From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E3DF8C3DA49
	for <linux-mm@archiver.kernel.org>; Wed, 17 Jul 2024 03:05:29 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 7901B6B009D; Tue, 16 Jul 2024 23:05:29 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 740A56B00A0; Tue, 16 Jul 2024 23:05:29 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6079B6B00A1; Tue, 16 Jul 2024 23:05:29 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 3E02B6B009D
	for <linux-mm@kvack.org>; Tue, 16 Jul 2024 23:05:29 -0400 (EDT)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id DF8A48072E
	for <linux-mm@kvack.org>; Wed, 17 Jul 2024 03:05:28 +0000 (UTC)
X-FDA: 82347753936.15.90863F7
Received: from out-170.mta0.migadu.com (out-170.mta0.migadu.com [91.218.175.170])
	by imf06.hostedemail.com (Postfix) with ESMTP id EFAFF180014
	for <linux-mm@kvack.org>; Wed, 17 Jul 2024 03:05:25 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=gZ4ysYfs;
	spf=pass (imf06.hostedemail.com: domain of muchun.song@linux.dev designates 91.218.175.170 as permitted sender) smtp.mailfrom=muchun.song@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1721185507;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=hELF6AOgvcydBYROo896HK+3EMsnVSU72HmlTLqJFWw=;
	b=H1eqVYrwNxo+OAsGT8DDFHotfaY41gdOEkTBbPx7MEBJbUZyLbSKpv+YCuYHlkZJfvXhrs
	a/ZnJVT6LO/3nplTMS006DqUrJPuZHi7MXW3sRwqYMdZJX5V/fdpLjDjpUYGPoIuz5MrOU
	XxzzTbSbF1bWsVLiUSluenovi8hdqJI=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=gZ4ysYfs;
	spf=pass (imf06.hostedemail.com: domain of muchun.song@linux.dev designates 91.218.175.170 as permitted sender) smtp.mailfrom=muchun.song@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721185507; a=rsa-sha256;
	cv=none;
	b=lxt+r/cHBKik1iA9vepX468birfWgq/6iOKNwpZUf2gE7WFx2A6/3X/zUUOoO+xDTIz2Cn
	r//Ex0vT2XsPstXjB4cK82cUCUf2vzIIq6hNhWNHIZ/cx1Pq1ezB2wu87BbW0jdUgfQuG7
	+8gi1X9vq964iczKdqk3thwyOMwaiyc=
X-Envelope-To: kasong@tencent.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1721185520;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=hELF6AOgvcydBYROo896HK+3EMsnVSU72HmlTLqJFWw=;
	b=gZ4ysYfsQHo/TVe+4y+o5tmN/rVomoDynv4H5Vs2PDPZ200H63lg/rX4QdzZkDqA+Uvx8t
	/yQho93/bKp7Iwk8cb8pDeOncKHt+N+CaZB8Zo886WtYXaxIFsM/EMghDORjBYGZEX0hxG
	S/nuVShdvGokVXTcPC8yqUHvxb+qh9o=
X-Envelope-To: linux-mm@kvack.org
X-Envelope-To: akpm@linux-foundation.org
X-Envelope-To: willy@infradead.org
X-Envelope-To: hannes@cmpxchg.org
X-Envelope-To: roman.gushchin@linux.dev
X-Envelope-To: longman@redhat.com
X-Envelope-To: shakeelb@google.com
X-Envelope-To: nphamcs@gmail.com
X-Envelope-To: mhocko@suse.com
X-Envelope-To: zhouchengming@bytedance.com
X-Envelope-To: zhengqi.arch@bytedance.com
X-Envelope-To: chrisl@kernel.org
X-Envelope-To: yosryahmed@google.com
X-Envelope-To: ying.huang@intel.com
Content-Type: text/plain;
	charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
Subject: Re: [PATCH 5/7] mm/list_lru: simplify reparenting and initial
 allocation
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Muchun Song <muchun.song@linux.dev>
In-Reply-To: <20240624175313.47329-6-ryncsn@gmail.com>
Date: Wed, 17 Jul 2024 11:04:40 +0800
Cc: Linux Memory Management List <linux-mm@kvack.org>,
 Andrew Morton <akpm@linux-foundation.org>,
 Matthew Wilcox <willy@infradead.org>,
 Johannes Weiner <hannes@cmpxchg.org>,
 Roman Gushchin <roman.gushchin@linux.dev>,
 Waiman Long <longman@redhat.com>,
 Shakeel Butt <shakeelb@google.com>,
 Nhat Pham <nphamcs@gmail.com>,
 Michal Hocko <mhocko@suse.com>,
 Chengming Zhou <zhouchengming@bytedance.com>,
 Qi Zheng <zhengqi.arch@bytedance.com>,
 Chris Li <chrisl@kernel.org>,
 Yosry Ahmed <yosryahmed@google.com>,
 "Huang, Ying" <ying.huang@intel.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <FE1DBFDB-FFBA-45FA-9D5D-D6A5DB6FD646@linux.dev>
References: <20240624175313.47329-1-ryncsn@gmail.com>
 <20240624175313.47329-6-ryncsn@gmail.com>
To: Kairui Song <kasong@tencent.com>
X-Migadu-Flow: FLOW_OUT
X-Rspamd-Server: rspam03
X-Rspam-User: 
X-Rspamd-Queue-Id: EFAFF180014
X-Stat-Signature: f8zkja6jes3bsr3jmq1t1zwtiua457cr
X-HE-Tag: 1721185525-390692
X-HE-Meta: U2FsdGVkX1/cYzcYm8IVtBRINia9k4Vf7mIT7VthBz8e2kKGHtZMaO9AjjE2TrhDRuy1xMYtCfAbC88bprc9Qv3KOu5/CRqBUQ4N7oDAI61DFPySCo3S6O97bkzOzeMIUFORopyoJIv375Jv+ALgk5BwdmCabRI19mpBTcGfFnfiOD2qizgq+mhgizQDb3qlyBq95MccaL4irE5rfM0rvO6mNqZv87ZSum9eeHutipuNs3W6p4hA8TpYjRYcDJC+mznuAFQfL6x9bASFA7qOCgLOkCSftfK4U6RYhne3O0nd16C7nenpI6wFkNvy9YPT1PtPalSOIaSXxSm30MwoCl2pXfAC+y2sx41t8f5B38VAGQXkicoFYRUjIC5AC6iyoZxX6v2gBIJVS152Y8QHBnaCKh4v40QC5JOMkesvYg4MNPU9BFqUmMn6dT2KE22Wua15cUMbRFcbPq4wgdlf8cLiqCbWn3rhbANCRQyo91JgPyEt5RbzUHUjTJivxY+GovFddM8y08mtmQ/TzhtXiRO2LrxdSWPbcpN0Qxmax9PqDqw5si8SrdqFWNZyz+lcGpqf3X3zlQ0XsOWnQVUVBSRzmvFdftrfPvlen1Rj8Ndzumyt6b12l91T7t135vnmdxipYNgjeDjYM8vy+ORmDGuKEgGwPzbHjh+w7XbHasuxMNKBoyE1FLbFYa15ffqZNDLAHYbiwMpPUf1mPPfFiTsBTgyLbJQn42C8DfdiJ86KdYhR9sf4FLFzIPJuH9zpZ8qBic8gvK/w6a1BhW/EnaMtTe+NSpult/BLZyf2mIphJ/8WtPLVJerHdy2cR3c657kVg9l6fKhtW6Nt5KEj/PIDT18L4r5mfy7hVkAjCQZcosjV+KOSrT51cienGosu1MMHJJMxU10E7la6PKS7C7XDnOumMm2fDthdXLtQXslEXnzs2GFJ96+Y0JHdbYmfjo2YOXzj7glNaup+0Sb
 /xOR26OX
 ASvLk64iwFM7qhV4stbWlILfidNV02aLp5bFJalSNoKIzXB4elf7AqgZ7UgSdM1WXqnoQWTU8sQmYPD1HJY+UJtJzFBWD1vqjlOMWGGS5nSPT/1WFvNUFKxV2Zz7+52tDrV1NBH9MWBmwovyWQ8XrS0GNHWqVWuV/9jALp+cZS/DRFd+HSSiVHutMtksH+4AgUw43zkLGctd1FPzLQIImes1KmWfPkctIGWvvDPwXKFbPmkhHLBB+c5DNCiOOvi5EpylC5RjC8s1x5/zpBcJXpdmNug==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


> On Jun 25, 2024, at 01:53, Kairui Song <ryncsn@gmail.com> wrote:
>=20
> From: Kairui Song <kasong@tencent.com>
>=20
> Currently, there is a lot of code for detecting reparent racing
> using kmemcg_id as the synchronization flag. And an intermediate
> table is required to record and compare the kmemcg_id.
>=20
> We can simplify this by just checking the cgroup css status, skip
> if cgroup is being offlined. On the reparenting side, ensure no
> more allocation is on going and no further allocation will occur
> by using the XArray lock as barrier.
>=20
> Combined with a O(n^2) top-down walk for the allocation, we get rid
> of the intermediate table allocation completely. Despite being O(n^2),
> it should be actually faster because it's not practical to have a very
> deep cgroup level.

I kind of agree with you. But just instinctually.

>=20
> This also avoided changing kmemcg_id before reparenting, making
> cgroups have a stable index for list_lru_memcg. After this change
> it's possible that a dying cgroup will see a NULL value in XArray
> corresponding to the kmemcg_id, because the kmemcg_id will point
> to an empty slot. In such case, just fallback to use its parent.
>=20
> As a result the code is simpler, following test also showed a
> performance gain (6 test runs):
>=20
> mkdir /tmp/test-fs
> modprobe brd rd_nr=3D1 rd_size=3D16777216
> mkfs.xfs /dev/ram0
> mount -t xfs /dev/ram0 /tmp/test-fs
> worker() {
>     echo TEST-CONTENT > "/tmp/test-fs/$1"
> }
> do_test() {
>     for i in $(seq 1 2048); do
>         (exec sh -c 'echo "$PPID"') > =
"/sys/fs/cgroup/benchmark/$i/cgroup.procs"
>         worker "$i" &
>     done; wait
>     echo 1 > /proc/sys/vm/drop_caches
> }
> mkdir -p /sys/fs/cgroup/benchmark
> echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control
> for i in $(seq 1 2048); do
>     rmdir "/sys/fs/cgroup/benchmark/$i" &>/dev/null
>     mkdir -p "/sys/fs/cgroup/benchmark/$i"
> done
> time do_test
>=20
> Before:
> real    0m5.932s user    0m2.366s sys     0m5.583s
> real    0m5.939s user    0m2.347s sys     0m5.597s
> real    0m6.149s user    0m2.398s sys     0m5.761s
> real    0m5.945s user    0m2.403s sys     0m5.547s
> real    0m5.925s user    0m2.293s sys     0m5.651s
> real    0m6.017s user    0m2.367s sys     0m5.686s
>=20
> After:
> real    0m5.712s user    0m2.343s sys     0m5.307s
> real    0m5.885s user    0m2.326s sys     0m5.518s
> real    0m5.694s user    0m2.347s sys     0m5.264s
> real    0m5.865s user    0m2.300s sys     0m5.545s
> real    0m5.748s user    0m2.273s sys     0m5.424s
> real    0m5.756s user    0m2.318s sys     0m5.398s
>=20
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/list_lru.c | 182 ++++++++++++++++++++++----------------------------
> mm/zswap.c    |   7 +-
> 2 files changed, 81 insertions(+), 108 deletions(-)
>=20
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 4c619857e916..ac8aec8451dd 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -59,6 +59,20 @@ list_lru_from_memcg_idx(struct list_lru *lru, int =
nid, int idx)
> }
> 	return &lru->node[nid].lru;
> }
> +
> +static inline struct list_lru_one *
> +list_lru_from_memcg(struct list_lru *lru, int nid, struct mem_cgroup =
*memcg)
> +{
> + 	struct list_lru_one *l;
> +again:
> + 	l =3D list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> + 	if (likely(l))
> + 	return l;
> +
> +	memcg =3D parent_mem_cgroup(memcg);
> +	WARN_ON(!css_is_dying(&memcg->css));
> + 	goto again;

Just want to confirm this will only loop max twice, right?

> +}
> #else
> static void list_lru_register(struct list_lru *lru)
> {
> @@ -83,6 +97,12 @@ list_lru_from_memcg_idx(struct list_lru *lru, int =
nid, int idx)
> {
> 	return &lru->node[nid].lru;
> }
> +
> +static inline struct list_lru_one *
> +list_lru_from_memcg(struct list_lru *lru, int nid, int idx)
> +{
> + 	return &lru->node[nid].lru;
> +}
> #endif /* CONFIG_MEMCG_KMEM */
>=20
> bool list_lru_add(struct list_lru *lru, struct list_head *item, int =
nid,
> @@ -93,7 +113,7 @@ bool list_lru_add(struct list_lru *lru, struct =
list_head *item, int nid,
>=20
> 	spin_lock(&nlru->lock);
> 	if (list_empty(item)) {
> - 		l =3D list_lru_from_memcg_idx(lru, nid, =
memcg_kmem_id(memcg));
> + 		l =3D list_lru_from_memcg(lru, nid, memcg);
> 		list_add_tail(item, &l->list);
> 		/* Set shrinker bit if the first element was added */
> 		if (!l->nr_items++)
> @@ -124,7 +144,7 @@ bool list_lru_del(struct list_lru *lru, struct =
list_head *item, int nid,
>=20
> 	spin_lock(&nlru->lock);
> 	if (!list_empty(item)) {
> - 		l =3D list_lru_from_memcg_idx(lru, nid, =
memcg_kmem_id(memcg));
> + 		l =3D list_lru_from_memcg(lru, nid, memcg);
> 		list_del_init(item);
> 		l->nr_items--;
> 		nlru->nr_items--;
> @@ -339,20 +359,6 @@ static struct list_lru_memcg =
*memcg_init_list_lru_one(gfp_t gfp)
> return mlru;
> }
>=20
> -static void memcg_list_lru_free(struct list_lru *lru, int src_idx)
> -{
> - 	struct list_lru_memcg *mlru =3D xa_erase_irq(&lru->xa, src_idx);
> -
> - 	/*
> -	 * The __list_lru_walk_one() can walk the list of this node.
> -	 * We need kvfree_rcu() here. And the walking of the list
> -	 * is under lru->node[nid]->lock, which can serve as a RCU
> -	 * read-side critical section.
> -	 */
> -	if (mlru)
> - 		kvfree_rcu(mlru, rcu);
> -}
> -
> static inline void memcg_init_list_lru(struct list_lru *lru, bool =
memcg_aware)
> {
> 	if (memcg_aware)
> @@ -377,22 +383,18 @@ static void memcg_destroy_list_lru(struct =
list_lru *lru)
> }
>=20
> static void memcg_reparent_list_lru_node(struct list_lru *lru, int =
nid,
> -					 int src_idx, struct mem_cgroup =
*dst_memcg)
> +					 struct list_lru_one *src,
> +					 struct mem_cgroup *dst_memcg)
> {
> 	struct list_lru_node *nlru =3D &lru->node[nid];
> - 	int dst_idx =3D dst_memcg->kmemcg_id;
> - 	struct list_lru_one *src, *dst;
> + 	struct list_lru_one *dst;
>=20
> 	/*
> 	 * Since list_lru_{add,del} may be called under an IRQ-safe =
lock,
> 	 * we have to use IRQ-safe primitives here to avoid deadlock.
> 	 */
> 	spin_lock_irq(&nlru->lock);
> -
> - 	src =3D list_lru_from_memcg_idx(lru, nid, src_idx);
> - 	if (!src)
> - 		goto out;
> - 	dst =3D list_lru_from_memcg_idx(lru, nid, dst_idx);
> + 	dst =3D list_lru_from_memcg_idx(lru, nid, =
memcg_kmem_id(dst_memcg));
>=20
> 	list_splice_init(&src->list, &dst->list);
>=20
> @@ -401,46 +403,45 @@ static void memcg_reparent_list_lru_node(struct =
list_lru *lru, int nid,
> 		set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
> 		src->nr_items =3D 0;
> 	}
> -out:
> 	spin_unlock_irq(&nlru->lock);
> }
>=20
> void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct =
mem_cgroup *parent)
> {
> - 	struct cgroup_subsys_state *css;
> 	struct list_lru *lru;
> - 	int src_idx =3D memcg->kmemcg_id, i;
> -
> - 	/*
> - 	 * Change kmemcg_id of this cgroup and all its descendants to =
the
> - 	 * parent's id, and then move all entries from this cgroup's =
list_lrus
> - 	 * to ones of the parent.
> - 	 */
> - 	rcu_read_lock();
> - 	css_for_each_descendant_pre(css, &memcg->css) {
> - 		struct mem_cgroup *child;
> -
> - 		child =3D mem_cgroup_from_css(css);
> - 		WRITE_ONCE(child->kmemcg_id, parent->kmemcg_id);
> - 	}
> - 	rcu_read_unlock();
> + 	int i;
>=20
> - 	/*
> -	 * With kmemcg_id set to parent, holding the lru lock below can
> -	 * prevent list_lru_{add,del,isolate} from touching the lru, =
safe
> -	 * to reparent.
> -	 */
> 	mutex_lock(&list_lrus_mutex);
> 	list_for_each_entry(lru, &memcg_list_lrus, list) {
> + 		struct list_lru_memcg *mlru;
> + 		XA_STATE(xas, &lru->xa, memcg->kmemcg_id);
> +
> + 		/*
> +		 * Lock the Xarray to ensure no on going allocation and

ongoing? I'd like to rewrite the comment here, like:

The XArray lock is used to make the future observers will see the =
current
memory cgroup has been dying (i.e. css_is_dying() will return true), =
which
is significant for memcg_list_lru_alloc() to make sure it will not =
allocate
a new mlru for this died memory cgroup for which we just free it below.

> +		 * further allocation will see css_is_dying().
> +		 */
> + 		xas_lock_irq(&xas);
> + 		mlru =3D xas_load(&xas);
> + 		if (mlru)
> + 			xas_store(&xas, NULL);
> + 		xas_unlock_irq(&xas);

The above block could be replaced with xa_erase_irq(), simpler, right?

> + 		if (!mlru)
> + 			continue;
> +
> + 		/*
> +		 * With Xarray value set to NULL, holding the lru lock =
below
> +		 * prevents list_lru_{add,del,isolate} from touching the =
lru,
> +		 * safe to reparent.
> +		 */
> 		for_each_node(i)
> - 			memcg_reparent_list_lru_node(lru, i, src_idx, =
parent);
> + 			memcg_reparent_list_lru_node(lru, i, =
&mlru->node[i], parent);
>=20
> 		/*
> 		 * Here all list_lrus corresponding to the cgroup are =
guaranteed
> 		 * to remain empty, we can safely free this lru, any =
further
> 		 * memcg_list_lru_alloc() call will simply bail out.
> 		 */
> - 		memcg_list_lru_free(lru, src_idx);
> + 		kvfree_rcu(mlru, rcu);
> 	}
> 	mutex_unlock(&list_lrus_mutex);
> }
> @@ -456,78 +457,51 @@ static inline bool =
memcg_list_lru_allocated(struct mem_cgroup *memcg,
> int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru =
*lru,
> 			 gfp_t gfp)
> {
> - 	int i;
> 	unsigned long flags;
> - 	struct list_lru_memcg_table {
> - 		struct list_lru_memcg *mlru;
> - 		struct mem_cgroup *memcg;
> - 	} *table;
> + 	struct list_lru_memcg *mlru;
> + 	struct mem_cgroup *pos, *parent;
> 	XA_STATE(xas, &lru->xa, 0);
>=20
> 	if (!list_lru_memcg_aware(lru) || =
memcg_list_lru_allocated(memcg, lru))
> 		return 0;
>=20
> 	gfp &=3D GFP_RECLAIM_MASK;
> - 	table =3D kmalloc_array(memcg->css.cgroup->level, =
sizeof(*table), gfp);
> - 	if (!table)
> - 		return -ENOMEM;
> -
> 	/*
> 	 * Because the list_lru can be reparented to the parent cgroup's
> 	 * list_lru, we should make sure that this cgroup and all its
> 	 * ancestors have allocated list_lru_memcg.
> 	 */
> - 	for (i =3D 0; memcg; memcg =3D parent_mem_cgroup(memcg), i++) {
> - 		if (memcg_list_lru_allocated(memcg, lru))
> - 			break;
> -
> - 		table[i].memcg =3D memcg;
> - 		table[i].mlru =3D memcg_init_list_lru_one(gfp);
> - 		if (!table[i].mlru) {
> - 			while (i--)
> - 				kfree(table[i].mlru);
> - 			kfree(table);
> - 			return -ENOMEM;
> + 	do {
> + 		/*
> +		 * Keep finding the farest parent that wasn't populated
> +		 * until found memcg itself.
> +		 */
> + 		pos =3D memcg;
> + 		parent =3D parent_mem_cgroup(pos);
> + 		while (parent && !memcg_list_lru_allocated(parent, lru)) =
{

The check of @parent is unnecessary since memcg_list_lru_allocated() =
always
return true for root memory cgroup. Right?

> + 			pos =3D parent;
> +			parent =3D parent_mem_cgroup(pos);
> 		}
> - 	}
>=20
> - 	xas_lock_irqsave(&xas, flags);
> - 	while (i--) {
> - 		int index =3D READ_ONCE(table[i].memcg->kmemcg_id);
> - 		struct list_lru_memcg *mlru =3D table[i].mlru;
> -
> - 		xas_set(&xas, index);
> -retry:
> - 		if (unlikely(index < 0 || xas_error(&xas) || =
xas_load(&xas))) {
> - 			kfree(mlru);
> - 		} else {
> - 			xas_store(&xas, mlru);
> - 			if (xas_error(&xas) =3D=3D -ENOMEM) {
> + 		mlru =3D memcg_init_list_lru_one(gfp);
> + 		do {
> + 			bool alloced =3D false;
> +
> + 			xas_set(&xas, pos->kmemcg_id);
> + 			xas_lock_irqsave(&xas, flags);
> + 			if (!css_is_dying(&pos->css) && !xas_load(&xas)) =
{
> + 				xas_store(&xas, mlru);
> + 				alloced =3D true;
> + 			}
> + 			if (!alloced || xas_error(&xas)) {
> 				xas_unlock_irqrestore(&xas, flags);
> - 				if (xas_nomem(&xas, gfp))
> - 					xas_set_err(&xas, 0);
> - 				xas_lock_irqsave(&xas, flags);
> - 				/*
> -				 * The xas lock has been released, this =
memcg
> -				 * can be reparented before us. So =
reload
> -				 * memcg id. More details see the =
comments
> -				 * in memcg_reparent_list_lrus().
> -				 */
> - 				index =3D =
READ_ONCE(table[i].memcg->kmemcg_id);
> - 				if (index < 0)
> - 					xas_set_err(&xas, 0);
> - 				else if (!xas_error(&xas) && index !=3D =
xas.xa_index)
> - 					xas_set(&xas, index);
> - 				goto retry;
> + 				kfree(mlru);
> + 				goto out;

1) If xas_error() returns true, we should continue since xas_mem() will =
handle the NOMEM
case for us.

2) If xas_load() returns a non-NULL pointer meaning there is a =
concurrent thread has
assigned a new mlru for us, we should continue since we might have not =
assigned a mlru
for the leaf memory cgroup. Suppose the following case:

	A
      / \
     B   C

If there are two threads allocating mlru for A, then either B or C will =
not allocate a mlru.
Here is a code snippet FYI.

Thanks.

```c
struct list_lru_memcg *mlru =3D NULL;=20

do {
	/* ... */
	if (!mlru)
		mlru =3D memcg_init_list_lru_one(gfp);

	xas_set(&xas, pos->kmemcg_id);
	do {
		xas_lock_irqsave(&xas, flags);
		if (css_is_dying(&pos->css) || xas_load(&xas))
			goto unlock;
		xas_store(&xas, mlru);
		if (!xas_error(&xas))
			mlru =3D NULL;
unlock:
		xas_unlock_irqrestore(&xas, flags);
	} while (xas_nomem(&xas, gfp));
} while (pos !=3D memcg);

if (mlru)
	kfree(mlru);
/* ... */
```
> 			}
> - 		}
> - 	}
> - 	/* xas_nomem() is used to free memory instead of memory =
allocation. */
> - 	if (xas.xa_alloc)
> - 		xas_nomem(&xas, gfp);
> - 	xas_unlock_irqrestore(&xas, flags);
> - 	kfree(table);
> -
> + 			xas_unlock_irqrestore(&xas, flags);
> + 		} while (xas_nomem(&xas, gfp));
> + 	} while (pos !=3D memcg);
> +out:
> 	return xas_error(&xas);
> }
> #else
> diff --git a/mm/zswap.c b/mm/zswap.c
> index a50e2986cd2f..c6e2256347ff 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -718,12 +718,11 @@ static void zswap_lru_add(struct list_lru =
*list_lru, struct zswap_entry *entry)
>=20
> 	/*
> 	 * Note that it is safe to use rcu_read_lock() here, even in the =
face of
> -	 * concurrent memcg offlining. Thanks to the memcg->kmemcg_id =
indirection
> -	 * used in list_lru lookup, only two scenarios are possible:
> +	 * concurrent memcg offlining:
> *
> -	 * 1. list_lru_add() is called before memcg->kmemcg_id is =
updated. The
> +	 * 1. list_lru_add() is called before list_lru_memcg is erased. =
The
> 	 *    new entry will be reparented to memcg's parent's list_lru.
> -	 * 2. list_lru_add() is called after memcg->kmemcg_id is =
updated. The
> +	 * 2. list_lru_add() is called after list_lru_memcg is erased. =
The
> 	 *    new entry will be added directly to memcg's parent's =
list_lru.
> 	 *
> 	 * Similar reasoning holds for list_lru_del().
> --=20
> 2.45.2
>=20