[PATCH 0/7] Split list_lru lock into per-cgroup scope

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] Split list_lru lock into per-cgroup scope
@ 2024-06-24 17:53 Kairui Song
  2024-06-24 17:53 ` [PATCH 1/7] mm/swap, workingset: make anon workingset nodes memcg aware Kairui Song
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Kairui Song @ 2024-06-24 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Waiman Long, Shakeel Butt, Nhat Pham, Michal Hocko,
	Chengming Zhou, Qi Zheng, Muchun Song, Chris Li, Yosry Ahmed,
	Huang, Ying, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently, every list_lru has a per-node lock that protects adding,
deletion, isolation, and reparenting of all list_lru_one instances
belonging to this list_lru on this node. This lock contention is heavy
when multiple cgroups modify the same list_lru.

This can be alleviated by splitting the lock into per-cgroup scope.

To achieve this, this series reworks and optimizes the reparenting
process step by step, making it possible to have a stable list_lru_one,
and making it possible to pin the list_lru_one. Then split the lock
into per-cgroup scope.

The result is reduced LOC and better performance: I see a ~25%
improvement for multi-cgroup SWAP over ZRAM and a ~10% improvement for
multi-cgroup inode / dentry workload, as tested in PATCH 6/7:

memhog SWAP test (shadow nodes):
Before:
real    0m20.328s user    0m4.315s sys     10m23.639s
real    0m20.440s user    0m4.142s sys     10m34.756s
real    0m20.381s user    0m4.164s sys     10m29.035s

After:
real    0m15.156s user    0m4.590s sys     7m34.361s
real    0m15.161s user    0m4.776s sys     7m35.086s
real    0m15.429s user    0m4.734s sys     7m42.919s

File read test (inode / dentry):
Before:
real    0m26.939s user    0m36.322s sys     6m30.248s
real    0m15.111s user    0m33.749s sys     5m4.991s
real    0m16.796s user    0m33.438s sys     5m22.865s
real    0m15.256s user    0m34.060s sys     4m56.870s
real    0m14.826s user    0m33.531s sys     4m55.907s
real    0m15.664s user    0m35.619s sys     6m3.638s
real    0m15.746s user    0m34.066s sys     4m56.519s

After:
real    0m22.166s user    0m35.155s sys     6m21.045s
real    0m13.753s user    0m34.554s sys     4m40.982s
real    0m13.815s user    0m34.693s sys     4m39.605s
real    0m13.495s user    0m34.372s sys     4m40.776s
real    0m13.895s user    0m34.005s sys     4m39.061s
real    0m13.629s user    0m33.476s sys     4m43.626s
real    0m14.001s user    0m33.463s sys     4m41.261s

PATCH 1/7: Fixes a long-existing bug, so shadow nodes will be accounted
    to the right cgroup and put into the right list_lru.
PATCH 2/7 - 4/7: Clean up
PATCH 6/7: Reworks and optimizes reparenting process, avoids touching
    kmemcg_id on reparenting as first step.
PATCH 7/7: Makes it possible to pin the list_lru_one and prevent racing
    with reparenting, and splits the lock.

Kairui Song (7):
  mm/swap, workingset: make anon workingset nodes memcg aware
  mm/list_lru: don't pass unnecessary key parameters
  mm/list_lru: don't export list_lru_add
  mm/list_lru: code clean up for reparenting
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: Simplify the list_lru walk callback function

 drivers/android/binder_alloc.c |   6 +-
 drivers/android/binder_alloc.h |   2 +-
 fs/dcache.c                    |   4 +-
 fs/gfs2/quota.c                |   2 +-
 fs/inode.c                     |   5 +-
 fs/nfs/nfs42xattr.c            |   4 +-
 fs/nfsd/filecache.c            |   5 +-
 fs/xfs/xfs_buf.c               |   2 -
 fs/xfs/xfs_qm.c                |   6 +-
 include/linux/list_lru.h       |  26 ++-
 mm/list_lru.c                  | 387 +++++++++++++++++----------------
 mm/memcontrol.c                |  10 +-
 mm/swap_state.c                |   3 +-
 mm/workingset.c                |  20 +-
 mm/zswap.c                     |  12 +-
 15 files changed, 246 insertions(+), 248 deletions(-)

-- 
2.45.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/7] mm/swap, workingset: make anon workingset nodes memcg aware
  2024-06-24 17:53 [PATCH 0/7] Split list_lru lock into per-cgroup scope Kairui Song
@ 2024-06-24 17:53 ` Kairui Song
  2024-07-17  3:25   ` Muchun Song
  2024-07-19  1:34   ` Shakeel Butt
  2024-06-24 17:53 ` [PATCH 2/7] mm/list_lru: don't pass unnecessary key parameters Kairui Song
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 22+ messages in thread
From: Kairui Song @ 2024-06-24 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Waiman Long, Shakeel Butt, Nhat Pham, Michal Hocko,
	Chengming Zhou, Qi Zheng, Muchun Song, Chris Li, Yosry Ahmed,
	Huang, Ying, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently, the (shadow) nodes of the swap cache are not accounted to
their corresponding memory cgroup, instead, they are all accounted
to the root cgroup. This leads to inaccurate accounting and
ineffective reclaiming.

This issue is similar to commit 7b785645e8f1 ("mm: fix page cache
convergence regression"), where page cache shadow nodes were incorrectly
accounted. That was due to the accidental dropping of the accounting
flag during the XArray conversion in commit a28334862993
("page cache: Finish XArray conversion").

However, this fix has a different cause. Swap cache shadow nodes were
never accounted even before the XArray conversion, since they did not
exist until commit 3852f6768ede ("mm/swapcache: support to handle the
shadow entries"), which was years after the XArray conversion. Without
shadow nodes, swap cache nodes can only use a very small amount of memory
and so reclaiming is not very important.

But now with shadow nodes, if a cgroup swaps out a large amount of
memory, it could take up a lot of memory.

This can be easily fixed by adding proper flags and LRU setters.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 642c30d8376c..68afaaf1c09b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -95,6 +95,7 @@ int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
 	void *old;

 	xas_set_update(&xas, workingset_update_node);
+	xas_set_lru(&xas, &shadow_nodes);

 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
@@ -714,7 +715,7 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
 		return -ENOMEM;
 	for (i = 0; i < nr; i++) {
 		space = spaces + i;
-		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
+		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
 		atomic_set(&space->i_mmap_writable, 0);
 		space->a_ops = &swap_aops;
 		/* swap cache doesn't use writeback related tags */
-- 
2.45.2

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] mm/swap, workingset: make anon workingset nodes memcg aware
  2024-06-24 17:53 ` [PATCH 1/7] mm/swap, workingset: make anon workingset nodes memcg aware Kairui Song
@ 2024-07-17  3:25   ` Muchun Song
  2024-07-18 11:33     ` Kairui Song
  2024-07-19  1:34   ` Shakeel Butt
  1 sibling, 1 reply; 22+ messages in thread
From: Muchun Song @ 2024-07-17  3:25 UTC (permalink / raw)
  To: Kairui Song
  Cc: Linux Memory Management List, Andrew Morton, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Waiman Long, Shakeel Butt,
	Nhat Pham, Michal Hocko, Chengming Zhou, Qi Zheng, Chris Li,
	Yosry Ahmed, Huang, Ying



> On Jun 25, 2024, at 01:53, Kairui Song <ryncsn@gmail.com> wrote:
> 
> From: Kairui Song <kasong@tencent.com>
> 
> Currently, the (shadow) nodes of the swap cache are not accounted to
> their corresponding memory cgroup, instead, they are all accounted
> to the root cgroup. This leads to inaccurate accounting and
> ineffective reclaiming.
> 
> This issue is similar to commit 7b785645e8f1 ("mm: fix page cache
> convergence regression"), where page cache shadow nodes were incorrectly
> accounted. That was due to the accidental dropping of the accounting
> flag during the XArray conversion in commit a28334862993
> ("page cache: Finish XArray conversion").
> 
> However, this fix has a different cause. Swap cache shadow nodes were
> never accounted even before the XArray conversion, since they did not
> exist until commit 3852f6768ede ("mm/swapcache: support to handle the
> shadow entries"), which was years after the XArray conversion. Without
> shadow nodes, swap cache nodes can only use a very small amount of memory
> and so reclaiming is not very important.
> 
> But now with shadow nodes, if a cgroup swaps out a large amount of
> memory, it could take up a lot of memory.
> 
> This can be easily fixed by adding proper flags and LRU setters.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

I haven't looked at the details of this patch yet, but I think it is not
related to this series, it could be as an individual patch (If it is a real
problem, I think it will be quickly and easily merged).

Thanks.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] mm/swap, workingset: make anon workingset nodes memcg aware
  2024-07-17  3:25   ` Muchun Song
@ 2024-07-18 11:33     ` Kairui Song
  0 siblings, 0 replies; 22+ messages in thread
From: Kairui Song @ 2024-07-18 11:33 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, Andrew Morton, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Waiman Long, Shakeel Butt,
	Nhat Pham, Michal Hocko, Chengming Zhou, Qi Zheng, Chris Li,
	Yosry Ahmed, Huang, Ying

On Wed, Jul 17, 2024 at 11:25 AM Muchun Song <muchun.song@linux.dev> wrote:
> > On Jun 25, 2024, at 01:53, Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently, the (shadow) nodes of the swap cache are not accounted to
> > their corresponding memory cgroup, instead, they are all accounted
> > to the root cgroup. This leads to inaccurate accounting and
> > ineffective reclaiming.
> >
> > This issue is similar to commit 7b785645e8f1 ("mm: fix page cache
> > convergence regression"), where page cache shadow nodes were incorrectly
> > accounted. That was due to the accidental dropping of the accounting
> > flag during the XArray conversion in commit a28334862993
> > ("page cache: Finish XArray conversion").
> >
> > However, this fix has a different cause. Swap cache shadow nodes were
> > never accounted even before the XArray conversion, since they did not
> > exist until commit 3852f6768ede ("mm/swapcache: support to handle the
> > shadow entries"), which was years after the XArray conversion. Without
> > shadow nodes, swap cache nodes can only use a very small amount of memory
> > and so reclaiming is not very important.
> >
> > But now with shadow nodes, if a cgroup swaps out a large amount of
> > memory, it could take up a lot of memory.
> >
> > This can be easily fixed by adding proper flags and LRU setters.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> I haven't looked at the details of this patch yet, but I think it is not
> related to this series, it could be as an individual patch (If it is a real
> problem, I think it will be quickly and easily merged).
>
> Thanks.
>

The gain of this patch is not as big as being part of the series, but
yes, it can be sent independently.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] mm/swap, workingset: make anon workingset nodes memcg aware
  2024-06-24 17:53 ` [PATCH 1/7] mm/swap, workingset: make anon workingset nodes memcg aware Kairui Song
  2024-07-17  3:25   ` Muchun Song
@ 2024-07-19  1:34   ` Shakeel Butt
  1 sibling, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2024-07-19  1:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Johannes Weiner,
	Roman Gushchin, Waiman Long, Shakeel Butt, Nhat Pham,
	Michal Hocko, Chengming Zhou, Qi Zheng, Muchun Song, Chris Li,
	Yosry Ahmed, Huang, Ying

On Tue, Jun 25, 2024 at 01:53:07AM GMT, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Currently, the (shadow) nodes of the swap cache are not accounted to
> their corresponding memory cgroup, instead, they are all accounted
> to the root cgroup. This leads to inaccurate accounting and
> ineffective reclaiming.
> 
> This issue is similar to commit 7b785645e8f1 ("mm: fix page cache
> convergence regression"), where page cache shadow nodes were incorrectly
> accounted. That was due to the accidental dropping of the accounting
> flag during the XArray conversion in commit a28334862993
> ("page cache: Finish XArray conversion").
> 
> However, this fix has a different cause. Swap cache shadow nodes were
> never accounted even before the XArray conversion, since they did not
> exist until commit 3852f6768ede ("mm/swapcache: support to handle the
> shadow entries"), which was years after the XArray conversion. Without
> shadow nodes, swap cache nodes can only use a very small amount of memory
> and so reclaiming is not very important.
> 
> But now with shadow nodes, if a cgroup swaps out a large amount of
> memory, it could take up a lot of memory.
> 
> This can be easily fixed by adding proper flags and LRU setters.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

As Muchun said, please send this patch separately. However as I am
thinking more about this patch, I think it is incomplete and the full
solution may be much more involved and might not be worth it.

One of the differences between file page cache and swap page cache is
the context in which the underlying xarray node can be allocated. For
file pages, such allocations happen in the context of the process/cgroup
owning the file pages and thus the memcg of the current is used for
charging. However xarray node allocations happen when a page is added to
swap cache which often happen in reclaim and reclaim can happen in any
context i.e. kernel thread, unrelated process/cgroups e.t.c. So, we may
charge unrelated memcg for these nodes.

Now you may argue that we can use the memcg of the page which is being
swapped out but then you may have an xarray node containing pointers to
pages (or shadows) of different memcgs. Who should be charged?

BTW filesystem shared between multiple cgroups can face this issue as
well but users have more control on shared filesystem as compare to
shared swap address space.

Shakeel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/7] mm/list_lru: don't pass unnecessary key parameters
  2024-06-24 17:53 [PATCH 0/7] Split list_lru lock into per-cgroup scope Kairui Song
  2024-06-24 17:53 ` [PATCH 1/7] mm/swap, workingset: make anon workingset nodes memcg aware Kairui Song
@ 2024-06-24 17:53 ` Kairui Song
  2024-06-24 17:53 ` [PATCH 3/7] mm/list_lru: don't export list_lru_add Kairui Song
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Kairui Song @ 2024-06-24 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Waiman Long, Shakeel Butt, Nhat Pham, Michal Hocko,
	Chengming Zhou, Qi Zheng, Muchun Song, Chris Li, Yosry Ahmed,
	Huang, Ying, Kairui Song

From: Kairui Song <kasong@tencent.com>

When LOCKDEP is not enabled, lock_class_key is an empty struct that
is never used. But the list_lru initialization function still takes
a placeholder pointer as parameter, and the compiler cannot optimize
it because the function is not static and exported.

Remove this parameter and move it inside the list_lru struct. Only
use it when LOCKDEP is enabled. Kernel builds with LOCKDEP will be
slightly larger, while !LOCKDEP builds without it will be slightly
smaller (the common case).

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/list_lru.h | 18 +++++++++++++++---
 mm/list_lru.c            |  9 +++++----
 mm/workingset.c          |  4 ++--
 3 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 792b67ceb631..2e5132905f42 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -56,16 +56,28 @@ struct list_lru {
 	bool			memcg_aware;
 	struct xarray		xa;
 #endif
+#ifdef CONFIG_LOCKDEP
+	struct lock_class_key	*key;
+#endif
 };
 
 void list_lru_destroy(struct list_lru *lru);
 int __list_lru_init(struct list_lru *lru, bool memcg_aware,
-		    struct lock_class_key *key, struct shrinker *shrinker);
+		    struct shrinker *shrinker);
 
 #define list_lru_init(lru)				\
-	__list_lru_init((lru), false, NULL, NULL)
+	__list_lru_init((lru), false, NULL)
 #define list_lru_init_memcg(lru, shrinker)		\
-	__list_lru_init((lru), true, NULL, shrinker)
+	__list_lru_init((lru), true, shrinker)
+
+static inline int list_lru_init_memcg_key(struct list_lru *lru, struct shrinker *shrinker,
+					  struct lock_class_key *key)
+{
+#ifdef CONFIG_LOCKDEP
+	lru->key = key;
+#endif
+	return list_lru_init_memcg(lru, shrinker);
+}
 
 int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
 			 gfp_t gfp);
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 3fd64736bc45..264713caa713 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -546,8 +546,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
-int __list_lru_init(struct list_lru *lru, bool memcg_aware,
-		    struct lock_class_key *key, struct shrinker *shrinker)
+int __list_lru_init(struct list_lru *lru, bool memcg_aware, struct shrinker *shrinker)
 {
 	int i;
 
@@ -567,8 +566,10 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
 
 	for_each_node(i) {
 		spin_lock_init(&lru->node[i].lock);
-		if (key)
-			lockdep_set_class(&lru->node[i].lock, key);
+#ifdef CONFIG_LOCKDEP
+		if (lru->key)
+			lockdep_set_class(&lru->node[i].lock, lru->key);
+#endif
 		init_one_lru(&lru->node[i].lru);
 	}
 
diff --git a/mm/workingset.c b/mm/workingset.c
index c22adb93622a..1801fbe5183c 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -815,8 +815,8 @@ static int __init workingset_init(void)
 	if (!workingset_shadow_shrinker)
 		goto err;
 
-	ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key,
-			      workingset_shadow_shrinker);
+	ret = list_lru_init_memcg_key(&shadow_nodes, workingset_shadow_shrinker,
+				      &shadow_nodes_key);
 	if (ret)
 		goto err_list_lru;
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/7] mm/list_lru: don't export list_lru_add
  2024-06-24 17:53 [PATCH 0/7] Split list_lru lock into per-cgroup scope Kairui Song
  2024-06-24 17:53 ` [PATCH 1/7] mm/swap, workingset: make anon workingset nodes memcg aware Kairui Song
  2024-06-24 17:53 ` [PATCH 2/7] mm/list_lru: don't pass unnecessary key parameters Kairui Song
@ 2024-06-24 17:53 ` Kairui Song
  2024-07-17  3:12   ` Muchun Song
  2024-06-24 17:53 ` [PATCH 4/7] mm/list_lru: code clean up for reparenting Kairui Song
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2024-06-24 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Waiman Long, Shakeel Butt, Nhat Pham, Michal Hocko,
	Chengming Zhou, Qi Zheng, Muchun Song, Chris Li, Yosry Ahmed,
	Huang, Ying, Kairui Song

From: Kairui Song <kasong@tencent.com>

It's no longer used by any module, just remove it.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/list_lru.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 264713caa713..9d9ec8661354 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -105,7 +105,6 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
 	spin_unlock(&nlru->lock);
 	return false;
 }
-EXPORT_SYMBOL_GPL(list_lru_add);
 
 bool list_lru_add_obj(struct list_lru *lru, struct list_head *item)
 {
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/7] mm/list_lru: don't export list_lru_add
  2024-06-24 17:53 ` [PATCH 3/7] mm/list_lru: don't export list_lru_add Kairui Song
@ 2024-07-17  3:12   ` Muchun Song
  0 siblings, 0 replies; 22+ messages in thread
From: Muchun Song @ 2024-07-17  3:12 UTC (permalink / raw)
  To: Kairui Song
  Cc: Linux Memory Management List, Andrew Morton, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Waiman Long, Shakeel Butt,
	Nhat Pham, Michal Hocko, Chengming Zhou, Qi Zheng, Chris Li,
	Yosry Ahmed, Huang, Ying



> On Jun 25, 2024, at 01:53, Kairui Song <ryncsn@gmail.com> wrote:
> 
> From: Kairui Song <kasong@tencent.com>
> 
> It's no longer used by any module, just remove it.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/7] mm/list_lru: code clean up for reparenting
  2024-06-24 17:53 [PATCH 0/7] Split list_lru lock into per-cgroup scope Kairui Song
                   ` (2 preceding siblings ...)
  2024-06-24 17:53 ` [PATCH 3/7] mm/list_lru: don't export list_lru_add Kairui Song
@ 2024-06-24 17:53 ` Kairui Song
  2024-07-15  9:10   ` Muchun Song
  2024-06-24 17:53 ` [PATCH 5/7] mm/list_lru: simplify reparenting and initial allocation Kairui Song
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2024-06-24 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Waiman Long, Shakeel Butt, Nhat Pham, Michal Hocko,
	Chengming Zhou, Qi Zheng, Muchun Song, Chris Li, Yosry Ahmed,
	Huang, Ying, Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, just change of code structure and fix comment.

The list lrus are not empty until memcg_reparent_list_lru_node() calls
are all done, so the comments in memcg_offline_kmem were slightly
inaccurate.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/list_lru.c   | 39 +++++++++++++++++----------------------
 mm/memcontrol.c |  7 -------
 2 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 9d9ec8661354..4c619857e916 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -405,35 +405,16 @@ static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
 	spin_unlock_irq(&nlru->lock);
 }
 
-static void memcg_reparent_list_lru(struct list_lru *lru,
-				    int src_idx, struct mem_cgroup *dst_memcg)
-{
-	int i;
-
-	for_each_node(i)
-		memcg_reparent_list_lru_node(lru, i, src_idx, dst_memcg);
-
-	memcg_list_lru_free(lru, src_idx);
-}
-
 void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent)
 {
 	struct cgroup_subsys_state *css;
 	struct list_lru *lru;
-	int src_idx = memcg->kmemcg_id;
+	int src_idx = memcg->kmemcg_id, i;
 
 	/*
 	 * Change kmemcg_id of this cgroup and all its descendants to the
 	 * parent's id, and then move all entries from this cgroup's list_lrus
 	 * to ones of the parent.
-	 *
-	 * After we have finished, all list_lrus corresponding to this cgroup
-	 * are guaranteed to remain empty. So we can safely free this cgroup's
-	 * list lrus in memcg_list_lru_free().
-	 *
-	 * Changing ->kmemcg_id to the parent can prevent memcg_list_lru_alloc()
-	 * from allocating list lrus for this cgroup after memcg_list_lru_free()
-	 * call.
 	 */
 	rcu_read_lock();
 	css_for_each_descendant_pre(css, &memcg->css) {
@@ -444,9 +425,23 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
 	}
 	rcu_read_unlock();
 
+	/*
+	 * With kmemcg_id set to parent, holding the lru lock below can
+	 * prevent list_lru_{add,del,isolate} from touching the lru, safe
+	 * to reparent.
+	 */
 	mutex_lock(&list_lrus_mutex);
-	list_for_each_entry(lru, &memcg_list_lrus, list)
-		memcg_reparent_list_lru(lru, src_idx, parent);
+	list_for_each_entry(lru, &memcg_list_lrus, list) {
+		for_each_node(i)
+			memcg_reparent_list_lru_node(lru, i, src_idx, parent);
+
+		/*
+		 * Here all list_lrus corresponding to the cgroup are guaranteed
+		 * to remain empty, we can safely free this lru, any further
+		 * memcg_list_lru_alloc() call will simply bail out.
+		 */
+		memcg_list_lru_free(lru, src_idx);
+	}
 	mutex_unlock(&list_lrus_mutex);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 71fe2a95b8bd..fc35c1d3f109 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4187,13 +4187,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	memcg_reparent_objcgs(memcg, parent);
-
-	/*
-	 * After we have finished memcg_reparent_objcgs(), all list_lrus
-	 * corresponding to this cgroup are guaranteed to remain empty.
-	 * The ordering is imposed by list_lru_node->lock taken by
-	 * memcg_reparent_list_lrus().
-	 */
 	memcg_reparent_list_lrus(memcg, parent);
 }
 #else
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/7] mm/list_lru: code clean up for reparenting
  2024-06-24 17:53 ` [PATCH 4/7] mm/list_lru: code clean up for reparenting Kairui Song
@ 2024-07-15  9:10   ` Muchun Song
  2024-07-16  8:15     ` Kairui Song
  0 siblings, 1 reply; 22+ messages in thread
From: Muchun Song @ 2024-07-15  9:10 UTC (permalink / raw)
  To: Kairui Song
  Cc: Andrew Morton, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Waiman Long, Shakeel Butt, Nhat Pham, Michal Hocko,
	Chengming Zhou, Qi Zheng, Chris Li, Yosry Ahmed, Huang, Ying,
	linux-mm



On 2024/6/25 01:53, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
> No feature change, just change of code structure and fix comment.
>
> The list lrus are not empty until memcg_reparent_list_lru_node() calls
> are all done, so the comments in memcg_offline_kmem were slightly
> inaccurate.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/list_lru.c   | 39 +++++++++++++++++----------------------
>   mm/memcontrol.c |  7 -------
>   2 files changed, 17 insertions(+), 29 deletions(-)
>
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 9d9ec8661354..4c619857e916 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -405,35 +405,16 @@ static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
>   	spin_unlock_irq(&nlru->lock);
>   }
>   
> -static void memcg_reparent_list_lru(struct list_lru *lru,
> -				    int src_idx, struct mem_cgroup *dst_memcg)
> -{
> -	int i;
> -
> -	for_each_node(i)
> -		memcg_reparent_list_lru_node(lru, i, src_idx, dst_memcg);
> -
> -	memcg_list_lru_free(lru, src_idx);
> -}
> -
>   void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>   {
>   	struct cgroup_subsys_state *css;
>   	struct list_lru *lru;
> -	int src_idx = memcg->kmemcg_id;
> +	int src_idx = memcg->kmemcg_id, i;
>   
>   	/*
>   	 * Change kmemcg_id of this cgroup and all its descendants to the
>   	 * parent's id, and then move all entries from this cgroup's list_lrus
>   	 * to ones of the parent.
> -	 *
> -	 * After we have finished, all list_lrus corresponding to this cgroup
> -	 * are guaranteed to remain empty. So we can safely free this cgroup's
> -	 * list lrus in memcg_list_lru_free().
> -	 *
> -	 * Changing ->kmemcg_id to the parent can prevent memcg_list_lru_alloc()
> -	 * from allocating list lrus for this cgroup after memcg_list_lru_free()
> -	 * call.
>   	 */
>   	rcu_read_lock();
>   	css_for_each_descendant_pre(css, &memcg->css) {
> @@ -444,9 +425,23 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
>   	}
>   	rcu_read_unlock();
>   
> +	/*
> +	 * With kmemcg_id set to parent, holding the lru lock below can

The word of "below" confuses me a bit. "lru lock below" refers to 1) 
"list_lrus_mutex"
or the 2) lock in "struct list_lru_node"?

I think it is 2), right?

The chnges LGTM.

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.

> +	 * prevent list_lru_{add,del,isolate} from touching the lru, safe
> +	 * to reparent.
> +	 */
>   	mutex_lock(&list_lrus_mutex);
> -	list_for_each_entry(lru, &memcg_list_lrus, list)
> -		memcg_reparent_list_lru(lru, src_idx, parent);
> +	list_for_each_entry(lru, &memcg_list_lrus, list) {
> +		for_each_node(i)
> +			memcg_reparent_list_lru_node(lru, i, src_idx, parent);
> +
> +		/*
> +		 * Here all list_lrus corresponding to the cgroup are guaranteed
> +		 * to remain empty, we can safely free this lru, any further
> +		 * memcg_list_lru_alloc() call will simply bail out.
> +		 */
> +		memcg_list_lru_free(lru, src_idx);
> +	}
>   	mutex_unlock(&list_lrus_mutex);
>   }
>   
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 71fe2a95b8bd..fc35c1d3f109 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4187,13 +4187,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
>   		parent = root_mem_cgroup;
>   
>   	memcg_reparent_objcgs(memcg, parent);
> -
> -	/*
> -	 * After we have finished memcg_reparent_objcgs(), all list_lrus
> -	 * corresponding to this cgroup are guaranteed to remain empty.
> -	 * The ordering is imposed by list_lru_node->lock taken by
> -	 * memcg_reparent_list_lrus().
> -	 */
>   	memcg_reparent_list_lrus(memcg, parent);
>   }
>   #else



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/7] mm/list_lru: code clean up for reparenting
  2024-07-15  9:10   ` Muchun Song
@ 2024-07-16  8:15     ` Kairui Song
  0 siblings, 0 replies; 22+ messages in thread
From: Kairui Song @ 2024-07-16  8:15 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Waiman Long, Shakeel Butt, Nhat Pham, Michal Hocko,
	Chengming Zhou, Qi Zheng, Chris Li, Yosry Ahmed, Huang, Ying,
	linux-mm

On Mon, Jul 15, 2024 at 5:12 PM Muchun Song <muchun.song@linux.dev> wrote:
> On 2024/6/25 01:53, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > No feature change, just change of code structure and fix comment.
> >
> > The list lrus are not empty until memcg_reparent_list_lru_node() calls
> > are all done, so the comments in memcg_offline_kmem were slightly
> > inaccurate.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/list_lru.c   | 39 +++++++++++++++++----------------------
> >   mm/memcontrol.c |  7 -------
> >   2 files changed, 17 insertions(+), 29 deletions(-)
> >
> > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > index 9d9ec8661354..4c619857e916 100644
> > --- a/mm/list_lru.c
> > +++ b/mm/list_lru.c
> > @@ -405,35 +405,16 @@ static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
> >       spin_unlock_irq(&nlru->lock);
> >   }
> >
> > -static void memcg_reparent_list_lru(struct list_lru *lru,
> > -                                 int src_idx, struct mem_cgroup *dst_memcg)
> > -{
> > -     int i;
> > -
> > -     for_each_node(i)
> > -             memcg_reparent_list_lru_node(lru, i, src_idx, dst_memcg);
> > -
> > -     memcg_list_lru_free(lru, src_idx);
> > -}
> > -
> >   void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> >   {
> >       struct cgroup_subsys_state *css;
> >       struct list_lru *lru;
> > -     int src_idx = memcg->kmemcg_id;
> > +     int src_idx = memcg->kmemcg_id, i;
> >
> >       /*
> >        * Change kmemcg_id of this cgroup and all its descendants to the
> >        * parent's id, and then move all entries from this cgroup's list_lrus
> >        * to ones of the parent.
> > -      *
> > -      * After we have finished, all list_lrus corresponding to this cgroup
> > -      * are guaranteed to remain empty. So we can safely free this cgroup's
> > -      * list lrus in memcg_list_lru_free().
> > -      *
> > -      * Changing ->kmemcg_id to the parent can prevent memcg_list_lru_alloc()
> > -      * from allocating list lrus for this cgroup after memcg_list_lru_free()
> > -      * call.
> >        */
> >       rcu_read_lock();
> >       css_for_each_descendant_pre(css, &memcg->css) {
> > @@ -444,9 +425,23 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
> >       }
> >       rcu_read_unlock();
> >
> > +     /*
> > +      * With kmemcg_id set to parent, holding the lru lock below can
>
> The word of "below" confuses me a bit. "lru lock below" refers to 1)
> "list_lrus_mutex"
> or the 2) lock in "struct list_lru_node"?
>
> I think it is 2), right?

Yes, that's 2).

> The chnges LGTM.
>
> Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks, I can make the comments more concrete in the next version.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 5/7] mm/list_lru: simplify reparenting and initial allocation
  2024-06-24 17:53 [PATCH 0/7] Split list_lru lock into per-cgroup scope Kairui Song
                   ` (3 preceding siblings ...)
  2024-06-24 17:53 ` [PATCH 4/7] mm/list_lru: code clean up for reparenting Kairui Song
@ 2024-06-24 17:53 ` Kairui Song
  2024-07-17  3:04   ` Muchun Song
  2024-06-24 17:53 ` [PATCH 6/7] mm/list_lru: split the lock to per-cgroup scope Kairui Song
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2024-06-24 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Waiman Long, Shakeel Butt, Nhat Pham, Michal Hocko,
	Chengming Zhou, Qi Zheng, Muchun Song, Chris Li, Yosry Ahmed,
	Huang, Ying, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently, there is a lot of code for detecting reparent racing
using kmemcg_id as the synchronization flag. And an intermediate
table is required to record and compare the kmemcg_id.

We can simplify this by just checking the cgroup css status, skip
if cgroup is being offlined. On the reparenting side, ensure no
more allocation is on going and no further allocation will occur
by using the XArray lock as barrier.

Combined with a O(n^2) top-down walk for the allocation, we get rid
of the intermediate table allocation completely. Despite being O(n^2),
it should be actually faster because it's not practical to have a very
deep cgroup level.

This also avoided changing kmemcg_id before reparenting, making
cgroups have a stable index for list_lru_memcg. After this change
it's possible that a dying cgroup will see a NULL value in XArray
corresponding to the kmemcg_id, because the kmemcg_id will point
to an empty slot. In such case, just fallback to use its parent.

As a result the code is simpler, following test also showed a
performance gain (6 test runs):

  mkdir /tmp/test-fs
  modprobe brd rd_nr=1 rd_size=16777216
  mkfs.xfs /dev/ram0
  mount -t xfs /dev/ram0 /tmp/test-fs
  worker() {
      echo TEST-CONTENT > "/tmp/test-fs/$1"
  }
  do_test() {
      for i in $(seq 1 2048); do
          (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs"
          worker "$i" &
      done; wait
      echo 1 > /proc/sys/vm/drop_caches
  }
  mkdir -p /sys/fs/cgroup/benchmark
  echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control
  for i in $(seq 1 2048); do
      rmdir "/sys/fs/cgroup/benchmark/$i" &>/dev/null
      mkdir -p "/sys/fs/cgroup/benchmark/$i"
  done
  time do_test

Before:
real    0m5.932s user    0m2.366s sys     0m5.583s
real    0m5.939s user    0m2.347s sys     0m5.597s
real    0m6.149s user    0m2.398s sys     0m5.761s
real    0m5.945s user    0m2.403s sys     0m5.547s
real    0m5.925s user    0m2.293s sys     0m5.651s
real    0m6.017s user    0m2.367s sys     0m5.686s

After:
real    0m5.712s user    0m2.343s sys     0m5.307s
real    0m5.885s user    0m2.326s sys     0m5.518s
real    0m5.694s user    0m2.347s sys     0m5.264s
real    0m5.865s user    0m2.300s sys     0m5.545s
real    0m5.748s user    0m2.273s sys     0m5.424s
real    0m5.756s user    0m2.318s sys     0m5.398s

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/list_lru.c | 182 ++++++++++++++++++++++----------------------------
 mm/zswap.c    |   7 +-
 2 files changed, 81 insertions(+), 108 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 4c619857e916..ac8aec8451dd 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -59,6 +59,20 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 	}
 	return &lru->node[nid].lru;
 }
+
+static inline struct list_lru_one *
+list_lru_from_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg)
+{
+	struct list_lru_one *l;
+again:
+	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
+	if (likely(l))
+		return l;
+
+	memcg = parent_mem_cgroup(memcg);
+	WARN_ON(!css_is_dying(&memcg->css));
+	goto again;
+}
 #else
 static void list_lru_register(struct list_lru *lru)
 {
@@ -83,6 +97,12 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 {
 	return &lru->node[nid].lru;
 }
+
+static inline struct list_lru_one *
+list_lru_from_memcg(struct list_lru *lru, int nid, int idx)
+{
+	return &lru->node[nid].lru;
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
@@ -93,7 +113,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
 
 	spin_lock(&nlru->lock);
 	if (list_empty(item)) {
-		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
+		l = list_lru_from_memcg(lru, nid, memcg);
 		list_add_tail(item, &l->list);
 		/* Set shrinker bit if the first element was added */
 		if (!l->nr_items++)
@@ -124,7 +144,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
-		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
+		l = list_lru_from_memcg(lru, nid, memcg);
 		list_del_init(item);
 		l->nr_items--;
 		nlru->nr_items--;
@@ -339,20 +359,6 @@ static struct list_lru_memcg *memcg_init_list_lru_one(gfp_t gfp)
 	return mlru;
 }
 
-static void memcg_list_lru_free(struct list_lru *lru, int src_idx)
-{
-	struct list_lru_memcg *mlru = xa_erase_irq(&lru->xa, src_idx);
-
-	/*
-	 * The __list_lru_walk_one() can walk the list of this node.
-	 * We need kvfree_rcu() here. And the walking of the list
-	 * is under lru->node[nid]->lock, which can serve as a RCU
-	 * read-side critical section.
-	 */
-	if (mlru)
-		kvfree_rcu(mlru, rcu);
-}
-
 static inline void memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
 {
 	if (memcg_aware)
@@ -377,22 +383,18 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
 }
 
 static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
-					 int src_idx, struct mem_cgroup *dst_memcg)
+					 struct list_lru_one *src,
+					 struct mem_cgroup *dst_memcg)
 {
 	struct list_lru_node *nlru = &lru->node[nid];
-	int dst_idx = dst_memcg->kmemcg_id;
-	struct list_lru_one *src, *dst;
+	struct list_lru_one *dst;
 
 	/*
 	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
 	 * we have to use IRQ-safe primitives here to avoid deadlock.
 	 */
 	spin_lock_irq(&nlru->lock);
-
-	src = list_lru_from_memcg_idx(lru, nid, src_idx);
-	if (!src)
-		goto out;
-	dst = list_lru_from_memcg_idx(lru, nid, dst_idx);
+	dst = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(dst_memcg));
 
 	list_splice_init(&src->list, &dst->list);
 
@@ -401,46 +403,45 @@ static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
 		set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
 		src->nr_items = 0;
 	}
-out:
 	spin_unlock_irq(&nlru->lock);
 }
 
 void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent)
 {
-	struct cgroup_subsys_state *css;
 	struct list_lru *lru;
-	int src_idx = memcg->kmemcg_id, i;
-
-	/*
-	 * Change kmemcg_id of this cgroup and all its descendants to the
-	 * parent's id, and then move all entries from this cgroup's list_lrus
-	 * to ones of the parent.
-	 */
-	rcu_read_lock();
-	css_for_each_descendant_pre(css, &memcg->css) {
-		struct mem_cgroup *child;
-
-		child = mem_cgroup_from_css(css);
-		WRITE_ONCE(child->kmemcg_id, parent->kmemcg_id);
-	}
-	rcu_read_unlock();
+	int i;
 
-	/*
-	 * With kmemcg_id set to parent, holding the lru lock below can
-	 * prevent list_lru_{add,del,isolate} from touching the lru, safe
-	 * to reparent.
-	 */
 	mutex_lock(&list_lrus_mutex);
 	list_for_each_entry(lru, &memcg_list_lrus, list) {
+		struct list_lru_memcg *mlru;
+		XA_STATE(xas, &lru->xa, memcg->kmemcg_id);
+
+		/*
+		 * Lock the Xarray to ensure no on going allocation and
+		 * further allocation will see css_is_dying().
+		 */
+		xas_lock_irq(&xas);
+		mlru = xas_load(&xas);
+		if (mlru)
+			xas_store(&xas, NULL);
+		xas_unlock_irq(&xas);
+		if (!mlru)
+			continue;
+
+		/*
+		 * With Xarray value set to NULL, holding the lru lock below
+		 * prevents list_lru_{add,del,isolate} from touching the lru,
+		 * safe to reparent.
+		 */
 		for_each_node(i)
-			memcg_reparent_list_lru_node(lru, i, src_idx, parent);
+			memcg_reparent_list_lru_node(lru, i, &mlru->node[i], parent);
 
 		/*
 		 * Here all list_lrus corresponding to the cgroup are guaranteed
 		 * to remain empty, we can safely free this lru, any further
 		 * memcg_list_lru_alloc() call will simply bail out.
 		 */
-		memcg_list_lru_free(lru, src_idx);
+		kvfree_rcu(mlru, rcu);
 	}
 	mutex_unlock(&list_lrus_mutex);
 }
@@ -456,78 +457,51 @@ static inline bool memcg_list_lru_allocated(struct mem_cgroup *memcg,
 int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
 			 gfp_t gfp)
 {
-	int i;
 	unsigned long flags;
-	struct list_lru_memcg_table {
-		struct list_lru_memcg *mlru;
-		struct mem_cgroup *memcg;
-	} *table;
+	struct list_lru_memcg *mlru;
+	struct mem_cgroup *pos, *parent;
 	XA_STATE(xas, &lru->xa, 0);
 
 	if (!list_lru_memcg_aware(lru) || memcg_list_lru_allocated(memcg, lru))
 		return 0;
 
 	gfp &= GFP_RECLAIM_MASK;
-	table = kmalloc_array(memcg->css.cgroup->level, sizeof(*table), gfp);
-	if (!table)
-		return -ENOMEM;
-
 	/*
 	 * Because the list_lru can be reparented to the parent cgroup's
 	 * list_lru, we should make sure that this cgroup and all its
 	 * ancestors have allocated list_lru_memcg.
 	 */
-	for (i = 0; memcg; memcg = parent_mem_cgroup(memcg), i++) {
-		if (memcg_list_lru_allocated(memcg, lru))
-			break;
-
-		table[i].memcg = memcg;
-		table[i].mlru = memcg_init_list_lru_one(gfp);
-		if (!table[i].mlru) {
-			while (i--)
-				kfree(table[i].mlru);
-			kfree(table);
-			return -ENOMEM;
+	do {
+		/*
+		 * Keep finding the farest parent that wasn't populated
+		 * until found memcg itself.
+		 */
+		pos = memcg;
+		parent = parent_mem_cgroup(pos);
+		while (parent && !memcg_list_lru_allocated(parent, lru)) {
+			pos = parent;
+			parent = parent_mem_cgroup(pos);
 		}
-	}
 
-	xas_lock_irqsave(&xas, flags);
-	while (i--) {
-		int index = READ_ONCE(table[i].memcg->kmemcg_id);
-		struct list_lru_memcg *mlru = table[i].mlru;
-
-		xas_set(&xas, index);
-retry:
-		if (unlikely(index < 0 || xas_error(&xas) || xas_load(&xas))) {
-			kfree(mlru);
-		} else {
-			xas_store(&xas, mlru);
-			if (xas_error(&xas) == -ENOMEM) {
+		mlru = memcg_init_list_lru_one(gfp);
+		do {
+			bool alloced = false;
+
+			xas_set(&xas, pos->kmemcg_id);
+			xas_lock_irqsave(&xas, flags);
+			if (!css_is_dying(&pos->css) && !xas_load(&xas)) {
+				xas_store(&xas, mlru);
+				alloced = true;
+			}
+			if (!alloced || xas_error(&xas)) {
 				xas_unlock_irqrestore(&xas, flags);
-				if (xas_nomem(&xas, gfp))
-					xas_set_err(&xas, 0);
-				xas_lock_irqsave(&xas, flags);
-				/*
-				 * The xas lock has been released, this memcg
-				 * can be reparented before us. So reload
-				 * memcg id. More details see the comments
-				 * in memcg_reparent_list_lrus().
-				 */
-				index = READ_ONCE(table[i].memcg->kmemcg_id);
-				if (index < 0)
-					xas_set_err(&xas, 0);
-				else if (!xas_error(&xas) && index != xas.xa_index)
-					xas_set(&xas, index);
-				goto retry;
+				kfree(mlru);
+				goto out;
 			}
-		}
-	}
-	/* xas_nomem() is used to free memory instead of memory allocation. */
-	if (xas.xa_alloc)
-		xas_nomem(&xas, gfp);
-	xas_unlock_irqrestore(&xas, flags);
-	kfree(table);
-
+			xas_unlock_irqrestore(&xas, flags);
+		} while (xas_nomem(&xas, gfp));
+	} while (pos != memcg);
+out:
 	return xas_error(&xas);
 }
 #else
diff --git a/mm/zswap.c b/mm/zswap.c
index a50e2986cd2f..c6e2256347ff 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -718,12 +718,11 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
 
 	/*
 	 * Note that it is safe to use rcu_read_lock() here, even in the face of
-	 * concurrent memcg offlining. Thanks to the memcg->kmemcg_id indirection
-	 * used in list_lru lookup, only two scenarios are possible:
+	 * concurrent memcg offlining:
 	 *
-	 * 1. list_lru_add() is called before memcg->kmemcg_id is updated. The
+	 * 1. list_lru_add() is called before list_lru_memcg is erased. The
 	 *    new entry will be reparented to memcg's parent's list_lru.
-	 * 2. list_lru_add() is called after memcg->kmemcg_id is updated. The
+	 * 2. list_lru_add() is called after list_lru_memcg is erased. The
 	 *    new entry will be added directly to memcg's parent's list_lru.
 	 *
 	 * Similar reasoning holds for list_lru_del().
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/7] mm/list_lru: simplify reparenting and initial allocation
  2024-06-24 17:53 ` [PATCH 5/7] mm/list_lru: simplify reparenting and initial allocation Kairui Song
@ 2024-07-17  3:04   ` Muchun Song
  2024-07-18 11:49     ` Kairui Song
  0 siblings, 1 reply; 22+ messages in thread
From: Muchun Song @ 2024-07-17  3:04 UTC (permalink / raw)
  To: Kairui Song
  Cc: Linux Memory Management List, Andrew Morton, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Waiman Long, Shakeel Butt,
	Nhat Pham, Michal Hocko, Chengming Zhou, Qi Zheng, Chris Li,
	Yosry Ahmed, Huang, Ying



> On Jun 25, 2024, at 01:53, Kairui Song <ryncsn@gmail.com> wrote:
> 
> From: Kairui Song <kasong@tencent.com>
> 
> Currently, there is a lot of code for detecting reparent racing
> using kmemcg_id as the synchronization flag. And an intermediate
> table is required to record and compare the kmemcg_id.
> 
> We can simplify this by just checking the cgroup css status, skip
> if cgroup is being offlined. On the reparenting side, ensure no
> more allocation is on going and no further allocation will occur
> by using the XArray lock as barrier.
> 
> Combined with a O(n^2) top-down walk for the allocation, we get rid
> of the intermediate table allocation completely. Despite being O(n^2),
> it should be actually faster because it's not practical to have a very
> deep cgroup level.

I kind of agree with you. But just instinctually.

> 
> This also avoided changing kmemcg_id before reparenting, making
> cgroups have a stable index for list_lru_memcg. After this change
> it's possible that a dying cgroup will see a NULL value in XArray
> corresponding to the kmemcg_id, because the kmemcg_id will point
> to an empty slot. In such case, just fallback to use its parent.
> 
> As a result the code is simpler, following test also showed a
> performance gain (6 test runs):
> 
> mkdir /tmp/test-fs
> modprobe brd rd_nr=1 rd_size=16777216
> mkfs.xfs /dev/ram0
> mount -t xfs /dev/ram0 /tmp/test-fs
> worker() {
>     echo TEST-CONTENT > "/tmp/test-fs/$1"
> }
> do_test() {
>     for i in $(seq 1 2048); do
>         (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs"
>         worker "$i" &
>     done; wait
>     echo 1 > /proc/sys/vm/drop_caches
> }
> mkdir -p /sys/fs/cgroup/benchmark
> echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control
> for i in $(seq 1 2048); do
>     rmdir "/sys/fs/cgroup/benchmark/$i" &>/dev/null
>     mkdir -p "/sys/fs/cgroup/benchmark/$i"
> done
> time do_test
> 
> Before:
> real    0m5.932s user    0m2.366s sys     0m5.583s
> real    0m5.939s user    0m2.347s sys     0m5.597s
> real    0m6.149s user    0m2.398s sys     0m5.761s
> real    0m5.945s user    0m2.403s sys     0m5.547s
> real    0m5.925s user    0m2.293s sys     0m5.651s
> real    0m6.017s user    0m2.367s sys     0m5.686s
> 
> After:
> real    0m5.712s user    0m2.343s sys     0m5.307s
> real    0m5.885s user    0m2.326s sys     0m5.518s
> real    0m5.694s user    0m2.347s sys     0m5.264s
> real    0m5.865s user    0m2.300s sys     0m5.545s
> real    0m5.748s user    0m2.273s sys     0m5.424s
> real    0m5.756s user    0m2.318s sys     0m5.398s
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/list_lru.c | 182 ++++++++++++++++++++++----------------------------
> mm/zswap.c    |   7 +-
> 2 files changed, 81 insertions(+), 108 deletions(-)
> 
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 4c619857e916..ac8aec8451dd 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -59,6 +59,20 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
> }
> 	return &lru->node[nid].lru;
> }
> +
> +static inline struct list_lru_one *
> +list_lru_from_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg)
> +{
> + 	struct list_lru_one *l;
> +again:
> + 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> + 	if (likely(l))
> + 	return l;
> +
> +	memcg = parent_mem_cgroup(memcg);
> +	WARN_ON(!css_is_dying(&memcg->css));
> + 	goto again;

Just want to confirm this will only loop max twice, right?

> +}
> #else
> static void list_lru_register(struct list_lru *lru)
> {
> @@ -83,6 +97,12 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
> {
> 	return &lru->node[nid].lru;
> }
> +
> +static inline struct list_lru_one *
> +list_lru_from_memcg(struct list_lru *lru, int nid, int idx)
> +{
> + 	return &lru->node[nid].lru;
> +}
> #endif /* CONFIG_MEMCG_KMEM */
> 
> bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
> @@ -93,7 +113,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
> 
> 	spin_lock(&nlru->lock);
> 	if (list_empty(item)) {
> - 		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> + 		l = list_lru_from_memcg(lru, nid, memcg);
> 		list_add_tail(item, &l->list);
> 		/* Set shrinker bit if the first element was added */
> 		if (!l->nr_items++)
> @@ -124,7 +144,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
> 
> 	spin_lock(&nlru->lock);
> 	if (!list_empty(item)) {
> - 		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> + 		l = list_lru_from_memcg(lru, nid, memcg);
> 		list_del_init(item);
> 		l->nr_items--;
> 		nlru->nr_items--;
> @@ -339,20 +359,6 @@ static struct list_lru_memcg *memcg_init_list_lru_one(gfp_t gfp)
> return mlru;
> }
> 
> -static void memcg_list_lru_free(struct list_lru *lru, int src_idx)
> -{
> - 	struct list_lru_memcg *mlru = xa_erase_irq(&lru->xa, src_idx);
> -
> - 	/*
> -	 * The __list_lru_walk_one() can walk the list of this node.
> -	 * We need kvfree_rcu() here. And the walking of the list
> -	 * is under lru->node[nid]->lock, which can serve as a RCU
> -	 * read-side critical section.
> -	 */
> -	if (mlru)
> - 		kvfree_rcu(mlru, rcu);
> -}
> -
> static inline void memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
> {
> 	if (memcg_aware)
> @@ -377,22 +383,18 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
> }
> 
> static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
> -					 int src_idx, struct mem_cgroup *dst_memcg)
> +					 struct list_lru_one *src,
> +					 struct mem_cgroup *dst_memcg)
> {
> 	struct list_lru_node *nlru = &lru->node[nid];
> - 	int dst_idx = dst_memcg->kmemcg_id;
> - 	struct list_lru_one *src, *dst;
> + 	struct list_lru_one *dst;
> 
> 	/*
> 	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> 	 * we have to use IRQ-safe primitives here to avoid deadlock.
> 	 */
> 	spin_lock_irq(&nlru->lock);
> -
> - 	src = list_lru_from_memcg_idx(lru, nid, src_idx);
> - 	if (!src)
> - 		goto out;
> - 	dst = list_lru_from_memcg_idx(lru, nid, dst_idx);
> + 	dst = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(dst_memcg));
> 
> 	list_splice_init(&src->list, &dst->list);
> 
> @@ -401,46 +403,45 @@ static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
> 		set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
> 		src->nr_items = 0;
> 	}
> -out:
> 	spin_unlock_irq(&nlru->lock);
> }
> 
> void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> {
> - 	struct cgroup_subsys_state *css;
> 	struct list_lru *lru;
> - 	int src_idx = memcg->kmemcg_id, i;
> -
> - 	/*
> - 	 * Change kmemcg_id of this cgroup and all its descendants to the
> - 	 * parent's id, and then move all entries from this cgroup's list_lrus
> - 	 * to ones of the parent.
> - 	 */
> - 	rcu_read_lock();
> - 	css_for_each_descendant_pre(css, &memcg->css) {
> - 		struct mem_cgroup *child;
> -
> - 		child = mem_cgroup_from_css(css);
> - 		WRITE_ONCE(child->kmemcg_id, parent->kmemcg_id);
> - 	}
> - 	rcu_read_unlock();
> + 	int i;
> 
> - 	/*
> -	 * With kmemcg_id set to parent, holding the lru lock below can
> -	 * prevent list_lru_{add,del,isolate} from touching the lru, safe
> -	 * to reparent.
> -	 */
> 	mutex_lock(&list_lrus_mutex);
> 	list_for_each_entry(lru, &memcg_list_lrus, list) {
> + 		struct list_lru_memcg *mlru;
> + 		XA_STATE(xas, &lru->xa, memcg->kmemcg_id);
> +
> + 		/*
> +		 * Lock the Xarray to ensure no on going allocation and

ongoing? I'd like to rewrite the comment here, like:

The XArray lock is used to make the future observers will see the current
memory cgroup has been dying (i.e. css_is_dying() will return true), which
is significant for memcg_list_lru_alloc() to make sure it will not allocate
a new mlru for this died memory cgroup for which we just free it below.

> +		 * further allocation will see css_is_dying().
> +		 */
> + 		xas_lock_irq(&xas);
> + 		mlru = xas_load(&xas);
> + 		if (mlru)
> + 			xas_store(&xas, NULL);
> + 		xas_unlock_irq(&xas);

The above block could be replaced with xa_erase_irq(), simpler, right?

> + 		if (!mlru)
> + 			continue;
> +
> + 		/*
> +		 * With Xarray value set to NULL, holding the lru lock below
> +		 * prevents list_lru_{add,del,isolate} from touching the lru,
> +		 * safe to reparent.
> +		 */
> 		for_each_node(i)
> - 			memcg_reparent_list_lru_node(lru, i, src_idx, parent);
> + 			memcg_reparent_list_lru_node(lru, i, &mlru->node[i], parent);
> 
> 		/*
> 		 * Here all list_lrus corresponding to the cgroup are guaranteed
> 		 * to remain empty, we can safely free this lru, any further
> 		 * memcg_list_lru_alloc() call will simply bail out.
> 		 */
> - 		memcg_list_lru_free(lru, src_idx);
> + 		kvfree_rcu(mlru, rcu);
> 	}
> 	mutex_unlock(&list_lrus_mutex);
> }
> @@ -456,78 +457,51 @@ static inline bool memcg_list_lru_allocated(struct mem_cgroup *memcg,
> int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
> 			 gfp_t gfp)
> {
> - 	int i;
> 	unsigned long flags;
> - 	struct list_lru_memcg_table {
> - 		struct list_lru_memcg *mlru;
> - 		struct mem_cgroup *memcg;
> - 	} *table;
> + 	struct list_lru_memcg *mlru;
> + 	struct mem_cgroup *pos, *parent;
> 	XA_STATE(xas, &lru->xa, 0);
> 
> 	if (!list_lru_memcg_aware(lru) || memcg_list_lru_allocated(memcg, lru))
> 		return 0;
> 
> 	gfp &= GFP_RECLAIM_MASK;
> - 	table = kmalloc_array(memcg->css.cgroup->level, sizeof(*table), gfp);
> - 	if (!table)
> - 		return -ENOMEM;
> -
> 	/*
> 	 * Because the list_lru can be reparented to the parent cgroup's
> 	 * list_lru, we should make sure that this cgroup and all its
> 	 * ancestors have allocated list_lru_memcg.
> 	 */
> - 	for (i = 0; memcg; memcg = parent_mem_cgroup(memcg), i++) {
> - 		if (memcg_list_lru_allocated(memcg, lru))
> - 			break;
> -
> - 		table[i].memcg = memcg;
> - 		table[i].mlru = memcg_init_list_lru_one(gfp);
> - 		if (!table[i].mlru) {
> - 			while (i--)
> - 				kfree(table[i].mlru);
> - 			kfree(table);
> - 			return -ENOMEM;
> + 	do {
> + 		/*
> +		 * Keep finding the farest parent that wasn't populated
> +		 * until found memcg itself.
> +		 */
> + 		pos = memcg;
> + 		parent = parent_mem_cgroup(pos);
> + 		while (parent && !memcg_list_lru_allocated(parent, lru)) {

The check of @parent is unnecessary since memcg_list_lru_allocated() always
return true for root memory cgroup. Right?

> + 			pos = parent;
> +			parent = parent_mem_cgroup(pos);
> 		}
> - 	}
> 
> - 	xas_lock_irqsave(&xas, flags);
> - 	while (i--) {
> - 		int index = READ_ONCE(table[i].memcg->kmemcg_id);
> - 		struct list_lru_memcg *mlru = table[i].mlru;
> -
> - 		xas_set(&xas, index);
> -retry:
> - 		if (unlikely(index < 0 || xas_error(&xas) || xas_load(&xas))) {
> - 			kfree(mlru);
> - 		} else {
> - 			xas_store(&xas, mlru);
> - 			if (xas_error(&xas) == -ENOMEM) {
> + 		mlru = memcg_init_list_lru_one(gfp);
> + 		do {
> + 			bool alloced = false;
> +
> + 			xas_set(&xas, pos->kmemcg_id);
> + 			xas_lock_irqsave(&xas, flags);
> + 			if (!css_is_dying(&pos->css) && !xas_load(&xas)) {
> + 				xas_store(&xas, mlru);
> + 				alloced = true;
> + 			}
> + 			if (!alloced || xas_error(&xas)) {
> 				xas_unlock_irqrestore(&xas, flags);
> - 				if (xas_nomem(&xas, gfp))
> - 					xas_set_err(&xas, 0);
> - 				xas_lock_irqsave(&xas, flags);
> - 				/*
> -				 * The xas lock has been released, this memcg
> -				 * can be reparented before us. So reload
> -				 * memcg id. More details see the comments
> -				 * in memcg_reparent_list_lrus().
> -				 */
> - 				index = READ_ONCE(table[i].memcg->kmemcg_id);
> - 				if (index < 0)
> - 					xas_set_err(&xas, 0);
> - 				else if (!xas_error(&xas) && index != xas.xa_index)
> - 					xas_set(&xas, index);
> - 				goto retry;
> + 				kfree(mlru);
> + 				goto out;

1) If xas_error() returns true, we should continue since xas_mem() will handle the NOMEM
case for us.

2) If xas_load() returns a non-NULL pointer meaning there is a concurrent thread has
assigned a new mlru for us, we should continue since we might have not assigned a mlru
for the leaf memory cgroup. Suppose the following case:

	A
      / \
     B   C

If there are two threads allocating mlru for A, then either B or C will not allocate a mlru.
Here is a code snippet FYI.

Thanks.

```c
struct list_lru_memcg *mlru = NULL; 

do {
	/* ... */
	if (!mlru)
		mlru = memcg_init_list_lru_one(gfp);

	xas_set(&xas, pos->kmemcg_id);
	do {
		xas_lock_irqsave(&xas, flags);
		if (css_is_dying(&pos->css) || xas_load(&xas))
			goto unlock;
		xas_store(&xas, mlru);
		if (!xas_error(&xas))
			mlru = NULL;
unlock:
		xas_unlock_irqrestore(&xas, flags);
	} while (xas_nomem(&xas, gfp));
} while (pos != memcg);

if (mlru)
	kfree(mlru);
/* ... */
```
> 			}
> - 		}
> - 	}
> - 	/* xas_nomem() is used to free memory instead of memory allocation. */
> - 	if (xas.xa_alloc)
> - 		xas_nomem(&xas, gfp);
> - 	xas_unlock_irqrestore(&xas, flags);
> - 	kfree(table);
> -
> + 			xas_unlock_irqrestore(&xas, flags);
> + 		} while (xas_nomem(&xas, gfp));
> + 	} while (pos != memcg);
> +out:
> 	return xas_error(&xas);
> }
> #else
> diff --git a/mm/zswap.c b/mm/zswap.c
> index a50e2986cd2f..c6e2256347ff 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -718,12 +718,11 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
> 
> 	/*
> 	 * Note that it is safe to use rcu_read_lock() here, even in the face of
> -	 * concurrent memcg offlining. Thanks to the memcg->kmemcg_id indirection
> -	 * used in list_lru lookup, only two scenarios are possible:
> +	 * concurrent memcg offlining:
> *
> -	 * 1. list_lru_add() is called before memcg->kmemcg_id is updated. The
> +	 * 1. list_lru_add() is called before list_lru_memcg is erased. The
> 	 *    new entry will be reparented to memcg's parent's list_lru.
> -	 * 2. list_lru_add() is called after memcg->kmemcg_id is updated. The
> +	 * 2. list_lru_add() is called after list_lru_memcg is erased. The
> 	 *    new entry will be added directly to memcg's parent's list_lru.
> 	 *
> 	 * Similar reasoning holds for list_lru_del().
> -- 
> 2.45.2
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/7] mm/list_lru: simplify reparenting and initial allocation
  2024-07-17  3:04   ` Muchun Song
@ 2024-07-18 11:49     ` Kairui Song
  2024-07-19  2:45       ` Muchun Song
  0 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2024-07-18 11:49 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, Andrew Morton, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Waiman Long, Shakeel Butt,
	Nhat Pham, Michal Hocko, Chengming Zhou, Qi Zheng, Chris Li,
	Yosry Ahmed, Huang, Ying

On Wed, Jul 17, 2024 at 11:05 AM Muchun Song <muchun.song@linux.dev> wrote:
> > On Jun 25, 2024, at 01:53, Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently, there is a lot of code for detecting reparent racing
> > using kmemcg_id as the synchronization flag. And an intermediate
> > table is required to record and compare the kmemcg_id.
> >
> > We can simplify this by just checking the cgroup css status, skip
> > if cgroup is being offlined. On the reparenting side, ensure no
> > more allocation is on going and no further allocation will occur
> > by using the XArray lock as barrier.
> >
> > Combined with a O(n^2) top-down walk for the allocation, we get rid
> > of the intermediate table allocation completely. Despite being O(n^2),
> > it should be actually faster because it's not practical to have a very
> > deep cgroup level.
>
> I kind of agree with you. But just instinctually.
>
> >
> > This also avoided changing kmemcg_id before reparenting, making
> > cgroups have a stable index for list_lru_memcg. After this change
> > it's possible that a dying cgroup will see a NULL value in XArray
> > corresponding to the kmemcg_id, because the kmemcg_id will point
> > to an empty slot. In such case, just fallback to use its parent.
> >
> > As a result the code is simpler, following test also showed a
> > performance gain (6 test runs):
> >
> > mkdir /tmp/test-fs
> > modprobe brd rd_nr=1 rd_size=16777216
> > mkfs.xfs /dev/ram0
> > mount -t xfs /dev/ram0 /tmp/test-fs
> > worker() {
> >     echo TEST-CONTENT > "/tmp/test-fs/$1"
> > }
> > do_test() {
> >     for i in $(seq 1 2048); do
> >         (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs"
> >         worker "$i" &
> >     done; wait
> >     echo 1 > /proc/sys/vm/drop_caches
> > }
> > mkdir -p /sys/fs/cgroup/benchmark
> > echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control
> > for i in $(seq 1 2048); do
> >     rmdir "/sys/fs/cgroup/benchmark/$i" &>/dev/null
> >     mkdir -p "/sys/fs/cgroup/benchmark/$i"
> > done
> > time do_test
> >
> > Before:
> > real    0m5.932s user    0m2.366s sys     0m5.583s
> > real    0m5.939s user    0m2.347s sys     0m5.597s
> > real    0m6.149s user    0m2.398s sys     0m5.761s
> > real    0m5.945s user    0m2.403s sys     0m5.547s
> > real    0m5.925s user    0m2.293s sys     0m5.651s
> > real    0m6.017s user    0m2.367s sys     0m5.686s
> >
> > After:
> > real    0m5.712s user    0m2.343s sys     0m5.307s
> > real    0m5.885s user    0m2.326s sys     0m5.518s
> > real    0m5.694s user    0m2.347s sys     0m5.264s
> > real    0m5.865s user    0m2.300s sys     0m5.545s
> > real    0m5.748s user    0m2.273s sys     0m5.424s
> > real    0m5.756s user    0m2.318s sys     0m5.398s
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> > mm/list_lru.c | 182 ++++++++++++++++++++++----------------------------
> > mm/zswap.c    |   7 +-
> > 2 files changed, 81 insertions(+), 108 deletions(-)
> >
> > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > index 4c619857e916..ac8aec8451dd 100644
> > --- a/mm/list_lru.c
> > +++ b/mm/list_lru.c
> > @@ -59,6 +59,20 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
> > }
> >       return &lru->node[nid].lru;
> > }
> > +
> > +static inline struct list_lru_one *
> > +list_lru_from_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg)
> > +{
> > +     struct list_lru_one *l;
> > +again:
> > +     l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> > +     if (likely(l))
> > +     return l;
> > +
> > +     memcg = parent_mem_cgroup(memcg);
> > +     WARN_ON(!css_is_dying(&memcg->css));
> > +     goto again;
>
> Just want to confirm this will only loop max twice, right?

If the memcg is dying, it finds the nearest parent that is still
active and returns the parent's lru. If the parent is also dying, try
the parent's parent. It may repeat mult times until it reaches root cg
but very unlikely.

>
> > +}
> > #else
> > static void list_lru_register(struct list_lru *lru)
> > {
> > @@ -83,6 +97,12 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
> > {
> >       return &lru->node[nid].lru;
> > }
> > +
> > +static inline struct list_lru_one *
> > +list_lru_from_memcg(struct list_lru *lru, int nid, int idx)
> > +{
> > +     return &lru->node[nid].lru;
> > +}
> > #endif /* CONFIG_MEMCG_KMEM */
> >
> > bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
> > @@ -93,7 +113,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
> >
> >       spin_lock(&nlru->lock);
> >       if (list_empty(item)) {
> > -             l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> > +             l = list_lru_from_memcg(lru, nid, memcg);
> >               list_add_tail(item, &l->list);
> >               /* Set shrinker bit if the first element was added */
> >               if (!l->nr_items++)
> > @@ -124,7 +144,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
> >
> >       spin_lock(&nlru->lock);
> >       if (!list_empty(item)) {
> > -             l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> > +             l = list_lru_from_memcg(lru, nid, memcg);
> >               list_del_init(item);
> >               l->nr_items--;
> >               nlru->nr_items--;
> > @@ -339,20 +359,6 @@ static struct list_lru_memcg *memcg_init_list_lru_one(gfp_t gfp)
> > return mlru;
> > }
> >
> > -static void memcg_list_lru_free(struct list_lru *lru, int src_idx)
> > -{
> > -     struct list_lru_memcg *mlru = xa_erase_irq(&lru->xa, src_idx);
> > -
> > -     /*
> > -      * The __list_lru_walk_one() can walk the list of this node.
> > -      * We need kvfree_rcu() here. And the walking of the list
> > -      * is under lru->node[nid]->lock, which can serve as a RCU
> > -      * read-side critical section.
> > -      */
> > -     if (mlru)
> > -             kvfree_rcu(mlru, rcu);
> > -}
> > -
> > static inline void memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
> > {
> >       if (memcg_aware)
> > @@ -377,22 +383,18 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
> > }
> >
> > static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
> > -                                      int src_idx, struct mem_cgroup *dst_memcg)
> > +                                      struct list_lru_one *src,
> > +                                      struct mem_cgroup *dst_memcg)
> > {
> >       struct list_lru_node *nlru = &lru->node[nid];
> > -     int dst_idx = dst_memcg->kmemcg_id;
> > -     struct list_lru_one *src, *dst;
> > +     struct list_lru_one *dst;
> >
> >       /*
> >        * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> >        * we have to use IRQ-safe primitives here to avoid deadlock.
> >        */
> >       spin_lock_irq(&nlru->lock);
> > -
> > -     src = list_lru_from_memcg_idx(lru, nid, src_idx);
> > -     if (!src)
> > -             goto out;
> > -     dst = list_lru_from_memcg_idx(lru, nid, dst_idx);
> > +     dst = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(dst_memcg));
> >
> >       list_splice_init(&src->list, &dst->list);
> >
> > @@ -401,46 +403,45 @@ static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
> >               set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
> >               src->nr_items = 0;
> >       }
> > -out:
> >       spin_unlock_irq(&nlru->lock);
> > }
> >
> > void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> > {
> > -     struct cgroup_subsys_state *css;
> >       struct list_lru *lru;
> > -     int src_idx = memcg->kmemcg_id, i;
> > -
> > -     /*
> > -      * Change kmemcg_id of this cgroup and all its descendants to the
> > -      * parent's id, and then move all entries from this cgroup's list_lrus
> > -      * to ones of the parent.
> > -      */
> > -     rcu_read_lock();
> > -     css_for_each_descendant_pre(css, &memcg->css) {
> > -             struct mem_cgroup *child;
> > -
> > -             child = mem_cgroup_from_css(css);
> > -             WRITE_ONCE(child->kmemcg_id, parent->kmemcg_id);
> > -     }
> > -     rcu_read_unlock();
> > +     int i;
> >
> > -     /*
> > -      * With kmemcg_id set to parent, holding the lru lock below can
> > -      * prevent list_lru_{add,del,isolate} from touching the lru, safe
> > -      * to reparent.
> > -      */
> >       mutex_lock(&list_lrus_mutex);
> >       list_for_each_entry(lru, &memcg_list_lrus, list) {
> > +             struct list_lru_memcg *mlru;
> > +             XA_STATE(xas, &lru->xa, memcg->kmemcg_id);
> > +
> > +             /*
> > +              * Lock the Xarray to ensure no on going allocation and
>
> ongoing? I'd like to rewrite the comment here, like:
>
> The XArray lock is used to make the future observers will see the current
> memory cgroup has been dying (i.e. css_is_dying() will return true), which
> is significant for memcg_list_lru_alloc() to make sure it will not allocate
> a new mlru for this died memory cgroup for which we just free it below.

Good suggestion.

>
> > +              * further allocation will see css_is_dying().
> > +              */
> > +             xas_lock_irq(&xas);
> > +             mlru = xas_load(&xas);
> > +             if (mlru)
> > +                     xas_store(&xas, NULL);
> > +             xas_unlock_irq(&xas);
>
> The above block could be replaced with xa_erase_irq(), simpler, right?

I wanted to reuse the `xas` here, it will be used by later commits and
this helps reduce total stack usage. Might be trivial, I can use
xa_erase_irq here for easier coding, and expand this to this again
later.

>
> > +             if (!mlru)
> > +                     continue;
> > +
> > +             /*
> > +              * With Xarray value set to NULL, holding the lru lock below
> > +              * prevents list_lru_{add,del,isolate} from touching the lru,
> > +              * safe to reparent.
> > +              */
> >               for_each_node(i)
> > -                     memcg_reparent_list_lru_node(lru, i, src_idx, parent);
> > +                     memcg_reparent_list_lru_node(lru, i, &mlru->node[i], parent);
> >
> >               /*
> >                * Here all list_lrus corresponding to the cgroup are guaranteed
> >                * to remain empty, we can safely free this lru, any further
> >                * memcg_list_lru_alloc() call will simply bail out.
> >                */
> > -             memcg_list_lru_free(lru, src_idx);
> > +             kvfree_rcu(mlru, rcu);
> >       }
> >       mutex_unlock(&list_lrus_mutex);
> > }
> > @@ -456,78 +457,51 @@ static inline bool memcg_list_lru_allocated(struct mem_cgroup *memcg,
> > int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
> >                        gfp_t gfp)
> > {
> > -     int i;
> >       unsigned long flags;
> > -     struct list_lru_memcg_table {
> > -             struct list_lru_memcg *mlru;
> > -             struct mem_cgroup *memcg;
> > -     } *table;
> > +     struct list_lru_memcg *mlru;
> > +     struct mem_cgroup *pos, *parent;
> >       XA_STATE(xas, &lru->xa, 0);
> >
> >       if (!list_lru_memcg_aware(lru) || memcg_list_lru_allocated(memcg, lru))
> >               return 0;
> >
> >       gfp &= GFP_RECLAIM_MASK;
> > -     table = kmalloc_array(memcg->css.cgroup->level, sizeof(*table), gfp);
> > -     if (!table)
> > -             return -ENOMEM;
> > -
> >       /*
> >        * Because the list_lru can be reparented to the parent cgroup's
> >        * list_lru, we should make sure that this cgroup and all its
> >        * ancestors have allocated list_lru_memcg.
> >        */
> > -     for (i = 0; memcg; memcg = parent_mem_cgroup(memcg), i++) {
> > -             if (memcg_list_lru_allocated(memcg, lru))
> > -                     break;
> > -
> > -             table[i].memcg = memcg;
> > -             table[i].mlru = memcg_init_list_lru_one(gfp);
> > -             if (!table[i].mlru) {
> > -                     while (i--)
> > -                             kfree(table[i].mlru);
> > -                     kfree(table);
> > -                     return -ENOMEM;
> > +     do {
> > +             /*
> > +              * Keep finding the farest parent that wasn't populated
> > +              * until found memcg itself.
> > +              */
> > +             pos = memcg;
> > +             parent = parent_mem_cgroup(pos);
> > +             while (parent && !memcg_list_lru_allocated(parent, lru)) {
>
> The check of @parent is unnecessary since memcg_list_lru_allocated() always
> return true for root memory cgroup. Right?

Right, just wrote this to be less error prone as
memcg_list_lru_allocated can't handle NULL. But yeah, parent = NULL
shouldn't happen here, it can be removed.

>
> > +                     pos = parent;
> > +                     parent = parent_mem_cgroup(pos);
> >               }
> > -     }
> >
> > -     xas_lock_irqsave(&xas, flags);
> > -     while (i--) {
> > -             int index = READ_ONCE(table[i].memcg->kmemcg_id);
> > -             struct list_lru_memcg *mlru = table[i].mlru;
> > -
> > -             xas_set(&xas, index);
> > -retry:
> > -             if (unlikely(index < 0 || xas_error(&xas) || xas_load(&xas))) {
> > -                     kfree(mlru);
> > -             } else {
> > -                     xas_store(&xas, mlru);
> > -                     if (xas_error(&xas) == -ENOMEM) {
> > +             mlru = memcg_init_list_lru_one(gfp);
> > +             do {
> > +                     bool alloced = false;
> > +
> > +                     xas_set(&xas, pos->kmemcg_id);
> > +                     xas_lock_irqsave(&xas, flags);
> > +                     if (!css_is_dying(&pos->css) && !xas_load(&xas)) {
> > +                             xas_store(&xas, mlru);
> > +                             alloced = true;
> > +                     }
> > +                     if (!alloced || xas_error(&xas)) {
> >                               xas_unlock_irqrestore(&xas, flags);
> > -                             if (xas_nomem(&xas, gfp))
> > -                                     xas_set_err(&xas, 0);
> > -                             xas_lock_irqsave(&xas, flags);
> > -                             /*
> > -                              * The xas lock has been released, this memcg
> > -                              * can be reparented before us. So reload
> > -                              * memcg id. More details see the comments
> > -                              * in memcg_reparent_list_lrus().
> > -                              */
> > -                             index = READ_ONCE(table[i].memcg->kmemcg_id);
> > -                             if (index < 0)
> > -                                     xas_set_err(&xas, 0);
> > -                             else if (!xas_error(&xas) && index != xas.xa_index)
> > -                                     xas_set(&xas, index);
> > -                             goto retry;
> > +                             kfree(mlru);
> > +                             goto out;
>
> 1) If xas_error() returns true, we should continue since xas_mem() will handle the NOMEM
> case for us.

Oh, My bad, I think I lost an intermediate commit for this part and
ended up exiting too early on error case; and it also needs to check
!mlru case above and return -ENOMEM, will update this part.

>
> 2) If xas_load() returns a non-NULL pointer meaning there is a concurrent thread has
> assigned a new mlru for us, we should continue since we might have not assigned a mlru
> for the leaf memory cgroup. Suppose the following case:
>
>         A
>       / \
>      B   C
>
> If there are two threads allocating mlru for A, then either B or C will not allocate a mlru.
> Here is a code snippet FYI.

Nice find, forgot the case when multiple children could race for one parent.

>
> Thanks.
>
> ```c
> struct list_lru_memcg *mlru = NULL;
>
> do {
>         /* ... */
>         if (!mlru)
>                 mlru = memcg_init_list_lru_one(gfp);
>
>         xas_set(&xas, pos->kmemcg_id);
>         do {
>                 xas_lock_irqsave(&xas, flags);
>                 if (css_is_dying(&pos->css) || xas_load(&xas))
>                         goto unlock;
>                 xas_store(&xas, mlru);
>                 if (!xas_error(&xas))
>                         mlru = NULL;
> unlock:
>                 xas_unlock_irqrestore(&xas, flags);
>         } while (xas_nomem(&xas, gfp));
> } while (pos != memcg);
>
> if (mlru)
>         kfree(mlru);

Good suggestion, one more thing is, should it abort the iteration if
the memcg is dead? Considering handling the memcg_init_list_lru_one
failure, so the loop could be this?

+        mlru = NULL;
+        do {
+                pos = memcg;
+                parent = parent_mem_cgroup(pos);
+                while (!memcg_list_lru_allocated(parent, lru)) {
+                        pos = parent;
+                        parent = parent_mem_cgroup(pos);
+                }
+
+                mlru = mlru ?: memcg_init_list_lru_one(gfp);
+                if (!mlru)
+                        return -ENOMEM;
+                xas_set(&xas, pos->kmemcg_id);
+                do {
+                        xas_lock_irqsave(&xas, flags);
+                        if (!xas_load(&xas) && !css_is_dying(&pos->css)) {
+                                xas_store(&xas, mlru);
+                                if (!xas_error(&xas))
+                                        mlru = NULL;
+                        }
+                        xas_unlock_irqrestore(&xas, flags);
+                } while (xas_nomem(&xas, gfp));
+        } while (!css_is_dying(&pos->css) && pos != memcg);
+
+        if (mlru)
+                kfree(mlru);


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/7] mm/list_lru: simplify reparenting and initial allocation
  2024-07-18 11:49     ` Kairui Song
@ 2024-07-19  2:45       ` Muchun Song
  0 siblings, 0 replies; 22+ messages in thread
From: Muchun Song @ 2024-07-19  2:45 UTC (permalink / raw)
  To: Kairui Song
  Cc: Linux Memory Management List, Andrew Morton, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Waiman Long, Shakeel Butt,
	Nhat Pham, Michal Hocko, Chengming Zhou, Qi Zheng, Chris Li,
	Yosry Ahmed, Huang, Ying



> On Jul 18, 2024, at 19:49, Kairui Song <ryncsn@gmail.com> wrote:
> 
> On Wed, Jul 17, 2024 at 11:05 AM Muchun Song <muchun.song@linux.dev> wrote:
>>> On Jun 25, 2024, at 01:53, Kairui Song <ryncsn@gmail.com> wrote:
>>> 
>>> From: Kairui Song <kasong@tencent.com>
>>> 
>>> Currently, there is a lot of code for detecting reparent racing
>>> using kmemcg_id as the synchronization flag. And an intermediate
>>> table is required to record and compare the kmemcg_id.
>>> 
>>> We can simplify this by just checking the cgroup css status, skip
>>> if cgroup is being offlined. On the reparenting side, ensure no
>>> more allocation is on going and no further allocation will occur
>>> by using the XArray lock as barrier.
>>> 
>>> Combined with a O(n^2) top-down walk for the allocation, we get rid
>>> of the intermediate table allocation completely. Despite being O(n^2),
>>> it should be actually faster because it's not practical to have a very
>>> deep cgroup level.
>> 
>> I kind of agree with you. But just instinctually.
>> 
>>> 
>>> This also avoided changing kmemcg_id before reparenting, making
>>> cgroups have a stable index for list_lru_memcg. After this change
>>> it's possible that a dying cgroup will see a NULL value in XArray
>>> corresponding to the kmemcg_id, because the kmemcg_id will point
>>> to an empty slot. In such case, just fallback to use its parent.
>>> 
>>> As a result the code is simpler, following test also showed a
>>> performance gain (6 test runs):
>>> 
>>> mkdir /tmp/test-fs
>>> modprobe brd rd_nr=1 rd_size=16777216
>>> mkfs.xfs /dev/ram0
>>> mount -t xfs /dev/ram0 /tmp/test-fs
>>> worker() {
>>>    echo TEST-CONTENT > "/tmp/test-fs/$1"
>>> }
>>> do_test() {
>>>    for i in $(seq 1 2048); do
>>>        (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs"
>>>        worker "$i" &
>>>    done; wait
>>>    echo 1 > /proc/sys/vm/drop_caches
>>> }
>>> mkdir -p /sys/fs/cgroup/benchmark
>>> echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control
>>> for i in $(seq 1 2048); do
>>>    rmdir "/sys/fs/cgroup/benchmark/$i" &>/dev/null
>>>    mkdir -p "/sys/fs/cgroup/benchmark/$i"
>>> done
>>> time do_test
>>> 
>>> Before:
>>> real    0m5.932s user    0m2.366s sys     0m5.583s
>>> real    0m5.939s user    0m2.347s sys     0m5.597s
>>> real    0m6.149s user    0m2.398s sys     0m5.761s
>>> real    0m5.945s user    0m2.403s sys     0m5.547s
>>> real    0m5.925s user    0m2.293s sys     0m5.651s
>>> real    0m6.017s user    0m2.367s sys     0m5.686s
>>> 
>>> After:
>>> real    0m5.712s user    0m2.343s sys     0m5.307s
>>> real    0m5.885s user    0m2.326s sys     0m5.518s
>>> real    0m5.694s user    0m2.347s sys     0m5.264s
>>> real    0m5.865s user    0m2.300s sys     0m5.545s
>>> real    0m5.748s user    0m2.273s sys     0m5.424s
>>> real    0m5.756s user    0m2.318s sys     0m5.398s
>>> 
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>> mm/list_lru.c | 182 ++++++++++++++++++++++----------------------------
>>> mm/zswap.c    |   7 +-
>>> 2 files changed, 81 insertions(+), 108 deletions(-)
>>> 
>>> diff --git a/mm/list_lru.c b/mm/list_lru.c
>>> index 4c619857e916..ac8aec8451dd 100644
>>> --- a/mm/list_lru.c
>>> +++ b/mm/list_lru.c
>>> @@ -59,6 +59,20 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
>>> }
>>>      return &lru->node[nid].lru;
>>> }
>>> +
>>> +static inline struct list_lru_one *
>>> +list_lru_from_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg)
>>> +{
>>> +     struct list_lru_one *l;
>>> +again:
>>> +     l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
>>> +     if (likely(l))
>>> +     return l;
>>> +
>>> +     memcg = parent_mem_cgroup(memcg);
>>> +     WARN_ON(!css_is_dying(&memcg->css));
>>> +     goto again;
>> 
>> Just want to confirm this will only loop max twice, right?
> 
> If the memcg is dying, it finds the nearest parent that is still
> active and returns the parent's lru. If the parent is also dying, try
> the parent's parent. It may repeat mult times until it reaches root cg
> but very unlikely.

Right.

> 
>> 
>>> +}
>>> #else
>>> static void list_lru_register(struct list_lru *lru)
>>> {
>>> @@ -83,6 +97,12 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
>>> {
>>>      return &lru->node[nid].lru;
>>> }
>>> +
>>> +static inline struct list_lru_one *
>>> +list_lru_from_memcg(struct list_lru *lru, int nid, int idx)
>>> +{
>>> +     return &lru->node[nid].lru;
>>> +}
>>> #endif /* CONFIG_MEMCG_KMEM */
>>> 
>>> bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
>>> @@ -93,7 +113,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
>>> 
>>>      spin_lock(&nlru->lock);
>>>      if (list_empty(item)) {
>>> -             l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
>>> +             l = list_lru_from_memcg(lru, nid, memcg);
>>>              list_add_tail(item, &l->list);
>>>              /* Set shrinker bit if the first element was added */
>>>              if (!l->nr_items++)
>>> @@ -124,7 +144,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
>>> 
>>>      spin_lock(&nlru->lock);
>>>      if (!list_empty(item)) {
>>> -             l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
>>> +             l = list_lru_from_memcg(lru, nid, memcg);
>>>              list_del_init(item);
>>>              l->nr_items--;
>>>              nlru->nr_items--;
>>> @@ -339,20 +359,6 @@ static struct list_lru_memcg *memcg_init_list_lru_one(gfp_t gfp)
>>> return mlru;
>>> }
>>> 
>>> -static void memcg_list_lru_free(struct list_lru *lru, int src_idx)
>>> -{
>>> -     struct list_lru_memcg *mlru = xa_erase_irq(&lru->xa, src_idx);
>>> -
>>> -     /*
>>> -      * The __list_lru_walk_one() can walk the list of this node.
>>> -      * We need kvfree_rcu() here. And the walking of the list
>>> -      * is under lru->node[nid]->lock, which can serve as a RCU
>>> -      * read-side critical section.
>>> -      */
>>> -     if (mlru)
>>> -             kvfree_rcu(mlru, rcu);
>>> -}
>>> -
>>> static inline void memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
>>> {
>>>      if (memcg_aware)
>>> @@ -377,22 +383,18 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
>>> }
>>> 
>>> static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
>>> -                                      int src_idx, struct mem_cgroup *dst_memcg)
>>> +                                      struct list_lru_one *src,
>>> +                                      struct mem_cgroup *dst_memcg)
>>> {
>>>      struct list_lru_node *nlru = &lru->node[nid];
>>> -     int dst_idx = dst_memcg->kmemcg_id;
>>> -     struct list_lru_one *src, *dst;
>>> +     struct list_lru_one *dst;
>>> 
>>>      /*
>>>       * Since list_lru_{add,del} may be called under an IRQ-safe lock,
>>>       * we have to use IRQ-safe primitives here to avoid deadlock.
>>>       */
>>>      spin_lock_irq(&nlru->lock);
>>> -
>>> -     src = list_lru_from_memcg_idx(lru, nid, src_idx);
>>> -     if (!src)
>>> -             goto out;
>>> -     dst = list_lru_from_memcg_idx(lru, nid, dst_idx);
>>> +     dst = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(dst_memcg));
>>> 
>>>      list_splice_init(&src->list, &dst->list);
>>> 
>>> @@ -401,46 +403,45 @@ static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
>>>              set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
>>>              src->nr_items = 0;
>>>      }
>>> -out:
>>>      spin_unlock_irq(&nlru->lock);
>>> }
>>> 
>>> void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>>> {
>>> -     struct cgroup_subsys_state *css;
>>>      struct list_lru *lru;
>>> -     int src_idx = memcg->kmemcg_id, i;
>>> -
>>> -     /*
>>> -      * Change kmemcg_id of this cgroup and all its descendants to the
>>> -      * parent's id, and then move all entries from this cgroup's list_lrus
>>> -      * to ones of the parent.
>>> -      */
>>> -     rcu_read_lock();
>>> -     css_for_each_descendant_pre(css, &memcg->css) {
>>> -             struct mem_cgroup *child;
>>> -
>>> -             child = mem_cgroup_from_css(css);
>>> -             WRITE_ONCE(child->kmemcg_id, parent->kmemcg_id);
>>> -     }
>>> -     rcu_read_unlock();
>>> +     int i;
>>> 
>>> -     /*
>>> -      * With kmemcg_id set to parent, holding the lru lock below can
>>> -      * prevent list_lru_{add,del,isolate} from touching the lru, safe
>>> -      * to reparent.
>>> -      */
>>>      mutex_lock(&list_lrus_mutex);
>>>      list_for_each_entry(lru, &memcg_list_lrus, list) {
>>> +             struct list_lru_memcg *mlru;
>>> +             XA_STATE(xas, &lru->xa, memcg->kmemcg_id);
>>> +
>>> +             /*
>>> +              * Lock the Xarray to ensure no on going allocation and
>> 
>> ongoing? I'd like to rewrite the comment here, like:
>> 
>> The XArray lock is used to make the future observers will see the current
>> memory cgroup has been dying (i.e. css_is_dying() will return true), which
>> is significant for memcg_list_lru_alloc() to make sure it will not allocate
>> a new mlru for this died memory cgroup for which we just free it below.
> 
> Good suggestion.
> 
>> 
>>> +              * further allocation will see css_is_dying().
>>> +              */
>>> +             xas_lock_irq(&xas);
>>> +             mlru = xas_load(&xas);
>>> +             if (mlru)
>>> +                     xas_store(&xas, NULL);
>>> +             xas_unlock_irq(&xas);
>> 
>> The above block could be replaced with xa_erase_irq(), simpler, right?
> 
> I wanted to reuse the `xas` here, it will be used by later commits and
> this helps reduce total stack usage. Might be trivial, I can use
> xa_erase_irq here for easier coding, and expand this to this again
> later.

Yes.

> 
>> 
>>> +             if (!mlru)
>>> +                     continue;
>>> +
>>> +             /*
>>> +              * With Xarray value set to NULL, holding the lru lock below
>>> +              * prevents list_lru_{add,del,isolate} from touching the lru,
>>> +              * safe to reparent.
>>> +              */
>>>              for_each_node(i)
>>> -                     memcg_reparent_list_lru_node(lru, i, src_idx, parent);
>>> +                     memcg_reparent_list_lru_node(lru, i, &mlru->node[i], parent);
>>> 
>>>              /*
>>>               * Here all list_lrus corresponding to the cgroup are guaranteed
>>>               * to remain empty, we can safely free this lru, any further
>>>               * memcg_list_lru_alloc() call will simply bail out.
>>>               */
>>> -             memcg_list_lru_free(lru, src_idx);
>>> +             kvfree_rcu(mlru, rcu);
>>>      }
>>>      mutex_unlock(&list_lrus_mutex);
>>> }
>>> @@ -456,78 +457,51 @@ static inline bool memcg_list_lru_allocated(struct mem_cgroup *memcg,
>>> int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
>>>                       gfp_t gfp)
>>> {
>>> -     int i;
>>>      unsigned long flags;
>>> -     struct list_lru_memcg_table {
>>> -             struct list_lru_memcg *mlru;
>>> -             struct mem_cgroup *memcg;
>>> -     } *table;
>>> +     struct list_lru_memcg *mlru;
>>> +     struct mem_cgroup *pos, *parent;
>>>      XA_STATE(xas, &lru->xa, 0);
>>> 
>>>      if (!list_lru_memcg_aware(lru) || memcg_list_lru_allocated(memcg, lru))
>>>              return 0;
>>> 
>>>      gfp &= GFP_RECLAIM_MASK;
>>> -     table = kmalloc_array(memcg->css.cgroup->level, sizeof(*table), gfp);
>>> -     if (!table)
>>> -             return -ENOMEM;
>>> -
>>>      /*
>>>       * Because the list_lru can be reparented to the parent cgroup's
>>>       * list_lru, we should make sure that this cgroup and all its
>>>       * ancestors have allocated list_lru_memcg.
>>>       */
>>> -     for (i = 0; memcg; memcg = parent_mem_cgroup(memcg), i++) {
>>> -             if (memcg_list_lru_allocated(memcg, lru))
>>> -                     break;
>>> -
>>> -             table[i].memcg = memcg;
>>> -             table[i].mlru = memcg_init_list_lru_one(gfp);
>>> -             if (!table[i].mlru) {
>>> -                     while (i--)
>>> -                             kfree(table[i].mlru);
>>> -                     kfree(table);
>>> -                     return -ENOMEM;
>>> +     do {
>>> +             /*
>>> +              * Keep finding the farest parent that wasn't populated
>>> +              * until found memcg itself.
>>> +              */
>>> +             pos = memcg;
>>> +             parent = parent_mem_cgroup(pos);
>>> +             while (parent && !memcg_list_lru_allocated(parent, lru)) {
>> 
>> The check of @parent is unnecessary since memcg_list_lru_allocated() always
>> return true for root memory cgroup. Right?
> 
> Right, just wrote this to be less error prone as
> memcg_list_lru_allocated can't handle NULL. But yeah, parent = NULL
> shouldn't happen here, it can be removed.
> 
>> 
>>> +                     pos = parent;
>>> +                     parent = parent_mem_cgroup(pos);
>>>              }
>>> -     }
>>> 
>>> -     xas_lock_irqsave(&xas, flags);
>>> -     while (i--) {
>>> -             int index = READ_ONCE(table[i].memcg->kmemcg_id);
>>> -             struct list_lru_memcg *mlru = table[i].mlru;
>>> -
>>> -             xas_set(&xas, index);
>>> -retry:
>>> -             if (unlikely(index < 0 || xas_error(&xas) || xas_load(&xas))) {
>>> -                     kfree(mlru);
>>> -             } else {
>>> -                     xas_store(&xas, mlru);
>>> -                     if (xas_error(&xas) == -ENOMEM) {
>>> +             mlru = memcg_init_list_lru_one(gfp);
>>> +             do {
>>> +                     bool alloced = false;
>>> +
>>> +                     xas_set(&xas, pos->kmemcg_id);
>>> +                     xas_lock_irqsave(&xas, flags);
>>> +                     if (!css_is_dying(&pos->css) && !xas_load(&xas)) {
>>> +                             xas_store(&xas, mlru);
>>> +                             alloced = true;
>>> +                     }
>>> +                     if (!alloced || xas_error(&xas)) {
>>>                              xas_unlock_irqrestore(&xas, flags);
>>> -                             if (xas_nomem(&xas, gfp))
>>> -                                     xas_set_err(&xas, 0);
>>> -                             xas_lock_irqsave(&xas, flags);
>>> -                             /*
>>> -                              * The xas lock has been released, this memcg
>>> -                              * can be reparented before us. So reload
>>> -                              * memcg id. More details see the comments
>>> -                              * in memcg_reparent_list_lrus().
>>> -                              */
>>> -                             index = READ_ONCE(table[i].memcg->kmemcg_id);
>>> -                             if (index < 0)
>>> -                                     xas_set_err(&xas, 0);
>>> -                             else if (!xas_error(&xas) && index != xas.xa_index)
>>> -                                     xas_set(&xas, index);
>>> -                             goto retry;
>>> +                             kfree(mlru);
>>> +                             goto out;
>> 
>> 1) If xas_error() returns true, we should continue since xas_mem() will handle the NOMEM
>> case for us.
> 
> Oh, My bad, I think I lost an intermediate commit for this part and
> ended up exiting too early on error case; and it also needs to check
> !mlru case above and return -ENOMEM, will update this part.
> 
>> 
>> 2) If xas_load() returns a non-NULL pointer meaning there is a concurrent thread has
>> assigned a new mlru for us, we should continue since we might have not assigned a mlru
>> for the leaf memory cgroup. Suppose the following case:
>> 
>>        A
>>      / \
>>     B   C
>> 
>> If there are two threads allocating mlru for A, then either B or C will not allocate a mlru.
>> Here is a code snippet FYI.
> 
> Nice find, forgot the case when multiple children could race for one parent.
> 
>> 
>> Thanks.
>> 
>> ```c
>> struct list_lru_memcg *mlru = NULL;
>> 
>> do {
>>        /* ... */
>>        if (!mlru)
>>                mlru = memcg_init_list_lru_one(gfp);
>> 
>>        xas_set(&xas, pos->kmemcg_id);
>>        do {
>>                xas_lock_irqsave(&xas, flags);
>>                if (css_is_dying(&pos->css) || xas_load(&xas))
>>                        goto unlock;
>>                xas_store(&xas, mlru);
>>                if (!xas_error(&xas))
>>                        mlru = NULL;
>> unlock:
>>                xas_unlock_irqrestore(&xas, flags);
>>        } while (xas_nomem(&xas, gfp));
>> } while (pos != memcg);
>> 
>> if (mlru)
>>        kfree(mlru);
> 
> Good suggestion, one more thing is, should it abort the iteration if
> the memcg is dead? Considering handling the memcg_init_list_lru_one
> failure, so the loop could be this?

Agree.

> 
> +        mlru = NULL;
> +        do {
> +                pos = memcg;
> +                parent = parent_mem_cgroup(pos);
> +                while (!memcg_list_lru_allocated(parent, lru)) {
> +                        pos = parent;
> +                        parent = parent_mem_cgroup(pos);
> +                }
> +
> +                mlru = mlru ?: memcg_init_list_lru_one(gfp);
> +                if (!mlru)
> +                        return -ENOMEM;
> +                xas_set(&xas, pos->kmemcg_id);
> +                do {
> +                        xas_lock_irqsave(&xas, flags);
> +                        if (!xas_load(&xas) && !css_is_dying(&pos->css)) {
> +                                xas_store(&xas, mlru);
> +                                if (!xas_error(&xas))
> +                                        mlru = NULL;
> +                        }
> +                        xas_unlock_irqrestore(&xas, flags);
> +                } while (xas_nomem(&xas, gfp));
> +        } while (!css_is_dying(&pos->css) && pos != memcg);
> +
> +        if (mlru)
> +                kfree(mlru);




^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 6/7] mm/list_lru: split the lock to per-cgroup scope
  2024-06-24 17:53 [PATCH 0/7] Split list_lru lock into per-cgroup scope Kairui Song
                   ` (4 preceding siblings ...)
  2024-06-24 17:53 ` [PATCH 5/7] mm/list_lru: simplify reparenting and initial allocation Kairui Song
@ 2024-06-24 17:53 ` Kairui Song
  2024-06-24 17:53 ` [PATCH 7/7] mm/list_lru: Simplify the list_lru walk callback function Kairui Song
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Kairui Song @ 2024-06-24 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Waiman Long, Shakeel Butt, Nhat Pham, Michal Hocko,
	Chengming Zhou, Qi Zheng, Muchun Song, Chris Li, Yosry Ahmed,
	Huang, Ying, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently, every list_lru has a per-node lock that protects adding,
deletion, isolation, and reparenting of all list_lru_one instances
belonging to this list_lru on this node. This lock contention is
heavy when multiple cgroups modify the same list_lru.

This lock can be split into per-cgroup scope to reduce contention.

To achieve this, we need a stable list_lru_one for every cgroup.
This commit adds a lock to each list_lru_one and introduced a
helper function lock_list_lru_of_memcg, making it possible to pin
the list_lru of a memcg. Then reworked the reparenting process.

Reparenting will switch the list_lru_one instances one by one.
By locking each instance and marking it dead using the nr_items
counter, reparenting ensures that all items in the corresponding
cgroup (on-list or not, because items have a stable cgroup, see below)
will see the list_lru_one switch synchronously.

Objcg reparent is also moved after list_lru reparent so items will have
a stable mem cgroup until all list_lru_one instances are drained.

The only caller that doesn't work the *_obj interfaces are direct
calls to list_lru_{add,del}. But it's only used by zswap and
that's also based on objcg, so it's fine.

This also changes the bahaviour of the isolation function when
LRU_RETRY or LRU_REMOVED_RETRY is returned, because now releasing the
lock could unblock reparenting and free the list_lru_one, isolation
function will have to return withoug re-lock the lru.

I see a 25% performance gain for SWAP after this change:

    modprobe zram
    echo 128G > /sys/block/zram0/disksize
    mkswap /dev/zram0; swapon /dev/zram0
    for i in $(seq 1 64); do
        mkdir /sys/fs/cgroup/benchmark-$i
        echo 64M > /sys/fs/cgroup/benchmark-$i/memory.max
    done
    do_test() {
        for i in $(seq 1 64); do
            cd /sys/fs/cgroup/benchmark-$i
            (exec sh -c 'echo "$PPID"') > cgroup.procs
            memhog 1G &
        done; wait
    }
    time do_test

Before:
real    0m20.328s user    0m4.315s sys     10m23.639s
real    0m20.440s user    0m4.142s sys     10m34.756s
real    0m20.381s user    0m4.164s sys     10m29.035s

After:
real    0m15.156s user    0m4.590s sys     7m34.361s
real    0m15.161s user    0m4.776s sys     7m35.086s
real    0m15.429s user    0m4.734s sys     7m42.919s

Similar performance gain with inode / dentry workload:

    prepare() {
        mkdir /tmp/test-fs
        modprobe brd rd_nr=1 rd_size=16777216
        mkfs.xfs -f /dev/ram0
        mount -t xfs /dev/ram0 /tmp/test-fs
        for i in $(seq 1 256); do
            mkdir "/tmp/test-fs/$i"
            for j in $(seq 1 10000); do
                echo TEST-CONTENT > "/tmp/test-fs/$i/$j"
            done
        done
    }

    do_test() {
        read_worker() {
            for j in $(seq 1 10000); do
                read -r __TMP < "/tmp/test-fs/$1/$j"
            done
        }
        read_in_all() {
            for i in $(seq 1 256); do
                (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs"
                read_worker "$i" &
            done; wait
        }
        echo 3 > /proc/sys/vm/drop_caches
        mkdir -p /sys/fs/cgroup/benchmark
        echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control
        echo 512M > /sys/fs/cgroup/benchmark/memory.max
        for i in $(seq 1 256); do
            rmdir "/sys/fs/cgroup/benchmark/$i" &>/dev/null
            mkdir -p "/sys/fs/cgroup/benchmark/$i"
        done
    }

Before this series:
real    0m26.939s user    0m36.322s sys     6m30.248s
real    0m15.111s user    0m33.749s sys     5m4.991s
real    0m16.796s user    0m33.438s sys     5m22.865s
real    0m15.256s user    0m34.060s sys     4m56.870s
real    0m14.826s user    0m33.531s sys     4m55.907s
real    0m15.664s user    0m35.619s sys     6m3.638s
real    0m15.746s user    0m34.066s sys     4m56.519s

After this commit (>10%):
real    0m22.166s user    0m35.155s sys     6m21.045s
real    0m13.753s user    0m34.554s sys     4m40.982s
real    0m13.815s user    0m34.693s sys     4m39.605s
real    0m13.495s user    0m34.372s sys     4m40.776s
real    0m13.895s user    0m34.005s sys     4m39.061s
real    0m13.629s user    0m33.476s sys     4m43.626s
real    0m14.001s user    0m33.463s sys     4m41.261s

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 drivers/android/binder_alloc.c |   1 -
 fs/inode.c                     |   1 -
 fs/xfs/xfs_qm.c                |   1 -
 include/linux/list_lru.h       |   6 +-
 mm/list_lru.c                  | 224 +++++++++++++++++++--------------
 mm/memcontrol.c                |   7 +-
 mm/workingset.c                |   1 -
 mm/zswap.c                     |   5 +-
 8 files changed, 141 insertions(+), 105 deletions(-)

diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index 2e1f261ec5c8..dd47d621e561 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -1106,7 +1106,6 @@ enum lru_status binder_alloc_free_page(struct list_head *item,
 	mmput_async(mm);
 	__free_page(page_to_free);
 
-	spin_lock(lock);
 	return LRU_REMOVED_RETRY;
 
 err_invalid_vma:
diff --git a/fs/inode.c b/fs/inode.c
index 3a41f83a4ba5..35da4e54e365 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -856,7 +856,6 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 			mm_account_reclaimed_pages(reap);
 		}
 		iput(inode);
-		spin_lock(lru_lock);
 		return LRU_RETRY;
 	}
 
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 47120b745c47..8d17099765ae 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -496,7 +496,6 @@ xfs_qm_dquot_isolate(
 	trace_xfs_dqreclaim_busy(dqp);
 	XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaim_misses);
 	xfs_dqunlock(dqp);
-	spin_lock(lru_lock);
 	return LRU_RETRY;
 }
 
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 2e5132905f42..b84483ef93a7 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -32,6 +32,8 @@ struct list_lru_one {
 	struct list_head	list;
 	/* may become negative during memcg reparenting */
 	long			nr_items;
+	/* protects all fields above */
+	spinlock_t		lock;
 };
 
 struct list_lru_memcg {
@@ -41,11 +43,9 @@ struct list_lru_memcg {
 };
 
 struct list_lru_node {
-	/* protects all lists on the node, including per cgroup */
-	spinlock_t		lock;
 	/* global list, used for the root cgroup in cgroup aware lrus */
 	struct list_lru_one	lru;
-	long			nr_items;
+	atomic_long_t		nr_items;
 } ____cacheline_aligned_in_smp;
 
 struct list_lru {
diff --git a/mm/list_lru.c b/mm/list_lru.c
index ac8aec8451dd..c503921cbb13 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -61,18 +61,48 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 }
 
 static inline struct list_lru_one *
-list_lru_from_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg)
+lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
+		       bool irq, bool skip_empty)
 {
 	struct list_lru_one *l;
+	rcu_read_lock();
 again:
 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
-	if (likely(l))
-		return l;
-
-	memcg = parent_mem_cgroup(memcg);
+	if (likely(l)) {
+		if (irq)
+			spin_lock_irq(&l->lock);
+		else
+			spin_lock(&l->lock);
+		if (likely(READ_ONCE(l->nr_items) != LONG_MIN)) {
+			WARN_ON(l->nr_items < 0);
+			rcu_read_unlock();
+			return l;
+		}
+		if (irq)
+			spin_unlock_irq(&l->lock);
+		else
+			spin_unlock(&l->lock);
+	}
+	/*
+	 * Caller may simply bail out if raced with reparenting or
+	 * may iterate through the list_lru and expect empty slots.
+	 */
+	if (skip_empty) {
+		rcu_read_unlock();
+		return NULL;
+	}
 	WARN_ON(!css_is_dying(&memcg->css));
+	memcg = parent_mem_cgroup(memcg);
 	goto again;
 }
+
+static inline void unlock_list_lru(struct list_lru_one *l, bool irq_off)
+{
+	if (irq_off)
+		spin_unlock_irq(&l->lock);
+	else
+		spin_unlock(&l->lock);
+}
 #else
 static void list_lru_register(struct list_lru *lru)
 {
@@ -99,30 +129,47 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 }
 
 static inline struct list_lru_one *
-list_lru_from_memcg(struct list_lru *lru, int nid, int idx)
+lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
+		       bool irq, bool skip_empty)
 {
-	return &lru->node[nid].lru;
+	struct list_lru_one *l = &lru->node[nid].lru;
+
+	if (irq)
+		spin_lock_irq(&l->lock);
+	else
+		spin_lock(&l->lock);
+
+	return l;
+}
+
+static inline void unlock_list_lru(struct list_lru_one *l, bool irq_off)
+{
+	if (irq_off)
+		spin_unlock_irq(&l->lock);
+	else
+		spin_unlock(&l->lock);
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
 bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
-		    struct mem_cgroup *memcg)
+		  struct mem_cgroup *memcg)
 {
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
 
-	spin_lock(&nlru->lock);
+	l = lock_list_lru_of_memcg(lru, nid, memcg, false, false);
+	if (!l)
+		return false;
 	if (list_empty(item)) {
-		l = list_lru_from_memcg(lru, nid, memcg);
 		list_add_tail(item, &l->list);
 		/* Set shrinker bit if the first element was added */
 		if (!l->nr_items++)
 			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
-		nlru->nr_items++;
-		spin_unlock(&nlru->lock);
+		unlock_list_lru(l, false);
+		atomic_long_inc(&nlru->nr_items);
 		return true;
 	}
-	spin_unlock(&nlru->lock);
+	unlock_list_lru(l, false);
 	return false;
 }
 
@@ -137,24 +184,23 @@ bool list_lru_add_obj(struct list_lru *lru, struct list_head *item)
 EXPORT_SYMBOL_GPL(list_lru_add_obj);
 
 bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
-		    struct mem_cgroup *memcg)
+		  struct mem_cgroup *memcg)
 {
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
-
-	spin_lock(&nlru->lock);
+	l = lock_list_lru_of_memcg(lru, nid, memcg, false, false);
+	if (!l)
+		return false;
 	if (!list_empty(item)) {
-		l = list_lru_from_memcg(lru, nid, memcg);
 		list_del_init(item);
 		l->nr_items--;
-		nlru->nr_items--;
-		spin_unlock(&nlru->lock);
+		unlock_list_lru(l, false);
+		atomic_long_dec(&nlru->nr_items);
 		return true;
 	}
-	spin_unlock(&nlru->lock);
+	unlock_list_lru(l, false);
 	return false;
 }
-EXPORT_SYMBOL_GPL(list_lru_del);
 
 bool list_lru_del_obj(struct list_lru *lru, struct list_head *item)
 {
@@ -204,25 +250,24 @@ unsigned long list_lru_count_node(struct list_lru *lru, int nid)
 	struct list_lru_node *nlru;
 
 	nlru = &lru->node[nid];
-	return nlru->nr_items;
+	return atomic_long_read(&nlru->nr_items);
 }
 EXPORT_SYMBOL_GPL(list_lru_count_node);
 
 static unsigned long
-__list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
+__list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 		    list_lru_walk_cb isolate, void *cb_arg,
-		    unsigned long *nr_to_walk)
+		    unsigned long *nr_to_walk, bool irq_off)
 {
 	struct list_lru_node *nlru = &lru->node[nid];
-	struct list_lru_one *l;
+	struct list_lru_one *l = NULL;
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
 
 restart:
-	l = list_lru_from_memcg_idx(lru, nid, memcg_idx);
+	l = lock_list_lru_of_memcg(lru, nid, memcg, irq_off, true);
 	if (!l)
-		goto out;
-
+		return isolated;
 	list_for_each_safe(item, n, &l->list) {
 		enum lru_status ret;
 
@@ -234,19 +279,20 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
 			break;
 		--*nr_to_walk;
 
-		ret = isolate(item, l, &nlru->lock, cb_arg);
+		ret = isolate(item, l, &l->lock, cb_arg);
 		switch (ret) {
+		/*
+		 * LRU_RETRY and LRU_REMOVED_RETRY will drop the lru lock,
+		 * the list traversal will be invalid and have to restart from
+		 * scratch.
+		 */
+		case LRU_RETRY:
+			goto restart;
 		case LRU_REMOVED_RETRY:
-			assert_spin_locked(&nlru->lock);
 			fallthrough;
 		case LRU_REMOVED:
 			isolated++;
-			nlru->nr_items--;
-			/*
-			 * If the lru lock has been dropped, our list
-			 * traversal is now invalid and so we have to
-			 * restart from scratch.
-			 */
+			atomic_long_dec(&nlru->nr_items);
 			if (ret == LRU_REMOVED_RETRY)
 				goto restart;
 			break;
@@ -255,21 +301,15 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
 			break;
 		case LRU_SKIP:
 			break;
-		case LRU_RETRY:
-			/*
-			 * The lru lock has been dropped, our list traversal is
-			 * now invalid and so we have to restart from scratch.
-			 */
-			assert_spin_locked(&nlru->lock);
-			goto restart;
 		case LRU_STOP:
-			assert_spin_locked(&nlru->lock);
+			assert_spin_locked(&l->lock);
 			goto out;
 		default:
 			BUG();
 		}
 	}
 out:
+	unlock_list_lru(l, irq_off);
 	return isolated;
 }
 
@@ -278,14 +318,8 @@ list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 		  list_lru_walk_cb isolate, void *cb_arg,
 		  unsigned long *nr_to_walk)
 {
-	struct list_lru_node *nlru = &lru->node[nid];
-	unsigned long ret;
-
-	spin_lock(&nlru->lock);
-	ret = __list_lru_walk_one(lru, nid, memcg_kmem_id(memcg), isolate,
-				  cb_arg, nr_to_walk);
-	spin_unlock(&nlru->lock);
-	return ret;
+	return __list_lru_walk_one(lru, nid, memcg, isolate,
+				   cb_arg, nr_to_walk, false);
 }
 EXPORT_SYMBOL_GPL(list_lru_walk_one);
 
@@ -294,14 +328,8 @@ list_lru_walk_one_irq(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 		      list_lru_walk_cb isolate, void *cb_arg,
 		      unsigned long *nr_to_walk)
 {
-	struct list_lru_node *nlru = &lru->node[nid];
-	unsigned long ret;
-
-	spin_lock_irq(&nlru->lock);
-	ret = __list_lru_walk_one(lru, nid, memcg_kmem_id(memcg), isolate,
-				  cb_arg, nr_to_walk);
-	spin_unlock_irq(&nlru->lock);
-	return ret;
+	return __list_lru_walk_one(lru, nid, memcg, isolate,
+				   cb_arg, nr_to_walk, true);
 }
 
 unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
@@ -316,16 +344,21 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 #ifdef CONFIG_MEMCG_KMEM
 	if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
 		struct list_lru_memcg *mlru;
+		struct mem_cgroup *memcg;
 		unsigned long index;
 
 		xa_for_each(&lru->xa, index, mlru) {
-			struct list_lru_node *nlru = &lru->node[nid];
-
-			spin_lock(&nlru->lock);
-			isolated += __list_lru_walk_one(lru, nid, index,
+			rcu_read_lock();
+			memcg = mem_cgroup_from_id(index);
+			if (!mem_cgroup_tryget(memcg)) {
+				rcu_read_unlock();
+				continue;
+			}
+			rcu_read_unlock();
+			isolated += __list_lru_walk_one(lru, nid, memcg,
 							isolate, cb_arg,
-							nr_to_walk);
-			spin_unlock(&nlru->lock);
+							nr_to_walk, false);
+			mem_cgroup_put(memcg);
 
 			if (*nr_to_walk <= 0)
 				break;
@@ -337,14 +370,19 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 }
 EXPORT_SYMBOL_GPL(list_lru_walk_node);
 
-static void init_one_lru(struct list_lru_one *l)
+static void init_one_lru(struct list_lru *lru, struct list_lru_one *l)
 {
 	INIT_LIST_HEAD(&l->list);
+	spin_lock_init(&l->lock);
 	l->nr_items = 0;
+#ifdef CONFIG_LOCKDEP
+	if (lru->key)
+		lockdep_set_class(&l->lock, lru->key);
+#endif
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static struct list_lru_memcg *memcg_init_list_lru_one(gfp_t gfp)
+static struct list_lru_memcg *memcg_init_list_lru_one(struct list_lru *lru, gfp_t gfp)
 {
 	int nid;
 	struct list_lru_memcg *mlru;
@@ -354,7 +392,7 @@ static struct list_lru_memcg *memcg_init_list_lru_one(gfp_t gfp)
 		return NULL;
 
 	for_each_node(nid)
-		init_one_lru(&mlru->node[nid]);
+		init_one_lru(lru, &mlru->node[nid]);
 
 	return mlru;
 }
@@ -382,28 +420,27 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
 	xas_unlock_irq(&xas);
 }
 
-static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
-					 struct list_lru_one *src,
-					 struct mem_cgroup *dst_memcg)
+static void memcg_reparent_list_lru_one(struct list_lru *lru, int nid,
+					struct list_lru_one *src,
+					struct mem_cgroup *dst_memcg)
 {
-	struct list_lru_node *nlru = &lru->node[nid];
+	int dst_idx = dst_memcg->kmemcg_id;
 	struct list_lru_one *dst;
 
-	/*
-	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
-	 * we have to use IRQ-safe primitives here to avoid deadlock.
-	 */
-	spin_lock_irq(&nlru->lock);
-	dst = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(dst_memcg));
+	spin_lock_irq(&src->lock);
+	dst = list_lru_from_memcg_idx(lru, nid, dst_idx);
+	spin_lock_nested(&dst->lock, SINGLE_DEPTH_NESTING);
 
 	list_splice_init(&src->list, &dst->list);
-
 	if (src->nr_items) {
 		dst->nr_items += src->nr_items;
 		set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
-		src->nr_items = 0;
 	}
-	spin_unlock_irq(&nlru->lock);
+	/* Mark the list_lru_one dead */
+	src->nr_items = LONG_MIN;
+
+	spin_unlock(&dst->lock);
+	spin_unlock_irq(&src->lock);
 }
 
 void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent)
@@ -422,8 +459,6 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
 		 */
 		xas_lock_irq(&xas);
 		mlru = xas_load(&xas);
-		if (mlru)
-			xas_store(&xas, NULL);
 		xas_unlock_irq(&xas);
 		if (!mlru)
 			continue;
@@ -434,13 +469,20 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
 		 * safe to reparent.
 		 */
 		for_each_node(i)
-			memcg_reparent_list_lru_node(lru, i, &mlru->node[i], parent);
+			memcg_reparent_list_lru_one(lru, i, &mlru->node[i], parent);
 
 		/*
 		 * Here all list_lrus corresponding to the cgroup are guaranteed
 		 * to remain empty, we can safely free this lru, any further
 		 * memcg_list_lru_alloc() call will simply bail out.
+		 *
+		 * To ensure callers see a stable list_lru_one, have to set it
+		 * to NULL after memcg_reparent_list_lru_one().
 		 */
+		xas_lock_irq(&xas);
+		xas_reset(&xas);
+		xas_store(&xas, NULL);
+		xas_unlock_irq(&xas);
 		kvfree_rcu(mlru, rcu);
 	}
 	mutex_unlock(&list_lrus_mutex);
@@ -483,7 +525,7 @@ int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
 			parent = parent_mem_cgroup(pos);
 		}
 
-		mlru = memcg_init_list_lru_one(gfp);
+		mlru = memcg_init_list_lru_one(lru, gfp);
 		do {
 			bool alloced = false;
 
@@ -532,14 +574,8 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware, struct shrinker *shr
 	if (!lru->node)
 		return -ENOMEM;
 
-	for_each_node(i) {
-		spin_lock_init(&lru->node[i].lock);
-#ifdef CONFIG_LOCKDEP
-		if (lru->key)
-			lockdep_set_class(&lru->node[i].lock, lru->key);
-#endif
-		init_one_lru(&lru->node[i].lru);
-	}
+	for_each_node(i)
+		init_one_lru(lru, &lru->node[i].lru);
 
 	memcg_init_list_lru(lru, memcg_aware);
 	list_lru_register(lru);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fc35c1d3f109..945290a53bf1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4186,8 +4186,13 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 	if (!parent)
 		parent = root_mem_cgroup;
 
-	memcg_reparent_objcgs(memcg, parent);
 	memcg_reparent_list_lrus(memcg, parent);
+
+	/*
+	 * Objcg's reparenting must be after list_lru's, make sure list_lru
+	 * helpers won't use parent's list_lru until child is drained.
+	 */
+	memcg_reparent_objcgs(memcg, parent);
 }
 #else
 static int memcg_online_kmem(struct mem_cgroup *memcg)
diff --git a/mm/workingset.c b/mm/workingset.c
index 1801fbe5183c..947423c3e719 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -769,7 +769,6 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	ret = LRU_REMOVED_RETRY;
 out:
 	cond_resched();
-	spin_lock_irq(lru_lock);
 	return ret;
 }
 
diff --git a/mm/zswap.c b/mm/zswap.c
index c6e2256347ff..f7a2afaeea53 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -720,9 +720,9 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
 	 * Note that it is safe to use rcu_read_lock() here, even in the face of
 	 * concurrent memcg offlining:
 	 *
-	 * 1. list_lru_add() is called before list_lru_memcg is erased. The
+	 * 1. list_lru_add() is called before list_lru_one is dead. The
 	 *    new entry will be reparented to memcg's parent's list_lru.
-	 * 2. list_lru_add() is called after list_lru_memcg is erased. The
+	 * 2. list_lru_add() is called after list_lru_one is dead. The
 	 *    new entry will be added directly to memcg's parent's list_lru.
 	 *
 	 * Similar reasoning holds for list_lru_del().
@@ -1164,7 +1164,6 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 		zswap_written_back_pages++;
 	}
 
-	spin_lock(lock);
 	return ret;
 }
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/7] mm/list_lru: Simplify the list_lru walk callback function
  2024-06-24 17:53 [PATCH 0/7] Split list_lru lock into per-cgroup scope Kairui Song
                   ` (5 preceding siblings ...)
  2024-06-24 17:53 ` [PATCH 6/7] mm/list_lru: split the lock to per-cgroup scope Kairui Song
@ 2024-06-24 17:53 ` Kairui Song
  2024-06-27 19:58   ` kernel test robot
  2024-06-24 21:26 ` [PATCH 0/7] Split list_lru lock into per-cgroup scope Andrew Morton
  2024-08-27 18:35 ` Shakeel Butt
  8 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2024-06-24 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Waiman Long, Shakeel Butt, Nhat Pham, Michal Hocko,
	Chengming Zhou, Qi Zheng, Muchun Song, Chris Li, Yosry Ahmed,
	Huang, Ying, Kairui Song

From: Kairui Song <kasong@tencent.com>

Now isolation no longer takes the list_lru global node lock, only use the
per-cgroup lock instead. And this lock is inside the list_lru_one being
walked, no longer needed to pass the lock explicitly.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 drivers/android/binder_alloc.c |  5 ++---
 drivers/android/binder_alloc.h |  2 +-
 fs/dcache.c                    |  4 ++--
 fs/gfs2/quota.c                |  2 +-
 fs/inode.c                     |  4 ++--
 fs/nfs/nfs42xattr.c            |  4 ++--
 fs/nfsd/filecache.c            |  5 +----
 fs/xfs/xfs_buf.c               |  2 --
 fs/xfs/xfs_qm.c                |  5 ++---
 include/linux/list_lru.h       |  2 +-
 mm/list_lru.c                  |  2 +-
 mm/workingset.c                | 15 +++++++--------
 mm/zswap.c                     |  4 ++--
 13 files changed, 24 insertions(+), 32 deletions(-)

diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index dd47d621e561..c55cce54f20c 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -1055,9 +1055,8 @@ void binder_alloc_vma_close(struct binder_alloc *alloc)
  */
 enum lru_status binder_alloc_free_page(struct list_head *item,
 				       struct list_lru_one *lru,
-				       spinlock_t *lock,
 				       void *cb_arg)
-	__must_hold(lock)
+	__must_hold(&lru->lock)
 {
 	struct binder_lru_page *page = container_of(item, typeof(*page), lru);
 	struct binder_alloc *alloc = page->alloc;
@@ -1092,7 +1091,7 @@ enum lru_status binder_alloc_free_page(struct list_head *item,
 
 	list_lru_isolate(lru, item);
 	spin_unlock(&alloc->lock);
-	spin_unlock(lock);
+	spin_unlock(&lru->lock);
 
 	if (vma) {
 		trace_binder_unmap_user_start(alloc, index);
diff --git a/drivers/android/binder_alloc.h b/drivers/android/binder_alloc.h
index 70387234477e..c02c8ebcb466 100644
--- a/drivers/android/binder_alloc.h
+++ b/drivers/android/binder_alloc.h
@@ -118,7 +118,7 @@ static inline void binder_selftest_alloc(struct binder_alloc *alloc) {}
 #endif
 enum lru_status binder_alloc_free_page(struct list_head *item,
 				       struct list_lru_one *lru,
-				       spinlock_t *lock, void *cb_arg);
+				       void *cb_arg);
 struct binder_buffer *binder_alloc_new_buf(struct binder_alloc *alloc,
 					   size_t data_size,
 					   size_t offsets_size,
diff --git a/fs/dcache.c b/fs/dcache.c
index 407095188f83..4e5f8382ee3f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1077,7 +1077,7 @@ void shrink_dentry_list(struct list_head *list)
 }
 
 static enum lru_status dentry_lru_isolate(struct list_head *item,
-		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
+		struct list_lru_one *lru, void *arg)
 {
 	struct list_head *freeable = arg;
 	struct dentry	*dentry = container_of(item, struct dentry, d_lru);
@@ -1158,7 +1158,7 @@ long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc)
 }
 
 static enum lru_status dentry_lru_isolate_shrink(struct list_head *item,
-		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
+		struct list_lru_one *lru, void *arg)
 {
 	struct list_head *freeable = arg;
 	struct dentry	*dentry = container_of(item, struct dentry, d_lru);
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index aa9cf0102848..31aece125a75 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -152,7 +152,7 @@ static void gfs2_qd_list_dispose(struct list_head *list)
 
 
 static enum lru_status gfs2_qd_isolate(struct list_head *item,
-		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
+		struct list_lru_one *lru, void *arg)
 {
 	struct list_head *dispose = arg;
 	struct gfs2_quota_data *qd =
diff --git a/fs/inode.c b/fs/inode.c
index 35da4e54e365..1fb52253a843 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -803,7 +803,7 @@ void invalidate_inodes(struct super_block *sb)
  * with this flag set because they are the inodes that are out of order.
  */
 static enum lru_status inode_lru_isolate(struct list_head *item,
-		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
+		struct list_lru_one *lru, void *arg)
 {
 	struct list_head *freeable = arg;
 	struct inode	*inode = container_of(item, struct inode, i_lru);
@@ -845,7 +845,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 	if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(lru_lock);
+		spin_unlock(&lru->lock);
 		if (remove_inode_buffers(inode)) {
 			unsigned long reap;
 			reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
diff --git a/fs/nfs/nfs42xattr.c b/fs/nfs/nfs42xattr.c
index b6e3d8f77b91..37d79400e5f4 100644
--- a/fs/nfs/nfs42xattr.c
+++ b/fs/nfs/nfs42xattr.c
@@ -802,7 +802,7 @@ static struct shrinker *nfs4_xattr_large_entry_shrinker;
 
 static enum lru_status
 cache_lru_isolate(struct list_head *item,
-	struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
+	struct list_lru_one *lru, void *arg)
 {
 	struct list_head *dispose = arg;
 	struct inode *inode;
@@ -867,7 +867,7 @@ nfs4_xattr_cache_count(struct shrinker *shrink, struct shrink_control *sc)
 
 static enum lru_status
 entry_lru_isolate(struct list_head *item,
-	struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
+	struct list_lru_one *lru, void *arg)
 {
 	struct list_head *dispose = arg;
 	struct nfs4_xattr_bucket *bucket;
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index ad9083ca144b..f68c4a1c529f 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -456,7 +456,6 @@ void nfsd_file_net_dispose(struct nfsd_net *nn)
  * nfsd_file_lru_cb - Examine an entry on the LRU list
  * @item: LRU entry to examine
  * @lru: controlling LRU
- * @lock: LRU list lock (unused)
  * @arg: dispose list
  *
  * Return values:
@@ -466,9 +465,7 @@ void nfsd_file_net_dispose(struct nfsd_net *nn)
  */
 static enum lru_status
 nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
-		 spinlock_t *lock, void *arg)
-	__releases(lock)
-	__acquires(lock)
+		 void *arg)
 {
 	struct list_head *head = arg;
 	struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index aa4dbda7b536..43b914c1f621 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1857,7 +1857,6 @@ static enum lru_status
 xfs_buftarg_drain_rele(
 	struct list_head	*item,
 	struct list_lru_one	*lru,
-	spinlock_t		*lru_lock,
 	void			*arg)
 
 {
@@ -1956,7 +1955,6 @@ static enum lru_status
 xfs_buftarg_isolate(
 	struct list_head	*item,
 	struct list_lru_one	*lru,
-	spinlock_t		*lru_lock,
 	void			*arg)
 {
 	struct xfs_buf		*bp = container_of(item, struct xfs_buf, b_lru);
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 8d17099765ae..f1b6e73c0e68 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -412,9 +412,8 @@ static enum lru_status
 xfs_qm_dquot_isolate(
 	struct list_head	*item,
 	struct list_lru_one	*lru,
-	spinlock_t		*lru_lock,
 	void			*arg)
-		__releases(lru_lock) __acquires(lru_lock)
+		__releases(&lru->lock) __acquires(&lru->lock)
 {
 	struct xfs_dquot	*dqp = container_of(item,
 						struct xfs_dquot, q_lru);
@@ -460,7 +459,7 @@ xfs_qm_dquot_isolate(
 		trace_xfs_dqreclaim_dirty(dqp);
 
 		/* we have to drop the LRU lock to flush the dquot */
-		spin_unlock(lru_lock);
+		spin_unlock(&lru->lock);
 
 		error = xfs_qm_dqflush(dqp, &bp);
 		if (error)
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index b84483ef93a7..df6b9374ca68 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -184,7 +184,7 @@ void list_lru_isolate_move(struct list_lru_one *list, struct list_head *item,
 			   struct list_head *head);
 
 typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
-		struct list_lru_one *list, spinlock_t *lock, void *cb_arg);
+		struct list_lru_one *list, void *cb_arg);
 
 /**
  * list_lru_walk_one: walk a @lru, isolating and disposing freeable items.
diff --git a/mm/list_lru.c b/mm/list_lru.c
index c503921cbb13..d8d653317c2c 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -279,7 +279,7 @@ __list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 			break;
 		--*nr_to_walk;
 
-		ret = isolate(item, l, &l->lock, cb_arg);
+		ret = isolate(item, l, cb_arg);
 		switch (ret) {
 		/*
 		 * LRU_RETRY and LRU_REMOVED_RETRY will drop the lru lock,
diff --git a/mm/workingset.c b/mm/workingset.c
index 947423c3e719..e3552e7318a5 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -704,8 +704,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 
 static enum lru_status shadow_lru_isolate(struct list_head *item,
 					  struct list_lru_one *lru,
-					  spinlock_t *lru_lock,
-					  void *arg) __must_hold(lru_lock)
+					  void *arg) __must_hold(lru->lock)
 {
 	struct xa_node *node = container_of(item, struct xa_node, private_list);
 	struct address_space *mapping;
@@ -714,20 +713,20 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	/*
 	 * Page cache insertions and deletions synchronously maintain
 	 * the shadow node LRU under the i_pages lock and the
-	 * lru_lock.  Because the page cache tree is emptied before
-	 * the inode can be destroyed, holding the lru_lock pins any
+	 * &lru->lock. Because the page cache tree is emptied before
+	 * the inode can be destroyed, holding the &lru->lock pins any
 	 * address_space that has nodes on the LRU.
 	 *
 	 * We can then safely transition to the i_pages lock to
 	 * pin only the address_space of the particular node we want
-	 * to reclaim, take the node off-LRU, and drop the lru_lock.
+	 * to reclaim, take the node off-LRU, and drop the &lru->lock.
 	 */
 
 	mapping = container_of(node->array, struct address_space, i_pages);
 
 	/* Coming from the list, invert the lock order */
 	if (!xa_trylock(&mapping->i_pages)) {
-		spin_unlock_irq(lru_lock);
+		spin_unlock_irq(&lru->lock);
 		ret = LRU_RETRY;
 		goto out;
 	}
@@ -736,7 +735,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	if (mapping->host != NULL) {
 		if (!spin_trylock(&mapping->host->i_lock)) {
 			xa_unlock(&mapping->i_pages);
-			spin_unlock_irq(lru_lock);
+			spin_unlock_irq(&lru->lock);
 			ret = LRU_RETRY;
 			goto out;
 		}
@@ -745,7 +744,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	list_lru_isolate(lru, item);
 	__dec_node_page_state(virt_to_page(node), WORKINGSET_NODES);
 
-	spin_unlock(lru_lock);
+	spin_unlock(&lru->lock);
 
 	/*
 	 * The nodes should only contain one or more shadow entries,
diff --git a/mm/zswap.c b/mm/zswap.c
index f7a2afaeea53..24e1e0c87172 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1097,7 +1097,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 * shrinker functions
 **********************************/
 static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
-				       spinlock_t *lock, void *arg)
+				       void *arg)
 {
 	struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
 	bool *encountered_page_in_swapcache = (bool *)arg;
@@ -1143,7 +1143,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 	 * It's safe to drop the lock here because we return either
 	 * LRU_REMOVED_RETRY or LRU_RETRY.
 	 */
-	spin_unlock(lock);
+	spin_unlock(&l->lock);
 
 	writeback_result = zswap_writeback_entry(entry, swpentry);
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/7] mm/list_lru: Simplify the list_lru walk callback function
  2024-06-24 17:53 ` [PATCH 7/7] mm/list_lru: Simplify the list_lru walk callback function Kairui Song
@ 2024-06-27 19:58   ` kernel test robot
  0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2024-06-27 19:58 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Waiman Long,
	Shakeel Butt, Nhat Pham, Michal Hocko, Chengming Zhou, Qi Zheng,
	Muchun Song, Chris Li, Yosry Ahmed, Huang, Ying, Kairui Song

Hi Kairui,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on staging/staging-testing staging/staging-next staging/staging-linus xfs-linux/for-next linus/master v6.10-rc5 next-20240626]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-workingset-make-anon-workingset-nodes-memcg-aware/20240625-231015
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240624175313.47329-8-ryncsn%40gmail.com
patch subject: [PATCH 7/7] mm/list_lru: Simplify the list_lru walk callback function
config: i386-buildonly-randconfig-001-20240628 (https://download.01.org/0day-ci/archive/20240628/202406280355.Qwjjjiug-lkp@intel.com/config)
compiler: gcc-10 (Ubuntu 10.5.0-1ubuntu1) 10.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240628/202406280355.Qwjjjiug-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202406280355.Qwjjjiug-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/android/binder_alloc.c:1060: warning: Function parameter or struct member 'lru' not described in 'binder_alloc_free_page'
>> drivers/android/binder_alloc.c:1060: warning: Excess function parameter 'lock' description in 'binder_alloc_free_page'


vim +1060 drivers/android/binder_alloc.c

0c972a05cde66e Todd Kjos         2017-06-29  1046  
f2517eb76f1f2f Sherry Yang       2017-08-23  1047  /**
f2517eb76f1f2f Sherry Yang       2017-08-23  1048   * binder_alloc_free_page() - shrinker callback to free pages
f2517eb76f1f2f Sherry Yang       2017-08-23  1049   * @item:   item to free
f2517eb76f1f2f Sherry Yang       2017-08-23  1050   * @lock:   lock protecting the item
f2517eb76f1f2f Sherry Yang       2017-08-23  1051   * @cb_arg: callback argument
f2517eb76f1f2f Sherry Yang       2017-08-23  1052   *
f2517eb76f1f2f Sherry Yang       2017-08-23  1053   * Called from list_lru_walk() in binder_shrink_scan() to free
f2517eb76f1f2f Sherry Yang       2017-08-23  1054   * up pages when the system is under memory pressure.
f2517eb76f1f2f Sherry Yang       2017-08-23  1055   */
f2517eb76f1f2f Sherry Yang       2017-08-23  1056  enum lru_status binder_alloc_free_page(struct list_head *item,
f2517eb76f1f2f Sherry Yang       2017-08-23  1057  				       struct list_lru_one *lru,
f2517eb76f1f2f Sherry Yang       2017-08-23  1058  				       void *cb_arg)
5672291f7d5dc5 Kairui Song       2024-06-25  1059  	__must_hold(&lru->lock)
f2517eb76f1f2f Sherry Yang       2017-08-23 @1060  {
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1061  	struct binder_lru_page *page = container_of(item, typeof(*page), lru);
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1062  	struct binder_alloc *alloc = page->alloc;
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1063  	struct mm_struct *mm = alloc->mm;
a1b2289cef92ef Sherry Yang       2017-10-03  1064  	struct vm_area_struct *vma;
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1065  	struct page *page_to_free;
df9aabead791d7 Carlos Llamas     2023-12-01  1066  	unsigned long page_addr;
f2517eb76f1f2f Sherry Yang       2017-08-23  1067  	size_t index;
f2517eb76f1f2f Sherry Yang       2017-08-23  1068  
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1069  	if (!mmget_not_zero(mm))
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1070  		goto err_mmget;
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1071  	if (!mmap_read_trylock(mm))
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1072  		goto err_mmap_read_lock_failed;
7710e2cca32e7f Carlos Llamas     2023-12-01  1073  	if (!spin_trylock(&alloc->lock))
7710e2cca32e7f Carlos Llamas     2023-12-01  1074  		goto err_get_alloc_lock_failed;
f2517eb76f1f2f Sherry Yang       2017-08-23  1075  	if (!page->page_ptr)
f2517eb76f1f2f Sherry Yang       2017-08-23  1076  		goto err_page_already_freed;
f2517eb76f1f2f Sherry Yang       2017-08-23  1077  
f2517eb76f1f2f Sherry Yang       2017-08-23  1078  	index = page - alloc->pages;
df9aabead791d7 Carlos Llamas     2023-12-01  1079  	page_addr = alloc->buffer + index * PAGE_SIZE;
5cec2d2e5839f9 Todd Kjos         2019-03-01  1080  
3f489c2067c582 Carlos Llamas     2023-12-01  1081  	vma = vma_lookup(mm, page_addr);
3f489c2067c582 Carlos Llamas     2023-12-01  1082  	if (vma && vma != binder_alloc_get_vma(alloc))
3f489c2067c582 Carlos Llamas     2023-12-01  1083  		goto err_invalid_vma;
a1b2289cef92ef Sherry Yang       2017-10-03  1084  
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1085  	trace_binder_unmap_kernel_start(alloc, index);
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1086  
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1087  	page_to_free = page->page_ptr;
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1088  	page->page_ptr = NULL;
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1089  
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1090  	trace_binder_unmap_kernel_end(alloc, index);
a1b2289cef92ef Sherry Yang       2017-10-03  1091  
a1b2289cef92ef Sherry Yang       2017-10-03  1092  	list_lru_isolate(lru, item);
7710e2cca32e7f Carlos Llamas     2023-12-01  1093  	spin_unlock(&alloc->lock);
5672291f7d5dc5 Kairui Song       2024-06-25  1094  	spin_unlock(&lru->lock);
f2517eb76f1f2f Sherry Yang       2017-08-23  1095  
a1b2289cef92ef Sherry Yang       2017-10-03  1096  	if (vma) {
e41e164c3cdff6 Sherry Yang       2017-08-23  1097  		trace_binder_unmap_user_start(alloc, index);
e41e164c3cdff6 Sherry Yang       2017-08-23  1098  
e9adcfecf572fc Mike Kravetz      2023-01-03  1099  		zap_page_range_single(vma, page_addr, PAGE_SIZE, NULL);
f2517eb76f1f2f Sherry Yang       2017-08-23  1100  
e41e164c3cdff6 Sherry Yang       2017-08-23  1101  		trace_binder_unmap_user_end(alloc, index);
f2517eb76f1f2f Sherry Yang       2017-08-23  1102  	}
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1103  
d8ed45c5dcd455 Michel Lespinasse 2020-06-08  1104  	mmap_read_unlock(mm);
f867c771f98891 Tetsuo Handa      2020-07-17  1105  	mmput_async(mm);
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1106  	__free_page(page_to_free);
e41e164c3cdff6 Sherry Yang       2017-08-23  1107  
a1b2289cef92ef Sherry Yang       2017-10-03  1108  	return LRU_REMOVED_RETRY;
f2517eb76f1f2f Sherry Yang       2017-08-23  1109  
3f489c2067c582 Carlos Llamas     2023-12-01  1110  err_invalid_vma:
e50f4e6cc9bfac Carlos Llamas     2023-12-01  1111  err_page_already_freed:
7710e2cca32e7f Carlos Llamas     2023-12-01  1112  	spin_unlock(&alloc->lock);
7710e2cca32e7f Carlos Llamas     2023-12-01  1113  err_get_alloc_lock_failed:
3f489c2067c582 Carlos Llamas     2023-12-01  1114  	mmap_read_unlock(mm);
3e4e28c5a8f01e Michel Lespinasse 2020-06-08  1115  err_mmap_read_lock_failed:
a1b2289cef92ef Sherry Yang       2017-10-03  1116  	mmput_async(mm);
a0c2baaf81bd53 Sherry Yang       2017-10-20  1117  err_mmget:
f2517eb76f1f2f Sherry Yang       2017-08-23  1118  	return LRU_SKIP;
f2517eb76f1f2f Sherry Yang       2017-08-23  1119  }
f2517eb76f1f2f Sherry Yang       2017-08-23  1120  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7] Split list_lru lock into per-cgroup scope
  2024-06-24 17:53 [PATCH 0/7] Split list_lru lock into per-cgroup scope Kairui Song
                   ` (6 preceding siblings ...)
  2024-06-24 17:53 ` [PATCH 7/7] mm/list_lru: Simplify the list_lru walk callback function Kairui Song
@ 2024-06-24 21:26 ` Andrew Morton
  2024-06-25  7:47   ` Kairui Song
  2024-06-25 17:00   ` Shakeel Butt
  2024-08-27 18:35 ` Shakeel Butt
  8 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2024-06-24 21:26 UTC (permalink / raw)
  To: Kairui Song
  Cc: Kairui Song, linux-mm, Matthew Wilcox, Johannes Weiner,
	Roman Gushchin, Waiman Long, Shakeel Butt, Nhat Pham,
	Michal Hocko, Chengming Zhou, Qi Zheng, Muchun Song, Chris Li,
	Yosry Ahmed, Huang, Ying

On Tue, 25 Jun 2024 01:53:06 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> Currently, every list_lru has a per-node lock that protects adding,
> deletion, isolation, and reparenting of all list_lru_one instances
> belonging to this list_lru on this node. This lock contention is heavy
> when multiple cgroups modify the same list_lru.
> 
> This can be alleviated by splitting the lock into per-cgroup scope.

I'm wavering over this.  We're at -rc5 and things generally feel a bit
unstable at present.

The performance numbers are nice for extreme workloads, but can you
suggest how much benefit users will see in more typical workloads?

Anyway, opinions are sought and I'd ask people to please review this
work promptly if they feel is it sufficiently beneficial.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7] Split list_lru lock into per-cgroup scope
  2024-06-24 21:26 ` [PATCH 0/7] Split list_lru lock into per-cgroup scope Andrew Morton
@ 2024-06-25  7:47   ` Kairui Song
  2024-06-25 17:00   ` Shakeel Butt
  1 sibling, 0 replies; 22+ messages in thread
From: Kairui Song @ 2024-06-25  7:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Waiman Long, Shakeel Butt, Nhat Pham, Michal Hocko,
	Chengming Zhou, Qi Zheng, Muchun Song, Chris Li, Yosry Ahmed,
	Huang, Ying

On Tue, Jun 25, 2024 at 5:26 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 25 Jun 2024 01:53:06 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> > Currently, every list_lru has a per-node lock that protects adding,
> > deletion, isolation, and reparenting of all list_lru_one instances
> > belonging to this list_lru on this node. This lock contention is heavy
> > when multiple cgroups modify the same list_lru.
> >
> > This can be alleviated by splitting the lock into per-cgroup scope.
>
> I'm wavering over this.  We're at -rc5 and things generally feel a bit
> unstable at present.
>
> The performance numbers are nice for extreme workloads, but can you
> suggest how much benefit users will see in more typical workloads?

Hi, the contention issue might be minor if the memory stress is low,
but still beneficial, and this series optimizes the cgroup
initialization too.

The memhog test I provided is tested on a 32 core system with 64
cgroups (I forgot to provide this detail, sry), not a very extreme
configuration actually, considering it's not rare to have thousands of
cgroups on a system nowadays. They all sharing a global lock is
definitely not a good idea.

The issue is barely observable for things like desktop usage though.

>
> Anyway, opinions are sought and I'd ask people to please review this
> work promptly if they feel is it sufficiently beneficial.

More reviews are definitely beneficial.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7] Split list_lru lock into per-cgroup scope
  2024-06-24 21:26 ` [PATCH 0/7] Split list_lru lock into per-cgroup scope Andrew Morton
  2024-06-25  7:47   ` Kairui Song
@ 2024-06-25 17:00   ` Shakeel Butt
  1 sibling, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2024-06-25 17:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Kairui Song, linux-mm, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Waiman Long, Shakeel Butt,
	Nhat Pham, Michal Hocko, Chengming Zhou, Qi Zheng, Muchun Song,
	Chris Li, Yosry Ahmed, Huang, Ying

On Mon, Jun 24, 2024 at 02:26:54PM GMT, Andrew Morton wrote:
> On Tue, 25 Jun 2024 01:53:06 +0800 Kairui Song <ryncsn@gmail.com> wrote:
> 
> > Currently, every list_lru has a per-node lock that protects adding,
> > deletion, isolation, and reparenting of all list_lru_one instances
> > belonging to this list_lru on this node. This lock contention is heavy
> > when multiple cgroups modify the same list_lru.
> > 
> > This can be alleviated by splitting the lock into per-cgroup scope.
> 
> I'm wavering over this.  We're at -rc5 and things generally feel a bit
> unstable at present.
> 
> The performance numbers are nice for extreme workloads, but can you
> suggest how much benefit users will see in more typical workloads?
> 
> Anyway, opinions are sought and I'd ask people to please review this
> work promptly if they feel is it sufficiently beneficial.

I will take a look this week (or next) hopefully but I don't see any
issue or concern missing the upcoming open window.

Shakeel
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7] Split list_lru lock into per-cgroup scope
  2024-06-24 17:53 [PATCH 0/7] Split list_lru lock into per-cgroup scope Kairui Song
                   ` (7 preceding siblings ...)
  2024-06-24 21:26 ` [PATCH 0/7] Split list_lru lock into per-cgroup scope Andrew Morton
@ 2024-08-27 18:35 ` Shakeel Butt
  8 siblings, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2024-08-27 18:35 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Johannes Weiner,
	Roman Gushchin, Waiman Long, Shakeel Butt, Nhat Pham,
	Michal Hocko, Chengming Zhou, Qi Zheng, Muchun Song, Chris Li,
	Yosry Ahmed, Huang, Ying

On Tue, Jun 25, 2024 at 01:53:06AM GMT, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Currently, every list_lru has a per-node lock that protects adding,
> deletion, isolation, and reparenting of all list_lru_one instances
> belonging to this list_lru on this node. This lock contention is heavy
> when multiple cgroups modify the same list_lru.
> 
> This can be alleviated by splitting the lock into per-cgroup scope.
> 
> To achieve this, this series reworks and optimizes the reparenting
> process step by step, making it possible to have a stable list_lru_one,
> and making it possible to pin the list_lru_one. Then split the lock
> into per-cgroup scope.
> 
> The result is reduced LOC and better performance: I see a ~25%
> improvement for multi-cgroup SWAP over ZRAM and a ~10% improvement for
> multi-cgroup inode / dentry workload, as tested in PATCH 6/7:
> 
> memhog SWAP test (shadow nodes):
> Before:
> real    0m20.328s user    0m4.315s sys     10m23.639s
> real    0m20.440s user    0m4.142s sys     10m34.756s
> real    0m20.381s user    0m4.164s sys     10m29.035s
> 
> After:
> real    0m15.156s user    0m4.590s sys     7m34.361s
> real    0m15.161s user    0m4.776s sys     7m35.086s
> real    0m15.429s user    0m4.734s sys     7m42.919s
> 
> File read test (inode / dentry):
> Before:
> real    0m26.939s user    0m36.322s sys     6m30.248s
> real    0m15.111s user    0m33.749s sys     5m4.991s
> real    0m16.796s user    0m33.438s sys     5m22.865s
> real    0m15.256s user    0m34.060s sys     4m56.870s
> real    0m14.826s user    0m33.531s sys     4m55.907s
> real    0m15.664s user    0m35.619s sys     6m3.638s
> real    0m15.746s user    0m34.066s sys     4m56.519s
> 
> After:
> real    0m22.166s user    0m35.155s sys     6m21.045s
> real    0m13.753s user    0m34.554s sys     4m40.982s
> real    0m13.815s user    0m34.693s sys     4m39.605s
> real    0m13.495s user    0m34.372s sys     4m40.776s
> real    0m13.895s user    0m34.005s sys     4m39.061s
> real    0m13.629s user    0m33.476s sys     4m43.626s
> real    0m14.001s user    0m33.463s sys     4m41.261s
> 
> PATCH 1/7: Fixes a long-existing bug, so shadow nodes will be accounted
>     to the right cgroup and put into the right list_lru.
> PATCH 2/7 - 4/7: Clean up
> PATCH 6/7: Reworks and optimizes reparenting process, avoids touching
>     kmemcg_id on reparenting as first step.
> PATCH 7/7: Makes it possible to pin the list_lru_one and prevent racing
>     with reparenting, and splits the lock.
> 
> Kairui Song (7):
>   mm/swap, workingset: make anon workingset nodes memcg aware
>   mm/list_lru: don't pass unnecessary key parameters
>   mm/list_lru: don't export list_lru_add
>   mm/list_lru: code clean up for reparenting
>   mm/list_lru: simplify reparenting and initial allocation
>   mm/list_lru: split the lock to per-cgroup scope
>   mm/list_lru: Simplify the list_lru walk callback function
> 
>  drivers/android/binder_alloc.c |   6 +-
>  drivers/android/binder_alloc.h |   2 +-
>  fs/dcache.c                    |   4 +-
>  fs/gfs2/quota.c                |   2 +-
>  fs/inode.c                     |   5 +-
>  fs/nfs/nfs42xattr.c            |   4 +-
>  fs/nfsd/filecache.c            |   5 +-
>  fs/xfs/xfs_buf.c               |   2 -
>  fs/xfs/xfs_qm.c                |   6 +-
>  include/linux/list_lru.h       |  26 ++-
>  mm/list_lru.c                  | 387 +++++++++++++++++----------------
>  mm/memcontrol.c                |  10 +-
>  mm/swap_state.c                |   3 +-
>  mm/workingset.c                |  20 +-
>  mm/zswap.c                     |  12 +-
>  15 files changed, 246 insertions(+), 248 deletions(-)
> 
> -- 
> 2.45.2
> 
> 

Hi Kairui, can you send the v2 of this series (without the first patch)?

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-08-27 18:36 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-24 17:53 [PATCH 0/7] Split list_lru lock into per-cgroup scope Kairui Song
2024-06-24 17:53 ` [PATCH 1/7] mm/swap, workingset: make anon workingset nodes memcg aware Kairui Song
2024-07-17  3:25   ` Muchun Song
2024-07-18 11:33     ` Kairui Song
2024-07-19  1:34   ` Shakeel Butt
2024-06-24 17:53 ` [PATCH 2/7] mm/list_lru: don't pass unnecessary key parameters Kairui Song
2024-06-24 17:53 ` [PATCH 3/7] mm/list_lru: don't export list_lru_add Kairui Song
2024-07-17  3:12   ` Muchun Song
2024-06-24 17:53 ` [PATCH 4/7] mm/list_lru: code clean up for reparenting Kairui Song
2024-07-15  9:10   ` Muchun Song
2024-07-16  8:15     ` Kairui Song
2024-06-24 17:53 ` [PATCH 5/7] mm/list_lru: simplify reparenting and initial allocation Kairui Song
2024-07-17  3:04   ` Muchun Song
2024-07-18 11:49     ` Kairui Song
2024-07-19  2:45       ` Muchun Song
2024-06-24 17:53 ` [PATCH 6/7] mm/list_lru: split the lock to per-cgroup scope Kairui Song
2024-06-24 17:53 ` [PATCH 7/7] mm/list_lru: Simplify the list_lru walk callback function Kairui Song
2024-06-27 19:58   ` kernel test robot
2024-06-24 21:26 ` [PATCH 0/7] Split list_lru lock into per-cgroup scope Andrew Morton
2024-06-25  7:47   ` Kairui Song
2024-06-25 17:00   ` Shakeel Butt
2024-08-27 18:35 ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).