[PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
@ 2024-01-17  9:23 Chengming Zhou
  2024-01-17  9:23 ` [PATCH 1/2] mm/zswap: make sure each swapfile always have " Chengming Zhou
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Chengming Zhou @ 2024-01-17  9:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Chengming Zhou, linux-kernel, Johannes Weiner,
	linux-mm, Chris Li, Nhat Pham

When testing the zswap performance by using kernel build -j32 in a tmpfs
directory, I found the scalability of zswap rb-tree is not good, which
is protected by the only spinlock. That would cause heavy lock contention
if multiple tasks zswap_store/load concurrently.

So a simple solution is to split the only one zswap rb-tree into multiple
rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

Although this method can't solve the spinlock contention completely, it
can mitigate much of that contention. Below is the results of kernel build
in tmpfs with zswap shrinker enabled:

     linux-next  zswap-lock-optimize
real 1m9.181s    1m3.820s
user 17m44.036s  17m40.100s
sys  7m37.297s   4m54.622s

So there are clearly improvements. And it's complementary with the ongoing
zswap xarray conversion by Chris. Anyway, I think we can also merge this
first, it's complementary IMHO. So I just refresh and resend this for
further discussion.

Thanks for review and comment!

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Chengming Zhou (2):
      mm/zswap: make sure each swapfile always have zswap rb-tree
      mm/zswap: split zswap rb-tree

 include/linux/zswap.h |  7 +++--
 mm/swapfile.c         | 10 ++++---
 mm/zswap.c            | 74 ++++++++++++++++++++++++++++++++-------------------
 3 files changed, 59 insertions(+), 32 deletions(-)
---
base-commit: ab27740f76654ed58dd32ac0ba0031c18a6dea3b
change-id: 20240117-b4-zswap-lock-optimize-44e071c13427

Best regards,
-- 
Chengming Zhou <zhouchengming@bytedance.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] mm/zswap: make sure each swapfile always have zswap rb-tree
  2024-01-17  9:23 [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree Chengming Zhou
@ 2024-01-17  9:23 ` Chengming Zhou
  2024-01-18 15:05   ` Johannes Weiner
                     ` (2 more replies)
  2024-01-17  9:23 ` [PATCH 2/2] mm/zswap: split " Chengming Zhou
  2024-01-17 18:37 ` [PATCH 0/2] mm/zswap: optimize the scalability of " Yosry Ahmed
  2 siblings, 3 replies; 23+ messages in thread
From: Chengming Zhou @ 2024-01-17  9:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Chengming Zhou, linux-kernel, Johannes Weiner,
	linux-mm, Chris Li, Nhat Pham

Not all zswap interfaces can handle the absence of the zswap rb-tree,
actually only zswap_store() has handled it for now.

To make things simple, we make sure each swapfile always have the
zswap rb-tree prepared before being enabled and used. The preparation
is unlikely to fail in practice, this patch just make it explicit.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/zswap.h |  7 +++++--
 mm/swapfile.c         | 10 +++++++---
 mm/zswap.c            |  7 ++++---
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 0b709f5bc65f..eca388229d9a 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -30,7 +30,7 @@ struct zswap_lruvec_state {
 bool zswap_store(struct folio *folio);
 bool zswap_load(struct folio *folio);
 void zswap_invalidate(int type, pgoff_t offset);
-void zswap_swapon(int type);
+int zswap_swapon(int type);
 void zswap_swapoff(int type);
 void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
 void zswap_lruvec_state_init(struct lruvec *lruvec);
@@ -51,7 +51,10 @@ static inline bool zswap_load(struct folio *folio)
 }
 
 static inline void zswap_invalidate(int type, pgoff_t offset) {}
-static inline void zswap_swapon(int type) {}
+static inline int zswap_swapon(int type)
+{
+	return 0;
+}
 static inline void zswap_swapoff(int type) {}
 static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
 static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3eec686484ef..6c53ea06626b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2347,8 +2347,6 @@ static void enable_swap_info(struct swap_info_struct *p, int prio,
 				unsigned char *swap_map,
 				struct swap_cluster_info *cluster_info)
 {
-	zswap_swapon(p->type);
-
 	spin_lock(&swap_lock);
 	spin_lock(&p->lock);
 	setup_swap_info(p, prio, swap_map, cluster_info);
@@ -3166,6 +3164,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (error)
 		goto bad_swap_unlock_inode;
 
+	error = zswap_swapon(p->type);
+	if (error)
+		goto free_swap_address_space;
+
 	/*
 	 * Flush any pending IO and dirty mappings before we start using this
 	 * swap device.
@@ -3174,7 +3176,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	error = inode_drain_writes(inode);
 	if (error) {
 		inode->i_flags &= ~S_SWAPFILE;
-		goto free_swap_address_space;
+		goto free_swap_zswap;
 	}
 
 	mutex_lock(&swapon_mutex);
@@ -3198,6 +3200,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 
 	error = 0;
 	goto out;
+free_swap_zswap:
+	zswap_swapoff(p->type);
 free_swap_address_space:
 	exit_swap_address_space(p->type);
 bad_swap_unlock_inode:
diff --git a/mm/zswap.c b/mm/zswap.c
index ca25b676048e..d88faea85978 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1519,7 +1519,7 @@ bool zswap_store(struct folio *folio)
 	if (folio_test_large(folio))
 		return false;
 
-	if (!zswap_enabled || !tree)
+	if (!zswap_enabled)
 		return false;
 
 	/*
@@ -1772,19 +1772,20 @@ void zswap_invalidate(int type, pgoff_t offset)
 	spin_unlock(&tree->lock);
 }
 
-void zswap_swapon(int type)
+int zswap_swapon(int type)
 {
 	struct zswap_tree *tree;
 
 	tree = kzalloc(sizeof(*tree), GFP_KERNEL);
 	if (!tree) {
 		pr_err("alloc failed, zswap disabled for swap type %d\n", type);
-		return;
+		return -ENOMEM;
 	}
 
 	tree->rbroot = RB_ROOT;
 	spin_lock_init(&tree->lock);
 	zswap_trees[type] = tree;
+	return 0;
 }
 
 void zswap_swapoff(int type)

-- 
b4 0.10.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/2] mm/zswap: split zswap rb-tree
  2024-01-17  9:23 [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree Chengming Zhou
  2024-01-17  9:23 ` [PATCH 1/2] mm/zswap: make sure each swapfile always have " Chengming Zhou
@ 2024-01-17  9:23 ` Chengming Zhou
  2024-01-18 15:11   ` Johannes Weiner
  2024-01-18 19:24   ` Nhat Pham
  2024-01-17 18:37 ` [PATCH 0/2] mm/zswap: optimize the scalability of " Yosry Ahmed
  2 siblings, 2 replies; 23+ messages in thread
From: Chengming Zhou @ 2024-01-17  9:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Chengming Zhou, linux-kernel, Johannes Weiner,
	linux-mm, Chris Li, Nhat Pham

Each swapfile has one rb-tree to search the mapping of swp_entry_t to
zswap_entry, that use a spinlock to protect, which can cause heavy lock
contention if multiple tasks zswap_store/load concurrently.

Optimize the scalability problem by splitting the zswap rb-tree into
multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M),
just like we did in the swap cache address_space splitting.

Although this method can't solve the spinlock contention completely, it
can mitigate much of that contention. Below is the results of kernel build
in tmpfs with zswap shrinker enabled:

     linux-next  zswap-lock-optimize
real 1m9.181s    1m3.820s
user 17m44.036s  17m40.100s
sys  7m37.297s   4m54.622s

So there are clearly improvements.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/zswap.h |  4 +--
 mm/swapfile.c         |  2 +-
 mm/zswap.c            | 69 ++++++++++++++++++++++++++++++++-------------------
 3 files changed, 47 insertions(+), 28 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index eca388229d9a..91895ce1fdbc 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -30,7 +30,7 @@ struct zswap_lruvec_state {
 bool zswap_store(struct folio *folio);
 bool zswap_load(struct folio *folio);
 void zswap_invalidate(int type, pgoff_t offset);
-int zswap_swapon(int type);
+int zswap_swapon(int type, unsigned long nr_pages);
 void zswap_swapoff(int type);
 void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
 void zswap_lruvec_state_init(struct lruvec *lruvec);
@@ -51,7 +51,7 @@ static inline bool zswap_load(struct folio *folio)
 }
 
 static inline void zswap_invalidate(int type, pgoff_t offset) {}
-static inline int zswap_swapon(int type)
+static inline int zswap_swapon(int type, unsigned long nr_pages)
 {
 	return 0;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6c53ea06626b..35aa17b2a2fa 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3164,7 +3164,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (error)
 		goto bad_swap_unlock_inode;
 
-	error = zswap_swapon(p->type);
+	error = zswap_swapon(p->type, maxpages);
 	if (error)
 		goto free_swap_address_space;
 
diff --git a/mm/zswap.c b/mm/zswap.c
index d88faea85978..4a6dbc620c7c 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -239,6 +239,7 @@ struct zswap_tree {
 };
 
 static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
+static unsigned int nr_zswap_trees[MAX_SWAPFILES];
 
 /* RCU-protected iteration */
 static LIST_HEAD(zswap_pools);
@@ -265,6 +266,10 @@ static bool zswap_has_pool;
 * helpers and fwd declarations
 **********************************/
 
+#define swap_zswap_tree(entry)					\
+	(&zswap_trees[swp_type(entry)][swp_offset(entry)	\
+		>> SWAP_ADDRESS_SPACE_SHIFT])
+
 #define zswap_pool_debug(msg, p)				\
 	pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,		\
 		 zpool_get_type((p)->zpools[0]))
@@ -865,7 +870,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 	 * until the entry is verified to still be alive in the tree.
 	 */
 	swpoffset = swp_offset(entry->swpentry);
-	tree = zswap_trees[swp_type(entry->swpentry)];
+	tree = swap_zswap_tree(entry->swpentry);
 	list_lru_isolate(l, item);
 	/*
 	 * It's safe to drop the lock here because we return either
@@ -1494,10 +1499,9 @@ static void zswap_fill_page(void *ptr, unsigned long value)
 bool zswap_store(struct folio *folio)
 {
 	swp_entry_t swp = folio->swap;
-	int type = swp_type(swp);
 	pgoff_t offset = swp_offset(swp);
 	struct page *page = &folio->page;
-	struct zswap_tree *tree = zswap_trees[type];
+	struct zswap_tree *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry, *dupentry;
 	struct scatterlist input, output;
 	struct crypto_acomp_ctx *acomp_ctx;
@@ -1569,7 +1573,7 @@ bool zswap_store(struct folio *folio)
 		src = kmap_local_page(page);
 		if (zswap_is_page_same_filled(src, &value)) {
 			kunmap_local(src);
-			entry->swpentry = swp_entry(type, offset);
+			entry->swpentry = swp;
 			entry->length = 0;
 			entry->value = value;
 			atomic_inc(&zswap_same_filled_pages);
@@ -1651,7 +1655,7 @@ bool zswap_store(struct folio *folio)
 	mutex_unlock(&acomp_ctx->mutex);
 
 	/* populate entry */
-	entry->swpentry = swp_entry(type, offset);
+	entry->swpentry = swp;
 	entry->handle = handle;
 	entry->length = dlen;
 
@@ -1711,10 +1715,9 @@ bool zswap_store(struct folio *folio)
 bool zswap_load(struct folio *folio)
 {
 	swp_entry_t swp = folio->swap;
-	int type = swp_type(swp);
 	pgoff_t offset = swp_offset(swp);
 	struct page *page = &folio->page;
-	struct zswap_tree *tree = zswap_trees[type];
+	struct zswap_tree *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 	u8 *dst;
 
@@ -1757,7 +1760,7 @@ bool zswap_load(struct folio *folio)
 
 void zswap_invalidate(int type, pgoff_t offset)
 {
-	struct zswap_tree *tree = zswap_trees[type];
+	struct zswap_tree *tree = swap_zswap_tree(swp_entry(type, offset));
 	struct zswap_entry *entry;
 
 	/* find */
@@ -1772,37 +1775,53 @@ void zswap_invalidate(int type, pgoff_t offset)
 	spin_unlock(&tree->lock);
 }
 
-int zswap_swapon(int type)
+int zswap_swapon(int type, unsigned long nr_pages)
 {
-	struct zswap_tree *tree;
+	struct zswap_tree *trees, *tree;
+	unsigned int nr, i;
 
-	tree = kzalloc(sizeof(*tree), GFP_KERNEL);
-	if (!tree) {
+	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
+	trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
+	if (!trees) {
 		pr_err("alloc failed, zswap disabled for swap type %d\n", type);
 		return -ENOMEM;
 	}
 
-	tree->rbroot = RB_ROOT;
-	spin_lock_init(&tree->lock);
-	zswap_trees[type] = tree;
+	for (i = 0; i < nr; i++) {
+		tree = trees + i;
+		tree->rbroot = RB_ROOT;
+		spin_lock_init(&tree->lock);
+	}
+
+	nr_zswap_trees[type] = nr;
+	zswap_trees[type] = trees;
 	return 0;
 }
 
 void zswap_swapoff(int type)
 {
-	struct zswap_tree *tree = zswap_trees[type];
-	struct zswap_entry *entry, *n;
+	struct zswap_tree *trees = zswap_trees[type];
+	unsigned int i;
 
-	if (!tree)
+	if (!trees)
 		return;
 
-	/* walk the tree and free everything */
-	spin_lock(&tree->lock);
-	rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode)
-		zswap_free_entry(entry);
-	tree->rbroot = RB_ROOT;
-	spin_unlock(&tree->lock);
-	kfree(tree);
+	for (i = 0; i < nr_zswap_trees[type]; i++) {
+		struct zswap_tree *tree = trees + i;
+		struct zswap_entry *entry, *n;
+
+		/* walk the tree and free everything */
+		spin_lock(&tree->lock);
+		rbtree_postorder_for_each_entry_safe(entry, n,
+						     &tree->rbroot,
+						     rbnode)
+			zswap_free_entry(entry);
+		tree->rbroot = RB_ROOT;
+		spin_unlock(&tree->lock);
+	}
+
+	kvfree(trees);
+	nr_zswap_trees[type] = 0;
 	zswap_trees[type] = NULL;
 }
 

-- 
b4 0.10.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-17  9:23 [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree Chengming Zhou
  2024-01-17  9:23 ` [PATCH 1/2] mm/zswap: make sure each swapfile always have " Chengming Zhou
  2024-01-17  9:23 ` [PATCH 2/2] mm/zswap: split " Chengming Zhou
@ 2024-01-17 18:37 ` Yosry Ahmed
  2024-01-17 23:41   ` Chris Li
  2024-01-18 15:34   ` Johannes Weiner
  2 siblings, 2 replies; 23+ messages in thread
From: Yosry Ahmed @ 2024-01-17 18:37 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, linux-kernel, Johannes Weiner, linux-mm, Chris Li,
	Nhat Pham

On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> When testing the zswap performance by using kernel build -j32 in a tmpfs
> directory, I found the scalability of zswap rb-tree is not good, which
> is protected by the only spinlock. That would cause heavy lock contention
> if multiple tasks zswap_store/load concurrently.
>
> So a simple solution is to split the only one zswap rb-tree into multiple
> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>
> Although this method can't solve the spinlock contention completely, it
> can mitigate much of that contention. Below is the results of kernel build
> in tmpfs with zswap shrinker enabled:
>
>      linux-next  zswap-lock-optimize
> real 1m9.181s    1m3.820s
> user 17m44.036s  17m40.100s
> sys  7m37.297s   4m54.622s
>
> So there are clearly improvements. And it's complementary with the ongoing
> zswap xarray conversion by Chris. Anyway, I think we can also merge this
> first, it's complementary IMHO. So I just refresh and resend this for
> further discussion.

The reason why I think we should wait for the xarray patch(es) is
there is a chance we may see less improvements from splitting the tree
if it was an xarray. If we merge this series first, there is no way to
know.

Chris, do you intend to send the xarray patch(es) anytime soon?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-17 18:37 ` [PATCH 0/2] mm/zswap: optimize the scalability of " Yosry Ahmed
@ 2024-01-17 23:41   ` Chris Li
  2024-01-17 23:47     ` Yosry Ahmed
  2024-01-18 15:34   ` Johannes Weiner
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Li @ 2024-01-17 23:41 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chengming Zhou, Andrew Morton, linux-kernel, Johannes Weiner,
	linux-mm, Nhat Pham

Hi Yosry and Chengming,


On Wed, Jan 17, 2024 at 10:38 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > directory, I found the scalability of zswap rb-tree is not good, which
> > is protected by the only spinlock. That would cause heavy lock contention
> > if multiple tasks zswap_store/load concurrently.
> >
> > So a simple solution is to split the only one zswap rb-tree into multiple
> > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> >
> > Although this method can't solve the spinlock contention completely, it
> > can mitigate much of that contention. Below is the results of kernel build
> > in tmpfs with zswap shrinker enabled:
> >
> >      linux-next  zswap-lock-optimize
> > real 1m9.181s    1m3.820s
> > user 17m44.036s  17m40.100s
> > sys  7m37.297s   4m54.622s
> >
> > So there are clearly improvements. And it's complementary with the ongoing
> > zswap xarray conversion by Chris. Anyway, I think we can also merge this
> > first, it's complementary IMHO. So I just refresh and resend this for
> > further discussion.
>

Sorry I have been radio silent busying on a few refreshments of the
xarray on the recent kernel tree. There is an assertion triggered on
xarray and the rb tree does not agree with each other. It takes some
time to debug. I ironed that out, also glad the assert did catch a
bug.

Currently the xarray patch should have everything it takes to use RCU
read lock. However taking out the tree spinlock is more work than
previously. If we are going to remove the tree spinlock, I think we
should revert back to doing a zswap tree lookup and return the zswap
entry with reference increased. The tree mapping can still decouple
from the zswap entry reference count drop to zero.  Anyway, my V1 of
the xarray patch will not include removing the tree spinlock.

> The reason why I think we should wait for the xarray patch(es) is
> there is a chance we may see less improvements from splitting the tree
> if it was an xarray. If we merge this series first, there is no way to
> know.
>
> Chris, do you intend to send the xarray patch(es) anytime soon?

Thanks for the heads up. Let me send it out now.

Chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-17 23:41   ` Chris Li
@ 2024-01-17 23:47     ` Yosry Ahmed
  2024-01-18  0:17       ` Chris Li
  0 siblings, 1 reply; 23+ messages in thread
From: Yosry Ahmed @ 2024-01-17 23:47 UTC (permalink / raw)
  To: Chris Li
  Cc: Chengming Zhou, Andrew Morton, linux-kernel, Johannes Weiner,
	linux-mm, Nhat Pham

> Currently the xarray patch should have everything it takes to use RCU
> read lock. However taking out the tree spinlock is more work than
> previously. If we are going to remove the tree spinlock, I think we
> should revert back to doing a zswap tree lookup and return the zswap
> entry with reference increased. The tree mapping can still decouple
> from the zswap entry reference count drop to zero.  Anyway, my V1 of
> the xarray patch will not include removing the tree spinlock.

Interesting. What do you mean by removing the tree spinlock? My
assumption was that the xarray reduces lock contention because we do
not need a lock to do lookups, but we still need the lock otherwise.
Did you have something in mind to completely remove the tree lock?

>
> > The reason why I think we should wait for the xarray patch(es) is
> > there is a chance we may see less improvements from splitting the tree
> > if it was an xarray. If we merge this series first, there is no way to
> > know.
> >
> > Chris, do you intend to send the xarray patch(es) anytime soon?
>
> Thanks for the heads up. Let me send it out now.

Awesome, thanks! I assume Chengming can test whether this series
provides the same benefits with the xarray.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-17 23:47     ` Yosry Ahmed
@ 2024-01-18  0:17       ` Chris Li
  2024-01-18  0:34         ` Yosry Ahmed
  2024-01-18  0:49         ` Nhat Pham
  0 siblings, 2 replies; 23+ messages in thread
From: Chris Li @ 2024-01-18  0:17 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chengming Zhou, Andrew Morton, linux-kernel, Johannes Weiner,
	linux-mm, Nhat Pham

Hi Yosry,

On Wed, Jan 17, 2024 at 3:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > Currently the xarray patch should have everything it takes to use RCU
> > read lock. However taking out the tree spinlock is more work than
> > previously. If we are going to remove the tree spinlock, I think we
> > should revert back to doing a zswap tree lookup and return the zswap
> > entry with reference increased. The tree mapping can still decouple
> > from the zswap entry reference count drop to zero.  Anyway, my V1 of
> > the xarray patch will not include removing the tree spinlock.
>
> Interesting. What do you mean by removing the tree spinlock? My
> assumption was that the xarray reduces lock contention because we do
> not need a lock to do lookups, but we still need the lock otherwise.
> Did you have something in mind to completely remove the tree lock?

In my current xarray series, it adds the xarray alongside the rb tree.
Xarray has its own internal lock as well. Effectively zswap now has
two locks instead of just one previously.  The xarray lock will not
have any contention due to the xarray lock taken inside the zswap rb
tree lock. The eventual goal is reducing the two locks back to
one(xarray lock), which is not in my V1 patch. Your understanding is
correct, the xarray still needs to have one lock for protecting the
write path.

> > > The reason why I think we should wait for the xarray patch(es) is
> > > there is a chance we may see less improvements from splitting the tree
> > > if it was an xarray. If we merge this series first, there is no way to
> > > know.
> > >
> > > Chris, do you intend to send the xarray patch(es) anytime soon?
> >
> > Thanks for the heads up. Let me send it out now.
>
> Awesome, thanks! I assume Chengming can test whether this series
> provides the same benefits with the xarray.

That would be great. The xarray patch needs more testing for sure.

Chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-18  0:17       ` Chris Li
@ 2024-01-18  0:34         ` Yosry Ahmed
  2024-01-18  1:03           ` Chris Li
  2024-01-18  0:49         ` Nhat Pham
  1 sibling, 1 reply; 23+ messages in thread
From: Yosry Ahmed @ 2024-01-18  0:34 UTC (permalink / raw)
  To: Chris Li
  Cc: Chengming Zhou, Andrew Morton, linux-kernel, Johannes Weiner,
	linux-mm, Nhat Pham

On Wed, Jan 17, 2024 at 4:18 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Yosry,
>
> On Wed, Jan 17, 2024 at 3:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > > Currently the xarray patch should have everything it takes to use RCU
> > > read lock. However taking out the tree spinlock is more work than
> > > previously. If we are going to remove the tree spinlock, I think we
> > > should revert back to doing a zswap tree lookup and return the zswap
> > > entry with reference increased. The tree mapping can still decouple
> > > from the zswap entry reference count drop to zero.  Anyway, my V1 of
> > > the xarray patch will not include removing the tree spinlock.
> >
> > Interesting. What do you mean by removing the tree spinlock? My
> > assumption was that the xarray reduces lock contention because we do
> > not need a lock to do lookups, but we still need the lock otherwise.
> > Did you have something in mind to completely remove the tree lock?
>
> In my current xarray series, it adds the xarray alongside the rb tree.
> Xarray has its own internal lock as well. Effectively zswap now has
> two locks instead of just one previously.  The xarray lock will not
> have any contention due to the xarray lock taken inside the zswap rb
> tree lock. The eventual goal is reducing the two locks back to
> one(xarray lock), which is not in my V1 patch. Your understanding is
> correct, the xarray still needs to have one lock for protecting the
> write path.

Hmm I don't understand. What's the point of keeping the rbtree if we
have the xarray? Doesn't it end up being more expensive and bug-prone
to maintain both trees?

When you say "eventual goal", do you mean what the patch would morph
into in later versions (as in v1 is just a proof of concept without
removing the rbtree), or follow up patches?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-18  0:17       ` Chris Li
  2024-01-18  0:34         ` Yosry Ahmed
@ 2024-01-18  0:49         ` Nhat Pham
  1 sibling, 0 replies; 23+ messages in thread
From: Nhat Pham @ 2024-01-18  0:49 UTC (permalink / raw)
  To: Chris Li
  Cc: Yosry Ahmed, Chengming Zhou, Andrew Morton, linux-kernel,
	Johannes Weiner, linux-mm

On Wed, Jan 17, 2024 at 4:18 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Yosry,
>
> On Wed, Jan 17, 2024 at 3:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > > Currently the xarray patch should have everything it takes to use RCU
> > > read lock. However taking out the tree spinlock is more work than
> > > previously. If we are going to remove the tree spinlock, I think we
> > > should revert back to doing a zswap tree lookup and return the zswap
> > > entry with reference increased. The tree mapping can still decouple
> > > from the zswap entry reference count drop to zero.  Anyway, my V1 of
> > > the xarray patch will not include removing the tree spinlock.
> >
> > Interesting. What do you mean by removing the tree spinlock? My
> > assumption was that the xarray reduces lock contention because we do
> > not need a lock to do lookups, but we still need the lock otherwise.
> > Did you have something in mind to completely remove the tree lock?
>
> In my current xarray series, it adds the xarray alongside the rb tree.

Hmmm why? Is there a reason to keep the rb tree around?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-18  0:34         ` Yosry Ahmed
@ 2024-01-18  1:03           ` Chris Li
  2024-01-18  3:51             ` Yosry Ahmed
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Li @ 2024-01-18  1:03 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chengming Zhou, Andrew Morton, linux-kernel, Johannes Weiner,
	linux-mm, Nhat Pham

Hi Yosry,

On Wed, Jan 17, 2024 at 4:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> Hmm I don't understand. What's the point of keeping the rbtree if we
> have the xarray? Doesn't it end up being more expensive and bug-prone
> to maintain both trees?

Patch 2/2 remove the rb tree code. Just keeping the tree spinlock.

>
> When you say "eventual goal", do you mean what the patch would morph
> into in later versions (as in v1 is just a proof of concept without
> removing the rbtree), or follow up patches?

V1 will remove the rb tree, but does not merge the rb tree lock with
the xarray lock.

Hi Nhat,

> Hmmm why? Is there a reason to keep the rb tree around?

No, not at all.


Chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-18  1:03           ` Chris Li
@ 2024-01-18  3:51             ` Yosry Ahmed
  0 siblings, 0 replies; 23+ messages in thread
From: Yosry Ahmed @ 2024-01-18  3:51 UTC (permalink / raw)
  To: Chris Li
  Cc: Chengming Zhou, Andrew Morton, linux-kernel, Johannes Weiner,
	linux-mm, Nhat Pham

On Wed, Jan 17, 2024 at 5:03 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Yosry,
>
> On Wed, Jan 17, 2024 at 4:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > Hmm I don't understand. What's the point of keeping the rbtree if we
> > have the xarray? Doesn't it end up being more expensive and bug-prone
> > to maintain both trees?
>
> Patch 2/2 remove the rb tree code. Just keeping the tree spinlock.
>
> >
> > When you say "eventual goal", do you mean what the patch would morph
> > into in later versions (as in v1 is just a proof of concept without
> > removing the rbtree), or follow up patches?
>
> V1 will remove the rb tree, but does not merge the rb tree lock with
> the xarray lock.

I see you already posted the patches, let's move the discussion there.
I will take a look at them as soon as I get the chance to.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] mm/zswap: make sure each swapfile always have zswap rb-tree
  2024-01-17  9:23 ` [PATCH 1/2] mm/zswap: make sure each swapfile always have " Chengming Zhou
@ 2024-01-18 15:05   ` Johannes Weiner
  2024-01-18 17:37   ` Yosry Ahmed
  2024-01-18 18:16   ` Nhat Pham
  2 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2024-01-18 15:05 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm, Chris Li,
	Nhat Pham

On Wed, Jan 17, 2024 at 09:23:18AM +0000, Chengming Zhou wrote:
> Not all zswap interfaces can handle the absence of the zswap rb-tree,
> actually only zswap_store() has handled it for now.
> 
> To make things simple, we make sure each swapfile always have the
> zswap rb-tree prepared before being enabled and used. The preparation
> is unlikely to fail in practice, this patch just make it explicit.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm/zswap: split zswap rb-tree
  2024-01-17  9:23 ` [PATCH 2/2] mm/zswap: split " Chengming Zhou
@ 2024-01-18 15:11   ` Johannes Weiner
  2024-01-19  6:20     ` Chengming Zhou
  2024-01-18 19:24   ` Nhat Pham
  1 sibling, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2024-01-18 15:11 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm, Chris Li,
	Nhat Pham

On Wed, Jan 17, 2024 at 09:23:19AM +0000, Chengming Zhou wrote:
> Each swapfile has one rb-tree to search the mapping of swp_entry_t to
> zswap_entry, that use a spinlock to protect, which can cause heavy lock
> contention if multiple tasks zswap_store/load concurrently.
> 
> Optimize the scalability problem by splitting the zswap rb-tree into
> multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M),
> just like we did in the swap cache address_space splitting.
> 
> Although this method can't solve the spinlock contention completely, it
> can mitigate much of that contention. Below is the results of kernel build
> in tmpfs with zswap shrinker enabled:
> 
>      linux-next  zswap-lock-optimize
> real 1m9.181s    1m3.820s
> user 17m44.036s  17m40.100s
> sys  7m37.297s   4m54.622s
> 
> So there are clearly improvements.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

One minor nit:

> @@ -265,6 +266,10 @@ static bool zswap_has_pool;
>  * helpers and fwd declarations
>  **********************************/
>  
> +#define swap_zswap_tree(entry)					\
> +	(&zswap_trees[swp_type(entry)][swp_offset(entry)	\
> +		>> SWAP_ADDRESS_SPACE_SHIFT])

Make this a static inline function instead?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-17 18:37 ` [PATCH 0/2] mm/zswap: optimize the scalability of " Yosry Ahmed
  2024-01-17 23:41   ` Chris Li
@ 2024-01-18 15:34   ` Johannes Weiner
  2024-01-18 17:30     ` Yosry Ahmed
  1 sibling, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2024-01-18 15:34 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chengming Zhou, Andrew Morton, linux-kernel, linux-mm, Chris Li,
	Nhat Pham

On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote:
> On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > directory, I found the scalability of zswap rb-tree is not good, which
> > is protected by the only spinlock. That would cause heavy lock contention
> > if multiple tasks zswap_store/load concurrently.
> >
> > So a simple solution is to split the only one zswap rb-tree into multiple
> > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> >
> > Although this method can't solve the spinlock contention completely, it
> > can mitigate much of that contention. Below is the results of kernel build
> > in tmpfs with zswap shrinker enabled:
> >
> >      linux-next  zswap-lock-optimize
> > real 1m9.181s    1m3.820s
> > user 17m44.036s  17m40.100s
> > sys  7m37.297s   4m54.622s
> >
> > So there are clearly improvements. And it's complementary with the ongoing
> > zswap xarray conversion by Chris. Anyway, I think we can also merge this
> > first, it's complementary IMHO. So I just refresh and resend this for
> > further discussion.
> 
> The reason why I think we should wait for the xarray patch(es) is
> there is a chance we may see less improvements from splitting the tree
> if it was an xarray. If we merge this series first, there is no way to
> know.

I mentioned this before, but I disagree quite strongly with this
general sentiment.

Chengming's patches are simple, mature, and have convincing
numbers. IMO it's poor form to hold something like that for "let's see
how our other experiment works out". The only exception would be if we
all agree that the earlier change flies in the face of the overall
direction we want to pursue, which I don't think is the case here.

With the xarray we'll still have a per-swapfile lock for writes. That
lock is the reason SWAP_ADDRESS_SPACE segmentation was introduced for
the swapcache in the first place. Lockless reads help of course, but
read-only access to swap are in the minority - stores will write, and
loads are commonly followed by invalidations. Somebody already went
through the trouble of proving that xarrays + segmentation are worth
it for swap load and store access patterns. Why dismiss that?

So my vote is that we follow the ususal upstreaming process here:
merge the ready patches now, and rebase future work on top of it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-18 15:34   ` Johannes Weiner
@ 2024-01-18 17:30     ` Yosry Ahmed
  2024-01-18 18:06       ` Johannes Weiner
  0 siblings, 1 reply; 23+ messages in thread
From: Yosry Ahmed @ 2024-01-18 17:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, Andrew Morton, linux-kernel, linux-mm, Chris Li,
	Nhat Pham

On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote:
> > On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> > >
> > > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > > directory, I found the scalability of zswap rb-tree is not good, which
> > > is protected by the only spinlock. That would cause heavy lock contention
> > > if multiple tasks zswap_store/load concurrently.
> > >
> > > So a simple solution is to split the only one zswap rb-tree into multiple
> > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> > >
> > > Although this method can't solve the spinlock contention completely, it
> > > can mitigate much of that contention. Below is the results of kernel build
> > > in tmpfs with zswap shrinker enabled:
> > >
> > >      linux-next  zswap-lock-optimize
> > > real 1m9.181s    1m3.820s
> > > user 17m44.036s  17m40.100s
> > > sys  7m37.297s   4m54.622s
> > >
> > > So there are clearly improvements. And it's complementary with the ongoing
> > > zswap xarray conversion by Chris. Anyway, I think we can also merge this
> > > first, it's complementary IMHO. So I just refresh and resend this for
> > > further discussion.
> >
> > The reason why I think we should wait for the xarray patch(es) is
> > there is a chance we may see less improvements from splitting the tree
> > if it was an xarray. If we merge this series first, there is no way to
> > know.
>
> I mentioned this before, but I disagree quite strongly with this
> general sentiment.
>
> Chengming's patches are simple, mature, and have convincing
> numbers. IMO it's poor form to hold something like that for "let's see
> how our other experiment works out". The only exception would be if we
> all agree that the earlier change flies in the face of the overall
> direction we want to pursue, which I don't think is the case here.

My intention was not to delay merging these patches until the xarray
patches are merged in. It was only to wait until the xarray patches
are *posted*, so that we can redo the testing on top of them and
verify that the gains are still there. That should have been around
now, but the xarray patches were posted in a form that does not allow
this testing (because we still have a lock on the read path), so I am
less inclined.

My rationale was that if the gains from splitting the tree become
minimal after we switch to an xarray, we won't know. It's more
difficult to remove optimizations than to add them, because we may
cause a regression. I am kind of paranoid about having code sitting
around that we don't have full information about how much it's needed.

In this case, I suppose we can redo the testing (1 tree vs. split
trees) once the xarray patches are in a testable form, and before we
have formed any strong dependencies on the split trees (we have time
until v6.9 is released, I assume).

How about that?

>
> With the xarray we'll still have a per-swapfile lock for writes. That
> lock is the reason SWAP_ADDRESS_SPACE segmentation was introduced for
> the swapcache in the first place. Lockless reads help of course, but
> read-only access to swap are in the minority - stores will write, and
> loads are commonly followed by invalidations. Somebody already went
> through the trouble of proving that xarrays + segmentation are worth
> it for swap load and store access patterns. Why dismiss that?

Fair point, although I think the swapcache lock may be more contended
than the zswap tree lock.

> So my vote is that we follow the ususal upstreaming process here:
> merge the ready patches now, and rebase future work on top of it.

No objections given the current state of the xarray patches as I
mentioned earlier, but I prefer we redo the testing once possible with
the xarray.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] mm/zswap: make sure each swapfile always have zswap rb-tree
  2024-01-17  9:23 ` [PATCH 1/2] mm/zswap: make sure each swapfile always have " Chengming Zhou
  2024-01-18 15:05   ` Johannes Weiner
@ 2024-01-18 17:37   ` Yosry Ahmed
  2024-01-18 18:16   ` Nhat Pham
  2 siblings, 0 replies; 23+ messages in thread
From: Yosry Ahmed @ 2024-01-18 17:37 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, linux-kernel, Johannes Weiner, linux-mm, Chris Li,
	Nhat Pham

On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Not all zswap interfaces can handle the absence of the zswap rb-tree,
> actually only zswap_store() has handled it for now.
>
> To make things simple, we make sure each swapfile always have the
> zswap rb-tree prepared before being enabled and used. The preparation
> is unlikely to fail in practice, this patch just make it explicit.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Yosry Ahmed <yosryahmed@google.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-18 17:30     ` Yosry Ahmed
@ 2024-01-18 18:06       ` Johannes Weiner
  2024-01-18 18:37         ` Yosry Ahmed
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2024-01-18 18:06 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chengming Zhou, Andrew Morton, linux-kernel, linux-mm, Chris Li,
	Nhat Pham

On Thu, Jan 18, 2024 at 09:30:12AM -0800, Yosry Ahmed wrote:
> On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote:
> > > On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
> > > <zhouchengming@bytedance.com> wrote:
> > > >
> > > > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > > > directory, I found the scalability of zswap rb-tree is not good, which
> > > > is protected by the only spinlock. That would cause heavy lock contention
> > > > if multiple tasks zswap_store/load concurrently.
> > > >
> > > > So a simple solution is to split the only one zswap rb-tree into multiple
> > > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> > > >
> > > > Although this method can't solve the spinlock contention completely, it
> > > > can mitigate much of that contention. Below is the results of kernel build
> > > > in tmpfs with zswap shrinker enabled:
> > > >
> > > >      linux-next  zswap-lock-optimize
> > > > real 1m9.181s    1m3.820s
> > > > user 17m44.036s  17m40.100s
> > > > sys  7m37.297s   4m54.622s
> > > >
> > > > So there are clearly improvements. And it's complementary with the ongoing
> > > > zswap xarray conversion by Chris. Anyway, I think we can also merge this
> > > > first, it's complementary IMHO. So I just refresh and resend this for
> > > > further discussion.
> > >
> > > The reason why I think we should wait for the xarray patch(es) is
> > > there is a chance we may see less improvements from splitting the tree
> > > if it was an xarray. If we merge this series first, there is no way to
> > > know.
> >
> > I mentioned this before, but I disagree quite strongly with this
> > general sentiment.
> >
> > Chengming's patches are simple, mature, and have convincing
> > numbers. IMO it's poor form to hold something like that for "let's see
> > how our other experiment works out". The only exception would be if we
> > all agree that the earlier change flies in the face of the overall
> > direction we want to pursue, which I don't think is the case here.
> 
> My intention was not to delay merging these patches until the xarray
> patches are merged in. It was only to wait until the xarray patches
> are *posted*, so that we can redo the testing on top of them and
> verify that the gains are still there. That should have been around
> now, but the xarray patches were posted in a form that does not allow
> this testing (because we still have a lock on the read path), so I am
> less inclined.
> 
> My rationale was that if the gains from splitting the tree become
> minimal after we switch to an xarray, we won't know. It's more
> difficult to remove optimizations than to add them, because we may
> cause a regression. I am kind of paranoid about having code sitting
> around that we don't have full information about how much it's needed.

Yeah I understand that fear.

I expect the splitting to help more than the move to xarray because
it's the writes that are hot. Luckily in this case it should be fairly
easy to differential-test after it's been merged by changing that tree
lookup macro/function locally to always return &trees[type][0], right?

> In this case, I suppose we can redo the testing (1 tree vs. split
> trees) once the xarray patches are in a testable form, and before we
> have formed any strong dependencies on the split trees (we have time
> until v6.9 is released, I assume).
> 
> How about that?

That sounds reasonable.

> > With the xarray we'll still have a per-swapfile lock for writes. That
> > lock is the reason SWAP_ADDRESS_SPACE segmentation was introduced for
> > the swapcache in the first place. Lockless reads help of course, but
> > read-only access to swap are in the minority - stores will write, and
> > loads are commonly followed by invalidations. Somebody already went
> > through the trouble of proving that xarrays + segmentation are worth
> > it for swap load and store access patterns. Why dismiss that?
> 
> Fair point, although I think the swapcache lock may be more contended
> than the zswap tree lock.

Right, it has two updates for each transition, compared to the one for
zswap. But we know that in a concurrent system under pressure a
globally shared swap lock will hurt. There is a history in Chengming's
numbers, your previous patch to split the zpools,
235b62176712b970c815923e36b9a9cc05d4d901 etc.

> > So my vote is that we follow the ususal upstreaming process here:
> > merge the ready patches now, and rebase future work on top of it.
> 
> No objections given the current state of the xarray patches as I
> mentioned earlier, but I prefer we redo the testing once possible with
> the xarray.

Cool, sounds good to me.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] mm/zswap: make sure each swapfile always have zswap rb-tree
  2024-01-17  9:23 ` [PATCH 1/2] mm/zswap: make sure each swapfile always have " Chengming Zhou
  2024-01-18 15:05   ` Johannes Weiner
  2024-01-18 17:37   ` Yosry Ahmed
@ 2024-01-18 18:16   ` Nhat Pham
  2 siblings, 0 replies; 23+ messages in thread
From: Nhat Pham @ 2024-01-18 18:16 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Yosry Ahmed, linux-kernel, Johannes Weiner,
	linux-mm, Chris Li

On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Not all zswap interfaces can handle the absence of the zswap rb-tree,
> actually only zswap_store() has handled it for now.
>
> To make things simple, we make sure each swapfile always have the
> zswap rb-tree prepared before being enabled and used. The preparation
> is unlikely to fail in practice, this patch just make it explicit.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

This seems fine to me. IIUC, zswap_swapon() only fails when the rbtree
allocation fails, and the tree's memory footprint is small so that's
unlikely anyway.

Acked-by: Nhat Pham <nphamcs@gmail.com>



> ---
>  include/linux/zswap.h |  7 +++++--
>  mm/swapfile.c         | 10 +++++++---
>  mm/zswap.c            |  7 ++++---
>  3 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 0b709f5bc65f..eca388229d9a 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -30,7 +30,7 @@ struct zswap_lruvec_state {
>  bool zswap_store(struct folio *folio);
>  bool zswap_load(struct folio *folio);
>  void zswap_invalidate(int type, pgoff_t offset);
> -void zswap_swapon(int type);
> +int zswap_swapon(int type);
>  void zswap_swapoff(int type);
>  void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
>  void zswap_lruvec_state_init(struct lruvec *lruvec);
> @@ -51,7 +51,10 @@ static inline bool zswap_load(struct folio *folio)
>  }
>
>  static inline void zswap_invalidate(int type, pgoff_t offset) {}
> -static inline void zswap_swapon(int type) {}
> +static inline int zswap_swapon(int type)
> +{
> +       return 0;
> +}
>  static inline void zswap_swapoff(int type) {}
>  static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
>  static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {}
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 3eec686484ef..6c53ea06626b 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2347,8 +2347,6 @@ static void enable_swap_info(struct swap_info_struct *p, int prio,
>                                 unsigned char *swap_map,
>                                 struct swap_cluster_info *cluster_info)
>  {
> -       zswap_swapon(p->type);
> -
>         spin_lock(&swap_lock);
>         spin_lock(&p->lock);
>         setup_swap_info(p, prio, swap_map, cluster_info);
> @@ -3166,6 +3164,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         if (error)
>                 goto bad_swap_unlock_inode;
>
> +       error = zswap_swapon(p->type);
> +       if (error)
> +               goto free_swap_address_space;
> +
>         /*
>          * Flush any pending IO and dirty mappings before we start using this
>          * swap device.
> @@ -3174,7 +3176,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         error = inode_drain_writes(inode);
>         if (error) {
>                 inode->i_flags &= ~S_SWAPFILE;
> -               goto free_swap_address_space;
> +               goto free_swap_zswap;
>         }
>
>         mutex_lock(&swapon_mutex);
> @@ -3198,6 +3200,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>
>         error = 0;
>         goto out;
> +free_swap_zswap:
> +       zswap_swapoff(p->type);
>  free_swap_address_space:
>         exit_swap_address_space(p->type);
>  bad_swap_unlock_inode:
> diff --git a/mm/zswap.c b/mm/zswap.c
> index ca25b676048e..d88faea85978 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1519,7 +1519,7 @@ bool zswap_store(struct folio *folio)
>         if (folio_test_large(folio))
>                 return false;
>
> -       if (!zswap_enabled || !tree)
> +       if (!zswap_enabled)
>                 return false;
>
>         /*
> @@ -1772,19 +1772,20 @@ void zswap_invalidate(int type, pgoff_t offset)
>         spin_unlock(&tree->lock);
>  }
>
> -void zswap_swapon(int type)
> +int zswap_swapon(int type)
>  {
>         struct zswap_tree *tree;
>
>         tree = kzalloc(sizeof(*tree), GFP_KERNEL);
>         if (!tree) {
>                 pr_err("alloc failed, zswap disabled for swap type %d\n", type);
> -               return;
> +               return -ENOMEM;
>         }
>
>         tree->rbroot = RB_ROOT;
>         spin_lock_init(&tree->lock);
>         zswap_trees[type] = tree;
> +       return 0;
>  }
>
>  void zswap_swapoff(int type)
>
> --
> b4 0.10.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-18 18:06       ` Johannes Weiner
@ 2024-01-18 18:37         ` Yosry Ahmed
  2024-01-19  6:40           ` Chengming Zhou
  0 siblings, 1 reply; 23+ messages in thread
From: Yosry Ahmed @ 2024-01-18 18:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, Andrew Morton, linux-kernel, linux-mm, Chris Li,
	Nhat Pham

On Thu, Jan 18, 2024 at 10:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Jan 18, 2024 at 09:30:12AM -0800, Yosry Ahmed wrote:
> > On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote:
> > > > On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
> > > > <zhouchengming@bytedance.com> wrote:
> > > > >
> > > > > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > > > > directory, I found the scalability of zswap rb-tree is not good, which
> > > > > is protected by the only spinlock. That would cause heavy lock contention
> > > > > if multiple tasks zswap_store/load concurrently.
> > > > >
> > > > > So a simple solution is to split the only one zswap rb-tree into multiple
> > > > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > > > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> > > > >
> > > > > Although this method can't solve the spinlock contention completely, it
> > > > > can mitigate much of that contention. Below is the results of kernel build
> > > > > in tmpfs with zswap shrinker enabled:
> > > > >
> > > > >      linux-next  zswap-lock-optimize
> > > > > real 1m9.181s    1m3.820s
> > > > > user 17m44.036s  17m40.100s
> > > > > sys  7m37.297s   4m54.622s
> > > > >
> > > > > So there are clearly improvements. And it's complementary with the ongoing
> > > > > zswap xarray conversion by Chris. Anyway, I think we can also merge this
> > > > > first, it's complementary IMHO. So I just refresh and resend this for
> > > > > further discussion.
> > > >
> > > > The reason why I think we should wait for the xarray patch(es) is
> > > > there is a chance we may see less improvements from splitting the tree
> > > > if it was an xarray. If we merge this series first, there is no way to
> > > > know.
> > >
> > > I mentioned this before, but I disagree quite strongly with this
> > > general sentiment.
> > >
> > > Chengming's patches are simple, mature, and have convincing
> > > numbers. IMO it's poor form to hold something like that for "let's see
> > > how our other experiment works out". The only exception would be if we
> > > all agree that the earlier change flies in the face of the overall
> > > direction we want to pursue, which I don't think is the case here.
> >
> > My intention was not to delay merging these patches until the xarray
> > patches are merged in. It was only to wait until the xarray patches
> > are *posted*, so that we can redo the testing on top of them and
> > verify that the gains are still there. That should have been around
> > now, but the xarray patches were posted in a form that does not allow
> > this testing (because we still have a lock on the read path), so I am
> > less inclined.
> >
> > My rationale was that if the gains from splitting the tree become
> > minimal after we switch to an xarray, we won't know. It's more
> > difficult to remove optimizations than to add them, because we may
> > cause a regression. I am kind of paranoid about having code sitting
> > around that we don't have full information about how much it's needed.
>
> Yeah I understand that fear.
>
> I expect the splitting to help more than the move to xarray because
> it's the writes that are hot. Luckily in this case it should be fairly
> easy to differential-test after it's been merged by changing that tree
> lookup macro/function locally to always return &trees[type][0], right?

Yeah that's exactly what I had in mind. Once we have a version of the
xarray patch without the locking on the read side we can test with
that. Chengming, does this sound reasonable to you?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm/zswap: split zswap rb-tree
  2024-01-17  9:23 ` [PATCH 2/2] mm/zswap: split " Chengming Zhou
  2024-01-18 15:11   ` Johannes Weiner
@ 2024-01-18 19:24   ` Nhat Pham
  2024-01-19  6:24     ` Chengming Zhou
  1 sibling, 1 reply; 23+ messages in thread
From: Nhat Pham @ 2024-01-18 19:24 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Yosry Ahmed, linux-kernel, Johannes Weiner,
	linux-mm, Chris Li

On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Each swapfile has one rb-tree to search the mapping of swp_entry_t to
> zswap_entry, that use a spinlock to protect, which can cause heavy lock
> contention if multiple tasks zswap_store/load concurrently.
>
> Optimize the scalability problem by splitting the zswap rb-tree into
> multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M),
> just like we did in the swap cache address_space splitting.
>
> Although this method can't solve the spinlock contention completely, it
> can mitigate much of that contention. Below is the results of kernel build
> in tmpfs with zswap shrinker enabled:
>
>      linux-next  zswap-lock-optimize
> real 1m9.181s    1m3.820s
> user 17m44.036s  17m40.100s
> sys  7m37.297s   4m54.622s

That's really impressive, especially the sys time reduction :) Well done.

>
> So there are clearly improvements.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Code looks solid too. I haven't read the xarray patch series too
closely yet, but this patch series is clearly already an improvement.
It is simple, with existing precedent (from swap cache), and
experiments show that it works quite well to improve zswap's
performance.

If the xarray patch proves to be even better, we can always combine it
with this approach (a per-range xarray?), or replace it with the
xarray. But for now:

Acked-by: Nhat Pham <nphamcs@gmail.com>

> ---
>  include/linux/zswap.h |  4 +--
>  mm/swapfile.c         |  2 +-
>  mm/zswap.c            | 69 ++++++++++++++++++++++++++++++++-------------------
>  3 files changed, 47 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index eca388229d9a..91895ce1fdbc 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -30,7 +30,7 @@ struct zswap_lruvec_state {
>  bool zswap_store(struct folio *folio);
>  bool zswap_load(struct folio *folio);
>  void zswap_invalidate(int type, pgoff_t offset);
> -int zswap_swapon(int type);
> +int zswap_swapon(int type, unsigned long nr_pages);
>  void zswap_swapoff(int type);
>  void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
>  void zswap_lruvec_state_init(struct lruvec *lruvec);
> @@ -51,7 +51,7 @@ static inline bool zswap_load(struct folio *folio)
>  }
>
>  static inline void zswap_invalidate(int type, pgoff_t offset) {}
> -static inline int zswap_swapon(int type)
> +static inline int zswap_swapon(int type, unsigned long nr_pages)
>  {
>         return 0;
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6c53ea06626b..35aa17b2a2fa 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3164,7 +3164,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         if (error)
>                 goto bad_swap_unlock_inode;
>
> -       error = zswap_swapon(p->type);
> +       error = zswap_swapon(p->type, maxpages);
>         if (error)
>                 goto free_swap_address_space;
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index d88faea85978..4a6dbc620c7c 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -239,6 +239,7 @@ struct zswap_tree {
>  };
>
>  static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
> +static unsigned int nr_zswap_trees[MAX_SWAPFILES];
>
>  /* RCU-protected iteration */
>  static LIST_HEAD(zswap_pools);
> @@ -265,6 +266,10 @@ static bool zswap_has_pool;
>  * helpers and fwd declarations
>  **********************************/
>
> +#define swap_zswap_tree(entry)                                 \
> +       (&zswap_trees[swp_type(entry)][swp_offset(entry)        \
> +               >> SWAP_ADDRESS_SPACE_SHIFT])
> +
>  #define zswap_pool_debug(msg, p)                               \
>         pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,         \
>                  zpool_get_type((p)->zpools[0]))
> @@ -865,7 +870,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>          * until the entry is verified to still be alive in the tree.
>          */
>         swpoffset = swp_offset(entry->swpentry);
> -       tree = zswap_trees[swp_type(entry->swpentry)];
> +       tree = swap_zswap_tree(entry->swpentry);
>         list_lru_isolate(l, item);
>         /*
>          * It's safe to drop the lock here because we return either
> @@ -1494,10 +1499,9 @@ static void zswap_fill_page(void *ptr, unsigned long value)
>  bool zswap_store(struct folio *folio)
>  {
>         swp_entry_t swp = folio->swap;
> -       int type = swp_type(swp);
>         pgoff_t offset = swp_offset(swp);
>         struct page *page = &folio->page;
> -       struct zswap_tree *tree = zswap_trees[type];
> +       struct zswap_tree *tree = swap_zswap_tree(swp);
>         struct zswap_entry *entry, *dupentry;
>         struct scatterlist input, output;
>         struct crypto_acomp_ctx *acomp_ctx;
> @@ -1569,7 +1573,7 @@ bool zswap_store(struct folio *folio)
>                 src = kmap_local_page(page);
>                 if (zswap_is_page_same_filled(src, &value)) {
>                         kunmap_local(src);
> -                       entry->swpentry = swp_entry(type, offset);
> +                       entry->swpentry = swp;
>                         entry->length = 0;
>                         entry->value = value;
>                         atomic_inc(&zswap_same_filled_pages);
> @@ -1651,7 +1655,7 @@ bool zswap_store(struct folio *folio)
>         mutex_unlock(&acomp_ctx->mutex);
>
>         /* populate entry */
> -       entry->swpentry = swp_entry(type, offset);
> +       entry->swpentry = swp;
>         entry->handle = handle;
>         entry->length = dlen;
>
> @@ -1711,10 +1715,9 @@ bool zswap_store(struct folio *folio)
>  bool zswap_load(struct folio *folio)
>  {
>         swp_entry_t swp = folio->swap;
> -       int type = swp_type(swp);
>         pgoff_t offset = swp_offset(swp);
>         struct page *page = &folio->page;
> -       struct zswap_tree *tree = zswap_trees[type];
> +       struct zswap_tree *tree = swap_zswap_tree(swp);
>         struct zswap_entry *entry;
>         u8 *dst;
>
> @@ -1757,7 +1760,7 @@ bool zswap_load(struct folio *folio)
>
>  void zswap_invalidate(int type, pgoff_t offset)
>  {
> -       struct zswap_tree *tree = zswap_trees[type];
> +       struct zswap_tree *tree = swap_zswap_tree(swp_entry(type, offset));
>         struct zswap_entry *entry;
>
>         /* find */
> @@ -1772,37 +1775,53 @@ void zswap_invalidate(int type, pgoff_t offset)
>         spin_unlock(&tree->lock);
>  }
>
> -int zswap_swapon(int type)
> +int zswap_swapon(int type, unsigned long nr_pages)
>  {
> -       struct zswap_tree *tree;
> +       struct zswap_tree *trees, *tree;
> +       unsigned int nr, i;
>
> -       tree = kzalloc(sizeof(*tree), GFP_KERNEL);
> -       if (!tree) {
> +       nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
> +       trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
> +       if (!trees) {
>                 pr_err("alloc failed, zswap disabled for swap type %d\n", type);
>                 return -ENOMEM;
>         }
>
> -       tree->rbroot = RB_ROOT;
> -       spin_lock_init(&tree->lock);
> -       zswap_trees[type] = tree;
> +       for (i = 0; i < nr; i++) {
> +               tree = trees + i;
> +               tree->rbroot = RB_ROOT;
> +               spin_lock_init(&tree->lock);
> +       }
> +
> +       nr_zswap_trees[type] = nr;
> +       zswap_trees[type] = trees;
>         return 0;
>  }
>
>  void zswap_swapoff(int type)
>  {
> -       struct zswap_tree *tree = zswap_trees[type];
> -       struct zswap_entry *entry, *n;
> +       struct zswap_tree *trees = zswap_trees[type];
> +       unsigned int i;
>
> -       if (!tree)
> +       if (!trees)
>                 return;
>
> -       /* walk the tree and free everything */
> -       spin_lock(&tree->lock);
> -       rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode)
> -               zswap_free_entry(entry);
> -       tree->rbroot = RB_ROOT;
> -       spin_unlock(&tree->lock);
> -       kfree(tree);
> +       for (i = 0; i < nr_zswap_trees[type]; i++) {
> +               struct zswap_tree *tree = trees + i;
> +               struct zswap_entry *entry, *n;
> +
> +               /* walk the tree and free everything */
> +               spin_lock(&tree->lock);
> +               rbtree_postorder_for_each_entry_safe(entry, n,
> +                                                    &tree->rbroot,
> +                                                    rbnode)
> +                       zswap_free_entry(entry);
> +               tree->rbroot = RB_ROOT;
> +               spin_unlock(&tree->lock);
> +       }
> +
> +       kvfree(trees);
> +       nr_zswap_trees[type] = 0;
>         zswap_trees[type] = NULL;
>  }
>
>
> --
> b4 0.10.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm/zswap: split zswap rb-tree
  2024-01-18 15:11   ` Johannes Weiner
@ 2024-01-19  6:20     ` Chengming Zhou
  0 siblings, 0 replies; 23+ messages in thread
From: Chengming Zhou @ 2024-01-19  6:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm, Chris Li,
	Nhat Pham

On 2024/1/18 23:11, Johannes Weiner wrote:
> On Wed, Jan 17, 2024 at 09:23:19AM +0000, Chengming Zhou wrote:
>> Each swapfile has one rb-tree to search the mapping of swp_entry_t to
>> zswap_entry, that use a spinlock to protect, which can cause heavy lock
>> contention if multiple tasks zswap_store/load concurrently.
>>
>> Optimize the scalability problem by splitting the zswap rb-tree into
>> multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M),
>> just like we did in the swap cache address_space splitting.
>>
>> Although this method can't solve the spinlock contention completely, it
>> can mitigate much of that contention. Below is the results of kernel build
>> in tmpfs with zswap shrinker enabled:
>>
>>      linux-next  zswap-lock-optimize
>> real 1m9.181s    1m3.820s
>> user 17m44.036s  17m40.100s
>> sys  7m37.297s   4m54.622s
>>
>> So there are clearly improvements.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> One minor nit:
> 
>> @@ -265,6 +266,10 @@ static bool zswap_has_pool;
>>  * helpers and fwd declarations
>>  **********************************/
>>  
>> +#define swap_zswap_tree(entry)					\
>> +	(&zswap_trees[swp_type(entry)][swp_offset(entry)	\
>> +		>> SWAP_ADDRESS_SPACE_SHIFT])
> 
> Make this a static inline function instead?

Good suggestion, will do.

Thanks.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm/zswap: split zswap rb-tree
  2024-01-18 19:24   ` Nhat Pham
@ 2024-01-19  6:24     ` Chengming Zhou
  0 siblings, 0 replies; 23+ messages in thread
From: Chengming Zhou @ 2024-01-19  6:24 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Andrew Morton, Yosry Ahmed, linux-kernel, Johannes Weiner,
	linux-mm, Chris Li

On 2024/1/19 03:24, Nhat Pham wrote:
> On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> Each swapfile has one rb-tree to search the mapping of swp_entry_t to
>> zswap_entry, that use a spinlock to protect, which can cause heavy lock
>> contention if multiple tasks zswap_store/load concurrently.
>>
>> Optimize the scalability problem by splitting the zswap rb-tree into
>> multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M),
>> just like we did in the swap cache address_space splitting.
>>
>> Although this method can't solve the spinlock contention completely, it
>> can mitigate much of that contention. Below is the results of kernel build
>> in tmpfs with zswap shrinker enabled:
>>
>>      linux-next  zswap-lock-optimize
>> real 1m9.181s    1m3.820s
>> user 17m44.036s  17m40.100s
>> sys  7m37.297s   4m54.622s
> 
> That's really impressive, especially the sys time reduction :) Well done.
> 

Thanks!

>>
>> So there are clearly improvements.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> Code looks solid too. I haven't read the xarray patch series too
> closely yet, but this patch series is clearly already an improvement.
> It is simple, with existing precedent (from swap cache), and
> experiments show that it works quite well to improve zswap's
> performance.
> 
> If the xarray patch proves to be even better, we can always combine it
> with this approach (a per-range xarray?), or replace it with the
> xarray. But for now:
> 
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> 

Right, I agree. We should combine both approaches.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree
  2024-01-18 18:37         ` Yosry Ahmed
@ 2024-01-19  6:40           ` Chengming Zhou
  0 siblings, 0 replies; 23+ messages in thread
From: Chengming Zhou @ 2024-01-19  6:40 UTC (permalink / raw)
  To: Yosry Ahmed, Johannes Weiner
  Cc: Andrew Morton, linux-kernel, linux-mm, Chris Li, Nhat Pham

On 2024/1/19 02:37, Yosry Ahmed wrote:
> On Thu, Jan 18, 2024 at 10:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> On Thu, Jan 18, 2024 at 09:30:12AM -0800, Yosry Ahmed wrote:
>>> On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>>
>>>> On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote:
>>>>> On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
>>>>> <zhouchengming@bytedance.com> wrote:
>>>>>>
>>>>>> When testing the zswap performance by using kernel build -j32 in a tmpfs
>>>>>> directory, I found the scalability of zswap rb-tree is not good, which
>>>>>> is protected by the only spinlock. That would cause heavy lock contention
>>>>>> if multiple tasks zswap_store/load concurrently.
>>>>>>
>>>>>> So a simple solution is to split the only one zswap rb-tree into multiple
>>>>>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
>>>>>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>>>>>>
>>>>>> Although this method can't solve the spinlock contention completely, it
>>>>>> can mitigate much of that contention. Below is the results of kernel build
>>>>>> in tmpfs with zswap shrinker enabled:
>>>>>>
>>>>>>      linux-next  zswap-lock-optimize
>>>>>> real 1m9.181s    1m3.820s
>>>>>> user 17m44.036s  17m40.100s
>>>>>> sys  7m37.297s   4m54.622s
>>>>>>
>>>>>> So there are clearly improvements. And it's complementary with the ongoing
>>>>>> zswap xarray conversion by Chris. Anyway, I think we can also merge this
>>>>>> first, it's complementary IMHO. So I just refresh and resend this for
>>>>>> further discussion.
>>>>>
>>>>> The reason why I think we should wait for the xarray patch(es) is
>>>>> there is a chance we may see less improvements from splitting the tree
>>>>> if it was an xarray. If we merge this series first, there is no way to
>>>>> know.
>>>>
>>>> I mentioned this before, but I disagree quite strongly with this
>>>> general sentiment.
>>>>
>>>> Chengming's patches are simple, mature, and have convincing
>>>> numbers. IMO it's poor form to hold something like that for "let's see
>>>> how our other experiment works out". The only exception would be if we
>>>> all agree that the earlier change flies in the face of the overall
>>>> direction we want to pursue, which I don't think is the case here.
>>>
>>> My intention was not to delay merging these patches until the xarray
>>> patches are merged in. It was only to wait until the xarray patches
>>> are *posted*, so that we can redo the testing on top of them and
>>> verify that the gains are still there. That should have been around
>>> now, but the xarray patches were posted in a form that does not allow
>>> this testing (because we still have a lock on the read path), so I am
>>> less inclined.
>>>
>>> My rationale was that if the gains from splitting the tree become
>>> minimal after we switch to an xarray, we won't know. It's more
>>> difficult to remove optimizations than to add them, because we may
>>> cause a regression. I am kind of paranoid about having code sitting
>>> around that we don't have full information about how much it's needed.
>>
>> Yeah I understand that fear.
>>
>> I expect the splitting to help more than the move to xarray because
>> it's the writes that are hot. Luckily in this case it should be fairly
>> easy to differential-test after it's been merged by changing that tree
>> lookup macro/function locally to always return &trees[type][0], right?
> 
> Yeah that's exactly what I had in mind. Once we have a version of the
> xarray patch without the locking on the read side we can test with
> that. Chengming, does this sound reasonable to you?

It's ok, sounds reasonable to me. I agree with Johannes, we will need
both since xarray still have a spinlock in the writes, it's clearly
better to split it. As for testing, we can always return &trees[type][0].

Thanks!


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-01-19  6:40 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-17  9:23 [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree Chengming Zhou
2024-01-17  9:23 ` [PATCH 1/2] mm/zswap: make sure each swapfile always have " Chengming Zhou
2024-01-18 15:05   ` Johannes Weiner
2024-01-18 17:37   ` Yosry Ahmed
2024-01-18 18:16   ` Nhat Pham
2024-01-17  9:23 ` [PATCH 2/2] mm/zswap: split " Chengming Zhou
2024-01-18 15:11   ` Johannes Weiner
2024-01-19  6:20     ` Chengming Zhou
2024-01-18 19:24   ` Nhat Pham
2024-01-19  6:24     ` Chengming Zhou
2024-01-17 18:37 ` [PATCH 0/2] mm/zswap: optimize the scalability of " Yosry Ahmed
2024-01-17 23:41   ` Chris Li
2024-01-17 23:47     ` Yosry Ahmed
2024-01-18  0:17       ` Chris Li
2024-01-18  0:34         ` Yosry Ahmed
2024-01-18  1:03           ` Chris Li
2024-01-18  3:51             ` Yosry Ahmed
2024-01-18  0:49         ` Nhat Pham
2024-01-18 15:34   ` Johannes Weiner
2024-01-18 17:30     ` Yosry Ahmed
2024-01-18 18:06       ` Johannes Weiner
2024-01-18 18:37         ` Yosry Ahmed
2024-01-19  6:40           ` Chengming Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).