* [PATCH v2 0/2] mm/zswap: optimize the scalability of zswap rb-tree
@ 2024-01-19 11:22 Chengming Zhou
2024-01-19 11:22 ` [PATCH v2 1/2] mm/zswap: make sure each swapfile always have " Chengming Zhou
2024-01-19 11:22 ` [PATCH v2 2/2] mm/zswap: split " Chengming Zhou
0 siblings, 2 replies; 5+ messages in thread
From: Chengming Zhou @ 2024-01-19 11:22 UTC (permalink / raw)
To: Andrew Morton
Cc: Nhat Pham, Yosry Ahmed, Chris Li, linux-kernel, linux-mm,
Johannes Weiner, Chengming Zhou
Changes in v2:
- Change swap_zswap_tree() to static inline function.
- Collect Acked-by tags.
- Link to v1: https://lore.kernel.org/r/20240117-b4-zswap-lock-optimize-v1-0-23f6effe5775@bytedance.com
When testing the zswap performance by using kernel build -j32 in a tmpfs
directory, I found the scalability of zswap rb-tree is not good, which
is protected by the only spinlock. That would cause heavy lock contention
if multiple tasks zswap_store/load concurrently.
So a simple solution is to split the only one zswap rb-tree into multiple
rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
Although this method can't solve the spinlock contention completely, it
can mitigate much of that contention. Below is the results of kernel build
in tmpfs with zswap shrinker enabled:
linux-next zswap-lock-optimize
real 1m9.181s 1m3.820s
user 17m44.036s 17m40.100s
sys 7m37.297s 4m54.622s
So there are clearly improvements. And it's complementary with the ongoing
zswap xarray conversion by Chris. Anyway, I think we can also merge this
first, it's complementary IMHO. So I just refresh and resend this for
further discussion.
Thanks for review and comment!
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Chengming Zhou (2):
mm/zswap: make sure each swapfile always have zswap rb-tree
mm/zswap: split zswap rb-tree
include/linux/zswap.h | 7 +++--
mm/swapfile.c | 10 +++++--
mm/zswap.c | 76 +++++++++++++++++++++++++++++++++------------------
3 files changed, 61 insertions(+), 32 deletions(-)
---
base-commit: ab27740f76654ed58dd32ac0ba0031c18a6dea3b
change-id: 20240117-b4-zswap-lock-optimize-44e071c13427
Best regards,
--
Chengming Zhou <zhouchengming@bytedance.com>
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v2 1/2] mm/zswap: make sure each swapfile always have zswap rb-tree
2024-01-19 11:22 [PATCH v2 0/2] mm/zswap: optimize the scalability of zswap rb-tree Chengming Zhou
@ 2024-01-19 11:22 ` Chengming Zhou
2024-01-19 11:22 ` [PATCH v2 2/2] mm/zswap: split " Chengming Zhou
1 sibling, 0 replies; 5+ messages in thread
From: Chengming Zhou @ 2024-01-19 11:22 UTC (permalink / raw)
To: Andrew Morton
Cc: Nhat Pham, Yosry Ahmed, Chris Li, linux-kernel, linux-mm,
Johannes Weiner, Chengming Zhou
Not all zswap interfaces can handle the absence of the zswap rb-tree,
actually only zswap_store() has handled it for now.
To make things simple, we make sure each swapfile always have the
zswap rb-tree prepared before being enabled and used. The preparation
is unlikely to fail in practice, this patch just make it explicit.
Acked-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
include/linux/zswap.h | 7 +++++--
mm/swapfile.c | 10 +++++++---
mm/zswap.c | 7 ++++---
3 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 0b709f5bc65f..eca388229d9a 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -30,7 +30,7 @@ struct zswap_lruvec_state {
bool zswap_store(struct folio *folio);
bool zswap_load(struct folio *folio);
void zswap_invalidate(int type, pgoff_t offset);
-void zswap_swapon(int type);
+int zswap_swapon(int type);
void zswap_swapoff(int type);
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
void zswap_lruvec_state_init(struct lruvec *lruvec);
@@ -51,7 +51,10 @@ static inline bool zswap_load(struct folio *folio)
}
static inline void zswap_invalidate(int type, pgoff_t offset) {}
-static inline void zswap_swapon(int type) {}
+static inline int zswap_swapon(int type)
+{
+ return 0;
+}
static inline void zswap_swapoff(int type) {}
static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3eec686484ef..6c53ea06626b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2347,8 +2347,6 @@ static void enable_swap_info(struct swap_info_struct *p, int prio,
unsigned char *swap_map,
struct swap_cluster_info *cluster_info)
{
- zswap_swapon(p->type);
-
spin_lock(&swap_lock);
spin_lock(&p->lock);
setup_swap_info(p, prio, swap_map, cluster_info);
@@ -3166,6 +3164,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
if (error)
goto bad_swap_unlock_inode;
+ error = zswap_swapon(p->type);
+ if (error)
+ goto free_swap_address_space;
+
/*
* Flush any pending IO and dirty mappings before we start using this
* swap device.
@@ -3174,7 +3176,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
error = inode_drain_writes(inode);
if (error) {
inode->i_flags &= ~S_SWAPFILE;
- goto free_swap_address_space;
+ goto free_swap_zswap;
}
mutex_lock(&swapon_mutex);
@@ -3198,6 +3200,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
error = 0;
goto out;
+free_swap_zswap:
+ zswap_swapoff(p->type);
free_swap_address_space:
exit_swap_address_space(p->type);
bad_swap_unlock_inode:
diff --git a/mm/zswap.c b/mm/zswap.c
index ca25b676048e..d88faea85978 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1519,7 +1519,7 @@ bool zswap_store(struct folio *folio)
if (folio_test_large(folio))
return false;
- if (!zswap_enabled || !tree)
+ if (!zswap_enabled)
return false;
/*
@@ -1772,19 +1772,20 @@ void zswap_invalidate(int type, pgoff_t offset)
spin_unlock(&tree->lock);
}
-void zswap_swapon(int type)
+int zswap_swapon(int type)
{
struct zswap_tree *tree;
tree = kzalloc(sizeof(*tree), GFP_KERNEL);
if (!tree) {
pr_err("alloc failed, zswap disabled for swap type %d\n", type);
- return;
+ return -ENOMEM;
}
tree->rbroot = RB_ROOT;
spin_lock_init(&tree->lock);
zswap_trees[type] = tree;
+ return 0;
}
void zswap_swapoff(int type)
--
b4 0.10.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v2 2/2] mm/zswap: split zswap rb-tree
2024-01-19 11:22 [PATCH v2 0/2] mm/zswap: optimize the scalability of zswap rb-tree Chengming Zhou
2024-01-19 11:22 ` [PATCH v2 1/2] mm/zswap: make sure each swapfile always have " Chengming Zhou
@ 2024-01-19 11:22 ` Chengming Zhou
2024-01-22 19:49 ` Yosry Ahmed
1 sibling, 1 reply; 5+ messages in thread
From: Chengming Zhou @ 2024-01-19 11:22 UTC (permalink / raw)
To: Andrew Morton
Cc: Nhat Pham, Yosry Ahmed, Chris Li, linux-kernel, linux-mm,
Johannes Weiner, Chengming Zhou
Each swapfile has one rb-tree to search the mapping of swp_entry_t to
zswap_entry, that use a spinlock to protect, which can cause heavy lock
contention if multiple tasks zswap_store/load concurrently.
Optimize the scalability problem by splitting the zswap rb-tree into
multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M),
just like we did in the swap cache address_space splitting.
Although this method can't solve the spinlock contention completely, it
can mitigate much of that contention. Below is the results of kernel build
in tmpfs with zswap shrinker enabled:
linux-next zswap-lock-optimize
real 1m9.181s 1m3.820s
user 17m44.036s 17m40.100s
sys 7m37.297s 4m54.622s
So there are clearly improvements.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
include/linux/zswap.h | 4 +--
mm/swapfile.c | 2 +-
mm/zswap.c | 71 +++++++++++++++++++++++++++++++++------------------
3 files changed, 49 insertions(+), 28 deletions(-)
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index eca388229d9a..91895ce1fdbc 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -30,7 +30,7 @@ struct zswap_lruvec_state {
bool zswap_store(struct folio *folio);
bool zswap_load(struct folio *folio);
void zswap_invalidate(int type, pgoff_t offset);
-int zswap_swapon(int type);
+int zswap_swapon(int type, unsigned long nr_pages);
void zswap_swapoff(int type);
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
void zswap_lruvec_state_init(struct lruvec *lruvec);
@@ -51,7 +51,7 @@ static inline bool zswap_load(struct folio *folio)
}
static inline void zswap_invalidate(int type, pgoff_t offset) {}
-static inline int zswap_swapon(int type)
+static inline int zswap_swapon(int type, unsigned long nr_pages)
{
return 0;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6c53ea06626b..35aa17b2a2fa 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3164,7 +3164,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
if (error)
goto bad_swap_unlock_inode;
- error = zswap_swapon(p->type);
+ error = zswap_swapon(p->type, maxpages);
if (error)
goto free_swap_address_space;
diff --git a/mm/zswap.c b/mm/zswap.c
index d88faea85978..2885f4fb6dcb 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -239,6 +239,7 @@ struct zswap_tree {
};
static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
+static unsigned int nr_zswap_trees[MAX_SWAPFILES];
/* RCU-protected iteration */
static LIST_HEAD(zswap_pools);
@@ -265,6 +266,12 @@ static bool zswap_has_pool;
* helpers and fwd declarations
**********************************/
+static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
+{
+ return &zswap_trees[swp_type(swp)][swp_offset(swp)
+ >> SWAP_ADDRESS_SPACE_SHIFT];
+}
+
#define zswap_pool_debug(msg, p) \
pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \
zpool_get_type((p)->zpools[0]))
@@ -865,7 +872,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
* until the entry is verified to still be alive in the tree.
*/
swpoffset = swp_offset(entry->swpentry);
- tree = zswap_trees[swp_type(entry->swpentry)];
+ tree = swap_zswap_tree(entry->swpentry);
list_lru_isolate(l, item);
/*
* It's safe to drop the lock here because we return either
@@ -1494,10 +1501,9 @@ static void zswap_fill_page(void *ptr, unsigned long value)
bool zswap_store(struct folio *folio)
{
swp_entry_t swp = folio->swap;
- int type = swp_type(swp);
pgoff_t offset = swp_offset(swp);
struct page *page = &folio->page;
- struct zswap_tree *tree = zswap_trees[type];
+ struct zswap_tree *tree = swap_zswap_tree(swp);
struct zswap_entry *entry, *dupentry;
struct scatterlist input, output;
struct crypto_acomp_ctx *acomp_ctx;
@@ -1569,7 +1575,7 @@ bool zswap_store(struct folio *folio)
src = kmap_local_page(page);
if (zswap_is_page_same_filled(src, &value)) {
kunmap_local(src);
- entry->swpentry = swp_entry(type, offset);
+ entry->swpentry = swp;
entry->length = 0;
entry->value = value;
atomic_inc(&zswap_same_filled_pages);
@@ -1651,7 +1657,7 @@ bool zswap_store(struct folio *folio)
mutex_unlock(&acomp_ctx->mutex);
/* populate entry */
- entry->swpentry = swp_entry(type, offset);
+ entry->swpentry = swp;
entry->handle = handle;
entry->length = dlen;
@@ -1711,10 +1717,9 @@ bool zswap_store(struct folio *folio)
bool zswap_load(struct folio *folio)
{
swp_entry_t swp = folio->swap;
- int type = swp_type(swp);
pgoff_t offset = swp_offset(swp);
struct page *page = &folio->page;
- struct zswap_tree *tree = zswap_trees[type];
+ struct zswap_tree *tree = swap_zswap_tree(swp);
struct zswap_entry *entry;
u8 *dst;
@@ -1757,7 +1762,7 @@ bool zswap_load(struct folio *folio)
void zswap_invalidate(int type, pgoff_t offset)
{
- struct zswap_tree *tree = zswap_trees[type];
+ struct zswap_tree *tree = swap_zswap_tree(swp_entry(type, offset));
struct zswap_entry *entry;
/* find */
@@ -1772,37 +1777,53 @@ void zswap_invalidate(int type, pgoff_t offset)
spin_unlock(&tree->lock);
}
-int zswap_swapon(int type)
+int zswap_swapon(int type, unsigned long nr_pages)
{
- struct zswap_tree *tree;
+ struct zswap_tree *trees, *tree;
+ unsigned int nr, i;
- tree = kzalloc(sizeof(*tree), GFP_KERNEL);
- if (!tree) {
+ nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
+ trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
+ if (!trees) {
pr_err("alloc failed, zswap disabled for swap type %d\n", type);
return -ENOMEM;
}
- tree->rbroot = RB_ROOT;
- spin_lock_init(&tree->lock);
- zswap_trees[type] = tree;
+ for (i = 0; i < nr; i++) {
+ tree = trees + i;
+ tree->rbroot = RB_ROOT;
+ spin_lock_init(&tree->lock);
+ }
+
+ nr_zswap_trees[type] = nr;
+ zswap_trees[type] = trees;
return 0;
}
void zswap_swapoff(int type)
{
- struct zswap_tree *tree = zswap_trees[type];
- struct zswap_entry *entry, *n;
+ struct zswap_tree *trees = zswap_trees[type];
+ unsigned int i;
- if (!tree)
+ if (!trees)
return;
- /* walk the tree and free everything */
- spin_lock(&tree->lock);
- rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode)
- zswap_free_entry(entry);
- tree->rbroot = RB_ROOT;
- spin_unlock(&tree->lock);
- kfree(tree);
+ for (i = 0; i < nr_zswap_trees[type]; i++) {
+ struct zswap_tree *tree = trees + i;
+ struct zswap_entry *entry, *n;
+
+ /* walk the tree and free everything */
+ spin_lock(&tree->lock);
+ rbtree_postorder_for_each_entry_safe(entry, n,
+ &tree->rbroot,
+ rbnode)
+ zswap_free_entry(entry);
+ tree->rbroot = RB_ROOT;
+ spin_unlock(&tree->lock);
+ }
+
+ kvfree(trees);
+ nr_zswap_trees[type] = 0;
zswap_trees[type] = NULL;
}
--
b4 0.10.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2 2/2] mm/zswap: split zswap rb-tree
2024-01-19 11:22 ` [PATCH v2 2/2] mm/zswap: split " Chengming Zhou
@ 2024-01-22 19:49 ` Yosry Ahmed
2024-01-23 7:46 ` Chengming Zhou
0 siblings, 1 reply; 5+ messages in thread
From: Yosry Ahmed @ 2024-01-22 19:49 UTC (permalink / raw)
To: Chengming Zhou
Cc: Andrew Morton, Nhat Pham, Chris Li, linux-kernel, linux-mm,
Johannes Weiner
On Fri, Jan 19, 2024 at 3:22 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Each swapfile has one rb-tree to search the mapping of swp_entry_t to
> zswap_entry, that use a spinlock to protect, which can cause heavy lock
> contention if multiple tasks zswap_store/load concurrently.
>
> Optimize the scalability problem by splitting the zswap rb-tree into
> multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M),
> just like we did in the swap cache address_space splitting.
>
> Although this method can't solve the spinlock contention completely, it
> can mitigate much of that contention. Below is the results of kernel build
> in tmpfs with zswap shrinker enabled:
>
> linux-next zswap-lock-optimize
> real 1m9.181s 1m3.820s
> user 17m44.036s 17m40.100s
> sys 7m37.297s 4m54.622s
>
> So there are clearly improvements.
If/when you respin this, can you mention that testing was done with a
single swapfile? I assume the improvements will be less with multiple
swapfiles as lock contention should be better.
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
I think the diff in zswap_swapoff() should be much simpler with the
tree(s) cleanup removed. Otherwise LGTM.
Acked-by: Yosry Ahmed <yosryahmed@google.com>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2 2/2] mm/zswap: split zswap rb-tree
2024-01-22 19:49 ` Yosry Ahmed
@ 2024-01-23 7:46 ` Chengming Zhou
0 siblings, 0 replies; 5+ messages in thread
From: Chengming Zhou @ 2024-01-23 7:46 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Andrew Morton, Nhat Pham, Chris Li, linux-kernel, linux-mm,
Johannes Weiner
On 2024/1/23 03:49, Yosry Ahmed wrote:
> On Fri, Jan 19, 2024 at 3:22 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> Each swapfile has one rb-tree to search the mapping of swp_entry_t to
>> zswap_entry, that use a spinlock to protect, which can cause heavy lock
>> contention if multiple tasks zswap_store/load concurrently.
>>
>> Optimize the scalability problem by splitting the zswap rb-tree into
>> multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M),
>> just like we did in the swap cache address_space splitting.
>>
>> Although this method can't solve the spinlock contention completely, it
>> can mitigate much of that contention. Below is the results of kernel build
>> in tmpfs with zswap shrinker enabled:
>>
>> linux-next zswap-lock-optimize
>> real 1m9.181s 1m3.820s
>> user 17m44.036s 17m40.100s
>> sys 7m37.297s 4m54.622s
>>
>> So there are clearly improvements.
>
> If/when you respin this, can you mention that testing was done with a
> single swapfile? I assume the improvements will be less with multiple
> swapfiles as lock contention should be better.
>
Ok. Not sure how much improvement, may do some tests later.
>>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>> Acked-by: Nhat Pham <nphamcs@gmail.com>
>
> I think the diff in zswap_swapoff() should be much simpler with the
> tree(s) cleanup removed. Otherwise LGTM.
>
> Acked-by: Yosry Ahmed <yosryahmed@google.com>
Right, thanks!
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-01-23 7:46 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-19 11:22 [PATCH v2 0/2] mm/zswap: optimize the scalability of zswap rb-tree Chengming Zhou
2024-01-19 11:22 ` [PATCH v2 1/2] mm/zswap: make sure each swapfile always have " Chengming Zhou
2024-01-19 11:22 ` [PATCH v2 2/2] mm/zswap: split " Chengming Zhou
2024-01-22 19:49 ` Yosry Ahmed
2024-01-23 7:46 ` Chengming Zhou
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).