[PATCH v2 0/2] improving dynamic zswap shrinker protection scheme

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] improving dynamic zswap shrinker protection scheme
@ 2024-07-30 22:27 Nhat Pham
  2024-07-30 22:27 ` [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker Nhat Pham
  2024-07-30 22:27 ` [PATCH v2 2/2] zswap: increment swapin count for non-pivot swapped in pages Nhat Pham
  0 siblings, 2 replies; 10+ messages in thread
From: Nhat Pham @ 2024-07-30 22:27 UTC (permalink / raw)
  To: akpm
  Cc: hannes, yosryahmed, shakeel.butt, linux-mm, kernel-team,
	linux-kernel, flintglass, chengming.zhou

v2:
  * Add more details in comments, patch changelog, documentation, etc.
    about the second chance scheme and its ability to modulate the
	writeback rate (patch 1) (suggested by Yosry Ahmed).
  * Move the referenced bit (patch 1) (suggested by Yosry Ahmed).

When experimenting with the memory-pressure based (i.e "dynamic") zswap
shrinker in production, we observed a sharp increase in the number of
swapins, which led to performance regression. We were able to trace this
regression to the following problems with the shrinker's warm pages
protection scheme: 

1. The protection decays way too rapidly, and the decaying is coupled with
   zswap stores, leading to anomalous patterns, in which a small batch of
   zswap stores effectively erase all the protection in place for the
   warmer pages in the zswap LRU.

   This observation has also been corroborated upstream by Takero Funaki
   (in [1]).

2. We inaccurately track the number of swapped in pages, missing the
   non-pivot pages that are part of the readahead window, while counting
   the pages that are found in the zswap pool.

To alleviate these two issues, this patch series improve the dynamic zswap
shrinker in the following manner:

1. Replace the protection size tracking scheme with a second chance
   algorithm. This new scheme removes the need for haphazard stats
   decaying, and automatically adjusts the pace of pages aging with memory
   pressure, and writeback rate with pool activities: slowing down when
   the pool is dominated with zswpouts, and speeding up when the pool is
   dominated with stale entries.

2. Fix the tracking of the number of swapins to take into account
   non-pivot pages in the readahead window.

With these two changes in place, in a kernel-building benchmark without
any cold data added, the number of swapins is reduced by 64.12%. This
translate to a 10.32% reduction in build time. We also observe a 3%
reduction in kernel CPU time.

In another benchmark, with cold data added (to gauge the new algorithm's
ability to offload cold data), the new second chance scheme outperforms
the old protection scheme by around 0.7%, and actually written back around
21% more pages to backing swap device. So the new scheme is just as good,
if not even better than the old scheme on this front as well.

[1]: https://lore.kernel.org/linux-mm/CAPpodddcGsK=0Xczfuk8usgZ47xeyf4ZjiofdT+ujiyz6V2pFQ@mail.gmail.com/

Nhat Pham (2):
  zswap: implement a second chance algorithm for dynamic zswap shrinker
  zswap: increment swapin count for non-pivot swapped in pages

 include/linux/zswap.h |  16 +++---
 mm/page_io.c          |  11 ++++-
 mm/swap_state.c       |   8 +--
 mm/zswap.c            | 110 ++++++++++++++++++++++++------------------
 4 files changed, 82 insertions(+), 63 deletions(-)

base-commit: cca1345bd26a67fc61a92ff0c6d81766c259e522
-- 
2.43.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker
  2024-07-30 22:27 [PATCH v2 0/2] improving dynamic zswap shrinker protection scheme Nhat Pham
@ 2024-07-30 22:27 ` Nhat Pham
  2024-07-30 22:47   ` [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker (fix) Nhat Pham
  2024-08-01 19:57   ` [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker Yosry Ahmed
  2024-07-30 22:27 ` [PATCH v2 2/2] zswap: increment swapin count for non-pivot swapped in pages Nhat Pham
  1 sibling, 2 replies; 10+ messages in thread
From: Nhat Pham @ 2024-07-30 22:27 UTC (permalink / raw)
  To: akpm
  Cc: hannes, yosryahmed, shakeel.butt, linux-mm, kernel-team,
	linux-kernel, flintglass, chengming.zhou

Current zswap shrinker's heuristics to prevent overshrinking is brittle
and inaccurate, specifically in the way we decay the protection size
(i.e making pages in the zswap LRU eligible for reclaim).

We currently decay protection aggressively in zswap_lru_add() calls.
This leads to the following unfortunate effect: when a new batch of
pages enter zswap, the protection size rapidly decays to below 25% of
the zswap LRU size, which is way too low.

We have observed this effect in production, when experimenting with the
zswap shrinker: the rate of shrinking shoots up massively right after a
new batch of zswap stores. This is somewhat the opposite of what we want
originally - when new pages enter zswap, we want to protect both these
new pages AND the pages that are already protected in the zswap LRU.

Replace existing heuristics with a second chance algorithm

1. When a new zswap entry is stored in the zswap pool, its reference bit
   is set.
2. When the zswap shrinker encounters a zswap entry with the reference
   bit set, give it a second chance - only flips the reference bit and
   rotate it in the LRU.
3. If the shrinker encounters the entry again, this time with its
   reference bit unset, then it can reclaim the entry.

In this manner, the aging of the pages in the zswap LRUs are decoupled
from zswap stores, and picks up the pace with increasing memory pressure
(which is what we want).

The second chance scheme allows us to modulate the writeback rate based
on recent pool activities. Entries that recently entered the pool will
be protected, so if the pool is dominated by such entries the writeback
rate will reduce proportionally, protecting the workload's workingset.On
the other hand, stale entries will be written back quickly, which
increases the effective writeback rate.

We will still maintain the count of swapins, which is consumed and
subtracted from the lru size in zswap_shrinker_count(), to further
penalize past overshrinking that led to disk swapins. The idea is that
had we considered this many more pages in the LRU active/protected, they
would not have been written back and we would not have had to swapped
them in.

To test this new heuristics, I built the kernel under a cgroup with
memory.max set to 2G, on a host with 36 cores:

With the old shrinker:

real: 263.89s
user: 4318.11s
sys: 673.29s
swapins: 227300.5

With the second chance algorithm:

real: 244.85s
user: 4327.22s
sys: 664.39s
swapins: 94663

(average over 5 runs)

We observe an 1.3% reduction in kernel CPU usage, and around 7.2%
reduction in real time. Note that the number of swapped in pages
dropped by 58%.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/zswap.h |  16 +++---
 mm/zswap.c            | 110 ++++++++++++++++++++++++------------------
 2 files changed, 70 insertions(+), 56 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 6cecb4a4f68b..b94b6ae262d5 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -13,17 +13,15 @@ extern atomic_t zswap_stored_pages;
 
 struct zswap_lruvec_state {
 	/*
-	 * Number of pages in zswap that should be protected from the shrinker.
-	 * This number is an estimate of the following counts:
+	 * Number of swapped in pages, i.e not found in the zswap pool.
 	 *
-	 * a) Recent page faults.
-	 * b) Recent insertion to the zswap LRU. This includes new zswap stores,
-	 *    as well as recent zswap LRU rotations.
-	 *
-	 * These pages are likely to be warm, and might incur IO if the are written
-	 * to swap.
+	 * This is consumed and subtracted from the lru size in
+	 * zswap_shrinker_count() to penalize past overshrinking that led to disk
+	 * swapins. The idea is that had we considered this many more pages in the
+	 * LRU active/protected and not written them back, we would not have had to
+	 * swapped them in.
 	 */
-	atomic_long_t nr_zswap_protected;
+	atomic_long_t nr_swapins;
 };
 
 unsigned long zswap_total_pages(void);
diff --git a/mm/zswap.c b/mm/zswap.c
index adeaf9c97fde..f4e001c9e7e0 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -184,6 +184,10 @@ static struct shrinker *zswap_shrinker;
  * page within zswap.
  *
  * swpentry - associated swap entry, the offset indexes into the red-black tree
+ * referenced - true if the entry recently entered the zswap pool. Unset by the
+ *              dynamic shrinker. The entry is only reclaimed by the dynamic
+ *              shrinker if referenced is unset. See comments in the shrinker
+ *              section for context.
  * length - the length in bytes of the compressed page data.  Needed during
  *          decompression. For a same value filled page length is 0, and both
  *          pool and lru are invalid and must be ignored.
@@ -196,6 +200,7 @@ static struct shrinker *zswap_shrinker;
 struct zswap_entry {
 	swp_entry_t swpentry;
 	unsigned int length;
+	bool referenced;
 	struct zswap_pool *pool;
 	union {
 		unsigned long handle;
@@ -700,11 +705,10 @@ static inline int entry_to_nid(struct zswap_entry *entry)
 
 static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
 {
-	atomic_long_t *nr_zswap_protected;
-	unsigned long lru_size, old, new;
 	int nid = entry_to_nid(entry);
 	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
+
+	entry->referenced = true;
 
 	/*
 	 * Note that it is safe to use rcu_read_lock() here, even in the face of
@@ -722,19 +726,6 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
 	memcg = mem_cgroup_from_entry(entry);
 	/* will always succeed */
 	list_lru_add(list_lru, &entry->lru, nid, memcg);
-
-	/* Update the protection area */
-	lru_size = list_lru_count_one(list_lru, nid, memcg);
-	lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
-	nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected;
-	old = atomic_long_inc_return(nr_zswap_protected);
-	/*
-	 * Decay to avoid overflow and adapt to changing workloads.
-	 * This is based on LRU reclaim cost decaying heuristics.
-	 */
-	do {
-		new = old > lru_size / 4 ? old / 2 : old;
-	} while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new));
 	rcu_read_unlock();
 }
 
@@ -752,7 +743,7 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
 
 void zswap_lruvec_state_init(struct lruvec *lruvec)
 {
-	atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0);
+	atomic_long_set(&lruvec->zswap_lruvec_state.nr_swapins, 0);
 }
 
 void zswap_folio_swapin(struct folio *folio)
@@ -761,7 +752,7 @@ void zswap_folio_swapin(struct folio *folio)
 
 	if (folio) {
 		lruvec = folio_lruvec(folio);
-		atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
+		atomic_long_inc(&lruvec->zswap_lruvec_state.nr_swapins);
 	}
 }
 
@@ -1082,6 +1073,28 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 /*********************************
 * shrinker functions
 **********************************/
+/*
+ * The dynamic shrinker is modulated by the following factors:
+ *
+ * 1. Each zswap entry has a referenced bit, which the shrinker unsets (giving
+ *    the entry a second chance) before rotating it in the LRU list. If the
+ *    entry is considered again by the shrinker, with its referenced bit unset,
+ *    it is written back. The writeback rate as a result is dynamically
+ *    adjusted by the pool activities - if the pool is dominated by new entries
+ *    (i.e lots of recent zswapouts), these entries will be protected and
+ *    the writeback rate will slow down. On the other hand, if the pool has a
+ *    lot of stagnant entries, these entries will be reclaimed immediately,
+ *    effectively increasing the writeback rate.
+ *
+ * 2. Swapins counter: If we observe swapins, it is a sign that we are
+ *    overshrinking and should slow down. We maintain a swapins counter, which
+ *    is consumed and subtract from the number of eligible objects on the LRU
+ *    in zswap_shrinker_count().
+ *
+ * 3. Compression ratio. The better the workload compresses, the less gains we
+ *    can expect from writeback. We scale down the number of objects available
+ *    for reclaim by this ratio.
+ */
 static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
 				       spinlock_t *lock, void *arg)
 {
@@ -1091,6 +1104,16 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 	enum lru_status ret = LRU_REMOVED_RETRY;
 	int writeback_result;
 
+	/*
+	 * Second chance algorithm: if the entry has its referenced bit set, give it
+	 * a second chance. Only clear the referenced bit and rotate it in the
+	 * zswap's LRU list.
+	 */
+	if (entry->referenced) {
+		entry->referenced = false;
+		return LRU_ROTATE;
+	}
+
 	/*
 	 * As soon as we drop the LRU lock, the entry can be freed by
 	 * a concurrent invalidation. This means the following:
@@ -1157,8 +1180,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 		struct shrink_control *sc)
 {
-	struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
-	unsigned long shrink_ret, nr_protected, lru_size;
+	unsigned long shrink_ret;
 	bool encountered_page_in_swapcache = false;
 
 	if (!zswap_shrinker_enabled ||
@@ -1167,25 +1189,6 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 		return SHRINK_STOP;
 	}
 
-	nr_protected =
-		atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
-	lru_size = list_lru_shrink_count(&zswap_list_lru, sc);
-
-	/*
-	 * Abort if we are shrinking into the protected region.
-	 *
-	 * This short-circuiting is necessary because if we have too many multiple
-	 * concurrent reclaimers getting the freeable zswap object counts at the
-	 * same time (before any of them made reasonable progress), the total
-	 * number of reclaimed objects might be more than the number of unprotected
-	 * objects (i.e the reclaimers will reclaim into the protected area of the
-	 * zswap LRU).
-	 */
-	if (nr_protected >= lru_size - sc->nr_to_scan) {
-		sc->nr_scanned = 0;
-		return SHRINK_STOP;
-	}
-
 	shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb,
 		&encountered_page_in_swapcache);
 
@@ -1200,7 +1203,8 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 {
 	struct mem_cgroup *memcg = sc->memcg;
 	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
-	unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
+	atomic_long_t *nr_swapins = &lruvec->zswap_lruvec_state.nr_swapins;
+	unsigned long nr_backing, nr_stored, lru_size, nr_swapins_cur, nr_remain;
 
 	if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
 		return 0;
@@ -1233,14 +1237,26 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	if (!nr_stored)
 		return 0;
 
-	nr_protected =
-		atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
-	nr_freeable = list_lru_shrink_count(&zswap_list_lru, sc);
+	lru_size = list_lru_shrink_count(&zswap_list_lru, sc);
+	if (!lru_size)
+		return 0;
+
 	/*
-	 * Subtract the lru size by an estimate of the number of pages
-	 * that should be protected.
+	 * Subtract the lru size by the number of pages that are recently swapped
+	 * in. The idea is that had we protect the zswap's LRU by this amount of
+	 * pages, these swap in would not have happened.
 	 */
-	nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0;
+	nr_swapins_cur = atomic_long_read(nr_swapins);
+	do {
+		if (lru_size >= nr_swapins_cur)
+			nr_remain = 0;
+		else
+			nr_remain = nr_swapins_cur - lru_size;
+	} while (!atomic_long_try_cmpxchg(nr_swapins, &nr_swapins_cur, nr_remain));
+
+	lru_size -= nr_swapins_cur - nr_remain;
+	if (!lru_size)
+		return 0;
 
 	/*
 	 * Scale the number of freeable pages by the memory saving factor.
@@ -1253,7 +1269,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	 * space. Hence, we may scale nr_freeable down a little bit more than we
 	 * should if we have a lot of same-filled pages.
 	 */
-	return mult_frac(nr_freeable, nr_backing, nr_stored);
+	return mult_frac(lru_size, nr_backing, nr_stored);
 }
 
 static struct shrinker *zswap_alloc_shrinker(void)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker (fix)
  2024-07-30 22:27 ` [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker Nhat Pham
@ 2024-07-30 22:47   ` Nhat Pham
  2024-08-01 19:57   ` [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker Yosry Ahmed
  1 sibling, 0 replies; 10+ messages in thread
From: Nhat Pham @ 2024-07-30 22:47 UTC (permalink / raw)
  To: akpm
  Cc: hannes, yosryahmed, shakeel.butt, linux-mm, kernel-team,
	linux-kernel, flintglass, chengming.zhou

I put the referenced bit documentation in the wrong order. Fix this.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zswap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index f4e001c9e7e0..f7f6bbb400c4 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -184,13 +184,13 @@ static struct shrinker *zswap_shrinker;
  * page within zswap.
  *
  * swpentry - associated swap entry, the offset indexes into the red-black tree
+ * length - the length in bytes of the compressed page data.  Needed during
+ *          decompression. For a same value filled page length is 0, and both
+ *          pool and lru are invalid and must be ignored.
  * referenced - true if the entry recently entered the zswap pool. Unset by the
  *              dynamic shrinker. The entry is only reclaimed by the dynamic
  *              shrinker if referenced is unset. See comments in the shrinker
  *              section for context.
- * length - the length in bytes of the compressed page data.  Needed during
- *          decompression. For a same value filled page length is 0, and both
- *          pool and lru are invalid and must be ignored.
  * pool - the zswap_pool the entry's data is in
  * handle - zpool allocation handle that stores the compressed page data
  * value - value of the same-value filled pages which have same content
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker
  2024-07-30 22:27 ` [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker Nhat Pham
  2024-07-30 22:47   ` [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker (fix) Nhat Pham
@ 2024-08-01 19:57   ` Yosry Ahmed
  2024-08-05 23:11     ` Nhat Pham
  1 sibling, 1 reply; 10+ messages in thread
From: Yosry Ahmed @ 2024-08-01 19:57 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, hannes, shakeel.butt, linux-mm, kernel-team, linux-kernel,
	flintglass, chengming.zhou

On Tue, Jul 30, 2024 at 3:27 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Current zswap shrinker's heuristics to prevent overshrinking is brittle
> and inaccurate, specifically in the way we decay the protection size
> (i.e making pages in the zswap LRU eligible for reclaim).
>
> We currently decay protection aggressively in zswap_lru_add() calls.
> This leads to the following unfortunate effect: when a new batch of
> pages enter zswap, the protection size rapidly decays to below 25% of
> the zswap LRU size, which is way too low.
>
> We have observed this effect in production, when experimenting with the
> zswap shrinker: the rate of shrinking shoots up massively right after a
> new batch of zswap stores. This is somewhat the opposite of what we want
> originally - when new pages enter zswap, we want to protect both these
> new pages AND the pages that are already protected in the zswap LRU.
>
> Replace existing heuristics with a second chance algorithm
>
> 1. When a new zswap entry is stored in the zswap pool, its reference bit
>    is set.

Probably worth mentioning that this is added in a hole and doesn't
consume any extra memory.

> 2. When the zswap shrinker encounters a zswap entry with the reference
>    bit set, give it a second chance - only flips the reference bit and
>    rotate it in the LRU.
> 3. If the shrinker encounters the entry again, this time with its
>    reference bit unset, then it can reclaim the entry.
>
> In this manner, the aging of the pages in the zswap LRUs are decoupled
> from zswap stores, and picks up the pace with increasing memory pressure
> (which is what we want).
>
> The second chance scheme allows us to modulate the writeback rate based
> on recent pool activities. Entries that recently entered the pool will
> be protected, so if the pool is dominated by such entries the writeback
> rate will reduce proportionally, protecting the workload's workingset.On
> the other hand, stale entries will be written back quickly, which
> increases the effective writeback rate.
>
> We will still maintain the count of swapins, which is consumed and
> subtracted from the lru size in zswap_shrinker_count(), to further
> penalize past overshrinking that led to disk swapins. The idea is that
> had we considered this many more pages in the LRU active/protected, they
> would not have been written back and we would not have had to swapped
> them in.
>
> To test this new heuristics, I built the kernel under a cgroup with
> memory.max set to 2G, on a host with 36 cores:
>
> With the old shrinker:
>
> real: 263.89s
> user: 4318.11s
> sys: 673.29s
> swapins: 227300.5
>
> With the second chance algorithm:
>
> real: 244.85s
> user: 4327.22s
> sys: 664.39s
> swapins: 94663
>
> (average over 5 runs)
>
> We observe an 1.3% reduction in kernel CPU usage, and around 7.2%
> reduction in real time. Note that the number of swapped in pages
> dropped by 58%.
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  include/linux/zswap.h |  16 +++---
>  mm/zswap.c            | 110 ++++++++++++++++++++++++------------------
>  2 files changed, 70 insertions(+), 56 deletions(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 6cecb4a4f68b..b94b6ae262d5 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -13,17 +13,15 @@ extern atomic_t zswap_stored_pages;
>
>  struct zswap_lruvec_state {
>         /*
> -        * Number of pages in zswap that should be protected from the shrinker.
> -        * This number is an estimate of the following counts:
> +        * Number of swapped in pages, i.e not found in the zswap pool.

With the next patch, this should be "Number of swapped in pages from
disk". Without the "from disk", the second part about not being found
in the zswap pool doesn't really make sense.

Maybe also the variable name should be changed to nr_disk_swapins or similar.

>          *
> -        * a) Recent page faults.
> -        * b) Recent insertion to the zswap LRU. This includes new zswap stores,
> -        *    as well as recent zswap LRU rotations.
> -        *
> -        * These pages are likely to be warm, and might incur IO if the are written
> -        * to swap.
> +        * This is consumed and subtracted from the lru size in
> +        * zswap_shrinker_count() to penalize past overshrinking that led to disk
> +        * swapins. The idea is that had we considered this many more pages in the
> +        * LRU active/protected and not written them back, we would not have had to
> +        * swapped them in.
>          */
> -       atomic_long_t nr_zswap_protected;
> +       atomic_long_t nr_swapins;
>  };
>
>  unsigned long zswap_total_pages(void);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index adeaf9c97fde..f4e001c9e7e0 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -184,6 +184,10 @@ static struct shrinker *zswap_shrinker;
>   * page within zswap.
>   *
>   * swpentry - associated swap entry, the offset indexes into the red-black tree
> + * referenced - true if the entry recently entered the zswap pool. Unset by the
> + *              dynamic shrinker. The entry is only reclaimed by the dynamic
> + *              shrinker if referenced is unset. See comments in the shrinker
> + *              section for context.
>   * length - the length in bytes of the compressed page data.  Needed during
>   *          decompression. For a same value filled page length is 0, and both
>   *          pool and lru are invalid and must be ignored.
> @@ -196,6 +200,7 @@ static struct shrinker *zswap_shrinker;
>  struct zswap_entry {
>         swp_entry_t swpentry;
>         unsigned int length;
> +       bool referenced;
>         struct zswap_pool *pool;
>         union {
>                 unsigned long handle;
> @@ -700,11 +705,10 @@ static inline int entry_to_nid(struct zswap_entry *entry)
>
>  static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
>  {
> -       atomic_long_t *nr_zswap_protected;
> -       unsigned long lru_size, old, new;
>         int nid = entry_to_nid(entry);
>         struct mem_cgroup *memcg;
> -       struct lruvec *lruvec;
> +
> +       entry->referenced = true;

Would it be clearer to initialize this in zswap_store() with the rest
of the zswap_entry initialization?

>
>         /*
>          * Note that it is safe to use rcu_read_lock() here, even in the face of
> @@ -722,19 +726,6 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
>         memcg = mem_cgroup_from_entry(entry);
>         /* will always succeed */
>         list_lru_add(list_lru, &entry->lru, nid, memcg);
> -
> -       /* Update the protection area */
> -       lru_size = list_lru_count_one(list_lru, nid, memcg);
> -       lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> -       nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected;
> -       old = atomic_long_inc_return(nr_zswap_protected);
> -       /*
> -        * Decay to avoid overflow and adapt to changing workloads.
> -        * This is based on LRU reclaim cost decaying heuristics.
> -        */
> -       do {
> -               new = old > lru_size / 4 ? old / 2 : old;
> -       } while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new));
>         rcu_read_unlock();
>  }
>
> @@ -752,7 +743,7 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>
>  void zswap_lruvec_state_init(struct lruvec *lruvec)
>  {
> -       atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0);
> +       atomic_long_set(&lruvec->zswap_lruvec_state.nr_swapins, 0);
>  }
>
>  void zswap_folio_swapin(struct folio *folio)
> @@ -761,7 +752,7 @@ void zswap_folio_swapin(struct folio *folio)
>
>         if (folio) {
>                 lruvec = folio_lruvec(folio);
> -               atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> +               atomic_long_inc(&lruvec->zswap_lruvec_state.nr_swapins);
>         }
>  }
>
> @@ -1082,6 +1073,28 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  /*********************************
>  * shrinker functions
>  **********************************/
> +/*
> + * The dynamic shrinker is modulated by the following factors:
> + *
> + * 1. Each zswap entry has a referenced bit, which the shrinker unsets (giving
> + *    the entry a second chance) before rotating it in the LRU list. If the
> + *    entry is considered again by the shrinker, with its referenced bit unset,
> + *    it is written back. The writeback rate as a result is dynamically
> + *    adjusted by the pool activities - if the pool is dominated by new entries
> + *    (i.e lots of recent zswapouts), these entries will be protected and
> + *    the writeback rate will slow down. On the other hand, if the pool has a
> + *    lot of stagnant entries, these entries will be reclaimed immediately,
> + *    effectively increasing the writeback rate.
> + *
> + * 2. Swapins counter: If we observe swapins, it is a sign that we are
> + *    overshrinking and should slow down. We maintain a swapins counter, which
> + *    is consumed and subtract from the number of eligible objects on the LRU
> + *    in zswap_shrinker_count().
> + *
> + * 3. Compression ratio. The better the workload compresses, the less gains we
> + *    can expect from writeback. We scale down the number of objects available
> + *    for reclaim by this ratio.
> + */

Nice :)

>  static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
>                                        spinlock_t *lock, void *arg)
>  {
> @@ -1091,6 +1104,16 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>         enum lru_status ret = LRU_REMOVED_RETRY;
>         int writeback_result;
>
> +       /*
> +        * Second chance algorithm: if the entry has its referenced bit set, give it
> +        * a second chance. Only clear the referenced bit and rotate it in the
> +        * zswap's LRU list.
> +        */
> +       if (entry->referenced) {
> +               entry->referenced = false;
> +               return LRU_ROTATE;
> +       }
> +
>         /*
>          * As soon as we drop the LRU lock, the entry can be freed by
>          * a concurrent invalidation. This means the following:
> @@ -1157,8 +1180,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>  static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>                 struct shrink_control *sc)
>  {
> -       struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
> -       unsigned long shrink_ret, nr_protected, lru_size;
> +       unsigned long shrink_ret;
>         bool encountered_page_in_swapcache = false;
>
>         if (!zswap_shrinker_enabled ||
> @@ -1167,25 +1189,6 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>                 return SHRINK_STOP;
>         }
>
> -       nr_protected =
> -               atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> -       lru_size = list_lru_shrink_count(&zswap_list_lru, sc);
> -
> -       /*
> -        * Abort if we are shrinking into the protected region.
> -        *
> -        * This short-circuiting is necessary because if we have too many multiple
> -        * concurrent reclaimers getting the freeable zswap object counts at the
> -        * same time (before any of them made reasonable progress), the total
> -        * number of reclaimed objects might be more than the number of unprotected
> -        * objects (i.e the reclaimers will reclaim into the protected area of the
> -        * zswap LRU).
> -        */
> -       if (nr_protected >= lru_size - sc->nr_to_scan) {
> -               sc->nr_scanned = 0;
> -               return SHRINK_STOP;
> -       }
> -

Do we need a similar mechanism to protect against concurrent shrinkers
quickly consuming nr_swapins?

>         shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb,
>                 &encountered_page_in_swapcache);
>
> @@ -1200,7 +1203,8 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>  {
>         struct mem_cgroup *memcg = sc->memcg;
>         struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
> -       unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> +       atomic_long_t *nr_swapins = &lruvec->zswap_lruvec_state.nr_swapins;
> +       unsigned long nr_backing, nr_stored, lru_size, nr_swapins_cur, nr_remain;
>
>         if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
>                 return 0;
> @@ -1233,14 +1237,26 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>         if (!nr_stored)
>                 return 0;
>
> -       nr_protected =
> -               atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> -       nr_freeable = list_lru_shrink_count(&zswap_list_lru, sc);
> +       lru_size = list_lru_shrink_count(&zswap_list_lru, sc);
> +       if (!lru_size)
> +               return 0;
> +
>         /*
> -        * Subtract the lru size by an estimate of the number of pages
> -        * that should be protected.
> +        * Subtract the lru size by the number of pages that are recently swapped

nit: I don't think "subtract by" is correct, it's usually "subtract
from". So maybe "Subtract the number of pages that are recently
swapped in from the lru size"? Also, should we remain consistent about
mentioning that these are disk swapins throughout all the comments to
keep things clear?

> +        * in. The idea is that had we protect the zswap's LRU by this amount of
> +        * pages, these swap in would not have happened.
>          */
> -       nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0;
> +       nr_swapins_cur = atomic_long_read(nr_swapins);
> +       do {
> +               if (lru_size >= nr_swapins_cur)
> +                       nr_remain = 0;
> +               else
> +                       nr_remain = nr_swapins_cur - lru_size;
> +       } while (!atomic_long_try_cmpxchg(nr_swapins, &nr_swapins_cur, nr_remain));
> +
> +       lru_size -= nr_swapins_cur - nr_remain;

It's a little bit weird that we reduce the variable named "lru_size"
by the consumed swapins. Maybe we should keep this named as
"nr_freeable", or add another variable here to hold the value after
subtraction?

> +       if (!lru_size)
> +               return 0;
>
>         /*
>          * Scale the number of freeable pages by the memory saving factor.
> @@ -1253,7 +1269,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>          * space. Hence, we may scale nr_freeable down a little bit more than we
>          * should if we have a lot of same-filled pages.
>          */
> -       return mult_frac(nr_freeable, nr_backing, nr_stored);
> +       return mult_frac(lru_size, nr_backing, nr_stored);
>  }
>
>  static struct shrinker *zswap_alloc_shrinker(void)
> --
> 2.43.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker
  2024-08-01 19:57   ` [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker Yosry Ahmed
@ 2024-08-05 23:11     ` Nhat Pham
  2024-08-05 23:58       ` Yosry Ahmed
  0 siblings, 1 reply; 10+ messages in thread
From: Nhat Pham @ 2024-08-05 23:11 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, hannes, shakeel.butt, linux-mm, kernel-team, linux-kernel,
	flintglass, chengming.zhou

On Thu, Aug 1, 2024 at 12:57 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Jul 30, 2024 at 3:27 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Current zswap shrinker's heuristics to prevent overshrinking is brittle
> > and inaccurate, specifically in the way we decay the protection size
> > (i.e making pages in the zswap LRU eligible for reclaim).
> >
> > We currently decay protection aggressively in zswap_lru_add() calls.
> > This leads to the following unfortunate effect: when a new batch of
> > pages enter zswap, the protection size rapidly decays to below 25% of
> > the zswap LRU size, which is way too low.
> >
> > We have observed this effect in production, when experimenting with the
> > zswap shrinker: the rate of shrinking shoots up massively right after a
> > new batch of zswap stores. This is somewhat the opposite of what we want
> > originally - when new pages enter zswap, we want to protect both these
> > new pages AND the pages that are already protected in the zswap LRU.
> >
> > Replace existing heuristics with a second chance algorithm
> >
> > 1. When a new zswap entry is stored in the zswap pool, its reference bit
> >    is set.
>
> Probably worth mentioning that this is added in a hole and doesn't
> consume any extra memory.

Will do!

>
> > 2. When the zswap shrinker encounters a zswap entry with the reference
> >    bit set, give it a second chance - only flips the reference bit and
> >    rotate it in the LRU.
> > 3. If the shrinker encounters the entry again, this time with its
> >    reference bit unset, then it can reclaim the entry.
> >
> > In this manner, the aging of the pages in the zswap LRUs are decoupled
> > from zswap stores, and picks up the pace with increasing memory pressure
> > (which is what we want).
> >
> > The second chance scheme allows us to modulate the writeback rate based
> > on recent pool activities. Entries that recently entered the pool will
> > be protected, so if the pool is dominated by such entries the writeback
> > rate will reduce proportionally, protecting the workload's workingset.On
> > the other hand, stale entries will be written back quickly, which
> > increases the effective writeback rate.
> >
> > We will still maintain the count of swapins, which is consumed and
> > subtracted from the lru size in zswap_shrinker_count(), to further
> > penalize past overshrinking that led to disk swapins. The idea is that
> > had we considered this many more pages in the LRU active/protected, they
> > would not have been written back and we would not have had to swapped
> > them in.
> >
> > To test this new heuristics, I built the kernel under a cgroup with
> > memory.max set to 2G, on a host with 36 cores:
> >
> > With the old shrinker:
> >
> > real: 263.89s
> > user: 4318.11s
> > sys: 673.29s
> > swapins: 227300.5
> >
> > With the second chance algorithm:
> >
> > real: 244.85s
> > user: 4327.22s
> > sys: 664.39s
> > swapins: 94663
> >
> > (average over 5 runs)
> >
> > We observe an 1.3% reduction in kernel CPU usage, and around 7.2%
> > reduction in real time. Note that the number of swapped in pages
> > dropped by 58%.
> >
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > ---
> >  include/linux/zswap.h |  16 +++---
> >  mm/zswap.c            | 110 ++++++++++++++++++++++++------------------
> >  2 files changed, 70 insertions(+), 56 deletions(-)
> >
> > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > index 6cecb4a4f68b..b94b6ae262d5 100644
> > --- a/include/linux/zswap.h
> > +++ b/include/linux/zswap.h
> > @@ -13,17 +13,15 @@ extern atomic_t zswap_stored_pages;
> >
> >  struct zswap_lruvec_state {
> >         /*
> > -        * Number of pages in zswap that should be protected from the shrinker.
> > -        * This number is an estimate of the following counts:
> > +        * Number of swapped in pages, i.e not found in the zswap pool.
>
> With the next patch, this should be "Number of swapped in pages from
> disk". Without the "from disk", the second part about not being found
> in the zswap pool doesn't really make sense.
>
> Maybe also the variable name should be changed to nr_disk_swapins or similar.

Sounds good :)

>
> >          *
> > -        * a) Recent page faults.
> > -        * b) Recent insertion to the zswap LRU. This includes new zswap stores,
> > -        *    as well as recent zswap LRU rotations.
> > -        *
> > -        * These pages are likely to be warm, and might incur IO if the are written
> > -        * to swap.
> > +        * This is consumed and subtracted from the lru size in
> > +        * zswap_shrinker_count() to penalize past overshrinking that led to disk
> > +        * swapins. The idea is that had we considered this many more pages in the
> > +        * LRU active/protected and not written them back, we would not have had to
> > +        * swapped them in.
> >          */
> > -       atomic_long_t nr_zswap_protected;
> > +       atomic_long_t nr_swapins;
> >  };
> >
> >  unsigned long zswap_total_pages(void);
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index adeaf9c97fde..f4e001c9e7e0 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -184,6 +184,10 @@ static struct shrinker *zswap_shrinker;
> >   * page within zswap.
> >   *
> >   * swpentry - associated swap entry, the offset indexes into the red-black tree
> > + * referenced - true if the entry recently entered the zswap pool. Unset by the
> > + *              dynamic shrinker. The entry is only reclaimed by the dynamic
> > + *              shrinker if referenced is unset. See comments in the shrinker
> > + *              section for context.
> >   * length - the length in bytes of the compressed page data.  Needed during
> >   *          decompression. For a same value filled page length is 0, and both
> >   *          pool and lru are invalid and must be ignored.
> > @@ -196,6 +200,7 @@ static struct shrinker *zswap_shrinker;
> >  struct zswap_entry {
> >         swp_entry_t swpentry;
> >         unsigned int length;
> > +       bool referenced;
> >         struct zswap_pool *pool;
> >         union {
> >                 unsigned long handle;
> > @@ -700,11 +705,10 @@ static inline int entry_to_nid(struct zswap_entry *entry)
> >
> >  static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
> >  {
> > -       atomic_long_t *nr_zswap_protected;
> > -       unsigned long lru_size, old, new;
> >         int nid = entry_to_nid(entry);
> >         struct mem_cgroup *memcg;
> > -       struct lruvec *lruvec;
> > +
> > +       entry->referenced = true;
>
> Would it be clearer to initialize this in zswap_store() with the rest
> of the zswap_entry initialization?

Sure thing!

>
> >
> >         /*
> >          * Note that it is safe to use rcu_read_lock() here, even in the face of
> > @@ -722,19 +726,6 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
> >         memcg = mem_cgroup_from_entry(entry);
> >         /* will always succeed */
> >         list_lru_add(list_lru, &entry->lru, nid, memcg);
> > -
> > -       /* Update the protection area */
> > -       lru_size = list_lru_count_one(list_lru, nid, memcg);
> > -       lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> > -       nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected;
> > -       old = atomic_long_inc_return(nr_zswap_protected);
> > -       /*
> > -        * Decay to avoid overflow and adapt to changing workloads.
> > -        * This is based on LRU reclaim cost decaying heuristics.
> > -        */
> > -       do {
> > -               new = old > lru_size / 4 ? old / 2 : old;
> > -       } while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new));
> >         rcu_read_unlock();
> >  }
> >
> > @@ -752,7 +743,7 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
> >
> >  void zswap_lruvec_state_init(struct lruvec *lruvec)
> >  {
> > -       atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0);
> > +       atomic_long_set(&lruvec->zswap_lruvec_state.nr_swapins, 0);
> >  }
> >
> >  void zswap_folio_swapin(struct folio *folio)
> > @@ -761,7 +752,7 @@ void zswap_folio_swapin(struct folio *folio)
> >
> >         if (folio) {
> >                 lruvec = folio_lruvec(folio);
> > -               atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> > +               atomic_long_inc(&lruvec->zswap_lruvec_state.nr_swapins);
> >         }
> >  }
> >
> > @@ -1082,6 +1073,28 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >  static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
> >                                        spinlock_t *lock, void *arg)
> >  {
> > @@ -1091,6 +1104,16 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> >         enum lru_status ret = LRU_REMOVED_RETRY;
> >         int writeback_result;
> >
> > +       /*
> > +        * Second chance algorithm: if the entry has its referenced bit set, give it
> > +        * a second chance. Only clear the referenced bit and rotate it in the
> > +        * zswap's LRU list.
> > +        */
> > +       if (entry->referenced) {
> > +               entry->referenced = false;
> > +               return LRU_ROTATE;
> > +       }
> > +
> >         /*
> >          * As soon as we drop the LRU lock, the entry can be freed by
> >          * a concurrent invalidation. This means the following:
> > @@ -1157,8 +1180,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> >  static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> >                 struct shrink_control *sc)
> >  {
> > -       struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
> > -       unsigned long shrink_ret, nr_protected, lru_size;
> > +       unsigned long shrink_ret;
> >         bool encountered_page_in_swapcache = false;
> >
> >         if (!zswap_shrinker_enabled ||
> > @@ -1167,25 +1189,6 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> >                 return SHRINK_STOP;
> >         }
> >
> > -       nr_protected =
> > -               atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> > -       lru_size = list_lru_shrink_count(&zswap_list_lru, sc);
> > -
> > -       /*
> > -        * Abort if we are shrinking into the protected region.
> > -        *
> > -        * This short-circuiting is necessary because if we have too many multiple
> > -        * concurrent reclaimers getting the freeable zswap object counts at the
> > -        * same time (before any of them made reasonable progress), the total
> > -        * number of reclaimed objects might be more than the number of unprotected
> > -        * objects (i.e the reclaimers will reclaim into the protected area of the
> > -        * zswap LRU).
> > -        */
> > -       if (nr_protected >= lru_size - sc->nr_to_scan) {
> > -               sc->nr_scanned = 0;
> > -               return SHRINK_STOP;
> > -       }
> > -
>
> Do we need a similar mechanism to protect against concurrent shrinkers
> quickly consuming nr_swapins?

Not for nr_swapins consumption per se, and the original reason why I
included this (racy) check is just so that concurrent reclaimers do
not disrespect the protection scheme. We had no guarantee that we
wouldn't just reclaim into the protected region (well even with this
racy check technically). With the second chance scheme, a "protected"
page (i.e with its referenced bit set) would not be reclaimed right
away - a shrinker encountering it would have to "age" it first (by
unsetting the referenced bit), so the intended protection is enforced.

That said, I do believe we need a mechanism to limit the concurrency
here. The amount of pages aged/reclaimed should scale (linearly?
proportionally?) with the reclaim pressure, i.e more reclaimers ==
more pages reclaimed/aged, so the current behavior is desired.
However, at some point, if we have more shrinkers than there are work
assigned to each of them, we might be unnecessarily wasting resources
(and potentially building up the nr_deferred counter that we discussed
in v1 of the patch series). Additionally, we might be overshrinking in
a very short amount of time, without letting the system have the
chance to react and provide feedback (through swapins/refaults) to the
memory reclaimers.

But let's do this as a follow-up work :) It seems orthogonal to what
we have here.

> > -        * Subtract the lru size by an estimate of the number of pages
> > -        * that should be protected.
> > +        * Subtract the lru size by the number of pages that are recently swapped
>
> nit: I don't think "subtract by" is correct, it's usually "subtract
> from". So maybe "Subtract the number of pages that are recently
> swapped in from the lru size"? Also, should we remain consistent about
> mentioning that these are disk swapins throughout all the comments to
> keep things clear?

Yeah I should be clearer here - it should be swapped in from disk, or
more generally (accurately?) swapped in from the backing swap device
(but the latter can change once we decoupled swap from zswap). Or
maybe swapped in from the secondary tier?

Let's just not overthink and go with swapped in from disk for now :)

>
> > +        * in. The idea is that had we protect the zswap's LRU by this amount of
> > +        * pages, these swap in would not have happened.
> >          */
> > -       nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0;
> > +       nr_swapins_cur = atomic_long_read(nr_swapins);
> > +       do {
> > +               if (lru_size >= nr_swapins_cur)
> > +                       nr_remain = 0;
> > +               else
> > +                       nr_remain = nr_swapins_cur - lru_size;
> > +       } while (!atomic_long_try_cmpxchg(nr_swapins, &nr_swapins_cur, nr_remain));
> > +
> > +       lru_size -= nr_swapins_cur - nr_remain;
>
> It's a little bit weird that we reduce the variable named "lru_size"
> by the consumed swapins. Maybe we should keep this named as
> "nr_freeable", or add another variable here to hold the value after
> subtraction?

Hmmmm yeah now I remember why I called it nr_freeable back then. Seems
like past Nhat is a bit smarter than present Nhat :)
I'll just revert this part then.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker
  2024-08-05 23:11     ` Nhat Pham
@ 2024-08-05 23:58       ` Yosry Ahmed
  0 siblings, 0 replies; 10+ messages in thread
From: Yosry Ahmed @ 2024-08-05 23:58 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, hannes, shakeel.butt, linux-mm, kernel-team, linux-kernel,
	flintglass, chengming.zhou

[..]
> > > @@ -1167,25 +1189,6 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> > >                 return SHRINK_STOP;
> > >         }
> > >
> > > -       nr_protected =
> > > -               atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> > > -       lru_size = list_lru_shrink_count(&zswap_list_lru, sc);
> > > -
> > > -       /*
> > > -        * Abort if we are shrinking into the protected region.
> > > -        *
> > > -        * This short-circuiting is necessary because if we have too many multiple
> > > -        * concurrent reclaimers getting the freeable zswap object counts at the
> > > -        * same time (before any of them made reasonable progress), the total
> > > -        * number of reclaimed objects might be more than the number of unprotected
> > > -        * objects (i.e the reclaimers will reclaim into the protected area of the
> > > -        * zswap LRU).
> > > -        */
> > > -       if (nr_protected >= lru_size - sc->nr_to_scan) {
> > > -               sc->nr_scanned = 0;
> > > -               return SHRINK_STOP;
> > > -       }
> > > -
> >
> > Do we need a similar mechanism to protect against concurrent shrinkers
> > quickly consuming nr_swapins?
>
> Not for nr_swapins consumption per se, and the original reason why I
> included this (racy) check is just so that concurrent reclaimers do
> not disrespect the protection scheme. We had no guarantee that we
> wouldn't just reclaim into the protected region (well even with this
> racy check technically). With the second chance scheme, a "protected"
> page (i.e with its referenced bit set) would not be reclaimed right
> away - a shrinker encountering it would have to "age" it first (by
> unsetting the referenced bit), so the intended protection is enforced.
>
> That said, I do believe we need a mechanism to limit the concurrency
> here. The amount of pages aged/reclaimed should scale (linearly?
> proportionally?) with the reclaim pressure, i.e more reclaimers ==
> more pages reclaimed/aged, so the current behavior is desired.
> However, at some point, if we have more shrinkers than there are work
> assigned to each of them, we might be unnecessarily wasting resources
> (and potentially building up the nr_deferred counter that we discussed
> in v1 of the patch series). Additionally, we might be overshrinking in
> a very short amount of time, without letting the system have the
> chance to react and provide feedback (through swapins/refaults) to the
> memory reclaimers.
>
> But let's do this as a follow-up work :) It seems orthogonal to what
> we have here.

Agreed, as long as the data shows we don't regress by removing this
part I am fine with doing this as a follow-up work.

>
> > > -        * Subtract the lru size by an estimate of the number of pages
> > > -        * that should be protected.
> > > +        * Subtract the lru size by the number of pages that are recently swapped
> >
> > nit: I don't think "subtract by" is correct, it's usually "subtract
> > from". So maybe "Subtract the number of pages that are recently
> > swapped in from the lru size"? Also, should we remain consistent about
> > mentioning that these are disk swapins throughout all the comments to
> > keep things clear?
>
> Yeah I should be clearer here - it should be swapped in from disk, or
> more generally (accurately?) swapped in from the backing swap device
> (but the latter can change once we decoupled swap from zswap). Or
> maybe swapped in from the secondary tier?
>
> Let's just not overthink and go with swapped in from disk for now :)

Agreed :)

I will take a look at the new version soon, thanks for working on this.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 2/2] zswap: increment swapin count for non-pivot swapped in pages
  2024-07-30 22:27 [PATCH v2 0/2] improving dynamic zswap shrinker protection scheme Nhat Pham
  2024-07-30 22:27 ` [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker Nhat Pham
@ 2024-07-30 22:27 ` Nhat Pham
  2024-08-01 20:02   ` Yosry Ahmed
  1 sibling, 1 reply; 10+ messages in thread
From: Nhat Pham @ 2024-07-30 22:27 UTC (permalink / raw)
  To: akpm
  Cc: hannes, yosryahmed, shakeel.butt, linux-mm, kernel-team,
	linux-kernel, flintglass, chengming.zhou

Currently, we only increment the swapin counter on pivot pages. This
means we are not taking into account pages that also need to be swapped
in, but are already taken care of as part of the readahead window. We
are also incrementing when the pages are read from the zswap pool, which
is inaccurate.

This patch rectifies this issue by incrementing whenever we need to
perform a non-zswap read.

To test this change, I built the kernel under a cgroup with its
memory.max set to 2 GB:

real: 236.66s
user: 4286.06s
sys: 652.86s
swapins: 81552

For comparison, with just the new second chance algorithm, the build
time is as follows:

real: 244.85s
user: 4327.22s
sys: 664.39s
swapins: 94663

Without neither:

real: 263.89s
user: 4318.11s
sys: 673.29s
swapins: 227300.5

(average over 5 runs)

With this change, the kernel CPU time reduces by a further 1.7%, and
the real time is reduced by another 3.3%, compared to just the second
chance algorithm by itself. The swapins count also reduces by another
13.85%.

Combinng the two changes, we reduce the real time by 10.32%, kernel CPU
time by 3%, and number of swapins by 64.12%.

To gauge the new scheme's ability to offload cold data, I ran another
benchmark, in which the kernel was built under a cgroup with memory.max
set to 3 GB, but with 0.5 GB worth of cold data allocated before each
build (in a shmem file).

Under the old scheme:

real: 197.18s
user: 4365.08s
sys: 289.02s
zswpwb: 72115.2

Under the new scheme:

real: 195.8s
user: 4362.25s
sys: 290.14s
zswpwb: 87277.8

(average over 5 runs)

Notice that we actually observe a 21% increase in the number of written
back pages - so the new scheme is just as good, if not better at
offloading pages from the zswap pool when they are cold. Build time
reduces by around 0.7% as a result.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/page_io.c    | 11 ++++++++++-
 mm/swap_state.c |  8 ++------
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index ff8c99ee3af7..0004c9fbf7e8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -521,7 +521,15 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)

 	if (zswap_load(folio)) {
 		folio_unlock(folio);
-	} else if (data_race(sis->flags & SWP_FS_OPS)) {
+		goto finish;
+	}
+
+	/*
+	 * We have to read the page from slower devices. Increase zswap protection.
+	 */
+	zswap_folio_swapin(folio);
+
+	if (data_race(sis->flags & SWP_FS_OPS)) {
 		swap_read_folio_fs(folio, plug);
 	} else if (synchronous) {
 		swap_read_folio_bdev_sync(folio, sis);
@@ -529,6 +537,7 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 		swap_read_folio_bdev_async(folio, sis);
 	}

+finish:
 	if (workingset) {
 		delayacct_thrashing_end(&in_thrashing);
 		psi_memstall_leave(&pflags);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index a1726e49a5eb..3a0cf965f32b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -698,10 +698,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	/* The page was likely read above, so no need for plugging here */
 	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
-	if (unlikely(page_allocated)) {
-		zswap_folio_swapin(folio);
+	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
-	}
 	return folio;
 }

@@ -850,10 +848,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	/* The folio was likely read above, so no need for plugging here */
 	folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
 					&page_allocated, false);
-	if (unlikely(page_allocated)) {
-		zswap_folio_swapin(folio);
+	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
-	}
 	return folio;
 }

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] zswap: increment swapin count for non-pivot swapped in pages
  2024-07-30 22:27 ` [PATCH v2 2/2] zswap: increment swapin count for non-pivot swapped in pages Nhat Pham
@ 2024-08-01 20:02   ` Yosry Ahmed
  2024-08-02 23:46     ` Nhat Pham
  0 siblings, 1 reply; 10+ messages in thread
From: Yosry Ahmed @ 2024-08-01 20:02 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, hannes, shakeel.butt, linux-mm, kernel-team, linux-kernel,
	flintglass, chengming.zhou

On Tue, Jul 30, 2024 at 3:27 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Currently, we only increment the swapin counter on pivot pages. This
> means we are not taking into account pages that also need to be swapped
> in, but are already taken care of as part of the readahead window. We

Hmm, but there is a chance that these pages are not actually needed,
in which case we will unnecessarily increase the zswap protection.
Does the readahead window self-correct if the pages were not used?

> are also incrementing when the pages are read from the zswap pool, which
> is inaccurate.

I feel like this is the more important part. It should be the focus of
the commit log with more details (i.e. why is it wrong to increment
the zswap protection in this case).

Do we need a Fixes and cc:stable for this one? Maybe it can be moved
first to make backports easy.

>
> This patch rectifies this issue by incrementing whenever we need to
> perform a non-zswap read.
>
> To test this change, I built the kernel under a cgroup with its
> memory.max set to 2 GB:
>
> real: 236.66s
> user: 4286.06s
> sys: 652.86s
> swapins: 81552
>
> For comparison, with just the new second chance algorithm, the build
> time is as follows:
>
> real: 244.85s
> user: 4327.22s
> sys: 664.39s
> swapins: 94663
>
> Without neither:
>
> real: 263.89s
> user: 4318.11s
> sys: 673.29s
> swapins: 227300.5
>
> (average over 5 runs)
>
> With this change, the kernel CPU time reduces by a further 1.7%, and
> the real time is reduced by another 3.3%, compared to just the second
> chance algorithm by itself. The swapins count also reduces by another
> 13.85%.
>
> Combinng the two changes, we reduce the real time by 10.32%, kernel CPU
> time by 3%, and number of swapins by 64.12%.
>
> To gauge the new scheme's ability to offload cold data, I ran another
> benchmark, in which the kernel was built under a cgroup with memory.max
> set to 3 GB, but with 0.5 GB worth of cold data allocated before each
> build (in a shmem file).
>
> Under the old scheme:
>
> real: 197.18s
> user: 4365.08s
> sys: 289.02s
> zswpwb: 72115.2
>
> Under the new scheme:
>
> real: 195.8s
> user: 4362.25s
> sys: 290.14s
> zswpwb: 87277.8
>
> (average over 5 runs)
>
> Notice that we actually observe a 21% increase in the number of written
> back pages - so the new scheme is just as good, if not better at
> offloading pages from the zswap pool when they are cold. Build time
> reduces by around 0.7% as a result.
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  mm/page_io.c    | 11 ++++++++++-
>  mm/swap_state.c |  8 ++------
>  2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index ff8c99ee3af7..0004c9fbf7e8 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -521,7 +521,15 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
>
>         if (zswap_load(folio)) {
>                 folio_unlock(folio);
> -       } else if (data_race(sis->flags & SWP_FS_OPS)) {
> +               goto finish;
> +       }
> +
> +       /*
> +        * We have to read the page from slower devices. Increase zswap protection.
> +        */
> +       zswap_folio_swapin(folio);
> +
> +       if (data_race(sis->flags & SWP_FS_OPS)) {
>                 swap_read_folio_fs(folio, plug);
>         } else if (synchronous) {
>                 swap_read_folio_bdev_sync(folio, sis);
> @@ -529,6 +537,7 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
>                 swap_read_folio_bdev_async(folio, sis);
>         }
>
> +finish:
>         if (workingset) {
>                 delayacct_thrashing_end(&in_thrashing);
>                 psi_memstall_leave(&pflags);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index a1726e49a5eb..3a0cf965f32b 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -698,10 +698,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         /* The page was likely read above, so no need for plugging here */
>         folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
>                                         &page_allocated, false);
> -       if (unlikely(page_allocated)) {
> -               zswap_folio_swapin(folio);
> +       if (unlikely(page_allocated))
>                 swap_read_folio(folio, NULL);
> -       }
>         return folio;
>  }
>
> @@ -850,10 +848,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>         /* The folio was likely read above, so no need for plugging here */
>         folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
>                                         &page_allocated, false);
> -       if (unlikely(page_allocated)) {
> -               zswap_folio_swapin(folio);
> +       if (unlikely(page_allocated))
>                 swap_read_folio(folio, NULL);
> -       }
>         return folio;
>  }
>
> --
> 2.43.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] zswap: increment swapin count for non-pivot swapped in pages
  2024-08-01 20:02   ` Yosry Ahmed
@ 2024-08-02 23:46     ` Nhat Pham
  2024-08-03  3:22       ` Yosry Ahmed
  0 siblings, 1 reply; 10+ messages in thread
From: Nhat Pham @ 2024-08-02 23:46 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, hannes, shakeel.butt, linux-mm, kernel-team, linux-kernel,
	flintglass, chengming.zhou

On Thu, Aug 1, 2024 at 1:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
>
> Hmm, but there is a chance that these pages are not actually needed,
> in which case we will unnecessarily increase the zswap protection.
> Does the readahead window self-correct if the pages were not used?

Hmm yeah it's kinda hard to predict if a swapped in page is strictly
necessary in this context. We don't have this information at the time
of the read.

That said, I think erring on the side of safety is OK here - my
understanding that readahead, while predictive in nature, only gets
progressively more aggressive if we get signals that it's helpful (i.e
the memory access patterns display sequential behavior).

I think we also accept this slight inaccuracy (i.e for pages in the
readahead window that might not necessarily be needed) the in
workingset refault handling behavior. Could you fact check me,
Johannes?

>
> > are also incrementing when the pages are read from the zswap pool, which
> > is inaccurate.
>
> I feel like this is the more important part. It should be the focus of
> the commit log with more details (i.e. why is it wrong to increment
> the zswap protection in this case).

Yeah this is pretty important too :) Maybe I should make it clearer in
the patch commit.

>
> Do we need a Fixes and cc:stable for this one? Maybe it can be moved
> first to make backports easy.

Hmm.

*Technically*, this is broken in older versions of the shrinker as
well, but it's more of an optimization than a bug that can crash the
kernel, so I don't know if it qualifies for a Fixes tag?

Another factor is, under the old scheme, this does not move the needle
much - at least in my benchmarks. This is because the decaying
behavior is so aggressive that incrementing the counter in a couple
places does not matter, when it will be rapidly divided by half later.
This fix only shows clear improvements when applied on top of the new
second chance scheme.

I don't have a strong opinion here, but it doesn't seem worth it to
backport IMHO :)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] zswap: increment swapin count for non-pivot swapped in pages
  2024-08-02 23:46     ` Nhat Pham
@ 2024-08-03  3:22       ` Yosry Ahmed
  0 siblings, 0 replies; 10+ messages in thread
From: Yosry Ahmed @ 2024-08-03  3:22 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, hannes, shakeel.butt, linux-mm, kernel-team, linux-kernel,
	flintglass, chengming.zhou

On Fri, Aug 2, 2024 at 4:46 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Aug 1, 2024 at 1:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> >
> > Hmm, but there is a chance that these pages are not actually needed,
> > in which case we will unnecessarily increase the zswap protection.
> > Does the readahead window self-correct if the pages were not used?
>
> Hmm yeah it's kinda hard to predict if a swapped in page is strictly
> necessary in this context. We don't have this information at the time
> of the read.
>
> That said, I think erring on the side of safety is OK here - my
> understanding that readahead, while predictive in nature, only gets
> progressively more aggressive if we get signals that it's helpful (i.e
> the memory access patterns display sequential behavior).

If the readahead logic is expected to adapt in these situations (and I
think it is), then I think we are fine. Perhaps we should just leave a
comment that we may increase the protection more than we should for
those readahead cases.

>
> I think we also accept this slight inaccuracy (i.e for pages in the
> readahead window that might not necessarily be needed) the in
> workingset refault handling behavior. Could you fact check me,
> Johannes?
>
>
> >
> > > are also incrementing when the pages are read from the zswap pool, which
> > > is inaccurate.
> >
> > I feel like this is the more important part. It should be the focus of
> > the commit log with more details (i.e. why is it wrong to increment
> > the zswap protection in this case).
>
> Yeah this is pretty important too :) Maybe I should make it clearer in
> the patch commit.
>
> >
> > Do we need a Fixes and cc:stable for this one? Maybe it can be moved
> > first to make backports easy.
>
> Hmm.
>
> *Technically*, this is broken in older versions of the shrinker as
> well, but it's more of an optimization than a bug that can crash the
> kernel, so I don't know if it qualifies for a Fixes tag?
>
> Another factor is, under the old scheme, this does not move the needle
> much - at least in my benchmarks. This is because the decaying
> behavior is so aggressive that incrementing the counter in a couple
> places does not matter, when it will be rapidly divided by half later.
> This fix only shows clear improvements when applied on top of the new
> second chance scheme.
>
> I don't have a strong opinion here, but it doesn't seem worth it to
> backport IMHO :)

I thought it's a simple change worth backporting, but if it doesn't
move the needle without the second chance algorithm then it's probably
not worth it.

I would still add the "Fixes" tag because technically the logic is
wrong without this patch, it increases the zswap protection when there
swapins from zswap which doesn't make much sense.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-08-05 23:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-30 22:27 [PATCH v2 0/2] improving dynamic zswap shrinker protection scheme Nhat Pham
2024-07-30 22:27 ` [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker Nhat Pham
2024-07-30 22:47   ` [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker (fix) Nhat Pham
2024-08-01 19:57   ` [PATCH v2 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker Yosry Ahmed
2024-08-05 23:11     ` Nhat Pham
2024-08-05 23:58       ` Yosry Ahmed
2024-07-30 22:27 ` [PATCH v2 2/2] zswap: increment swapin count for non-pivot swapped in pages Nhat Pham
2024-08-01 20:02   ` Yosry Ahmed
2024-08-02 23:46     ` Nhat Pham
2024-08-03  3:22       ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).