[PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
@ 2024-07-06  2:25 Takero Funaki
  2024-07-06  2:25 ` [PATCH v2 1/6] mm: zswap: fix global shrinker memcg iteration Takero Funaki
                   ` (7 more replies)
  0 siblings, 8 replies; 37+ messages in thread
From: Takero Funaki @ 2024-07-06  2:25 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Chengming Zhou,
	Jonathan Corbet, Andrew Morton, Domenico Cerasuolo
  Cc: Takero Funaki, linux-mm, linux-doc, linux-kernel

This series addresses some issues and introduces a minor improvement in
the zswap global shrinker.

This version addresses issues discovered during the review of v1:
https://lore.kernel.org/linux-mm/20240608155316.451600-1-flintglass@gmail.com/
and includes three additional patches to fix another issue uncovered by
applying v1.

Changes in v2:
- Added 3 patches to reduce resource contention.
mm: zswap: fix global shrinker memcg iteration:
- Change the loop style (Yosry, Nhat, Shakeel)
- To not skip online memcg, save zswap_last_shrink to detect cursor change by cleaner.  (Yosry, Nhat, Shakeel)
mm: zswap: fix global shrinker error handling logic:
- Change error code for no-writeback memcg. (Yosry)
- Use nr_scanned to check if lru is empty. (Yosry)

Changes in v1:
mm: zswap: fix global shrinker memcg iteration:
- Drop and reacquire spinlock before skipping a memcg.
- Add some comment to clarify the locking mechanism.
mm: zswap: proactive shrinking before pool size limit is hit:
- Remove unneeded check before scheduling work.
- Change shrink start threshold to accept_thr_percent + 1%.

Patches
===============================

1. Fix the memcg iteration logic that abort iteration on offline memcgs.
2. Fix the error path that aborts on expected error codes.
3. Add proactive shrinking at accept threshold + 1%.
4. Drop the global shrinker workqueue WQ_MEM_RECLAIM flag to not block
   pageout.
5. Store incompressible pages as-is to accept all pages.
6. Interrupt the shrinker to avoid blocking page-in/out.

Patches 1 to 3 should be applied in this order to avoid potential loops
caused by the first issue. Patch 3 and later can be applied
independently, but the first two issues must be resolved to ensure the
shrinker can evict pages. Patches 4 to 6 address resource contention
issues in the current shrinker uncovered by applying patch 3.

With this series applied, the shrinker will continue to evict pages
until the accept_threshold_percent proactively, as documented in patch
3. As a side effect of changes in the hysteresis logic, zswap will no
longer reject pages under the max pool limit.

The series is rebased on mainline 6.10-rc6.

Proactive shrinking (Patch 1 to 3)
===============================

Visible issue to resolve
-------------------------------
The visible issue with the current global shrinker is that pageout/in
operations from active processes are slow when zswap is near its max
pool size. This is particularly significant on small memory systems
where total swap usage exceeds what zswap can store. This results in old
pages occupying most of the zswap pool space, with recent pages using
the swap disk directly.

Root cause of the issue
-------------------------------
This issue is caused by zswap maintaining the pool size near 100%. Since
the shrinker fails to shrink the pool to accept_threshold_percent, zswap
rejects incoming pages more frequently than it should. The rejected
pages are directly written to disk while zswap protects old pages from
eviction, leading to slow pageout/in performance for recent pages.

Changes to resolve the issue
-------------------------------
If the pool size were shrunk proactively, rejection by pool limit hits
would be less likely. New incoming pages could be accepted as the pool
gains some space in advance, while older pages are written back in the
background. zswap would then be filled with recent pages, as expected
in the LRU logic.

Patches 1 and 2 ensure the shrinker reduces the pool to the
accept_threshold_percent. Without patch 2, shrink_worker() stops
evicting pages on the 16th encounter with a memcg that has no stored
pages (e.g., tiny services managed under systemd).
Patch 3 makes zswap_store() trigger the shrinker before reaching the max
pool size.

With this series, zswap will prepare some space to reduce the
probability of problematic pool_limit_hit situations, thus reducing slow
reclaim and the page priority inversion against LRU.

Visible changes
-------------------------------
Once proactive shrinking reduces the pool size, pageouts complete
instantly as long as the space prepared by proactive shrinking can store
the reclaimed pages. If an admin sees a large pool_limit_hit, lowering
the accept_threshold_percent will improve active process performance.
The optimal point depends on the tradeoff between active process
pageout/in performance and background services' pagein performance.

Benchmark
-------------------------------
The benchmark below observes active process performance under memory
pressure, coexisting with background services occupying the zswap pool.

The tests were run on a vm with 2 vCPU, 1GB RAM + 4GB swap.  20% max
pool and 50% accept threshould (about 100MB proactive shrink),
zsmalloc/lz4.  This test prefer active process than background process
to demonstrate my intended workload.

It runs `apt -qq update` 10 times under a 60MB memcg memory.max
constraint to emulate an active process running under pressure. The
zswap pool is filled prior to the test by allocating a large memory
(~1.5GB) to emulate background inactive services.

The counters below are from vmstat and zswap debugfs stats. The times
are average seconds using /usr/bin/time -v.

pre-patch, 6.10-rc4
|                    | Start       | End         | Delta      |
|--------------------|-------------|-------------|------------|
| pool_limit_hit     | 845         | 845         | 0          |
| pool_total_size    | 201539584   | 201539584   | 0          |
| stored_pages       | 63138       | 63138       | 0          |
| written_back_pages | 12          | 12          | 0          |
| pswpin             | 387         | 32412       | 32025      |
| pswpout            | 153078      | 197138      | 44060      |
| zswpin             | 0           | 0           | 0          |
| zswpout            | 63150       | 63150       | 0          |
| zswpwb             | 12          | 12          | 0          |

| Time              |              |
|-------------------|--------------|
| elapsed           | 8.473        |
| user              | 1.881        |
| system            | 0.814        |

post-patch, 6.10-rc4 with patch 1 to 5
|                    | Start       | End         | Delta      |
|--------------------|-------------|-------------|------------|
| pool_limit_hit     | 81861       | 81861       | 0          |
| pool_total_size    | 75001856    | 87117824    | 12115968   |
| reject_reclaim_fail| 0           | 32          | 32         |
| same_filled_pages  | 135         | 135         | 0          |
| stored_pages       | 23919       | 27549       | 3630       |
| written_back_pages | 97665       | 106994      | 10329      |
| pswpin             | 4981        | 8223        | 3242       |
| pswpout            | 179554      | 188883      | 9329       |
| zswpin             | 293863      | 590014      | 296151     |
| zswpout            | 455338      | 764882      | 309544     |
| zswpwb             | 97665       | 106994      | 10329      |

| Time              |              |
|-------------------|--------------|
| elapsed           | 4.525        |
| user              | 1.839        |
| system            | 1.467        |

Although the pool_limit_hit is not increased in both cases,
zswap_store() rejected pages before this patch. Note that, before this
series, zswap_store() did not increment pool_limit_hit on rejection by
limit hit hysteresis (only the first few hits were counted).

From the pre-patch result, the existing zswap global shrinker cannot
write back effectively and locks the old pages in the pool. The
pswpin/out indicates the active process uses the swap device directly.

From the post-patch result, zswpin/out/wb are increased as expected,
indicating the active process uses zswap and the old pages of the
background services are evicted from the pool. pswpin/out are
significantly reduced from pre-patch results.

System responsiveness issue (patch 4 to 6)
===============================
After applying patches 1 to 3, I encountered severe responsiveness
degradation while zswap global shrinker is running under heavy memory
pressure.

Visible issue to resolve
-------------------------------
The visible issue happens with patches 1 to 3 applied when a large
amount of memory allocation happens and zswap cannot store the incoming
pages.
While global shrinker is writing back pages, system stops responding as
if under heavy memory thrashing.

This issue is less likely to happen without patches 1 to 3 or zswap is
disabled. I believe this is because the global shrinker could not write
back a meaningful amount of pages, as described in patch 2.

Root cause and changes to resolve the issue
-------------------------------
It seems that zswap shrinker blocking IO for memory reclaim and faults
is the root cause of this responsiveness issue. I introduced three
patches to reduce possible blocking in the following problematic
situations:

1. Contention on workqueue thread pools by shrink_worker() using
WQ_MEM_RECLAIM unnecessarily.

Although the shrinker runs simultaneously with memory reclaim, shrinking
is not required to reclaim memory since zswap_store() can reject pages
without interfering with memory reclaim progress. shrink_worker() should
not use WQ_MEM_RECLAIM and should be delayed when another work in
WQ_MEM_RECLAIM is reclaiming memory. The existing code requires
allocating memory inside shrink_worker(), potentially blocking other
latency-sensitive reclaim work.

2. Contention on swap IO.

Since zswap_writeback_entry() performs write IO in 4KB pages, it
consumes a lot of IOPS, increasing the IO latency of swapout/in. We
should not perform IO for background shrinking while zswap_store() is
rejecting pages or zswap_load() is failing to find stored pages. This
series implements two mitigation logics to reduce the IO contention:

2-a. Do not reject pages in zswap_store().
This is mostly achieved by patch 3. With patch 3, zswap can prepare
space proactively and accept pages while the global shrinker is running.

To avoid rejection further, patch 5 (store incompressible pages) is
added. This reduces rejection by storing incompressible pages. When
zsmalloc is used, we can accept incompressible pages with small memory
overhead. It is a minor optimization, but I think it is worth
implementing. This does not improve performance on current zbud but does
not incur a performance penalty.

2-b. Interrupt writeback while pagein/out.
Once zswap runs out of prepared space, we cannot accept incoming pages,
incurring direct writes to the swap disk. At this moment, the shrinker
is proactively evicting pages, leading to IO contention with memory
reclaim.

Performing low-priority IO is straightforward but requires
reimplementing a low-priority version of __swap_writepage(). Instead, in
patch 6, I implemented a heuristic, delaying the next zswap writeback
based on the elapsed time since zswap_store() rejected a page.

When zswap_store() hits the max pool size and rejects pages,
swap_writepage() immediately performs the writeback to disk. The time
jiffies is saved to tell shrink_worker() to sleep up to
ZSWAP_GLOBAL_SHRINK_DELAY msec.

The same logic applied to zswap_load(). When zswap cannot find a page in
the stored pool, pagein requires read IO from the swap device. The
global shrinker should be interrupted here.

This patch proposes a constant delay of 500 milliseconds, aligning with
the mq-deadline target latency.

Visible change
-------------------------------
With patches 4 to 6, the global shrinker pauses the writeback while
pagein/out operations are using the swap device. This change reduces
resource contention and makes memory reclaim/faults complete faster,
thereby reducing system responsiveness degradation. 

Intended scenario for memory reclaim:
1. zswap pool < accept_threshold as the initial state. This is achieved
   by patch 3, proactive shrinking.
2. Active processes start allocating pages. Pageout is buffered by zswap
   without IO.
3. zswap reaches shrink_start_threshold. zswap continues to buffer
   incoming pages and starts writeback immediately in the background.
4. zswap reaches max pool size. zswap interrupts the global shrinker and
   starts rejecting pages. Write IO for the rejected page will consume
   all IO resources.
5. Active processes stop allocating pages. After the delay, the shrinker
   resumes writeback until the accept threshold.

Benchmark
-------------------------------
To demonstrate that the shrinker writeback is not interfering with
pagein/out operations, I measured the elapsed time of allocating 2GB of
3/4 compressible data by a Python script, averaged over 10 runs:

|                      | elapsed| user  | sys   |
|----------------------|--------|-------|-------|
| With patches 1 to 3  | 13.10  | 0.183 | 2.049 |
| With all patches     | 11.17  | 0.116 | 1.490 |
| zswap off (baseline) | 11.81  | 0.149 | 1.381 |

Although this test cannot distinguish responsiveness issues caused by
zswap writeback from normal memory thrashing between plain pagein/out,
the difference from the baseline indicates that the patches reduced
performance degradation on pageout caused by zswap writeback.

The tests were run on kernel 6.10-rc5 on a VM with 1GB RAM (idling Azure
VM with persistent block swap device), 2 vCPUs, zsmalloc/lz4, 25% max
pool, and 50% accept threshold.

---

Takero Funaki (6):
  mm: zswap: fix global shrinker memcg iteration
  mm: zswap: fix global shrinker error handling logic
  mm: zswap: proactive shrinking before pool size limit is hit
  mm: zswap: make writeback run in the background
  mm: zswap: store incompressible page as-is
  mm: zswap: interrupt shrinker writeback while pagein/out IO

 Documentation/admin-guide/mm/zswap.rst |  17 +-
 mm/zswap.c                             | 264 ++++++++++++++++++++-----
 2 files changed, 219 insertions(+), 62 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 1/6] mm: zswap: fix global shrinker memcg iteration
  2024-07-06  2:25 [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Takero Funaki
@ 2024-07-06  2:25 ` Takero Funaki
  2024-07-08  4:54   ` Chengming Zhou
  2024-07-17  1:54   ` Yosry Ahmed
  2024-07-06  2:25 ` [PATCH v2 2/6] mm: zswap: fix global shrinker error handling logic Takero Funaki
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 37+ messages in thread
From: Takero Funaki @ 2024-07-06  2:25 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Chengming Zhou,
	Jonathan Corbet, Andrew Morton, Domenico Cerasuolo
  Cc: Takero Funaki, linux-mm, linux-doc, linux-kernel

This patch fixes an issue where the zswap global shrinker stopped
iterating through the memcg tree.

The problem was that shrink_worker() would stop iterating when a memcg
was being offlined and restart from the tree root.  Now, it properly
handles the offlie memcg and continues shrinking with the next memcg.

Note that, to avoid a refcount leak of offline memcg encountered during
the memcg tree walking, shrink_worker() must continue iterating to find
the next online memcg.

The following minor issues in the existing code are also resolved by the
change in the iteration logic:

- A rare temporary refcount leak in the offline memcg cleaner, where the
  next memcg of the offlined memcg is also offline.  The leaked memcg
  cannot be freed until the next shrink_worker() releases the reference.

- One memcg was skipped from shrinking when the offline memcg cleaner
  advanced the cursor of memcg tree. It is addressed by a flag to
  indicate that the cursor has already been advanced.

Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
Signed-off-by: Takero Funaki <flintglass@gmail.com>
---
 mm/zswap.c | 94 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 73 insertions(+), 21 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index a50e2986cd2f..29944d8145af 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -171,6 +171,7 @@ static struct list_lru zswap_list_lru;
 /* The lock protects zswap_next_shrink updates. */
 static DEFINE_SPINLOCK(zswap_shrink_lock);
 static struct mem_cgroup *zswap_next_shrink;
+static bool zswap_next_shrink_changed;
 static struct work_struct zswap_shrink_work;
 static struct shrinker *zswap_shrinker;
 
@@ -775,12 +776,39 @@ void zswap_folio_swapin(struct folio *folio)
 	}
 }
 
+/*
+ * This function should be called when a memcg is being offlined.
+ *
+ * Since the global shrinker shrink_worker() may hold a reference
+ * of the memcg, we must check and release the reference in
+ * zswap_next_shrink.
+ *
+ * shrink_worker() must handle the case where this function releases
+ * the reference of memcg being shrunk.
+ */
 void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
 {
 	/* lock out zswap shrinker walking memcg tree */
 	spin_lock(&zswap_shrink_lock);
-	if (zswap_next_shrink == memcg)
-		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
+	if (zswap_next_shrink == memcg) {
+		/*
+		 * We advances the cursor to put back the offlined memcg.
+		 * shrink_worker() should not advance the cursor again.
+		 */
+		zswap_next_shrink_changed = true;
+
+		do {
+			zswap_next_shrink = mem_cgroup_iter(NULL,
+					zswap_next_shrink, NULL);
+		} while (zswap_next_shrink &&
+				!mem_cgroup_online(zswap_next_shrink));
+		/*
+		 * We verified the next memcg is online.  Even if the next
+		 * memcg is being offlined here, another cleaner must be
+		 * waiting for our lock.  We can leave the online memcg
+		 * reference.
+		 */
+	}
 	spin_unlock(&zswap_shrink_lock);
 }
 
@@ -1319,18 +1347,42 @@ static void shrink_worker(struct work_struct *w)
 	/* Reclaim down to the accept threshold */
 	thr = zswap_accept_thr_pages();
 
-	/* global reclaim will select cgroup in a round-robin fashion. */
+	/* global reclaim will select cgroup in a round-robin fashion.
+	 *
+	 * We save iteration cursor memcg into zswap_next_shrink,
+	 * which can be modified by the offline memcg cleaner
+	 * zswap_memcg_offline_cleanup().
+	 *
+	 * Since the offline cleaner is called only once, we cannot leave an
+	 * offline memcg reference in zswap_next_shrink.
+	 * We can rely on the cleaner only if we get online memcg under lock.
+	 *
+	 * If we get an offline memcg, we cannot determine the cleaner has
+	 * already been called or will be called later. We must put back the
+	 * reference before returning from this function. Otherwise, the
+	 * offline memcg left in zswap_next_shrink will hold the reference
+	 * until the next run of shrink_worker().
+	 */
 	do {
 		spin_lock(&zswap_shrink_lock);
-		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
-		memcg = zswap_next_shrink;
 
 		/*
-		 * We need to retry if we have gone through a full round trip, or if we
-		 * got an offline memcg (or else we risk undoing the effect of the
-		 * zswap memcg offlining cleanup callback). This is not catastrophic
-		 * per se, but it will keep the now offlined memcg hostage for a while.
-		 *
+		 * Start shrinking from the next memcg after zswap_next_shrink.
+		 * To not skip a memcg, do not advance the cursor when it has
+		 * already been advanced by the offline cleaner.
+		 */
+		do {
+			if (zswap_next_shrink_changed) {
+				/* cleaner advanced the cursor */
+				zswap_next_shrink_changed = false;
+			} else {
+				zswap_next_shrink = mem_cgroup_iter(NULL,
+						zswap_next_shrink, NULL);
+			}
+			memcg = zswap_next_shrink;
+		} while (memcg && !mem_cgroup_tryget_online(memcg));
+
+		/*
 		 * Note that if we got an online memcg, we will keep the extra
 		 * reference in case the original reference obtained by mem_cgroup_iter
 		 * is dropped by the zswap memcg offlining callback, ensuring that the
@@ -1344,17 +1396,11 @@ static void shrink_worker(struct work_struct *w)
 			goto resched;
 		}
 
-		if (!mem_cgroup_tryget_online(memcg)) {
-			/* drop the reference from mem_cgroup_iter() */
-			mem_cgroup_iter_break(NULL, memcg);
-			zswap_next_shrink = NULL;
-			spin_unlock(&zswap_shrink_lock);
-
-			if (++failures == MAX_RECLAIM_RETRIES)
-				break;
-
-			goto resched;
-		}
+		/*
+		 * We verified the memcg is online and got an extra memcg
+		 * reference.  Our memcg might be offlined concurrently but the
+		 * respective offline cleaner must be waiting for our lock.
+		 */
 		spin_unlock(&zswap_shrink_lock);
 
 		ret = shrink_memcg(memcg);
@@ -1368,6 +1414,12 @@ static void shrink_worker(struct work_struct *w)
 resched:
 		cond_resched();
 	} while (zswap_total_pages() > thr);
+
+	/*
+	 * We can still hold the original memcg reference.
+	 * The reference is stored in zswap_next_shrink, and then reused
+	 * by the next shrink_worker().
+	 */
 }
 
 /*********************************
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 2/6] mm: zswap: fix global shrinker error handling logic
  2024-07-06  2:25 [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Takero Funaki
  2024-07-06  2:25 ` [PATCH v2 1/6] mm: zswap: fix global shrinker memcg iteration Takero Funaki
@ 2024-07-06  2:25 ` Takero Funaki
  2024-07-17  2:39   ` Yosry Ahmed
  2024-07-06  2:25 ` [PATCH v2 3/6] mm: zswap: proactive shrinking before pool size limit is hit Takero Funaki
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Takero Funaki @ 2024-07-06  2:25 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Chengming Zhou,
	Jonathan Corbet, Andrew Morton, Domenico Cerasuolo
  Cc: Takero Funaki, linux-mm, linux-doc, linux-kernel

This patch fixes zswap global shrinker that did not shrink zpool as
expected.

The issue it addresses is that `shrink_worker()` did not distinguish
between unexpected errors and expected error codes that should be
skipped, such as when there is no stored page in a memcg. This led to
the shrinking process being aborted on the expected error codes.

The shrinker should ignore these cases and skip to the next memcg.
However,  skipping all memcgs presents another problem. To address this,
this patch tracks progress while walking the memcg tree and checks for
progress once the tree walk is completed.

To handle the empty memcg case, the helper function `shrink_memcg()` is
modified to check if the memcg is empty and then return -ENOENT.

Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
Signed-off-by: Takero Funaki <flintglass@gmail.com>
---
 mm/zswap.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 29944d8145af..f092932e652b 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1317,10 +1317,10 @@ static struct shrinker *zswap_alloc_shrinker(void)
 
 static int shrink_memcg(struct mem_cgroup *memcg)
 {
-	int nid, shrunk = 0;
+	int nid, shrunk = 0, scanned = 0;
 
 	if (!mem_cgroup_zswap_writeback_enabled(memcg))
-		return -EINVAL;
+		return -ENOENT;
 
 	/*
 	 * Skip zombies because their LRUs are reparented and we would be
@@ -1334,19 +1334,30 @@ static int shrink_memcg(struct mem_cgroup *memcg)
 
 		shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
 					    &shrink_memcg_cb, NULL, &nr_to_walk);
+		scanned += 1 - nr_to_walk;
 	}
+
+	if (!scanned)
+		return -ENOENT;
+
 	return shrunk ? 0 : -EAGAIN;
 }
 
 static void shrink_worker(struct work_struct *w)
 {
 	struct mem_cgroup *memcg;
-	int ret, failures = 0;
+	int ret, failures = 0, progress;
 	unsigned long thr;
 
 	/* Reclaim down to the accept threshold */
 	thr = zswap_accept_thr_pages();
 
+	/*
+	 * We might start from the last memcg.
+	 * That is not a failure.
+	 */
+	progress = 1;
+
 	/* global reclaim will select cgroup in a round-robin fashion.
 	 *
 	 * We save iteration cursor memcg into zswap_next_shrink,
@@ -1390,9 +1401,12 @@ static void shrink_worker(struct work_struct *w)
 		 */
 		if (!memcg) {
 			spin_unlock(&zswap_shrink_lock);
-			if (++failures == MAX_RECLAIM_RETRIES)
+
+			/* tree walk completed but no progress */
+			if (!progress && ++failures == MAX_RECLAIM_RETRIES)
 				break;
 
+			progress = 0;
 			goto resched;
 		}
 
@@ -1407,10 +1421,15 @@ static void shrink_worker(struct work_struct *w)
 		/* drop the extra reference */
 		mem_cgroup_put(memcg);
 
-		if (ret == -EINVAL)
-			break;
+		/* not a writeback candidate memcg */
+		if (ret == -ENOENT)
+			continue;
+
 		if (ret && ++failures == MAX_RECLAIM_RETRIES)
 			break;
+
+		++progress;
+		/* reschedule as we performed some IO */
 resched:
 		cond_resched();
 	} while (zswap_total_pages() > thr);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 3/6] mm: zswap: proactive shrinking before pool size limit is hit
  2024-07-06  2:25 [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Takero Funaki
  2024-07-06  2:25 ` [PATCH v2 1/6] mm: zswap: fix global shrinker memcg iteration Takero Funaki
  2024-07-06  2:25 ` [PATCH v2 2/6] mm: zswap: fix global shrinker error handling logic Takero Funaki
@ 2024-07-06  2:25 ` Takero Funaki
  2024-07-12 23:18   ` Nhat Pham
  2024-07-06  2:25 ` [PATCH v2 4/6] mm: zswap: make writeback run in the background Takero Funaki
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Takero Funaki @ 2024-07-06  2:25 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Chengming Zhou,
	Jonathan Corbet, Andrew Morton, Domenico Cerasuolo
  Cc: Takero Funaki, linux-mm, linux-doc, linux-kernel

This patch implements proactive shrinking of zswap pool before the max
pool size limit is reached. This also changes zswap to accept new pages
while the shrinker is running.

To prevent zswap from rejecting new pages and incurring latency when
zswap is full, this patch queues the global shrinker by a pool usage
threshold between 100% and accept_thr_percent, instead of the max pool
size.  The pool size will be controlled between 90% to 91% for the
default accept_thr_percent=90.  Since the current global shrinker
continues to shrink until accept_thr_percent, we do not need to maintain
the hysteresis variable tracking the pool limit overage in
zswap_store().

Before this patch, zswap rejected pages while the shrinker is running
without incrementing zswap_pool_limit_hit counter. It could be a reason
why zswap writethrough new pages before writeback old pages.  With this
patch, zswap accepts new pages while shrinking, and zswap increments
the counter when and only when zswap rejects pages by the max pool size.

Now, reclaims smaller than the proactive shrinking amount finish
instantly and trigger background shrinking.  Admins can check if new
pages are buffered by zswap by monitoring the pool_limit_hit counter.

The name of sysfs tunable accept_thr_percent is unchanged as it is still
the stop condition of the shrinker.
The respective documentation is updated to describe the new behavior.

Signed-off-by: Takero Funaki <flintglass@gmail.com>
---
 Documentation/admin-guide/mm/zswap.rst | 17 ++++----
 mm/zswap.c                             | 54 ++++++++++++++++----------
 2 files changed, 42 insertions(+), 29 deletions(-)

diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 3598dcd7dbe7..a1d8f167a27a 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -111,18 +111,17 @@ checked if it is a same-value filled page before compressing it. If true, the
 compressed length of the page is set to zero and the pattern or same-filled
 value is stored.
 
-To prevent zswap from shrinking pool when zswap is full and there's a high
-pressure on swap (this will result in flipping pages in and out zswap pool
-without any real benefit but with a performance drop for the system), a
-special parameter has been introduced to implement a sort of hysteresis to
-refuse taking pages into zswap pool until it has sufficient space if the limit
-has been hit. To set the threshold at which zswap would start accepting pages
-again after it became full, use the sysfs ``accept_threshold_percent``
-attribute, e. g.::
+To prevent zswap from rejecting new pages and incurring latency when zswap is
+full, zswap initiates a worker called global shrinker that proactively evicts
+some pages from the pool to swap devices while the pool is reaching the limit.
+The global shrinker continues to evict pages until there is sufficient space to
+accept new pages. To control how many pages should remain in the pool, use the
+sysfs ``accept_threshold_percent`` attribute as a percentage of the max pool
+size, e. g.::
 
 	echo 80 > /sys/module/zswap/parameters/accept_threshold_percent
 
-Setting this parameter to 100 will disable the hysteresis.
+Setting this parameter to 100 will disable the proactive shrinking.
 
 Some users cannot tolerate the swapping that comes with zswap store failures
 and zswap writebacks. Swapping can be disabled entirely (without disabling
diff --git a/mm/zswap.c b/mm/zswap.c
index f092932e652b..24acbab44e7a 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -71,8 +71,6 @@ static u64 zswap_reject_kmemcache_fail;
 
 /* Shrinker work queue */
 static struct workqueue_struct *shrink_wq;
-/* Pool limit was hit, we need to calm down */
-static bool zswap_pool_reached_full;
 
 /*********************************
 * tunables
@@ -118,7 +116,10 @@ module_param_cb(zpool, &zswap_zpool_param_ops, &zswap_zpool_type, 0644);
 static unsigned int zswap_max_pool_percent = 20;
 module_param_named(max_pool_percent, zswap_max_pool_percent, uint, 0644);
 
-/* The threshold for accepting new pages after the max_pool_percent was hit */
+/*
+ * The percentage of pool size that the global shrinker keeps in memory.
+ * It does not protect old pages from the dynamic shrinker.
+ */
 static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
 module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
 		   uint, 0644);
@@ -488,6 +489,20 @@ static unsigned long zswap_accept_thr_pages(void)
 	return zswap_max_pages() * zswap_accept_thr_percent / 100;
 }
 
+/*
+ * Returns threshold to start proactive global shrinking.
+ */
+static inline unsigned long zswap_shrink_start_pages(void)
+{
+	/*
+	 * Shrinker will evict pages to the accept threshold.
+	 * We add 1% to not schedule shrinker too frequently
+	 * for small swapout.
+	 */
+	return zswap_max_pages() *
+		min(100, zswap_accept_thr_percent + 1) / 100;
+}
+
 unsigned long zswap_total_pages(void)
 {
 	struct zswap_pool *pool;
@@ -505,21 +520,6 @@ unsigned long zswap_total_pages(void)
 	return total;
 }
 
-static bool zswap_check_limits(void)
-{
-	unsigned long cur_pages = zswap_total_pages();
-	unsigned long max_pages = zswap_max_pages();
-
-	if (cur_pages >= max_pages) {
-		zswap_pool_limit_hit++;
-		zswap_pool_reached_full = true;
-	} else if (zswap_pool_reached_full &&
-		   cur_pages <= zswap_accept_thr_pages()) {
-			zswap_pool_reached_full = false;
-	}
-	return zswap_pool_reached_full;
-}
-
 /*********************************
 * param callbacks
 **********************************/
@@ -1489,6 +1489,8 @@ bool zswap_store(struct folio *folio)
 	struct obj_cgroup *objcg = NULL;
 	struct mem_cgroup *memcg = NULL;
 	unsigned long value;
+	unsigned long cur_pages;
+	bool need_global_shrink = false;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1511,8 +1513,17 @@ bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
-	if (zswap_check_limits())
+	cur_pages = zswap_total_pages();
+
+	if (cur_pages >= zswap_max_pages()) {
+		zswap_pool_limit_hit++;
+		need_global_shrink = true;
 		goto reject;
+	}
+
+	/* schedule shrink for incoming pages */
+	if (cur_pages >= zswap_shrink_start_pages())
+		queue_work(shrink_wq, &zswap_shrink_work);
 
 	/* allocate entry */
 	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
@@ -1555,6 +1566,9 @@ bool zswap_store(struct folio *folio)
 
 		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
 		zswap_reject_alloc_fail++;
+
+		/* reduce entry in array */
+		need_global_shrink = true;
 		goto store_failed;
 	}
 
@@ -1604,7 +1618,7 @@ bool zswap_store(struct folio *folio)
 	zswap_entry_cache_free(entry);
 reject:
 	obj_cgroup_put(objcg);
-	if (zswap_pool_reached_full)
+	if (need_global_shrink)
 		queue_work(shrink_wq, &zswap_shrink_work);
 check_old:
 	/*
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 4/6] mm: zswap: make writeback run in the background
  2024-07-06  2:25 [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Takero Funaki
                   ` (2 preceding siblings ...)
  2024-07-06  2:25 ` [PATCH v2 3/6] mm: zswap: proactive shrinking before pool size limit is hit Takero Funaki
@ 2024-07-06  2:25 ` Takero Funaki
  2024-07-06  2:25 ` [PATCH v2 5/6] mm: zswap: store incompressible page as-is Takero Funaki
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Takero Funaki @ 2024-07-06  2:25 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Chengming Zhou,
	Jonathan Corbet, Andrew Morton, Domenico Cerasuolo
  Cc: Takero Funaki, linux-mm, linux-doc, linux-kernel

Drop the WQ_MEM_RECLAIM flag from the zswap global shrinker workqueue to
resolve resource contention with actual kernel memory reclaim.

The current zswap global shrinker and its writeback contend with acutual
memory reclaim, leading to system responsiveness issues when the zswap
writeback and direct reclaim run concurrently.  Unlike kernel memory
shrinkers, the global shrinker works in the background behind the zswap
pool, which acts as a large in-memory buffer. The zswap writeback is not
urgent and is not strictly necessary to reclaim kernel memory.
Even when zswap shrinker cannot evict pages, zswap_store() can reject
reclaimed pages, and the rejected pages have swap space preallocated.
Delaying writeback or shrinker progress do not interfere page reclaim.

The visible issue in the current implementation occurs when a large
amount of direct reclaim happens and zswap cannot store the incoming
pages. Both the zswap global shrinker and the memory reclaimer start
writing back pages concurrently. This leads the entire system
responsivility issue that does not occur without zswap.  The
shrink_worker() running on WQ_MEM_RECLAIM blocks other important works
required for memory reclamation. In this case, swp_writepage() and
zswap_writeback() are consuming time and contend with each other for
workqueue scheduling and I/O resources, especially on slow swap devices.

Note that this issue has been masked by the global shrinker failing to
evict a considerable number of pages. This patch is required to fix the
shrinker to continuously reduce the pool size to the acceptable
threshold.

The probability of this issue can be mitigated mostly by removing the
WQ_MEM_RECLAIM flag from the zswap shrinker workqueue. With this change,
the invocation of shrink_worker() and its writeback will be delayed
while reclamation is running on WQ_MEM_RECLAIM workqueue.

Signed-off-by: Takero Funaki <flintglass@gmail.com>
---
 mm/zswap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 24acbab44e7a..76691ca7b6a7 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1806,7 +1806,7 @@ static int zswap_setup(void)
 		goto hp_fail;

 	shrink_wq = alloc_workqueue("zswap-shrink",
-			WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
+			WQ_UNBOUND, 1);
 	if (!shrink_wq)
 		goto shrink_wq_fail;

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 5/6] mm: zswap: store incompressible page as-is
  2024-07-06  2:25 [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Takero Funaki
                   ` (3 preceding siblings ...)
  2024-07-06  2:25 ` [PATCH v2 4/6] mm: zswap: make writeback run in the background Takero Funaki
@ 2024-07-06  2:25 ` Takero Funaki
  2024-07-06 23:53   ` Nhat Pham
  2024-07-08  3:56   ` Chengming Zhou
  2024-07-06  2:25 ` [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO Takero Funaki
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 37+ messages in thread
From: Takero Funaki @ 2024-07-06  2:25 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Chengming Zhou,
	Jonathan Corbet, Andrew Morton, Domenico Cerasuolo
  Cc: Takero Funaki, linux-mm, linux-doc, linux-kernel

This patch allows zswap to accept incompressible pages and store them
into zpool if possible.

This change is required to achieve zero rejection on zswap_store(). With
proper amount of proactive shrinking, swapout can be buffered by zswap
without IO latency. Storing incompressible pages may seem costly, but it
can reduce latency. A rare incompressible page in a large batch of
compressive pages can delay the entire batch during swapping.

The memory overhead is negligible because the underlying zsmalloc
already accepts nearly incompressible pages. zsmalloc stores data close
to PAGE_SIZE to a dedicated page. Thus storing as-is saves decompression
cycles without allocation overhead. zswap itself has not rejected pages
in these cases.

To store the page as-is, use the compressed data size field `length` in
struct `zswap_entry`. The length == PAGE_SIZE indicates
incompressible data.

If a zpool backend does not support allocating PAGE_SIZE (zbud), the
behavior remains unchanged. The allocation failure reported by the zpool
blocks accepting the page as before.

Signed-off-by: Takero Funaki <flintglass@gmail.com>
---
 mm/zswap.c | 36 +++++++++++++++++++++++++++++++++---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 76691ca7b6a7..def0f948a4ab 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -186,6 +186,8 @@ static struct shrinker *zswap_shrinker;
  * length - the length in bytes of the compressed page data.  Needed during
  *          decompression. For a same value filled page length is 0, and both
  *          pool and lru are invalid and must be ignored.
+ *          If length is equal to PAGE_SIZE, the data stored in handle is
+ *          not compressed. The data must be copied to page as-is.
  * pool - the zswap_pool the entry's data is in
  * handle - zpool allocation handle that stores the compressed page data
  * value - value of the same-value filled pages which have same content
@@ -969,9 +971,23 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
 	 */
 	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
 	dlen = acomp_ctx->req->dlen;
-	if (comp_ret)
+
+	/* coa_compress returns -EINVAL for errors including insufficient dlen */
+	if (comp_ret && comp_ret != -EINVAL)
 		goto unlock;
 
+	/*
+	 * If the data cannot be compressed well, store the data as-is.
+	 * Switching by a threshold at
+	 * PAGE_SIZE - (allocation granularity)
+	 * zbud and z3fold use 64B granularity.
+	 * zsmalloc stores >3632B in one page for 4K page arch.
+	 */
+	if (comp_ret || dlen > PAGE_SIZE - 64) {
+		/* we do not use compressed result anymore */
+		comp_ret = 0;
+		dlen = PAGE_SIZE;
+	}
 	zpool = zswap_find_zpool(entry);
 	gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
 	if (zpool_malloc_support_movable(zpool))
@@ -981,14 +997,20 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
 		goto unlock;
 
 	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
-	memcpy(buf, dst, dlen);
+
+	/* PAGE_SIZE indicates not compressed. */
+	if (dlen == PAGE_SIZE)
+		memcpy_from_folio(buf, folio, 0, PAGE_SIZE);
+	else
+		memcpy(buf, dst, dlen);
+
 	zpool_unmap_handle(zpool, handle);
 
 	entry->handle = handle;
 	entry->length = dlen;
 
 unlock:
-	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
+	if (alloc_ret == -ENOSPC)
 		zswap_reject_compress_poor++;
 	else if (comp_ret)
 		zswap_reject_compress_fail++;
@@ -1006,6 +1028,14 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
 	struct crypto_acomp_ctx *acomp_ctx;
 	u8 *src;
 
+	if (entry->length == PAGE_SIZE) {
+		/* the content is not compressed. copy back as-is.  */
+		src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
+		memcpy_to_page(page, 0, src, entry->length);
+		zpool_unmap_handle(zpool, entry->handle);
+		return;
+	}
+
 	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
 	mutex_lock(&acomp_ctx->mutex);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO
  2024-07-06  2:25 [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Takero Funaki
                   ` (4 preceding siblings ...)
  2024-07-06  2:25 ` [PATCH v2 5/6] mm: zswap: store incompressible page as-is Takero Funaki
@ 2024-07-06  2:25 ` Takero Funaki
  2024-07-08 19:17   ` Nhat Pham
  2024-07-09  0:57   ` Nhat Pham
  2024-07-06 17:32 ` [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Andrew Morton
  2024-07-09  0:53 ` Nhat Pham
  7 siblings, 2 replies; 37+ messages in thread
From: Takero Funaki @ 2024-07-06  2:25 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Chengming Zhou,
	Jonathan Corbet, Andrew Morton, Domenico Cerasuolo
  Cc: Takero Funaki, linux-mm, linux-doc, linux-kernel

To prevent the zswap global shrinker from writing back pages
simultaneously with IO performed for memory reclaim and faults, delay
the writeback when zswap_store() rejects pages or zswap_load() cannot
find entry in pool.

When the zswap shrinker is running and zswap rejects an incoming page,
simulatenous zswap writeback and the rejected page lead to IO contention
on swap device. In this case, the writeback of the rejected page must be
higher priority as it is necessary for actual memory reclaim progress.
The zswap global shrinker can run in the background and should not
interfere with memory reclaim.

The same logic applies to zswap_load(). When zswap cannot find requested
page from pool and read IO is performed, shrinker should be interrupted.

To avoid IO contention, save the timestamp jiffies when zswap cannot
buffer the pagein/out IO and interrupt the global shrinker. The shrinker
resumes the writeback in 500 msec since the saved timestamp.

Signed-off-by: Takero Funaki <flintglass@gmail.com>
---
 mm/zswap.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index def0f948a4ab..59ba4663c74f 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -35,6 +35,8 @@
 #include <linux/pagemap.h>
 #include <linux/workqueue.h>
 #include <linux/list_lru.h>
+#include <linux/delay.h>
+#include <linux/jiffies.h>
 
 #include "swap.h"
 #include "internal.h"
@@ -176,6 +178,14 @@ static bool zswap_next_shrink_changed;
 static struct work_struct zswap_shrink_work;
 static struct shrinker *zswap_shrinker;
 
+/*
+ * To avoid IO contention between pagein/out and global shrinker writeback,
+ * track the last jiffies of pagein/out and delay the writeback.
+ * Default to 500msec in alignment with mq-deadline read timeout.
+ */
+#define ZSWAP_GLOBAL_SHRINKER_DELAY_MS 500
+static unsigned long zswap_shrinker_delay_start;
+
 /*
  * struct zswap_entry
  *
@@ -244,6 +254,14 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 	pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,		\
 		 zpool_get_type((p)->zpools[0]))
 
+static inline void zswap_shrinker_delay_update(void)
+{
+	unsigned long now = jiffies;
+
+	if (now != zswap_shrinker_delay_start)
+		zswap_shrinker_delay_start = now;
+}
+
 /*********************************
 * pool functions
 **********************************/
@@ -1378,6 +1396,8 @@ static void shrink_worker(struct work_struct *w)
 	struct mem_cgroup *memcg;
 	int ret, failures = 0, progress;
 	unsigned long thr;
+	unsigned long now, sleepuntil;
+	const unsigned long delay = msecs_to_jiffies(ZSWAP_GLOBAL_SHRINKER_DELAY_MS);
 
 	/* Reclaim down to the accept threshold */
 	thr = zswap_accept_thr_pages();
@@ -1405,6 +1425,21 @@ static void shrink_worker(struct work_struct *w)
 	 * until the next run of shrink_worker().
 	 */
 	do {
+		/*
+		 * delay shrinking to allow the last rejected page completes
+		 * its writeback
+		 */
+		sleepuntil = delay + READ_ONCE(zswap_shrinker_delay_start);
+		now = jiffies;
+		/*
+		 * If zswap did not reject pages for long, sleepuntil-now may
+		 * underflow.  We assume the timestamp is valid only if
+		 * now < sleepuntil < now + delay + 1
+		 */
+		if (time_before(now, sleepuntil) &&
+				time_before(sleepuntil, now + delay + 1))
+			fsleep(jiffies_to_usecs(sleepuntil - now));
+
 		spin_lock(&zswap_shrink_lock);
 
 		/*
@@ -1526,8 +1561,10 @@ bool zswap_store(struct folio *folio)
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
 
 	/* Large folios aren't supported */
-	if (folio_test_large(folio))
+	if (folio_test_large(folio)) {
+		zswap_shrinker_delay_update();
 		return false;
+	}
 
 	if (!zswap_enabled)
 		goto check_old;
@@ -1648,6 +1685,8 @@ bool zswap_store(struct folio *folio)
 	zswap_entry_cache_free(entry);
 reject:
 	obj_cgroup_put(objcg);
+	zswap_shrinker_delay_update();
+
 	if (need_global_shrink)
 		queue_work(shrink_wq, &zswap_shrink_work);
 check_old:
@@ -1691,8 +1730,10 @@ bool zswap_load(struct folio *folio)
 	else
 		entry = xa_load(tree, offset);
 
-	if (!entry)
+	if (!entry) {
+		zswap_shrinker_delay_update();
 		return false;
+	}
 
 	if (entry->length)
 		zswap_decompress(entry, page);
@@ -1835,6 +1876,8 @@ static int zswap_setup(void)
 	if (ret)
 		goto hp_fail;
 
+	zswap_shrinker_delay_update();
+
 	shrink_wq = alloc_workqueue("zswap-shrink",
 			WQ_UNBOUND, 1);
 	if (!shrink_wq)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-06  2:25 [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Takero Funaki
                   ` (5 preceding siblings ...)
  2024-07-06  2:25 ` [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO Takero Funaki
@ 2024-07-06 17:32 ` Andrew Morton
  2024-07-07 10:54   ` Takero Funaki
  2024-07-09  0:53 ` Nhat Pham
  7 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2024-07-06 17:32 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Nhat Pham, Chengming Zhou,
	Jonathan Corbet, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Sat,  6 Jul 2024 02:25:16 +0000 Takero Funaki <flintglass@gmail.com> wrote:

> This series addresses some issues and introduces a minor improvement in
> the zswap global shrinker.

Seems that patches 1 & 2 might be worthy of backporting into earlier
kernels?  Could you please provide a description of the
userspace-visible effects of the bugs so the desirability of such an
action can be better understood?



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is
  2024-07-06  2:25 ` [PATCH v2 5/6] mm: zswap: store incompressible page as-is Takero Funaki
@ 2024-07-06 23:53   ` Nhat Pham
  2024-07-07  9:38     ` Takero Funaki
  2024-07-08  3:56   ` Chengming Zhou
  1 sibling, 1 reply; 37+ messages in thread
From: Nhat Pham @ 2024-07-06 23:53 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Fri, Jul 5, 2024 at 7:25 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> This patch allows zswap to accept incompressible pages and store them
> into zpool if possible.
>
> This change is required to achieve zero rejection on zswap_store(). With
> proper amount of proactive shrinking, swapout can be buffered by zswap
> without IO latency. Storing incompressible pages may seem costly, but it
> can reduce latency. A rare incompressible page in a large batch of
> compressive pages can delay the entire batch during swapping.
>
> The memory overhead is negligible because the underlying zsmalloc
> already accepts nearly incompressible pages. zsmalloc stores data close
> to PAGE_SIZE to a dedicated page. Thus storing as-is saves decompression
> cycles without allocation overhead. zswap itself has not rejected pages
> in these cases.
>
> To store the page as-is, use the compressed data size field `length` in
> struct `zswap_entry`. The length == PAGE_SIZE indicates
> incompressible data.
>
> If a zpool backend does not support allocating PAGE_SIZE (zbud), the
> behavior remains unchanged. The allocation failure reported by the zpool
> blocks accepting the page as before.
>
> Signed-off-by: Takero Funaki <flintglass@gmail.com>

I tried to propose something similar in the past. Please read the
following discussion:

https://lore.kernel.org/all/CAJD7tka6XRyzYndRNEFZmi0Zj4DD2KnVzt=vMGhfF4iN2B4VKw@mail.gmail.com/

But, the TLDR is Yosry was (rightly) concerned that with this
approach, memory reclaiming could end up increasing memory usage
rather than reducing (since we do not free up the page that fail to
zswap-out, and we need extra memory for the zswap metadata of that
page).

So my vote on this patch would be NACK, until we get around this issue
somehow :)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is
  2024-07-06 23:53   ` Nhat Pham
@ 2024-07-07  9:38     ` Takero Funaki
  2024-07-12 22:36       ` Nhat Pham
  0 siblings, 1 reply; 37+ messages in thread
From: Takero Funaki @ 2024-07-07  9:38 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

2024年7月7日(日) 8:53 Nhat Pham <nphamcs@gmail.com>:
>
> I tried to propose something similar in the past. Please read the
> following discussion:
>
> https://lore.kernel.org/all/CAJD7tka6XRyzYndRNEFZmi0Zj4DD2KnVzt=vMGhfF4iN2B4VKw@mail.gmail.com/
>
> But, the TLDR is Yosry was (rightly) concerned that with this
> approach, memory reclaiming could end up increasing memory usage
> rather than reducing (since we do not free up the page that fail to
> zswap-out, and we need extra memory for the zswap metadata of that
> page).
>
> So my vote on this patch would be NACK, until we get around this issue
> somehow :)

It seems the discussion on the thread mixed up memory allocation
failure (system runs out of memory reserve) and incompressible pages
(compression algorithm successfully compressed but the result is equal
to or larger than PAGE_SIZE).

zswap has been storing pages into dedicated pages 1:1 when compressed
to near PAGE_SIZE. Using zsmalloc, current zswap stores pages
compressed to between 3633 bytes (=hugeclass+1) to 4095 bytes
(=PAGE_SIZE-1) into 1 page. This patch changes the range to 3633 to
4096 by treating PAGE_SIZE as a special case. I could not find a
reason to reject only PAGE_SIZE while accepting PAGE_SIZE-1.

zswap wastes memory for metadata for all accepted pages but reduces IO
amount and latency by compressed buffer memory. For pages between 3633
to 4096 bytes, zswap reduces the latency only. This is still
beneficial because the rare incompressible pages trigger urgent
pageout IO and incur a head-of-line blocking on the subsequent pages.
It also keeps LRU priority for pagein latency.

In the worst case or with a malicious dataset, zswap will waste a
significant amount of memory, but this patch does not affect nor
resolve the scenario. For example, if a user allocates pages
compressed to 3633 bytes, current zswap using zsmalloc cannot gain
memory as the compression ratio, including zsmalloc overhead, becomes
1:1. This also applies to zbud. The compression ratio will be 1:1 as
zbud cannot find buddies smaller than 463 bytes. zswap will be less
efficient but still work in this situation since the max pool percent
and background writeback ensure the pool size does not overwhelm
usable memory.

I suppose the current zswap has accepted the possible waste of memory,
at least since the current zswap_compress() logic was implemented. If
zswap had to ensure the compression ratio is better than 1:1, and only
prefers reducing IO amount (not latency), there would have been a
compression ratio threshold to reject pages not compressible to under
2048 bytes. I think accepting nearly incompressible pages is
beneficial and changing the range to 4096 does not negatively affect
the current behavior.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-06 17:32 ` [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Andrew Morton
@ 2024-07-07 10:54   ` Takero Funaki
  0 siblings, 0 replies; 37+ messages in thread
From: Takero Funaki @ 2024-07-07 10:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Yosry Ahmed, Nhat Pham, Chengming Zhou,
	Jonathan Corbet, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

2024年7月7日(日) 2:32 Andrew Morton <akpm@linux-foundation.org>:
> Seems that patches 1 & 2 might be worthy of backporting into earlier
> kernels?  Could you please provide a description of the
> userspace-visible effects of the bugs so the desirability of such an
> action can be better understood?

Patch 1 and Patch 2 partially resolve the zswap global shrinker that
leads to performance degradation on small systems.  However, the fix
uncovers another issue addressed in patches 3 to 6.  Backporting only
the two patches can be a tradeoff with possible performance
degradation in some cases. I am not sure the possible issue can be
acceptable.

The visible issue is described in the cover letter:

> Visible issue to resolve
> -------------------------------
> The visible issue with the current global shrinker is that pageout/in
> operations from active processes are slow when zswap is near its max
> pool size. This is particularly significant on small memory systems
> where total swap usage exceeds what zswap can store. This results in old
> pages occupying most of the zswap pool space, with recent pages using
> the swap disk directly.
>
> Root cause of the issue
> -------------------------------
> This issue is caused by zswap maintaining the pool size near 100%. Since
> the shrinker fails to shrink the pool to accept_threshold_percent, zswap
> rejects incoming pages more frequently than it should. The rejected
> pages are directly written to disk while zswap protects old pages from
> eviction, leading to slow pageout/in performance for recent pages.

Patches 1 and 2 partially resolve the issue by fixing iteration logic.
With the two patches applied, zswap shrinker starts evicting pages
once the pool limit is hit, as described in the current zswap
documentation. However, this fix might not give performance
improvement since it lacks proactive shrinking required to prepare
spaces before pool limit is hit, implemented in patch 3.

Unfortunately, the fix uncovers another issue described in the bottom
half of the cover letter. Because the shrinker performs writeback
simultaneously with pageout for rejected pages, the shrinker delays
actual memory reclaim unnecessarily. The first issue masked the second
by virtually disabling the global shrinker writeback. I think the
second issue only occurs under severe memory pressure, but may degrade
pageout performance as shown in the benchmark at the bottom of the
cover letter.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is
  2024-07-06  2:25 ` [PATCH v2 5/6] mm: zswap: store incompressible page as-is Takero Funaki
  2024-07-06 23:53   ` Nhat Pham
@ 2024-07-08  3:56   ` Chengming Zhou
  2024-07-08 13:44     ` Takero Funaki
  1 sibling, 1 reply; 37+ messages in thread
From: Chengming Zhou @ 2024-07-08  3:56 UTC (permalink / raw)
  To: Takero Funaki, Johannes Weiner, Yosry Ahmed, Nhat Pham,
	Jonathan Corbet, Andrew Morton, Domenico Cerasuolo
  Cc: linux-mm, linux-doc, linux-kernel

On 2024/7/6 10:25, Takero Funaki wrote:
> This patch allows zswap to accept incompressible pages and store them
> into zpool if possible.
> 
> This change is required to achieve zero rejection on zswap_store(). With
> proper amount of proactive shrinking, swapout can be buffered by zswap
> without IO latency. Storing incompressible pages may seem costly, but it
> can reduce latency. A rare incompressible page in a large batch of
> compressive pages can delay the entire batch during swapping.
> 
> The memory overhead is negligible because the underlying zsmalloc
> already accepts nearly incompressible pages. zsmalloc stores data close
> to PAGE_SIZE to a dedicated page. Thus storing as-is saves decompression
> cycles without allocation overhead. zswap itself has not rejected pages
> in these cases.
> 
> To store the page as-is, use the compressed data size field `length` in
> struct `zswap_entry`. The length == PAGE_SIZE indicates
> incompressible data.
> 
> If a zpool backend does not support allocating PAGE_SIZE (zbud), the
> behavior remains unchanged. The allocation failure reported by the zpool
> blocks accepting the page as before.
> 
> Signed-off-by: Takero Funaki <flintglass@gmail.com>
> ---
>   mm/zswap.c | 36 +++++++++++++++++++++++++++++++++---
>   1 file changed, 33 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 76691ca7b6a7..def0f948a4ab 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -186,6 +186,8 @@ static struct shrinker *zswap_shrinker;
>    * length - the length in bytes of the compressed page data.  Needed during
>    *          decompression. For a same value filled page length is 0, and both
>    *          pool and lru are invalid and must be ignored.
> + *          If length is equal to PAGE_SIZE, the data stored in handle is
> + *          not compressed. The data must be copied to page as-is.
>    * pool - the zswap_pool the entry's data is in
>    * handle - zpool allocation handle that stores the compressed page data
>    * value - value of the same-value filled pages which have same content
> @@ -969,9 +971,23 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
>   	 */
>   	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
>   	dlen = acomp_ctx->req->dlen;
> -	if (comp_ret)
> +
> +	/* coa_compress returns -EINVAL for errors including insufficient dlen */
> +	if (comp_ret && comp_ret != -EINVAL)
>   		goto unlock;

Seems we don't need to care about? "comp_ret" is useless anymore.

Just:

if (comp_ret || dlen > PAGE_SIZE - 64)
	dlen = PAGE_SIZE;

And remove the checkings of comp_ret at the end.

>   
> +	/*
> +	 * If the data cannot be compressed well, store the data as-is.
> +	 * Switching by a threshold at
> +	 * PAGE_SIZE - (allocation granularity)
> +	 * zbud and z3fold use 64B granularity.
> +	 * zsmalloc stores >3632B in one page for 4K page arch.
> +	 */
> +	if (comp_ret || dlen > PAGE_SIZE - 64) {
> +		/* we do not use compressed result anymore */
> +		comp_ret = 0;
> +		dlen = PAGE_SIZE;
> +	}
>   	zpool = zswap_find_zpool(entry);
>   	gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
>   	if (zpool_malloc_support_movable(zpool))
> @@ -981,14 +997,20 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
>   		goto unlock;
>   
>   	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> -	memcpy(buf, dst, dlen);
> +
> +	/* PAGE_SIZE indicates not compressed. */
> +	if (dlen == PAGE_SIZE)
> +		memcpy_from_folio(buf, folio, 0, PAGE_SIZE);

We actually don't need to hold mutex if we are just copying folio.

Thanks.

> +	else
> +		memcpy(buf, dst, dlen);
> +
>   	zpool_unmap_handle(zpool, handle);
>   
>   	entry->handle = handle;
>   	entry->length = dlen;
>   
>   unlock:
> -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> +	if (alloc_ret == -ENOSPC)
>   		zswap_reject_compress_poor++;
>   	else if (comp_ret)
>   		zswap_reject_compress_fail++;
> @@ -1006,6 +1028,14 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
>   	struct crypto_acomp_ctx *acomp_ctx;
>   	u8 *src;
>   
> +	if (entry->length == PAGE_SIZE) {
> +		/* the content is not compressed. copy back as-is.  */
> +		src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> +		memcpy_to_page(page, 0, src, entry->length);
> +		zpool_unmap_handle(zpool, entry->handle);
> +		return;
> +	}
> +
>   	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>   	mutex_lock(&acomp_ctx->mutex);
>   


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/6] mm: zswap: fix global shrinker memcg iteration
  2024-07-06  2:25 ` [PATCH v2 1/6] mm: zswap: fix global shrinker memcg iteration Takero Funaki
@ 2024-07-08  4:54   ` Chengming Zhou
  2024-07-17  1:54   ` Yosry Ahmed
  1 sibling, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-07-08  4:54 UTC (permalink / raw)
  To: Takero Funaki, Johannes Weiner, Yosry Ahmed, Nhat Pham,
	Jonathan Corbet, Andrew Morton, Domenico Cerasuolo
  Cc: linux-mm, linux-doc, linux-kernel

On 2024/7/6 10:25, Takero Funaki wrote:
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.
> 
> The problem was that shrink_worker() would stop iterating when a memcg
> was being offlined and restart from the tree root.  Now, it properly
> handles the offlie memcg and continues shrinking with the next memcg.

	      ^offline

> 
> Note that, to avoid a refcount leak of offline memcg encountered during
> the memcg tree walking, shrink_worker() must continue iterating to find
> the next online memcg.

Yeah, mem_cgroup_tryget_online() ensures that we get an online memcg to 
shrink.

> 
> The following minor issues in the existing code are also resolved by the
> change in the iteration logic:
> 
> - A rare temporary refcount leak in the offline memcg cleaner, where the
>    next memcg of the offlined memcg is also offline.  The leaked memcg
>    cannot be freed until the next shrink_worker() releases the reference.

Is this a problem encountered in real life? I'm a little confused, since
we move cursor under the lock protection, and any offlining memcg will 
come here to move cursor to the next one. I don't get how refcount leak
happens?

> 
> - One memcg was skipped from shrinking when the offline memcg cleaner
>    advanced the cursor of memcg tree. It is addressed by a flag to
>    indicate that the cursor has already been advanced.

Sorry, I also don't get this. Obviously, we should skip offline memcg to
shrink, the reason is that offline memcg would reparent its LRU and 
objcg, so moving cursor to the next online one is fine? Maybe I missed 
something?

Thanks.

> 
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <flintglass@gmail.com>
> ---
>   mm/zswap.c | 94 ++++++++++++++++++++++++++++++++++++++++++------------
>   1 file changed, 73 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index a50e2986cd2f..29944d8145af 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -171,6 +171,7 @@ static struct list_lru zswap_list_lru;
>   /* The lock protects zswap_next_shrink updates. */
>   static DEFINE_SPINLOCK(zswap_shrink_lock);
>   static struct mem_cgroup *zswap_next_shrink;
> +static bool zswap_next_shrink_changed;
>   static struct work_struct zswap_shrink_work;
>   static struct shrinker *zswap_shrinker;
>   
> @@ -775,12 +776,39 @@ void zswap_folio_swapin(struct folio *folio)
>   	}
>   }
>   
> +/*
> + * This function should be called when a memcg is being offlined.
> + *
> + * Since the global shrinker shrink_worker() may hold a reference
> + * of the memcg, we must check and release the reference in
> + * zswap_next_shrink.
> + *
> + * shrink_worker() must handle the case where this function releases
> + * the reference of memcg being shrunk.
> + */
>   void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
>   {
>   	/* lock out zswap shrinker walking memcg tree */
>   	spin_lock(&zswap_shrink_lock);
> -	if (zswap_next_shrink == memcg)
> -		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> +	if (zswap_next_shrink == memcg) {
> +		/*
> +		 * We advances the cursor to put back the offlined memcg.
> +		 * shrink_worker() should not advance the cursor again.
> +		 */
> +		zswap_next_shrink_changed = true;
> +
> +		do {
> +			zswap_next_shrink = mem_cgroup_iter(NULL,
> +					zswap_next_shrink, NULL);
> +		} while (zswap_next_shrink &&
> +				!mem_cgroup_online(zswap_next_shrink));
> +		/*
> +		 * We verified the next memcg is online.  Even if the next
> +		 * memcg is being offlined here, another cleaner must be
> +		 * waiting for our lock.  We can leave the online memcg
> +		 * reference.
> +		 */
> +	}
>   	spin_unlock(&zswap_shrink_lock);
>   }
>   
> @@ -1319,18 +1347,42 @@ static void shrink_worker(struct work_struct *w)
>   	/* Reclaim down to the accept threshold */
>   	thr = zswap_accept_thr_pages();
>   
> -	/* global reclaim will select cgroup in a round-robin fashion. */
> +	/* global reclaim will select cgroup in a round-robin fashion.
> +	 *
> +	 * We save iteration cursor memcg into zswap_next_shrink,
> +	 * which can be modified by the offline memcg cleaner
> +	 * zswap_memcg_offline_cleanup().
> +	 *
> +	 * Since the offline cleaner is called only once, we cannot leave an
> +	 * offline memcg reference in zswap_next_shrink.
> +	 * We can rely on the cleaner only if we get online memcg under lock.
> +	 *
> +	 * If we get an offline memcg, we cannot determine the cleaner has
> +	 * already been called or will be called later. We must put back the
> +	 * reference before returning from this function. Otherwise, the
> +	 * offline memcg left in zswap_next_shrink will hold the reference
> +	 * until the next run of shrink_worker().
> +	 */
>   	do {
>   		spin_lock(&zswap_shrink_lock);
> -		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> -		memcg = zswap_next_shrink;
>   
>   		/*
> -		 * We need to retry if we have gone through a full round trip, or if we
> -		 * got an offline memcg (or else we risk undoing the effect of the
> -		 * zswap memcg offlining cleanup callback). This is not catastrophic
> -		 * per se, but it will keep the now offlined memcg hostage for a while.
> -		 *
> +		 * Start shrinking from the next memcg after zswap_next_shrink.
> +		 * To not skip a memcg, do not advance the cursor when it has
> +		 * already been advanced by the offline cleaner.
> +		 */
> +		do {
> +			if (zswap_next_shrink_changed) {
> +				/* cleaner advanced the cursor */
> +				zswap_next_shrink_changed = false;
> +			} else {
> +				zswap_next_shrink = mem_cgroup_iter(NULL,
> +						zswap_next_shrink, NULL);
> +			}
> +			memcg = zswap_next_shrink;
> +		} while (memcg && !mem_cgroup_tryget_online(memcg));
> +
> +		/*
>   		 * Note that if we got an online memcg, we will keep the extra
>   		 * reference in case the original reference obtained by mem_cgroup_iter
>   		 * is dropped by the zswap memcg offlining callback, ensuring that the
> @@ -1344,17 +1396,11 @@ static void shrink_worker(struct work_struct *w)
>   			goto resched;
>   		}
>   
> -		if (!mem_cgroup_tryget_online(memcg)) {
> -			/* drop the reference from mem_cgroup_iter() */
> -			mem_cgroup_iter_break(NULL, memcg);
> -			zswap_next_shrink = NULL;
> -			spin_unlock(&zswap_shrink_lock);
> -
> -			if (++failures == MAX_RECLAIM_RETRIES)
> -				break;
> -
> -			goto resched;
> -		}
> +		/*
> +		 * We verified the memcg is online and got an extra memcg
> +		 * reference.  Our memcg might be offlined concurrently but the
> +		 * respective offline cleaner must be waiting for our lock.
> +		 */
>   		spin_unlock(&zswap_shrink_lock);
>   
>   		ret = shrink_memcg(memcg);
> @@ -1368,6 +1414,12 @@ static void shrink_worker(struct work_struct *w)
>   resched:
>   		cond_resched();
>   	} while (zswap_total_pages() > thr);
> +
> +	/*
> +	 * We can still hold the original memcg reference.
> +	 * The reference is stored in zswap_next_shrink, and then reused
> +	 * by the next shrink_worker().
> +	 */
>   }
>   
>   /*********************************


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is
  2024-07-08  3:56   ` Chengming Zhou
@ 2024-07-08 13:44     ` Takero Funaki
  2024-07-09 13:26       ` Chengming Zhou
  0 siblings, 1 reply; 37+ messages in thread
From: Takero Funaki @ 2024-07-08 13:44 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Johannes Weiner, Yosry Ahmed, Nhat Pham, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

2024年7月8日(月) 12:56 Chengming Zhou <chengming.zhou@linux.dev>:

> >       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> >       dlen = acomp_ctx->req->dlen;
> > -     if (comp_ret)
> > +
> > +     /* coa_compress returns -EINVAL for errors including insufficient dlen */
> > +     if (comp_ret && comp_ret != -EINVAL)
> >               goto unlock;
>
> Seems we don't need to care about? "comp_ret" is useless anymore.
>
> Just:
>
> if (comp_ret || dlen > PAGE_SIZE - 64)
>         dlen = PAGE_SIZE;
>
> And remove the checkings of comp_ret at the end.
>

>
> We actually don't need to hold mutex if we are just copying folio.
>
> Thanks.
>

Thanks for reviewing.

For comp_ret, can we consolidate all possible error codes as
incompressible data?
if we do not need to distinguish -EINVAL and the others, diff v2..v3
can be like:

@@ -62,8 +62,6 @@ static u64 zswap_pool_limit_hit;
 static u64 zswap_written_back_pages;
 /* Store failed due to a reclaim failure after pool limit was reached */
 static u64 zswap_reject_reclaim_fail;
-/* Store failed due to compression algorithm failure */
-static u64 zswap_reject_compress_fail;
 /* Compressed page was too big for the allocator to (optimally) store */
 static u64 zswap_reject_compress_poor;
 /* Store failed because underlying allocator could not get memory */
@@ -1043,10 +1041,6 @@ static bool zswap_compress(struct folio *folio,
struct zswap_entry *entry)
        comp_ret =
crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
&acomp_ctx->wait);
        dlen = acomp_ctx->req->dlen;

-       /* coa_compress returns -EINVAL for errors including
insufficient dlen */
-       if (comp_ret && comp_ret != -EINVAL)
-               goto unlock;
-
        /*
         * If the data cannot be compressed well, store the data as-is.
         * Switching by a threshold at
@@ -1056,7 +1050,8 @@ static bool zswap_compress(struct folio *folio,
struct zswap_entry *entry)
         */
        if (comp_ret || dlen > PAGE_SIZE - 64) {
                /* we do not use compressed result anymore */
-               comp_ret = 0;
+               mutex_unlock(&acomp_ctx->mutex);
+               acomp_ctx = NULL;
                dlen = PAGE_SIZE;
        }
        zpool = zswap_find_zpool(entry);
@@ -1083,12 +1078,11 @@ static bool zswap_compress(struct folio
*folio, struct zswap_entry *entry)
 unlock:
        if (alloc_ret == -ENOSPC)
                zswap_reject_compress_poor++;
-       else if (comp_ret)
-               zswap_reject_compress_fail++;
        else if (alloc_ret)
                zswap_reject_alloc_fail++;

-       mutex_unlock(&acomp_ctx->mutex);
+       if (acomp_ctx)
+               mutex_unlock(&acomp_ctx->mutex);
        return comp_ret == 0 && alloc_ret == 0;
 }

@@ -1886,8 +1880,6 @@ static int zswap_debugfs_init(void)
                           zswap_debugfs_root, &zswap_reject_alloc_fail);
        debugfs_create_u64("reject_kmemcache_fail", 0444,
                           zswap_debugfs_root, &zswap_reject_kmemcache_fail);
-       debugfs_create_u64("reject_compress_fail", 0444,
-                          zswap_debugfs_root, &zswap_reject_compress_fail);
        debugfs_create_u64("reject_compress_poor", 0444,
                           zswap_debugfs_root, &zswap_reject_compress_poor);
        debugfs_create_u64("written_back_pages", 0444,


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO
  2024-07-06  2:25 ` [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO Takero Funaki
@ 2024-07-08 19:17   ` Nhat Pham
  2024-07-09  0:57   ` Nhat Pham
  1 sibling, 0 replies; 37+ messages in thread
From: Nhat Pham @ 2024-07-08 19:17 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Fri, Jul 5, 2024 at 7:25 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> To prevent the zswap global shrinker from writing back pages
> simultaneously with IO performed for memory reclaim and faults, delay
> the writeback when zswap_store() rejects pages or zswap_load() cannot
> find entry in pool.
>
> When the zswap shrinker is running and zswap rejects an incoming page,
> simulatenous zswap writeback and the rejected page lead to IO contention
> on swap device. In this case, the writeback of the rejected page must be
> higher priority as it is necessary for actual memory reclaim progress.
> The zswap global shrinker can run in the background and should not
> interfere with memory reclaim.

Do you see this problem actually manifesting in real life? Does not
sound infeasible to me, but I wonder how likely this is the case.

Do you have any userspace-visible metrics, or any tracing logs etc.
that proves that it actually happens?

This might also affect the dynamic shrinker as well FWIW.

>
> The same logic applies to zswap_load(). When zswap cannot find requested
> page from pool and read IO is performed, shrinker should be interrupted.
>
> To avoid IO contention, save the timestamp jiffies when zswap cannot
> buffer the pagein/out IO and interrupt the global shrinker. The shrinker
> resumes the writeback in 500 msec since the saved timestamp.
>
> Signed-off-by: Takero Funaki <flintglass@gmail.com>
> ---
>  mm/zswap.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index def0f948a4ab..59ba4663c74f 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -35,6 +35,8 @@
>  #include <linux/pagemap.h>
>  #include <linux/workqueue.h>
>  #include <linux/list_lru.h>
> +#include <linux/delay.h>
> +#include <linux/jiffies.h>
>
>  #include "swap.h"
>  #include "internal.h"
> @@ -176,6 +178,14 @@ static bool zswap_next_shrink_changed;
>  static struct work_struct zswap_shrink_work;
>  static struct shrinker *zswap_shrinker;
>
> +/*
> + * To avoid IO contention between pagein/out and global shrinker writeback,
> + * track the last jiffies of pagein/out and delay the writeback.
> + * Default to 500msec in alignment with mq-deadline read timeout.

If there is a future version, could you include the reason why you
select 500msec in the patch's changelog as well?

> + */
> +#define ZSWAP_GLOBAL_SHRINKER_DELAY_MS 500
> +static unsigned long zswap_shrinker_delay_start;
> +
>  /*
>   * struct zswap_entry
>   *
> @@ -244,6 +254,14 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>         pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,         \
>                  zpool_get_type((p)->zpools[0]))
>
> +static inline void zswap_shrinker_delay_update(void)
> +{
> +       unsigned long now = jiffies;
> +
> +       if (now != zswap_shrinker_delay_start)
> +               zswap_shrinker_delay_start = now;
> +}

Hmmm is there a reason why we do not just do:

zswap_shrinker_delay_start = jiffies;

or, more unnecessarily:

unsigned long now = jiffies;

zswap_shrinker_delay_start = now;

IOW, why the branching here? Does not seem necessary to me, but
perhaps there is a correctness/compiler reason I'm not seeing?

In fact, if it's the first version, then we could manually inline it.

Additionally/alternatively, I wonder if it is more convenient to do it
at the mm/page_io.c zswap callsites, i.e whenever zswap_store() and
zswap_load() returns false, then delay the shrinker before proceeding
with the IO steps.

> +
>  /*********************************
>  * pool functions
>  **********************************/
> @@ -1378,6 +1396,8 @@ static void shrink_worker(struct work_struct *w)
>         struct mem_cgroup *memcg;
>         int ret, failures = 0, progress;
>         unsigned long thr;
> +       unsigned long now, sleepuntil;
> +       const unsigned long delay = msecs_to_jiffies(ZSWAP_GLOBAL_SHRINKER_DELAY_MS);
>
>         /* Reclaim down to the accept threshold */
>         thr = zswap_accept_thr_pages();
> @@ -1405,6 +1425,21 @@ static void shrink_worker(struct work_struct *w)
>          * until the next run of shrink_worker().
>          */
>         do {
> +               /*
> +                * delay shrinking to allow the last rejected page completes
> +                * its writeback
> +                */
> +               sleepuntil = delay + READ_ONCE(zswap_shrinker_delay_start);

I assume we do not care about racy access here right? Same goes for
updates - I don't see any locks protecting these operations (but I
could be missing something).


> +               now = jiffies;
> +               /*
> +                * If zswap did not reject pages for long, sleepuntil-now may
> +                * underflow.  We assume the timestamp is valid only if
> +                * now < sleepuntil < now + delay + 1
> +                */
> +               if (time_before(now, sleepuntil) &&
> +                               time_before(sleepuntil, now + delay + 1))
> +                       fsleep(jiffies_to_usecs(sleepuntil - now));
> +
>                 spin_lock(&zswap_shrink_lock);
>
>                 /*
> @@ -1526,8 +1561,10 @@ bool zswap_store(struct folio *folio)
>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>
>         /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> +       if (folio_test_large(folio)) {
> +               zswap_shrinker_delay_update();
>                 return false;
> +       }
>
>         if (!zswap_enabled)
>                 goto check_old;
> @@ -1648,6 +1685,8 @@ bool zswap_store(struct folio *folio)
>         zswap_entry_cache_free(entry);
>  reject:
>         obj_cgroup_put(objcg);
> +       zswap_shrinker_delay_update();
> +
>         if (need_global_shrink)
>                 queue_work(shrink_wq, &zswap_shrink_work);
>  check_old:
> @@ -1691,8 +1730,10 @@ bool zswap_load(struct folio *folio)
>         else
>                 entry = xa_load(tree, offset);
>
> -       if (!entry)
> +       if (!entry) {
> +               zswap_shrinker_delay_update();
>                 return false;
> +       }
>
>         if (entry->length)
>                 zswap_decompress(entry, page);
> @@ -1835,6 +1876,8 @@ static int zswap_setup(void)
>         if (ret)
>                 goto hp_fail;
>
> +       zswap_shrinker_delay_update();
> +
>         shrink_wq = alloc_workqueue("zswap-shrink",
>                         WQ_UNBOUND, 1);
>         if (!shrink_wq)
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-06  2:25 [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Takero Funaki
                   ` (6 preceding siblings ...)
  2024-07-06 17:32 ` [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Andrew Morton
@ 2024-07-09  0:53 ` Nhat Pham
  2024-07-10 22:26   ` Takero Funaki
  7 siblings, 1 reply; 37+ messages in thread
From: Nhat Pham @ 2024-07-09  0:53 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Fri, Jul 5, 2024 at 7:25 PM Takero Funaki <flintglass@gmail.com> wrote:
>
>

Woah, thanks for the copious of details :)

> This series addresses some issues and introduces a minor improvement in
> the zswap global shrinker.
>
> This version addresses issues discovered during the review of v1:
> https://lore.kernel.org/linux-mm/20240608155316.451600-1-flintglass@gmail.com/
> and includes three additional patches to fix another issue uncovered by
> applying v1.
>
> Changes in v2:
> - Added 3 patches to reduce resource contention.
> mm: zswap: fix global shrinker memcg iteration:
> - Change the loop style (Yosry, Nhat, Shakeel)
> - To not skip online memcg, save zswap_last_shrink to detect cursor change by cleaner.  (Yosry, Nhat, Shakeel)
> mm: zswap: fix global shrinker error handling logic:
> - Change error code for no-writeback memcg. (Yosry)
> - Use nr_scanned to check if lru is empty. (Yosry)
>
> Changes in v1:
> mm: zswap: fix global shrinker memcg iteration:
> - Drop and reacquire spinlock before skipping a memcg.
> - Add some comment to clarify the locking mechanism.
> mm: zswap: proactive shrinking before pool size limit is hit:
> - Remove unneeded check before scheduling work.
> - Change shrink start threshold to accept_thr_percent + 1%.
>
>
> Patches
> ===============================
>
> 1. Fix the memcg iteration logic that abort iteration on offline memcgs.
> 2. Fix the error path that aborts on expected error codes.
> 3. Add proactive shrinking at accept threshold + 1%.
> 4. Drop the global shrinker workqueue WQ_MEM_RECLAIM flag to not block
>    pageout.
> 5. Store incompressible pages as-is to accept all pages.
> 6. Interrupt the shrinker to avoid blocking page-in/out.
>
> Patches 1 to 3 should be applied in this order to avoid potential loops
> caused by the first issue. Patch 3 and later can be applied
> independently, but the first two issues must be resolved to ensure the
> shrinker can evict pages. Patches 4 to 6 address resource contention
> issues in the current shrinker uncovered by applying patch 3.
>
> With this series applied, the shrinker will continue to evict pages
> until the accept_threshold_percent proactively, as documented in patch
> 3. As a side effect of changes in the hysteresis logic, zswap will no
> longer reject pages under the max pool limit.
>
> The series is rebased on mainline 6.10-rc6.
>
>
> The counters below are from vmstat and zswap debugfs stats. The times
> are average seconds using /usr/bin/time -v.
>
> pre-patch, 6.10-rc4
> |                    | Start       | End         | Delta      |
> |--------------------|-------------|-------------|------------|
> | pool_limit_hit     | 845         | 845         | 0          |
> | pool_total_size    | 201539584   | 201539584   | 0          |
> | stored_pages       | 63138       | 63138       | 0          |
> | written_back_pages | 12          | 12          | 0          |
> | pswpin             | 387         | 32412       | 32025      |
> | pswpout            | 153078      | 197138      | 44060      |
> | zswpin             | 0           | 0           | 0          |
> | zswpout            | 63150       | 63150       | 0          |
> | zswpwb             | 12          | 12          | 0          |
>
> | Time              |              |
> |-------------------|--------------|
> | elapsed           | 8.473        |
> | user              | 1.881        |
> | system            | 0.814        |
>
> post-patch, 6.10-rc4 with patch 1 to 5

You mean 1 to 6? There are 6 patches, no?

Just out of pure curiosity, could you include the stats from patch 1-3 only?

> |                    | Start       | End         | Delta      |
> |--------------------|-------------|-------------|------------|
> | pool_limit_hit     | 81861       | 81861       | 0          |
> | pool_total_size    | 75001856    | 87117824    | 12115968   |
> | reject_reclaim_fail| 0           | 32          | 32         |
> | same_filled_pages  | 135         | 135         | 0          |
> | stored_pages       | 23919       | 27549       | 3630       |
> | written_back_pages | 97665       | 106994      | 10329      |
> | pswpin             | 4981        | 8223        | 3242       |
> | pswpout            | 179554      | 188883      | 9329       |

The pswpin delta actually decreases. Nice :)

> | zswpin             | 293863      | 590014      | 296151     |
> | zswpout            | 455338      | 764882      | 309544     |
> | zswpwb             | 97665       | 106994      | 10329      |
>
> | Time              |              |
> |-------------------|--------------|
> | elapsed           | 4.525        |
> | user              | 1.839        |
> | system            | 1.467        |
>
> Although the pool_limit_hit is not increased in both cases,
> zswap_store() rejected pages before this patch. Note that, before this
> series, zswap_store() did not increment pool_limit_hit on rejection by
> limit hit hysteresis (only the first few hits were counted).

Yeah hysteresis + the broken global shrinker puts the system in a very
bad state. Thanks for showing this.

>
> From the pre-patch result, the existing zswap global shrinker cannot
> write back effectively and locks the old pages in the pool. The
> pswpin/out indicates the active process uses the swap device directly.
>
> From the post-patch result, zswpin/out/wb are increased as expected,
> indicating the active process uses zswap and the old pages of the
> background services are evicted from the pool. pswpin/out are
> significantly reduced from pre-patch results.

Lovely :)

>
>
> System responsiveness issue (patch 4 to 6)
> ===============================
> After applying patches 1 to 3, I encountered severe responsiveness
> degradation while zswap global shrinker is running under heavy memory
> pressure.
>
> Visible issue to resolve
> -------------------------------
> The visible issue happens with patches 1 to 3 applied when a large
> amount of memory allocation happens and zswap cannot store the incoming
> pages.
> While global shrinker is writing back pages, system stops responding as
> if under heavy memory thrashing.
>
> This issue is less likely to happen without patches 1 to 3 or zswap is
> disabled. I believe this is because the global shrinker could not write
> back a meaningful amount of pages, as described in patch 2.
>
> Root cause and changes to resolve the issue
> -------------------------------
> It seems that zswap shrinker blocking IO for memory reclaim and faults
> is the root cause of this responsiveness issue. I introduced three
> patches to reduce possible blocking in the following problematic
> situations:
>
> 1. Contention on workqueue thread pools by shrink_worker() using
> WQ_MEM_RECLAIM unnecessarily.
>
> Although the shrinker runs simultaneously with memory reclaim, shrinking
> is not required to reclaim memory since zswap_store() can reject pages
> without interfering with memory reclaim progress. shrink_worker() should
> not use WQ_MEM_RECLAIM and should be delayed when another work in
> WQ_MEM_RECLAIM is reclaiming memory. The existing code requires
> allocating memory inside shrink_worker(), potentially blocking other
> latency-sensitive reclaim work.
>
> 2. Contention on swap IO.
>
> Since zswap_writeback_entry() performs write IO in 4KB pages, it
> consumes a lot of IOPS, increasing the IO latency of swapout/in. We
> should not perform IO for background shrinking while zswap_store() is
> rejecting pages or zswap_load() is failing to find stored pages. This
> series implements two mitigation logics to reduce the IO contention:
>
> 2-a. Do not reject pages in zswap_store().
> This is mostly achieved by patch 3. With patch 3, zswap can prepare
> space proactively and accept pages while the global shrinker is running.
>
> To avoid rejection further, patch 5 (store incompressible pages) is
> added. This reduces rejection by storing incompressible pages. When
> zsmalloc is used, we can accept incompressible pages with small memory
> overhead. It is a minor optimization, but I think it is worth
> implementing. This does not improve performance on current zbud but does
> not incur a performance penalty.
>
> 2-b. Interrupt writeback while pagein/out.
> Once zswap runs out of prepared space, we cannot accept incoming pages,
> incurring direct writes to the swap disk. At this moment, the shrinker
> is proactively evicting pages, leading to IO contention with memory
> reclaim.
>
> Performing low-priority IO is straightforward but requires
> reimplementing a low-priority version of __swap_writepage(). Instead, in
> patch 6, I implemented a heuristic, delaying the next zswap writeback
> based on the elapsed time since zswap_store() rejected a page.
>
> When zswap_store() hits the max pool size and rejects pages,
> swap_writepage() immediately performs the writeback to disk. The time
> jiffies is saved to tell shrink_worker() to sleep up to
> ZSWAP_GLOBAL_SHRINK_DELAY msec.
>
> The same logic applied to zswap_load(). When zswap cannot find a page in
> the stored pool, pagein requires read IO from the swap device. The
> global shrinker should be interrupted here.
>
> This patch proposes a constant delay of 500 milliseconds, aligning with
> the mq-deadline target latency.
>
> Visible change
> -------------------------------
> With patches 4 to 6, the global shrinker pauses the writeback while
> pagein/out operations are using the swap device. This change reduces
> resource contention and makes memory reclaim/faults complete faster,
> thereby reducing system responsiveness degradation.

Ah this is interesting. Did you actually see improvement in your real
deployment (i.e not the benchmark) with patch 4-6 in?

>
> Intended scenario for memory reclaim:
> 1. zswap pool < accept_threshold as the initial state. This is achieved
>    by patch 3, proactive shrinking.
> 2. Active processes start allocating pages. Pageout is buffered by zswap
>    without IO.
> 3. zswap reaches shrink_start_threshold. zswap continues to buffer
>    incoming pages and starts writeback immediately in the background.
> 4. zswap reaches max pool size. zswap interrupts the global shrinker and
>    starts rejecting pages. Write IO for the rejected page will consume
>    all IO resources.

This sounds like the proactive shrinker is still not aggressive
enough, and/or there are some sort of misspecifications of the zswap
setting... Correct me if I'm wrong, but the new proactive global
shrinker begins 1% after the acceptance threshold, and shrinks down to
acceptance threshold, right? How are we still hitting the pool
limit...

My concern is that we are knowingly (and perhaps unnecessarily)
creating an LRU inversion here - preferring swapping out the rejected
pages over the colder pages in the zswap pool. Shouldn't it be the
other way around? For instance, can we spiral into the following
scenario:

1. zswap pool becomes full.
2. Memory is still tight, so anonymous memory will be reclaimed. zswap
keeps rejecting incoming pages, and putting a hold on the global
shrinker.
3. The pages that are swapped out are warmer than the ones stored in
the zswap pool, so they will be more likely to be swapped in (which,
IIUC, will also further delay the global shrinker).

and the cycle keeps going on and on?

Have you experimented with synchronous reclaim in the case the pool is
full? All the way to the acceptance threshold is too aggressive of
course - you might need to find something in between :)

> 5. Active processes stop allocating pages. After the delay, the shrinker
>    resumes writeback until the accept threshold.
>
> Benchmark
> -------------------------------
> To demonstrate that the shrinker writeback is not interfering with
> pagein/out operations, I measured the elapsed time of allocating 2GB of
> 3/4 compressible data by a Python script, averaged over 10 runs:
>
> |                      | elapsed| user  | sys   |
> |----------------------|--------|-------|-------|
> | With patches 1 to 3  | 13.10  | 0.183 | 2.049 |
> | With all patches     | 11.17  | 0.116 | 1.490 |
> | zswap off (baseline) | 11.81  | 0.149 | 1.381 |
>
> Although this test cannot distinguish responsiveness issues caused by
> zswap writeback from normal memory thrashing between plain pagein/out,
> the difference from the baseline indicates that the patches reduced
> performance degradation on pageout caused by zswap writeback.
>

I wonder if this contention would show up in PSI metrics
(/proc/pressure/io, or the cgroup variants if you use them ). Maybe
correlate reclaim counters (pgscan, zswpout, pswpout, zswpwb etc.)
with IO pressure to show the pattern, i.e the contention problem was
there before, and is now resolved? :)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO
  2024-07-06  2:25 ` [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO Takero Funaki
  2024-07-08 19:17   ` Nhat Pham
@ 2024-07-09  0:57   ` Nhat Pham
  2024-07-10 21:21     ` Takero Funaki
  1 sibling, 1 reply; 37+ messages in thread
From: Nhat Pham @ 2024-07-09  0:57 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Fri, Jul 5, 2024 at 7:25 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> To prevent the zswap global shrinker from writing back pages
> simultaneously with IO performed for memory reclaim and faults, delay
> the writeback when zswap_store() rejects pages or zswap_load() cannot
> find entry in pool.
>
> When the zswap shrinker is running and zswap rejects an incoming page,
> simulatenous zswap writeback and the rejected page lead to IO contention
> on swap device. In this case, the writeback of the rejected page must be
> higher priority as it is necessary for actual memory reclaim progress.
> The zswap global shrinker can run in the background and should not
> interfere with memory reclaim.

Hmm what about this scenario: when we disable zswap writeback on a
cgroup, if zswap_store() fails, we are delaying the global shrinker
for no gain essentially. There is no subsequent IO. I don't think we
are currently handling this, right?

>
> The same logic applies to zswap_load(). When zswap cannot find requested
> page from pool and read IO is performed, shrinker should be interrupted.
>

Yet another (less concerning IMHO) scenario is when a cgroup disables
zswap by setting zswap.max = 0 (for instance, if the sysadmin knows
that this cgroup's data are really cold, and/or that the workload is
latency-tolerant, and do not want it to take up valuable memory
resources of other cgroups). Every time this cgroup reclaims memory,
it would disable the global shrinker (including the new proactive
behavior) for other cgroup, correct? And, when they do need to swap
in, it would further delay the global shrinker. Would this break of
isolation be a problem?

There are other concerns I raised in the cover letter's response as
well - please take a look :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is
  2024-07-08 13:44     ` Takero Funaki
@ 2024-07-09 13:26       ` Chengming Zhou
  2024-07-12 22:47         ` Nhat Pham
  0 siblings, 1 reply; 37+ messages in thread
From: Chengming Zhou @ 2024-07-09 13:26 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Nhat Pham, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On 2024/7/8 21:44, Takero Funaki wrote:
> 2024年7月8日(月) 12:56 Chengming Zhou <chengming.zhou@linux.dev>:
> 
>>>        comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
>>>        dlen = acomp_ctx->req->dlen;
>>> -     if (comp_ret)
>>> +
>>> +     /* coa_compress returns -EINVAL for errors including insufficient dlen */
>>> +     if (comp_ret && comp_ret != -EINVAL)
>>>                goto unlock;
>>
>> Seems we don't need to care about? "comp_ret" is useless anymore.
>>
>> Just:
>>
>> if (comp_ret || dlen > PAGE_SIZE - 64)
>>          dlen = PAGE_SIZE;
>>
>> And remove the checkings of comp_ret at the end.
>>
> 
>>
>> We actually don't need to hold mutex if we are just copying folio.
>>
>> Thanks.
>>
> 
> Thanks for reviewing.
> 
> For comp_ret, can we consolidate all possible error codes as
> incompressible data?

Maybe we still want these debug counters? I'm not sure.

With your proposal, I think we don't care about compression failures
anymore, in all cases it's just ok to fallback to just copy the folio.

> if we do not need to distinguish -EINVAL and the others, diff v2..v3
> can be like:
> 
> @@ -62,8 +62,6 @@ static u64 zswap_pool_limit_hit;
>   static u64 zswap_written_back_pages;
>   /* Store failed due to a reclaim failure after pool limit was reached */
>   static u64 zswap_reject_reclaim_fail;
> -/* Store failed due to compression algorithm failure */
> -static u64 zswap_reject_compress_fail;
>   /* Compressed page was too big for the allocator to (optimally) store */
>   static u64 zswap_reject_compress_poor;
>   /* Store failed because underlying allocator could not get memory */
> @@ -1043,10 +1041,6 @@ static bool zswap_compress(struct folio *folio,
> struct zswap_entry *entry)
>          comp_ret =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> &acomp_ctx->wait);
>          dlen = acomp_ctx->req->dlen;
> 
> -       /* coa_compress returns -EINVAL for errors including
> insufficient dlen */
> -       if (comp_ret && comp_ret != -EINVAL)
> -               goto unlock;
> -
>          /*
>           * If the data cannot be compressed well, store the data as-is.
>           * Switching by a threshold at
> @@ -1056,7 +1050,8 @@ static bool zswap_compress(struct folio *folio,
> struct zswap_entry *entry)
>           */
>          if (comp_ret || dlen > PAGE_SIZE - 64) {
>                  /* we do not use compressed result anymore */
> -               comp_ret = 0;
> +               mutex_unlock(&acomp_ctx->mutex);
> +               acomp_ctx = NULL;
>                  dlen = PAGE_SIZE;
>          }
>          zpool = zswap_find_zpool(entry);
> @@ -1083,12 +1078,11 @@ static bool zswap_compress(struct folio
> *folio, struct zswap_entry *entry)
>   unlock:
>          if (alloc_ret == -ENOSPC)
>                  zswap_reject_compress_poor++;
> -       else if (comp_ret)
> -               zswap_reject_compress_fail++;

If you want to keep these debug counters, you can move these forward.

>          else if (alloc_ret)
>                  zswap_reject_alloc_fail++;
> 
> -       mutex_unlock(&acomp_ctx->mutex);
> +       if (acomp_ctx)
> +               mutex_unlock(&acomp_ctx->mutex);
>          return comp_ret == 0 && alloc_ret == 0;

And here we don't care about comp_ret anymore.

Thanks.

>   }
> 
> @@ -1886,8 +1880,6 @@ static int zswap_debugfs_init(void)
>                             zswap_debugfs_root, &zswap_reject_alloc_fail);
>          debugfs_create_u64("reject_kmemcache_fail", 0444,
>                             zswap_debugfs_root, &zswap_reject_kmemcache_fail);
> -       debugfs_create_u64("reject_compress_fail", 0444,
> -                          zswap_debugfs_root, &zswap_reject_compress_fail);
>          debugfs_create_u64("reject_compress_poor", 0444,
>                             zswap_debugfs_root, &zswap_reject_compress_poor);
>          debugfs_create_u64("written_back_pages", 0444,


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO
  2024-07-09  0:57   ` Nhat Pham
@ 2024-07-10 21:21     ` Takero Funaki
  2024-07-10 22:10       ` Nhat Pham
  0 siblings, 1 reply; 37+ messages in thread
From: Takero Funaki @ 2024-07-10 21:21 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

2024年7月9日(火) 4:17 Nhat Pham <nphamcs@gmail.com>:

>
> Do you see this problem actually manifesting in real life? Does not
> sound infeasible to me, but I wonder how likely this is the case.
>
> Do you have any userspace-visible metrics, or any tracing logs etc.
> that proves that it actually happens?
>
> This might also affect the dynamic shrinker as well FWIW.
>

Although it is rare, on a small VM with 0.5GB RAM, performing `apt
upgrade` for ubuntu kernel update degrades system responsiveness.
Since kernel upgrade is memory consuming for zstd compressed
initramfs, there is heavy memory pressure like the benchmark.

Unfortunately I could not get evidence that clearly indicates the
contention. Perhaps IO latency can be a metric?
While allocating large memory, perf showed that __swap_writepage() was
consuming time and was called mostly from kswapd and some fraction
from user faults of python script and from shrink_worker. CPU was
mostly idling even in a single CPU system, so lock contention and
compression should not be the reason. I believe these behaviors
suggest contention on writeback IO.
As shown in the benchmark,  reducing shrinker writeback by patches 3
to 6 reduced elapsed time, which also indicates IO contention.

> > +/*
> > + * To avoid IO contention between pagein/out and global shrinker writeback,
> > + * track the last jiffies of pagein/out and delay the writeback.
> > + * Default to 500msec in alignment with mq-deadline read timeout.
>
> If there is a future version, could you include the reason why you
> select 500msec in the patch's changelog as well?
>

The 500ms can be any value longer than the average interval of each
pageout/in and is not significant for behavior. If subsequent pageout
rejection occurs while the shrinker is sleeping, writeback will be
delayed again by 500ms from the last timestamp update. If pageout
occurs at a 1ms interval on average, the minimal delay should be 1+ms.

I chose 500ms from the mq-deadline scheduler that tries to perform
read IO in a 500ms timeframe by default (bfq for HDD uses a shorter
timeout).
When the shrinker performs writeback IO with a 500ms delay from the
last pagein, the write IO will be of lower priority than the read IO
waiting in the queue, as the pagein read becomes the highest priority
by the deadline. This logic emulates low-priority write IO by
voluntarily delaying IO.

>
> Hmmm is there a reason why we do not just do:
>
> zswap_shrinker_delay_start = jiffies;
>
> or, more unnecessarily:
>
> unsigned long now = jiffies;
>
> zswap_shrinker_delay_start = now;
>
> IOW, why the branching here? Does not seem necessary to me, but
> perhaps there is a correctness/compiler reason I'm not seeing?
>
> In fact, if it's the first version, then we could manually inline it.
>

That was to avoid invalidating the CPU cache of the shared variable
unnecessarily. Removing the branch and manually inlining it for v3.

> Additionally/alternatively, I wonder if it is more convenient to do it
> at the mm/page_io.c zswap callsites, i.e whenever zswap_store() and
> zswap_load() returns false, then delay the shrinker before proceeding
> with the IO steps.
>

Should we expose the timestamp variable? It is only used in zswap
internally, and the timestamp is not required when zswap is disabled.

> >         do {
> > +               /*
> > +                * delay shrinking to allow the last rejected page completes
> > +                * its writeback
> > +                */
> > +               sleepuntil = delay + READ_ONCE(zswap_shrinker_delay_start);
>
> I assume we do not care about racy access here right? Same goes for
> updates - I don't see any locks protecting these operations (but I
> could be missing something).
>

Right. Do we need atomic or spinlock for safety?
 I think the bare store/load of unsigned long is sufficient here. The
possible deviation by concurrent updates is mostly +/-1 jiffy. Sleep
does not need ms accuracy.

Ah, I found a mistake here. v2 missed continue statement in the loop.
The delay should be extended if zswap_store() rejects another page. In
v2, one writeback was allowed per 500ms, which was not my intended
behavior.
The corrected logic for v3 should be:

               if (time_before(now, sleepuntil) &&
                               time_before(sleepuntil, now + delay + 1)) {
                       fsleep(jiffies_to_usecs(sleepuntil - now));
                       /* check if subsequent pageout/in extended delay */
                       continue;
               }

2024年7月9日(火) 9:57 Nhat Pham <nphamcs@gmail.com>:
>
> Hmm what about this scenario: when we disable zswap writeback on a
> cgroup, if zswap_store() fails, we are delaying the global shrinker
> for no gain essentially. There is no subsequent IO. I don't think we
> are currently handling this, right?
>
> >
> > The same logic applies to zswap_load(). When zswap cannot find requested
> > page from pool and read IO is performed, shrinker should be interrupted.
> >
>
> Yet another (less concerning IMHO) scenario is when a cgroup disables
> zswap by setting zswap.max = 0 (for instance, if the sysadmin knows
> that this cgroup's data are really cold, and/or that the workload is
> latency-tolerant, and do not want it to take up valuable memory
> resources of other cgroups). Every time this cgroup reclaims memory,
> it would disable the global shrinker (including the new proactive
> behavior) for other cgroup, correct? And, when they do need to swap
> in, it would further delay the global shrinker. Would this break of
> isolation be a problem?
>
> There are other concerns I raised in the cover letter's response as
> well - please take a look :)

I haven't considered these cases much, but I suppose the global
shrinker should be delayed in both cases as well. In general, any
pagein/out should be prefered over shrinker writeback throughput.

When zswap writeback was disabled for a memcg
(memcg.zswap.writeback=0), I suppose disabling/delaying writeback is
harmless.
If the rejection incurs no IO, there is no more memory pressure and
shrinking is not urgent. We can postpone the shrinker writeback. If
the rejection incurs IO (i.e. mm choose another page from a memcg with
writeback enabled), again we should delay the shrinker.

For pageout from latency-tolerant memcg (zswap.max=0), I think pageout
latency may affect other memcgs.
For example, when a latency-sensitive workload tries to allocate
memory, mm might choose to swap out pages from zswap-disabled memcg.
The slow direct pageout may indirectly delay allocation of the
latency-sensitive workload. IOW, we cannot determine which workload
would be blocked by a slow pageout based on which memcg owns the page.
In this case, it would be better to delay shrinker writeback even if
the owner is latency tolerant memcg.
Also for pagein, we cannot determine how urgent the pagein is.

Delaying shrinker on any pagein/out diminishes proactive shrinking
progress, but that is still better than the existing shrinker that
cannot shrink.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO
  2024-07-10 21:21     ` Takero Funaki
@ 2024-07-10 22:10       ` Nhat Pham
  2024-07-15  7:33         ` Takero Funaki
  0 siblings, 1 reply; 37+ messages in thread
From: Nhat Pham @ 2024-07-10 22:10 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Wed, Jul 10, 2024 at 2:21 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年7月9日(火) 4:17 Nhat Pham <nphamcs@gmail.com>:
>
> >
> > Do you see this problem actually manifesting in real life? Does not
> > sound infeasible to me, but I wonder how likely this is the case.
> >
> > Do you have any userspace-visible metrics, or any tracing logs etc.
> > that proves that it actually happens?
> >
> > This might also affect the dynamic shrinker as well FWIW.
> >
>
> Although it is rare, on a small VM with 0.5GB RAM, performing `apt
> upgrade` for ubuntu kernel update degrades system responsiveness.
> Since kernel upgrade is memory consuming for zstd compressed
> initramfs, there is heavy memory pressure like the benchmark.
>
> Unfortunately I could not get evidence that clearly indicates the
> contention. Perhaps IO latency can be a metric?

IO latency, IO pressure, etc. I'm not sure either, as I do not have
the benchmark setup at hand - just throwing out ideas :)

> While allocating large memory, perf showed that __swap_writepage() was
> consuming time and was called mostly from kswapd and some fraction
> from user faults of python script and from shrink_worker. CPU was
> mostly idling even in a single CPU system, so lock contention and
> compression should not be the reason. I believe these behaviors
> suggest contention on writeback IO.

Sounds fair to me.

> As shown in the benchmark,  reducing shrinker writeback by patches 3
> to 6 reduced elapsed time, which also indicates IO contention.
>
> > > +/*
> > > + * To avoid IO contention between pagein/out and global shrinker writeback,
> > > + * track the last jiffies of pagein/out and delay the writeback.
> > > + * Default to 500msec in alignment with mq-deadline read timeout.
> >
> > If there is a future version, could you include the reason why you
> > select 500msec in the patch's changelog as well?
> >
>
> The 500ms can be any value longer than the average interval of each
> pageout/in and is not significant for behavior. If subsequent pageout
> rejection occurs while the shrinker is sleeping, writeback will be
> delayed again by 500ms from the last timestamp update. If pageout
> occurs at a 1ms interval on average, the minimal delay should be 1+ms.
>
> I chose 500ms from the mq-deadline scheduler that tries to perform
> read IO in a 500ms timeframe by default (bfq for HDD uses a shorter
> timeout).
> When the shrinker performs writeback IO with a 500ms delay from the
> last pagein, the write IO will be of lower priority than the read IO
> waiting in the queue, as the pagein read becomes the highest priority
> by the deadline. This logic emulates low-priority write IO by
> voluntarily delaying IO.
>
>
> >
> > Hmmm is there a reason why we do not just do:
> >
> > zswap_shrinker_delay_start = jiffies;
> >
> > or, more unnecessarily:
> >
> > unsigned long now = jiffies;
> >
> > zswap_shrinker_delay_start = now;
> >
> > IOW, why the branching here? Does not seem necessary to me, but
> > perhaps there is a correctness/compiler reason I'm not seeing?
> >
> > In fact, if it's the first version, then we could manually inline it.
> >
>
> That was to avoid invalidating the CPU cache of the shared variable
> unnecessarily. Removing the branch and manually inlining it for v3.

Ah I see. Seems a tad overengineering IMHO, but I'll leave this up to you.

If you keep the current version, please add a comment.

>
>
> > Additionally/alternatively, I wonder if it is more convenient to do it
> > at the mm/page_io.c zswap callsites, i.e whenever zswap_store() and
> > zswap_load() returns false, then delay the shrinker before proceeding
> > with the IO steps.
> >
>
> Should we expose the timestamp variable? It is only used in zswap
> internally, and the timestamp is not required when zswap is disabled.

Not necessarily the timestamp. If you keep the helper, you can just
expose that helper (and have a no-op when !CONFIG_ZSWAP in the zswap.h
header file).


>
> > >         do {
> > > +               /*
> > > +                * delay shrinking to allow the last rejected page completes
> > > +                * its writeback
> > > +                */
> > > +               sleepuntil = delay + READ_ONCE(zswap_shrinker_delay_start);
> >
> > I assume we do not care about racy access here right? Same goes for
> > updates - I don't see any locks protecting these operations (but I
> > could be missing something).
> >
>
> Right. Do we need atomic or spinlock for safety?
>  I think the bare store/load of unsigned long is sufficient here. The
> possible deviation by concurrent updates is mostly +/-1 jiffy. Sleep
> does not need ms accuracy.
>
> Ah, I found a mistake here. v2 missed continue statement in the loop.
> The delay should be extended if zswap_store() rejects another page. In
> v2, one writeback was allowed per 500ms, which was not my intended
> behavior.
> The corrected logic for v3 should be:
>
>                if (time_before(now, sleepuntil) &&
>                                time_before(sleepuntil, now + delay + 1)) {
>                        fsleep(jiffies_to_usecs(sleepuntil - now));
>                        /* check if subsequent pageout/in extended delay */
>                        continue;
>                }
>
>
> 2024年7月9日(火) 9:57 Nhat Pham <nphamcs@gmail.com>:
> >
> > Hmm what about this scenario: when we disable zswap writeback on a
> > cgroup, if zswap_store() fails, we are delaying the global shrinker
> > for no gain essentially. There is no subsequent IO. I don't think we
> > are currently handling this, right?
> >
> > >
> > > The same logic applies to zswap_load(). When zswap cannot find requested
> > > page from pool and read IO is performed, shrinker should be interrupted.
> > >
> >
> > Yet another (less concerning IMHO) scenario is when a cgroup disables
> > zswap by setting zswap.max = 0 (for instance, if the sysadmin knows
> > that this cgroup's data are really cold, and/or that the workload is
> > latency-tolerant, and do not want it to take up valuable memory
> > resources of other cgroups). Every time this cgroup reclaims memory,
> > it would disable the global shrinker (including the new proactive
> > behavior) for other cgroup, correct? And, when they do need to swap
> > in, it would further delay the global shrinker. Would this break of
> > isolation be a problem?
> >
> > There are other concerns I raised in the cover letter's response as
> > well - please take a look :)
>
> I haven't considered these cases much, but I suppose the global
> shrinker should be delayed in both cases as well. In general, any
> pagein/out should be prefered over shrinker writeback throughput.
>
> When zswap writeback was disabled for a memcg
> (memcg.zswap.writeback=0), I suppose disabling/delaying writeback is
> harmless.
> If the rejection incurs no IO, there is no more memory pressure and
> shrinking is not urgent. We can postpone the shrinker writeback. If
> the rejection incurs IO (i.e. mm choose another page from a memcg with
> writeback enabled), again we should delay the shrinker.

You are delaying writeback globally right? IOW, other cgroup is also affected?

Say we are under memory pressure, with two cgroups under reclaim - one
with zswap writeback disabled. The one with writeback disabled will
constantly fail at zswap_store(), and delay the global shrinking
action, which could have targeted the other cgroup (which should
proceed fully because there is no contention here since the first
cgroup is not swapping either).

>
> For pageout from latency-tolerant memcg (zswap.max=0), I think pageout
> latency may affect other memcgs.
> For example, when a latency-sensitive workload tries to allocate
> memory, mm might choose to swap out pages from zswap-disabled memcg.
> The slow direct pageout may indirectly delay allocation of the
> latency-sensitive workload. IOW, we cannot determine which workload
> would be blocked by a slow pageout based on which memcg owns the page.
> In this case, it would be better to delay shrinker writeback even if
> the owner is latency tolerant memcg.
> Also for pagein, we cannot determine how urgent the pagein is.
>
> Delaying shrinker on any pagein/out diminishes proactive shrinking
> progress, but that is still better than the existing shrinker that
> cannot shrink.

This is fair though :)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-09  0:53 ` Nhat Pham
@ 2024-07-10 22:26   ` Takero Funaki
  2024-07-12 23:02     ` Nhat Pham
  2024-07-17  2:53     ` Yosry Ahmed
  0 siblings, 2 replies; 37+ messages in thread
From: Takero Funaki @ 2024-07-10 22:26 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

2024年7月9日(火) 9:53 Nhat Pham <nphamcs@gmail.com>:

> > post-patch, 6.10-rc4 with patch 1 to 5
>
> You mean 1 to 6? There are 6 patches, no?

oops. with patches 1 to 6.

>
> Just out of pure curiosity, could you include the stats from patch 1-3 only?
>

I will rerun the bench in v3. I assume this bench does not reflect
patches 4 to 6, as delta pool_limit_hit=0 means no rejection from
zswap.

> Ah this is interesting. Did you actually see improvement in your real
> deployment (i.e not the benchmark) with patch 4-6 in?
>

As replied in patch 6, memory consuming tasks like `apt upgrade` for instance.

> >
> > Intended scenario for memory reclaim:
> > 1. zswap pool < accept_threshold as the initial state. This is achieved
> >    by patch 3, proactive shrinking.
> > 2. Active processes start allocating pages. Pageout is buffered by zswap
> >    without IO.
> > 3. zswap reaches shrink_start_threshold. zswap continues to buffer
> >    incoming pages and starts writeback immediately in the background.
> > 4. zswap reaches max pool size. zswap interrupts the global shrinker and
> >    starts rejecting pages. Write IO for the rejected page will consume
> >    all IO resources.
>
> This sounds like the proactive shrinker is still not aggressive
> enough, and/or there are some sort of misspecifications of the zswap
> setting... Correct me if I'm wrong, but the new proactive global
> shrinker begins 1% after the acceptance threshold, and shrinks down to
> acceptance threshold, right? How are we still hitting the pool
> limit...
>

Proactive shrinking should not be aggressive. With patches 4 and 6, I
modified the global shrinker to be less aggressive against pagein/out.
Shrinking proactively cannot avoid hitting the pool limit when memory
pressure grows faster.

> My concern is that we are knowingly (and perhaps unnecessarily)
> creating an LRU inversion here - preferring swapping out the rejected
> pages over the colder pages in the zswap pool. Shouldn't it be the
> other way around? For instance, can we spiral into the following
> scenario:
>
> 1. zswap pool becomes full.
> 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> keeps rejecting incoming pages, and putting a hold on the global
> shrinker.
> 3. The pages that are swapped out are warmer than the ones stored in
> the zswap pool, so they will be more likely to be swapped in (which,
> IIUC, will also further delay the global shrinker).
>
> and the cycle keeps going on and on?

I agree this does not follow LRU, but I think the LRU priority
inversion is unavoidable once the pool limit is hit.
The accept_thr_percent should be lowered to reduce the probability of
LRU inversion if it matters. (it is why I implemented proactive
shrinker.)

When the writeback throughput is slower than memory usage grows,
zswap_store() will have to reject pages sooner or later.
If we evict the oldest stored pages synchronously before rejecting a
new page (rotating pool to keep LRU), it will affect latency depending
how much writeback is required to store the new page. If the oldest
pages were compressed well, we would have to evict too many pages to
store a warmer page, which blocks the reclaim progress. Fragmentation
in the zspool may also increase the required writeback amount.
We cannot accomplish both maintaining LRU priority and maintaining
pageout latency.

Additionally, zswap_writeback_entry() is slower than direct pageout. I
assume this is because shrinker performs 4KB IO synchronously. I am
seeing shrinking throughput is limited by disk IOPS * 4KB while much
higher throughput can be achieved by disabling zswap. direct pageout
can be faster than zswap writeback, possibly because of bio
optimization or sequential allocation of swap.

> Have you experimented with synchronous reclaim in the case the pool is
> full? All the way to the acceptance threshold is too aggressive of
> course - you might need to find something in between :)
>

I don't get what the expected situation is.
The benchmark of patch 6 is performing synchronous reclaim in the case
the pool is full, since bulk memory allocation (write to mmapped
space) is much faster than writeback throughput. The zswap pool is
filled instantly at the beginning of benchmark runs. The
accept_thr_percent is not significant for the benchmark, I think.

>
> I wonder if this contention would show up in PSI metrics
> (/proc/pressure/io, or the cgroup variants if you use them ). Maybe
> correlate reclaim counters (pgscan, zswpout, pswpout, zswpwb etc.)
> with IO pressure to show the pattern, i.e the contention problem was
> there before, and is now resolved? :)

Unfortunately, I could not find a reliable metric other than elapsed
time. It seems PSI does not distinguish stalls for rejected pageout
from stalls for shrinker writeback.
For counters, this issue affects latency but does not increase the
number of pagein/out. Is there any better way to observe the origin of
contention?

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is
  2024-07-07  9:38     ` Takero Funaki
@ 2024-07-12 22:36       ` Nhat Pham
  0 siblings, 0 replies; 37+ messages in thread
From: Nhat Pham @ 2024-07-12 22:36 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Sun, Jul 7, 2024 at 2:38 AM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年7月7日(日) 8:53 Nhat Pham <nphamcs@gmail.com>:
> >
> > I tried to propose something similar in the past. Please read the
> > following discussion:
> >
> > https://lore.kernel.org/all/CAJD7tka6XRyzYndRNEFZmi0Zj4DD2KnVzt=vMGhfF4iN2B4VKw@mail.gmail.com/
> >
> > But, the TLDR is Yosry was (rightly) concerned that with this
> > approach, memory reclaiming could end up increasing memory usage
> > rather than reducing (since we do not free up the page that fail to
> > zswap-out, and we need extra memory for the zswap metadata of that
> > page).
> >
> > So my vote on this patch would be NACK, until we get around this issue
> > somehow :)
>
> It seems the discussion on the thread mixed up memory allocation
> failure (system runs out of memory reserve) and incompressible pages
> (compression algorithm successfully compressed but the result is equal
> to or larger than PAGE_SIZE).
>
> zswap has been storing pages into dedicated pages 1:1 when compressed
> to near PAGE_SIZE. Using zsmalloc, current zswap stores pages
> compressed to between 3633 bytes (=hugeclass+1) to 4095 bytes
> (=PAGE_SIZE-1) into 1 page. This patch changes the range to 3633 to
> 4096 by treating PAGE_SIZE as a special case. I could not find a
> reason to reject only PAGE_SIZE while accepting PAGE_SIZE-1.
>

I'm not actually sure if this is true in practice. While yes, zsmalloc
has the capability to store near-PAGE_SIZE objects, this also depends
on the compression algorithm.

At Meta, we use zstd. What I have found is that a lot of the time, it
just flat out rejects the page if it's too poorly compressed. Without
this change, we will not have to suffer the memory overhead of the
zswap_entry structures for these rejected pages, whereas we will with
this change.

We might need to run some tracing to get a histogram of the
distribution of post-compression sizes.

> zswap wastes memory for metadata for all accepted pages but reduces IO

Key word: accepted. The compression algorithm might already have some
built in logic to reject poorly compressed pages, preventing the cases
where the overhead might be too high for the saving.

> amount and latency by compressed buffer memory. For pages between 3633
> to 4096 bytes, zswap reduces the latency only. This is still
> beneficial because the rare incompressible pages trigger urgent
> pageout IO and incur a head-of-line blocking on the subsequent pages.
> It also keeps LRU priority for pagein latency.
>
> In the worst case or with a malicious dataset, zswap will waste a
> significant amount of memory, but this patch does not affect nor
> resolve the scenario. For example, if a user allocates pages
> compressed to 3633 bytes, current zswap using zsmalloc cannot gain
> memory as the compression ratio, including zsmalloc overhead, becomes
> 1:1. This also applies to zbud. The compression ratio will be 1:1 as
> zbud cannot find buddies smaller than 463 bytes. zswap will be less
> efficient but still work in this situation since the max pool percent
> and background writeback ensure the pool size does not overwhelm
> usable memory.
>
> I suppose the current zswap has accepted the possible waste of memory,
> at least since the current zswap_compress() logic was implemented. If
> zswap had to ensure the compression ratio is better than 1:1, and only
> prefers reducing IO amount (not latency), there would have been a
> compression ratio threshold to reject pages not compressible to under
> 2048 bytes. I think accepting nearly incompressible pages is
> beneficial and changing the range to 4096 does not negatively affect
> the current behavior.

FWIW, I do agree with your approach (storing incompressible pages in
the zswap pool to maintain LRU ordering) - this is *essentially* what
I was trying to do too with the attempt I mentioned above.

I'll let Johannes and Yosry chime in as well, since they were the
original folks who raised these concerns :) If they're happy then I'll
revoke my NACK.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is
  2024-07-09 13:26       ` Chengming Zhou
@ 2024-07-12 22:47         ` Nhat Pham
  2024-07-16  2:30           ` Chengming Zhou
  0 siblings, 1 reply; 37+ messages in thread
From: Nhat Pham @ 2024-07-12 22:47 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Takero Funaki, Johannes Weiner, Yosry Ahmed, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Tue, Jul 9, 2024 at 6:26 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>
> On 2024/7/8 21:44, Takero Funaki wrote:
> > 2024年7月8日(月) 12:56 Chengming Zhou <chengming.zhou@linux.dev>:
> >
> >>>        comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> >>>        dlen = acomp_ctx->req->dlen;
> >>> -     if (comp_ret)
> >>> +
> >>> +     /* coa_compress returns -EINVAL for errors including insufficient dlen */
> >>> +     if (comp_ret && comp_ret != -EINVAL)
> >>>                goto unlock;
> >>
> >> Seems we don't need to care about? "comp_ret" is useless anymore.
> >>
> >> Just:
> >>
> >> if (comp_ret || dlen > PAGE_SIZE - 64)
> >>          dlen = PAGE_SIZE;
> >>
> >> And remove the checkings of comp_ret at the end.
> >>
> >
> >>
> >> We actually don't need to hold mutex if we are just copying folio.
> >>
> >> Thanks.
> >>
> >
> > Thanks for reviewing.
> >
> > For comp_ret, can we consolidate all possible error codes as
> > incompressible data?
>
> Maybe we still want these debug counters? I'm not sure.

I'm a bit torn, but ATM I have no strong opinions on these two error
codes. If you do decide to consolidate these two, may I ask you to
separate it into its own patch so that we can review + discuss it
separately?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-10 22:26   ` Takero Funaki
@ 2024-07-12 23:02     ` Nhat Pham
  2024-07-15  8:20       ` Takero Funaki
  2024-07-17  2:53     ` Yosry Ahmed
  1 sibling, 1 reply; 37+ messages in thread
From: Nhat Pham @ 2024-07-12 23:02 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Wed, Jul 10, 2024 at 3:27 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年7月9日(火) 9:53 Nhat Pham <nphamcs@gmail.com>:
>
> > > post-patch, 6.10-rc4 with patch 1 to 5
> >
> > You mean 1 to 6? There are 6 patches, no?
>
> oops. with patches 1 to 6.
>
> >
> > Just out of pure curiosity, could you include the stats from patch 1-3 only?
> >
>
> I will rerun the bench in v3. I assume this bench does not reflect
> patches 4 to 6, as delta pool_limit_hit=0 means no rejection from
> zswap.
>
> > Ah this is interesting. Did you actually see improvement in your real
> > deployment (i.e not the benchmark) with patch 4-6 in?
> >
>
> As replied in patch 6, memory consuming tasks like `apt upgrade` for instance.
>
> > >
> > > Intended scenario for memory reclaim:
> > > 1. zswap pool < accept_threshold as the initial state. This is achieved
> > >    by patch 3, proactive shrinking.
> > > 2. Active processes start allocating pages. Pageout is buffered by zswap
> > >    without IO.
> > > 3. zswap reaches shrink_start_threshold. zswap continues to buffer
> > >    incoming pages and starts writeback immediately in the background.
> > > 4. zswap reaches max pool size. zswap interrupts the global shrinker and
> > >    starts rejecting pages. Write IO for the rejected page will consume
> > >    all IO resources.
> >
> > This sounds like the proactive shrinker is still not aggressive
> > enough, and/or there are some sort of misspecifications of the zswap
> > setting... Correct me if I'm wrong, but the new proactive global
> > shrinker begins 1% after the acceptance threshold, and shrinks down to
> > acceptance threshold, right? How are we still hitting the pool
> > limit...
> >
>
> Proactive shrinking should not be aggressive. With patches 4 and 6, I
> modified the global shrinker to be less aggressive against pagein/out.
> Shrinking proactively cannot avoid hitting the pool limit when memory
> pressure grows faster.
>
> > My concern is that we are knowingly (and perhaps unnecessarily)
> > creating an LRU inversion here - preferring swapping out the rejected
> > pages over the colder pages in the zswap pool. Shouldn't it be the
> > other way around? For instance, can we spiral into the following
> > scenario:
> >
> > 1. zswap pool becomes full.
> > 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> > keeps rejecting incoming pages, and putting a hold on the global
> > shrinker.
> > 3. The pages that are swapped out are warmer than the ones stored in
> > the zswap pool, so they will be more likely to be swapped in (which,
> > IIUC, will also further delay the global shrinker).
> >
> > and the cycle keeps going on and on?
>
> I agree this does not follow LRU, but I think the LRU priority
> inversion is unavoidable once the pool limit is hit.
> The accept_thr_percent should be lowered to reduce the probability of
> LRU inversion if it matters. (it is why I implemented proactive
> shrinker.)

And yet, in your own benchmark it fails to prevent that, no? I think
you lower it all the way down to 50%.

>
> When the writeback throughput is slower than memory usage grows,
> zswap_store() will have to reject pages sooner or later.
> If we evict the oldest stored pages synchronously before rejecting a
> new page (rotating pool to keep LRU), it will affect latency depending
> how much writeback is required to store the new page. If the oldest
> pages were compressed well, we would have to evict too many pages to
> store a warmer page, which blocks the reclaim progress. Fragmentation
> in the zspool may also increase the required writeback amount.
> We cannot accomplish both maintaining LRU priority and maintaining
> pageout latency.

Hmm yeah, I guess this is fair. Looks like there is not a lot of
choice, if you want to maintain decent pageout latency...

I could suggest that you have a budgeted zswap writeback on zswap
store - i.e if the pool is full, then try to zswap writeback until we
have enough space or if the budget is reached. But that feels like
even more engineering - the IO priority approach might even be easier
at that point LOL.

Oh well, global shrinker delay it is :)

>
> Additionally, zswap_writeback_entry() is slower than direct pageout. I
> assume this is because shrinker performs 4KB IO synchronously. I am
> seeing shrinking throughput is limited by disk IOPS * 4KB while much
> higher throughput can be achieved by disabling zswap. direct pageout
> can be faster than zswap writeback, possibly because of bio
> optimization or sequential allocation of swap.

Hah, this is interesting!

I wonder though, if the solution here is to perform some sort of
batching for zswap writeback.

BTW, what is the type of the storage device you are using for swap? Is
it SSD or HDD etc?

>
>
> > Have you experimented with synchronous reclaim in the case the pool is
> > full? All the way to the acceptance threshold is too aggressive of
> > course - you might need to find something in between :)
> >
>
> I don't get what the expected situation is.
> The benchmark of patch 6 is performing synchronous reclaim in the case
> the pool is full, since bulk memory allocation (write to mmapped
> space) is much faster than writeback throughput. The zswap pool is
> filled instantly at the beginning of benchmark runs. The
> accept_thr_percent is not significant for the benchmark, I think.

No. I meant synchronous reclaim as in triggering zswap writeback
within the zswap store path, to make space for the incoming new zswap
pages. But you already addressed it above :)


>
>
> >
> > I wonder if this contention would show up in PSI metrics
> > (/proc/pressure/io, or the cgroup variants if you use them ). Maybe
> > correlate reclaim counters (pgscan, zswpout, pswpout, zswpwb etc.)
> > with IO pressure to show the pattern, i.e the contention problem was
> > there before, and is now resolved? :)
>
> Unfortunately, I could not find a reliable metric other than elapsed
> time. It seems PSI does not distinguish stalls for rejected pageout
> from stalls for shrinker writeback.
> For counters, this issue affects latency but does not increase the
> number of pagein/out. Is there any better way to observe the origin of
> contention?
>
> Thanks.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/6] mm: zswap: proactive shrinking before pool size limit is hit
  2024-07-06  2:25 ` [PATCH v2 3/6] mm: zswap: proactive shrinking before pool size limit is hit Takero Funaki
@ 2024-07-12 23:18   ` Nhat Pham
  0 siblings, 0 replies; 37+ messages in thread
From: Nhat Pham @ 2024-07-12 23:18 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Fri, Jul 5, 2024 at 7:25 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> This patch implements proactive shrinking of zswap pool before the max
> pool size limit is reached. This also changes zswap to accept new pages
> while the shrinker is running.
>
> To prevent zswap from rejecting new pages and incurring latency when
> zswap is full, this patch queues the global shrinker by a pool usage
> threshold between 100% and accept_thr_percent, instead of the max pool
> size.  The pool size will be controlled between 90% to 91% for the
> default accept_thr_percent=90.  Since the current global shrinker
> continues to shrink until accept_thr_percent, we do not need to maintain
> the hysteresis variable tracking the pool limit overage in
> zswap_store().
>
> Before this patch, zswap rejected pages while the shrinker is running
> without incrementing zswap_pool_limit_hit counter. It could be a reason
> why zswap writethrough new pages before writeback old pages.  With this
> patch, zswap accepts new pages while shrinking, and zswap increments
> the counter when and only when zswap rejects pages by the max pool size.
>
> Now, reclaims smaller than the proactive shrinking amount finish
> instantly and trigger background shrinking.  Admins can check if new
> pages are buffered by zswap by monitoring the pool_limit_hit counter.
>
> The name of sysfs tunable accept_thr_percent is unchanged as it is still
> the stop condition of the shrinker.
> The respective documentation is updated to describe the new behavior.
>
> Signed-off-by: Takero Funaki <flintglass@gmail.com>

Code-wise and idea-wise, I like this patch :) This is dependent on
other changes (as you have pointed out), but if the series goes
through, please feel free to include:

Reviewed-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO
  2024-07-10 22:10       ` Nhat Pham
@ 2024-07-15  7:33         ` Takero Funaki
  0 siblings, 0 replies; 37+ messages in thread
From: Takero Funaki @ 2024-07-15  7:33 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

2024年7月11日(木) 7:10 Nhat Pham <nphamcs@gmail.com>:

> > > Yet another (less concerning IMHO) scenario is when a cgroup disables
> > > zswap by setting zswap.max = 0 (for instance, if the sysadmin knows
> > > that this cgroup's data are really cold, and/or that the workload is
> > > latency-tolerant, and do not want it to take up valuable memory
> > > resources of other cgroups). Every time this cgroup reclaims memory,
> > > it would disable the global shrinker (including the new proactive
> > > behavior) for other cgroup, correct? And, when they do need to swap
> > > in, it would further delay the global shrinker. Would this break of
> > > isolation be a problem?
> > >
> > > There are other concerns I raised in the cover letter's response as
> > > well - please take a look :)
> >
> > I haven't considered these cases much, but I suppose the global
> > shrinker should be delayed in both cases as well. In general, any
> > pagein/out should be prefered over shrinker writeback throughput.
> >
> > When zswap writeback was disabled for a memcg
> > (memcg.zswap.writeback=0), I suppose disabling/delaying writeback is
> > harmless.
> > If the rejection incurs no IO, there is no more memory pressure and
> > shrinking is not urgent. We can postpone the shrinker writeback. If
> > the rejection incurs IO (i.e. mm choose another page from a memcg with
> > writeback enabled), again we should delay the shrinker.
>
> You are delaying writeback globally right? IOW, other cgroup is also affected?
>
> Say we are under memory pressure, with two cgroups under reclaim - one
> with zswap writeback disabled. The one with writeback disabled will
> constantly fail at zswap_store(), and delay the global shrinking
> action, which could have targeted the other cgroup (which should
> proceed fully because there is no contention here since the first
> cgroup is not swapping either).
>

Thanks, I think I understand your concern. Even if zswap rejected
pages, that does not mean we need IO because memory.zswap.writeback=0
also disables memory-to-disk writeback...
And yes, v2 interrupts the shrinker in this case, which is
unnecessary. I'll move the timer updates to page_io.c like this:

--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -205,6 +205,7 @@ int swap_writepage(struct page *page, struct
writeback_control *wbc)
                folio_mark_dirty(folio);
                return AOP_WRITEPAGE_ACTIVATE;
        }
+       zswap_shrinker_delay_extend();

        __swap_writepage(folio, wbc);
        return 0;


This extends the delay only if actual I/O is necessary.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-12 23:02     ` Nhat Pham
@ 2024-07-15  8:20       ` Takero Funaki
  2024-07-26 18:13         ` Nhat Pham
  0 siblings, 1 reply; 37+ messages in thread
From: Takero Funaki @ 2024-07-15  8:20 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

2024年7月13日(土) 8:02 Nhat Pham <nphamcs@gmail.com>:

> >
> > I agree this does not follow LRU, but I think the LRU priority
> > inversion is unavoidable once the pool limit is hit.
> > The accept_thr_percent should be lowered to reduce the probability of
> > LRU inversion if it matters. (it is why I implemented proactive
> > shrinker.)
>
> And yet, in your own benchmark it fails to prevent that, no? I think
> you lower it all the way down to 50%.
>
> >
> > When the writeback throughput is slower than memory usage grows,
> > zswap_store() will have to reject pages sooner or later.
> > If we evict the oldest stored pages synchronously before rejecting a
> > new page (rotating pool to keep LRU), it will affect latency depending
> > how much writeback is required to store the new page. If the oldest
> > pages were compressed well, we would have to evict too many pages to
> > store a warmer page, which blocks the reclaim progress. Fragmentation
> > in the zspool may also increase the required writeback amount.
> > We cannot accomplish both maintaining LRU priority and maintaining
> > pageout latency.
>
> Hmm yeah, I guess this is fair. Looks like there is not a lot of
> choice, if you want to maintain decent pageout latency...
>
> I could suggest that you have a budgeted zswap writeback on zswap
> store - i.e if the pool is full, then try to zswap writeback until we
> have enough space or if the budget is reached. But that feels like
> even more engineering - the IO priority approach might even be easier
> at that point LOL.
>
> Oh well, global shrinker delay it is :)
>
> >
> > Additionally, zswap_writeback_entry() is slower than direct pageout. I
> > assume this is because shrinker performs 4KB IO synchronously. I am
> > seeing shrinking throughput is limited by disk IOPS * 4KB while much
> > higher throughput can be achieved by disabling zswap. direct pageout
> > can be faster than zswap writeback, possibly because of bio
> > optimization or sequential allocation of swap.
>
> Hah, this is interesting!
>
> I wonder though, if the solution here is to perform some sort of
> batching for zswap writeback.
>
> BTW, what is the type of the storage device you are using for swap? Is
> it SSD or HDD etc?
>

It was tested on an Azure VM with SSD-backed storage. The total IOPS
was capped at 4K IOPS by the VM host. The max throughput of the global
shrinker was around 16 MB/s. Proactive shrinking cannot prevent
pool_limit_hit since memory allocation can be on the order of GB/s.
(The benchmark script allocates 2 GB sequentially, which was
compressed to 1.3 GB, while the zswap pool was limited to 200 MB.)


> >
> >
> > > Have you experimented with synchronous reclaim in the case the pool is
> > > full? All the way to the acceptance threshold is too aggressive of
> > > course - you might need to find something in between :)
> > >
> >
> > I don't get what the expected situation is.
> > The benchmark of patch 6 is performing synchronous reclaim in the case
> > the pool is full, since bulk memory allocation (write to mmapped
> > space) is much faster than writeback throughput. The zswap pool is
> > filled instantly at the beginning of benchmark runs. The
> > accept_thr_percent is not significant for the benchmark, I think.
>
> No. I meant synchronous reclaim as in triggering zswap writeback
> within the zswap store path, to make space for the incoming new zswap
> pages. But you already addressed it above :)
>
>
> >
> >
> > >
> > > I wonder if this contention would show up in PSI metrics
> > > (/proc/pressure/io, or the cgroup variants if you use them ). Maybe
> > > correlate reclaim counters (pgscan, zswpout, pswpout, zswpwb etc.)
> > > with IO pressure to show the pattern, i.e the contention problem was
> > > there before, and is now resolved? :)
> >
> > Unfortunately, I could not find a reliable metric other than elapsed
> > time. It seems PSI does not distinguish stalls for rejected pageout
> > from stalls for shrinker writeback.
> > For counters, this issue affects latency but does not increase the
> > number of pagein/out. Is there any better way to observe the origin of
> > contention?
> >
> > Thanks.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/6] mm: zswap: store incompressible page as-is
  2024-07-12 22:47         ` Nhat Pham
@ 2024-07-16  2:30           ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-07-16  2:30 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Takero Funaki, Johannes Weiner, Yosry Ahmed, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On 2024/7/13 06:47, Nhat Pham wrote:
> On Tue, Jul 9, 2024 at 6:26 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>
>> On 2024/7/8 21:44, Takero Funaki wrote:
>>> 2024年7月8日(月) 12:56 Chengming Zhou <chengming.zhou@linux.dev>:
>>>
>>>>>         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
>>>>>         dlen = acomp_ctx->req->dlen;
>>>>> -     if (comp_ret)
>>>>> +
>>>>> +     /* coa_compress returns -EINVAL for errors including insufficient dlen */
>>>>> +     if (comp_ret && comp_ret != -EINVAL)
>>>>>                 goto unlock;
>>>>
>>>> Seems we don't need to care about? "comp_ret" is useless anymore.
>>>>
>>>> Just:
>>>>
>>>> if (comp_ret || dlen > PAGE_SIZE - 64)
>>>>           dlen = PAGE_SIZE;
>>>>
>>>> And remove the checkings of comp_ret at the end.
>>>>
>>>
>>>>
>>>> We actually don't need to hold mutex if we are just copying folio.
>>>>
>>>> Thanks.
>>>>
>>>
>>> Thanks for reviewing.
>>>
>>> For comp_ret, can we consolidate all possible error codes as
>>> incompressible data?
>>
>> Maybe we still want these debug counters? I'm not sure.
> 
> I'm a bit torn, but ATM I have no strong opinions on these two error
> codes. If you do decide to consolidate these two, may I ask you to
> separate it into its own patch so that we can review + discuss it
> separately?

Yeah, I also have no strong opinions on these two error codes,
it's just ok to keep them.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/6] mm: zswap: fix global shrinker memcg iteration
  2024-07-06  2:25 ` [PATCH v2 1/6] mm: zswap: fix global shrinker memcg iteration Takero Funaki
  2024-07-08  4:54   ` Chengming Zhou
@ 2024-07-17  1:54   ` Yosry Ahmed
  1 sibling, 0 replies; 37+ messages in thread
From: Yosry Ahmed @ 2024-07-17  1:54 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Nhat Pham, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Fri, Jul 5, 2024 at 7:25 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.
>
> The problem was that shrink_worker() would stop iterating when a memcg
> was being offlined and restart from the tree root.  Now, it properly
> handles the offlie memcg and continues shrinking with the next memcg.
>
> Note that, to avoid a refcount leak of offline memcg encountered during
> the memcg tree walking, shrink_worker() must continue iterating to find
> the next online memcg.

Please do not use the word "leak" here. It is confusing. The refcount
is not leaked, we just have a long-standing ref that should eventually
be dropped (although in theory, shrink_worker() may never me called
again). Leak makes it sound like we increment the refcount but
completely forget dropping it, which is not the case here.

>
> The following minor issues in the existing code are also resolved by the
> change in the iteration logic:
>
> - A rare temporary refcount leak in the offline memcg cleaner, where the
>   next memcg of the offlined memcg is also offline.  The leaked memcg
>   cannot be freed until the next shrink_worker() releases the reference.
>
> - One memcg was skipped from shrinking when the offline memcg cleaner
>   advanced the cursor of memcg tree. It is addressed by a flag to
>   indicate that the cursor has already been advanced.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <flintglass@gmail.com>
> ---
>  mm/zswap.c | 94 ++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 73 insertions(+), 21 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index a50e2986cd2f..29944d8145af 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -171,6 +171,7 @@ static struct list_lru zswap_list_lru;
>  /* The lock protects zswap_next_shrink updates. */
>  static DEFINE_SPINLOCK(zswap_shrink_lock);
>  static struct mem_cgroup *zswap_next_shrink;
> +static bool zswap_next_shrink_changed;
>  static struct work_struct zswap_shrink_work;
>  static struct shrinker *zswap_shrinker;
>
> @@ -775,12 +776,39 @@ void zswap_folio_swapin(struct folio *folio)
>         }
>  }
>
> +/*
> + * This function should be called when a memcg is being offlined.
> + *
> + * Since the global shrinker shrink_worker() may hold a reference
> + * of the memcg, we must check and release the reference in
> + * zswap_next_shrink.
> + *
> + * shrink_worker() must handle the case where this function releases
> + * the reference of memcg being shrunk.
> + */
>  void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
>  {
>         /* lock out zswap shrinker walking memcg tree */
>         spin_lock(&zswap_shrink_lock);
> -       if (zswap_next_shrink == memcg)
> -               zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> +       if (zswap_next_shrink == memcg) {
> +               /*
> +                * We advances the cursor to put back the offlined memcg.
> +                * shrink_worker() should not advance the cursor again.
> +                */
> +               zswap_next_shrink_changed = true;

I think this should be rare enough that it's not worth the extra complexity imo.

> +
> +               do {
> +                       zswap_next_shrink = mem_cgroup_iter(NULL,
> +                                       zswap_next_shrink, NULL);
> +               } while (zswap_next_shrink &&
> +                               !mem_cgroup_online(zswap_next_shrink));
> +               /*
> +                * We verified the next memcg is online.  Even if the next
> +                * memcg is being offlined here, another cleaner must be
> +                * waiting for our lock.  We can leave the online memcg
> +                * reference.
> +                */
> +       }
>         spin_unlock(&zswap_shrink_lock);
>  }
>
> @@ -1319,18 +1347,42 @@ static void shrink_worker(struct work_struct *w)
>         /* Reclaim down to the accept threshold */
>         thr = zswap_accept_thr_pages();
>
> -       /* global reclaim will select cgroup in a round-robin fashion. */
> +       /* global reclaim will select cgroup in a round-robin fashion.
> +        *
> +        * We save iteration cursor memcg into zswap_next_shrink,
> +        * which can be modified by the offline memcg cleaner
> +        * zswap_memcg_offline_cleanup().
> +        *
> +        * Since the offline cleaner is called only once, we cannot leave an
> +        * offline memcg reference in zswap_next_shrink.
> +        * We can rely on the cleaner only if we get online memcg under lock.
> +        *
> +        * If we get an offline memcg, we cannot determine the cleaner has

we cannot determine if* the cleaner

> +        * already been called or will be called later. We must put back the
> +        * reference before returning from this function. Otherwise, the
> +        * offline memcg left in zswap_next_shrink will hold the reference
> +        * until the next run of shrink_worker().
> +        */
>         do {
>                 spin_lock(&zswap_shrink_lock);
> -               zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> -               memcg = zswap_next_shrink;
>
>                 /*
> -                * We need to retry if we have gone through a full round trip, or if we
> -                * got an offline memcg (or else we risk undoing the effect of the
> -                * zswap memcg offlining cleanup callback). This is not catastrophic
> -                * per se, but it will keep the now offlined memcg hostage for a while.
> -                *
> +                * Start shrinking from the next memcg after zswap_next_shrink.
> +                * To not skip a memcg, do not advance the cursor when it has
> +                * already been advanced by the offline cleaner.
> +                */
> +               do {
> +                       if (zswap_next_shrink_changed) {
> +                               /* cleaner advanced the cursor */
> +                               zswap_next_shrink_changed = false;
> +                       } else {
> +                               zswap_next_shrink = mem_cgroup_iter(NULL,
> +                                               zswap_next_shrink, NULL);
> +                       }
> +                       memcg = zswap_next_shrink;
> +               } while (memcg && !mem_cgroup_tryget_online(memcg));
> +
> +               /*
>                  * Note that if we got an online memcg, we will keep the extra
>                  * reference in case the original reference obtained by mem_cgroup_iter
>                  * is dropped by the zswap memcg offlining callback, ensuring that the
> @@ -1344,17 +1396,11 @@ static void shrink_worker(struct work_struct *w)
>                         goto resched;
>                 }
>
> -               if (!mem_cgroup_tryget_online(memcg)) {
> -                       /* drop the reference from mem_cgroup_iter() */
> -                       mem_cgroup_iter_break(NULL, memcg);
> -                       zswap_next_shrink = NULL;
> -                       spin_unlock(&zswap_shrink_lock);
> -
> -                       if (++failures == MAX_RECLAIM_RETRIES)
> -                               break;
> -
> -                       goto resched;
> -               }
> +               /*
> +                * We verified the memcg is online and got an extra memcg
> +                * reference.  Our memcg might be offlined concurrently but the
> +                * respective offline cleaner must be waiting for our lock.
> +                */
>                 spin_unlock(&zswap_shrink_lock);
>
>                 ret = shrink_memcg(memcg);
> @@ -1368,6 +1414,12 @@ static void shrink_worker(struct work_struct *w)
>  resched:
>                 cond_resched();
>         } while (zswap_total_pages() > thr);
> +
> +       /*
> +        * We can still hold the original memcg reference.
> +        * The reference is stored in zswap_next_shrink, and then reused
> +        * by the next shrink_worker().
> +        */

This is unnecessary imo.

>  }
>
>  /*********************************
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 2/6] mm: zswap: fix global shrinker error handling logic
  2024-07-06  2:25 ` [PATCH v2 2/6] mm: zswap: fix global shrinker error handling logic Takero Funaki
@ 2024-07-17  2:39   ` Yosry Ahmed
  0 siblings, 0 replies; 37+ messages in thread
From: Yosry Ahmed @ 2024-07-17  2:39 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Nhat Pham, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Fri, Jul 5, 2024 at 7:25 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> This patch fixes zswap global shrinker that did not shrink zpool as
> expected.
>
> The issue it addresses is that `shrink_worker()` did not distinguish
> between unexpected errors and expected error codes that should be
> skipped, such as when there is no stored page in a memcg. This led to
> the shrinking process being aborted on the expected error codes.
>
> The shrinker should ignore these cases and skip to the next memcg.
> However,  skipping all memcgs presents another problem. To address this,
> this patch tracks progress while walking the memcg tree and checks for
> progress once the tree walk is completed.
>
> To handle the empty memcg case, the helper function `shrink_memcg()` is
> modified to check if the memcg is empty and then return -ENOENT.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <flintglass@gmail.com>
> ---
>  mm/zswap.c | 31 +++++++++++++++++++++++++------
>  1 file changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 29944d8145af..f092932e652b 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1317,10 +1317,10 @@ static struct shrinker *zswap_alloc_shrinker(void)
>
>  static int shrink_memcg(struct mem_cgroup *memcg)
>  {
> -       int nid, shrunk = 0;
> +       int nid, shrunk = 0, scanned = 0;
>
>         if (!mem_cgroup_zswap_writeback_enabled(memcg))
> -               return -EINVAL;
> +               return -ENOENT;
>
>         /*
>          * Skip zombies because their LRUs are reparented and we would be
> @@ -1334,19 +1334,30 @@ static int shrink_memcg(struct mem_cgroup *memcg)
>
>                 shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
>                                             &shrink_memcg_cb, NULL, &nr_to_walk);
> +               scanned += 1 - nr_to_walk;
>         }
> +
> +       if (!scanned)
> +               return -ENOENT;
> +
>         return shrunk ? 0 : -EAGAIN;
>  }
>
>  static void shrink_worker(struct work_struct *w)
>  {
>         struct mem_cgroup *memcg;
> -       int ret, failures = 0;
> +       int ret, failures = 0, progress;
>         unsigned long thr;
>
>         /* Reclaim down to the accept threshold */
>         thr = zswap_accept_thr_pages();
>
> +       /*
> +        * We might start from the last memcg.
> +        * That is not a failure.
> +        */
> +       progress = 1;
> +

I think this is unneeded complexity imo. Doing one less retry if we
happen to start at the last memcg should be fine.

>         /* global reclaim will select cgroup in a round-robin fashion.
>          *
>          * We save iteration cursor memcg into zswap_next_shrink,
> @@ -1390,9 +1401,12 @@ static void shrink_worker(struct work_struct *w)
>                  */
>                 if (!memcg) {
>                         spin_unlock(&zswap_shrink_lock);
> -                       if (++failures == MAX_RECLAIM_RETRIES)
> +
> +                       /* tree walk completed but no progress */
> +                       if (!progress && ++failures == MAX_RECLAIM_RETRIES)
>                                 break;
>
> +                       progress = 0;
>                         goto resched;
>                 }
>
> @@ -1407,10 +1421,15 @@ static void shrink_worker(struct work_struct *w)
>                 /* drop the extra reference */
>                 mem_cgroup_put(memcg);
>
> -               if (ret == -EINVAL)
> -                       break;
> +               /* not a writeback candidate memcg */

Unneeded comment.

> +               if (ret == -ENOENT)
> +                       continue;
> +
>                 if (ret && ++failures == MAX_RECLAIM_RETRIES)
>                         break;
> +
> +               ++progress;
> +               /* reschedule as we performed some IO */

Unneeded comment.


>  resched:
>                 cond_resched();
>         } while (zswap_total_pages() > thr);
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-10 22:26   ` Takero Funaki
  2024-07-12 23:02     ` Nhat Pham
@ 2024-07-17  2:53     ` Yosry Ahmed
  2024-07-17 17:49       ` Nhat Pham
  1 sibling, 1 reply; 37+ messages in thread
From: Yosry Ahmed @ 2024-07-17  2:53 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Nhat Pham, Johannes Weiner, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

[..]
>
> > My concern is that we are knowingly (and perhaps unnecessarily)
> > creating an LRU inversion here - preferring swapping out the rejected
> > pages over the colder pages in the zswap pool. Shouldn't it be the
> > other way around? For instance, can we spiral into the following
> > scenario:
> >
> > 1. zswap pool becomes full.
> > 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> > keeps rejecting incoming pages, and putting a hold on the global
> > shrinker.
> > 3. The pages that are swapped out are warmer than the ones stored in
> > the zswap pool, so they will be more likely to be swapped in (which,
> > IIUC, will also further delay the global shrinker).
> >
> > and the cycle keeps going on and on?
>
> I agree this does not follow LRU, but I think the LRU priority
> inversion is unavoidable once the pool limit is hit.
> The accept_thr_percent should be lowered to reduce the probability of
> LRU inversion if it matters. (it is why I implemented proactive
> shrinker.)

Why?

Let's take a step back. You are suggesting that we throttle zswap
writeback to allow reclaim to swapout warmer pages to swap device. As
Nhat said, we are proliferating LRU inversion instead of fixing it.

I think I had a similar discussion with Johannes about this before,
and we discussed that if zswap becomes full, we should instead
throttle reclaim and allow zswap writeback to proceed (i.e. the
opposite of what this series is doing). This would be similar to how
we throttle reclaim today to wait for dirty pages to be written back.

This should reduce/fix the LRU inversion instead of proliferating it,
and it should reduce the total amout of IO as colder pages should go
to disk while warmer pages go to zswap. I am wondering if we can reuse
the reclaim_throttle() mechanism here.

One concern I have is that we will also throttle file pages if we use
reclaim_throttle(), since I don't see per-type throttling there. This
could be fine, since we similarly throttle zswap reclaim if there are
too many dirty file pages. I am not super familiar with reclaim
throttling, so maybe I missed something obvious or there is a better
way, but I believe that from a high level this should be the right way
to go.

I actually think if we do this properly, and throttle reclaim when
zswap becomes full, we may be able to drop the acceptance hysteresis
and rely on the throttling mechanism to make sure we stop reclaim
until we free up enough space in zswap to avoid consistently hitting
the limit, but this could be a future extension.

Johannes, any thoughts here?

Anyway, since patches 1-2 are independent of the rest of the series,
feel free to send them separately, and we can continue the discussion
on the best way forward for the rest of the series.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-17  2:53     ` Yosry Ahmed
@ 2024-07-17 17:49       ` Nhat Pham
  2024-07-17 18:05         ` Yosry Ahmed
  0 siblings, 1 reply; 37+ messages in thread
From: Nhat Pham @ 2024-07-17 17:49 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Takero Funaki, Johannes Weiner, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Tue, Jul 16, 2024 at 7:53 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> >
> > > My concern is that we are knowingly (and perhaps unnecessarily)
> > > creating an LRU inversion here - preferring swapping out the rejected
> > > pages over the colder pages in the zswap pool. Shouldn't it be the
> > > other way around? For instance, can we spiral into the following
> > > scenario:
> > >
> > > 1. zswap pool becomes full.
> > > 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> > > keeps rejecting incoming pages, and putting a hold on the global
> > > shrinker.
> > > 3. The pages that are swapped out are warmer than the ones stored in
> > > the zswap pool, so they will be more likely to be swapped in (which,
> > > IIUC, will also further delay the global shrinker).
> > >
> > > and the cycle keeps going on and on?
> >
> > I agree this does not follow LRU, but I think the LRU priority
> > inversion is unavoidable once the pool limit is hit.
> > The accept_thr_percent should be lowered to reduce the probability of
> > LRU inversion if it matters. (it is why I implemented proactive
> > shrinker.)
>
> Why?
>
> Let's take a step back. You are suggesting that we throttle zswap
> writeback to allow reclaim to swapout warmer pages to swap device. As
> Nhat said, we are proliferating LRU inversion instead of fixing it.
>
> I think I had a similar discussion with Johannes about this before,
> and we discussed that if zswap becomes full, we should instead
> throttle reclaim and allow zswap writeback to proceed (i.e. the
> opposite of what this series is doing). This would be similar to how
> we throttle reclaim today to wait for dirty pages to be written back.
>

I completely agree with this analysis and proposal - it's somewhat
similar to what I have in mind, but more fleshed out :)

> This should reduce/fix the LRU inversion instead of proliferating it,
> and it should reduce the total amout of IO as colder pages should go
> to disk while warmer pages go to zswap. I am wondering if we can reuse
> the reclaim_throttle() mechanism here.
>
> One concern I have is that we will also throttle file pages if we use
> reclaim_throttle(), since I don't see per-type throttling there. This
> could be fine, since we similarly throttle zswap reclaim if there are
> too many dirty file pages. I am not super familiar with reclaim
> throttling, so maybe I missed something obvious or there is a better
> way, but I believe that from a high level this should be the right way
> to go.

I don't think we have any infrastructure for anon-only throttling in
vmscan logic, but it sounds trivial to implement if necessary :)

>
> I actually think if we do this properly, and throttle reclaim when
> zswap becomes full, we may be able to drop the acceptance hysteresis
> and rely on the throttling mechanism to make sure we stop reclaim
> until we free up enough space in zswap to avoid consistently hitting
> the limit, but this could be a future extension.

Agree - this hysteresis heuristics needs to die.

IMHO, I think we should still have the proactive global shrinking
action that Takero is proposing in patch 3. The throttling is nice,
but it'd be even nicer if we can get ahead of that :)

>
> Johannes, any thoughts here?
>
> Anyway, since patches 1-2 are independent of the rest of the series,
> feel free to send them separately, and we can continue the discussion
> on the best way forward for the rest of the series.

+1.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-17 17:49       ` Nhat Pham
@ 2024-07-17 18:05         ` Yosry Ahmed
  2024-07-17 19:01           ` Nhat Pham
  2024-07-19 14:55           ` Takero Funaki
  0 siblings, 2 replies; 37+ messages in thread
From: Yosry Ahmed @ 2024-07-17 18:05 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Takero Funaki, Johannes Weiner, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Wed, Jul 17, 2024 at 10:49 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Jul 16, 2024 at 7:53 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > [..]
> > >
> > > > My concern is that we are knowingly (and perhaps unnecessarily)
> > > > creating an LRU inversion here - preferring swapping out the rejected
> > > > pages over the colder pages in the zswap pool. Shouldn't it be the
> > > > other way around? For instance, can we spiral into the following
> > > > scenario:
> > > >
> > > > 1. zswap pool becomes full.
> > > > 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> > > > keeps rejecting incoming pages, and putting a hold on the global
> > > > shrinker.
> > > > 3. The pages that are swapped out are warmer than the ones stored in
> > > > the zswap pool, so they will be more likely to be swapped in (which,
> > > > IIUC, will also further delay the global shrinker).
> > > >
> > > > and the cycle keeps going on and on?
> > >
> > > I agree this does not follow LRU, but I think the LRU priority
> > > inversion is unavoidable once the pool limit is hit.
> > > The accept_thr_percent should be lowered to reduce the probability of
> > > LRU inversion if it matters. (it is why I implemented proactive
> > > shrinker.)
> >
> > Why?
> >
> > Let's take a step back. You are suggesting that we throttle zswap
> > writeback to allow reclaim to swapout warmer pages to swap device. As
> > Nhat said, we are proliferating LRU inversion instead of fixing it.
> >
> > I think I had a similar discussion with Johannes about this before,
> > and we discussed that if zswap becomes full, we should instead
> > throttle reclaim and allow zswap writeback to proceed (i.e. the
> > opposite of what this series is doing). This would be similar to how
> > we throttle reclaim today to wait for dirty pages to be written back.
> >
>
> I completely agree with this analysis and proposal - it's somewhat
> similar to what I have in mind, but more fleshed out :)
>
> > This should reduce/fix the LRU inversion instead of proliferating it,
> > and it should reduce the total amout of IO as colder pages should go
> > to disk while warmer pages go to zswap. I am wondering if we can reuse
> > the reclaim_throttle() mechanism here.
> >
> > One concern I have is that we will also throttle file pages if we use
> > reclaim_throttle(), since I don't see per-type throttling there. This
> > could be fine, since we similarly throttle zswap reclaim if there are
> > too many dirty file pages. I am not super familiar with reclaim
> > throttling, so maybe I missed something obvious or there is a better
> > way, but I believe that from a high level this should be the right way
> > to go.
>
> I don't think we have any infrastructure for anon-only throttling in
> vmscan logic, but it sounds trivial to implement if necessary :)
>
> >
> > I actually think if we do this properly, and throttle reclaim when
> > zswap becomes full, we may be able to drop the acceptance hysteresis
> > and rely on the throttling mechanism to make sure we stop reclaim
> > until we free up enough space in zswap to avoid consistently hitting
> > the limit, but this could be a future extension.
>
> Agree - this hysteresis heuristics needs to die.
>
> IMHO, I think we should still have the proactive global shrinking
> action that Takero is proposing in patch 3. The throttling is nice,
> but it'd be even nicer if we can get ahead of that :)

I have always thought that the shrinker should play this role in one
way or another. Instead of an arbitrary watermark and asynchronous
work, it incrementally pushes the zswap LRU toward disk as reclaim
activity increases.

Is the point behind proactive shrinking is to reduce the latency in
the reclaim path?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-17 18:05         ` Yosry Ahmed
@ 2024-07-17 19:01           ` Nhat Pham
  2024-07-19 14:55           ` Takero Funaki
  1 sibling, 0 replies; 37+ messages in thread
From: Nhat Pham @ 2024-07-17 19:01 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Takero Funaki, Johannes Weiner, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Wed, Jul 17, 2024 at 11:05 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> I have always thought that the shrinker should play this role in one
> way or another. Instead of an arbitrary watermark and asynchronous
> work, it incrementally pushes the zswap LRU toward disk as reclaim
> activity increases.
>
> Is the point behind proactive shrinking is to reduce the latency in
> the reclaim path?

Yeah, reducing latency is the one benefit I have in mind :) I don't
feel too strongly regarding this though - in fact I'm more biased
towards the other shrinker in the first place.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-17 18:05         ` Yosry Ahmed
  2024-07-17 19:01           ` Nhat Pham
@ 2024-07-19 14:55           ` Takero Funaki
  1 sibling, 0 replies; 37+ messages in thread
From: Takero Funaki @ 2024-07-19 14:55 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, Johannes Weiner, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

Thank you all for your reviews and comments.

2024年7月18日(木) 3:05 Yosry Ahmed <yosryahmed@google.com>:
>
> On Wed, Jul 17, 2024 at 10:49 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Jul 16, 2024 at 7:53 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > [..]
> > > >
> > > > > My concern is that we are knowingly (and perhaps unnecessarily)
> > > > > creating an LRU inversion here - preferring swapping out the rejected
> > > > > pages over the colder pages in the zswap pool. Shouldn't it be the
> > > > > other way around? For instance, can we spiral into the following
> > > > > scenario:
> > > > >
> > > > > 1. zswap pool becomes full.
> > > > > 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> > > > > keeps rejecting incoming pages, and putting a hold on the global
> > > > > shrinker.
> > > > > 3. The pages that are swapped out are warmer than the ones stored in
> > > > > the zswap pool, so they will be more likely to be swapped in (which,
> > > > > IIUC, will also further delay the global shrinker).
> > > > >
> > > > > and the cycle keeps going on and on?
> > > >
> > > > I agree this does not follow LRU, but I think the LRU priority
> > > > inversion is unavoidable once the pool limit is hit.
> > > > The accept_thr_percent should be lowered to reduce the probability of
> > > > LRU inversion if it matters. (it is why I implemented proactive
> > > > shrinker.)
> > >
> > > Why?
> > >
> > > Let's take a step back. You are suggesting that we throttle zswap
> > > writeback to allow reclaim to swapout warmer pages to swap device. As
> > > Nhat said, we are proliferating LRU inversion instead of fixing it.
> > >
> > > I think I had a similar discussion with Johannes about this before,
> > > and we discussed that if zswap becomes full, we should instead
> > > throttle reclaim and allow zswap writeback to proceed (i.e. the
> > > opposite of what this series is doing). This would be similar to how
> > > we throttle reclaim today to wait for dirty pages to be written back.
> > >
> >
> > I completely agree with this analysis and proposal - it's somewhat
> > similar to what I have in mind, but more fleshed out :)
> >
> > > This should reduce/fix the LRU inversion instead of proliferating it,
> > > and it should reduce the total amout of IO as colder pages should go
> > > to disk while warmer pages go to zswap. I am wondering if we can reuse
> > > the reclaim_throttle() mechanism here.
> > >
> > > One concern I have is that we will also throttle file pages if we use
> > > reclaim_throttle(), since I don't see per-type throttling there. This
> > > could be fine, since we similarly throttle zswap reclaim if there are
> > > too many dirty file pages. I am not super familiar with reclaim
> > > throttling, so maybe I missed something obvious or there is a better
> > > way, but I believe that from a high level this should be the right way
> > > to go.
> >
> > I don't think we have any infrastructure for anon-only throttling in
> > vmscan logic, but it sounds trivial to implement if necessary :)
> >
> > >
> > > I actually think if we do this properly, and throttle reclaim when
> > > zswap becomes full, we may be able to drop the acceptance hysteresis
> > > and rely on the throttling mechanism to make sure we stop reclaim
> > > until we free up enough space in zswap to avoid consistently hitting
> > > the limit, but this could be a future extension.
> >
> > Agree - this hysteresis heuristics needs to die.
> >
> > IMHO, I think we should still have the proactive global shrinking
> > action that Takero is proposing in patch 3. The throttling is nice,
> > but it'd be even nicer if we can get ahead of that :)
>
> I have always thought that the shrinker should play this role in one
> way or another. Instead of an arbitrary watermark and asynchronous
> work, it incrementally pushes the zswap LRU toward disk as reclaim
> activity increases.
>
> Is the point behind proactive shrinking is to reduce the latency in
> the reclaim path?

For proactive shrinking, I thought the latency and throughput of
pageout should be prioritized, assuming that delaying the reclaim
progress by rejection or synchronous writeback is not always
acceptable. Similarly, patch 6 accepted breaking LRU priority to avoid
degrading pageout performance compared to zswap-disabled systems.

But It seems like zswap prefers the LRU heuristics and a larger pool.
The shrinker should writeback synchronously after the pool limit is
hit until the max pool size, and zswap should backpressure the
reclaim, right?  If so, my proposal is in the opposite direction. I
will submit patches 1 and 2 as v3.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-15  8:20       ` Takero Funaki
@ 2024-07-26 18:13         ` Nhat Pham
  2024-07-26 18:25           ` Nhat Pham
  0 siblings, 1 reply; 37+ messages in thread
From: Nhat Pham @ 2024-07-26 18:13 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Mon, Jul 15, 2024 at 1:20 AM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年7月13日(土) 8:02 Nhat Pham <nphamcs@gmail.com>:
>
> It was tested on an Azure VM with SSD-backed storage. The total IOPS
> was capped at 4K IOPS by the VM host. The max throughput of the global
> shrinker was around 16 MB/s. Proactive shrinking cannot prevent
> pool_limit_hit since memory allocation can be on the order of GB/s.
> (The benchmark script allocates 2 GB sequentially, which was
> compressed to 1.3 GB, while the zswap pool was limited to 200 MB.)

Hmmm I noticed that in a lot of other swap read/write paths (in
__read_swap_cache_async(), or in shrink_lruvec()), we are doing block
device plugging (blk_{start|finish}_plug()). The global shrinker path,
however, is currently not doing this - it's triggered in a workqueue,
separate from all these reclaim paths.

I wonder if there are any values to doing the same for zswap global
shrinker. We do acquire a mutex (which can sleep) for every page,
which can unplug, but IIUC we only sleep when the mutex is currently
held by another task, and the mutex is per-CPU. The compression
algorithm is usually non-sleeping as well (for e.g, zstd). So maybe
there could be improvement in throughput here?

(Btw - friendly reminder that everyone should use zsmalloc as the default :))

Anyway, I haven't really played with this, and I don't have the right
setup that mimics your use case. If you do decide to give this a shot,
let me know :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink
  2024-07-26 18:13         ` Nhat Pham
@ 2024-07-26 18:25           ` Nhat Pham
  0 siblings, 0 replies; 37+ messages in thread
From: Nhat Pham @ 2024-07-26 18:25 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Jonathan Corbet,
	Andrew Morton, Domenico Cerasuolo, linux-mm, linux-doc,
	linux-kernel

On Fri, Jul 26, 2024 at 11:13 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> __read_swap_cache_async(), or in shrink_lruvec()), we are doing block

/s/__read_swap_cache_async()/swapin_{cluster|vma}_readahead()

We're initiating the plugging at the swapin callsites.


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-07-26 18:25 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-06  2:25 [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Takero Funaki
2024-07-06  2:25 ` [PATCH v2 1/6] mm: zswap: fix global shrinker memcg iteration Takero Funaki
2024-07-08  4:54   ` Chengming Zhou
2024-07-17  1:54   ` Yosry Ahmed
2024-07-06  2:25 ` [PATCH v2 2/6] mm: zswap: fix global shrinker error handling logic Takero Funaki
2024-07-17  2:39   ` Yosry Ahmed
2024-07-06  2:25 ` [PATCH v2 3/6] mm: zswap: proactive shrinking before pool size limit is hit Takero Funaki
2024-07-12 23:18   ` Nhat Pham
2024-07-06  2:25 ` [PATCH v2 4/6] mm: zswap: make writeback run in the background Takero Funaki
2024-07-06  2:25 ` [PATCH v2 5/6] mm: zswap: store incompressible page as-is Takero Funaki
2024-07-06 23:53   ` Nhat Pham
2024-07-07  9:38     ` Takero Funaki
2024-07-12 22:36       ` Nhat Pham
2024-07-08  3:56   ` Chengming Zhou
2024-07-08 13:44     ` Takero Funaki
2024-07-09 13:26       ` Chengming Zhou
2024-07-12 22:47         ` Nhat Pham
2024-07-16  2:30           ` Chengming Zhou
2024-07-06  2:25 ` [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO Takero Funaki
2024-07-08 19:17   ` Nhat Pham
2024-07-09  0:57   ` Nhat Pham
2024-07-10 21:21     ` Takero Funaki
2024-07-10 22:10       ` Nhat Pham
2024-07-15  7:33         ` Takero Funaki
2024-07-06 17:32 ` [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink Andrew Morton
2024-07-07 10:54   ` Takero Funaki
2024-07-09  0:53 ` Nhat Pham
2024-07-10 22:26   ` Takero Funaki
2024-07-12 23:02     ` Nhat Pham
2024-07-15  8:20       ` Takero Funaki
2024-07-26 18:13         ` Nhat Pham
2024-07-26 18:25           ` Nhat Pham
2024-07-17  2:53     ` Yosry Ahmed
2024-07-17 17:49       ` Nhat Pham
2024-07-17 18:05         ` Yosry Ahmed
2024-07-17 19:01           ` Nhat Pham
2024-07-19 14:55           ` Takero Funaki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).