* [PATCH v2 0/3] mm/zswap: fix race between lru writeback and swapoff
@ 2024-01-28 13:28 Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock Chengming Zhou
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Chengming Zhou @ 2024-01-28 13:28 UTC (permalink / raw)
To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Andrew Morton, Chris Li
Cc: Chengming Zhou, Johannes Weiner, Nhat Pham, linux-kernel,
linux-mm
This is based on mm-unstable and the "mm: zswap: fix missing folio cleanup
in writeback race path" patch [1].
Changes in v2:
- Append a patch to remove list_lru_putback() since its only user
in zswap has gone, per Nhat.
- Improve the commit messages per Johannes.
- Add comments about the lru rotate in shrink_memcg_cb() per Johannes.
- Collect tags.
- Link to v1: https://lore.kernel.org/all/20240126083015.3557006-1-chengming.zhou@linux.dev/
This series mainly fix the race problem between lru writeback and swapoff,
which is spotted by Yosry [2]. Please see the commits for details.
Thanks for review and comments!
[1] https://lore.kernel.org/all/20240125085127.1327013-1-yosryahmed@google.com/
[2] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Chengming Zhou (3):
mm/zswap: don't return LRU_SKIP if we have dropped lru lock
mm/zswap: fix race between lru writeback and swapoff
mm/list_lru: remove list_lru_putback()
include/linux/list_lru.h | 16 -------
mm/list_lru.c | 14 ------
mm/zswap.c | 120 ++++++++++++++++++++---------------------------
3 files changed, 51 insertions(+), 99 deletions(-)
---
base-commit: 13d63cb513841433b69564b07614169e61113720
change-id: 20240126-zswap-writeback-race-2773be0a3ad4
Best regards,
--
Chengming Zhou <zhouchengming@bytedance.com>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock
2024-01-28 13:28 [PATCH v2 0/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
@ 2024-01-28 13:28 ` Chengming Zhou
2024-01-30 0:09 ` Yosry Ahmed
2024-01-28 13:28 ` [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 3/3] mm/list_lru: remove list_lru_putback() Chengming Zhou
2 siblings, 1 reply; 14+ messages in thread
From: Chengming Zhou @ 2024-01-28 13:28 UTC (permalink / raw)
To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Andrew Morton, Chris Li
Cc: Chengming Zhou, Johannes Weiner, Nhat Pham, linux-kernel,
linux-mm
LRU_SKIP can only be returned if we don't ever dropped lru lock, or
we need to return LRU_RETRY to restart from the head of lru list.
Otherwise, the iteration might continue from a cursor position that
was freed while the locks were dropped.
Actually we may need to introduce another LRU_STOP to really terminate
the ongoing shrinking scan process, when we encounter a warm page
already in the swap cache. The current list_lru implementation
doesn't have this function to early break from __list_lru_walk_one.
Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
mm/zswap.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 00e90b9b5417..81cb3790e0dd 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -901,10 +901,8 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
* into the warmer region. We should terminate shrinking (if we're in the dynamic
* shrinker context).
*/
- if (writeback_result == -EEXIST && encountered_page_in_swapcache) {
- ret = LRU_SKIP;
+ if (writeback_result == -EEXIST && encountered_page_in_swapcache)
*encountered_page_in_swapcache = true;
- }
goto put_unlock;
}
--
b4 0.10.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff
2024-01-28 13:28 [PATCH v2 0/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock Chengming Zhou
@ 2024-01-28 13:28 ` Chengming Zhou
2024-01-30 0:22 ` Yosry Ahmed
2024-01-28 13:28 ` [PATCH v2 3/3] mm/list_lru: remove list_lru_putback() Chengming Zhou
2 siblings, 1 reply; 14+ messages in thread
From: Chengming Zhou @ 2024-01-28 13:28 UTC (permalink / raw)
To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Andrew Morton, Chris Li
Cc: Chengming Zhou, Johannes Weiner, Nhat Pham, linux-kernel,
linux-mm
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before
referencing the tree, then check invalidate race with tree lock,
only after that we can safely deref the entry. Note we couldn't
deref entry or tree anymore after we unlock the folio, since we
depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache
and if returned with folio locked we can reference the tree safely.
Then we can check invalidate race with tree lock, the following things
is much the same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we
can't use zswap_lru_putback() anymore, instead we rotate the entry
in the beginning. And it will be unlinked and freed when invalidated
if writeback success.
Another change is we don't update the memcg nr_zswap_protected in
the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced
with swapin or concurrent shrinker action, since swapin already
have memcg nr_zswap_protected updated, don't need double counts here.
For concurrent shrinker, the folio will be writeback and freed anyway.
-ENOMEM case is extremely rare and doesn't happen spuriously either,
so don't bother distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
mm/zswap.c | 114 ++++++++++++++++++++++++++-----------------------------------
1 file changed, 49 insertions(+), 65 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 81cb3790e0dd..f5cb5a46e4d7 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
zpool_get_type((p)->zpools[0]))
static int zswap_writeback_entry(struct zswap_entry *entry,
- struct zswap_tree *tree);
+ swp_entry_t swpentry);
static int zswap_pool_get(struct zswap_pool *pool);
static void zswap_pool_put(struct zswap_pool *pool);
@@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
rcu_read_unlock();
}
-static void zswap_lru_putback(struct list_lru *list_lru,
- struct zswap_entry *entry)
-{
- int nid = entry_to_nid(entry);
- spinlock_t *lock = &list_lru->node[nid].lock;
- struct mem_cgroup *memcg;
- struct lruvec *lruvec;
-
- rcu_read_lock();
- memcg = mem_cgroup_from_entry(entry);
- spin_lock(lock);
- /* we cannot use list_lru_add here, because it increments node's lru count */
- list_lru_putback(list_lru, &entry->lru, nid, memcg);
- spin_unlock(lock);
-
- lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
- /* increment the protection area to account for the LRU rotation. */
- atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
- rcu_read_unlock();
-}
-
/*********************************
* rbtree functions
**********************************/
@@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
{
struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
bool *encountered_page_in_swapcache = (bool *)arg;
- struct zswap_tree *tree;
- pgoff_t swpoffset;
+ swp_entry_t swpentry;
enum lru_status ret = LRU_REMOVED_RETRY;
int writeback_result;
+ /*
+ * Rotate the entry to the tail before unlocking the LRU,
+ * so that in case of an invalidation race concurrent
+ * reclaimers don't waste their time on it.
+ *
+ * If writeback succeeds, or failure is due to the entry
+ * being invalidated by the swap subsystem, the invalidation
+ * will unlink and free it.
+ *
+ * Temporary failures, where the same entry should be tried
+ * again immediately, almost never happen for this shrinker.
+ * We don't do any trylocking; -ENOMEM comes closest,
+ * but that's extremely rare and doesn't happen spuriously
+ * either. Don't bother distinguishing this case.
+ *
+ * But since they do exist in theory, the entry cannot just
+ * be unlinked, or we could leak it. Hence, rotate.
+ */
+ list_move_tail(item, &l->list);
+
/*
* Once the lru lock is dropped, the entry might get freed. The
- * swpoffset is copied to the stack, and entry isn't deref'd again
+ * swpentry is copied to the stack, and entry isn't deref'd again
* until the entry is verified to still be alive in the tree.
*/
- swpoffset = swp_offset(entry->swpentry);
- tree = swap_zswap_tree(entry->swpentry);
- list_lru_isolate(l, item);
+ swpentry = entry->swpentry;
+
/*
* It's safe to drop the lock here because we return either
* LRU_REMOVED_RETRY or LRU_RETRY.
*/
spin_unlock(lock);
- /* Check for invalidate() race */
- spin_lock(&tree->lock);
- if (entry != zswap_rb_search(&tree->rbroot, swpoffset))
- goto unlock;
-
- /* Hold a reference to prevent a free during writeback */
- zswap_entry_get(entry);
- spin_unlock(&tree->lock);
-
- writeback_result = zswap_writeback_entry(entry, tree);
+ writeback_result = zswap_writeback_entry(entry, swpentry);
- spin_lock(&tree->lock);
if (writeback_result) {
zswap_reject_reclaim_fail++;
- zswap_lru_putback(&entry->pool->list_lru, entry);
ret = LRU_RETRY;
/*
@@ -903,27 +889,10 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
*/
if (writeback_result == -EEXIST && encountered_page_in_swapcache)
*encountered_page_in_swapcache = true;
-
- goto put_unlock;
+ } else {
+ zswap_written_back_pages++;
}
- zswap_written_back_pages++;
-
- if (entry->objcg)
- count_objcg_event(entry->objcg, ZSWPWB);
- count_vm_event(ZSWPWB);
- /*
- * Writeback started successfully, the page now belongs to the
- * swapcache. Drop the entry from zswap - unless invalidate already
- * took it out while we had the tree->lock released for IO.
- */
- zswap_invalidate_entry(tree, entry);
-
-put_unlock:
- /* Drop local reference */
- zswap_entry_put(entry);
-unlock:
- spin_unlock(&tree->lock);
spin_lock(lock);
return ret;
}
@@ -1408,9 +1377,9 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
* freed.
*/
static int zswap_writeback_entry(struct zswap_entry *entry,
- struct zswap_tree *tree)
+ swp_entry_t swpentry)
{
- swp_entry_t swpentry = entry->swpentry;
+ struct zswap_tree *tree;
struct folio *folio;
struct mempolicy *mpol;
bool folio_was_allocated;
@@ -1426,9 +1395,11 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
return -ENOMEM;
/*
- * Found an existing folio, we raced with load/swapin. We generally
- * writeback cold folios from zswap, and swapin means the folio just
- * became hot. Skip this folio and let the caller find another one.
+ * Found an existing folio, we raced with swapin or concurrent
+ * shrinker. We generally writeback cold folios from zswap, and
+ * swapin means the folio just became hot, so skip this folio.
+ * For unlikely concurrent shrinker case, it will be unlinked
+ * and freed when invalidated by the concurrent shrinker anyway.
*/
if (!folio_was_allocated) {
folio_put(folio);
@@ -1442,18 +1413,31 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
* backs (our zswap_entry reference doesn't prevent that), to
* avoid overwriting a new swap folio with old compressed data.
*/
+ tree = swap_zswap_tree(swpentry);
spin_lock(&tree->lock);
- if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) {
+ if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
spin_unlock(&tree->lock);
delete_from_swap_cache(folio);
folio_unlock(folio);
folio_put(folio);
return -ENOMEM;
}
+
+ /* Safe to deref entry after the entry is verified above. */
+ zswap_entry_get(entry);
spin_unlock(&tree->lock);
__zswap_load(entry, &folio->page);
+ count_vm_event(ZSWPWB);
+ if (entry->objcg)
+ count_objcg_event(entry->objcg, ZSWPWB);
+
+ spin_lock(&tree->lock);
+ zswap_invalidate_entry(tree, entry);
+ zswap_entry_put(entry);
+ spin_unlock(&tree->lock);
+
/* folio is up to date */
folio_mark_uptodate(folio);
--
b4 0.10.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 3/3] mm/list_lru: remove list_lru_putback()
2024-01-28 13:28 [PATCH v2 0/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
@ 2024-01-28 13:28 ` Chengming Zhou
2024-01-28 15:52 ` Johannes Weiner
2024-01-30 0:34 ` Yosry Ahmed
2 siblings, 2 replies; 14+ messages in thread
From: Chengming Zhou @ 2024-01-28 13:28 UTC (permalink / raw)
To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Andrew Morton, Chris Li
Cc: Chengming Zhou, Johannes Weiner, Nhat Pham, linux-kernel,
linux-mm
Since the only user zswap_lru_putback() has gone, remove
list_lru_putback() too.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
include/linux/list_lru.h | 16 ----------------
mm/list_lru.c | 14 --------------
mm/zswap.c | 2 +-
3 files changed, 1 insertion(+), 31 deletions(-)
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index c679e6b293c4..f2882a820690 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -168,22 +168,6 @@ static inline unsigned long list_lru_count(struct list_lru *lru)
void list_lru_isolate(struct list_lru_one *list, struct list_head *item);
void list_lru_isolate_move(struct list_lru_one *list, struct list_head *item,
struct list_head *head);
-/**
- * list_lru_putback: undo list_lru_isolate
- * @lru: the lru pointer.
- * @item: the item to put back.
- * @nid: the node id of the sublist to put the item back to.
- * @memcg: the cgroup of the sublist to put the item back to.
- *
- * Put back an isolated item into its original LRU. Note that unlike
- * list_lru_add, this does not increment the node LRU count (as
- * list_lru_isolate does not originally decrement this count).
- *
- * Since we might have dropped the LRU lock in between, recompute list_lru_one
- * from the node's id and memcg.
- */
-void list_lru_putback(struct list_lru *lru, struct list_head *item, int nid,
- struct mem_cgroup *memcg);
typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
struct list_lru_one *list, spinlock_t *lock, void *cb_arg);
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 158781d1d3c2..61f3b6b1134f 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -162,20 +162,6 @@ void list_lru_isolate_move(struct list_lru_one *list, struct list_head *item,
}
EXPORT_SYMBOL_GPL(list_lru_isolate_move);
-void list_lru_putback(struct list_lru *lru, struct list_head *item, int nid,
- struct mem_cgroup *memcg)
-{
- struct list_lru_one *list =
- list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
-
- if (list_empty(item)) {
- list_add_tail(item, &list->list);
- if (!list->nr_items++)
- set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
- }
-}
-EXPORT_SYMBOL_GPL(list_lru_putback);
-
unsigned long list_lru_count_one(struct list_lru *lru,
int nid, struct mem_cgroup *memcg)
{
diff --git a/mm/zswap.c b/mm/zswap.c
index f5cb5a46e4d7..de68a5928527 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -411,7 +411,7 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
* 2. list_lru_add() is called after memcg->kmemcg_id is updated. The
* new entry will be added directly to memcg's parent's list_lru.
*
- * Similar reasoning holds for list_lru_del() and list_lru_putback().
+ * Similar reasoning holds for list_lru_del().
*/
rcu_read_lock();
memcg = mem_cgroup_from_entry(entry);
--
b4 0.10.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/list_lru: remove list_lru_putback()
2024-01-28 13:28 ` [PATCH v2 3/3] mm/list_lru: remove list_lru_putback() Chengming Zhou
@ 2024-01-28 15:52 ` Johannes Weiner
2024-01-28 19:45 ` Nhat Pham
2024-01-30 0:34 ` Yosry Ahmed
1 sibling, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2024-01-28 15:52 UTC (permalink / raw)
To: Chengming Zhou
Cc: Yosry Ahmed, Nhat Pham, Andrew Morton, Chris Li, linux-kernel,
linux-mm
On Sun, Jan 28, 2024 at 01:28:51PM +0000, Chengming Zhou wrote:
> Since the only user zswap_lru_putback() has gone, remove
> list_lru_putback() too.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/list_lru: remove list_lru_putback()
2024-01-28 15:52 ` Johannes Weiner
@ 2024-01-28 19:45 ` Nhat Pham
0 siblings, 0 replies; 14+ messages in thread
From: Nhat Pham @ 2024-01-28 19:45 UTC (permalink / raw)
To: Johannes Weiner
Cc: Chengming Zhou, Yosry Ahmed, Andrew Morton, Chris Li,
linux-kernel, linux-mm
On Sun, Jan 28, 2024 at 7:52 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Sun, Jan 28, 2024 at 01:28:51PM +0000, Chengming Zhou wrote:
> > Since the only user zswap_lru_putback() has gone, remove
> > list_lru_putback() too.
> >
> > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock
2024-01-28 13:28 ` [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock Chengming Zhou
@ 2024-01-30 0:09 ` Yosry Ahmed
2024-01-30 0:12 ` Nhat Pham
0 siblings, 1 reply; 14+ messages in thread
From: Yosry Ahmed @ 2024-01-30 0:09 UTC (permalink / raw)
To: Chengming Zhou
Cc: Johannes Weiner, Nhat Pham, Andrew Morton, Chris Li, linux-kernel,
linux-mm
On Sun, Jan 28, 2024 at 01:28:49PM +0000, Chengming Zhou wrote:
> LRU_SKIP can only be returned if we don't ever dropped lru lock, or
> we need to return LRU_RETRY to restart from the head of lru list.
>
> Otherwise, the iteration might continue from a cursor position that
> was freed while the locks were dropped.
Does this warrant a stable backport?
>
> Actually we may need to introduce another LRU_STOP to really terminate
> the ongoing shrinking scan process, when we encounter a warm page
> already in the swap cache. The current list_lru implementation
> doesn't have this function to early break from __list_lru_walk_one.
>
> Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock
2024-01-30 0:09 ` Yosry Ahmed
@ 2024-01-30 0:12 ` Nhat Pham
2024-01-30 0:58 ` Yosry Ahmed
0 siblings, 1 reply; 14+ messages in thread
From: Nhat Pham @ 2024-01-30 0:12 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Chengming Zhou, Johannes Weiner, Andrew Morton, Chris Li,
linux-kernel, linux-mm
On Mon, Jan 29, 2024 at 4:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Sun, Jan 28, 2024 at 01:28:49PM +0000, Chengming Zhou wrote:
> > LRU_SKIP can only be returned if we don't ever dropped lru lock, or
> > we need to return LRU_RETRY to restart from the head of lru list.
> >
> > Otherwise, the iteration might continue from a cursor position that
> > was freed while the locks were dropped.
>
> Does this warrant a stable backport?
IUC, the zswap shrinker was merged in 6.8, and we're still in the RC's
for 6.8, right? If this patch goes into 6.8 then no need?
Otherwise, yeah it should go to 6.8 stable IMHO.
>
> >
> > Actually we may need to introduce another LRU_STOP to really terminate
> > the ongoing shrinking scan process, when we encounter a warm page
> > already in the swap cache. The current list_lru implementation
> > doesn't have this function to early break from __list_lru_walk_one.
> >
> > Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>
> Acked-by: Yosry Ahmed <yosryahmed@google.com>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff
2024-01-28 13:28 ` [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
@ 2024-01-30 0:22 ` Yosry Ahmed
2024-01-30 2:30 ` Chengming Zhou
0 siblings, 1 reply; 14+ messages in thread
From: Yosry Ahmed @ 2024-01-30 0:22 UTC (permalink / raw)
To: Chengming Zhou
Cc: Johannes Weiner, Nhat Pham, Andrew Morton, Chris Li, linux-kernel,
linux-mm
On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
> LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
>
> CPU1 CPU2
> shrink_memcg_cb swap_off
> list_lru_isolate zswap_invalidate
> zswap_swapoff
> kfree(tree)
> // UAF
> spin_lock(&tree->lock)
>
> The problem is that the entry in lru list can't protect the tree from
> being swapoff and freed, and the entry also can be invalidated and freed
> concurrently after we unlock the lru lock.
>
> We can fix it by moving the swap cache allocation ahead before
> referencing the tree, then check invalidate race with tree lock,
> only after that we can safely deref the entry. Note we couldn't
> deref entry or tree anymore after we unlock the folio, since we
> depend on this to hold on swapoff.
>
> So this patch moves all tree and entry usage to zswap_writeback_entry(),
> we only use the copied swpentry on the stack to allocate swap cache
> and if returned with folio locked we can reference the tree safely.
> Then we can check invalidate race with tree lock, the following things
> is much the same like zswap_load().
>
> Since we can't deref the entry after zswap_writeback_entry(), we
> can't use zswap_lru_putback() anymore, instead we rotate the entry
> in the beginning. And it will be unlinked and freed when invalidated
> if writeback success.
You are also removing one redundant lookup from the zswap writeback path
to check for the invalidation race, and simplifying the code. Give
yourself full credit :)
>
> Another change is we don't update the memcg nr_zswap_protected in
> the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced
> with swapin or concurrent shrinker action, since swapin already
> have memcg nr_zswap_protected updated, don't need double counts here.
> For concurrent shrinker, the folio will be writeback and freed anyway.
> -ENOMEM case is extremely rare and doesn't happen spuriously either,
> so don't bother distinguishing this case.
>
> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
> mm/zswap.c | 114 ++++++++++++++++++++++++++-----------------------------------
> 1 file changed, 49 insertions(+), 65 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 81cb3790e0dd..f5cb5a46e4d7 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> zpool_get_type((p)->zpools[0]))
>
> static int zswap_writeback_entry(struct zswap_entry *entry,
> - struct zswap_tree *tree);
> + swp_entry_t swpentry);
> static int zswap_pool_get(struct zswap_pool *pool);
> static void zswap_pool_put(struct zswap_pool *pool);
>
> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
> rcu_read_unlock();
> }
>
> -static void zswap_lru_putback(struct list_lru *list_lru,
> - struct zswap_entry *entry)
> -{
> - int nid = entry_to_nid(entry);
> - spinlock_t *lock = &list_lru->node[nid].lock;
> - struct mem_cgroup *memcg;
> - struct lruvec *lruvec;
> -
> - rcu_read_lock();
> - memcg = mem_cgroup_from_entry(entry);
> - spin_lock(lock);
> - /* we cannot use list_lru_add here, because it increments node's lru count */
> - list_lru_putback(list_lru, &entry->lru, nid, memcg);
> - spin_unlock(lock);
> -
> - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
> - /* increment the protection area to account for the LRU rotation. */
> - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> - rcu_read_unlock();
> -}
> -
> /*********************************
> * rbtree functions
> **********************************/
> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> {
> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> bool *encountered_page_in_swapcache = (bool *)arg;
> - struct zswap_tree *tree;
> - pgoff_t swpoffset;
> + swp_entry_t swpentry;
> enum lru_status ret = LRU_REMOVED_RETRY;
> int writeback_result;
>
> + /*
> + * Rotate the entry to the tail before unlocking the LRU,
> + * so that in case of an invalidation race concurrent
> + * reclaimers don't waste their time on it.
> + *
> + * If writeback succeeds, or failure is due to the entry
> + * being invalidated by the swap subsystem, the invalidation
> + * will unlink and free it.
> + *
> + * Temporary failures, where the same entry should be tried
> + * again immediately, almost never happen for this shrinker.
> + * We don't do any trylocking; -ENOMEM comes closest,
> + * but that's extremely rare and doesn't happen spuriously
> + * either. Don't bother distinguishing this case.
> + *
> + * But since they do exist in theory, the entry cannot just
> + * be unlinked, or we could leak it. Hence, rotate.
The entry cannot be unlinked because we cannot get a ref on it without
holding the tree lock, and we cannot deref the tree before we acquire a
swap cache ref in zswap_writeback_entry() -- or if
zswap_writeback_entry() fails. This should be called out explicitly
somewhere. Perhaps we can describe this whole deref dance with added
docs to shrink_memcg_cb().
We could also use a comment in zswap_writeback_entry() (or above it) to
state that we only deref the tree *after* we get the swapcache ref.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/list_lru: remove list_lru_putback()
2024-01-28 13:28 ` [PATCH v2 3/3] mm/list_lru: remove list_lru_putback() Chengming Zhou
2024-01-28 15:52 ` Johannes Weiner
@ 2024-01-30 0:34 ` Yosry Ahmed
1 sibling, 0 replies; 14+ messages in thread
From: Yosry Ahmed @ 2024-01-30 0:34 UTC (permalink / raw)
To: Chengming Zhou
Cc: Johannes Weiner, Nhat Pham, Andrew Morton, Chris Li, linux-kernel,
linux-mm
On Sun, Jan 28, 2024 at 01:28:51PM +0000, Chengming Zhou wrote:
> Since the only user zswap_lru_putback() has gone, remove
> list_lru_putback() too.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock
2024-01-30 0:12 ` Nhat Pham
@ 2024-01-30 0:58 ` Yosry Ahmed
0 siblings, 0 replies; 14+ messages in thread
From: Yosry Ahmed @ 2024-01-30 0:58 UTC (permalink / raw)
To: Nhat Pham
Cc: Chengming Zhou, Johannes Weiner, Andrew Morton, Chris Li,
linux-kernel, linux-mm
On Mon, Jan 29, 2024 at 04:12:54PM -0800, Nhat Pham wrote:
> On Mon, Jan 29, 2024 at 4:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Sun, Jan 28, 2024 at 01:28:49PM +0000, Chengming Zhou wrote:
> > > LRU_SKIP can only be returned if we don't ever dropped lru lock, or
> > > we need to return LRU_RETRY to restart from the head of lru list.
> > >
> > > Otherwise, the iteration might continue from a cursor position that
> > > was freed while the locks were dropped.
> >
> > Does this warrant a stable backport?
>
> IUC, the zswap shrinker was merged in 6.8, and we're still in the RC's
> for 6.8, right? If this patch goes into 6.8 then no need?
> Otherwise, yeah it should go to 6.8 stable IMHO.
For some reason I thought the shrinker went into v6.7, my bad. Then I
guess it should only go into v6.8.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff
2024-01-30 0:22 ` Yosry Ahmed
@ 2024-01-30 2:30 ` Chengming Zhou
2024-01-30 3:17 ` Johannes Weiner
0 siblings, 1 reply; 14+ messages in thread
From: Chengming Zhou @ 2024-01-30 2:30 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Johannes Weiner, Nhat Pham, Andrew Morton, Chris Li, linux-kernel,
linux-mm
On 2024/1/30 08:22, Yosry Ahmed wrote:
> On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
>> LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
>>
>> CPU1 CPU2
>> shrink_memcg_cb swap_off
>> list_lru_isolate zswap_invalidate
>> zswap_swapoff
>> kfree(tree)
>> // UAF
>> spin_lock(&tree->lock)
>>
>> The problem is that the entry in lru list can't protect the tree from
>> being swapoff and freed, and the entry also can be invalidated and freed
>> concurrently after we unlock the lru lock.
>>
>> We can fix it by moving the swap cache allocation ahead before
>> referencing the tree, then check invalidate race with tree lock,
>> only after that we can safely deref the entry. Note we couldn't
>> deref entry or tree anymore after we unlock the folio, since we
>> depend on this to hold on swapoff.
>>
>> So this patch moves all tree and entry usage to zswap_writeback_entry(),
>> we only use the copied swpentry on the stack to allocate swap cache
>> and if returned with folio locked we can reference the tree safely.
>> Then we can check invalidate race with tree lock, the following things
>> is much the same like zswap_load().
>>
>> Since we can't deref the entry after zswap_writeback_entry(), we
>> can't use zswap_lru_putback() anymore, instead we rotate the entry
>> in the beginning. And it will be unlinked and freed when invalidated
>> if writeback success.
>
> You are also removing one redundant lookup from the zswap writeback path
> to check for the invalidation race, and simplifying the code. Give
> yourself full credit :)
Ah right, forgot to mention it, I will add this part in the commit message.
Thanks for your reminder!
>
>>
>> Another change is we don't update the memcg nr_zswap_protected in
>> the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced
>> with swapin or concurrent shrinker action, since swapin already
>> have memcg nr_zswap_protected updated, don't need double counts here.
>> For concurrent shrinker, the folio will be writeback and freed anyway.
>> -ENOMEM case is extremely rare and doesn't happen spuriously either,
>> so don't bother distinguishing this case.
>>
>> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
>>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>> Acked-by: Nhat Pham <nphamcs@gmail.com>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>> mm/zswap.c | 114 ++++++++++++++++++++++++++-----------------------------------
>> 1 file changed, 49 insertions(+), 65 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 81cb3790e0dd..f5cb5a46e4d7 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
>> zpool_get_type((p)->zpools[0]))
>>
>> static int zswap_writeback_entry(struct zswap_entry *entry,
>> - struct zswap_tree *tree);
>> + swp_entry_t swpentry);
>> static int zswap_pool_get(struct zswap_pool *pool);
>> static void zswap_pool_put(struct zswap_pool *pool);
>>
>> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>> rcu_read_unlock();
>> }
>>
>> -static void zswap_lru_putback(struct list_lru *list_lru,
>> - struct zswap_entry *entry)
>> -{
>> - int nid = entry_to_nid(entry);
>> - spinlock_t *lock = &list_lru->node[nid].lock;
>> - struct mem_cgroup *memcg;
>> - struct lruvec *lruvec;
>> -
>> - rcu_read_lock();
>> - memcg = mem_cgroup_from_entry(entry);
>> - spin_lock(lock);
>> - /* we cannot use list_lru_add here, because it increments node's lru count */
>> - list_lru_putback(list_lru, &entry->lru, nid, memcg);
>> - spin_unlock(lock);
>> -
>> - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
>> - /* increment the protection area to account for the LRU rotation. */
>> - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
>> - rcu_read_unlock();
>> -}
>> -
>> /*********************************
>> * rbtree functions
>> **********************************/
>> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>> {
>> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>> bool *encountered_page_in_swapcache = (bool *)arg;
>> - struct zswap_tree *tree;
>> - pgoff_t swpoffset;
>> + swp_entry_t swpentry;
>> enum lru_status ret = LRU_REMOVED_RETRY;
>> int writeback_result;
>>
>> + /*
>> + * Rotate the entry to the tail before unlocking the LRU,
>> + * so that in case of an invalidation race concurrent
>> + * reclaimers don't waste their time on it.
>> + *
>> + * If writeback succeeds, or failure is due to the entry
>> + * being invalidated by the swap subsystem, the invalidation
>> + * will unlink and free it.
>> + *
>> + * Temporary failures, where the same entry should be tried
>> + * again immediately, almost never happen for this shrinker.
>> + * We don't do any trylocking; -ENOMEM comes closest,
>> + * but that's extremely rare and doesn't happen spuriously
>> + * either. Don't bother distinguishing this case.
>> + *
>> + * But since they do exist in theory, the entry cannot just
>> + * be unlinked, or we could leak it. Hence, rotate.
>
> The entry cannot be unlinked because we cannot get a ref on it without
> holding the tree lock, and we cannot deref the tree before we acquire a
> swap cache ref in zswap_writeback_entry() -- or if
> zswap_writeback_entry() fails. This should be called out explicitly
> somewhere. Perhaps we can describe this whole deref dance with added
> docs to shrink_memcg_cb().
Maybe we should add some comments before or after zswap_writeback_entry()?
Or do you have some suggestions? I'm not good at this. :)
>
> We could also use a comment in zswap_writeback_entry() (or above it) to
> state that we only deref the tree *after* we get the swapcache ref.
I just notice there are some comments in zswap_writeback_entry(), should
we add more comments here?
/*
* folio is locked, and the swapcache is now secured against
* concurrent swapping to and from the slot. Verify that the
* swap entry hasn't been invalidated and recycled behind our
* backs (our zswap_entry reference doesn't prevent that), to
* avoid overwriting a new swap folio with old compressed data.
*/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff
2024-01-30 2:30 ` Chengming Zhou
@ 2024-01-30 3:17 ` Johannes Weiner
2024-01-30 3:31 ` Chengming Zhou
0 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2024-01-30 3:17 UTC (permalink / raw)
To: Chengming Zhou
Cc: Yosry Ahmed, Nhat Pham, Andrew Morton, Chris Li, linux-kernel,
linux-mm
On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote:
> On 2024/1/30 08:22, Yosry Ahmed wrote:
> > On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
> >> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> >> {
> >> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> >> bool *encountered_page_in_swapcache = (bool *)arg;
> >> - struct zswap_tree *tree;
> >> - pgoff_t swpoffset;
> >> + swp_entry_t swpentry;
> >> enum lru_status ret = LRU_REMOVED_RETRY;
> >> int writeback_result;
> >>
> >> + /*
> >> + * Rotate the entry to the tail before unlocking the LRU,
> >> + * so that in case of an invalidation race concurrent
> >> + * reclaimers don't waste their time on it.
> >> + *
> >> + * If writeback succeeds, or failure is due to the entry
> >> + * being invalidated by the swap subsystem, the invalidation
> >> + * will unlink and free it.
> >> + *
> >> + * Temporary failures, where the same entry should be tried
> >> + * again immediately, almost never happen for this shrinker.
> >> + * We don't do any trylocking; -ENOMEM comes closest,
> >> + * but that's extremely rare and doesn't happen spuriously
> >> + * either. Don't bother distinguishing this case.
> >> + *
> >> + * But since they do exist in theory, the entry cannot just
> >> + * be unlinked, or we could leak it. Hence, rotate.
> >
> > The entry cannot be unlinked because we cannot get a ref on it without
> > holding the tree lock, and we cannot deref the tree before we acquire a
> > swap cache ref in zswap_writeback_entry() -- or if
> > zswap_writeback_entry() fails. This should be called out explicitly
> > somewhere. Perhaps we can describe this whole deref dance with added
> > docs to shrink_memcg_cb().
>
> Maybe we should add some comments before or after zswap_writeback_entry()?
> Or do you have some suggestions? I'm not good at this. :)
I agree with the suggestion of a central point to document this.
How about something like this:
/*
* As soon as we drop the LRU lock, the entry can be freed by
* a concurrent invalidation. This means the following:
*
* 1. We extract the swp_entry_t to the stack, allowing
* zswap_writeback_entry() to pin the swap entry and
* then validate the zwap entry against that swap entry's
* tree using pointer value comparison. Only when that
* is successful can the entry be dereferenced.
*
* 2. Usually, objects are taken off the LRU for reclaim. In
* this case this isn't possible, because if reclaim fails
* for whatever reason, we have no means of knowing if the
* entry is alive to put it back on the LRU.
*
* So rotate it before dropping the lock. If the entry is
* written back or invalidated, the free path will unlink
* it. For failures, rotation is the right thing as well.
*
* Temporary failures, where the same entry should be tried
* again immediately, almost never happen for this shrinker.
* We don't do any trylocking; -ENOMEM comes closest,
* but that's extremely rare and doesn't happen spuriously
* either. Don't bother distinguishing this case.
*/
> > We could also use a comment in zswap_writeback_entry() (or above it) to
> > state that we only deref the tree *after* we get the swapcache ref.
>
> I just notice there are some comments in zswap_writeback_entry(), should
> we add more comments here?
>
> /*
> * folio is locked, and the swapcache is now secured against
> * concurrent swapping to and from the slot. Verify that the
> * swap entry hasn't been invalidated and recycled behind our
> * backs (our zswap_entry reference doesn't prevent that), to
> * avoid overwriting a new swap folio with old compressed data.
> */
The bit in () is now stale, since we're not even holding a ref ;)
Otherwise, a brief note that entry can not be dereferenced until
validation would be helpful in zswap_writeback_entry(). The core of
the scheme I'd probably describe in shrink_memcg_cb(), though.
Can I ask a favor, though?
For non-critical updates to this patch, can you please make them
follow-up changes? I just sent out 20 cleanup patches on top of this
patch which would be super painful and error prone to rebase. I'd like
to avoid that if at all possible.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff
2024-01-30 3:17 ` Johannes Weiner
@ 2024-01-30 3:31 ` Chengming Zhou
0 siblings, 0 replies; 14+ messages in thread
From: Chengming Zhou @ 2024-01-30 3:31 UTC (permalink / raw)
To: Johannes Weiner
Cc: Yosry Ahmed, Nhat Pham, Andrew Morton, Chris Li, linux-kernel,
linux-mm
On 2024/1/30 11:17, Johannes Weiner wrote:
> On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote:
>> On 2024/1/30 08:22, Yosry Ahmed wrote:
>>> On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
>>>> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>>>> {
>>>> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>>>> bool *encountered_page_in_swapcache = (bool *)arg;
>>>> - struct zswap_tree *tree;
>>>> - pgoff_t swpoffset;
>>>> + swp_entry_t swpentry;
>>>> enum lru_status ret = LRU_REMOVED_RETRY;
>>>> int writeback_result;
>>>>
>>>> + /*
>>>> + * Rotate the entry to the tail before unlocking the LRU,
>>>> + * so that in case of an invalidation race concurrent
>>>> + * reclaimers don't waste their time on it.
>>>> + *
>>>> + * If writeback succeeds, or failure is due to the entry
>>>> + * being invalidated by the swap subsystem, the invalidation
>>>> + * will unlink and free it.
>>>> + *
>>>> + * Temporary failures, where the same entry should be tried
>>>> + * again immediately, almost never happen for this shrinker.
>>>> + * We don't do any trylocking; -ENOMEM comes closest,
>>>> + * but that's extremely rare and doesn't happen spuriously
>>>> + * either. Don't bother distinguishing this case.
>>>> + *
>>>> + * But since they do exist in theory, the entry cannot just
>>>> + * be unlinked, or we could leak it. Hence, rotate.
>>>
>>> The entry cannot be unlinked because we cannot get a ref on it without
>>> holding the tree lock, and we cannot deref the tree before we acquire a
>>> swap cache ref in zswap_writeback_entry() -- or if
>>> zswap_writeback_entry() fails. This should be called out explicitly
>>> somewhere. Perhaps we can describe this whole deref dance with added
>>> docs to shrink_memcg_cb().
>>
>> Maybe we should add some comments before or after zswap_writeback_entry()?
>> Or do you have some suggestions? I'm not good at this. :)
>
> I agree with the suggestion of a central point to document this.
>
> How about something like this:
>
> /*
> * As soon as we drop the LRU lock, the entry can be freed by
> * a concurrent invalidation. This means the following:
> *
> * 1. We extract the swp_entry_t to the stack, allowing
> * zswap_writeback_entry() to pin the swap entry and
> * then validate the zwap entry against that swap entry's
> * tree using pointer value comparison. Only when that
> * is successful can the entry be dereferenced.
> *
> * 2. Usually, objects are taken off the LRU for reclaim. In
> * this case this isn't possible, because if reclaim fails
> * for whatever reason, we have no means of knowing if the
> * entry is alive to put it back on the LRU.
> *
> * So rotate it before dropping the lock. If the entry is
> * written back or invalidated, the free path will unlink
> * it. For failures, rotation is the right thing as well.
> *
> * Temporary failures, where the same entry should be tried
> * again immediately, almost never happen for this shrinker.
> * We don't do any trylocking; -ENOMEM comes closest,
> * but that's extremely rare and doesn't happen spuriously
> * either. Don't bother distinguishing this case.
> */
>
Thanks! I think this document is great enough.
>>> We could also use a comment in zswap_writeback_entry() (or above it) to
>>> state that we only deref the tree *after* we get the swapcache ref.
>>
>> I just notice there are some comments in zswap_writeback_entry(), should
>> we add more comments here?
>>
>> /*
>> * folio is locked, and the swapcache is now secured against
>> * concurrent swapping to and from the slot. Verify that the
>> * swap entry hasn't been invalidated and recycled behind our
>> * backs (our zswap_entry reference doesn't prevent that), to
>> * avoid overwriting a new swap folio with old compressed data.
>> */
>
> The bit in () is now stale, since we're not even holding a ref ;)
Right.
>
> Otherwise, a brief note that entry can not be dereferenced until
> validation would be helpful in zswap_writeback_entry(). The core of
Ok.
> the scheme I'd probably describe in shrink_memcg_cb(), though.
>
> Can I ask a favor, though?
>
> For non-critical updates to this patch, can you please make them
> follow-up changes? I just sent out 20 cleanup patches on top of this
> patch which would be super painful and error prone to rebase. I'd like
> to avoid that if at all possible.
Ok, so these comments changes should be changed on top of your cleanup series
and sent as a follow-up patch.
Thanks.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-01-30 3:31 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-28 13:28 [PATCH v2 0/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock Chengming Zhou
2024-01-30 0:09 ` Yosry Ahmed
2024-01-30 0:12 ` Nhat Pham
2024-01-30 0:58 ` Yosry Ahmed
2024-01-28 13:28 ` [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-30 0:22 ` Yosry Ahmed
2024-01-30 2:30 ` Chengming Zhou
2024-01-30 3:17 ` Johannes Weiner
2024-01-30 3:31 ` Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 3/3] mm/list_lru: remove list_lru_putback() Chengming Zhou
2024-01-28 15:52 ` Johannes Weiner
2024-01-28 19:45 ` Nhat Pham
2024-01-30 0:34 ` Yosry Ahmed
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).