From: Yosry Ahmed <yosryahmed@google.com>
To: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Nhat Pham <nphamcs@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chriscli@google.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff
Date: Tue, 30 Jan 2024 00:22:14 +0000 [thread overview]
Message-ID: <ZbhBNkayw1hNlkpL@google.com> (raw)
In-Reply-To: <20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com>
On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
> LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
>
> CPU1 CPU2
> shrink_memcg_cb swap_off
> list_lru_isolate zswap_invalidate
> zswap_swapoff
> kfree(tree)
> // UAF
> spin_lock(&tree->lock)
>
> The problem is that the entry in lru list can't protect the tree from
> being swapoff and freed, and the entry also can be invalidated and freed
> concurrently after we unlock the lru lock.
>
> We can fix it by moving the swap cache allocation ahead before
> referencing the tree, then check invalidate race with tree lock,
> only after that we can safely deref the entry. Note we couldn't
> deref entry or tree anymore after we unlock the folio, since we
> depend on this to hold on swapoff.
>
> So this patch moves all tree and entry usage to zswap_writeback_entry(),
> we only use the copied swpentry on the stack to allocate swap cache
> and if returned with folio locked we can reference the tree safely.
> Then we can check invalidate race with tree lock, the following things
> is much the same like zswap_load().
>
> Since we can't deref the entry after zswap_writeback_entry(), we
> can't use zswap_lru_putback() anymore, instead we rotate the entry
> in the beginning. And it will be unlinked and freed when invalidated
> if writeback success.
You are also removing one redundant lookup from the zswap writeback path
to check for the invalidation race, and simplifying the code. Give
yourself full credit :)
>
> Another change is we don't update the memcg nr_zswap_protected in
> the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced
> with swapin or concurrent shrinker action, since swapin already
> have memcg nr_zswap_protected updated, don't need double counts here.
> For concurrent shrinker, the folio will be writeback and freed anyway.
> -ENOMEM case is extremely rare and doesn't happen spuriously either,
> so don't bother distinguishing this case.
>
> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
> mm/zswap.c | 114 ++++++++++++++++++++++++++-----------------------------------
> 1 file changed, 49 insertions(+), 65 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 81cb3790e0dd..f5cb5a46e4d7 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> zpool_get_type((p)->zpools[0]))
>
> static int zswap_writeback_entry(struct zswap_entry *entry,
> - struct zswap_tree *tree);
> + swp_entry_t swpentry);
> static int zswap_pool_get(struct zswap_pool *pool);
> static void zswap_pool_put(struct zswap_pool *pool);
>
> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
> rcu_read_unlock();
> }
>
> -static void zswap_lru_putback(struct list_lru *list_lru,
> - struct zswap_entry *entry)
> -{
> - int nid = entry_to_nid(entry);
> - spinlock_t *lock = &list_lru->node[nid].lock;
> - struct mem_cgroup *memcg;
> - struct lruvec *lruvec;
> -
> - rcu_read_lock();
> - memcg = mem_cgroup_from_entry(entry);
> - spin_lock(lock);
> - /* we cannot use list_lru_add here, because it increments node's lru count */
> - list_lru_putback(list_lru, &entry->lru, nid, memcg);
> - spin_unlock(lock);
> -
> - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
> - /* increment the protection area to account for the LRU rotation. */
> - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> - rcu_read_unlock();
> -}
> -
> /*********************************
> * rbtree functions
> **********************************/
> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> {
> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> bool *encountered_page_in_swapcache = (bool *)arg;
> - struct zswap_tree *tree;
> - pgoff_t swpoffset;
> + swp_entry_t swpentry;
> enum lru_status ret = LRU_REMOVED_RETRY;
> int writeback_result;
>
> + /*
> + * Rotate the entry to the tail before unlocking the LRU,
> + * so that in case of an invalidation race concurrent
> + * reclaimers don't waste their time on it.
> + *
> + * If writeback succeeds, or failure is due to the entry
> + * being invalidated by the swap subsystem, the invalidation
> + * will unlink and free it.
> + *
> + * Temporary failures, where the same entry should be tried
> + * again immediately, almost never happen for this shrinker.
> + * We don't do any trylocking; -ENOMEM comes closest,
> + * but that's extremely rare and doesn't happen spuriously
> + * either. Don't bother distinguishing this case.
> + *
> + * But since they do exist in theory, the entry cannot just
> + * be unlinked, or we could leak it. Hence, rotate.
The entry cannot be unlinked because we cannot get a ref on it without
holding the tree lock, and we cannot deref the tree before we acquire a
swap cache ref in zswap_writeback_entry() -- or if
zswap_writeback_entry() fails. This should be called out explicitly
somewhere. Perhaps we can describe this whole deref dance with added
docs to shrink_memcg_cb().
We could also use a comment in zswap_writeback_entry() (or above it) to
state that we only deref the tree *after* we get the swapcache ref.
next prev parent reply other threads:[~2024-01-30 0:22 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-28 13:28 [PATCH v2 0/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock Chengming Zhou
2024-01-30 0:09 ` Yosry Ahmed
2024-01-30 0:12 ` Nhat Pham
2024-01-30 0:58 ` Yosry Ahmed
2024-01-28 13:28 ` [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-30 0:22 ` Yosry Ahmed [this message]
2024-01-30 2:30 ` Chengming Zhou
2024-01-30 3:17 ` Johannes Weiner
2024-01-30 3:31 ` Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 3/3] mm/list_lru: remove list_lru_putback() Chengming Zhou
2024-01-28 15:52 ` Johannes Weiner
2024-01-28 19:45 ` Nhat Pham
2024-01-30 0:34 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZbhBNkayw1hNlkpL@google.com \
--to=yosryahmed@google.com \
--cc=akpm@linux-foundation.org \
--cc=chriscli@google.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=zhouchengming@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.