linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosryahmed@google.com>
To: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Nhat Pham <nphamcs@gmail.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chriscli@google.com>,
	 linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff
Date: Tue, 30 Jan 2024 00:22:14 +0000	[thread overview]
Message-ID: <ZbhBNkayw1hNlkpL@google.com> (raw)
In-Reply-To: <20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com>

On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
> LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
> 
> CPU1			CPU2
> shrink_memcg_cb		swap_off
>   list_lru_isolate	  zswap_invalidate
> 			  zswap_swapoff
> 			    kfree(tree)
>   // UAF
>   spin_lock(&tree->lock)
> 
> The problem is that the entry in lru list can't protect the tree from
> being swapoff and freed, and the entry also can be invalidated and freed
> concurrently after we unlock the lru lock.
> 
> We can fix it by moving the swap cache allocation ahead before
> referencing the tree, then check invalidate race with tree lock,
> only after that we can safely deref the entry. Note we couldn't
> deref entry or tree anymore after we unlock the folio, since we
> depend on this to hold on swapoff.
> 
> So this patch moves all tree and entry usage to zswap_writeback_entry(),
> we only use the copied swpentry on the stack to allocate swap cache
> and if returned with folio locked we can reference the tree safely.
> Then we can check invalidate race with tree lock, the following things
> is much the same like zswap_load().
> 
> Since we can't deref the entry after zswap_writeback_entry(), we
> can't use zswap_lru_putback() anymore, instead we rotate the entry
> in the beginning. And it will be unlinked and freed when invalidated
> if writeback success.

You are also removing one redundant lookup from the zswap writeback path
to check for the invalidation race, and simplifying the code. Give
yourself full credit :)

> 
> Another change is we don't update the memcg nr_zswap_protected in
> the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced
> with swapin or concurrent shrinker action, since swapin already
> have memcg nr_zswap_protected updated, don't need double counts here.
> For concurrent shrinker, the folio will be writeback and freed anyway.
> -ENOMEM case is extremely rare and doesn't happen spuriously either,
> so don't bother distinguishing this case.
> 
> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  mm/zswap.c | 114 ++++++++++++++++++++++++++-----------------------------------
>  1 file changed, 49 insertions(+), 65 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 81cb3790e0dd..f5cb5a46e4d7 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
>  		 zpool_get_type((p)->zpools[0]))
>  
>  static int zswap_writeback_entry(struct zswap_entry *entry,
> -				 struct zswap_tree *tree);
> +				 swp_entry_t swpentry);
>  static int zswap_pool_get(struct zswap_pool *pool);
>  static void zswap_pool_put(struct zswap_pool *pool);
>  
> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>  	rcu_read_unlock();
>  }
>  
> -static void zswap_lru_putback(struct list_lru *list_lru,
> -		struct zswap_entry *entry)
> -{
> -	int nid = entry_to_nid(entry);
> -	spinlock_t *lock = &list_lru->node[nid].lock;
> -	struct mem_cgroup *memcg;
> -	struct lruvec *lruvec;
> -
> -	rcu_read_lock();
> -	memcg = mem_cgroup_from_entry(entry);
> -	spin_lock(lock);
> -	/* we cannot use list_lru_add here, because it increments node's lru count */
> -	list_lru_putback(list_lru, &entry->lru, nid, memcg);
> -	spin_unlock(lock);
> -
> -	lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
> -	/* increment the protection area to account for the LRU rotation. */
> -	atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> -	rcu_read_unlock();
> -}
> -
>  /*********************************
>  * rbtree functions
>  **********************************/
> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>  {
>  	struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>  	bool *encountered_page_in_swapcache = (bool *)arg;
> -	struct zswap_tree *tree;
> -	pgoff_t swpoffset;
> +	swp_entry_t swpentry;
>  	enum lru_status ret = LRU_REMOVED_RETRY;
>  	int writeback_result;
>  
> +	/*
> +	 * Rotate the entry to the tail before unlocking the LRU,
> +	 * so that in case of an invalidation race concurrent
> +	 * reclaimers don't waste their time on it.
> +	 *
> +	 * If writeback succeeds, or failure is due to the entry
> +	 * being invalidated by the swap subsystem, the invalidation
> +	 * will unlink and free it.
> +	 *
> +	 * Temporary failures, where the same entry should be tried
> +	 * again immediately, almost never happen for this shrinker.
> +	 * We don't do any trylocking; -ENOMEM comes closest,
> +	 * but that's extremely rare and doesn't happen spuriously
> +	 * either. Don't bother distinguishing this case.
> +	 *
> +	 * But since they do exist in theory, the entry cannot just
> +	 * be unlinked, or we could leak it. Hence, rotate.

The entry cannot be unlinked because we cannot get a ref on it without
holding the tree lock, and we cannot deref the tree before we acquire a
swap cache ref in zswap_writeback_entry() -- or if
zswap_writeback_entry() fails. This should be called out explicitly
somewhere. Perhaps we can describe this whole deref dance with added
docs to shrink_memcg_cb().

We could also use a comment in zswap_writeback_entry() (or above it) to
state that we only deref the tree *after* we get the swapcache ref.


  reply	other threads:[~2024-01-30  0:22 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-28 13:28 [PATCH v2 0/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock Chengming Zhou
2024-01-30  0:09   ` Yosry Ahmed
2024-01-30  0:12     ` Nhat Pham
2024-01-30  0:58       ` Yosry Ahmed
2024-01-28 13:28 ` [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-30  0:22   ` Yosry Ahmed [this message]
2024-01-30  2:30     ` Chengming Zhou
2024-01-30  3:17       ` Johannes Weiner
2024-01-30  3:31         ` Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 3/3] mm/list_lru: remove list_lru_putback() Chengming Zhou
2024-01-28 15:52   ` Johannes Weiner
2024-01-28 19:45     ` Nhat Pham
2024-01-30  0:34   ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZbhBNkayw1hNlkpL@google.com \
    --to=yosryahmed@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=chriscli@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).