From: Chengming Zhou <chengming.zhou@linux.dev>
To: Nhat Pham <nphamcs@gmail.com>
Cc: hannes@cmpxchg.org, yosryahmed@google.com,
akpm@linux-foundation.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org,
Chengming Zhou <zhouchengming@bytedance.com>
Subject: Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff
Date: Sat, 27 Jan 2024 23:12:52 +0800 [thread overview]
Message-ID: <ec428b7f-7aa4-4da5-885f-9c6de0263a55@linux.dev> (raw)
In-Reply-To: <CAKEwX=OP_32757xb6ogZ5OaDk_6bhkm--F4akhQ-tj_WTzA4CA@mail.gmail.com>
On 2024/1/27 03:50, Nhat Pham wrote:
> On Fri, Jan 26, 2024 at 12:32 AM <chengming.zhou@linux.dev> wrote:
>>
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> LRU writeback has race problem with swapoff, as spotted by Yosry[1]:
>>
>> CPU1 CPU2
>> shrink_memcg_cb swap_off
>> list_lru_isolate zswap_invalidate
>> zswap_swapoff
>> kfree(tree)
>> // UAF
>> spin_lock(&tree->lock)
>>
>> The problem is that the entry in lru list can't protect the tree from
>> being swapoff and freed, and the entry also can be invalidated and freed
>> concurrently after we unlock the lru lock.
>>
>> We can fix it by moving the swap cache allocation ahead before
>> referencing the tree, then check invalidate race with tree lock,
>> only after that we can safely deref the entry. Note we couldn't
>> deref entry or tree anymore after we unlock the folio, since we
>> depend on this to hold on swapoff.
>>
>> So this patch moves all tree and entry usage to zswap_writeback_entry(),
>> we only use the copied swpentry on the stack to allocate swap cache
>> and return with folio locked, after which we can reference the tree.
>> Then check invalidate race with tree lock, the following things is
>> much the same like zswap_load().
>>
>> Since we can't deref the entry after zswap_writeback_entry(), we
>> can't use zswap_lru_putback() anymore, instead we rotate the entry
>
> I added list_lru_putback to the list_lru API specifically for this use
> case (zswap_lru_putback()). Now that we no longer need it, maybe we
> can also remove this as well (assuming no-one else is using this?).
>
> This can be done in a separate patch though.
Right, I can append a patch to remove it since no other users.
>
>> in the LRU list so concurrent reclaimers have little chance to see it.
>> Or it will be deleted from LRU list if writeback success.
>>
>> Another confusing part to me is the update of memcg nr_zswap_protected
>> in zswap_lru_putback(). I'm not sure why it's needed here since
>> if we raced with swapin, memcg nr_zswap_protected has already been
>> updated in zswap_folio_swapin(). So not include this part for now.
>>
>> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>
> LGTM! This is quite elegant.
> Acked-by: Nhat Pham <nphamcs@gmail.com>
>
>> ---
>> mm/zswap.c | 93 ++++++++++++++++++------------------------------------
>> 1 file changed, 31 insertions(+), 62 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 81cb3790e0dd..fa2bdb7ec1d8 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
>> zpool_get_type((p)->zpools[0]))
>>
>> static int zswap_writeback_entry(struct zswap_entry *entry,
>> - struct zswap_tree *tree);
>> + swp_entry_t swpentry);
>> static int zswap_pool_get(struct zswap_pool *pool);
>> static void zswap_pool_put(struct zswap_pool *pool);
>>
>> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>> rcu_read_unlock();
>> }
>>
>> -static void zswap_lru_putback(struct list_lru *list_lru,
>> - struct zswap_entry *entry)
>> -{
>> - int nid = entry_to_nid(entry);
>> - spinlock_t *lock = &list_lru->node[nid].lock;
>> - struct mem_cgroup *memcg;
>> - struct lruvec *lruvec;
>> -
>> - rcu_read_lock();
>> - memcg = mem_cgroup_from_entry(entry);
>> - spin_lock(lock);
>> - /* we cannot use list_lru_add here, because it increments node's lru count */
>> - list_lru_putback(list_lru, &entry->lru, nid, memcg);
>> - spin_unlock(lock);
>> -
>> - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
>> - /* increment the protection area to account for the LRU rotation. */
>> - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
>> - rcu_read_unlock();
>> -}
>> -
>> /*********************************
>> * rbtree functions
>> **********************************/
>> @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>> {
>> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>> bool *encountered_page_in_swapcache = (bool *)arg;
>> - struct zswap_tree *tree;
>> - pgoff_t swpoffset;
>> + swp_entry_t swpentry;
>> enum lru_status ret = LRU_REMOVED_RETRY;
>> int writeback_result;
>>
>> + /*
>> + * First rotate to the tail of lru list before unlocking lru lock,
>> + * so the concurrent reclaimers have little chance to see it.
>> + * It will be deleted from the lru list if writeback success.
>> + */
>> + list_move_tail(item, &l->list);
>> +
>> /*
>> * Once the lru lock is dropped, the entry might get freed. The
>> - * swpoffset is copied to the stack, and entry isn't deref'd again
>> + * swpentry is copied to the stack, and entry isn't deref'd again
>> * until the entry is verified to still be alive in the tree.
>> */
>> - swpoffset = swp_offset(entry->swpentry);
>> - tree = swap_zswap_tree(entry->swpentry);
>> - list_lru_isolate(l, item);
>
> nit: IIUC, now that we're no longer removing the entry from the
> list_lru, we protect against concurrent shrinking action via this
> check inside zswap_writeback_entry() too right:
>
> if (!folio_was_allocated) {
> folio_put(folio);
> return -EEXIST;
> }
>
> Maybe update the comment above it too?
* Found an existing folio, we raced with load/swapin. We generally
* writeback cold folios from zswap, and swapin means the folio just
* became hot. Skip this folio and let the caller find another one.
So now found an existing folio not only means load/swapin, and also concurrent
shrinking action. Yes, this comment needs to be changed a little.
Thanks.
next prev parent reply other threads:[~2024-01-27 15:13 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-26 8:30 [PATCH 1/2] mm/zswap: don't return LRU_SKIP if we have dropped lru lock chengming.zhou
2024-01-26 8:30 ` [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff chengming.zhou
2024-01-26 15:31 ` Johannes Weiner
2024-01-26 19:31 ` Nhat Pham
2024-01-27 14:53 ` Chengming Zhou
2024-01-26 19:50 ` Nhat Pham
2024-01-27 15:12 ` Chengming Zhou [this message]
2024-01-26 14:28 ` [PATCH 1/2] mm/zswap: don't return LRU_SKIP if we have dropped lru lock Johannes Weiner
2024-01-27 14:33 ` Chengming Zhou
2024-01-26 18:01 ` Nhat Pham
2024-01-27 14:37 ` Chengming Zhou
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ec428b7f-7aa4-4da5-885f-9c6de0263a55@linux.dev \
--to=chengming.zhou@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=yosryahmed@google.com \
--cc=zhouchengming@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.