From: Chengming Zhou <zhouchengming@bytedance.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yosry Ahmed <yosryahmed@google.com>,
Nhat Pham <nphamcs@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chriscli@google.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff
Date: Tue, 30 Jan 2024 11:31:04 +0800 [thread overview]
Message-ID: <30e949b5-9455-4053-93ef-1ca0ceb10c3f@bytedance.com> (raw)
In-Reply-To: <20240130031727.GA780575@cmpxchg.org>
On 2024/1/30 11:17, Johannes Weiner wrote:
> On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote:
>> On 2024/1/30 08:22, Yosry Ahmed wrote:
>>> On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
>>>> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>>>> {
>>>> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>>>> bool *encountered_page_in_swapcache = (bool *)arg;
>>>> - struct zswap_tree *tree;
>>>> - pgoff_t swpoffset;
>>>> + swp_entry_t swpentry;
>>>> enum lru_status ret = LRU_REMOVED_RETRY;
>>>> int writeback_result;
>>>>
>>>> + /*
>>>> + * Rotate the entry to the tail before unlocking the LRU,
>>>> + * so that in case of an invalidation race concurrent
>>>> + * reclaimers don't waste their time on it.
>>>> + *
>>>> + * If writeback succeeds, or failure is due to the entry
>>>> + * being invalidated by the swap subsystem, the invalidation
>>>> + * will unlink and free it.
>>>> + *
>>>> + * Temporary failures, where the same entry should be tried
>>>> + * again immediately, almost never happen for this shrinker.
>>>> + * We don't do any trylocking; -ENOMEM comes closest,
>>>> + * but that's extremely rare and doesn't happen spuriously
>>>> + * either. Don't bother distinguishing this case.
>>>> + *
>>>> + * But since they do exist in theory, the entry cannot just
>>>> + * be unlinked, or we could leak it. Hence, rotate.
>>>
>>> The entry cannot be unlinked because we cannot get a ref on it without
>>> holding the tree lock, and we cannot deref the tree before we acquire a
>>> swap cache ref in zswap_writeback_entry() -- or if
>>> zswap_writeback_entry() fails. This should be called out explicitly
>>> somewhere. Perhaps we can describe this whole deref dance with added
>>> docs to shrink_memcg_cb().
>>
>> Maybe we should add some comments before or after zswap_writeback_entry()?
>> Or do you have some suggestions? I'm not good at this. :)
>
> I agree with the suggestion of a central point to document this.
>
> How about something like this:
>
> /*
> * As soon as we drop the LRU lock, the entry can be freed by
> * a concurrent invalidation. This means the following:
> *
> * 1. We extract the swp_entry_t to the stack, allowing
> * zswap_writeback_entry() to pin the swap entry and
> * then validate the zwap entry against that swap entry's
> * tree using pointer value comparison. Only when that
> * is successful can the entry be dereferenced.
> *
> * 2. Usually, objects are taken off the LRU for reclaim. In
> * this case this isn't possible, because if reclaim fails
> * for whatever reason, we have no means of knowing if the
> * entry is alive to put it back on the LRU.
> *
> * So rotate it before dropping the lock. If the entry is
> * written back or invalidated, the free path will unlink
> * it. For failures, rotation is the right thing as well.
> *
> * Temporary failures, where the same entry should be tried
> * again immediately, almost never happen for this shrinker.
> * We don't do any trylocking; -ENOMEM comes closest,
> * but that's extremely rare and doesn't happen spuriously
> * either. Don't bother distinguishing this case.
> */
>
Thanks! I think this document is great enough.
>>> We could also use a comment in zswap_writeback_entry() (or above it) to
>>> state that we only deref the tree *after* we get the swapcache ref.
>>
>> I just notice there are some comments in zswap_writeback_entry(), should
>> we add more comments here?
>>
>> /*
>> * folio is locked, and the swapcache is now secured against
>> * concurrent swapping to and from the slot. Verify that the
>> * swap entry hasn't been invalidated and recycled behind our
>> * backs (our zswap_entry reference doesn't prevent that), to
>> * avoid overwriting a new swap folio with old compressed data.
>> */
>
> The bit in () is now stale, since we're not even holding a ref ;)
Right.
>
> Otherwise, a brief note that entry can not be dereferenced until
> validation would be helpful in zswap_writeback_entry(). The core of
Ok.
> the scheme I'd probably describe in shrink_memcg_cb(), though.
>
> Can I ask a favor, though?
>
> For non-critical updates to this patch, can you please make them
> follow-up changes? I just sent out 20 cleanup patches on top of this
> patch which would be super painful and error prone to rebase. I'd like
> to avoid that if at all possible.
Ok, so these comments changes should be changed on top of your cleanup series
and sent as a follow-up patch.
Thanks.
next prev parent reply other threads:[~2024-01-30 3:31 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-28 13:28 [PATCH v2 0/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock Chengming Zhou
2024-01-30 0:09 ` Yosry Ahmed
2024-01-30 0:12 ` Nhat Pham
2024-01-30 0:58 ` Yosry Ahmed
2024-01-28 13:28 ` [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-30 0:22 ` Yosry Ahmed
2024-01-30 2:30 ` Chengming Zhou
2024-01-30 3:17 ` Johannes Weiner
2024-01-30 3:31 ` Chengming Zhou [this message]
2024-01-28 13:28 ` [PATCH v2 3/3] mm/list_lru: remove list_lru_putback() Chengming Zhou
2024-01-28 15:52 ` Johannes Weiner
2024-01-28 19:45 ` Nhat Pham
2024-01-30 0:34 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=30e949b5-9455-4053-93ef-1ca0ceb10c3f@bytedance.com \
--to=zhouchengming@bytedance.com \
--cc=akpm@linux-foundation.org \
--cc=chriscli@google.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).