From: "Yin, Fengwei" <fengwei.yin@intel.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Hugh Dickins <hughd@google.com>, Yu Zhao <yuzhao@google.com>,
<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<akpm@linux-foundation.org>, <willy@infradead.org>,
<david@redhat.com>, <ryan.roberts@arm.com>, <shy828301@gmail.com>
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio
Date: Fri, 21 Jul 2023 09:12:18 +0800 [thread overview]
Message-ID: <c8ea2617-df48-a1cf-e910-71eeba353d67@intel.com> (raw)
In-Reply-To: <CAJD7tkbuU9Op_TmUET9N+Mug=AS7N3S16tZifVajVBL0yaYv4w@mail.gmail.com>
On 7/21/2023 4:51 AM, Yosry Ahmed wrote:
> On Thu, Jul 20, 2023 at 5:03 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 7/19/2023 11:44 PM, Yosry Ahmed wrote:
>>> On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <hughd@google.com> wrote:
>>>>
>>>> On Wed, 19 Jul 2023, Yin Fengwei wrote:
>>>>>>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
>>>>>>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
>>>>>>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
>>>>>>>>>>>>> cpu 2 is isolating the folio for any purpose:
>>>>>>>>>>>>>
>>>>>>>>>>>>> cpu1 cpu2
>>>>>>>>>>>>> isolate folio
>>>>>>>>>>>>> folio_test_clear_lru() // 0
>>>>>>>>>>>>> putback folio // add to unevictable list
>>>>>>>>>>>>> folio_test_clear_mlocked()
>>>>>>>>>> folio_set_lru()
>>>>> Let's wait the response from Huge and Yu. :).
>>>>
>>>> I haven't been able to give it enough thought, but I suspect you are right:
>>>> that the current __munlock_folio() is deficient when folio_test_clear_lru()
>>>> fails.
>>>>
>>>> (Though it has not been reported as a problem in practice: perhaps because
>>>> so few places try to isolate from the unevictable "list".)
>>>>
>>>> I forget what my order of development was, but it's likely that I first
>>>> wrote the version for our own internal kernel - which used our original
>>>> lruvec locking, which did not depend on getting PG_lru first (having got
>>>> lru_lock, it checked memcg, then tried again if that had changed).
>>>
>>> Right. Just holding the lruvec lock without clearing PG_lru would not
>>> protect against memcg movement in this case.
>>>
>>>>
>>>> I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
>>>> but it turned out to work okay - elsewhere; but it looks as if I missed
>>>> its implication when adapting __munlock_page() for upstream.
>>>>
>>>> If I were trying to fix this __munlock_folio() race myself (sorry, I'm
>>>> not), I would first look at that aspect: instead of folio_test_clear_lru()
>>>> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
>>>> spin waiting for PG_lru here?
>>>
>>> +Matthew Wilcox
>>>
>>> It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to
>>> a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do
>>> folio_set_lru() before checking folio_evictable(). While this is
>>> probably extraneous since folio_batch_move_lru() will set it again
>>> afterwards, it's probably harmless given that the lruvec lock is held
>>> throughout (so no one can complete the folio isolation anyway), and
>>> given that there were no problems introduced by this extra
>>> folio_set_lru() as far as I can tell.
>> After checking related code, Yes. Looks fine if we move folio_set_lru()
>> before if (folio_evictable(folio)) in lru_add_fn() because of holding
>> lru lock.
>>
>>>
>>> If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713
>>> ("mm/munlock:
>>> delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict
>>> ordering between manipulating PG_lru and PG_mlocked, I suppose we can
>>> get away without having to spin. Again, that would only be possible if
>>> reworking mlock_count [1] is acceptable. Otherwise, we can't clear
>>> PG_mlocked before PG_lru in __munlock_folio().
>> What about following change to move mlocked operation before check lru
>> in __munlock_folio()?
>
> It seems correct to me on a high level, but I think there is a subtle problem:
>
> We clear PG_mlocked before trying to isolate to make sure that if
> someone already has the folio isolated they will put it back on an
> evictable list, then if we are able to isolate the folio ourselves and
> find that the mlock_count is > 0, we set PG_mlocked again.
>
> There is a small window where PG_mlocked might be temporarily cleared
> but the folio is not actually munlocked (i.e we don't update the
> NR_MLOCK stat). In that window, a racing reclaimer on a different cpu
> may find VM_LOCKED from in a different vma, and call mlock_folio(). In
> mlock_folio(), we will call folio_test_set_mlocked(folio) and see that
> PG_mlocked is clear, so we will increment the MLOCK stats, even though
> the folio was already mlocked. This can cause MLOCK stats to be
> unbalanced (increments more than decrements), no?
Looks like NR_MLOCK is always connected to PG_mlocked bit. Not possible
to be unbalanced.
Let's say:
mlock_folio() NR_MLOCK increase and set mlocked
mlock_folio() NR_MLOCK NO change as folio is already mlocked
__munlock_folio() with isolated folio. NR_MLOCK decrease (0) and
clear mlocked
folio_putback_lru()
reclaimed mlock_folio() NR_MLOCK increase and set mlocked
munlock_folio() NR_MLOCK decrease (0) and clear mlocked
munlock_folio() NR_MLOCK NO change as folio has no mlocked set
Regards
Yin, Fengwei
>
>>
>> diff --git a/mm/mlock.c b/mm/mlock.c
>> index 0a0c996c5c21..514f0d5bfbfd 100644
>> --- a/mm/mlock.c
>> +++ b/mm/mlock.c
>> @@ -122,7 +122,9 @@ static struct lruvec *__mlock_new_folio(struct folio *folio, struct lruvec *lruv
>> static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec)
>> {
>> int nr_pages = folio_nr_pages(folio);
>> - bool isolated = false;
>> + bool isolated = false, mlocked = true;
>> +
>> + mlocked = folio_test_clear_mlocked(folio);
>>
>> if (!folio_test_clear_lru(folio))
>> goto munlock;
>> @@ -134,13 +136,17 @@ static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec
>> /* Then mlock_count is maintained, but might undercount */
>> if (folio->mlock_count)
>> folio->mlock_count--;
>> - if (folio->mlock_count)
>> + if (folio->mlock_count) {
>> + if (mlocked)
>> + folio_set_mlocked(folio);
>> goto out;
>> + }
>> }
>> /* else assume that was the last mlock: reclaim will fix it if not */
>>
>> munlock:
>> - if (folio_test_clear_mlocked(folio)) {
>> + if (mlocked) {
>> __zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
>> if (isolated || !folio_test_unevictable(folio))
>> __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
>>
>>
>>>
>>> I am not saying this is necessarily better than spinning, just a note
>>> (and perhaps selfishly making [1] more appealing ;)).
>>>
>>> [1]https://lore.kernel.org/lkml/20230618065719.1363271-1-yosryahmed@google.com/
>>>
>>>>
>>>> Hugh
next prev parent reply other threads:[~2023-07-21 1:12 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-12 6:01 [RFC PATCH v2 0/3] support large folio for mlock Yin Fengwei
2023-07-12 6:01 ` [RFC PATCH v2 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
2023-07-12 6:11 ` Yu Zhao
2023-07-12 6:01 ` [RFC PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range Yin Fengwei
2023-07-12 6:23 ` Yu Zhao
2023-07-12 6:43 ` Yin Fengwei
2023-07-12 17:03 ` Yu Zhao
2023-07-13 1:55 ` Yin Fengwei
2023-07-14 2:21 ` Hugh Dickins
2023-07-14 2:49 ` Yin, Fengwei
2023-07-14 3:41 ` Hugh Dickins
2023-07-14 5:45 ` Yin, Fengwei
2023-07-12 6:01 ` [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio Yin Fengwei
2023-07-12 6:31 ` Yu Zhao
2023-07-15 6:06 ` Yu Zhao
2023-07-16 23:59 ` Yin, Fengwei
2023-07-17 0:35 ` Yu Zhao
2023-07-17 1:58 ` Yin Fengwei
2023-07-18 22:48 ` Yosry Ahmed
2023-07-18 23:47 ` Yin Fengwei
2023-07-19 1:32 ` Yosry Ahmed
2023-07-19 1:52 ` Yosry Ahmed
2023-07-19 1:57 ` Yin Fengwei
2023-07-19 2:00 ` Yosry Ahmed
2023-07-19 2:09 ` Yin Fengwei
2023-07-19 2:22 ` Yosry Ahmed
2023-07-19 2:28 ` Yin Fengwei
2023-07-19 14:26 ` Hugh Dickins
2023-07-19 15:44 ` Yosry Ahmed
2023-07-20 12:02 ` Yin, Fengwei
2023-07-20 20:51 ` Yosry Ahmed
2023-07-21 1:12 ` Yin, Fengwei [this message]
2023-07-21 1:35 ` Yosry Ahmed
2023-07-21 3:18 ` Yin, Fengwei
2023-07-21 3:39 ` Yosry Ahmed
2023-07-20 1:52 ` Yin, Fengwei
2023-07-17 8:12 ` Yin Fengwei
2023-07-18 2:06 ` Yin Fengwei
2023-07-18 3:59 ` Yu Zhao
2023-07-26 12:49 ` Yin Fengwei
2023-07-26 16:57 ` Yu Zhao
2023-07-27 0:15 ` Yin Fengwei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c8ea2617-df48-a1cf-e910-71eeba353d67@intel.com \
--to=fengwei.yin@intel.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ryan.roberts@arm.com \
--cc=shy828301@gmail.com \
--cc=willy@infradead.org \
--cc=yosryahmed@google.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).