From: Yin Fengwei <fengwei.yin@intel.com>
To: Yu Zhao <yuzhao@google.com>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<akpm@linux-foundation.org>, <minchan@kernel.org>,
<willy@infradead.org>, <david@redhat.com>, <ryan.roberts@arm.com>,
<shy828301@gmail.com>
Subject: Re: [RFC PATCH v2 2/4] madvise: Use notify-able API to clear and flush page table entries
Date: Wed, 26 Jul 2023 14:21:36 +0800 [thread overview]
Message-ID: <0843fb4d-ab0b-2766-281c-ef32b6031dd7@intel.com> (raw)
In-Reply-To: <CAOUHufa9rFS-VjbCRG6KGjb4YKOZioH=dLdTyFLWqEFePoL+wQ@mail.gmail.com>
On 7/26/23 13:40, Yu Zhao wrote:
> On Tue, Jul 25, 2023 at 10:44 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 7/26/23 11:26, Yu Zhao wrote:
>>> On Tue, Jul 25, 2023 at 8:49 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>>
>>>> On 7/25/23 13:55, Yu Zhao wrote:
>>>>> On Fri, Jul 21, 2023 at 3:41 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>>>>
>>>>>> Currently, in function madvise_cold_or_pageout_pte_range(), the
>>>>>> young bit of pte/pmd is cleared notify subscripter.
>>>>>>
>>>>>> Using notify-able API to make sure the subscripter is signaled about
>>>>>> the young bit clearing.
>>>>>>
>>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>>> ---
>>>>>> mm/madvise.c | 18 ++----------------
>>>>>> 1 file changed, 2 insertions(+), 16 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>>>> index f12933ebcc24..b236e201a738 100644
>>>>>> --- a/mm/madvise.c
>>>>>> +++ b/mm/madvise.c
>>>>>> @@ -403,14 +403,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>> return 0;
>>>>>> }
>>>>>>
>>>>>> - if (pmd_young(orig_pmd)) {
>>>>>> - pmdp_invalidate(vma, addr, pmd);
>>>>>> - orig_pmd = pmd_mkold(orig_pmd);
>>>>>> -
>>>>>> - set_pmd_at(mm, addr, pmd, orig_pmd);
>>>>>> - tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>>>>>> - }
>>>>>> -
>>>>>> + pmdp_clear_flush_young_notify(vma, addr, pmd);
>>>>>> folio_clear_referenced(folio);
>>>>>> folio_test_clear_young(folio);
>>>>>> if (folio_test_active(folio))
>>>>>> @@ -496,14 +489,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>
>>>>>> VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>>
>>>>>> - if (pte_young(ptent)) {
>>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>> - tlb->fullmm);
>>>>>> - ptent = pte_mkold(ptent);
>>>>>> - set_pte_at(mm, addr, pte, ptent);
>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>> - }
>>>>>> -
>>>>>> + ptep_clear_flush_young_notify(vma, addr, pte);
>>>>>
>>>>> These two places are tricky.
>>>>>
>>>>> I agree there is a problem here, i.e., we are not consulting the mmu
>>>>> notifier. In fact, we do pageout on VMs on ChromeOS, and it's been a
>>>>> known problem to me for a while (not a high priority one).
>>>>>
>>>>> tlb_remove_tlb_entry() is batched flush, ptep_clear_flush_young() is
>>>>> not. But, on x86, we might see a performance improvement since
>>>>> ptep_clear_flush_young() doesn't flush TLB at all. On ARM, there might
>>>>> be regressions though.
>>>>>
>>>>> I'd go with ptep_clear_young_notify(), but IIRC, Minchan mentioned he
>>>>> prefers flush. So I'll let him chime in.
>>>> I am OK with either way even no flush way here is more efficient for
>>>> arm64. Let's wait for Minchan's comment.
>>>
>>> Yes, and I don't think there would be any "negative" consequences
>>> without tlb flushes when clearing the A-bit.
>>>
>>>>> If we do end up with ptep_clear_young_notify(), please remove
>>>>> mmu_gather -- it should have been done in this patch.
>>>>
>>>> I suppose "remove mmu_gather" means to trigger flush tlb operation in
>>>> batched way to make sure no stale data in TLB for long time on arm64
>>>> platform.
>>>
>>> In madvise_cold_or_pageout_pte_range(), we only need struct
>>> mmu_gather *tlb because of tlb_remove_pmd_tlb_entry(), i.e., flushing
>>> tlb after clearing the A-bit. There is no correction, e.g., potential
>>> data corruption, involved there.
>>
>> From https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/,
>> the reason that arm64 didn't drop whole flush tlb in ptep_clear_flush_young()
>> is to prevent the stale data in TLB. I suppose there is no correction issue
>> there also.
>>
>> So why keep stale data in TLB in madvise_cold_or_pageout_pte_range() is fine?
>
> Sorry, I'm not sure I understand your question here.
Oh. Sorry for the confusion. I will explain my understanding and
question in detail.
>
> In this patch, you removed tlb_remove_tlb_entry(), so we don't need
> struct mmu_gather *tlb any more.
Yes. You are right.
>
> If you are asking why I prefer ptep_clear_young_notify() (no flush),
> which also doesn't need tlb_remove_tlb_entry(), then the answer is
> that the TLB size doesn't scale like DRAM does: the gap has been
> growing exponentially. So there is no way TLB can hold stale entries
> long enough to cause a measurable effect on the A-bit. This isn't a
> conjecture -- it's been proven conversely: we encountered bugs (almost
> every year) caused by missing TLB flushes and resulting in data
> corruption. They were never easy to reproduce, meaning stale entries
> never stayed long in TLB.
when I read https://lore.kernel.org/lkml/20181029105515.GD14127@arm.com/,
my understanding is that arm64 still keep something in ptep_clear_flush_young.
The reason is finishing TLB flush by next context-switch to make sure no
stale entries in TLB cross next context switch.
Should we still keep same behavior (no stable entries in TLB cross next
context switch) for madvise_cold_or_pageout_pte_range()?
So two versions work (I assume we should keep same behavior) for me:
1. using xxx_flush_xxx() version. but with possible regression on arm64.
2. using none flush version. But do batched TLB flush.
Hope this could make things clear. Thanks.
Regards
Yin, Fengwei
next prev parent reply other threads:[~2023-07-26 6:21 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-21 9:40 [RFC PATCH v2 0/4] fix large folio for madvise_cold_or_pageout() Yin Fengwei
2023-07-21 9:40 ` [RFC PATCH v2 1/4] madvise: not use mapcount() against large folio for sharing check Yin Fengwei
2023-07-21 18:57 ` Yu Zhao
2023-07-23 12:26 ` Yin, Fengwei
2023-07-25 5:22 ` Yu Zhao
2023-07-21 9:40 ` [RFC PATCH v2 2/4] madvise: Use notify-able API to clear and flush page table entries Yin Fengwei
2023-07-25 5:55 ` Yu Zhao
2023-07-26 2:49 ` Yin Fengwei
2023-07-26 3:26 ` Yu Zhao
2023-07-26 4:44 ` Yin Fengwei
2023-07-26 5:40 ` Yu Zhao
2023-07-26 6:21 ` Yin Fengwei [this message]
2023-07-27 3:28 ` Yu Zhao
2023-07-28 16:14 ` Yin, Fengwei
2023-07-21 9:40 ` [RFC PATCH v2 3/4] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
2023-07-25 5:42 ` Yu Zhao
2023-07-21 9:40 ` [RFC PATCH v2 4/4] madvise: avoid trying to split large folio always in cold_pageout Yin Fengwei
2023-07-25 5:26 ` Yu Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0843fb4d-ab0b-2766-281c-ef32b6031dd7@intel.com \
--to=fengwei.yin@intel.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=shy828301@gmail.com \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).