From: Ryan Roberts <ryan.roberts@arm.com>
To: David Hildenbrand <david@redhat.com>,
Mark Rutland <mark.rutland@arm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Alexander Shishkin <alexander.shishkin@linux.intel.com>,
Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>,
Adrian Hunter <adrian.hunter@intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
Muchun Song <muchun.song@linux.dev>
Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64
Date: Thu, 11 Apr 2024 10:45:38 +0100 [thread overview]
Message-ID: <81aa23ca-18b1-4430-9ad1-00a2c5af8fc2@arm.com> (raw)
In-Reply-To: <5a23518b-7974-4b03-bd6e-80ecf6c39484@redhat.com>
On 10/04/2024 21:09, David Hildenbrand wrote:
> [...]
>
> Skipping the ptdesc stuff we discussed offline, to not get distracted. :)
>
> This stuff is killing me, sorry for the lengthy reply ...
No problem - thanks for taking the time to think it through and reply with such
clarity. :)
>
>>
>> So I've been looking at all this again, and getting myself even more confused.
>>
>> I believe there are 2 classes of ptep_get_lockless() caller:
>>
>> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
>> 2) everyone else
>
> Likely only completely lockless page table walkers where we *cannot* recheck
> under PTL is special. Essentially where we disable interrupts and rely on TLB
> flushes to sync against concurrent changes.
Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
"lockless walkers that never take the PTL".
Detail: the part about disabling interrupts and TLB flush syncing is
arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
you make that clear further down.
>
> Let's take a look where ptep_get_lockless() comes from:
>
> commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3
> Author: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Nov 13 11:41:40 2020 +0100
>
> mm/gup: Provide gup_get_pte() more generic
>
> In order to write another lockless page-table walker, we need
> gup_get_pte() exposed. While doing that, rename it to
> ptep_get_lockless() to match the existing ptep_get() naming.
>
>
> GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast.
>
> "With get_user_pages_fast(), we walk down the pagetables without taking any
> locks. For this we would like to load the pointers atomically, but sometimes
> that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What we
> do have is the guarantee that a PTE will only either go from not present to
> present, or present to not present or both -- it will not switch to a completely
> different present page without a TLB flush in between; something hat we are
> blocking by holding interrupts off."
>
> Later, we added support for GUP-fast that introduced the !TLB variant:
>
> commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13
> Author: Steve Capper <steve.capper@linaro.org>
> Date: Thu Oct 9 15:29:14 2014 -0700
>
> mm: introduce a general RCU get_user_pages_fast()
>
> With the pattern
>
> /*
> * In the line below we are assuming that the pte can be read
> * atomically. If this is not the case for your architecture,
> * please wrap this in a helper function!
> *
> * for an example see gup_get_pte in arch/x86/mm/gup.c
> */
> pte_t pte = ACCESS_ONCE(*ptep);
> ...
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> ...
>
>
> Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte =
> gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was
> required to be sane, this the lengthy comment above ptep_get_lockless() that
> talks about TLB flushes.
>
> The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still
> full of details about TLB flushes syncing against GUP-fast. But as you note, we
> use it even in contexts where we don't disable interrupts and the TLB flush
> can't help.
>
> We don't disable interrupts during page faults ... so most of the things
> described in ptep_get_lockless() don't really apply.
>
> That's also the reason why ...
>>
>> vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> - vmf->orig_pte = *vmf->pte;
>> + vmf->orig_pte = ptep_get_lockless(vmf->pte);
>> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>>
>> - /*
>> - * some architectures can have larger ptes than wordsize,
>> - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
>> - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
>> - * accesses. The code below just needs a consistent view
>> - * for the ifs and we later double check anyway with the
>> - * ptl lock held. So here a barrier will do.
>> - */
>> - barrier();
>> if (pte_none(vmf->orig_pte)) {
>
> ... that code was in place. We would possibly read garbage PTEs, but would
> recheck *under PTL* (where the PTE can no longer change) that what we read
> wasn't garbage and didn't change.
Agreed.
>
>>
>> --
>>
>> (2) All the other users require that a subset of the pte fields are
>> self-consistent; specifically they don't care about access, dirty, uffd-wp or
>> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
>> just by calling ptep_get().
>
> Let's focus on access+dirty for now ;)
>
>>
>> --
>>
>> So, I'm making the bold claim that it was never neccessary to specialize
>> pte_get_lockless() on arm64, and I *think* we could just delete it so that
>> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
>> aim without needing to introduce "norecency" variants.
>>
>> Additionally I propose documenting ptep_get_lockless() to describe the set of
>> fields that are guarranteed to be self-consistent and the remaining fields which
>> are self-consistent only with best-effort.
>>
>> Could it be this easy? My head is hurting...
>
> I think what has to happen is:
>
> (1) pte_get_lockless() must return the same value as ptep_get() as long as there
> are no races. No removal/addition of access/dirty bits etc.
Today's arm64 ptep_get() guarantees this.
>
> (2) Lockless page table walkers that later verify under the PTL can handle
> serious "garbage PTEs". This is our page fault handler.
This isn't really a property of a ptep_get_lockless(); its a statement about a
class of users. I agree with the statement.
>
> (3) Lockless page table walkers that cannot verify under PTL cannot handle
> arbitrary garbage PTEs. This is GUP-fast. Two options:
>
> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
> required. This is the common case. HW might concurrently set access/dirty bits,
> so we can race with that. But we don't read garbage.
Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
consistent for contpte ptes. That's the bit that complicates the current
ptep_get_lockless() implementation.
But the point I was trying to make is that GUP-fast does not actually care about
*all* the fields being consistent (e.g. access/dirty). So we could spec
pte_get_lockless() to say that "all fields in the returned pte are guarranteed
to be self-consistent except for access and dirty information, which may be
inconsistent if a racing modification occured".
This could mean that the access/dirty state *does* change for a given page while
GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
that failing to detect this is benign.
Aside: GUP-fast currently rechecks the pte originally obtained with
ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform
to (1), so either it returns the same pte or it returns a different pte or
garbage. But that garbage could just happen to be the same as the originally
obtained pte. So in that case, it would have a false match. I think this needs
to be changed to ptep_get_lockless()?
>
> (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to
> read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious
> PTE changes (e.g., present -> present). This is weird x86-PAE.
>
>
> If ptep_get() on arm64 can do (1), (2) and (3a), we might be good.
>
> My head is hurting ...
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: Ryan Roberts <ryan.roberts@arm.com>
To: David Hildenbrand <david@redhat.com>,
Mark Rutland <mark.rutland@arm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Alexander Shishkin <alexander.shishkin@linux.intel.com>,
Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>,
Adrian Hunter <adrian.hunter@intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
Muchun Song <muchun.song@linux.dev>
Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64
Date: Thu, 11 Apr 2024 10:45:38 +0100 [thread overview]
Message-ID: <81aa23ca-18b1-4430-9ad1-00a2c5af8fc2@arm.com> (raw)
In-Reply-To: <5a23518b-7974-4b03-bd6e-80ecf6c39484@redhat.com>
On 10/04/2024 21:09, David Hildenbrand wrote:
> [...]
>
> Skipping the ptdesc stuff we discussed offline, to not get distracted. :)
>
> This stuff is killing me, sorry for the lengthy reply ...
No problem - thanks for taking the time to think it through and reply with such
clarity. :)
>
>>
>> So I've been looking at all this again, and getting myself even more confused.
>>
>> I believe there are 2 classes of ptep_get_lockless() caller:
>>
>> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
>> 2) everyone else
>
> Likely only completely lockless page table walkers where we *cannot* recheck
> under PTL is special. Essentially where we disable interrupts and rely on TLB
> flushes to sync against concurrent changes.
Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
"lockless walkers that never take the PTL".
Detail: the part about disabling interrupts and TLB flush syncing is
arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
you make that clear further down.
>
> Let's take a look where ptep_get_lockless() comes from:
>
> commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3
> Author: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Nov 13 11:41:40 2020 +0100
>
> mm/gup: Provide gup_get_pte() more generic
>
> In order to write another lockless page-table walker, we need
> gup_get_pte() exposed. While doing that, rename it to
> ptep_get_lockless() to match the existing ptep_get() naming.
>
>
> GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast.
>
> "With get_user_pages_fast(), we walk down the pagetables without taking any
> locks. For this we would like to load the pointers atomically, but sometimes
> that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What we
> do have is the guarantee that a PTE will only either go from not present to
> present, or present to not present or both -- it will not switch to a completely
> different present page without a TLB flush in between; something hat we are
> blocking by holding interrupts off."
>
> Later, we added support for GUP-fast that introduced the !TLB variant:
>
> commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13
> Author: Steve Capper <steve.capper@linaro.org>
> Date: Thu Oct 9 15:29:14 2014 -0700
>
> mm: introduce a general RCU get_user_pages_fast()
>
> With the pattern
>
> /*
> * In the line below we are assuming that the pte can be read
> * atomically. If this is not the case for your architecture,
> * please wrap this in a helper function!
> *
> * for an example see gup_get_pte in arch/x86/mm/gup.c
> */
> pte_t pte = ACCESS_ONCE(*ptep);
> ...
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> ...
>
>
> Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte =
> gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was
> required to be sane, this the lengthy comment above ptep_get_lockless() that
> talks about TLB flushes.
>
> The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still
> full of details about TLB flushes syncing against GUP-fast. But as you note, we
> use it even in contexts where we don't disable interrupts and the TLB flush
> can't help.
>
> We don't disable interrupts during page faults ... so most of the things
> described in ptep_get_lockless() don't really apply.
>
> That's also the reason why ...
>>
>> vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> - vmf->orig_pte = *vmf->pte;
>> + vmf->orig_pte = ptep_get_lockless(vmf->pte);
>> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>>
>> - /*
>> - * some architectures can have larger ptes than wordsize,
>> - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
>> - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
>> - * accesses. The code below just needs a consistent view
>> - * for the ifs and we later double check anyway with the
>> - * ptl lock held. So here a barrier will do.
>> - */
>> - barrier();
>> if (pte_none(vmf->orig_pte)) {
>
> ... that code was in place. We would possibly read garbage PTEs, but would
> recheck *under PTL* (where the PTE can no longer change) that what we read
> wasn't garbage and didn't change.
Agreed.
>
>>
>> --
>>
>> (2) All the other users require that a subset of the pte fields are
>> self-consistent; specifically they don't care about access, dirty, uffd-wp or
>> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
>> just by calling ptep_get().
>
> Let's focus on access+dirty for now ;)
>
>>
>> --
>>
>> So, I'm making the bold claim that it was never neccessary to specialize
>> pte_get_lockless() on arm64, and I *think* we could just delete it so that
>> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
>> aim without needing to introduce "norecency" variants.
>>
>> Additionally I propose documenting ptep_get_lockless() to describe the set of
>> fields that are guarranteed to be self-consistent and the remaining fields which
>> are self-consistent only with best-effort.
>>
>> Could it be this easy? My head is hurting...
>
> I think what has to happen is:
>
> (1) pte_get_lockless() must return the same value as ptep_get() as long as there
> are no races. No removal/addition of access/dirty bits etc.
Today's arm64 ptep_get() guarantees this.
>
> (2) Lockless page table walkers that later verify under the PTL can handle
> serious "garbage PTEs". This is our page fault handler.
This isn't really a property of a ptep_get_lockless(); its a statement about a
class of users. I agree with the statement.
>
> (3) Lockless page table walkers that cannot verify under PTL cannot handle
> arbitrary garbage PTEs. This is GUP-fast. Two options:
>
> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
> required. This is the common case. HW might concurrently set access/dirty bits,
> so we can race with that. But we don't read garbage.
Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
consistent for contpte ptes. That's the bit that complicates the current
ptep_get_lockless() implementation.
But the point I was trying to make is that GUP-fast does not actually care about
*all* the fields being consistent (e.g. access/dirty). So we could spec
pte_get_lockless() to say that "all fields in the returned pte are guarranteed
to be self-consistent except for access and dirty information, which may be
inconsistent if a racing modification occured".
This could mean that the access/dirty state *does* change for a given page while
GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
that failing to detect this is benign.
Aside: GUP-fast currently rechecks the pte originally obtained with
ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform
to (1), so either it returns the same pte or it returns a different pte or
garbage. But that garbage could just happen to be the same as the originally
obtained pte. So in that case, it would have a false match. I think this needs
to be changed to ptep_get_lockless()?
>
> (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to
> read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious
> PTE changes (e.g., present -> present). This is weird x86-PAE.
>
>
> If ptep_get() on arm64 can do (1), (2) and (3a), we might be good.
>
> My head is hurting ...
>
next prev parent reply other threads:[~2024-04-11 9:46 UTC|newest]
Thread overview: 96+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-15 12:17 [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 Ryan Roberts
2024-02-15 12:17 ` Ryan Roberts
2024-02-15 12:17 ` [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() Ryan Roberts
2024-02-15 12:17 ` Ryan Roberts
2024-03-26 16:27 ` David Hildenbrand
2024-03-26 16:27 ` David Hildenbrand
2024-03-26 16:39 ` Ryan Roberts
2024-03-26 16:39 ` Ryan Roberts
2024-03-27 9:28 ` David Hildenbrand
2024-03-27 9:28 ` David Hildenbrand
2024-03-27 9:57 ` Ryan Roberts
2024-03-27 9:57 ` Ryan Roberts
2024-03-27 17:02 ` David Hildenbrand
2024-03-27 17:02 ` David Hildenbrand
2024-02-15 12:17 ` [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency() Ryan Roberts
2024-02-15 12:17 ` Ryan Roberts
2024-03-26 16:30 ` David Hildenbrand
2024-03-26 16:30 ` David Hildenbrand
2024-03-26 16:48 ` Ryan Roberts
2024-03-26 16:48 ` Ryan Roberts
2024-02-15 12:17 ` [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte Ryan Roberts
2024-02-15 12:17 ` Ryan Roberts
2024-03-26 17:02 ` David Hildenbrand
2024-03-26 17:02 ` David Hildenbrand
2024-03-26 17:27 ` Ryan Roberts
2024-03-26 17:27 ` Ryan Roberts
2024-03-26 17:38 ` David Hildenbrand
2024-03-26 17:38 ` David Hildenbrand
2024-03-26 17:48 ` Ryan Roberts
2024-03-26 17:48 ` Ryan Roberts
2024-03-26 17:58 ` David Hildenbrand
2024-03-26 17:58 ` David Hildenbrand
2024-03-27 9:51 ` Ryan Roberts
2024-03-27 9:51 ` Ryan Roberts
2024-03-27 17:05 ` David Hildenbrand
2024-03-27 17:05 ` David Hildenbrand
2024-02-15 12:17 ` [RFC PATCH v1 4/4] arm64/mm: Override ptep_get_lockless_norecency() Ryan Roberts
2024-02-15 12:17 ` Ryan Roberts
2024-03-26 16:35 ` David Hildenbrand
2024-03-26 16:35 ` David Hildenbrand
2024-03-26 16:17 ` [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 David Hildenbrand
2024-03-26 16:17 ` David Hildenbrand
2024-03-26 16:31 ` Ryan Roberts
2024-03-26 16:31 ` Ryan Roberts
2024-03-26 16:34 ` David Hildenbrand
2024-03-26 16:34 ` David Hildenbrand
2024-03-26 16:53 ` Ryan Roberts
2024-03-26 16:53 ` Ryan Roberts
2024-03-26 17:04 ` David Hildenbrand
2024-03-26 17:04 ` David Hildenbrand
2024-03-26 17:32 ` Ryan Roberts
2024-03-26 17:32 ` Ryan Roberts
2024-03-26 17:39 ` David Hildenbrand
2024-03-26 17:39 ` David Hildenbrand
2024-03-26 17:51 ` Ryan Roberts
2024-03-26 17:51 ` Ryan Roberts
2024-03-27 9:34 ` David Hildenbrand
2024-03-27 9:34 ` David Hildenbrand
2024-03-27 10:01 ` Ryan Roberts
2024-03-27 10:01 ` Ryan Roberts
2024-04-03 12:59 ` Ryan Roberts
2024-04-03 12:59 ` Ryan Roberts
2024-04-08 8:36 ` David Hildenbrand
2024-04-08 8:36 ` David Hildenbrand
2024-04-09 16:35 ` Ryan Roberts
2024-04-09 16:35 ` Ryan Roberts
2024-04-10 20:09 ` David Hildenbrand
2024-04-10 20:09 ` David Hildenbrand
2024-04-11 9:45 ` Ryan Roberts [this message]
2024-04-11 9:45 ` Ryan Roberts
2024-04-12 20:16 ` David Hildenbrand
2024-04-12 20:16 ` David Hildenbrand
2024-04-15 9:28 ` Ryan Roberts
2024-04-15 9:28 ` Ryan Roberts
2024-04-15 10:57 ` David Hildenbrand
2024-04-15 10:57 ` David Hildenbrand
2024-04-15 13:30 ` Ryan Roberts
2024-04-15 13:30 ` Ryan Roberts
2024-04-15 14:23 ` David Hildenbrand
2024-04-15 14:23 ` David Hildenbrand
2024-04-15 14:34 ` Ryan Roberts
2024-04-15 14:34 ` Ryan Roberts
2024-04-15 14:58 ` David Hildenbrand
2024-04-15 14:58 ` David Hildenbrand
2024-04-15 15:17 ` Ryan Roberts
2024-04-15 15:17 ` Ryan Roberts
2024-04-15 15:22 ` David Hildenbrand
2024-04-15 15:22 ` David Hildenbrand
2024-04-15 15:53 ` Ryan Roberts
2024-04-15 15:53 ` Ryan Roberts
2024-04-15 16:02 ` David Hildenbrand
2024-04-15 16:02 ` David Hildenbrand
2024-04-23 10:15 ` Ryan Roberts
2024-04-23 10:15 ` Ryan Roberts
2024-04-23 10:18 ` David Hildenbrand
2024-04-23 10:18 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=81aa23ca-18b1-4430-9ad1-00a2c5af8fc2@arm.com \
--to=ryan.roberts@arm.com \
--cc=adrian.hunter@intel.com \
--cc=akpm@linux-foundation.org \
--cc=alexander.shishkin@linux.intel.com \
--cc=catalin.marinas@arm.com \
--cc=david@redhat.com \
--cc=irogers@google.com \
--cc=jolsa@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mark.rutland@arm.com \
--cc=muchun.song@linux.dev \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.