All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: David Hildenbrand <david@redhat.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Muchun Song <muchun.song@linux.dev>
Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte
Date: Tue, 26 Mar 2024 17:48:52 +0000	[thread overview]
Message-ID: <fb101fd1-e28a-4278-bf5f-e2dca0215a90@arm.com> (raw)
In-Reply-To: <f6996c4f-da60-4267-bcf1-09e4fd40c91b@redhat.com>

On 26/03/2024 17:38, David Hildenbrand wrote:
> On 26.03.24 18:27, Ryan Roberts wrote:
>> On 26/03/2024 17:02, David Hildenbrand wrote:
>>> On 15.02.24 13:17, Ryan Roberts wrote:
>>>> Let's convert handle_pte_fault()'s use of ptep_get_lockless() to
>>>> ptep_get_lockless_norecency() to save orig_pte.
>>>>
>>>> There are a number of places that follow this model:
>>>>
>>>>       orig_pte = ptep_get_lockless(ptep)
>>>>       ...
>>>>       <lock>
>>>>       if (!pte_same(orig_pte, ptep_get(ptep)))
>>>>               // RACE!
>>>>       ...
>>>>       <unlock>
>>>>
>>>> So we need to be careful to convert all of those to use
>>>> pte_same_norecency() so that the access and dirty bits are excluded from
>>>> the comparison.
>>>>
>>>> Additionally there are a couple of places that genuinely rely on the
>>>> access and dirty bits of orig_pte, but with some careful refactoring, we
>>>> can use ptep_get() once we are holding the lock to achieve equivalent
>>>> logic.
>>>
>>> We really should document that changed behavior somewhere where it can be easily
>>> found: that orig_pte might have incomplete/stale accessed/dirty information.
>>
>> I could add it to the orig_pte definition in the `struct vm_fault`?
>>
>>>
>>>
>>>> @@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>>>                             vmf->address, &vmf->ptl);
>>>>            if (unlikely(!vmf->pte))
>>>>                return 0;
>>>> -        vmf->orig_pte = ptep_get_lockless(vmf->pte);
>>>> +        vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte);
>>>>            vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>>>>
>>>>            if (pte_none(vmf->orig_pte)) {
>>>> @@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>>>
>>>>        spin_lock(vmf->ptl);
>>>>        entry = vmf->orig_pte;
>>>> -    if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>>>> +    if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) {
>>>>            update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>>>>            goto unlock;
>>>
>>> I was wondering about the following:
>>>
>>> Assume the PTE is not dirty.
>>>
>>> Thread 1 does
>>
>> Sorry not sure what threads have to do with this? How is the vmf shared between
>> threads? What have I misunderstood...
> 
> Assume we have a HW that does not have HW-managed access/dirty bits. One that
> ends up using generic ptep_set_access_flags(). Access/dirty bits are always
> updated under PT lock.
> 
> Then, imagine two threads. One is the the fault path here. another thread
> performs some other magic that sets the PTE dirty under PTL.
> 
>>
>>>
>>> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte)
>>> /* not dirty */
>>>
>>> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */

Ahh, this comment about thread 2 is not referring to the code immediately below
it. It all makes much more sense now. :)

>>>
>>> spin_lock(vmf->ptl);
>>> entry = vmf->orig_pte;
>>> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>>>      ...
>>> }
>>> ...
>>> entry = pte_mkyoung(entry);
>>
>> Do you mean pte_mkdirty() here? You're talking about dirty everywhere else.
> 
> No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead and
> unconditionally does that in handle_pte_fault().
> 
>>
>>> if (ptep_set_access_flags(vmf->vma, ...)
>>> ...
>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>
>>>
>>> Generic ptep_set_access_flags() will do another pte_same() check and realize
>>> "hey, there was a change!" let's update the PTE!
>>>
>>> set_pte_at(vma->vm_mm, address, ptep, entry);
>>
>> This is called from the generic ptep_set_access_flags() in your example, right?
>>
> 
> Yes.
> 
>>>
>>> would overwrite the dirty bit set by thread 2.
>>
>> I'm not really sure what you are getting at... Is your concern that there is a
>> race where the page could become dirty in the meantime and it now gets lost? I
>> think that's why arm64 overrides ptep_set_access_flags(); since the hw can
>> update access/dirty we have to deal with the races.
> 
> My concern is that your patch can in subtle ways lead to use losing PTE dirty
> bits on architectures that don't have the HW-managed dirty bit. They do exist ;)

But I think the example you give can already happen today? Thread 1 reads
orig_pte = ptep_get_lockless(). So that's already racy, if thread 2 is going to
set dirty just after the get, then thread 1 is going to set the PTE back to (a
modified version of) orig_pte. Isn't it already broken?


> 
> Arm64 should be fine in that regard.
> 

There is plenty of arm64 HW that doesn't do HW access/dirty update. But our
ptep_set_access_flags() can always deal with a racing update, even if that
update originates from SW.

Why do I have the feeling you're about to explain (very patiently) exactly why
I'm wrong?... :)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Ryan Roberts <ryan.roberts@arm.com>
To: David Hildenbrand <david@redhat.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Muchun Song <muchun.song@linux.dev>
Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte
Date: Tue, 26 Mar 2024 17:48:52 +0000	[thread overview]
Message-ID: <fb101fd1-e28a-4278-bf5f-e2dca0215a90@arm.com> (raw)
In-Reply-To: <f6996c4f-da60-4267-bcf1-09e4fd40c91b@redhat.com>

On 26/03/2024 17:38, David Hildenbrand wrote:
> On 26.03.24 18:27, Ryan Roberts wrote:
>> On 26/03/2024 17:02, David Hildenbrand wrote:
>>> On 15.02.24 13:17, Ryan Roberts wrote:
>>>> Let's convert handle_pte_fault()'s use of ptep_get_lockless() to
>>>> ptep_get_lockless_norecency() to save orig_pte.
>>>>
>>>> There are a number of places that follow this model:
>>>>
>>>>       orig_pte = ptep_get_lockless(ptep)
>>>>       ...
>>>>       <lock>
>>>>       if (!pte_same(orig_pte, ptep_get(ptep)))
>>>>               // RACE!
>>>>       ...
>>>>       <unlock>
>>>>
>>>> So we need to be careful to convert all of those to use
>>>> pte_same_norecency() so that the access and dirty bits are excluded from
>>>> the comparison.
>>>>
>>>> Additionally there are a couple of places that genuinely rely on the
>>>> access and dirty bits of orig_pte, but with some careful refactoring, we
>>>> can use ptep_get() once we are holding the lock to achieve equivalent
>>>> logic.
>>>
>>> We really should document that changed behavior somewhere where it can be easily
>>> found: that orig_pte might have incomplete/stale accessed/dirty information.
>>
>> I could add it to the orig_pte definition in the `struct vm_fault`?
>>
>>>
>>>
>>>> @@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>>>                             vmf->address, &vmf->ptl);
>>>>            if (unlikely(!vmf->pte))
>>>>                return 0;
>>>> -        vmf->orig_pte = ptep_get_lockless(vmf->pte);
>>>> +        vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte);
>>>>            vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>>>>
>>>>            if (pte_none(vmf->orig_pte)) {
>>>> @@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>>>
>>>>        spin_lock(vmf->ptl);
>>>>        entry = vmf->orig_pte;
>>>> -    if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>>>> +    if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) {
>>>>            update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>>>>            goto unlock;
>>>
>>> I was wondering about the following:
>>>
>>> Assume the PTE is not dirty.
>>>
>>> Thread 1 does
>>
>> Sorry not sure what threads have to do with this? How is the vmf shared between
>> threads? What have I misunderstood...
> 
> Assume we have a HW that does not have HW-managed access/dirty bits. One that
> ends up using generic ptep_set_access_flags(). Access/dirty bits are always
> updated under PT lock.
> 
> Then, imagine two threads. One is the the fault path here. another thread
> performs some other magic that sets the PTE dirty under PTL.
> 
>>
>>>
>>> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte)
>>> /* not dirty */
>>>
>>> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */

Ahh, this comment about thread 2 is not referring to the code immediately below
it. It all makes much more sense now. :)

>>>
>>> spin_lock(vmf->ptl);
>>> entry = vmf->orig_pte;
>>> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>>>      ...
>>> }
>>> ...
>>> entry = pte_mkyoung(entry);
>>
>> Do you mean pte_mkdirty() here? You're talking about dirty everywhere else.
> 
> No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead and
> unconditionally does that in handle_pte_fault().
> 
>>
>>> if (ptep_set_access_flags(vmf->vma, ...)
>>> ...
>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>
>>>
>>> Generic ptep_set_access_flags() will do another pte_same() check and realize
>>> "hey, there was a change!" let's update the PTE!
>>>
>>> set_pte_at(vma->vm_mm, address, ptep, entry);
>>
>> This is called from the generic ptep_set_access_flags() in your example, right?
>>
> 
> Yes.
> 
>>>
>>> would overwrite the dirty bit set by thread 2.
>>
>> I'm not really sure what you are getting at... Is your concern that there is a
>> race where the page could become dirty in the meantime and it now gets lost? I
>> think that's why arm64 overrides ptep_set_access_flags(); since the hw can
>> update access/dirty we have to deal with the races.
> 
> My concern is that your patch can in subtle ways lead to use losing PTE dirty
> bits on architectures that don't have the HW-managed dirty bit. They do exist ;)

But I think the example you give can already happen today? Thread 1 reads
orig_pte = ptep_get_lockless(). So that's already racy, if thread 2 is going to
set dirty just after the get, then thread 1 is going to set the PTE back to (a
modified version of) orig_pte. Isn't it already broken?


> 
> Arm64 should be fine in that regard.
> 

There is plenty of arm64 HW that doesn't do HW access/dirty update. But our
ptep_set_access_flags() can always deal with a racing update, even if that
update originates from SW.

Why do I have the feeling you're about to explain (very patiently) exactly why
I'm wrong?... :)



  reply	other threads:[~2024-03-26 17:49 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-15 12:17 [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 Ryan Roberts
2024-02-15 12:17 ` Ryan Roberts
2024-02-15 12:17 ` [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() Ryan Roberts
2024-02-15 12:17   ` Ryan Roberts
2024-03-26 16:27   ` David Hildenbrand
2024-03-26 16:27     ` David Hildenbrand
2024-03-26 16:39     ` Ryan Roberts
2024-03-26 16:39       ` Ryan Roberts
2024-03-27  9:28       ` David Hildenbrand
2024-03-27  9:28         ` David Hildenbrand
2024-03-27  9:57         ` Ryan Roberts
2024-03-27  9:57           ` Ryan Roberts
2024-03-27 17:02           ` David Hildenbrand
2024-03-27 17:02             ` David Hildenbrand
2024-02-15 12:17 ` [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency() Ryan Roberts
2024-02-15 12:17   ` Ryan Roberts
2024-03-26 16:30   ` David Hildenbrand
2024-03-26 16:30     ` David Hildenbrand
2024-03-26 16:48     ` Ryan Roberts
2024-03-26 16:48       ` Ryan Roberts
2024-02-15 12:17 ` [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte Ryan Roberts
2024-02-15 12:17   ` Ryan Roberts
2024-03-26 17:02   ` David Hildenbrand
2024-03-26 17:02     ` David Hildenbrand
2024-03-26 17:27     ` Ryan Roberts
2024-03-26 17:27       ` Ryan Roberts
2024-03-26 17:38       ` David Hildenbrand
2024-03-26 17:38         ` David Hildenbrand
2024-03-26 17:48         ` Ryan Roberts [this message]
2024-03-26 17:48           ` Ryan Roberts
2024-03-26 17:58           ` David Hildenbrand
2024-03-26 17:58             ` David Hildenbrand
2024-03-27  9:51             ` Ryan Roberts
2024-03-27  9:51               ` Ryan Roberts
2024-03-27 17:05               ` David Hildenbrand
2024-03-27 17:05                 ` David Hildenbrand
2024-02-15 12:17 ` [RFC PATCH v1 4/4] arm64/mm: Override ptep_get_lockless_norecency() Ryan Roberts
2024-02-15 12:17   ` Ryan Roberts
2024-03-26 16:35   ` David Hildenbrand
2024-03-26 16:35     ` David Hildenbrand
2024-03-26 16:17 ` [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 David Hildenbrand
2024-03-26 16:17   ` David Hildenbrand
2024-03-26 16:31   ` Ryan Roberts
2024-03-26 16:31     ` Ryan Roberts
2024-03-26 16:34     ` David Hildenbrand
2024-03-26 16:34       ` David Hildenbrand
2024-03-26 16:53       ` Ryan Roberts
2024-03-26 16:53         ` Ryan Roberts
2024-03-26 17:04         ` David Hildenbrand
2024-03-26 17:04           ` David Hildenbrand
2024-03-26 17:32           ` Ryan Roberts
2024-03-26 17:32             ` Ryan Roberts
2024-03-26 17:39             ` David Hildenbrand
2024-03-26 17:39               ` David Hildenbrand
2024-03-26 17:51               ` Ryan Roberts
2024-03-26 17:51                 ` Ryan Roberts
2024-03-27  9:34                 ` David Hildenbrand
2024-03-27  9:34                   ` David Hildenbrand
2024-03-27 10:01                   ` Ryan Roberts
2024-03-27 10:01                     ` Ryan Roberts
2024-04-03 12:59                   ` Ryan Roberts
2024-04-03 12:59                     ` Ryan Roberts
2024-04-08  8:36                     ` David Hildenbrand
2024-04-08  8:36                       ` David Hildenbrand
2024-04-09 16:35                       ` Ryan Roberts
2024-04-09 16:35                         ` Ryan Roberts
2024-04-10 20:09                         ` David Hildenbrand
2024-04-10 20:09                           ` David Hildenbrand
2024-04-11  9:45                           ` Ryan Roberts
2024-04-11  9:45                             ` Ryan Roberts
2024-04-12 20:16                             ` David Hildenbrand
2024-04-12 20:16                               ` David Hildenbrand
2024-04-15  9:28                               ` Ryan Roberts
2024-04-15  9:28                                 ` Ryan Roberts
2024-04-15 10:57                                 ` David Hildenbrand
2024-04-15 10:57                                   ` David Hildenbrand
2024-04-15 13:30                                   ` Ryan Roberts
2024-04-15 13:30                                     ` Ryan Roberts
2024-04-15 14:23                                     ` David Hildenbrand
2024-04-15 14:23                                       ` David Hildenbrand
2024-04-15 14:34                                       ` Ryan Roberts
2024-04-15 14:34                                         ` Ryan Roberts
2024-04-15 14:58                                         ` David Hildenbrand
2024-04-15 14:58                                           ` David Hildenbrand
2024-04-15 15:17                                           ` Ryan Roberts
2024-04-15 15:17                                             ` Ryan Roberts
2024-04-15 15:22                                             ` David Hildenbrand
2024-04-15 15:22                                               ` David Hildenbrand
2024-04-15 15:53                                               ` Ryan Roberts
2024-04-15 15:53                                                 ` Ryan Roberts
2024-04-15 16:02                                                 ` David Hildenbrand
2024-04-15 16:02                                                   ` David Hildenbrand
2024-04-23 10:15                                                   ` Ryan Roberts
2024-04-23 10:15                                                     ` Ryan Roberts
2024-04-23 10:18                                                     ` David Hildenbrand
2024-04-23 10:18                                                       ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fb101fd1-e28a-4278-bf5f-e2dca0215a90@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=adrian.hunter@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=irogers@google.com \
    --cc=jolsa@kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mark.rutland@arm.com \
    --cc=muchun.song@linux.dev \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.