All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alistair Popple <apopple@nvidia.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	robin.murphy@arm.com, will@kernel.org, nicolinc@nvidia.com,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	kvmarm@lists.cs.columbia.edu, jgg@nvidia.com,
	John Hubbard <jhubbard@nvidia.com>
Subject: Re: [PATCH] mmu_notifiers: Notify on pte permission upgrades
Date: Tue, 23 May 2023 09:50:02 +1000	[thread overview]
Message-ID: <875y8jappz.fsf@nvidia.com> (raw)
In-Reply-To: <ZGu1vsbscdO48V6h@google.com>


Sean Christopherson <seanjc@google.com> writes:

> On Mon, May 22, 2023, Alistair Popple wrote:
>> Some architectures, specifically ARM and perhaps Sparc and IA64,
>> require TLB invalidates when upgrading pte permission from read-only
>> to read-write.
>> 
>> The current mmu_notifier implementation assumes that upgrades do not
>> need notifications. Typically though mmu_notifiers are used to
>> implement TLB invalidations for secondary MMUs that comply with the
>> main CPU architecture.
>> 
>> Therefore if the main CPU architecture requires an invalidation for
>> permission upgrade the secondary MMU will as well and an mmu_notifier
>> should be sent for the upgrade.
>> 
>> Currently CPU invalidations for permission upgrade occur in
>> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called
>> directly from this architecture specific code as the notifier
>> callbacks can sleep, and ptep_set_access_flags() is usually called
>> whilst holding the PTL spinlock. Therefore add the notifier calls
>> after the PTL is dropped and only if the PTE actually changed. This
>> will allow secondary MMUs to obtain an updated PTE with appropriate
>> permissions.
>> 
>> This problem was discovered during testing of an ARM SMMU
>> implementation that does not support broadcast TLB maintenance
>> (BTM). In this case the SMMU driver uses notifiers to issue TLB
>> invalidates. For read-only to read-write pte upgrades the SMMU
>> continually returned a read-only PTE to the device, even though the
>> CPU had a read-write PTE installed.
>> 
>> Sending a mmu notifier event to the SMMU driver fixes the problem by
>> flushing secondary TLB entries. A new notifier event type is added so
>> drivers may filter out these invalidations if not required. Note a
>> driver should never upgrade or install a PTE in response to this mmu
>> notifier event as it is not synchronised against other PTE operations.
>> 
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>  include/linux/mmu_notifier.h |  6 +++++
>>  mm/hugetlb.c                 | 24 ++++++++++++++++++-
>>  mm/memory.c                  | 45 ++++++++++++++++++++++++++++++++++--
>>  3 files changed, 72 insertions(+), 3 deletions(-)
>> 
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index d6c06e140277..f14d68f119d8 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -31,6 +31,11 @@ struct mmu_interval_notifier;
>>   * pages in the range so to mirror those changes the user must inspect the CPU
>>   * page table (from the end callback).
>>   *
>> + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to
>> + * read-write for pages in the range. This must not be used to upgrade
>> + * permissions on secondary PTEs, rather it should only be used to invalidate
>> + * caches such as secondary TLBs that may cache old read-only entries.
>
> This is a poor fit for invalidate_range_{start,end}().  All other uses bookend
> the primary MMU update, i.e. call start() _before_ changing PTEs.  The comments
> are somewhat stale as they talk only about "unmapped", but the contract between
> the primary MMU and the secondary MMU is otherwise quite clear on when the primary
> MMU will invoke start() and end().
>
> 	 * invalidate_range_start() is called when all pages in the
> 	 * range are still mapped and have at least a refcount of one.
> 	 *
> 	 * invalidate_range_end() is called when all pages in the
> 	 * range have been unmapped and the pages have been freed by
> 	 * the VM.
>
> I'm also confused as to how this actually fixes ARM's SMMU.  Unless I'm looking
> at the wrong SMMU implementation, the SMMU implemenents only invalidate_range(),
> not the start()/end() variants.

mmu_invalidate_range_end() calls the invalidate_range() callback if
the start()/end() variants aren't set.

> 	static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
> 		.invalidate_range	= arm_smmu_mm_invalidate_range,
> 		.release		= arm_smmu_mm_release,
> 		.free_notifier		= arm_smmu_mmu_notifier_free,
> 	};
>
> Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks
> is perfectly valid.  And AFAICT, the existing invalidate_range() hook is pretty
> much a perfect fit for what you want to achieve.

Right, I didn't take that approach because it doesn't allow an event
type to be passed which would allow them to be filtered on platforms
which don't require this.

I had also assumed the invalidate_range() callbacks were allowed to
sleep, hence couldn't be called under PTL. That's certainly true of mmu
interval notifier callbacks, but Catalin reminded me that calls such as
ptep_clear_flush_notify() already call invalidate_range() callback under
PTL so I guess we already assume drivers don't sleep in their
invalidate_range() callbacks. I will update the comments to reflect
that.

> 	 * If invalidate_range() is used to manage a non-CPU TLB with
> 	 * shared page-tables, it not necessary to implement the
> 	 * invalidate_range_start()/end() notifiers, as
> 	 * invalidate_range() already catches the points in time when an
> 	 * external TLB range needs to be flushed. For more in depth
> 	 * discussion on this see Documentation/mm/mmu_notifier.rst
>
> Even worse, this change may silently regress performance for secondary MMUs that
> haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs
> in response to the upgrade, instead of waiting until the guest actually tries to
> utilize the new protections.

Yeah, I like the idea of introducing a
ptep_set_access_flags_notify(). That way this won't regress performance
on platforms that don't need it. Note this isn't a new feature but
rather a bugfix. It's unclear to me why KVM on ARM hasn't already run
into this issue, but I'm no KVM expert. Thanks for the feedback.

 - Alistair

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Alistair Popple <apopple@nvidia.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	robin.murphy@arm.com, will@kernel.org, nicolinc@nvidia.com,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	kvmarm@lists.cs.columbia.edu, jgg@nvidia.com,
	John Hubbard <jhubbard@nvidia.com>
Subject: Re: [PATCH] mmu_notifiers: Notify on pte permission upgrades
Date: Tue, 23 May 2023 09:50:02 +1000	[thread overview]
Message-ID: <875y8jappz.fsf@nvidia.com> (raw)
In-Reply-To: <ZGu1vsbscdO48V6h@google.com>


Sean Christopherson <seanjc@google.com> writes:

> On Mon, May 22, 2023, Alistair Popple wrote:
>> Some architectures, specifically ARM and perhaps Sparc and IA64,
>> require TLB invalidates when upgrading pte permission from read-only
>> to read-write.
>> 
>> The current mmu_notifier implementation assumes that upgrades do not
>> need notifications. Typically though mmu_notifiers are used to
>> implement TLB invalidations for secondary MMUs that comply with the
>> main CPU architecture.
>> 
>> Therefore if the main CPU architecture requires an invalidation for
>> permission upgrade the secondary MMU will as well and an mmu_notifier
>> should be sent for the upgrade.
>> 
>> Currently CPU invalidations for permission upgrade occur in
>> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called
>> directly from this architecture specific code as the notifier
>> callbacks can sleep, and ptep_set_access_flags() is usually called
>> whilst holding the PTL spinlock. Therefore add the notifier calls
>> after the PTL is dropped and only if the PTE actually changed. This
>> will allow secondary MMUs to obtain an updated PTE with appropriate
>> permissions.
>> 
>> This problem was discovered during testing of an ARM SMMU
>> implementation that does not support broadcast TLB maintenance
>> (BTM). In this case the SMMU driver uses notifiers to issue TLB
>> invalidates. For read-only to read-write pte upgrades the SMMU
>> continually returned a read-only PTE to the device, even though the
>> CPU had a read-write PTE installed.
>> 
>> Sending a mmu notifier event to the SMMU driver fixes the problem by
>> flushing secondary TLB entries. A new notifier event type is added so
>> drivers may filter out these invalidations if not required. Note a
>> driver should never upgrade or install a PTE in response to this mmu
>> notifier event as it is not synchronised against other PTE operations.
>> 
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>  include/linux/mmu_notifier.h |  6 +++++
>>  mm/hugetlb.c                 | 24 ++++++++++++++++++-
>>  mm/memory.c                  | 45 ++++++++++++++++++++++++++++++++++--
>>  3 files changed, 72 insertions(+), 3 deletions(-)
>> 
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index d6c06e140277..f14d68f119d8 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -31,6 +31,11 @@ struct mmu_interval_notifier;
>>   * pages in the range so to mirror those changes the user must inspect the CPU
>>   * page table (from the end callback).
>>   *
>> + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to
>> + * read-write for pages in the range. This must not be used to upgrade
>> + * permissions on secondary PTEs, rather it should only be used to invalidate
>> + * caches such as secondary TLBs that may cache old read-only entries.
>
> This is a poor fit for invalidate_range_{start,end}().  All other uses bookend
> the primary MMU update, i.e. call start() _before_ changing PTEs.  The comments
> are somewhat stale as they talk only about "unmapped", but the contract between
> the primary MMU and the secondary MMU is otherwise quite clear on when the primary
> MMU will invoke start() and end().
>
> 	 * invalidate_range_start() is called when all pages in the
> 	 * range are still mapped and have at least a refcount of one.
> 	 *
> 	 * invalidate_range_end() is called when all pages in the
> 	 * range have been unmapped and the pages have been freed by
> 	 * the VM.
>
> I'm also confused as to how this actually fixes ARM's SMMU.  Unless I'm looking
> at the wrong SMMU implementation, the SMMU implemenents only invalidate_range(),
> not the start()/end() variants.

mmu_invalidate_range_end() calls the invalidate_range() callback if
the start()/end() variants aren't set.

> 	static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
> 		.invalidate_range	= arm_smmu_mm_invalidate_range,
> 		.release		= arm_smmu_mm_release,
> 		.free_notifier		= arm_smmu_mmu_notifier_free,
> 	};
>
> Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks
> is perfectly valid.  And AFAICT, the existing invalidate_range() hook is pretty
> much a perfect fit for what you want to achieve.

Right, I didn't take that approach because it doesn't allow an event
type to be passed which would allow them to be filtered on platforms
which don't require this.

I had also assumed the invalidate_range() callbacks were allowed to
sleep, hence couldn't be called under PTL. That's certainly true of mmu
interval notifier callbacks, but Catalin reminded me that calls such as
ptep_clear_flush_notify() already call invalidate_range() callback under
PTL so I guess we already assume drivers don't sleep in their
invalidate_range() callbacks. I will update the comments to reflect
that.

> 	 * If invalidate_range() is used to manage a non-CPU TLB with
> 	 * shared page-tables, it not necessary to implement the
> 	 * invalidate_range_start()/end() notifiers, as
> 	 * invalidate_range() already catches the points in time when an
> 	 * external TLB range needs to be flushed. For more in depth
> 	 * discussion on this see Documentation/mm/mmu_notifier.rst
>
> Even worse, this change may silently regress performance for secondary MMUs that
> haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs
> in response to the upgrade, instead of waiting until the guest actually tries to
> utilize the new protections.

Yeah, I like the idea of introducing a
ptep_set_access_flags_notify(). That way this won't regress performance
on platforms that don't need it. Note this isn't a new feature but
rather a bugfix. It's unclear to me why KVM on ARM hasn't already run
into this issue, but I'm no KVM expert. Thanks for the feedback.

 - Alistair

  reply	other threads:[~2023-05-22 23:53 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-22  6:37 [PATCH] mmu_notifiers: Notify on pte permission upgrades Alistair Popple
2023-05-22  6:37 ` Alistair Popple
2023-05-22  7:15 ` Qi Zheng
2023-05-22  7:15   ` Qi Zheng
2023-05-22  7:45   ` Alistair Popple
2023-05-22  7:45     ` Alistair Popple
2023-05-22  8:28     ` Qi Zheng
2023-05-22  8:28       ` Qi Zheng
2023-05-22 15:09 ` Catalin Marinas
2023-05-22 15:09   ` Catalin Marinas
2023-05-22 23:52   ` Alistair Popple
2023-05-22 23:52     ` Alistair Popple
2023-05-22 18:34 ` Sean Christopherson
2023-05-22 18:34   ` Sean Christopherson
2023-05-22 23:50   ` Alistair Popple [this message]
2023-05-22 23:50     ` Alistair Popple
2023-05-23  0:06     ` Sean Christopherson
2023-05-23  0:06       ` Sean Christopherson
2023-05-23  0:43       ` Alistair Popple
2023-05-23  0:43         ` Alistair Popple
2023-05-23  1:13     ` John Hubbard
2023-05-23  1:13       ` John Hubbard
2023-05-23  4:35       ` Alistair Popple
2023-05-23  4:35         ` Alistair Popple
2023-05-23  0:55 ` John Hubbard
2023-05-23  0:55   ` John Hubbard
2023-05-23  1:12   ` Alistair Popple
2023-05-23  1:12     ` Alistair Popple
2023-05-23  6:45 ` Christoph Hellwig
2023-05-23  6:45   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875y8jappz.fsf@nvidia.com \
    --to=apopple@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nicolinc@nvidia.com \
    --cc=robin.murphy@arm.com \
    --cc=seanjc@google.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.