From: Cornelia Huck <cohuck@redhat.com>
To: Steven Price <steven.price@arm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Peter Collingbourne <pcc@google.com>
Cc: kvm@vger.kernel.org, Marc Zyngier <maz@kernel.org>,
Andy Lutomirski <luto@amacapital.net>,
Evgenii Stepanov <eugenis@google.com>,
Michael Roth <michael.roth@amd.com>,
Chao Peng <chao.p.peng@linux.intel.com>,
Will Deacon <will@kernel.org>,
kvmarm@lists.cs.columbia.edu,
linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled
Date: Mon, 27 Jun 2022 17:55:33 +0200 [thread overview]
Message-ID: <875ykmcd8q.fsf@redhat.com> (raw)
In-Reply-To: <14f2a69e-4022-e463-1662-30032655e3d1@arm.com>
[I'm still in the process of trying to grok the issues surrounding
MTE+KVM, so apologies in advance if I'm muddying the waters]
On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote:
> On 24/06/2022 18:05, Catalin Marinas wrote:
>> + Steven as he added the KVM and swap support for MTE.
>>
>> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
>>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
>>> depend on being able to map guest memory as MAP_SHARED. The current
>>> restriction on sharing MAP_SHARED pages with the guest is preventing
>>> the use of those features with MTE. Therefore, remove this restriction.
>>
>> We already have some corner cases where the PG_mte_tagged logic fails
>> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
>> KVM MAP_SHARED will potentially make things worse (or hard to reason
>> about; for example the VMM sets PROT_MTE as well). I'm more inclined to
>> get rid of PG_mte_tagged altogether, always zero (or restore) the tags
>> on user page allocation, copy them on write. For swap we can scan and if
>> all tags are 0 and just skip saving them.
>>
>> Another aspect is a change in the KVM ABI with this patch. It's probably
>> not that bad since it's rather a relaxation but it has the potential to
>> confuse the VMM, especially as it doesn't know whether it's running on
>> older kernels or not (it would have to probe unless we expose this info
>> to the VMM in some other way).
Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.)
>>
>>> To avoid races between multiple tasks attempting to clear tags on the
>>> same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it
>>> atomically before beginning to clear tags on a page. If the flag was not
>>> initially set, spin until the other task has finished clearing the tags.
>>
>> TBH, I can't mentally model all the corner cases, so maybe a formal
>> model would help (I can have a go with TLA+, though not sure when I find
>> a bit of time this summer). If we get rid of PG_mte_tagged altogether,
>> this would simplify things (hopefully).
>>
>> As you noticed, the problem is that setting PG_mte_tagged and clearing
>> (or restoring) the tags is not an atomic operation. There are places
>> like mprotect() + CoW where one task can end up with stale tags. Another
>> is shared memfd mappings if more than one mapping sets PROT_MTE and
>> there's the swap restoring on top.
>>
>>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>>> index f6b00743c399..8f9655053a9f 100644
>>> --- a/arch/arm64/kernel/mte.c
>>> +++ b/arch/arm64/kernel/mte.c
>>> @@ -57,7 +57,18 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
>>> * the new page->flags are visible before the tags were updated.
>>> */
>>> smp_wmb();
>>> - mte_clear_page_tags(page_address(page));
>>> + mte_ensure_page_tags_cleared(page);
>>> +}
>>> +
>>> +void mte_ensure_page_tags_cleared(struct page *page)
>>> +{
>>> + if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) {
>>> + while (!test_bit(PG_mte_tagged, &page->flags))
>>> + ;
>>> + } else {
>>> + mte_clear_page_tags(page_address(page));
>>> + set_bit(PG_mte_tagged, &page->flags);
>>> + }
>
> I'm pretty sure we need some form of barrier in here to ensure the tag
> clearing is visible to the other CPU. Otherwise I can't immediately see
> any problems with the approach of a second flag (it was something I had
> considered). But I do also think we should seriously consider Catalin's
> approach of simply zeroing tags unconditionally - it would certainly
> simplify the code.
What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end
up copying zeroes?
That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS
will be called? I.e. when implementing migration, it should be ok to
call it while the vm is paused, but you probably won't get a consistent
state while the vm is running?
[Postcopy needs a different interface, I guess, so that the migration
target can atomically place a received page and its metadata. I see
https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/;
has there been any follow-up?]
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
WARNING: multiple messages have this Message-ID (diff)
From: Cornelia Huck <cohuck@redhat.com>
To: Steven Price <steven.price@arm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Peter Collingbourne <pcc@google.com>
Cc: kvmarm@lists.cs.columbia.edu, Marc Zyngier <maz@kernel.org>,
kvm@vger.kernel.org, Andy Lutomirski <luto@amacapital.net>,
linux-arm-kernel@lists.infradead.org,
Michael Roth <michael.roth@amd.com>,
Chao Peng <chao.p.peng@linux.intel.com>,
Will Deacon <will@kernel.org>,
Evgenii Stepanov <eugenis@google.com>
Subject: Re: [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled
Date: Mon, 27 Jun 2022 17:55:33 +0200 [thread overview]
Message-ID: <875ykmcd8q.fsf@redhat.com> (raw)
In-Reply-To: <14f2a69e-4022-e463-1662-30032655e3d1@arm.com>
[I'm still in the process of trying to grok the issues surrounding
MTE+KVM, so apologies in advance if I'm muddying the waters]
On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote:
> On 24/06/2022 18:05, Catalin Marinas wrote:
>> + Steven as he added the KVM and swap support for MTE.
>>
>> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
>>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
>>> depend on being able to map guest memory as MAP_SHARED. The current
>>> restriction on sharing MAP_SHARED pages with the guest is preventing
>>> the use of those features with MTE. Therefore, remove this restriction.
>>
>> We already have some corner cases where the PG_mte_tagged logic fails
>> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
>> KVM MAP_SHARED will potentially make things worse (or hard to reason
>> about; for example the VMM sets PROT_MTE as well). I'm more inclined to
>> get rid of PG_mte_tagged altogether, always zero (or restore) the tags
>> on user page allocation, copy them on write. For swap we can scan and if
>> all tags are 0 and just skip saving them.
>>
>> Another aspect is a change in the KVM ABI with this patch. It's probably
>> not that bad since it's rather a relaxation but it has the potential to
>> confuse the VMM, especially as it doesn't know whether it's running on
>> older kernels or not (it would have to probe unless we expose this info
>> to the VMM in some other way).
Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.)
>>
>>> To avoid races between multiple tasks attempting to clear tags on the
>>> same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it
>>> atomically before beginning to clear tags on a page. If the flag was not
>>> initially set, spin until the other task has finished clearing the tags.
>>
>> TBH, I can't mentally model all the corner cases, so maybe a formal
>> model would help (I can have a go with TLA+, though not sure when I find
>> a bit of time this summer). If we get rid of PG_mte_tagged altogether,
>> this would simplify things (hopefully).
>>
>> As you noticed, the problem is that setting PG_mte_tagged and clearing
>> (or restoring) the tags is not an atomic operation. There are places
>> like mprotect() + CoW where one task can end up with stale tags. Another
>> is shared memfd mappings if more than one mapping sets PROT_MTE and
>> there's the swap restoring on top.
>>
>>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>>> index f6b00743c399..8f9655053a9f 100644
>>> --- a/arch/arm64/kernel/mte.c
>>> +++ b/arch/arm64/kernel/mte.c
>>> @@ -57,7 +57,18 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
>>> * the new page->flags are visible before the tags were updated.
>>> */
>>> smp_wmb();
>>> - mte_clear_page_tags(page_address(page));
>>> + mte_ensure_page_tags_cleared(page);
>>> +}
>>> +
>>> +void mte_ensure_page_tags_cleared(struct page *page)
>>> +{
>>> + if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) {
>>> + while (!test_bit(PG_mte_tagged, &page->flags))
>>> + ;
>>> + } else {
>>> + mte_clear_page_tags(page_address(page));
>>> + set_bit(PG_mte_tagged, &page->flags);
>>> + }
>
> I'm pretty sure we need some form of barrier in here to ensure the tag
> clearing is visible to the other CPU. Otherwise I can't immediately see
> any problems with the approach of a second flag (it was something I had
> considered). But I do also think we should seriously consider Catalin's
> approach of simply zeroing tags unconditionally - it would certainly
> simplify the code.
What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end
up copying zeroes?
That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS
will be called? I.e. when implementing migration, it should be ok to
call it while the vm is paused, but you probably won't get a consistent
state while the vm is running?
[Postcopy needs a different interface, I guess, so that the migration
target can atomically place a received page and its metadata. I see
https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/;
has there been any follow-up?]
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: Cornelia Huck <cohuck@redhat.com>
To: Steven Price <steven.price@arm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Peter Collingbourne <pcc@google.com>
Cc: kvmarm@lists.cs.columbia.edu, Marc Zyngier <maz@kernel.org>,
kvm@vger.kernel.org, Andy Lutomirski <luto@amacapital.net>,
linux-arm-kernel@lists.infradead.org,
Michael Roth <michael.roth@amd.com>,
Chao Peng <chao.p.peng@linux.intel.com>,
Will Deacon <will@kernel.org>,
Evgenii Stepanov <eugenis@google.com>
Subject: Re: [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled
Date: Mon, 27 Jun 2022 17:55:33 +0200 [thread overview]
Message-ID: <875ykmcd8q.fsf@redhat.com> (raw)
In-Reply-To: <14f2a69e-4022-e463-1662-30032655e3d1@arm.com>
[I'm still in the process of trying to grok the issues surrounding
MTE+KVM, so apologies in advance if I'm muddying the waters]
On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote:
> On 24/06/2022 18:05, Catalin Marinas wrote:
>> + Steven as he added the KVM and swap support for MTE.
>>
>> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
>>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
>>> depend on being able to map guest memory as MAP_SHARED. The current
>>> restriction on sharing MAP_SHARED pages with the guest is preventing
>>> the use of those features with MTE. Therefore, remove this restriction.
>>
>> We already have some corner cases where the PG_mte_tagged logic fails
>> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
>> KVM MAP_SHARED will potentially make things worse (or hard to reason
>> about; for example the VMM sets PROT_MTE as well). I'm more inclined to
>> get rid of PG_mte_tagged altogether, always zero (or restore) the tags
>> on user page allocation, copy them on write. For swap we can scan and if
>> all tags are 0 and just skip saving them.
>>
>> Another aspect is a change in the KVM ABI with this patch. It's probably
>> not that bad since it's rather a relaxation but it has the potential to
>> confuse the VMM, especially as it doesn't know whether it's running on
>> older kernels or not (it would have to probe unless we expose this info
>> to the VMM in some other way).
Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.)
>>
>>> To avoid races between multiple tasks attempting to clear tags on the
>>> same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it
>>> atomically before beginning to clear tags on a page. If the flag was not
>>> initially set, spin until the other task has finished clearing the tags.
>>
>> TBH, I can't mentally model all the corner cases, so maybe a formal
>> model would help (I can have a go with TLA+, though not sure when I find
>> a bit of time this summer). If we get rid of PG_mte_tagged altogether,
>> this would simplify things (hopefully).
>>
>> As you noticed, the problem is that setting PG_mte_tagged and clearing
>> (or restoring) the tags is not an atomic operation. There are places
>> like mprotect() + CoW where one task can end up with stale tags. Another
>> is shared memfd mappings if more than one mapping sets PROT_MTE and
>> there's the swap restoring on top.
>>
>>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>>> index f6b00743c399..8f9655053a9f 100644
>>> --- a/arch/arm64/kernel/mte.c
>>> +++ b/arch/arm64/kernel/mte.c
>>> @@ -57,7 +57,18 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
>>> * the new page->flags are visible before the tags were updated.
>>> */
>>> smp_wmb();
>>> - mte_clear_page_tags(page_address(page));
>>> + mte_ensure_page_tags_cleared(page);
>>> +}
>>> +
>>> +void mte_ensure_page_tags_cleared(struct page *page)
>>> +{
>>> + if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) {
>>> + while (!test_bit(PG_mte_tagged, &page->flags))
>>> + ;
>>> + } else {
>>> + mte_clear_page_tags(page_address(page));
>>> + set_bit(PG_mte_tagged, &page->flags);
>>> + }
>
> I'm pretty sure we need some form of barrier in here to ensure the tag
> clearing is visible to the other CPU. Otherwise I can't immediately see
> any problems with the approach of a second flag (it was something I had
> considered). But I do also think we should seriously consider Catalin's
> approach of simply zeroing tags unconditionally - it would certainly
> simplify the code.
What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end
up copying zeroes?
That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS
will be called? I.e. when implementing migration, it should be ok to
call it while the vm is paused, but you probably won't get a consistent
state while the vm is running?
[Postcopy needs a different interface, I guess, so that the migration
target can atomically place a received page and its metadata. I see
https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/;
has there been any follow-up?]
next prev parent reply other threads:[~2022-06-27 15:55 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-23 23:49 [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled Peter Collingbourne
2022-06-23 23:49 ` Peter Collingbourne
2022-06-23 23:49 ` Peter Collingbourne
2022-06-24 17:05 ` Catalin Marinas
2022-06-24 17:05 ` Catalin Marinas
2022-06-24 17:05 ` Catalin Marinas
2022-06-24 21:50 ` Peter Collingbourne
2022-06-24 21:50 ` Peter Collingbourne
2022-06-24 21:50 ` Peter Collingbourne
2022-06-27 11:43 ` Catalin Marinas
2022-06-27 11:43 ` Catalin Marinas
2022-06-27 11:43 ` Catalin Marinas
2022-06-27 18:16 ` Peter Collingbourne
2022-06-27 18:16 ` Peter Collingbourne
2022-06-27 18:16 ` Peter Collingbourne
2022-06-28 17:57 ` Catalin Marinas
2022-06-28 17:57 ` Catalin Marinas
2022-06-28 17:57 ` Catalin Marinas
2022-06-28 18:54 ` Peter Collingbourne
2022-06-28 18:54 ` Peter Collingbourne
2022-06-28 18:54 ` Peter Collingbourne
2022-06-29 19:15 ` Catalin Marinas
2022-06-29 19:15 ` Catalin Marinas
2022-06-29 19:15 ` Catalin Marinas
2022-06-30 17:24 ` Catalin Marinas
2022-06-30 17:24 ` Catalin Marinas
2022-06-30 17:24 ` Catalin Marinas
2022-06-25 8:14 ` Steven Price
2022-06-25 8:14 ` Steven Price
2022-06-25 8:14 ` Steven Price
2022-06-27 15:55 ` Cornelia Huck [this message]
2022-06-27 15:55 ` Cornelia Huck
2022-06-27 15:55 ` Cornelia Huck
2022-06-29 8:45 ` Catalin Marinas
2022-06-29 8:45 ` Catalin Marinas
2022-06-29 8:45 ` Catalin Marinas
2022-07-04 9:52 ` Steven Price
2022-07-04 9:52 ` Steven Price
2022-07-04 9:52 ` Steven Price
2022-07-04 12:19 ` Cornelia Huck
2022-07-04 12:19 ` Cornelia Huck
2022-07-04 12:19 ` Cornelia Huck
2022-07-04 15:00 ` Steven Price
2022-07-04 15:00 ` Steven Price
2022-07-04 15:00 ` Steven Price
2022-07-08 13:03 ` Cornelia Huck
2022-07-08 13:03 ` Cornelia Huck
2022-07-08 13:03 ` Cornelia Huck
2022-07-08 13:58 ` Peter Xu
2022-07-08 13:58 ` Peter Xu
2022-07-08 13:58 ` Peter Xu
2022-07-14 13:30 ` Cornelia Huck
2022-07-14 13:30 ` Cornelia Huck
2022-07-14 13:30 ` Cornelia Huck
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=875ykmcd8q.fsf@redhat.com \
--to=cohuck@redhat.com \
--cc=catalin.marinas@arm.com \
--cc=chao.p.peng@linux.intel.com \
--cc=eugenis@google.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.cs.columbia.edu \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=luto@amacapital.net \
--cc=maz@kernel.org \
--cc=michael.roth@amd.com \
--cc=pcc@google.com \
--cc=steven.price@arm.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.