From: Catalin Marinas <catalin.marinas@arm.com>
To: Steven Price <steven.price@arm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
qemu-devel@nongnu.org, Marc Zyngier <maz@kernel.org>,
Juan Quintela <quintela@redhat.com>,
Richard Henderson <richard.henderson@linaro.org>,
linux-kernel@vger.kernel.org, Dave Martin <Dave.Martin@arm.com>,
linux-arm-kernel@lists.infradead.org,
Thomas Gleixner <tglx@linutronix.de>,
Will Deacon <will@kernel.org>,
kvmarm@lists.cs.columbia.edu
Subject: Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
Date: Fri, 7 May 2021 19:25:39 +0100 [thread overview]
Message-ID: <20210507182538.GF26528@arm.com> (raw)
In-Reply-To: <329286e8-a8f3-ea1a-1802-58813255a4a5@arm.com>
On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> On 04/05/2021 18:40, Catalin Marinas wrote:
> > On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> > > On 28/04/2021 18:07, Catalin Marinas wrote:
> > > > While the set_pte_at() race on the page flags is somewhat clearer, we
> > > > may still have a race here with the VMM's set_pte_at() if the page is
> > > > mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
> > > > handling the VMM page tables (well, not always, see below).
> > > >
> > > > gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
> > > > path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
> > > > would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
> > > > sure whether get_user_page_fast_only() does the same.
> > > >
> > > > The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
> > > > KVM mmu notifier is invoked before set_pte_at() and racing with another
> > > > user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
> > > > set_pte_at() would see the PG_mte_tagged set either by the current CPU
> > > > or by the one it was racing with.
> > >
> > > Given the changes to set_pte_at() which means that tags are restored from
> > > swap even if !PROT_MTE, the only race I can see remaining is the creation of
> > > new PROT_MTE mappings. As you mention an attempt to change mappings in the
> > > VMM memory space should involve a mmu notifier call which I think serialises
> > > this. So the remaining issue is doing this in a separate address space.
> > >
> > > So I guess the potential problem is:
> > >
> > > * allocate memory MAP_SHARED but !PROT_MTE
> > > * fork()
> > > * VM causes a fault in parent address space
> > > * child does a mprotect(PROT_MTE)
> > >
> > > With the last two potentially racing. Sadly I can't see a good way of
> > > handling that.
> >
> > Ah, the mmap lock doesn't help as they are different processes
> > (mprotect() acquires it as a writer).
> >
> > I wonder whether this is racy even in the absence of KVM. If both parent
> > and child do an mprotect(PROT_MTE), one of them may be reading stale
> > tags for a brief period.
> >
> > Maybe we should revisit whether shared MTE pages are of any use, though
> > it's an ABI change (not bad if no-one is relying on this). However...
>
> Shared MTE pages are certainly hard to use correctly (e.g. see the
> discussions with the VMM accessing guest memory). But I guess that boat has
> sailed.
Digging out some old emails (two years ago), the Chrome people may have
found a use for MTE in shared mappings (with memfd_create), though not
sure they took advantage of this yet.
> > Thinking about this, we have a similar problem with the PG_dcache_clean
> > and two processes doing mprotect(PROT_EXEC). One of them could see the
> > flag set and skip the I-cache maintenance while the other executes
> > stale instructions. change_pte_range() could acquire the page lock if
> > the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
> > solve the MTE/KVM case but we could at least take the page lock via
> > user_mem_abort().
>
> For PG_dcache_clean AFAICS the solution is actually simple: split the test
> and set parts. i.e..:
>
> if (!test_bit(PG_dcache_clean, &page->flags)) {
> sync_icache_aliases(page_address(page), page_size(page));
> set_bit(PG_dcache_clean, &page->flags);
> }
>
> There isn't a problem with repeating the sync_icache_aliases() call in the
> case of a race. Or am I missing something?
No, the fix is simple as you said.
> > Or maybe we just document this behaviour as racy both for PROT_EXEC and
> > PROT_MTE mappings and be done with this. The minor issue with PROT_MTE
> > is the potential leaking of tags (it's harder to leak information
> > through the I-cache).
>
> This is the real issue I see - the race in PROT_MTE case is either a data
> leak (syncing after setting the bit) or data loss (syncing before setting
> the bit).
For a moment I thought an mmap(PROT_MTE, MAP_SHARED) on memfd/tmpfs file
may lead to the same situation but the mmap() itself won't directly
cause allocating the page. The subsequent fault via filemap_map_pages()
seems to take the page lock.
> But without serialising through a spinlock (in mte_sync_tags()) I haven't
> been able to come up with any way of closing the race. But with the change
> to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
> is likely to seriously hurt performance.
Yeah. We could add another page flag as a lock though I think it should
be the core code that prevents the race.
If we are to do it in the arch code, maybe easier with a custom
ptep_modify_prot_start/end() where we check if it's VM_SHARED and
VM_MTE, take a (big) lock.
In the core code, something like below (well, a partial hack, not tested
and it doesn't handle huge pages but just to give an idea):
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94188df1ee55..6ba96ff141a6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -76,14 +76,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (pte_present(oldpte)) {
pte_t ptent;
bool preserve_write = prot_numa && pte_write(oldpte);
+ struct page *page = NULL;
/*
* Avoid trapping faults against the zero or KSM
* pages. See similar comment in change_huge_pmd.
*/
if (prot_numa) {
- struct page *page;
-
/* Avoid TLB flush if possible */
if (pte_protnone(oldpte))
continue;
@@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
}
oldpte = ptep_modify_prot_start(vma, addr, pte);
+ if (vma->vm_flags & VM_SHARED) {
+ page = vm_normal_page(vma, addr, oldpte);
+ lock_page(page);
+ }
ptent = pte_modify(oldpte, newprot);
if (preserve_write)
ptent = pte_mk_savedwrite(ptent);
@@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+ if (page)
+ unlock_page(page);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
--
Catalin
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
WARNING: multiple messages have this Message-ID (diff)
From: Catalin Marinas <catalin.marinas@arm.com>
To: Steven Price <steven.price@arm.com>
Cc: Marc Zyngier <maz@kernel.org>, Will Deacon <will@kernel.org>,
James Morse <james.morse@arm.com>,
Julien Thierry <julien.thierry.kdev@gmail.com>,
Suzuki K Poulose <suzuki.poulose@arm.com>,
kvmarm@lists.cs.columbia.edu,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, Dave Martin <Dave.Martin@arm.com>,
Mark Rutland <mark.rutland@arm.com>,
Thomas Gleixner <tglx@linutronix.de>,
qemu-devel@nongnu.org, Juan Quintela <quintela@redhat.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
Richard Henderson <richard.henderson@linaro.org>,
Peter Maydell <peter.maydell@linaro.org>,
Haibo Xu <Haibo.Xu@arm.com>, Andrew Jones <drjones@redhat.com>
Subject: Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
Date: Fri, 7 May 2021 19:25:39 +0100 [thread overview]
Message-ID: <20210507182538.GF26528@arm.com> (raw)
In-Reply-To: <329286e8-a8f3-ea1a-1802-58813255a4a5@arm.com>
On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> On 04/05/2021 18:40, Catalin Marinas wrote:
> > On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> > > On 28/04/2021 18:07, Catalin Marinas wrote:
> > > > While the set_pte_at() race on the page flags is somewhat clearer, we
> > > > may still have a race here with the VMM's set_pte_at() if the page is
> > > > mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
> > > > handling the VMM page tables (well, not always, see below).
> > > >
> > > > gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
> > > > path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
> > > > would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
> > > > sure whether get_user_page_fast_only() does the same.
> > > >
> > > > The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
> > > > KVM mmu notifier is invoked before set_pte_at() and racing with another
> > > > user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
> > > > set_pte_at() would see the PG_mte_tagged set either by the current CPU
> > > > or by the one it was racing with.
> > >
> > > Given the changes to set_pte_at() which means that tags are restored from
> > > swap even if !PROT_MTE, the only race I can see remaining is the creation of
> > > new PROT_MTE mappings. As you mention an attempt to change mappings in the
> > > VMM memory space should involve a mmu notifier call which I think serialises
> > > this. So the remaining issue is doing this in a separate address space.
> > >
> > > So I guess the potential problem is:
> > >
> > > * allocate memory MAP_SHARED but !PROT_MTE
> > > * fork()
> > > * VM causes a fault in parent address space
> > > * child does a mprotect(PROT_MTE)
> > >
> > > With the last two potentially racing. Sadly I can't see a good way of
> > > handling that.
> >
> > Ah, the mmap lock doesn't help as they are different processes
> > (mprotect() acquires it as a writer).
> >
> > I wonder whether this is racy even in the absence of KVM. If both parent
> > and child do an mprotect(PROT_MTE), one of them may be reading stale
> > tags for a brief period.
> >
> > Maybe we should revisit whether shared MTE pages are of any use, though
> > it's an ABI change (not bad if no-one is relying on this). However...
>
> Shared MTE pages are certainly hard to use correctly (e.g. see the
> discussions with the VMM accessing guest memory). But I guess that boat has
> sailed.
Digging out some old emails (two years ago), the Chrome people may have
found a use for MTE in shared mappings (with memfd_create), though not
sure they took advantage of this yet.
> > Thinking about this, we have a similar problem with the PG_dcache_clean
> > and two processes doing mprotect(PROT_EXEC). One of them could see the
> > flag set and skip the I-cache maintenance while the other executes
> > stale instructions. change_pte_range() could acquire the page lock if
> > the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
> > solve the MTE/KVM case but we could at least take the page lock via
> > user_mem_abort().
>
> For PG_dcache_clean AFAICS the solution is actually simple: split the test
> and set parts. i.e..:
>
> if (!test_bit(PG_dcache_clean, &page->flags)) {
> sync_icache_aliases(page_address(page), page_size(page));
> set_bit(PG_dcache_clean, &page->flags);
> }
>
> There isn't a problem with repeating the sync_icache_aliases() call in the
> case of a race. Or am I missing something?
No, the fix is simple as you said.
> > Or maybe we just document this behaviour as racy both for PROT_EXEC and
> > PROT_MTE mappings and be done with this. The minor issue with PROT_MTE
> > is the potential leaking of tags (it's harder to leak information
> > through the I-cache).
>
> This is the real issue I see - the race in PROT_MTE case is either a data
> leak (syncing after setting the bit) or data loss (syncing before setting
> the bit).
For a moment I thought an mmap(PROT_MTE, MAP_SHARED) on memfd/tmpfs file
may lead to the same situation but the mmap() itself won't directly
cause allocating the page. The subsequent fault via filemap_map_pages()
seems to take the page lock.
> But without serialising through a spinlock (in mte_sync_tags()) I haven't
> been able to come up with any way of closing the race. But with the change
> to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
> is likely to seriously hurt performance.
Yeah. We could add another page flag as a lock though I think it should
be the core code that prevents the race.
If we are to do it in the arch code, maybe easier with a custom
ptep_modify_prot_start/end() where we check if it's VM_SHARED and
VM_MTE, take a (big) lock.
In the core code, something like below (well, a partial hack, not tested
and it doesn't handle huge pages but just to give an idea):
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94188df1ee55..6ba96ff141a6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -76,14 +76,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (pte_present(oldpte)) {
pte_t ptent;
bool preserve_write = prot_numa && pte_write(oldpte);
+ struct page *page = NULL;
/*
* Avoid trapping faults against the zero or KSM
* pages. See similar comment in change_huge_pmd.
*/
if (prot_numa) {
- struct page *page;
-
/* Avoid TLB flush if possible */
if (pte_protnone(oldpte))
continue;
@@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
}
oldpte = ptep_modify_prot_start(vma, addr, pte);
+ if (vma->vm_flags & VM_SHARED) {
+ page = vm_normal_page(vma, addr, oldpte);
+ lock_page(page);
+ }
ptent = pte_modify(oldpte, newprot);
if (preserve_write)
ptent = pte_mk_savedwrite(ptent);
@@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+ if (page)
+ unlock_page(page);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: Catalin Marinas <catalin.marinas@arm.com>
To: Steven Price <steven.price@arm.com>
Cc: Marc Zyngier <maz@kernel.org>, Will Deacon <will@kernel.org>,
James Morse <james.morse@arm.com>,
Julien Thierry <julien.thierry.kdev@gmail.com>,
Suzuki K Poulose <suzuki.poulose@arm.com>,
kvmarm@lists.cs.columbia.edu,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, Dave Martin <Dave.Martin@arm.com>,
Mark Rutland <mark.rutland@arm.com>,
Thomas Gleixner <tglx@linutronix.de>,
qemu-devel@nongnu.org, Juan Quintela <quintela@redhat.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
Richard Henderson <richard.henderson@linaro.org>,
Peter Maydell <peter.maydell@linaro.org>,
Haibo Xu <Haibo.Xu@arm.com>, Andrew Jones <drjones@redhat.com>
Subject: Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
Date: Fri, 7 May 2021 19:25:39 +0100 [thread overview]
Message-ID: <20210507182538.GF26528@arm.com> (raw)
In-Reply-To: <329286e8-a8f3-ea1a-1802-58813255a4a5@arm.com>
On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> On 04/05/2021 18:40, Catalin Marinas wrote:
> > On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> > > On 28/04/2021 18:07, Catalin Marinas wrote:
> > > > While the set_pte_at() race on the page flags is somewhat clearer, we
> > > > may still have a race here with the VMM's set_pte_at() if the page is
> > > > mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
> > > > handling the VMM page tables (well, not always, see below).
> > > >
> > > > gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
> > > > path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
> > > > would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
> > > > sure whether get_user_page_fast_only() does the same.
> > > >
> > > > The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
> > > > KVM mmu notifier is invoked before set_pte_at() and racing with another
> > > > user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
> > > > set_pte_at() would see the PG_mte_tagged set either by the current CPU
> > > > or by the one it was racing with.
> > >
> > > Given the changes to set_pte_at() which means that tags are restored from
> > > swap even if !PROT_MTE, the only race I can see remaining is the creation of
> > > new PROT_MTE mappings. As you mention an attempt to change mappings in the
> > > VMM memory space should involve a mmu notifier call which I think serialises
> > > this. So the remaining issue is doing this in a separate address space.
> > >
> > > So I guess the potential problem is:
> > >
> > > * allocate memory MAP_SHARED but !PROT_MTE
> > > * fork()
> > > * VM causes a fault in parent address space
> > > * child does a mprotect(PROT_MTE)
> > >
> > > With the last two potentially racing. Sadly I can't see a good way of
> > > handling that.
> >
> > Ah, the mmap lock doesn't help as they are different processes
> > (mprotect() acquires it as a writer).
> >
> > I wonder whether this is racy even in the absence of KVM. If both parent
> > and child do an mprotect(PROT_MTE), one of them may be reading stale
> > tags for a brief period.
> >
> > Maybe we should revisit whether shared MTE pages are of any use, though
> > it's an ABI change (not bad if no-one is relying on this). However...
>
> Shared MTE pages are certainly hard to use correctly (e.g. see the
> discussions with the VMM accessing guest memory). But I guess that boat has
> sailed.
Digging out some old emails (two years ago), the Chrome people may have
found a use for MTE in shared mappings (with memfd_create), though not
sure they took advantage of this yet.
> > Thinking about this, we have a similar problem with the PG_dcache_clean
> > and two processes doing mprotect(PROT_EXEC). One of them could see the
> > flag set and skip the I-cache maintenance while the other executes
> > stale instructions. change_pte_range() could acquire the page lock if
> > the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
> > solve the MTE/KVM case but we could at least take the page lock via
> > user_mem_abort().
>
> For PG_dcache_clean AFAICS the solution is actually simple: split the test
> and set parts. i.e..:
>
> if (!test_bit(PG_dcache_clean, &page->flags)) {
> sync_icache_aliases(page_address(page), page_size(page));
> set_bit(PG_dcache_clean, &page->flags);
> }
>
> There isn't a problem with repeating the sync_icache_aliases() call in the
> case of a race. Or am I missing something?
No, the fix is simple as you said.
> > Or maybe we just document this behaviour as racy both for PROT_EXEC and
> > PROT_MTE mappings and be done with this. The minor issue with PROT_MTE
> > is the potential leaking of tags (it's harder to leak information
> > through the I-cache).
>
> This is the real issue I see - the race in PROT_MTE case is either a data
> leak (syncing after setting the bit) or data loss (syncing before setting
> the bit).
For a moment I thought an mmap(PROT_MTE, MAP_SHARED) on memfd/tmpfs file
may lead to the same situation but the mmap() itself won't directly
cause allocating the page. The subsequent fault via filemap_map_pages()
seems to take the page lock.
> But without serialising through a spinlock (in mte_sync_tags()) I haven't
> been able to come up with any way of closing the race. But with the change
> to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
> is likely to seriously hurt performance.
Yeah. We could add another page flag as a lock though I think it should
be the core code that prevents the race.
If we are to do it in the arch code, maybe easier with a custom
ptep_modify_prot_start/end() where we check if it's VM_SHARED and
VM_MTE, take a (big) lock.
In the core code, something like below (well, a partial hack, not tested
and it doesn't handle huge pages but just to give an idea):
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94188df1ee55..6ba96ff141a6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -76,14 +76,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (pte_present(oldpte)) {
pte_t ptent;
bool preserve_write = prot_numa && pte_write(oldpte);
+ struct page *page = NULL;
/*
* Avoid trapping faults against the zero or KSM
* pages. See similar comment in change_huge_pmd.
*/
if (prot_numa) {
- struct page *page;
-
/* Avoid TLB flush if possible */
if (pte_protnone(oldpte))
continue;
@@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
}
oldpte = ptep_modify_prot_start(vma, addr, pte);
+ if (vma->vm_flags & VM_SHARED) {
+ page = vm_normal_page(vma, addr, oldpte);
+ lock_page(page);
+ }
ptent = pte_modify(oldpte, newprot);
if (preserve_write)
ptent = pte_mk_savedwrite(ptent);
@@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+ if (page)
+ unlock_page(page);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
--
Catalin
WARNING: multiple messages have this Message-ID (diff)
From: Catalin Marinas <catalin.marinas@arm.com>
To: Steven Price <steven.price@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
Peter Maydell <peter.maydell@linaro.org>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
Andrew Jones <drjones@redhat.com>, Haibo Xu <Haibo.Xu@arm.com>,
Suzuki K Poulose <suzuki.poulose@arm.com>,
qemu-devel@nongnu.org, Marc Zyngier <maz@kernel.org>,
Juan Quintela <quintela@redhat.com>,
Richard Henderson <richard.henderson@linaro.org>,
linux-kernel@vger.kernel.org, Dave Martin <Dave.Martin@arm.com>,
James Morse <james.morse@arm.com>,
linux-arm-kernel@lists.infradead.org,
Thomas Gleixner <tglx@linutronix.de>,
Will Deacon <will@kernel.org>,
kvmarm@lists.cs.columbia.edu,
Julien Thierry <julien.thierry.kdev@gmail.com>
Subject: Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
Date: Fri, 7 May 2021 19:25:39 +0100 [thread overview]
Message-ID: <20210507182538.GF26528@arm.com> (raw)
In-Reply-To: <329286e8-a8f3-ea1a-1802-58813255a4a5@arm.com>
On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> On 04/05/2021 18:40, Catalin Marinas wrote:
> > On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> > > On 28/04/2021 18:07, Catalin Marinas wrote:
> > > > While the set_pte_at() race on the page flags is somewhat clearer, we
> > > > may still have a race here with the VMM's set_pte_at() if the page is
> > > > mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
> > > > handling the VMM page tables (well, not always, see below).
> > > >
> > > > gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
> > > > path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
> > > > would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
> > > > sure whether get_user_page_fast_only() does the same.
> > > >
> > > > The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
> > > > KVM mmu notifier is invoked before set_pte_at() and racing with another
> > > > user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
> > > > set_pte_at() would see the PG_mte_tagged set either by the current CPU
> > > > or by the one it was racing with.
> > >
> > > Given the changes to set_pte_at() which means that tags are restored from
> > > swap even if !PROT_MTE, the only race I can see remaining is the creation of
> > > new PROT_MTE mappings. As you mention an attempt to change mappings in the
> > > VMM memory space should involve a mmu notifier call which I think serialises
> > > this. So the remaining issue is doing this in a separate address space.
> > >
> > > So I guess the potential problem is:
> > >
> > > * allocate memory MAP_SHARED but !PROT_MTE
> > > * fork()
> > > * VM causes a fault in parent address space
> > > * child does a mprotect(PROT_MTE)
> > >
> > > With the last two potentially racing. Sadly I can't see a good way of
> > > handling that.
> >
> > Ah, the mmap lock doesn't help as they are different processes
> > (mprotect() acquires it as a writer).
> >
> > I wonder whether this is racy even in the absence of KVM. If both parent
> > and child do an mprotect(PROT_MTE), one of them may be reading stale
> > tags for a brief period.
> >
> > Maybe we should revisit whether shared MTE pages are of any use, though
> > it's an ABI change (not bad if no-one is relying on this). However...
>
> Shared MTE pages are certainly hard to use correctly (e.g. see the
> discussions with the VMM accessing guest memory). But I guess that boat has
> sailed.
Digging out some old emails (two years ago), the Chrome people may have
found a use for MTE in shared mappings (with memfd_create), though not
sure they took advantage of this yet.
> > Thinking about this, we have a similar problem with the PG_dcache_clean
> > and two processes doing mprotect(PROT_EXEC). One of them could see the
> > flag set and skip the I-cache maintenance while the other executes
> > stale instructions. change_pte_range() could acquire the page lock if
> > the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
> > solve the MTE/KVM case but we could at least take the page lock via
> > user_mem_abort().
>
> For PG_dcache_clean AFAICS the solution is actually simple: split the test
> and set parts. i.e..:
>
> if (!test_bit(PG_dcache_clean, &page->flags)) {
> sync_icache_aliases(page_address(page), page_size(page));
> set_bit(PG_dcache_clean, &page->flags);
> }
>
> There isn't a problem with repeating the sync_icache_aliases() call in the
> case of a race. Or am I missing something?
No, the fix is simple as you said.
> > Or maybe we just document this behaviour as racy both for PROT_EXEC and
> > PROT_MTE mappings and be done with this. The minor issue with PROT_MTE
> > is the potential leaking of tags (it's harder to leak information
> > through the I-cache).
>
> This is the real issue I see - the race in PROT_MTE case is either a data
> leak (syncing after setting the bit) or data loss (syncing before setting
> the bit).
For a moment I thought an mmap(PROT_MTE, MAP_SHARED) on memfd/tmpfs file
may lead to the same situation but the mmap() itself won't directly
cause allocating the page. The subsequent fault via filemap_map_pages()
seems to take the page lock.
> But without serialising through a spinlock (in mte_sync_tags()) I haven't
> been able to come up with any way of closing the race. But with the change
> to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
> is likely to seriously hurt performance.
Yeah. We could add another page flag as a lock though I think it should
be the core code that prevents the race.
If we are to do it in the arch code, maybe easier with a custom
ptep_modify_prot_start/end() where we check if it's VM_SHARED and
VM_MTE, take a (big) lock.
In the core code, something like below (well, a partial hack, not tested
and it doesn't handle huge pages but just to give an idea):
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94188df1ee55..6ba96ff141a6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -76,14 +76,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (pte_present(oldpte)) {
pte_t ptent;
bool preserve_write = prot_numa && pte_write(oldpte);
+ struct page *page = NULL;
/*
* Avoid trapping faults against the zero or KSM
* pages. See similar comment in change_huge_pmd.
*/
if (prot_numa) {
- struct page *page;
-
/* Avoid TLB flush if possible */
if (pte_protnone(oldpte))
continue;
@@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
}
oldpte = ptep_modify_prot_start(vma, addr, pte);
+ if (vma->vm_flags & VM_SHARED) {
+ page = vm_normal_page(vma, addr, oldpte);
+ lock_page(page);
+ }
ptent = pte_modify(oldpte, newprot);
if (preserve_write)
ptent = pte_mk_savedwrite(ptent);
@@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+ if (page)
+ unlock_page(page);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
--
Catalin
next prev parent reply other threads:[~2021-05-07 18:25 UTC|newest]
Thread overview: 104+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-16 15:43 [PATCH v11 0/6] MTE support for KVM guest Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` [PATCH v11 1/6] arm64: mte: Sync tags for pages where PTE is untagged Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-27 17:43 ` Catalin Marinas
2021-04-27 17:43 ` Catalin Marinas
2021-04-27 17:43 ` Catalin Marinas
2021-04-27 17:43 ` Catalin Marinas
2021-04-29 16:06 ` Steven Price
2021-04-29 16:06 ` Steven Price
2021-04-29 16:06 ` Steven Price
2021-04-29 16:06 ` Steven Price
2021-05-04 15:29 ` Catalin Marinas
2021-05-04 15:29 ` Catalin Marinas
2021-05-04 15:29 ` Catalin Marinas
2021-05-04 15:29 ` Catalin Marinas
2021-04-16 15:43 ` [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-28 17:07 ` Catalin Marinas
2021-04-28 17:07 ` Catalin Marinas
2021-04-28 17:07 ` Catalin Marinas
2021-04-28 17:07 ` Catalin Marinas
2021-04-29 16:06 ` Steven Price
2021-04-29 16:06 ` Steven Price
2021-04-29 16:06 ` Steven Price
2021-04-29 16:06 ` Steven Price
2021-05-04 17:40 ` Catalin Marinas
2021-05-04 17:40 ` Catalin Marinas
2021-05-04 17:40 ` Catalin Marinas
2021-05-04 17:40 ` Catalin Marinas
2021-05-06 16:15 ` Steven Price
2021-05-06 16:15 ` Steven Price
2021-05-06 16:15 ` Steven Price
2021-05-06 16:15 ` Steven Price
2021-05-07 18:25 ` Catalin Marinas [this message]
2021-05-07 18:25 ` Catalin Marinas
2021-05-07 18:25 ` Catalin Marinas
2021-05-07 18:25 ` Catalin Marinas
2021-05-10 18:35 ` Catalin Marinas
2021-05-10 18:35 ` Catalin Marinas
2021-05-10 18:35 ` Catalin Marinas
2021-05-10 18:35 ` Catalin Marinas
2021-05-12 15:46 ` Steven Price
2021-05-12 15:46 ` Steven Price
2021-05-12 15:46 ` Steven Price
2021-05-12 15:46 ` Steven Price
2021-05-12 17:45 ` Catalin Marinas
2021-05-12 17:45 ` Catalin Marinas
2021-05-12 17:45 ` Catalin Marinas
2021-05-12 17:45 ` Catalin Marinas
2021-05-13 10:57 ` Steven Price
2021-05-13 10:57 ` Steven Price
2021-05-13 10:57 ` Steven Price
2021-05-13 10:57 ` Steven Price
2021-05-13 15:08 ` Catalin Marinas
2021-05-13 15:08 ` Catalin Marinas
2021-05-13 15:08 ` Catalin Marinas
2021-05-13 15:08 ` Catalin Marinas
2021-05-13 15:21 ` Catalin Marinas
2021-05-13 15:21 ` Catalin Marinas
2021-05-13 15:21 ` Catalin Marinas
2021-05-13 15:21 ` Catalin Marinas
2021-04-16 15:43 ` [PATCH v11 3/6] arm64: kvm: Save/restore MTE registers Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` [PATCH v11 4/6] arm64: kvm: Expose KVM_ARM_CAP_MTE Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-27 17:58 ` Catalin Marinas
2021-04-27 17:58 ` Catalin Marinas
2021-04-27 17:58 ` Catalin Marinas
2021-04-27 17:58 ` Catalin Marinas
2021-04-29 16:06 ` Steven Price
2021-04-29 16:06 ` Steven Price
2021-04-29 16:06 ` Steven Price
2021-04-29 16:06 ` Steven Price
2021-05-04 17:44 ` Catalin Marinas
2021-05-04 17:44 ` Catalin Marinas
2021-05-04 17:44 ` Catalin Marinas
2021-05-04 17:44 ` Catalin Marinas
2021-05-07 9:44 ` Steven Price
2021-05-07 9:44 ` Steven Price
2021-05-07 9:44 ` Steven Price
2021-05-07 9:44 ` Steven Price
2021-05-07 9:59 ` David Laight
2021-05-07 9:59 ` David Laight
2021-05-07 9:59 ` David Laight
2021-05-07 9:59 ` David Laight
2021-04-16 15:43 ` [PATCH v11 6/6] KVM: arm64: Document MTE capability and ioctl Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
2021-04-16 15:43 ` Steven Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210507182538.GF26528@arm.com \
--to=catalin.marinas@arm.com \
--cc=Dave.Martin@arm.com \
--cc=dgilbert@redhat.com \
--cc=kvmarm@lists.cs.columbia.edu \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maz@kernel.org \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=richard.henderson@linaro.org \
--cc=steven.price@arm.com \
--cc=tglx@linutronix.de \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.