* [PATCH] x86: ignore spurious faults
@ 2008-01-24 0:05 Jeremy Fitzhardinge
2008-01-24 0:18 ` Harvey Harrison
2008-01-24 6:49 ` [PATCH] " Andi Kleen
0 siblings, 2 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-24 0:05 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Linux Kernel Mailing List, Andi Kleen, Harvey Harrison
When changing a kernel page from RO->RW, it's OK to leave stale TLB
entries around, since doing a global flush is expensive and they pose
no security problem. They can, however, generate a spurious fault,
which we should catch and simply return from (which will have the
side-effect of reloading the TLB to the current PTE).
This can occur when running under Xen, because it frequently changes
kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
doing a global TLB flush after changing page permissions.
[ Changes to fault_32.c and fault_64.c are identical, and should be
easy unify when the time comes. ]
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
---
arch/x86/mm/fault_32.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/mm/fault_64.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 104 insertions(+)
===================================================================
--- a/arch/x86/mm/fault_32.c
+++ b/arch/x86/mm/fault_32.c
@@ -290,6 +290,53 @@ static int is_errata93(struct pt_regs *r
/*
+ * Handle a spurious fault caused by a stale TLB entry. This allows
+ * us to lazily refresh the TLB when increasing the permissions of a
+ * kernel page (RO -> RW or NX -> X). Doing it eagerly is very
+ * expensive since that implies doing a full cross-processor TLB
+ * flush, even if no stale TLB entries exist on other processors.
+ * There are no security implications to leaving a stale TLB when
+ * increasing the permissions on a page.
+ */
+static int spurious_fault(unsigned long address,
+ unsigned long error_code)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ /* Reserved-bit violation or user access to kernel space? */
+ if (error_code & (PF_USER | PF_RSVD))
+ return 0;
+
+ pgd = init_mm.pgd + pgd_index(address);
+ if (!pgd_present(*pgd))
+ return 0;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return 0;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return 0;
+
+ pte = pte_offset_kernel(pmd, address);
+ if (!pte_present(*pte))
+ return 0;
+ if ((error_code & 0x02) && !pte_write(*pte))
+ return 0;
+
+#if _PAGE_NX
+ if ((error_code & PF_INSTR) && !pte_exec(*pte))
+ return 0;
+#endif
+
+ return 1;
+}
+
+/*
* Handle a fault on the vmalloc or module mapping area
*
* This assumes no large pages in there.
@@ -412,6 +459,11 @@ void __kprobes do_page_fault(struct pt_r
if (!(error_code & (PF_RSVD|PF_USER|PF_PROT)) &&
vmalloc_fault(address) >= 0)
return;
+
+ /* Can handle a stale RO->RW TLB */
+ if (spurious_fault(address, error_code))
+ return;
+
/*
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock.
===================================================================
--- a/arch/x86/mm/fault_64.c
+++ b/arch/x86/mm/fault_64.c
@@ -275,6 +275,53 @@ static noinline void pgtable_bad(unsigne
}
/*
+ * Handle a spurious fault caused by a stale TLB entry. This allows
+ * us to lazily refresh the TLB when increasing the permissions of a
+ * kernel page (RO -> RW or NX -> X). Doing it eagerly is very
+ * expensive since that implies doing a full cross-processor TLB
+ * flush, even if no stale TLB entries exist on other processors.
+ * There are no security implications to leaving a stale TLB when
+ * increasing the permissions on a page.
+ */
+static int spurious_fault(unsigned long address,
+ unsigned long error_code)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ /* Reserved-bit violation or user access to kernel space? */
+ if (error_code & (PF_USER | PF_RSVD))
+ return 0;
+
+ pgd = init_mm.pgd + pgd_index(address);
+ if (!pgd_present(*pgd))
+ return 0;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return 0;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return 0;
+
+ pte = pte_offset_kernel(pmd, address);
+ if (!pte_present(*pte))
+ return 0;
+ if ((error_code & 0x02) && !pte_write(*pte))
+ return 0;
+
+#if _PAGE_NX
+ if ((error_code & PF_INSTR) && !pte_exec(*pte))
+ return 0;
+#endif
+
+ return 1;
+}
+
+/*
* Handle a fault on the vmalloc area
*
* This assumes no large pages in there.
@@ -406,6 +453,11 @@ asmlinkage void __kprobes do_page_fault(
if (vmalloc_fault(address) >= 0)
return;
}
+
+ /* Can handle a stale RO->RW TLB */
+ if (spurious_fault(address, error_code))
+ return;
+
/*
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock.
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH] x86: ignore spurious faults
2008-01-24 0:05 [PATCH] x86: ignore spurious faults Jeremy Fitzhardinge
@ 2008-01-24 0:18 ` Harvey Harrison
2008-01-24 0:26 ` Jeremy Fitzhardinge
2008-01-24 0:28 ` [PATCH UPDATE] " Jeremy Fitzhardinge
2008-01-24 6:49 ` [PATCH] " Andi Kleen
1 sibling, 2 replies; 24+ messages in thread
From: Harvey Harrison @ 2008-01-24 0:18 UTC (permalink / raw)
To: Jeremy Fitzhardinge; +Cc: Ingo Molnar, Linux Kernel Mailing List, Andi Kleen
On Wed, 2008-01-23 at 16:05 -0800, Jeremy Fitzhardinge wrote:
> ===================================================================
> --- a/arch/x86/mm/fault_32.c
> +++ b/arch/x86/mm/fault_32.c
> @@ -290,6 +290,53 @@ static int is_errata93(struct pt_regs *r
>
>
> /*
> + * Handle a spurious fault caused by a stale TLB entry. This allows
> + * us to lazily refresh the TLB when increasing the permissions of a
> + * kernel page (RO -> RW or NX -> X). Doing it eagerly is very
> + * expensive since that implies doing a full cross-processor TLB
> + * flush, even if no stale TLB entries exist on other processors.
> + * There are no security implications to leaving a stale TLB when
> + * increasing the permissions on a page.
> + */
> +static int spurious_fault(unsigned long address,
> + unsigned long error_code)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> +
> + /* Reserved-bit violation or user access to kernel space? */
> + if (error_code & (PF_USER | PF_RSVD))
> + return 0;
> +
> + pgd = init_mm.pgd + pgd_index(address);
> + if (!pgd_present(*pgd))
> + return 0;
> +
> + pud = pud_offset(pgd, address);
> + if (!pud_present(*pud))
> + return 0;
> +
> + pmd = pmd_offset(pud, address);
> + if (!pmd_present(*pmd))
> + return 0;
> +
> + pte = pte_offset_kernel(pmd, address);
> + if (!pte_present(*pte))
> + return 0;
> + if ((error_code & 0x02) && !pte_write(*pte))
> + return 0;
if ((error_code & PF_WRITE) && !pte_write(*pte))
return 0;
> +
> +#if _PAGE_NX
> + if ((error_code & PF_INSTR) && !pte_exec(*pte))
> + return 0;
> +#endif
> +
How about dropping the #if and rely on the !pte_exec() test always
being false when _PAGE_NX = 0? The compiler should just trim this
all away.
from pgtable.h:
static inline int pte_exec(pte_t pte) { return !(pte_val(pte)
& _PAGE_NX); }
Cheers,
Harvey
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH] x86: ignore spurious faults
2008-01-24 0:18 ` Harvey Harrison
@ 2008-01-24 0:26 ` Jeremy Fitzhardinge
2008-01-24 0:28 ` [PATCH UPDATE] " Jeremy Fitzhardinge
1 sibling, 0 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-24 0:26 UTC (permalink / raw)
To: Harvey Harrison; +Cc: Ingo Molnar, Linux Kernel Mailing List, Andi Kleen
Harvey Harrison wrote:
> On Wed, 2008-01-23 at 16:05 -0800, Jeremy Fitzhardinge wrote:
>
>> ===================================================================
>> --- a/arch/x86/mm/fault_32.c
>> +++ b/arch/x86/mm/fault_32.c
>> @@ -290,6 +290,53 @@ static int is_errata93(struct pt_regs *r
>>
>>
>> /*
>> + * Handle a spurious fault caused by a stale TLB entry. This allows
>> + * us to lazily refresh the TLB when increasing the permissions of a
>> + * kernel page (RO -> RW or NX -> X). Doing it eagerly is very
>> + * expensive since that implies doing a full cross-processor TLB
>> + * flush, even if no stale TLB entries exist on other processors.
>> + * There are no security implications to leaving a stale TLB when
>> + * increasing the permissions on a page.
>> + */
>> +static int spurious_fault(unsigned long address,
>> + unsigned long error_code)
>> +{
>> + pgd_t *pgd;
>> + pud_t *pud;
>> + pmd_t *pmd;
>> + pte_t *pte;
>> +
>> + /* Reserved-bit violation or user access to kernel space? */
>> + if (error_code & (PF_USER | PF_RSVD))
>> + return 0;
>> +
>> + pgd = init_mm.pgd + pgd_index(address);
>> + if (!pgd_present(*pgd))
>> + return 0;
>> +
>> + pud = pud_offset(pgd, address);
>> + if (!pud_present(*pud))
>> + return 0;
>> +
>> + pmd = pmd_offset(pud, address);
>> + if (!pmd_present(*pmd))
>> + return 0;
>> +
>> + pte = pte_offset_kernel(pmd, address);
>> + if (!pte_present(*pte))
>> + return 0;
>> + if ((error_code & 0x02) && !pte_write(*pte))
>> + return 0;
>>
>
> if ((error_code & PF_WRITE) && !pte_write(*pte))
> return 0;
>
Oops, thanks. Overlooked that one.
> How about dropping the #if and rely on the !pte_exec() test always
> being false when _PAGE_NX = 0? The compiler should just trim this
> all away.
>
> from pgtable.h:
>
> static inline int pte_exec(pte_t pte) { return !(pte_val(pte)
> & _PAGE_NX); }
>
Thanks for reminding me; I thought of this the other day, but forgot to
do it. Will post an updated patch shortly.
J
^ permalink raw reply [flat|nested] 24+ messages in thread* [PATCH UPDATE] x86: ignore spurious faults
2008-01-24 0:18 ` Harvey Harrison
2008-01-24 0:26 ` Jeremy Fitzhardinge
@ 2008-01-24 0:28 ` Jeremy Fitzhardinge
2008-01-24 19:14 ` Matt Mackall
2008-01-25 15:30 ` Ingo Molnar
1 sibling, 2 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-24 0:28 UTC (permalink / raw)
To: Harvey Harrison; +Cc: Ingo Molnar, Linux Kernel Mailing List, Andi Kleen
When changing a kernel page from RO->RW, it's OK to leave stale TLB
entries around, since doing a global flush is expensive and they pose
no security problem. They can, however, generate a spurious fault,
which we should catch and simply return from (which will have the
side-effect of reloading the TLB to the current PTE).
This can occur when running under Xen, because it frequently changes
kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
doing a global TLB flush after changing page permissions.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
---
arch/x86/mm/fault_32.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/mm/fault_64.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+)
===================================================================
--- a/arch/x86/mm/fault_32.c
+++ b/arch/x86/mm/fault_32.c
@@ -324,6 +324,51 @@ static int is_f00f_bug(struct pt_regs *r
}
/*
+ * Handle a spurious fault caused by a stale TLB entry. This allows
+ * us to lazily refresh the TLB when increasing the permissions of a
+ * kernel page (RO -> RW or NX -> X). Doing it eagerly is very
+ * expensive since that implies doing a full cross-processor TLB
+ * flush, even if no stale TLB entries exist on other processors.
+ * There are no security implications to leaving a stale TLB when
+ * increasing the permissions on a page.
+ */
+static int spurious_fault(unsigned long address,
+ unsigned long error_code)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ /* Reserved-bit violation or user access to kernel space? */
+ if (error_code & (PF_USER | PF_RSVD))
+ return 0;
+
+ pgd = init_mm.pgd + pgd_index(address);
+ if (!pgd_present(*pgd))
+ return 0;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return 0;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return 0;
+
+ pte = pte_offset_kernel(pmd, address);
+ if (!pte_present(*pte))
+ return 0;
+
+ if ((error_code & PF_WRITE) && !pte_write(*pte))
+ return 0;
+ if ((error_code & PF_INSTR) && !pte_exec(*pte))
+ return 0;
+
+ return 1;
+}
+
+/*
* Handle a fault on the vmalloc or module mapping area
*
* This assumes no large pages in there.
@@ -446,6 +491,11 @@ void __kprobes do_page_fault(struct pt_r
if (!(error_code & (PF_RSVD|PF_USER|PF_PROT)) &&
vmalloc_fault(address) >= 0)
return;
+
+ /* Can handle a stale RO->RW TLB */
+ if (spurious_fault(address, error_code))
+ return;
+
/*
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock.
===================================================================
--- a/arch/x86/mm/fault_64.c
+++ b/arch/x86/mm/fault_64.c
@@ -312,6 +312,51 @@ static noinline void pgtable_bad(unsigne
}
/*
+ * Handle a spurious fault caused by a stale TLB entry. This allows
+ * us to lazily refresh the TLB when increasing the permissions of a
+ * kernel page (RO -> RW or NX -> X). Doing it eagerly is very
+ * expensive since that implies doing a full cross-processor TLB
+ * flush, even if no stale TLB entries exist on other processors.
+ * There are no security implications to leaving a stale TLB when
+ * increasing the permissions on a page.
+ */
+static int spurious_fault(unsigned long address,
+ unsigned long error_code)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ /* Reserved-bit violation or user access to kernel space? */
+ if (error_code & (PF_USER | PF_RSVD))
+ return 0;
+
+ pgd = init_mm.pgd + pgd_index(address);
+ if (!pgd_present(*pgd))
+ return 0;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return 0;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return 0;
+
+ pte = pte_offset_kernel(pmd, address);
+ if (!pte_present(*pte))
+ return 0;
+
+ if ((error_code & PF_WRITE) && !pte_write(*pte))
+ return 0;
+ if ((error_code & PF_INSTR) && !pte_exec(*pte))
+ return 0;
+
+ return 1;
+}
+
+/*
* Handle a fault on the vmalloc area
*
* This assumes no large pages in there.
@@ -443,6 +488,11 @@ asmlinkage void __kprobes do_page_fault(
if (vmalloc_fault(address) >= 0)
return;
}
+
+ /* Can handle a stale RO->RW TLB */
+ if (spurious_fault(address, error_code))
+ return;
+
/*
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock.
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-24 0:28 ` [PATCH UPDATE] " Jeremy Fitzhardinge
@ 2008-01-24 19:14 ` Matt Mackall
2008-01-24 19:21 ` Jeremy Fitzhardinge
2008-01-25 15:30 ` Ingo Molnar
1 sibling, 1 reply; 24+ messages in thread
From: Matt Mackall @ 2008-01-24 19:14 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Harvey Harrison, Ingo Molnar, Linux Kernel Mailing List,
Andi Kleen
On Wed, 2008-01-23 at 16:28 -0800, Jeremy Fitzhardinge wrote:
> When changing a kernel page from RO->RW, it's OK to leave stale TLB
> entries around, since doing a global flush is expensive and they pose
> no security problem. They can, however, generate a spurious fault,
> which we should catch and simply return from (which will have the
> side-effect of reloading the TLB to the current PTE).
>
> This can occur when running under Xen, because it frequently changes
> kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
> It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
> doing a global TLB flush after changing page permissions.
There's perhaps an opportunity to do this lazy TLB trick in the mmap
path as well, where RW mappings are initially mapped as RO so we can
catch processes dirtying them and then switched to RW. If the mapping is
shared across threads on multiple cores, we can defer synchronizing the
TLBs on the others.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-24 19:14 ` Matt Mackall
@ 2008-01-24 19:21 ` Jeremy Fitzhardinge
2008-01-24 23:41 ` Nick Piggin
0 siblings, 1 reply; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-24 19:21 UTC (permalink / raw)
To: Matt Mackall
Cc: Harvey Harrison, Ingo Molnar, Linux Kernel Mailing List,
Andi Kleen
Matt Mackall wrote:
> There's perhaps an opportunity to do this lazy TLB trick in the mmap
> path as well, where RW mappings are initially mapped as RO so we can
> catch processes dirtying them and then switched to RW. If the mapping is
> shared across threads on multiple cores, we can defer synchronizing the
> TLBs on the others.
>
I think spurious usermode faults are already dealt with.
handle_pte_fault() does essentially the same thing as this patch:
if (ptep_set_access_flags(vma, address, pte, entry, write_access)) {
update_mmu_cache(vma, address, entry);
} else {
/*
* This is needed only for protection faults but the arch code
* is not yet telling us if this is a protection fault or not.
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
if (write_access)
flush_tlb_page(vma, address);
}
J
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-24 19:21 ` Jeremy Fitzhardinge
@ 2008-01-24 23:41 ` Nick Piggin
2008-01-25 0:26 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2008-01-24 23:41 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Matt Mackall, Harvey Harrison, Ingo Molnar,
Linux Kernel Mailing List, Andi Kleen
On Friday 25 January 2008 06:21, Jeremy Fitzhardinge wrote:
> Matt Mackall wrote:
> > There's perhaps an opportunity to do this lazy TLB trick in the mmap
> > path as well, where RW mappings are initially mapped as RO so we can
> > catch processes dirtying them and then switched to RW. If the mapping is
> > shared across threads on multiple cores, we can defer synchronizing the
> > TLBs on the others.
>
> I think spurious usermode faults are already dealt with.
> handle_pte_fault() does essentially the same thing as this patch:
>
> if (ptep_set_access_flags(vma, address, pte, entry, write_access)) {
> update_mmu_cache(vma, address, entry);
> } else {
> /*
> * This is needed only for protection faults but the arch code
> * is not yet telling us if this is a protection fault or not.
> * This still avoids useless tlb flushes for .text page faults
> * with threads.
> */
> if (write_access)
> flush_tlb_page(vma, address);
> }
I (obviously) don't know exactly how the TLB works in x86, but I
thought that on a miss, the CPU walks the pagetables first before
faulting? Maybe that's not the case if there is an RO entry
actually in the TLB?
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-24 23:41 ` Nick Piggin
@ 2008-01-25 0:26 ` Jeremy Fitzhardinge
2008-01-25 7:36 ` Keir Fraser
0 siblings, 1 reply; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-25 0:26 UTC (permalink / raw)
To: Nick Piggin
Cc: Matt Mackall, Harvey Harrison, Ingo Molnar,
Linux Kernel Mailing List, Andi Kleen, Keir Fraser, Jan Beulich
Nick Piggin wrote:
> On Friday 25 January 2008 06:21, Jeremy Fitzhardinge wrote:
>
>> Matt Mackall wrote:
>>
>>> There's perhaps an opportunity to do this lazy TLB trick in the mmap
>>> path as well, where RW mappings are initially mapped as RO so we can
>>> catch processes dirtying them and then switched to RW. If the mapping is
>>> shared across threads on multiple cores, we can defer synchronizing the
>>> TLBs on the others.
>>>
>> I think spurious usermode faults are already dealt with.
>> handle_pte_fault() does essentially the same thing as this patch:
>>
>> if (ptep_set_access_flags(vma, address, pte, entry, write_access)) {
>> update_mmu_cache(vma, address, entry);
>> } else {
>> /*
>> * This is needed only for protection faults but the arch code
>> * is not yet telling us if this is a protection fault or not.
>> * This still avoids useless tlb flushes for .text page faults
>> * with threads.
>> */
>> if (write_access)
>> flush_tlb_page(vma, address);
>> }
>>
>
> I (obviously) don't know exactly how the TLB works in x86, but I
> thought that on a miss, the CPU walks the pagetables first before
> faulting? Maybe that's not the case if there is an RO entry
> actually in the TLB?
>
My understanding is that it will fault immediately if there's a TLB
entry, and rewalk the tables on return from the fault before restarting
the instruction, so there's no need for an explicit TLB flush. The TLB
doesn't have a notion of negative cache entries, so any entry represents
a present page of some variety.
J
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-25 0:26 ` Jeremy Fitzhardinge
@ 2008-01-25 7:36 ` Keir Fraser
2008-01-25 8:15 ` Jan Beulich
0 siblings, 1 reply; 24+ messages in thread
From: Keir Fraser @ 2008-01-25 7:36 UTC (permalink / raw)
To: Jeremy Fitzhardinge, Nick Piggin
Cc: Matt Mackall, Harvey Harrison, Ingo Molnar,
Linux Kernel Mailing List, Andi Kleen, Jan Beulich
On 25/1/08 00:26, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:
>> I (obviously) don't know exactly how the TLB works in x86, but I
>> thought that on a miss, the CPU walks the pagetables first before
>> faulting? Maybe that's not the case if there is an RO entry
>> actually in the TLB?
>>
>
> My understanding is that it will fault immediately if there's a TLB
> entry, and rewalk the tables on return from the fault before restarting
> the instruction, so there's no need for an explicit TLB flush. The TLB
> doesn't have a notion of negative cache entries, so any entry represents
> a present page of some variety.
Yes, write access with a r/o TLB entry causes the TLB entry to be flushed
and an immediate #PF with no page walk. This is a hardware optimisation for
copy-on-write demand faults. Both Intel and AMD implement it.
-- Keir
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-25 7:36 ` Keir Fraser
@ 2008-01-25 8:15 ` Jan Beulich
2008-01-25 8:38 ` Nick Piggin
0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2008-01-25 8:15 UTC (permalink / raw)
To: Keir Fraser, Jeremy Fitzhardinge, Nick Piggin
Cc: Ingo Molnar, Harvey Harrison, Matt Mackall, Andi Kleen,
Linux Kernel Mailing List
Actually, another thought: permitting (and handling) spurious faults for
kernel mappings conflicts with NMI handling, i.e. great care would be
needed to ensure the NMI path cannot touch any such mapping. So
even the present Xen/Linux Dom0 implementation may have some
(perhaps unlikely) problems here, and it would get worse if we added
e.g. a virtual watchdog NMI (something I am considering, which would
then extend the problem to DomU-s).
Jan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-25 8:15 ` Jan Beulich
@ 2008-01-25 8:38 ` Nick Piggin
2008-01-25 9:11 ` Andi Kleen
2008-01-25 9:18 ` Jan Beulich
0 siblings, 2 replies; 24+ messages in thread
From: Nick Piggin @ 2008-01-25 8:38 UTC (permalink / raw)
To: Jan Beulich
Cc: Keir Fraser, Jeremy Fitzhardinge, Ingo Molnar, Harvey Harrison,
Matt Mackall, Andi Kleen, Linux Kernel Mailing List
On Friday 25 January 2008 19:15, Jan Beulich wrote:
> Actually, another thought: permitting (and handling) spurious faults for
> kernel mappings conflicts with NMI handling, i.e. great care would be
> needed to ensure the NMI path cannot touch any such mapping. So
> even the present Xen/Linux Dom0 implementation may have some
> (perhaps unlikely) problems here, and it would get worse if we added
> e.g. a virtual watchdog NMI (something I am considering, which would
> then extend the problem to DomU-s).
Can you explain how they conflict?
Thanks,
Nick
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-25 8:38 ` Nick Piggin
@ 2008-01-25 9:11 ` Andi Kleen
2008-01-25 9:18 ` Keir Fraser
2008-01-25 9:18 ` Jan Beulich
1 sibling, 1 reply; 24+ messages in thread
From: Andi Kleen @ 2008-01-25 9:11 UTC (permalink / raw)
To: Nick Piggin
Cc: Jan Beulich, Keir Fraser, Jeremy Fitzhardinge, Ingo Molnar,
Harvey Harrison, Matt Mackall, Linux Kernel Mailing List
On Friday 25 January 2008 09:38:38 Nick Piggin wrote:
> On Friday 25 January 2008 19:15, Jan Beulich wrote:
> > Actually, another thought: permitting (and handling) spurious faults for
> > kernel mappings conflicts with NMI handling, i.e. great care would be
> > needed to ensure the NMI path cannot touch any such mapping. So
> > even the present Xen/Linux Dom0 implementation may have some
> > (perhaps unlikely) problems here, and it would get worse if we added
> > e.g. a virtual watchdog NMI (something I am considering, which would
> > then extend the problem to DomU-s).
>
> Can you explain how they conflict?
NMI is blocked by the hardware until IRET and when a page fault happens inside
the NMI handler the early IRET unblocks it and then NMIs can nest, which
will lead to stack corruption.
-Andi
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-25 9:11 ` Andi Kleen
@ 2008-01-25 9:18 ` Keir Fraser
2008-01-25 9:51 ` Andi Kleen
2008-01-25 10:19 ` Andi Kleen
0 siblings, 2 replies; 24+ messages in thread
From: Keir Fraser @ 2008-01-25 9:18 UTC (permalink / raw)
To: Andi Kleen, Nick Piggin
Cc: Jan Beulich, Jeremy Fitzhardinge, Ingo Molnar, Harvey Harrison,
Matt Mackall, Linux Kernel Mailing List
On 25/1/08 09:11, "Andi Kleen" <ak@novell.com> wrote:
>>> Actually, another thought: permitting (and handling) spurious faults for
>>> kernel mappings conflicts with NMI handling, i.e. great care would be
>>> needed to ensure the NMI path cannot touch any such mapping. So
>>> even the present Xen/Linux Dom0 implementation may have some
>>> (perhaps unlikely) problems here, and it would get worse if we added
>>> e.g. a virtual watchdog NMI (something I am considering, which would
>>> then extend the problem to DomU-s).
>>
>> Can you explain how they conflict?
>
> NMI is blocked by the hardware until IRET and when a page fault happens inside
> the NMI handler the early IRET unblocks it and then NMIs can nest, which
> will lead to stack corruption.
Whether this a problem in light of Xen spurious faults depends on whether
NMI handlers touch dynamically-allocated data. And if they do, it still
depends on the exact scenario. If it is likely to be a problem, a Xen pv_op
can flush the TLB on NMI entry, or we could have Xen do that implicitly
before invoking the guest NMI handler.
Currently Xen guests do not use NMI for watchdog or oprofile, so that rather
limits the scope for problems.
-- Keir
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-25 9:18 ` Keir Fraser
@ 2008-01-25 9:51 ` Andi Kleen
2008-01-25 10:19 ` Andi Kleen
1 sibling, 0 replies; 24+ messages in thread
From: Andi Kleen @ 2008-01-25 9:51 UTC (permalink / raw)
To: Keir Fraser
Cc: Nick Piggin, Jan Beulich, Jeremy Fitzhardinge, Ingo Molnar,
Harvey Harrison, Matt Mackall, Linux Kernel Mailing List
> Whether this a problem in light of Xen spurious faults depends on whether
> NMI handlers touch dynamically-allocated data. And if they do, it still
> depends on the exact scenario. If it is likely to be a problem, a Xen pv_op
> can flush the TLB on NMI entry, or we could have Xen do that implicitly
> before invoking the guest NMI handler.
For PV kernels it would probably be better to just implement a truly
nesting NMI trigger method instead of the one bit state of the real hardware.
Actually I'm not sure NMIs are really that useful for guests. Most things
done traditionally by NMIs are probably better done outside the guest
in a virtualized environment.
-Andi
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-25 9:18 ` Keir Fraser
2008-01-25 9:51 ` Andi Kleen
@ 2008-01-25 10:19 ` Andi Kleen
2008-01-25 13:17 ` Keir Fraser
1 sibling, 1 reply; 24+ messages in thread
From: Andi Kleen @ 2008-01-25 10:19 UTC (permalink / raw)
To: Keir Fraser
Cc: Nick Piggin, Jan Beulich, Jeremy Fitzhardinge, Ingo Molnar,
Harvey Harrison, Matt Mackall, Linux Kernel Mailing List
> Whether this a problem in light of Xen spurious faults depends on whether
> NMI handlers touch dynamically-allocated data.
How do you define dynamically-allocated data?
-Andi
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-25 10:19 ` Andi Kleen
@ 2008-01-25 13:17 ` Keir Fraser
0 siblings, 0 replies; 24+ messages in thread
From: Keir Fraser @ 2008-01-25 13:17 UTC (permalink / raw)
To: Andi Kleen
Cc: Nick Piggin, Jan Beulich, Jeremy Fitzhardinge, Ingo Molnar,
Harvey Harrison, Matt Mackall, Linux Kernel Mailing List
On 25/1/08 10:19, "Andi Kleen" <ak@suse.de> wrote:
>> Whether this a problem in light of Xen spurious faults depends on whether
>> NMI handlers touch dynamically-allocated data.
>
> How do you define dynamically-allocated data?
Anything that could have been a read-only pte or ldt page in a previous life
with no intervening TLB flush. So get_free_page(), kmalloc(), vmalloc(), ...
Actually I think we are fine, now I think about it some more, because we
only clear the software NMI-in-flight flag if the guest executes IRET via
the hypervisor. Most Xen Linux guests only do IRET via the hypervisor when
the current context is an NMI handler (additionally x86_64 also does so when
returning to ring 3). Most importantly for this case, we will *not* IRET via
the hypervisor when returning from a #PF context nested in an NMI context.
Hence the NMI-in-flight flag will not be cleared, and guest virtual NMIs
will not nest. So that's a relief!
-- Keir
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-25 8:38 ` Nick Piggin
2008-01-25 9:11 ` Andi Kleen
@ 2008-01-25 9:18 ` Jan Beulich
1 sibling, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2008-01-25 9:18 UTC (permalink / raw)
To: Nick Piggin
Cc: Keir Fraser, Ingo Molnar, Harvey Harrison, Jeremy Fitzhardinge,
Matt Mackall, Andi Kleen, Linux Kernel Mailing List
>>> Nick Piggin <nickpiggin@yahoo.com.au> 25.01.08 09:38 >>>
>On Friday 25 January 2008 19:15, Jan Beulich wrote:
>> Actually, another thought: permitting (and handling) spurious faults for
>> kernel mappings conflicts with NMI handling, i.e. great care would be
>> needed to ensure the NMI path cannot touch any such mapping. So
>> even the present Xen/Linux Dom0 implementation may have some
>> (perhaps unlikely) problems here, and it would get worse if we added
>> e.g. a virtual watchdog NMI (something I am considering, which would
>> then extend the problem to DomU-s).
>
>Can you explain how they conflict?
In the same way as vmalloc faults do (which is why vmalloc_sync_all()
got introduced): a page fault nested inside an NMI will, by virtue of
executing IRET, prematurely tell the processor that NMI handling is
done (and specifically unmask further NMIs).
Jan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-24 0:28 ` [PATCH UPDATE] " Jeremy Fitzhardinge
2008-01-24 19:14 ` Matt Mackall
@ 2008-01-25 15:30 ` Ingo Molnar
2008-01-25 15:54 ` Jeremy Fitzhardinge
1 sibling, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2008-01-25 15:30 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Harvey Harrison, Linux Kernel Mailing List, Andi Kleen
* Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> When changing a kernel page from RO->RW, it's OK to leave stale TLB
> entries around, since doing a global flush is expensive and they pose
> no security problem. They can, however, generate a spurious fault,
> which we should catch and simply return from (which will have the
> side-effect of reloading the TLB to the current PTE).
>
> This can occur when running under Xen, because it frequently changes
> kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
> It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
> doing a global TLB flush after changing page permissions.
thanks, applied.
it would be nice to expose this ability of the architecture to the core
Linux kernel mprotect code as well, and let it skip on a TLB flush when
doing a RO->RW transition. It could speed up valgrind and the other
mprotect() users i guess? [and UML too perhaps]
Ingo
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-25 15:30 ` Ingo Molnar
@ 2008-01-25 15:54 ` Jeremy Fitzhardinge
2008-01-25 18:08 ` Ingo Molnar
0 siblings, 1 reply; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-25 15:54 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Harvey Harrison, Linux Kernel Mailing List, Andi Kleen
Ingo Molnar wrote:
> thanks, applied.
>
> it would be nice to expose this ability of the architecture to the core
> Linux kernel mprotect code as well, and let it skip on a TLB flush when
> doing a RO->RW transition.
The usermode fault handler already effectively does this; this patch
just does it for kernel mode as well. I don't know if mprotect takes
advantage of this.
> It could speed up valgrind and the other
> mprotect() users i guess? [and UML too perhaps]
>
Not valgrind (it doesn't rely on mmap protections), but electric fence
perhaps.
J
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-25 15:54 ` Jeremy Fitzhardinge
@ 2008-01-25 18:08 ` Ingo Molnar
2008-01-25 18:39 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2008-01-25 18:08 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Harvey Harrison, Linux Kernel Mailing List, Andi Kleen
* Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> Ingo Molnar wrote:
>> thanks, applied.
>>
>> it would be nice to expose this ability of the architecture to the core
>> Linux kernel mprotect code as well, and let it skip on a TLB flush when
>> doing a RO->RW transition.
>
> The usermode fault handler already effectively does this; this patch
> just does it for kernel mode as well. I don't know if mprotect takes
> advantage of this.
spurious faults happen all the time on SMP, in the native kernel.
And what i mean is that Linux mprotect currently does not take advantage
of x86's ability to just change the ptes, because there's no structured
way to tell mm/mprotect.c that "it's safe to skip the TLB flush here".
The flush happens in mm/mprotect.c's change_protection() function:
flush_tlb_range(vma, start, end);
and that is unnecessary when we increase the protection rights, such as
in a RO->RW change. (all that is needed is an smp_wmb() instead, to make
sure all the pte modifications are visible when the syscall returns.)
and it's a really rare case these days that you can find an area where
Linux does not make use of a hardware MMU feature - so we should fix
this ;-)
Ingo
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH UPDATE] x86: ignore spurious faults
2008-01-25 18:08 ` Ingo Molnar
@ 2008-01-25 18:39 ` Jeremy Fitzhardinge
0 siblings, 0 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-25 18:39 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Harvey Harrison, Linux Kernel Mailing List, Andi Kleen
Ingo Molnar wrote:
> spurious faults happen all the time on SMP, in the native kernel.
>
> And what i mean is that Linux mprotect currently does not take advantage
> of x86's ability to just change the ptes, because there's no structured
> way to tell mm/mprotect.c that "it's safe to skip the TLB flush here".
>
> The flush happens in mm/mprotect.c's change_protection() function:
>
> flush_tlb_range(vma, start, end);
>
> and that is unnecessary when we increase the protection rights, such as
> in a RO->RW change. (all that is needed is an smp_wmb() instead, to make
> sure all the pte modifications are visible when the syscall returns.)
>
> and it's a really rare case these days that you can find an area where
> Linux does not make use of a hardware MMU feature - so we should fix
> this ;-)
Well, I guess this isn't really specific to x86; we could always
legitimately not do a tlb flush after increasing permissions and leave
the fault handler to clean up the mess where needed. But I don't think
that's necessarily much of a win; it's cheaper to just do the tlb flush
rather than take a spurious fault, unless the faults are very rare. If
someone is doing an mprotect on a piece of memory (esp to make it
writable), my guess is that they're going to touch that memory in the
very near future.
The big win for this patch is avoiding cross-cpu tlb invalidation when
changing kernel mappings. mprotect doesn't attempt to do that anyway,
and so can incur spurious faults on other CPUs.
J
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] x86: ignore spurious faults
2008-01-24 0:05 [PATCH] x86: ignore spurious faults Jeremy Fitzhardinge
2008-01-24 0:18 ` Harvey Harrison
@ 2008-01-24 6:49 ` Andi Kleen
2008-01-24 7:02 ` Jeremy Fitzhardinge
1 sibling, 1 reply; 24+ messages in thread
From: Andi Kleen @ 2008-01-24 6:49 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Ingo Molnar, Linux Kernel Mailing List, Harvey Harrison
Jeremy Fitzhardinge <jeremy@goop.org> writes:
>
> /*
> + * Handle a spurious fault caused by a stale TLB entry. This allows
vmalloc_fault already has nearly the same logic. You should look
at sharing the code.
-Andi
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] x86: ignore spurious faults
2008-01-24 6:49 ` [PATCH] " Andi Kleen
@ 2008-01-24 7:02 ` Jeremy Fitzhardinge
2008-01-24 7:11 ` Andi Kleen
0 siblings, 1 reply; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-24 7:02 UTC (permalink / raw)
To: Andi Kleen; +Cc: Ingo Molnar, Linux Kernel Mailing List, Harvey Harrison
Andi Kleen wrote:
> Jeremy Fitzhardinge <jeremy@goop.org> writes:
>
>> /*
>> + * Handle a spurious fault caused by a stale TLB entry. This allows
>>
>
> vmalloc_fault already has nearly the same logic. You should look
> at sharing the code.
Hm, I see what you mean, but its hard to see how to share much code
there. It's a case of "two lines common, one line different" repeated 4
times or so.
J
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH] x86: ignore spurious faults
2008-01-24 7:02 ` Jeremy Fitzhardinge
@ 2008-01-24 7:11 ` Andi Kleen
0 siblings, 0 replies; 24+ messages in thread
From: Andi Kleen @ 2008-01-24 7:11 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Ingo Molnar, Linux Kernel Mailing List, Harvey Harrison
On Thursday 24 January 2008 08:02:11 Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > Jeremy Fitzhardinge <jeremy@goop.org> writes:
> >> /*
> >> + * Handle a spurious fault caused by a stale TLB entry. This allows
> >
> > vmalloc_fault already has nearly the same logic. You should look
> > at sharing the code.
>
> Hm, I see what you mean, but its hard to see how to share much code
> there. It's a case of "two lines common, one line different" repeated 4
> times or so.
The core common logic is checking if the fault agrees with the page tables.
If not do different things.
-Andi
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2008-01-25 18:39 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-24 0:05 [PATCH] x86: ignore spurious faults Jeremy Fitzhardinge
2008-01-24 0:18 ` Harvey Harrison
2008-01-24 0:26 ` Jeremy Fitzhardinge
2008-01-24 0:28 ` [PATCH UPDATE] " Jeremy Fitzhardinge
2008-01-24 19:14 ` Matt Mackall
2008-01-24 19:21 ` Jeremy Fitzhardinge
2008-01-24 23:41 ` Nick Piggin
2008-01-25 0:26 ` Jeremy Fitzhardinge
2008-01-25 7:36 ` Keir Fraser
2008-01-25 8:15 ` Jan Beulich
2008-01-25 8:38 ` Nick Piggin
2008-01-25 9:11 ` Andi Kleen
2008-01-25 9:18 ` Keir Fraser
2008-01-25 9:51 ` Andi Kleen
2008-01-25 10:19 ` Andi Kleen
2008-01-25 13:17 ` Keir Fraser
2008-01-25 9:18 ` Jan Beulich
2008-01-25 15:30 ` Ingo Molnar
2008-01-25 15:54 ` Jeremy Fitzhardinge
2008-01-25 18:08 ` Ingo Molnar
2008-01-25 18:39 ` Jeremy Fitzhardinge
2008-01-24 6:49 ` [PATCH] " Andi Kleen
2008-01-24 7:02 ` Jeremy Fitzhardinge
2008-01-24 7:11 ` Andi Kleen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox