[PATCH] x86: ignore spurious faults

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] x86: ignore spurious faults
@ 2008-01-24  0:05 Jeremy Fitzhardinge
  2008-01-24  0:18 ` Harvey Harrison
  2008-01-24  6:49 ` [PATCH] " Andi Kleen
  0 siblings, 2 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-24  0:05 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List, Andi Kleen, Harvey Harrison

When changing a kernel page from RO->RW, it's OK to leave stale TLB
entries around, since doing a global flush is expensive and they pose
no security problem.  They can, however, generate a spurious fault,
which we should catch and simply return from (which will have the
side-effect of reloading the TLB to the current PTE).

This can occur when running under Xen, because it frequently changes
kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
doing a global TLB flush after changing page permissions.

[ Changes to fault_32.c and fault_64.c are identical, and should be
  easy unify when the time comes. ]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
---
 arch/x86/mm/fault_32.c |   52 ++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/fault_64.c |   52 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 104 insertions(+)

===================================================================
--- a/arch/x86/mm/fault_32.c
+++ b/arch/x86/mm/fault_32.c
@@ -290,6 +290,53 @@ static int is_errata93(struct pt_regs *r
 
 
 /*
+ * Handle a spurious fault caused by a stale TLB entry.  This allows
+ * us to lazily refresh the TLB when increasing the permissions of a
+ * kernel page (RO -> RW or NX -> X).  Doing it eagerly is very
+ * expensive since that implies doing a full cross-processor TLB
+ * flush, even if no stale TLB entries exist on other processors.
+ * There are no security implications to leaving a stale TLB when
+ * increasing the permissions on a page.
+ */
+static int spurious_fault(unsigned long address,
+			  unsigned long error_code)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	/* Reserved-bit violation or user access to kernel space? */
+	if (error_code & (PF_USER | PF_RSVD))
+		return 0;
+
+	pgd = init_mm.pgd + pgd_index(address);
+	if (!pgd_present(*pgd))
+		return 0;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return 0;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return 0;
+
+	pte = pte_offset_kernel(pmd, address);
+	if (!pte_present(*pte))
+		return 0;
+	if ((error_code & 0x02) && !pte_write(*pte))
+		return 0;
+
+#if _PAGE_NX
+	if ((error_code & PF_INSTR) && !pte_exec(*pte))
+		return 0;
+#endif
+
+	return 1;
+}
+
+/*
  * Handle a fault on the vmalloc or module mapping area
  *
  * This assumes no large pages in there.
@@ -412,6 +459,11 @@ void __kprobes do_page_fault(struct pt_r
 		if (!(error_code & (PF_RSVD|PF_USER|PF_PROT)) &&
 		    vmalloc_fault(address) >= 0)
 			return;
+
+		/* Can handle a stale RO->RW TLB */
+		if (spurious_fault(address, error_code))
+			return;
+
 		/*
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock.
===================================================================
--- a/arch/x86/mm/fault_64.c
+++ b/arch/x86/mm/fault_64.c
@@ -275,6 +275,53 @@ static noinline void pgtable_bad(unsigne
 }
 
 /*
+ * Handle a spurious fault caused by a stale TLB entry.  This allows
+ * us to lazily refresh the TLB when increasing the permissions of a
+ * kernel page (RO -> RW or NX -> X).  Doing it eagerly is very
+ * expensive since that implies doing a full cross-processor TLB
+ * flush, even if no stale TLB entries exist on other processors.
+ * There are no security implications to leaving a stale TLB when
+ * increasing the permissions on a page.
+ */
+static int spurious_fault(unsigned long address,
+			  unsigned long error_code)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	/* Reserved-bit violation or user access to kernel space? */
+	if (error_code & (PF_USER | PF_RSVD))
+		return 0;
+
+	pgd = init_mm.pgd + pgd_index(address);
+	if (!pgd_present(*pgd))
+		return 0;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return 0;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return 0;
+
+	pte = pte_offset_kernel(pmd, address);
+	if (!pte_present(*pte))
+		return 0;
+	if ((error_code & 0x02) && !pte_write(*pte))
+		return 0;
+
+#if _PAGE_NX
+	if ((error_code & PF_INSTR) && !pte_exec(*pte))
+		return 0;
+#endif
+
+	return 1;
+}
+
+/*
  * Handle a fault on the vmalloc area
  *
  * This assumes no large pages in there.
@@ -406,6 +453,11 @@ asmlinkage void __kprobes do_page_fault(
 			if (vmalloc_fault(address) >= 0)
 				return;
 		}
+
+		/* Can handle a stale RO->RW TLB */
+		if (spurious_fault(address, error_code))
+			return;
+
 		/*
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: ignore spurious faults
  2008-01-24  0:05 [PATCH] x86: ignore spurious faults Jeremy Fitzhardinge
@ 2008-01-24  0:18 ` Harvey Harrison
  2008-01-24  0:26   ` Jeremy Fitzhardinge
  2008-01-24  0:28   ` [PATCH UPDATE] " Jeremy Fitzhardinge
  2008-01-24  6:49 ` [PATCH] " Andi Kleen
  1 sibling, 2 replies; 24+ messages in thread
From: Harvey Harrison @ 2008-01-24  0:18 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Ingo Molnar, Linux Kernel Mailing List, Andi Kleen

On Wed, 2008-01-23 at 16:05 -0800, Jeremy Fitzhardinge wrote:
> ===================================================================
> --- a/arch/x86/mm/fault_32.c
> +++ b/arch/x86/mm/fault_32.c
> @@ -290,6 +290,53 @@ static int is_errata93(struct pt_regs *r
>  
> 
>  /*
> + * Handle a spurious fault caused by a stale TLB entry.  This allows
> + * us to lazily refresh the TLB when increasing the permissions of a
> + * kernel page (RO -> RW or NX -> X).  Doing it eagerly is very
> + * expensive since that implies doing a full cross-processor TLB
> + * flush, even if no stale TLB entries exist on other processors.
> + * There are no security implications to leaving a stale TLB when
> + * increasing the permissions on a page.
> + */
> +static int spurious_fault(unsigned long address,
> +			  unsigned long error_code)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +
> +	/* Reserved-bit violation or user access to kernel space? */
> +	if (error_code & (PF_USER | PF_RSVD))
> +		return 0;
> +
> +	pgd = init_mm.pgd + pgd_index(address);
> +	if (!pgd_present(*pgd))
> +		return 0;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		return 0;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (!pmd_present(*pmd))
> +		return 0;
> +
> +	pte = pte_offset_kernel(pmd, address);
> +	if (!pte_present(*pte))
> +		return 0;
> +	if ((error_code & 0x02) && !pte_write(*pte))
> +		return 0;

	if ((error_code & PF_WRITE) && !pte_write(*pte))
		return 0;

> +
> +#if _PAGE_NX
> +	if ((error_code & PF_INSTR) && !pte_exec(*pte))
> +		return 0;
> +#endif
> +

How about dropping the #if and rely on the !pte_exec() test always
being false when _PAGE_NX = 0?  The compiler should just trim this
all away.

from pgtable.h:

static inline int pte_exec(pte_t pte)           { return !(pte_val(pte)
& _PAGE_NX); }

Cheers,

Harvey


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: ignore spurious faults
  2008-01-24  0:18 ` Harvey Harrison
@ 2008-01-24  0:26   ` Jeremy Fitzhardinge
  2008-01-24  0:28   ` [PATCH UPDATE] " Jeremy Fitzhardinge
  1 sibling, 0 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-24  0:26 UTC (permalink / raw)
  To: Harvey Harrison; +Cc: Ingo Molnar, Linux Kernel Mailing List, Andi Kleen

Harvey Harrison wrote:
> On Wed, 2008-01-23 at 16:05 -0800, Jeremy Fitzhardinge wrote:
>   
>> ===================================================================
>> --- a/arch/x86/mm/fault_32.c
>> +++ b/arch/x86/mm/fault_32.c
>> @@ -290,6 +290,53 @@ static int is_errata93(struct pt_regs *r
>>  
>>
>>  /*
>> + * Handle a spurious fault caused by a stale TLB entry.  This allows
>> + * us to lazily refresh the TLB when increasing the permissions of a
>> + * kernel page (RO -> RW or NX -> X).  Doing it eagerly is very
>> + * expensive since that implies doing a full cross-processor TLB
>> + * flush, even if no stale TLB entries exist on other processors.
>> + * There are no security implications to leaving a stale TLB when
>> + * increasing the permissions on a page.
>> + */
>> +static int spurious_fault(unsigned long address,
>> +			  unsigned long error_code)
>> +{
>> +	pgd_t *pgd;
>> +	pud_t *pud;
>> +	pmd_t *pmd;
>> +	pte_t *pte;
>> +
>> +	/* Reserved-bit violation or user access to kernel space? */
>> +	if (error_code & (PF_USER | PF_RSVD))
>> +		return 0;
>> +
>> +	pgd = init_mm.pgd + pgd_index(address);
>> +	if (!pgd_present(*pgd))
>> +		return 0;
>> +
>> +	pud = pud_offset(pgd, address);
>> +	if (!pud_present(*pud))
>> +		return 0;
>> +
>> +	pmd = pmd_offset(pud, address);
>> +	if (!pmd_present(*pmd))
>> +		return 0;
>> +
>> +	pte = pte_offset_kernel(pmd, address);
>> +	if (!pte_present(*pte))
>> +		return 0;
>> +	if ((error_code & 0x02) && !pte_write(*pte))
>> +		return 0;
>>     
>
> 	if ((error_code & PF_WRITE) && !pte_write(*pte))
> 		return 0;
>   

Oops, thanks.  Overlooked that one.

> How about dropping the #if and rely on the !pte_exec() test always
> being false when _PAGE_NX = 0?  The compiler should just trim this
> all away.
>
> from pgtable.h:
>
> static inline int pte_exec(pte_t pte)           { return !(pte_val(pte)
> & _PAGE_NX); }
>   

Thanks for reminding me; I thought of this the other day, but forgot to 
do it.  Will post an updated patch shortly.

    J

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH UPDATE] x86: ignore spurious faults
  2008-01-24  0:18 ` Harvey Harrison
  2008-01-24  0:26   ` Jeremy Fitzhardinge
@ 2008-01-24  0:28   ` Jeremy Fitzhardinge
  2008-01-24 19:14     ` Matt Mackall
  2008-01-25 15:30     ` Ingo Molnar
  1 sibling, 2 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-24  0:28 UTC (permalink / raw)
  To: Harvey Harrison; +Cc: Ingo Molnar, Linux Kernel Mailing List, Andi Kleen

When changing a kernel page from RO->RW, it's OK to leave stale TLB
entries around, since doing a global flush is expensive and they pose
no security problem.  They can, however, generate a spurious fault,
which we should catch and simply return from (which will have the
side-effect of reloading the TLB to the current PTE).

This can occur when running under Xen, because it frequently changes
kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
doing a global TLB flush after changing page permissions.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
---
 arch/x86/mm/fault_32.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/fault_64.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+)

===================================================================
--- a/arch/x86/mm/fault_32.c
+++ b/arch/x86/mm/fault_32.c
@@ -324,6 +324,51 @@ static int is_f00f_bug(struct pt_regs *r
 }
 
 /*
+ * Handle a spurious fault caused by a stale TLB entry.  This allows
+ * us to lazily refresh the TLB when increasing the permissions of a
+ * kernel page (RO -> RW or NX -> X).  Doing it eagerly is very
+ * expensive since that implies doing a full cross-processor TLB
+ * flush, even if no stale TLB entries exist on other processors.
+ * There are no security implications to leaving a stale TLB when
+ * increasing the permissions on a page.
+ */
+static int spurious_fault(unsigned long address,
+			  unsigned long error_code)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	/* Reserved-bit violation or user access to kernel space? */
+	if (error_code & (PF_USER | PF_RSVD))
+		return 0;
+
+	pgd = init_mm.pgd + pgd_index(address);
+	if (!pgd_present(*pgd))
+		return 0;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return 0;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return 0;
+
+	pte = pte_offset_kernel(pmd, address);
+	if (!pte_present(*pte))
+		return 0;
+
+	if ((error_code & PF_WRITE) && !pte_write(*pte))
+		return 0;
+	if ((error_code & PF_INSTR) && !pte_exec(*pte))
+		return 0;
+
+	return 1;
+}
+
+/*
  * Handle a fault on the vmalloc or module mapping area
  *
  * This assumes no large pages in there.
@@ -446,6 +491,11 @@ void __kprobes do_page_fault(struct pt_r
 		if (!(error_code & (PF_RSVD|PF_USER|PF_PROT)) &&
 		    vmalloc_fault(address) >= 0)
 			return;
+
+		/* Can handle a stale RO->RW TLB */
+		if (spurious_fault(address, error_code))
+			return;
+
 		/*
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock.
===================================================================
--- a/arch/x86/mm/fault_64.c
+++ b/arch/x86/mm/fault_64.c
@@ -312,6 +312,51 @@ static noinline void pgtable_bad(unsigne
 }
 
 /*
+ * Handle a spurious fault caused by a stale TLB entry.  This allows
+ * us to lazily refresh the TLB when increasing the permissions of a
+ * kernel page (RO -> RW or NX -> X).  Doing it eagerly is very
+ * expensive since that implies doing a full cross-processor TLB
+ * flush, even if no stale TLB entries exist on other processors.
+ * There are no security implications to leaving a stale TLB when
+ * increasing the permissions on a page.
+ */
+static int spurious_fault(unsigned long address,
+			  unsigned long error_code)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	/* Reserved-bit violation or user access to kernel space? */
+	if (error_code & (PF_USER | PF_RSVD))
+		return 0;
+
+	pgd = init_mm.pgd + pgd_index(address);
+	if (!pgd_present(*pgd))
+		return 0;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return 0;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return 0;
+
+	pte = pte_offset_kernel(pmd, address);
+	if (!pte_present(*pte))
+		return 0;
+
+	if ((error_code & PF_WRITE) && !pte_write(*pte))
+		return 0;
+	if ((error_code & PF_INSTR) && !pte_exec(*pte))
+		return 0;
+
+	return 1;
+}
+
+/*
  * Handle a fault on the vmalloc area
  *
  * This assumes no large pages in there.
@@ -443,6 +488,11 @@ asmlinkage void __kprobes do_page_fault(
 			if (vmalloc_fault(address) >= 0)
 				return;
 		}
+
+		/* Can handle a stale RO->RW TLB */
+		if (spurious_fault(address, error_code))
+			return;
+
 		/*
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-24  0:28   ` [PATCH UPDATE] " Jeremy Fitzhardinge
@ 2008-01-24 19:14     ` Matt Mackall
  2008-01-24 19:21       ` Jeremy Fitzhardinge
  2008-01-25 15:30     ` Ingo Molnar
  1 sibling, 1 reply; 24+ messages in thread
From: Matt Mackall @ 2008-01-24 19:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Harvey Harrison, Ingo Molnar, Linux Kernel Mailing List,
	Andi Kleen


On Wed, 2008-01-23 at 16:28 -0800, Jeremy Fitzhardinge wrote:
> When changing a kernel page from RO->RW, it's OK to leave stale TLB
> entries around, since doing a global flush is expensive and they pose
> no security problem.  They can, however, generate a spurious fault,
> which we should catch and simply return from (which will have the
> side-effect of reloading the TLB to the current PTE).
> 
> This can occur when running under Xen, because it frequently changes
> kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
> It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
> doing a global TLB flush after changing page permissions.

There's perhaps an opportunity to do this lazy TLB trick in the mmap
path as well, where RW mappings are initially mapped as RO so we can
catch processes dirtying them and then switched to RW. If the mapping is
shared across threads on multiple cores, we can defer synchronizing the
TLBs on the others.

-- 
Mathematics is the supreme nostalgia of our time.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-24 19:14     ` Matt Mackall
@ 2008-01-24 19:21       ` Jeremy Fitzhardinge
  2008-01-24 23:41         ` Nick Piggin
  0 siblings, 1 reply; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-24 19:21 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Harvey Harrison, Ingo Molnar, Linux Kernel Mailing List,
	Andi Kleen

Matt Mackall wrote:
> There's perhaps an opportunity to do this lazy TLB trick in the mmap
> path as well, where RW mappings are initially mapped as RO so we can
> catch processes dirtying them and then switched to RW. If the mapping is
> shared across threads on multiple cores, we can defer synchronizing the
> TLBs on the others.
>   

I think spurious usermode faults are already dealt with.  
handle_pte_fault() does essentially the same thing as this patch:

	if (ptep_set_access_flags(vma, address, pte, entry, write_access)) {
		update_mmu_cache(vma, address, entry);
	} else {
		/*
		 * This is needed only for protection faults but the arch code
		 * is not yet telling us if this is a protection fault or not.
		 * This still avoids useless tlb flushes for .text page faults
		 * with threads.
		 */
		if (write_access)
			flush_tlb_page(vma, address);
	}



    J

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-24 19:21       ` Jeremy Fitzhardinge
@ 2008-01-24 23:41         ` Nick Piggin
  2008-01-25  0:26           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2008-01-24 23:41 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Matt Mackall, Harvey Harrison, Ingo Molnar,
	Linux Kernel Mailing List, Andi Kleen

On Friday 25 January 2008 06:21, Jeremy Fitzhardinge wrote:
> Matt Mackall wrote:
> > There's perhaps an opportunity to do this lazy TLB trick in the mmap
> > path as well, where RW mappings are initially mapped as RO so we can
> > catch processes dirtying them and then switched to RW. If the mapping is
> > shared across threads on multiple cores, we can defer synchronizing the
> > TLBs on the others.
>
> I think spurious usermode faults are already dealt with.
> handle_pte_fault() does essentially the same thing as this patch:
>
> 	if (ptep_set_access_flags(vma, address, pte, entry, write_access)) {
> 		update_mmu_cache(vma, address, entry);
> 	} else {
> 		/*
> 		 * This is needed only for protection faults but the arch code
> 		 * is not yet telling us if this is a protection fault or not.
> 		 * This still avoids useless tlb flushes for .text page faults
> 		 * with threads.
> 		 */
> 		if (write_access)
> 			flush_tlb_page(vma, address);
> 	}

I (obviously) don't know exactly how the TLB works in x86, but I
thought that on a miss, the CPU walks the pagetables first before
faulting? Maybe that's not the case if there is an RO entry
actually in the TLB?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-24 23:41         ` Nick Piggin
@ 2008-01-25  0:26           ` Jeremy Fitzhardinge
  2008-01-25  7:36             ` Keir Fraser
  0 siblings, 1 reply; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-25  0:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Matt Mackall, Harvey Harrison, Ingo Molnar,
	Linux Kernel Mailing List, Andi Kleen, Keir Fraser, Jan Beulich

Nick Piggin wrote:
> On Friday 25 January 2008 06:21, Jeremy Fitzhardinge wrote:
>   
>> Matt Mackall wrote:
>>     
>>> There's perhaps an opportunity to do this lazy TLB trick in the mmap
>>> path as well, where RW mappings are initially mapped as RO so we can
>>> catch processes dirtying them and then switched to RW. If the mapping is
>>> shared across threads on multiple cores, we can defer synchronizing the
>>> TLBs on the others.
>>>       
>> I think spurious usermode faults are already dealt with.
>> handle_pte_fault() does essentially the same thing as this patch:
>>
>> 	if (ptep_set_access_flags(vma, address, pte, entry, write_access)) {
>> 		update_mmu_cache(vma, address, entry);
>> 	} else {
>> 		/*
>> 		 * This is needed only for protection faults but the arch code
>> 		 * is not yet telling us if this is a protection fault or not.
>> 		 * This still avoids useless tlb flushes for .text page faults
>> 		 * with threads.
>> 		 */
>> 		if (write_access)
>> 			flush_tlb_page(vma, address);
>> 	}
>>     
>
> I (obviously) don't know exactly how the TLB works in x86, but I
> thought that on a miss, the CPU walks the pagetables first before
> faulting? Maybe that's not the case if there is an RO entry
> actually in the TLB?
>   

My understanding is that it will fault immediately if there's a TLB 
entry, and rewalk the tables on return from the fault before restarting 
the instruction, so there's no need for an explicit TLB flush.  The TLB 
doesn't have a notion of negative cache entries, so any entry represents 
a present page of some variety.

    J

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-25  0:26           ` Jeremy Fitzhardinge
@ 2008-01-25  7:36             ` Keir Fraser
  2008-01-25  8:15               ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Keir Fraser @ 2008-01-25  7:36 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Nick Piggin
  Cc: Matt Mackall, Harvey Harrison, Ingo Molnar,
	Linux Kernel Mailing List, Andi Kleen, Jan Beulich

On 25/1/08 00:26, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:

>> I (obviously) don't know exactly how the TLB works in x86, but I
>> thought that on a miss, the CPU walks the pagetables first before
>> faulting? Maybe that's not the case if there is an RO entry
>> actually in the TLB?
>>   
> 
> My understanding is that it will fault immediately if there's a TLB
> entry, and rewalk the tables on return from the fault before restarting
> the instruction, so there's no need for an explicit TLB flush.  The TLB
> doesn't have a notion of negative cache entries, so any entry represents
> a present page of some variety.

Yes, write access with a r/o TLB entry causes the TLB entry to be flushed
and an immediate #PF with no page walk. This is a hardware optimisation for
copy-on-write demand faults. Both Intel and AMD implement it.

 -- Keir



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-25  7:36             ` Keir Fraser
@ 2008-01-25  8:15               ` Jan Beulich
  2008-01-25  8:38                 ` Nick Piggin
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2008-01-25  8:15 UTC (permalink / raw)
  To: Keir Fraser, Jeremy Fitzhardinge, Nick Piggin
  Cc: Ingo Molnar, Harvey Harrison, Matt Mackall, Andi Kleen,
	Linux Kernel Mailing List

Actually, another thought: permitting (and handling) spurious faults for
kernel mappings conflicts with NMI handling, i.e. great care would be
needed to ensure the NMI path cannot touch any such mapping. So
even the present Xen/Linux Dom0 implementation may have some
(perhaps unlikely) problems here, and it would get worse if we added
e.g. a virtual watchdog NMI (something I am considering, which would
then extend the problem to DomU-s).

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-25  8:15               ` Jan Beulich
@ 2008-01-25  8:38                 ` Nick Piggin
  2008-01-25  9:11                   ` Andi Kleen
  2008-01-25  9:18                   ` Jan Beulich
  0 siblings, 2 replies; 24+ messages in thread
From: Nick Piggin @ 2008-01-25  8:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Jeremy Fitzhardinge, Ingo Molnar, Harvey Harrison,
	Matt Mackall, Andi Kleen, Linux Kernel Mailing List

On Friday 25 January 2008 19:15, Jan Beulich wrote:
> Actually, another thought: permitting (and handling) spurious faults for
> kernel mappings conflicts with NMI handling, i.e. great care would be
> needed to ensure the NMI path cannot touch any such mapping. So
> even the present Xen/Linux Dom0 implementation may have some
> (perhaps unlikely) problems here, and it would get worse if we added
> e.g. a virtual watchdog NMI (something I am considering, which would
> then extend the problem to DomU-s).

Can you explain how they conflict?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-25  8:38                 ` Nick Piggin
@ 2008-01-25  9:11                   ` Andi Kleen
  2008-01-25  9:18                     ` Keir Fraser
  2008-01-25  9:18                   ` Jan Beulich
  1 sibling, 1 reply; 24+ messages in thread
From: Andi Kleen @ 2008-01-25  9:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Beulich, Keir Fraser, Jeremy Fitzhardinge, Ingo Molnar,
	Harvey Harrison, Matt Mackall, Linux Kernel Mailing List

On Friday 25 January 2008 09:38:38 Nick Piggin wrote:
> On Friday 25 January 2008 19:15, Jan Beulich wrote:
> > Actually, another thought: permitting (and handling) spurious faults for
> > kernel mappings conflicts with NMI handling, i.e. great care would be
> > needed to ensure the NMI path cannot touch any such mapping. So
> > even the present Xen/Linux Dom0 implementation may have some
> > (perhaps unlikely) problems here, and it would get worse if we added
> > e.g. a virtual watchdog NMI (something I am considering, which would
> > then extend the problem to DomU-s).
> 
> Can you explain how they conflict?

NMI is blocked by the hardware until IRET and when a page fault happens inside 
the NMI handler the early IRET unblocks it and then NMIs can nest, which
will lead to stack corruption.

-Andi


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-25  9:11                   ` Andi Kleen
@ 2008-01-25  9:18                     ` Keir Fraser
  2008-01-25  9:51                       ` Andi Kleen
  2008-01-25 10:19                       ` Andi Kleen
  0 siblings, 2 replies; 24+ messages in thread
From: Keir Fraser @ 2008-01-25  9:18 UTC (permalink / raw)
  To: Andi Kleen, Nick Piggin
  Cc: Jan Beulich, Jeremy Fitzhardinge, Ingo Molnar, Harvey Harrison,
	Matt Mackall, Linux Kernel Mailing List

On 25/1/08 09:11, "Andi Kleen" <ak@novell.com> wrote:

>>> Actually, another thought: permitting (and handling) spurious faults for
>>> kernel mappings conflicts with NMI handling, i.e. great care would be
>>> needed to ensure the NMI path cannot touch any such mapping. So
>>> even the present Xen/Linux Dom0 implementation may have some
>>> (perhaps unlikely) problems here, and it would get worse if we added
>>> e.g. a virtual watchdog NMI (something I am considering, which would
>>> then extend the problem to DomU-s).
>> 
>> Can you explain how they conflict?
> 
> NMI is blocked by the hardware until IRET and when a page fault happens inside
> the NMI handler the early IRET unblocks it and then NMIs can nest, which
> will lead to stack corruption.

Whether this a problem in light of Xen spurious faults depends on whether
NMI handlers touch dynamically-allocated data. And if they do, it still
depends on the exact scenario. If it is likely to be a problem, a Xen pv_op
can flush the TLB on NMI entry, or we could have Xen do that implicitly
before invoking the guest NMI handler.

Currently Xen guests do not use NMI for watchdog or oprofile, so that rather
limits the scope for problems.

 -- Keir



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-25  9:18                     ` Keir Fraser
@ 2008-01-25  9:51                       ` Andi Kleen
  2008-01-25 10:19                       ` Andi Kleen
  1 sibling, 0 replies; 24+ messages in thread
From: Andi Kleen @ 2008-01-25  9:51 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Nick Piggin, Jan Beulich, Jeremy Fitzhardinge, Ingo Molnar,
	Harvey Harrison, Matt Mackall, Linux Kernel Mailing List


> Whether this a problem in light of Xen spurious faults depends on whether
> NMI handlers touch dynamically-allocated data. And if they do, it still
> depends on the exact scenario. If it is likely to be a problem, a Xen pv_op
> can flush the TLB on NMI entry, or we could have Xen do that implicitly
> before invoking the guest NMI handler.

For PV kernels it would probably be better to just implement a truly
nesting NMI trigger method instead of the one bit state of the real hardware.

Actually I'm not sure NMIs are really that useful for guests. Most things
done traditionally by NMIs are probably better done outside the guest
in a virtualized environment.

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-25  9:18                     ` Keir Fraser
  2008-01-25  9:51                       ` Andi Kleen
@ 2008-01-25 10:19                       ` Andi Kleen
  2008-01-25 13:17                         ` Keir Fraser
  1 sibling, 1 reply; 24+ messages in thread
From: Andi Kleen @ 2008-01-25 10:19 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Nick Piggin, Jan Beulich, Jeremy Fitzhardinge, Ingo Molnar,
	Harvey Harrison, Matt Mackall, Linux Kernel Mailing List


> Whether this a problem in light of Xen spurious faults depends on whether
> NMI handlers touch dynamically-allocated data.

How do you define dynamically-allocated data?

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-25 10:19                       ` Andi Kleen
@ 2008-01-25 13:17                         ` Keir Fraser
  0 siblings, 0 replies; 24+ messages in thread
From: Keir Fraser @ 2008-01-25 13:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Jan Beulich, Jeremy Fitzhardinge, Ingo Molnar,
	Harvey Harrison, Matt Mackall, Linux Kernel Mailing List

On 25/1/08 10:19, "Andi Kleen" <ak@suse.de> wrote:

>> Whether this a problem in light of Xen spurious faults depends on whether
>> NMI handlers touch dynamically-allocated data.
> 
> How do you define dynamically-allocated data?

Anything that could have been a read-only pte or ldt page in a previous life
with no intervening TLB flush. So get_free_page(), kmalloc(), vmalloc(), ...

Actually I think we are fine, now I think about it some more, because we
only clear the software NMI-in-flight flag if the guest executes IRET via
the hypervisor. Most Xen Linux guests only do IRET via the hypervisor when
the current context is an NMI handler (additionally x86_64 also does so when
returning to ring 3). Most importantly for this case, we will *not* IRET via
the hypervisor when returning from a #PF context nested in an NMI context.
Hence the NMI-in-flight flag will not be cleared, and guest virtual NMIs
will not nest. So that's a relief!

 -- Keir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-25  8:38                 ` Nick Piggin
  2008-01-25  9:11                   ` Andi Kleen
@ 2008-01-25  9:18                   ` Jan Beulich
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2008-01-25  9:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Keir Fraser, Ingo Molnar, Harvey Harrison, Jeremy Fitzhardinge,
	Matt Mackall, Andi Kleen, Linux Kernel Mailing List

>>> Nick Piggin <nickpiggin@yahoo.com.au> 25.01.08 09:38 >>>
>On Friday 25 January 2008 19:15, Jan Beulich wrote:
>> Actually, another thought: permitting (and handling) spurious faults for
>> kernel mappings conflicts with NMI handling, i.e. great care would be
>> needed to ensure the NMI path cannot touch any such mapping. So
>> even the present Xen/Linux Dom0 implementation may have some
>> (perhaps unlikely) problems here, and it would get worse if we added
>> e.g. a virtual watchdog NMI (something I am considering, which would
>> then extend the problem to DomU-s).
>
>Can you explain how they conflict?

In the same way as vmalloc faults do (which is why vmalloc_sync_all()
got introduced): a page fault nested inside an NMI will, by virtue of
executing IRET, prematurely tell the processor that NMI handling is
done (and specifically unmask further NMIs).

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-24  0:28   ` [PATCH UPDATE] " Jeremy Fitzhardinge
  2008-01-24 19:14     ` Matt Mackall
@ 2008-01-25 15:30     ` Ingo Molnar
  2008-01-25 15:54       ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2008-01-25 15:30 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Harvey Harrison, Linux Kernel Mailing List, Andi Kleen


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> When changing a kernel page from RO->RW, it's OK to leave stale TLB 
> entries around, since doing a global flush is expensive and they pose 
> no security problem.  They can, however, generate a spurious fault, 
> which we should catch and simply return from (which will have the 
> side-effect of reloading the TLB to the current PTE).
>
> This can occur when running under Xen, because it frequently changes 
> kernel pages from RW->RO->RW to implement Xen's pagetable semantics. 
> It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids 
> doing a global TLB flush after changing page permissions.

thanks, applied.

it would be nice to expose this ability of the architecture to the core 
Linux kernel mprotect code as well, and let it skip on a TLB flush when 
doing a RO->RW transition. It could speed up valgrind and the other 
mprotect() users i guess? [and UML too perhaps]

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-25 15:30     ` Ingo Molnar
@ 2008-01-25 15:54       ` Jeremy Fitzhardinge
  2008-01-25 18:08         ` Ingo Molnar
  0 siblings, 1 reply; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-25 15:54 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Harvey Harrison, Linux Kernel Mailing List, Andi Kleen

Ingo Molnar wrote:
> thanks, applied.
>
> it would be nice to expose this ability of the architecture to the core 
> Linux kernel mprotect code as well, and let it skip on a TLB flush when 
> doing a RO->RW transition.

The usermode fault handler already effectively does this; this patch 
just does it for kernel mode as well.  I don't know if mprotect takes 
advantage of this.

>  It could speed up valgrind and the other 
> mprotect() users i guess? [and UML too perhaps]
>   

Not valgrind (it doesn't rely on mmap protections), but electric fence 
perhaps.

    J


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-25 15:54       ` Jeremy Fitzhardinge
@ 2008-01-25 18:08         ` Ingo Molnar
  2008-01-25 18:39           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2008-01-25 18:08 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Harvey Harrison, Linux Kernel Mailing List, Andi Kleen

* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
>> thanks, applied.
>>
>> it would be nice to expose this ability of the architecture to the core 
>> Linux kernel mprotect code as well, and let it skip on a TLB flush when 
>> doing a RO->RW transition.
>
> The usermode fault handler already effectively does this; this patch 
> just does it for kernel mode as well.  I don't know if mprotect takes 
> advantage of this.

spurious faults happen all the time on SMP, in the native kernel.

And what i mean is that Linux mprotect currently does not take advantage 
of x86's ability to just change the ptes, because there's no structured 
way to tell mm/mprotect.c that "it's safe to skip the TLB flush here". 

The flush happens in mm/mprotect.c's change_protection() function:

        flush_tlb_range(vma, start, end);

and that is unnecessary when we increase the protection rights, such as 
in a RO->RW change. (all that is needed is an smp_wmb() instead, to make 
sure all the pte modifications are visible when the syscall returns.)

and it's a really rare case these days that you can find an area where 
Linux does not make use of a hardware MMU feature - so we should fix 
this ;-)

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH UPDATE] x86: ignore spurious faults
  2008-01-25 18:08         ` Ingo Molnar
@ 2008-01-25 18:39           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-25 18:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Harvey Harrison, Linux Kernel Mailing List, Andi Kleen

Ingo Molnar wrote:
> spurious faults happen all the time on SMP, in the native kernel.
>
> And what i mean is that Linux mprotect currently does not take advantage 
> of x86's ability to just change the ptes, because there's no structured 
> way to tell mm/mprotect.c that "it's safe to skip the TLB flush here". 
>
> The flush happens in mm/mprotect.c's change_protection() function:
>
>         flush_tlb_range(vma, start, end);
>
> and that is unnecessary when we increase the protection rights, such as 
> in a RO->RW change. (all that is needed is an smp_wmb() instead, to make 
> sure all the pte modifications are visible when the syscall returns.)
>
> and it's a really rare case these days that you can find an area where 
> Linux does not make use of a hardware MMU feature - so we should fix 
> this ;-)

Well, I guess this isn't really specific to x86; we could always 
legitimately not do a tlb flush after increasing permissions and leave 
the fault handler to clean up the mess where needed.  But I don't think 
that's necessarily much of a win; it's cheaper to just do the tlb flush 
rather than take a spurious fault, unless the faults are very rare.  If 
someone is doing an mprotect on a piece of memory (esp to make it 
writable), my guess is that they're going to touch that memory in the 
very near future.

The big win for this patch is avoiding cross-cpu tlb invalidation when 
changing kernel mappings.  mprotect doesn't attempt to do that anyway, 
and so can incur spurious faults on other CPUs.

    J

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: ignore spurious faults
  2008-01-24  0:05 [PATCH] x86: ignore spurious faults Jeremy Fitzhardinge
  2008-01-24  0:18 ` Harvey Harrison
@ 2008-01-24  6:49 ` Andi Kleen
  2008-01-24  7:02   ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 24+ messages in thread
From: Andi Kleen @ 2008-01-24  6:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Linux Kernel Mailing List, Harvey Harrison

Jeremy Fitzhardinge <jeremy@goop.org> writes:
>
> /*
> + * Handle a spurious fault caused by a stale TLB entry.  This allows

vmalloc_fault already has nearly the same logic. You should look
at sharing the code.

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: ignore spurious faults
  2008-01-24  6:49 ` [PATCH] " Andi Kleen
@ 2008-01-24  7:02   ` Jeremy Fitzhardinge
  2008-01-24  7:11     ` Andi Kleen
  0 siblings, 1 reply; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-24  7:02 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, Linux Kernel Mailing List, Harvey Harrison

Andi Kleen wrote:
> Jeremy Fitzhardinge <jeremy@goop.org> writes:
>   
>> /*
>> + * Handle a spurious fault caused by a stale TLB entry.  This allows
>>     
>
> vmalloc_fault already has nearly the same logic. You should look
> at sharing the code.

Hm, I see what you mean, but its hard to see how to share much code 
there.  It's a case of "two lines common, one line different" repeated 4 
times or so.

    J

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: ignore spurious faults
  2008-01-24  7:02   ` Jeremy Fitzhardinge
@ 2008-01-24  7:11     ` Andi Kleen
  0 siblings, 0 replies; 24+ messages in thread
From: Andi Kleen @ 2008-01-24  7:11 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Linux Kernel Mailing List, Harvey Harrison

On Thursday 24 January 2008 08:02:11 Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > Jeremy Fitzhardinge <jeremy@goop.org> writes:
> >> /*
> >> + * Handle a spurious fault caused by a stale TLB entry.  This allows
> >
> > vmalloc_fault already has nearly the same logic. You should look
> > at sharing the code.
>
> Hm, I see what you mean, but its hard to see how to share much code
> there.  It's a case of "two lines common, one line different" repeated 4
> times or so.

The core common logic is checking if the fault agrees with the page tables.
If not do different things.

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-01-25 18:39 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-24  0:05 [PATCH] x86: ignore spurious faults Jeremy Fitzhardinge
2008-01-24  0:18 ` Harvey Harrison
2008-01-24  0:26   ` Jeremy Fitzhardinge
2008-01-24  0:28   ` [PATCH UPDATE] " Jeremy Fitzhardinge
2008-01-24 19:14     ` Matt Mackall
2008-01-24 19:21       ` Jeremy Fitzhardinge
2008-01-24 23:41         ` Nick Piggin
2008-01-25  0:26           ` Jeremy Fitzhardinge
2008-01-25  7:36             ` Keir Fraser
2008-01-25  8:15               ` Jan Beulich
2008-01-25  8:38                 ` Nick Piggin
2008-01-25  9:11                   ` Andi Kleen
2008-01-25  9:18                     ` Keir Fraser
2008-01-25  9:51                       ` Andi Kleen
2008-01-25 10:19                       ` Andi Kleen
2008-01-25 13:17                         ` Keir Fraser
2008-01-25  9:18                   ` Jan Beulich
2008-01-25 15:30     ` Ingo Molnar
2008-01-25 15:54       ` Jeremy Fitzhardinge
2008-01-25 18:08         ` Ingo Molnar
2008-01-25 18:39           ` Jeremy Fitzhardinge
2008-01-24  6:49 ` [PATCH] " Andi Kleen
2008-01-24  7:02   ` Jeremy Fitzhardinge
2008-01-24  7:11     ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox