stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] MIPS: Make set_pte() SMP safe.
@ 2015-08-04  0:48 David Daney
  2015-08-04 19:15 ` Leonid Yegoshin
  2015-08-24  3:28 ` [PATCH] " Joshua Kinard
  0 siblings, 2 replies; 10+ messages in thread
From: David Daney @ 2015-08-04  0:48 UTC (permalink / raw)
  To: linux-mips, ralf; +Cc: David Daney, stable

From: David Daney <david.daney@cavium.com>

On MIPS the GLOBAL bit of the PTE must have the same value in any
aligned pair of PTEs.  These pairs of PTEs are referred to as
"buddies".  In a SMP system is is possible for two CPUs to be calling
set_pte() on adjacent PTEs at the same time.  There is a race between
setting the PTE and a different CPU setting the GLOBAL bit in its
buddy PTE.

This race can be observed when multiple CPUs are executing
vmap()/vfree() at the same time.

Make setting the buddy PTE's GLOBAL bit an atomic operation to close
the race condition.

The case of CONFIG_64BIT_PHYS_ADDR && CONFIG_CPU_MIPS32 is *not*
handled.

Signed-off-by: David Daney <david.daney@cavium.com>
Cc: <stable@vger.kernel.org>
---
 arch/mips/include/asm/pgtable.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 9d81067..ae85694 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -182,8 +182,39 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
 		 * Make sure the buddy is global too (if it's !none,
 		 * it better already be global)
 		 */
+#ifdef CONFIG_SMP
+		/*
+		 * For SMP, multiple CPUs can race, so we need to do
+		 * this atomically.
+		 */
+#ifdef CONFIG_64BIT
+#define LL_INSN "lld"
+#define SC_INSN "scd"
+#else /* CONFIG_32BIT */
+#define LL_INSN "ll"
+#define SC_INSN "sc"
+#endif
+		unsigned long page_global = _PAGE_GLOBAL;
+		unsigned long tmp;
+
+		__asm__ __volatile__ (
+			"	.set	push\n"
+			"	.set	noreorder\n"
+			"1:	" LL_INSN "	%[tmp], %[buddy]\n"
+			"	bnez	%[tmp], 2f\n"
+			"	 or	%[tmp], %[tmp], %[global]\n"
+			"	" SC_INSN "	%[tmp], %[buddy]\n"
+			"	beqz	%[tmp], 1b\n"
+			"	 nop\n"
+			"2:\n"
+			"	.set pop"
+			: [buddy] "+m" (buddy->pte),
+			  [tmp] "=&r" (tmp)
+			: [global] "r" (page_global));
+#else /* !CONFIG_SMP */
 		if (pte_none(*buddy))
 			pte_val(*buddy) = pte_val(*buddy) | _PAGE_GLOBAL;
+#endif /* CONFIG_SMP */
 	}
 #endif
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: MIPS: Make set_pte() SMP safe.
  2015-08-04  0:48 [PATCH] MIPS: Make set_pte() SMP safe David Daney
@ 2015-08-04 19:15 ` Leonid Yegoshin
  2015-08-04 20:01   ` David Daney
  2015-08-24  3:28 ` [PATCH] " Joshua Kinard
  1 sibling, 1 reply; 10+ messages in thread
From: Leonid Yegoshin @ 2015-08-04 19:15 UTC (permalink / raw)
  To: David Daney, linux-mips, ralf; +Cc: David Daney, stable

David,

Did you observe this in real?
The function __get_vm_area_node() allocates a guard page if flag 
VM_NO_GUARD is not used and I don't see any use of it in source.

In past vmap allocated a guard page even unconditionally.

- Leonid.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MIPS: Make set_pte() SMP safe.
  2015-08-04 19:15 ` Leonid Yegoshin
@ 2015-08-04 20:01   ` David Daney
  2015-08-04 20:32     ` Leonid Yegoshin
  0 siblings, 1 reply; 10+ messages in thread
From: David Daney @ 2015-08-04 20:01 UTC (permalink / raw)
  To: Leonid Yegoshin; +Cc: David Daney, linux-mips, ralf, David Daney, stable

On 08/04/2015 12:15 PM, Leonid Yegoshin wrote:
> David,
>
> Did you observe this in real?

Yes.  It is not hypothetical.


> The function __get_vm_area_node() allocates a guard page if flag
> VM_NO_GUARD is not used and I don't see any use of it in source.
>
> In past vmap allocated a guard page even unconditionally.

It has nothing to do with guard pages per se.  The problem is if a vmap 
range (including guard page) ends on an even PFN.  The buddy code will 
clobber the PTE for PFN+1.  If another vmap operation is executing in 
set_pte() for the clobbered location, you can get corrupted page tables.


>
> - Leonid.
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MIPS: Make set_pte() SMP safe.
  2015-08-04 20:01   ` David Daney
@ 2015-08-04 20:32     ` Leonid Yegoshin
  2015-08-04 20:36       ` Leonid Yegoshin
  2015-08-04 20:48       ` David Daney
  0 siblings, 2 replies; 10+ messages in thread
From: Leonid Yegoshin @ 2015-08-04 20:32 UTC (permalink / raw)
  To: David Daney; +Cc: David Daney, linux-mips, ralf, David Daney, stable

David,

It is interesting, I still don't understand the effect - if guard page 
is used then two different VMAP allocations can't use two buddy PTEs.

Yes, only one of buddy PTEs in that case can be allocated and attached 
to VMA but caller doesn't know about additional page and two cases are 
possible. Even map_vm_area has no any info about guard page.

(assume VMA1 has low address range and VMA2 has higher address range):

a.  VMA1 (after adjustment) ends at even PTE ==> caller doesn't use that 
PTE and there is no collision with last pair of buddy PTEs, even if VMA2 
uses odd PTE from that pair.
b.  VMA1 (after adjustment) ends at odd PTE ==> again, this buddy pair 
is used only VMA1. Next VMA2 start from next pair.

What is wrong here?

Is it possible that access gone bad and touches a page beyond a 
requested size?
Is it possible that it is not vmap() but some different interface was used?

- Leonid.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MIPS: Make set_pte() SMP safe.
  2015-08-04 20:32     ` Leonid Yegoshin
@ 2015-08-04 20:36       ` Leonid Yegoshin
  2015-08-04 20:38         ` David Daney
  2015-08-04 20:48       ` David Daney
  1 sibling, 1 reply; 10+ messages in thread
From: Leonid Yegoshin @ 2015-08-04 20:36 UTC (permalink / raw)
  To: David Daney; +Cc: David Daney, linux-mips, ralf, David Daney, stable

David,

Note: Excuse me for bothering you but if guard page logic doesn't work 
anymore then it can be scraped. It allows me to optimize some cache 
flushes in VMAP area - instead of flushing a full cache, just flush a 
range of pages.

- Leonid.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MIPS: Make set_pte() SMP safe.
  2015-08-04 20:36       ` Leonid Yegoshin
@ 2015-08-04 20:38         ` David Daney
  2015-08-04 20:47           ` Leonid Yegoshin
  0 siblings, 1 reply; 10+ messages in thread
From: David Daney @ 2015-08-04 20:38 UTC (permalink / raw)
  To: Leonid Yegoshin; +Cc: David Daney, linux-mips, ralf, David Daney, stable

On 08/04/2015 01:36 PM, Leonid Yegoshin wrote:
> David,
>
> Note: Excuse me for bothering you but if guard page logic doesn't work
> anymore then it can be scraped. It allows me to optimize some cache
> flushes in VMAP area - instead of flushing a full cache, just flush a
> range of pages.
>

I don't know what the purpose of guard pages is.  I thought it was to 
produce an OOPs message if you accessed just beyond the end of the vmap 
area.  If people are using them for some other reason, then I am not 
aware of it.

David.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MIPS: Make set_pte() SMP safe.
  2015-08-04 20:38         ` David Daney
@ 2015-08-04 20:47           ` Leonid Yegoshin
  0 siblings, 0 replies; 10+ messages in thread
From: Leonid Yegoshin @ 2015-08-04 20:47 UTC (permalink / raw)
  To: David Daney; +Cc: David Daney, linux-mips, ralf, David Daney, stable

On 08/04/2015 01:38 PM, David Daney wrote:
> I don't know what the purpose of guard pages is.  I thought it was to 
> produce an OOPs message if you accessed just beyond the end of the 
> vmap area.  If people are using them for some other reason, then I am 
> not aware of it.
>
>
Right, but if you have a buddy problem then it means that in your case 
the guard page IS USED... and doesn't produce OOPs.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MIPS: Make set_pte() SMP safe.
  2015-08-04 20:32     ` Leonid Yegoshin
  2015-08-04 20:36       ` Leonid Yegoshin
@ 2015-08-04 20:48       ` David Daney
  2015-08-04 20:58         ` Leonid Yegoshin
  1 sibling, 1 reply; 10+ messages in thread
From: David Daney @ 2015-08-04 20:48 UTC (permalink / raw)
  To: Leonid Yegoshin; +Cc: David Daney, linux-mips, ralf, David Daney, stable

On 08/04/2015 01:32 PM, Leonid Yegoshin wrote:
> David,
>
> It is interesting, I still don't understand the effect

I think the best way to think about it is to ignore vmap, and consider 
the semantics of set_pte().

When a thread calls set_pte() it must ensure that no other thread will 
crash using VA region covered by the PTE.  That is the contract of 
set_pte().

The MIPS set_pte() does something different.  In addition to setting the 
specified PTE, it has the side effect of clobbering another PTE (called 
the buddy).  There is nothing in the kernel that prevents a another 
thread from using the buddy-PTE, and when that happens in the race 
window, the page tables are corrupted, and the system crashes.

The fix is to not clobber the buddy-PTE.

You can go around in circles all you want trying to indirectly avoid 
using the buddy-PTE from another thread, but I think it is best to make 
set_pte() have easily understood semantics (and semantics that match 
those of other architectures) and not clobber things in unexpected ways.

David Daney.


> - if guard page
> is used then two different VMAP allocations can't use two buddy PTEs.
>
> Yes, only one of buddy PTEs in that case can be allocated and attached
> to VMA but caller doesn't know about additional page and two cases are
> possible. Even map_vm_area has no any info about guard page.
>
> (assume VMA1 has low address range and VMA2 has higher address range):
>
> a.  VMA1 (after adjustment) ends at even PTE ==> caller doesn't use that
> PTE and there is no collision with last pair of buddy PTEs, even if VMA2
> uses odd PTE from that pair.
> b.  VMA1 (after adjustment) ends at odd PTE ==> again, this buddy pair
> is used only VMA1. Next VMA2 start from next pair.
>
> What is wrong here?
>
> Is it possible that access gone bad and touches a page beyond a
> requested size?
> Is it possible that it is not vmap() but some different interface was used?
>
> - Leonid.
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MIPS: Make set_pte() SMP safe.
  2015-08-04 20:48       ` David Daney
@ 2015-08-04 20:58         ` Leonid Yegoshin
  0 siblings, 0 replies; 10+ messages in thread
From: Leonid Yegoshin @ 2015-08-04 20:58 UTC (permalink / raw)
  To: David Daney; +Cc: David Daney, linux-mips, ralf, David Daney, stable

David,

On 08/04/2015 01:48 PM, David Daney wrote:
> I think the best way to think about it is to ignore vmap, and consider 
> the semantics of set_pte().
> ...
> You can go around in circles all you want trying to indirectly avoid 
> using the buddy-PTE from another thread, but I think it is best to 
> make set_pte() have easily understood semantics (and semantics that 
> match those of other architectures) and not clobber things in 
> unexpected ways.

My primary interest here is not a semantics of set_pte() but the 
followup of your finding, I tried test it: if guard page logic doesn't 
work anymore (as I can judge basing on your observations) then it calls 
back my old optimization in flush_cache_vmap(start, end) and similar. 
Right now it flushes the whole cache because if it tries to flush a 
guard page (and it THERE IS such attempt in some mm/*.c) it does TLB 
exception and I have a hard lock in do_page_fault().

Issue is significant for some GPU/display drivers which calls flushing 
VMALLOC area pretty often.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] MIPS: Make set_pte() SMP safe.
  2015-08-04  0:48 [PATCH] MIPS: Make set_pte() SMP safe David Daney
  2015-08-04 19:15 ` Leonid Yegoshin
@ 2015-08-24  3:28 ` Joshua Kinard
  1 sibling, 0 replies; 10+ messages in thread
From: Joshua Kinard @ 2015-08-24  3:28 UTC (permalink / raw)
  To: David Daney, linux-mips, ralf; +Cc: David Daney, stable

On 08/03/2015 20:48, David Daney wrote:
> From: David Daney <david.daney@cavium.com>
> 
> On MIPS the GLOBAL bit of the PTE must have the same value in any
> aligned pair of PTEs.  These pairs of PTEs are referred to as
> "buddies".  In a SMP system is is possible for two CPUs to be calling
> set_pte() on adjacent PTEs at the same time.  There is a race between
> setting the PTE and a different CPU setting the GLOBAL bit in its
> buddy PTE.
> 
> This race can be observed when multiple CPUs are executing
> vmap()/vfree() at the same time.

Curious, but what's the observed symptom when this race condition occurs?  I am
wondering if it may (or may not) explain the periodic hard lockups on IP27 SMP
that I see during heavy disk I/O.  On that platform, !CONFIG_SMP works fine.
It's only in SMP mode that I can reproduce the lockups, so I suspect it's a
race condition of some kind.


> Make setting the buddy PTE's GLOBAL bit an atomic operation to close
> the race condition.
> 
> The case of CONFIG_64BIT_PHYS_ADDR && CONFIG_CPU_MIPS32 is *not*
> handled.
> 
> Signed-off-by: David Daney <david.daney@cavium.com>
> Cc: <stable@vger.kernel.org>
> ---
>  arch/mips/include/asm/pgtable.h | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
> index 9d81067..ae85694 100644
> --- a/arch/mips/include/asm/pgtable.h
> +++ b/arch/mips/include/asm/pgtable.h
> @@ -182,8 +182,39 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
>  		 * Make sure the buddy is global too (if it's !none,
>  		 * it better already be global)
>  		 */
> +#ifdef CONFIG_SMP
> +		/*
> +		 * For SMP, multiple CPUs can race, so we need to do
> +		 * this atomically.
> +		 */
> +#ifdef CONFIG_64BIT
> +#define LL_INSN "lld"
> +#define SC_INSN "scd"
> +#else /* CONFIG_32BIT */
> +#define LL_INSN "ll"
> +#define SC_INSN "sc"
> +#endif
> +		unsigned long page_global = _PAGE_GLOBAL;
> +		unsigned long tmp;
> +
> +		__asm__ __volatile__ (

If an R10000_LLSC_WAR case is needed here (see below), then shouldn't ".set
arch=r4000" go here, and '.set    "MIPS_ISA_ARCH_LEVEL"' go in the
!R10000_LLSC_WAR case?


> +			"	.set	push\n"
> +			"	.set	noreorder\n"
> +			"1:	" LL_INSN "	%[tmp], %[buddy]\n"
> +			"	bnez	%[tmp], 2f\n"
> +			"	 or	%[tmp], %[tmp], %[global]\n"
> +			"	" SC_INSN "	%[tmp], %[buddy]\n"

A quick look at asm/local.h and asm/bitops.h shows that they use "__LL" and
"__SC" instead of defining local variants.  Perhaps those macros are usable
here as well?


> +			"	beqz	%[tmp], 1b\n"

Does this 'beqz' insn need to have a case for R10000_LLSC_WAR, which uses
'beqzl', like several other areas in the kernel?



> +			"	 nop\n"
> +			"2:\n"
> +			"	.set pop"
> +			: [buddy] "+m" (buddy->pte),
> +			  [tmp] "=&r" (tmp)
> +			: [global] "r" (page_global));
> +#else /* !CONFIG_SMP */
>  		if (pte_none(*buddy))
>  			pte_val(*buddy) = pte_val(*buddy) | _PAGE_GLOBAL;
> +#endif /* CONFIG_SMP */
>  	}
>  #endif
>  }
> 

--J

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-08-24  3:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-04  0:48 [PATCH] MIPS: Make set_pte() SMP safe David Daney
2015-08-04 19:15 ` Leonid Yegoshin
2015-08-04 20:01   ` David Daney
2015-08-04 20:32     ` Leonid Yegoshin
2015-08-04 20:36       ` Leonid Yegoshin
2015-08-04 20:38         ` David Daney
2015-08-04 20:47           ` Leonid Yegoshin
2015-08-04 20:48       ` David Daney
2015-08-04 20:58         ` Leonid Yegoshin
2015-08-24  3:28 ` [PATCH] " Joshua Kinard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).