[patch 1/2] read_barrier

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 1/2] read_barrier_depends fixlets
@ 2008-05-05 11:20 Nick Piggin
  2008-05-05 12:12 ` [patch 2/2] fix SMP data race in pagetable setup vs walking Nick Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Nick Piggin @ 2008-05-05 11:20 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List, Paul McKenney

While considering the impact of read_barrier_depends, it occurred to
me that it should really be really a noop for the compiler. At least, it is
better to have every arch the same than to have a few that are slightly
different. (Does this mean SMP Alpha's read_barrier_depends could drop the
"memory" clobber too?)
--
It would be a highly unusual compiler that might try to issue a load of
data1 before it loads a data2 which is data-dependant on data1.

There is the problem of the compiler trying to reload data1 _after_
loading data2, and thus having a newer data1 than data2. However if the
compiler is so inclined, then it could perform such a load at any point
after the barrier, so the barrier itself will not guarantee correctness.

I think we've mostly hoped the compiler would not to do that.

This brings alpha and frv into line with all other architectures.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/asm-alpha/barrier.h
===================================================================
--- linux-2.6.orig/include/asm-alpha/barrier.h
+++ linux-2.6/include/asm-alpha/barrier.h
@@ -24,7 +24,7 @@ __asm__ __volatile__("mb": : :"memory")
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	barrier()
+#define smp_read_barrier_depends()	do { } while (0)
 #endif
 
 #define set_mb(var, value) \
Index: linux-2.6/include/asm-frv/system.h
===================================================================
--- linux-2.6.orig/include/asm-frv/system.h
+++ linux-2.6/include/asm-frv/system.h
@@ -179,7 +179,7 @@ do {							\
 #define mb()			asm volatile ("membar" : : :"memory")
 #define rmb()			asm volatile ("membar" : : :"memory")
 #define wmb()			asm volatile ("membar" : : :"memory")
-#define read_barrier_depends()	barrier()
+#define read_barrier_depends()	do { } while (0)
 
 #ifdef CONFIG_SMP
 #define smp_mb()			mb()

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-05 11:20 [patch 1/2] read_barrier_depends fixlets Nick Piggin
@ 2008-05-05 12:12 ` Nick Piggin
  2008-05-05 14:35   ` Paul E. McKenney
                     ` (3 more replies)
  2008-05-05 14:27 ` [patch 1/2] read_barrier_depends fixlets Paul E. McKenney
  2008-05-06 15:29 ` David Howells
  2 siblings, 4 replies; 31+ messages in thread
From: Nick Piggin @ 2008-05-05 12:12 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List, Paul McKenney

I only converted x86 and powerpc. I think comments in x86 are good because
that is more or less the reference implementation and is where many VM
developers would look to understand mm/ code. Commenting all page table
walking in all other architectures is kind of beyond my skill or patience,
and maintainers might consider this weird "alpha thingy" is below them ;)
But they are quite free to add smp_read_barrier_depends to their own code.

Still would like more acks on this before it is applied.

--

There is a possible data race in the page table walking code. After the split
ptlock patches, it actually seems to have been introduced to the core code, but
even before that I think it would have impacted some architectures (powerpc and
sparc64, at least, walk the page tables without taking locks eg. see
find_linux_pte()).

The race is as follows:
The pte page is allocated, zeroed, and its struct page gets its spinlock
initialized. The mm-wide ptl is then taken, and then the pte page is inserted
into the pagetables.

At this point, the spinlock is not guaranteed to have ordered the previous
stores to initialize the pte page with the subsequent store to put it in the
page tables. So another Linux page table walker might be walking down (without
any locks, because we have split-leaf-ptls), and find that new pte we've
inserted. It might try to take the spinlock before the store from the other
CPU initializes it. And subsequently it might read a pte_t out before stores
from the other CPU have cleared the memory.

There seem to be similar races in higher levels of the page tables, but they
obviously don't involve the spinlock, but one could see uninitialized memory.

Arch code and hardware pagetable walkers that walk the pagetables without
locks could see similar uninitialized memory problems (regardless of whether
we have split ptes or not).

Fortunately, on x86 (except OOSTORE), nothing needs to be done, because stores
are in order, and so are loads.

I prefer to put the barriers in core code, because that's where the higher
level logic happens, but the page table accessors are per-arch, and open-coding
them everywhere I don't think is an option.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/asm-x86/pgtable_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_32.h
+++ linux-2.6/include/asm-x86/pgtable_32.h
@@ -133,7 +133,12 @@ extern int pmd_bad(pmd_t pmd);
  * pgd_offset() returns a (pgd_t *)
  * pgd_index() is used get the offset into the pgd page's array of pgd_t's;
  */
-#define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address)))
+#define pgd_offset(mm, address)						\
+({									\
+	pgd_t *ret = ((mm)->pgd + pgd_index((address)));		\
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
+	ret;								\
+})
 
 /*
  * a shortcut which implies the use of the kernel's pgd, instead
@@ -160,8 +165,12 @@ static inline int pud_large(pud_t pud) {
  */
 #define pte_index(address)					\
 	(((address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
-#define pte_offset_kernel(dir, address)				\
-	((pte_t *)pmd_page_vaddr(*(dir)) +  pte_index((address)))
+#define pte_offset_kernel(dir, address)					\
+({									\
+	pte_t *ret = (pte_t *)pmd_page_vaddr(*(dir)) +  pte_index((address)); \
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
+	ret;								\
+})
 
 #define pmd_page(pmd) (pfn_to_page(pmd_val((pmd)) >> PAGE_SHIFT))
 
@@ -170,16 +179,32 @@ static inline int pud_large(pud_t pud) {
 
 #if defined(CONFIG_HIGHPTE)
 #define pte_offset_map(dir, address)					\
-	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE0) +		\
-	 pte_index((address)))
+({									\
+	pte_t *ret = (pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE0) + \
+		 pte_index((address));					\
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
+	ret;								\
+})
+
 #define pte_offset_map_nested(dir, address)				\
-	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE1) +		\
-	 pte_index((address)))
+({									\
+	pte_t *ret = (pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE1) + \
+		 pte_index((address));					\
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
+	ret;								\
+})
+
 #define pte_unmap(pte) kunmap_atomic((pte), KM_PTE0)
 #define pte_unmap_nested(pte) kunmap_atomic((pte), KM_PTE1)
 #else
 #define pte_offset_map(dir, address)					\
-	((pte_t *)page_address(pmd_page(*(dir))) + pte_index((address)))
+({									\
+	pte_t *ret = (pte_t *)page_address(pmd_page(*(dir))) +		\
+		pte_index((address));					\
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
+	ret;								\
+})
+
 #define pte_offset_map_nested(dir, address) pte_offset_map((dir), (address))
 #define pte_unmap(pte) do { } while (0)
 #define pte_unmap_nested(pte) do { } while (0)
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -311,6 +311,37 @@ int __pte_alloc(struct mm_struct *mm, pm
 	if (!new)
 		return -ENOMEM;
 
+	/*
+	 * Ensure all pte setup (eg. pte page lock and page clearing) are
+	 * visible before the pte is made visible to other CPUs by being
+	 * put into page tables.
+	 *
+	 * The other side of the story is the pointer chasing in the page
+	 * table walking code (when walking the page table without locking;
+	 * ie. most of the time). Fortunately, these data accesses consist
+	 * of a chain of data-dependent loads, meaning most CPUs (alpha
+	 * being the notable exception) will already guarantee loads are
+	 * seen in-order. x86 has a "reference" implementation of
+	 * smp_read_barrier_depends() barriers in its page table walking
+	 * code, even though that barrier is a simple noop on that architecture.
+	 * Alpha obviously also has the required barriers.
+	 *
+	 * It is debatable whether or not the smp_read_barrier_depends()
+	 * barriers are required for kernel page tables; it could be that
+	 * nowhere in the kernel do we walk those pagetables without taking
+	 * init_mm's page_table_lock. However, such walks are pretty uncommon,
+	 * and the only architecture that is even slightly impacted is
+	 * alpha, so barriers are there to be safe. The smp_wmb()'s also may
+	 * not be required in the allocation side of kernel page tables,
+	 * because it is probably a bug for a thread to concurrently be
+	 * accessing kva that is being set up at the same time -- however it
+	 * is nice to have the wmb barriers there, because they might prevent
+	 * us from reading some junk in that case. So we will get a simple
+	 * page fault in the case of such a bug, rather than a possible
+	 * undetected wander off into crazy space.
+	 */
+	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
+
 	spin_lock(&mm->page_table_lock);
 	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
 		mm->nr_ptes++;
@@ -329,6 +360,8 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
 	if (!new)
 		return -ENOMEM;
 
+	smp_wmb(); /* See comment in __pte_alloc */
+
 	spin_lock(&init_mm.page_table_lock);
 	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
@@ -2616,6 +2649,8 @@ int __pud_alloc(struct mm_struct *mm, pg
 	if (!new)
 		return -ENOMEM;
 
+	smp_wmb(); /* See comment in __pte_alloc */
+
 	spin_lock(&mm->page_table_lock);
 	if (pgd_present(*pgd))		/* Another has populated it */
 		pud_free(mm, new);
@@ -2637,6 +2672,8 @@ int __pmd_alloc(struct mm_struct *mm, pu
 	if (!new)
 		return -ENOMEM;
 
+	smp_wmb(); /* See comment in __pte_alloc */
+
 	spin_lock(&mm->page_table_lock);
 #ifndef __ARCH_HAS_4LEVEL_HACK
 	if (pud_present(*pud))		/* Another has populated it */
Index: linux-2.6/include/asm-x86/pgtable-3level.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable-3level.h
+++ linux-2.6/include/asm-x86/pgtable-3level.h
@@ -126,8 +126,13 @@ static inline void pud_clear(pud_t *pudp
 
 
 /* Find an entry in the second-level page table.. */
-#define pmd_offset(pud, address) ((pmd_t *)pud_page(*(pud)) +	\
-				  pmd_index(address))
+#define pmd_offset(pud, address)					\
+({									\
+	pmd_t *pmd = ((pmd_t *)pud_page(*(pud)) +			\
+				  pmd_index(address))			\
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
+	ret;								\
+})
 
 #ifdef CONFIG_SMP
 static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
Index: linux-2.6/include/asm-x86/pgtable_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_64.h
+++ linux-2.6/include/asm-x86/pgtable_64.h
@@ -193,8 +193,20 @@ static inline unsigned long pmd_bad(pmd_
 	((unsigned long)__va((unsigned long)pgd_val((pgd)) & PTE_MASK))
 #define pgd_page(pgd)		(pfn_to_page(pgd_val((pgd)) >> PAGE_SHIFT))
 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
-#define pgd_offset(mm, address)	((mm)->pgd + pgd_index((address)))
-#define pgd_offset_k(address) (init_level4_pgt + pgd_index((address)))
+#define pgd_offset(mm, address)						\
+({									\
+	pgd_t *ret = ((mm)->pgd + pgd_index((address)));		\
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
+	ret;								\
+})
+
+#define pgd_offset_k(address)						\
+({									\
+	pgd_t *ret = (init_level4_pgt + pgd_index((address)));		\
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
+	ret;								\
+})
+
 #define pgd_present(pgd) (pgd_val(pgd) & _PAGE_PRESENT)
 static inline int pgd_large(pgd_t pgd) { return 0; }
 #define mk_kernel_pgd(address) ((pgd_t){ (address) | _KERNPG_TABLE })
@@ -206,7 +218,12 @@ static inline int pgd_large(pgd_t pgd) {
 #define pud_page(pud)	(pfn_to_page(pud_val((pud)) >> PAGE_SHIFT))
 #define pud_index(address) (((address) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
 #define pud_offset(pgd, address)					\
-	((pud_t *)pgd_page_vaddr(*(pgd)) + pud_index((address)))
+({									\
+	pud_t *ret = ((pud_t *)pgd_page_vaddr(*(pgd)) + pud_index((address))); \
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
+	ret;								\
+})
+
 #define pud_present(pud) (pud_val((pud)) & _PAGE_PRESENT)
 
 static inline int pud_large(pud_t pte)
@@ -220,8 +237,14 @@ static inline int pud_large(pud_t pte)
 #define pmd_page(pmd)		(pfn_to_page(pmd_val((pmd)) >> PAGE_SHIFT))
 
 #define pmd_index(address) (((address) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
-#define pmd_offset(dir, address) ((pmd_t *)pud_page_vaddr(*(dir)) + \
-				  pmd_index(address))
+#define pmd_offset(dir, address)					\
+({									\
+	pmd_t *ret = ((pmd_t *)pud_page_vaddr(*(dir)) +			\
+				  pmd_index(address));			\
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
+	ret;								\
+})
+
 #define pmd_none(x)	(!pmd_val((x)))
 #define pmd_present(x)	(pmd_val((x)) & _PAGE_PRESENT)
 #define pfn_pmd(nr, prot) (__pmd(((nr) << PAGE_SHIFT) | pgprot_val((prot))))
@@ -238,8 +261,13 @@ static inline int pud_large(pud_t pte)
 #define mk_pte(page, pgprot)	pfn_pte(page_to_pfn((page)), (pgprot))
 
 #define pte_index(address) (((address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
-#define pte_offset_kernel(dir, address) ((pte_t *) pmd_page_vaddr(*(dir)) + \
-					 pte_index((address)))
+#define pte_offset_kernel(dir, address)					\
+({									\
+	pte_t *ret = ((pte_t *) pmd_page_vaddr(*(dir)) +		\
+					 pte_index((address)));		\
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
+	ret;								\
+})
 
 /* x86-64 always has all page tables mapped. */
 #define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))
Index: linux-2.6/include/asm-alpha/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-alpha/pgtable.h
+++ linux-2.6/include/asm-alpha/pgtable.h
@@ -285,19 +285,28 @@ extern inline pte_t pte_mkspecial(pte_t 
 
 /* to find an entry in a page-table-directory. */
 #define pgd_index(address)	(((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
-#define pgd_offset(mm, address)	((mm)->pgd+pgd_index(address))
+#define pgd_offset(mm, address)						\
+({									\
+	pgd_t *ret = ((mm)->pgd+pgd_index(address));			\
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
+	ret;								\
+})
 
 /* Find an entry in the second-level page table.. */
 extern inline pmd_t * pmd_offset(pgd_t * dir, unsigned long address)
 {
-	return (pmd_t *) pgd_page_vaddr(*dir) + ((address >> PMD_SHIFT) & (PTRS_PER_PAGE - 1));
+	pmd_t *ret = (pmd_t *) pgd_page_vaddr(*dir) + ((address >> PMD_SHIFT) & (PTRS_PER_PAGE - 1));
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */
+	return ret;
 }
 
 /* Find an entry in the third-level page table.. */
 extern inline pte_t * pte_offset_kernel(pmd_t * dir, unsigned long address)
 {
-	return (pte_t *) pmd_page_vaddr(*dir)
+	pte_t *ret = (pte_t *) pmd_page_vaddr(*dir)
 		+ ((address >> PAGE_SHIFT) & (PTRS_PER_PAGE - 1));
+	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */
+	return ret;
 }
 
 #define pte_offset_map(dir,addr)	pte_offset_kernel((dir),(addr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/2] read_barrier_depends fixlets
  2008-05-05 11:20 [patch 1/2] read_barrier_depends fixlets Nick Piggin
  2008-05-05 12:12 ` [patch 2/2] fix SMP data race in pagetable setup vs walking Nick Piggin
@ 2008-05-05 14:27 ` Paul E. McKenney
  2008-05-06  9:01   ` Nick Piggin
  2008-05-06 15:29 ` David Howells
  2 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2008-05-05 14:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List

On Mon, May 05, 2008 at 01:20:21PM +0200, Nick Piggin wrote:
> While considering the impact of read_barrier_depends, it occurred to
> me that it should really be really a noop for the compiler. At least, it is
> better to have every arch the same than to have a few that are slightly
> different. (Does this mean SMP Alpha's read_barrier_depends could drop the
> "memory" clobber too?)

SMP Alpha's read_barrier_depends() needs the "memory" clobber
because the compiler is otherwise free to move code across the
smp_read_barrier_depends(), which would defeat its purpose.

> --
> It would be a highly unusual compiler that might try to issue a load of
> data1 before it loads a data2 which is data-dependant on data1.

A bit unusual, perhaps, but not unprecedented.  Value speculating
compilers, for example.

> There is the problem of the compiler trying to reload data1 _after_
> loading data2, and thus having a newer data1 than data2. However if the
> compiler is so inclined, then it could perform such a load at any point
> after the barrier, so the barrier itself will not guarantee correctness.
> 
> I think we've mostly hoped the compiler would not to do that.

Well, this does point me at one thing I missed with preemptable RCU,
namely all the open-coded sequences using smp_read_barrier_depends().
Quite embarrassing!!!  But a lot easier having you point me at it than
however long it would have taken me to figure it out on my own, so thank
you very much!!!

> This brings alpha and frv into line with all other architectures.

Assuming that we apply ACCESS_ONCE() as needed to the uses of
smp_read_barrier_depends():

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> Index: linux-2.6/include/asm-alpha/barrier.h
> ===================================================================
> --- linux-2.6.orig/include/asm-alpha/barrier.h
> +++ linux-2.6/include/asm-alpha/barrier.h
> @@ -24,7 +24,7 @@ __asm__ __volatile__("mb": : :"memory")
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> -#define smp_read_barrier_depends()	barrier()
> +#define smp_read_barrier_depends()	do { } while (0)
>  #endif
> 
>  #define set_mb(var, value) \
> Index: linux-2.6/include/asm-frv/system.h
> ===================================================================
> --- linux-2.6.orig/include/asm-frv/system.h
> +++ linux-2.6/include/asm-frv/system.h
> @@ -179,7 +179,7 @@ do {							\
>  #define mb()			asm volatile ("membar" : : :"memory")
>  #define rmb()			asm volatile ("membar" : : :"memory")
>  #define wmb()			asm volatile ("membar" : : :"memory")
> -#define read_barrier_depends()	barrier()
> +#define read_barrier_depends()	do { } while (0)
> 
>  #ifdef CONFIG_SMP
>  #define smp_mb()			mb()

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-05 12:12 ` [patch 2/2] fix SMP data race in pagetable setup vs walking Nick Piggin
@ 2008-05-05 14:35   ` Paul E. McKenney
  2008-05-06  9:38     ` Nick Piggin
  2008-05-05 15:32   ` Linus Torvalds
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2008-05-05 14:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List

On Mon, May 05, 2008 at 02:12:40PM +0200, Nick Piggin wrote:
> I only converted x86 and powerpc. I think comments in x86 are good because
> that is more or less the reference implementation and is where many VM
> developers would look to understand mm/ code. Commenting all page table
> walking in all other architectures is kind of beyond my skill or patience,
> and maintainers might consider this weird "alpha thingy" is below them ;)
> But they are quite free to add smp_read_barrier_depends to their own code.
> 
> Still would like more acks on this before it is applied.

Need ACCESS_ONCE(), as called out below.  Note that without this, the
compiler would be within its rights to refetch the pointer, which could
cause later dereferences to be using different versions of the structure.
I suspect that there are not many pieces of code anticipating that sort
of abuse...

							Thanx, Paul

> --
> 
> There is a possible data race in the page table walking code. After the split
> ptlock patches, it actually seems to have been introduced to the core code, but
> even before that I think it would have impacted some architectures (powerpc and
> sparc64, at least, walk the page tables without taking locks eg. see
> find_linux_pte()).
> 
> The race is as follows:
> The pte page is allocated, zeroed, and its struct page gets its spinlock
> initialized. The mm-wide ptl is then taken, and then the pte page is inserted
> into the pagetables.
> 
> At this point, the spinlock is not guaranteed to have ordered the previous
> stores to initialize the pte page with the subsequent store to put it in the
> page tables. So another Linux page table walker might be walking down (without
> any locks, because we have split-leaf-ptls), and find that new pte we've
> inserted. It might try to take the spinlock before the store from the other
> CPU initializes it. And subsequently it might read a pte_t out before stores
> from the other CPU have cleared the memory.
> 
> There seem to be similar races in higher levels of the page tables, but they
> obviously don't involve the spinlock, but one could see uninitialized memory.
> 
> Arch code and hardware pagetable walkers that walk the pagetables without
> locks could see similar uninitialized memory problems (regardless of whether
> we have split ptes or not).
> 
> Fortunately, on x86 (except OOSTORE), nothing needs to be done, because stores
> are in order, and so are loads.
> 
> I prefer to put the barriers in core code, because that's where the higher
> level logic happens, but the page table accessors are per-arch, and open-coding
> them everywhere I don't think is an option.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> Index: linux-2.6/include/asm-x86/pgtable_32.h
> ===================================================================
> --- linux-2.6.orig/include/asm-x86/pgtable_32.h
> +++ linux-2.6/include/asm-x86/pgtable_32.h
> @@ -133,7 +133,12 @@ extern int pmd_bad(pmd_t pmd);
>   * pgd_offset() returns a (pgd_t *)
>   * pgd_index() is used get the offset into the pgd page's array of pgd_t's;
>   */
> -#define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address)))
> +#define pgd_offset(mm, address)						\
> +({									\
> +	pgd_t *ret = ((mm)->pgd + pgd_index((address)));		\

+	pgd_t *ret = (ACCESS_ONCE((mm)->pgd) + pgd_index((address)));	\

Without this change, the compiler could refetch mm->pgd.

It is not clear to me why we double-nest the parentheses around "address".

Also might want to either make this an inline function or use a
more-obscure name than "ret".

Similar changes are needed for the rest of these.

> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})
> 
>  /*
>   * a shortcut which implies the use of the kernel's pgd, instead
> @@ -160,8 +165,12 @@ static inline int pud_large(pud_t pud) {
>   */
>  #define pte_index(address)					\
>  	(((address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
> -#define pte_offset_kernel(dir, address)				\
> -	((pte_t *)pmd_page_vaddr(*(dir)) +  pte_index((address)))
> +#define pte_offset_kernel(dir, address)					\
> +({									\
> +	pte_t *ret = (pte_t *)pmd_page_vaddr(*(dir)) +  pte_index((address)); \
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})
> 
>  #define pmd_page(pmd) (pfn_to_page(pmd_val((pmd)) >> PAGE_SHIFT))
> 
> @@ -170,16 +179,32 @@ static inline int pud_large(pud_t pud) {
> 
>  #if defined(CONFIG_HIGHPTE)
>  #define pte_offset_map(dir, address)					\
> -	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE0) +		\
> -	 pte_index((address)))
> +({									\
> +	pte_t *ret = (pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE0) + \
> +		 pte_index((address));					\
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})
> +
>  #define pte_offset_map_nested(dir, address)				\
> -	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE1) +		\
> -	 pte_index((address)))
> +({									\
> +	pte_t *ret = (pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE1) + \
> +		 pte_index((address));					\
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})
> +
>  #define pte_unmap(pte) kunmap_atomic((pte), KM_PTE0)
>  #define pte_unmap_nested(pte) kunmap_atomic((pte), KM_PTE1)
>  #else
>  #define pte_offset_map(dir, address)					\
> -	((pte_t *)page_address(pmd_page(*(dir))) + pte_index((address)))
> +({									\
> +	pte_t *ret = (pte_t *)page_address(pmd_page(*(dir))) +		\
> +		pte_index((address));					\
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})
> +
>  #define pte_offset_map_nested(dir, address) pte_offset_map((dir), (address))
>  #define pte_unmap(pte) do { } while (0)
>  #define pte_unmap_nested(pte) do { } while (0)
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -311,6 +311,37 @@ int __pte_alloc(struct mm_struct *mm, pm
>  	if (!new)
>  		return -ENOMEM;
> 
> +	/*
> +	 * Ensure all pte setup (eg. pte page lock and page clearing) are
> +	 * visible before the pte is made visible to other CPUs by being
> +	 * put into page tables.
> +	 *
> +	 * The other side of the story is the pointer chasing in the page
> +	 * table walking code (when walking the page table without locking;
> +	 * ie. most of the time). Fortunately, these data accesses consist
> +	 * of a chain of data-dependent loads, meaning most CPUs (alpha
> +	 * being the notable exception) will already guarantee loads are
> +	 * seen in-order. x86 has a "reference" implementation of
> +	 * smp_read_barrier_depends() barriers in its page table walking
> +	 * code, even though that barrier is a simple noop on that architecture.
> +	 * Alpha obviously also has the required barriers.
> +	 *
> +	 * It is debatable whether or not the smp_read_barrier_depends()
> +	 * barriers are required for kernel page tables; it could be that
> +	 * nowhere in the kernel do we walk those pagetables without taking
> +	 * init_mm's page_table_lock. However, such walks are pretty uncommon,
> +	 * and the only architecture that is even slightly impacted is
> +	 * alpha, so barriers are there to be safe. The smp_wmb()'s also may
> +	 * not be required in the allocation side of kernel page tables,
> +	 * because it is probably a bug for a thread to concurrently be
> +	 * accessing kva that is being set up at the same time -- however it
> +	 * is nice to have the wmb barriers there, because they might prevent
> +	 * us from reading some junk in that case. So we will get a simple
> +	 * page fault in the case of such a bug, rather than a possible
> +	 * undetected wander off into crazy space.
> +	 */
> +	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
> +
>  	spin_lock(&mm->page_table_lock);
>  	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
>  		mm->nr_ptes++;
> @@ -329,6 +360,8 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
>  	if (!new)
>  		return -ENOMEM;
> 
> +	smp_wmb(); /* See comment in __pte_alloc */
> +
>  	spin_lock(&init_mm.page_table_lock);
>  	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
>  		pmd_populate_kernel(&init_mm, pmd, new);
> @@ -2616,6 +2649,8 @@ int __pud_alloc(struct mm_struct *mm, pg
>  	if (!new)
>  		return -ENOMEM;
> 
> +	smp_wmb(); /* See comment in __pte_alloc */
> +
>  	spin_lock(&mm->page_table_lock);
>  	if (pgd_present(*pgd))		/* Another has populated it */
>  		pud_free(mm, new);
> @@ -2637,6 +2672,8 @@ int __pmd_alloc(struct mm_struct *mm, pu
>  	if (!new)
>  		return -ENOMEM;
> 
> +	smp_wmb(); /* See comment in __pte_alloc */
> +
>  	spin_lock(&mm->page_table_lock);
>  #ifndef __ARCH_HAS_4LEVEL_HACK
>  	if (pud_present(*pud))		/* Another has populated it */
> Index: linux-2.6/include/asm-x86/pgtable-3level.h
> ===================================================================
> --- linux-2.6.orig/include/asm-x86/pgtable-3level.h
> +++ linux-2.6/include/asm-x86/pgtable-3level.h
> @@ -126,8 +126,13 @@ static inline void pud_clear(pud_t *pudp
> 
> 
>  /* Find an entry in the second-level page table.. */
> -#define pmd_offset(pud, address) ((pmd_t *)pud_page(*(pud)) +	\
> -				  pmd_index(address))
> +#define pmd_offset(pud, address)					\
> +({									\
> +	pmd_t *pmd = ((pmd_t *)pud_page(*(pud)) +			\
> +				  pmd_index(address))			\
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})
> 
>  #ifdef CONFIG_SMP
>  static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
> Index: linux-2.6/include/asm-x86/pgtable_64.h
> ===================================================================
> --- linux-2.6.orig/include/asm-x86/pgtable_64.h
> +++ linux-2.6/include/asm-x86/pgtable_64.h
> @@ -193,8 +193,20 @@ static inline unsigned long pmd_bad(pmd_
>  	((unsigned long)__va((unsigned long)pgd_val((pgd)) & PTE_MASK))
>  #define pgd_page(pgd)		(pfn_to_page(pgd_val((pgd)) >> PAGE_SHIFT))
>  #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
> -#define pgd_offset(mm, address)	((mm)->pgd + pgd_index((address)))
> -#define pgd_offset_k(address) (init_level4_pgt + pgd_index((address)))
> +#define pgd_offset(mm, address)						\
> +({									\
> +	pgd_t *ret = ((mm)->pgd + pgd_index((address)));		\
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})
> +
> +#define pgd_offset_k(address)						\
> +({									\
> +	pgd_t *ret = (init_level4_pgt + pgd_index((address)));		\
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})
> +
>  #define pgd_present(pgd) (pgd_val(pgd) & _PAGE_PRESENT)
>  static inline int pgd_large(pgd_t pgd) { return 0; }
>  #define mk_kernel_pgd(address) ((pgd_t){ (address) | _KERNPG_TABLE })
> @@ -206,7 +218,12 @@ static inline int pgd_large(pgd_t pgd) {
>  #define pud_page(pud)	(pfn_to_page(pud_val((pud)) >> PAGE_SHIFT))
>  #define pud_index(address) (((address) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
>  #define pud_offset(pgd, address)					\
> -	((pud_t *)pgd_page_vaddr(*(pgd)) + pud_index((address)))
> +({									\
> +	pud_t *ret = ((pud_t *)pgd_page_vaddr(*(pgd)) + pud_index((address))); \
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})
> +
>  #define pud_present(pud) (pud_val((pud)) & _PAGE_PRESENT)
> 
>  static inline int pud_large(pud_t pte)
> @@ -220,8 +237,14 @@ static inline int pud_large(pud_t pte)
>  #define pmd_page(pmd)		(pfn_to_page(pmd_val((pmd)) >> PAGE_SHIFT))
> 
>  #define pmd_index(address) (((address) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
> -#define pmd_offset(dir, address) ((pmd_t *)pud_page_vaddr(*(dir)) + \
> -				  pmd_index(address))
> +#define pmd_offset(dir, address)					\
> +({									\
> +	pmd_t *ret = ((pmd_t *)pud_page_vaddr(*(dir)) +			\
> +				  pmd_index(address));			\
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})
> +
>  #define pmd_none(x)	(!pmd_val((x)))
>  #define pmd_present(x)	(pmd_val((x)) & _PAGE_PRESENT)
>  #define pfn_pmd(nr, prot) (__pmd(((nr) << PAGE_SHIFT) | pgprot_val((prot))))
> @@ -238,8 +261,13 @@ static inline int pud_large(pud_t pte)
>  #define mk_pte(page, pgprot)	pfn_pte(page_to_pfn((page)), (pgprot))
> 
>  #define pte_index(address) (((address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
> -#define pte_offset_kernel(dir, address) ((pte_t *) pmd_page_vaddr(*(dir)) + \
> -					 pte_index((address)))
> +#define pte_offset_kernel(dir, address)					\
> +({									\
> +	pte_t *ret = ((pte_t *) pmd_page_vaddr(*(dir)) +		\
> +					 pte_index((address)));		\
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})
> 
>  /* x86-64 always has all page tables mapped. */
>  #define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))
> Index: linux-2.6/include/asm-alpha/pgtable.h
> ===================================================================
> --- linux-2.6.orig/include/asm-alpha/pgtable.h
> +++ linux-2.6/include/asm-alpha/pgtable.h
> @@ -285,19 +285,28 @@ extern inline pte_t pte_mkspecial(pte_t 
> 
>  /* to find an entry in a page-table-directory. */
>  #define pgd_index(address)	(((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
> -#define pgd_offset(mm, address)	((mm)->pgd+pgd_index(address))
> +#define pgd_offset(mm, address)						\
> +({									\
> +	pgd_t *ret = ((mm)->pgd+pgd_index(address));			\
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})
> 
>  /* Find an entry in the second-level page table.. */
>  extern inline pmd_t * pmd_offset(pgd_t * dir, unsigned long address)
>  {
> -	return (pmd_t *) pgd_page_vaddr(*dir) + ((address >> PMD_SHIFT) & (PTRS_PER_PAGE - 1));
> +	pmd_t *ret = (pmd_t *) pgd_page_vaddr(*dir) + ((address >> PMD_SHIFT) & (PTRS_PER_PAGE - 1));
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */
> +	return ret;
>  }
> 
>  /* Find an entry in the third-level page table.. */
>  extern inline pte_t * pte_offset_kernel(pmd_t * dir, unsigned long address)
>  {
> -	return (pte_t *) pmd_page_vaddr(*dir)
> +	pte_t *ret = (pte_t *) pmd_page_vaddr(*dir)
>  		+ ((address >> PAGE_SHIFT) & (PTRS_PER_PAGE - 1));
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */
> +	return ret;
>  }
> 
>  #define pte_offset_map(dir,addr)	pte_offset_kernel((dir),(addr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-05 12:12 ` [patch 2/2] fix SMP data race in pagetable setup vs walking Nick Piggin
  2008-05-05 14:35   ` Paul E. McKenney
@ 2008-05-05 15:32   ` Linus Torvalds
  2008-05-05 16:37     ` Hugh Dickins
  2008-05-06  9:51     ` Nick Piggin
  2008-05-05 16:57   ` [patch 2/2] fix SMP data race in pagetable setup vs walking Hugh Dickins
  2008-05-06  7:08   ` David Miller, Nick Piggin
  3 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2008-05-05 15:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Paul McKenney


On Mon, 5 May 2008, Nick Piggin wrote:
> 
> Index: linux-2.6/include/asm-x86/pgtable_32.h
> ===================================================================
> --- linux-2.6.orig/include/asm-x86/pgtable_32.h
> +++ linux-2.6/include/asm-x86/pgtable_32.h
> @@ -133,7 +133,12 @@ extern int pmd_bad(pmd_t pmd);
>   * pgd_offset() returns a (pgd_t *)
>   * pgd_index() is used get the offset into the pgd page's array of pgd_t's;
>   */
> -#define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address)))
> +#define pgd_offset(mm, address)						\
> +({									\
> +	pgd_t *ret = ((mm)->pgd + pgd_index((address)));		\
> +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> +	ret;								\
> +})

Is there some fundamental reason this needs to be a macro?

It is really ugly, and it would be much nicer to make this an inline 
function if at all possible.

Yeah, maybe it requires some more #include's, but ..

(Especially since it apparently gets worse, and the pgd load needs a 
ACCESS_ONCE() too - the code generated is the same, but the source gets 
more and more involved)

That said, I *also* think that it's sad that you do this at all, since 
smp_read_barrier_depends() is a no-op on x86, so why should we have it in 
an x86-specific header file?

In short, I think the fixes are real, but the patch itself is really just 
confusing things for no apparent good reason.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-05 15:32   ` Linus Torvalds
@ 2008-05-05 16:37     ` Hugh Dickins
  2008-05-06  9:51     ` Nick Piggin
  1 sibling, 0 replies; 31+ messages in thread
From: Hugh Dickins @ 2008-05-05 16:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, linux-arch, Linux Memory Management List,
	Paul McKenney

On Mon, 5 May 2008, Linus Torvalds wrote:
> On Mon, 5 May 2008, Nick Piggin wrote:
> > 
> > Index: linux-2.6/include/asm-x86/pgtable_32.h
> > ===================================================================
> > --- linux-2.6.orig/include/asm-x86/pgtable_32.h
> > +++ linux-2.6/include/asm-x86/pgtable_32.h
> > @@ -133,7 +133,12 @@ extern int pmd_bad(pmd_t pmd);
> >   * pgd_offset() returns a (pgd_t *)
> >   * pgd_index() is used get the offset into the pgd page's array of pgd_t's;
> >   */
> > -#define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address)))
> > +#define pgd_offset(mm, address)						\
> > +({									\
> > +	pgd_t *ret = ((mm)->pgd + pgd_index((address)));		\
> > +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> > +	ret;								\
> > +})
> 
> Is there some fundamental reason this needs to be a macro?
> 
> It is really ugly, and it would be much nicer to make this an inline 
> function if at all possible.
> 
> Yeah, maybe it requires some more #include's, but ..

My recollection is that it gets difficult once you reach
pte_offset_map(), which involved kmap_atomic() (now kmap_atomic_pte()),
which gets twisty.  Or perhaps my problems arose from doing
pte_offset_map_lock() in the generic linux/mm.h: harder to
herd that safely through the different arch builds.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-05 12:12 ` [patch 2/2] fix SMP data race in pagetable setup vs walking Nick Piggin
  2008-05-05 14:35   ` Paul E. McKenney
  2008-05-05 15:32   ` Linus Torvalds
@ 2008-05-05 16:57   ` Hugh Dickins
  2008-05-06  9:52     ` Nick Piggin
  2008-05-06  7:08   ` David Miller, Nick Piggin
  3 siblings, 1 reply; 31+ messages in thread
From: Hugh Dickins @ 2008-05-05 16:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, linux-arch, Linux Memory Management List,
	Paul McKenney

On Mon, 5 May 2008, Nick Piggin wrote:
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -311,6 +311,37 @@ int __pte_alloc(struct mm_struct *mm, pm
>  	if (!new)
>  		return -ENOMEM;
>  
> +	/*
> +	 * Ensure all pte setup (eg. pte page lock and page clearing) are
> +	 * visible before the pte is made visible to other CPUs by being
> +	 * put into page tables.
> +	 *
> +	 * The other side of the story is the pointer chasing in the page
> +	 * table walking code (when walking the page table without locking;
> +	 * ie. most of the time). Fortunately, these data accesses consist
> +	 * of a chain of data-dependent loads, meaning most CPUs (alpha
> +	 * being the notable exception) will already guarantee loads are
> +	 * seen in-order. x86 has a "reference" implementation of
> +	 * smp_read_barrier_depends() barriers in its page table walking
> +	 * code, even though that barrier is a simple noop on that architecture.
> +	 * Alpha obviously also has the required barriers.
> +	 *
> +	 * It is debatable whether or not the smp_read_barrier_depends()
> +	 * barriers are required for kernel page tables; it could be that
> +	 * nowhere in the kernel do we walk those pagetables without taking
> +	 * init_mm's page_table_lock.

Just delete "; it could be that ... init_mm's page_table_lock":
in general (if not everywhere) the architectures do not nowadays
take init_mm's page_table_lock to walk down those pagetables (blame
me for removing it, if anyone thinks that was wrong: I stand by it).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-05 12:12 ` [patch 2/2] fix SMP data race in pagetable setup vs walking Nick Piggin
                     ` (2 preceding siblings ...)
  2008-05-05 16:57   ` [patch 2/2] fix SMP data race in pagetable setup vs walking Hugh Dickins
@ 2008-05-06  7:08   ` David Miller, Nick Piggin
  2008-05-06  9:56     ` Nick Piggin
  3 siblings, 1 reply; 31+ messages in thread
From: David Miller, Nick Piggin @ 2008-05-06  7:08 UTC (permalink / raw)
  To: npiggin; +Cc: torvalds, hugh, linux-arch, linux-mm, paulmck

> I only converted x86 and powerpc. I think comments in x86 are good because
> that is more or less the reference implementation and is where many VM
> developers would look to understand mm/ code. Commenting all page table
> walking in all other architectures is kind of beyond my skill or patience,
> and maintainers might consider this weird "alpha thingy" is below them ;)
> But they are quite free to add smp_read_barrier_depends to their own code.
> 
> Still would like more acks on this before it is applied.

I've read this over a few times, I think it's OK:

Acked-by: David S. Miller <davem@davemloft.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/2] read_barrier_depends fixlets
  2008-05-05 14:27 ` [patch 1/2] read_barrier_depends fixlets Paul E. McKenney
@ 2008-05-06  9:01   ` Nick Piggin
  2008-05-06 14:06     ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-05-06  9:01 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List

On Mon, May 05, 2008 at 07:27:46AM -0700, Paul E. McKenney wrote:
> On Mon, May 05, 2008 at 01:20:21PM +0200, Nick Piggin wrote:
> > While considering the impact of read_barrier_depends, it occurred to
> > me that it should really be really a noop for the compiler. At least, it is
> > better to have every arch the same than to have a few that are slightly
> > different. (Does this mean SMP Alpha's read_barrier_depends could drop the
> > "memory" clobber too?)
> 
> SMP Alpha's read_barrier_depends() needs the "memory" clobber
> because the compiler is otherwise free to move code across the
> smp_read_barrier_depends(), which would defeat its purpose.

Oh that's what does it. I was thinking of volatile, but I guess that is
to prevent the statement from being eliminated.



> > It would be a highly unusual compiler that might try to issue a load of
> > data1 before it loads a data2 which is data-dependant on data1.
> 
> A bit unusual, perhaps, but not unprecedented.  Value speculating
> compilers, for example.

Yes very true. Actually I guess it may even not be far off if we ever
used gcc's builtin_expect for predicting data rather than control values.
OTOH, would it help significantly over simply prefetching and then having
the compiler issue the (non speculative) loads in the correct order? You
would avoid speculation and fixup code in the generated code that way.


> > There is the problem of the compiler trying to reload data1 _after_
> > loading data2, and thus having a newer data1 than data2. However if the
> > compiler is so inclined, then it could perform such a load at any point
> > after the barrier, so the barrier itself will not guarantee correctness.
> > 
> > I think we've mostly hoped the compiler would not to do that.
> 
> Well, this does point me at one thing I missed with preemptable RCU,
> namely all the open-coded sequences using smp_read_barrier_depends().
> Quite embarrassing!!!  But a lot easier having you point me at it than
> however long it would have taken me to figure it out on my own, so thank
> you very much!!!

Heh, glad to be of help ;)

 
> > This brings alpha and frv into line with all other architectures.
> 
> Assuming that we apply ACCESS_ONCE() as needed to the uses of
> smp_read_barrier_depends():

Hmm, more on this in the next mail... (but I think it is important to
bring other archs into line with the common case, even if the common
case may have some issues that need sorting out).

> 
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > 
> > Index: linux-2.6/include/asm-alpha/barrier.h
> > ===================================================================
> > --- linux-2.6.orig/include/asm-alpha/barrier.h
> > +++ linux-2.6/include/asm-alpha/barrier.h
> > @@ -24,7 +24,7 @@ __asm__ __volatile__("mb": : :"memory")
> >  #define smp_mb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> > -#define smp_read_barrier_depends()	barrier()
> > +#define smp_read_barrier_depends()	do { } while (0)
> >  #endif
> > 
> >  #define set_mb(var, value) \
> > Index: linux-2.6/include/asm-frv/system.h
> > ===================================================================
> > --- linux-2.6.orig/include/asm-frv/system.h
> > +++ linux-2.6/include/asm-frv/system.h
> > @@ -179,7 +179,7 @@ do {							\
> >  #define mb()			asm volatile ("membar" : : :"memory")
> >  #define rmb()			asm volatile ("membar" : : :"memory")
> >  #define wmb()			asm volatile ("membar" : : :"memory")
> > -#define read_barrier_depends()	barrier()
> > +#define read_barrier_depends()	do { } while (0)
> > 
> >  #ifdef CONFIG_SMP
> >  #define smp_mb()			mb()

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-05 14:35   ` Paul E. McKenney
@ 2008-05-06  9:38     ` Nick Piggin
  2008-05-06 13:32       ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-05-06  9:38 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List

On Mon, May 05, 2008 at 07:35:47AM -0700, Paul E. McKenney wrote:
> On Mon, May 05, 2008 at 02:12:40PM +0200, Nick Piggin wrote:
> > I only converted x86 and powerpc. I think comments in x86 are good because
> > that is more or less the reference implementation and is where many VM
> > developers would look to understand mm/ code. Commenting all page table
> > walking in all other architectures is kind of beyond my skill or patience,
> > and maintainers might consider this weird "alpha thingy" is below them ;)
> > But they are quite free to add smp_read_barrier_depends to their own code.
> > 
> > Still would like more acks on this before it is applied.
> 
> Need ACCESS_ONCE(), as called out below.  Note that without this, the
> compiler would be within its rights to refetch the pointer, which could
> cause later dereferences to be using different versions of the structure.
> I suspect that there are not many pieces of code anticipating that sort
> of abuse...

I'm wondering about this... and the problem does not only exist in
memory ordering situations, but also just when using a single loaded
value in a lot of times.

I'd be slightly worried about requiring this of threaded code. Even
the regular memory ordering bugs we even have in core mm code is kind of
annoying (and it is by no means just this current bug).

Is it such an improvement to refetch a pointer versus spilling to stack?
Can we just ask gcc for a -multithreading-for-dummies mode?


> > --
> > 
> > There is a possible data race in the page table walking code. After the split
> > ptlock patches, it actually seems to have been introduced to the core code, but
> > even before that I think it would have impacted some architectures (powerpc and
> > sparc64, at least, walk the page tables without taking locks eg. see
> > find_linux_pte()).
> > 
> > The race is as follows:
> > The pte page is allocated, zeroed, and its struct page gets its spinlock
> > initialized. The mm-wide ptl is then taken, and then the pte page is inserted
> > into the pagetables.
> > 
> > At this point, the spinlock is not guaranteed to have ordered the previous
> > stores to initialize the pte page with the subsequent store to put it in the
> > page tables. So another Linux page table walker might be walking down (without
> > any locks, because we have split-leaf-ptls), and find that new pte we've
> > inserted. It might try to take the spinlock before the store from the other
> > CPU initializes it. And subsequently it might read a pte_t out before stores
> > from the other CPU have cleared the memory.
> > 
> > There seem to be similar races in higher levels of the page tables, but they
> > obviously don't involve the spinlock, but one could see uninitialized memory.
> > 
> > Arch code and hardware pagetable walkers that walk the pagetables without
> > locks could see similar uninitialized memory problems (regardless of whether
> > we have split ptes or not).
> > 
> > Fortunately, on x86 (except OOSTORE), nothing needs to be done, because stores
> > are in order, and so are loads.
> > 
> > I prefer to put the barriers in core code, because that's where the higher
> > level logic happens, but the page table accessors are per-arch, and open-coding
> > them everywhere I don't think is an option.
> > 
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > 
> > Index: linux-2.6/include/asm-x86/pgtable_32.h
> > ===================================================================
> > --- linux-2.6.orig/include/asm-x86/pgtable_32.h
> > +++ linux-2.6/include/asm-x86/pgtable_32.h
> > @@ -133,7 +133,12 @@ extern int pmd_bad(pmd_t pmd);
> >   * pgd_offset() returns a (pgd_t *)
> >   * pgd_index() is used get the offset into the pgd page's array of pgd_t's;
> >   */
> > -#define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address)))
> > +#define pgd_offset(mm, address)						\
> > +({									\
> > +	pgd_t *ret = ((mm)->pgd + pgd_index((address)));		\
> 
> +	pgd_t *ret = (ACCESS_ONCE((mm)->pgd) + pgd_index((address)));	\
> 
> Without this change, the compiler could refetch mm->pgd.

You mean:
ret = pgd_offset();
pgd = *ret;
x = pgd;
y = pgd;
x might != y because pgd might have been recalculated?

In that case it isn't really an ordering issue between two variables,
but an issue within a single variable. And I'm not exactly sure we want
to go down the path of trying to handle this. At least it probably belongs
in a different patch.


> It is not clear to me why we double-nest the parentheses around "address".

Typo maybe. I didn't want to make unnecessary changes.

 
> Also might want to either make this an inline function or use a
> more-obscure name than "ret".

Hmm, inlines might be difficult as Hugh points out. I'll add a couple of
underscores though ;)

 
> Similar changes are needed for the rest of these.

Thanks. I'll try to improve them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-05 15:32   ` Linus Torvalds
  2008-05-05 16:37     ` Hugh Dickins
@ 2008-05-06  9:51     ` Nick Piggin
  2008-05-06 14:53       ` Linus Torvalds
  1 sibling, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-05-06  9:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Paul McKenney

On Mon, May 05, 2008 at 08:32:30AM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 5 May 2008, Nick Piggin wrote:
> > 
> > Index: linux-2.6/include/asm-x86/pgtable_32.h
> > ===================================================================
> > --- linux-2.6.orig/include/asm-x86/pgtable_32.h
> > +++ linux-2.6/include/asm-x86/pgtable_32.h
> > @@ -133,7 +133,12 @@ extern int pmd_bad(pmd_t pmd);
> >   * pgd_offset() returns a (pgd_t *)
> >   * pgd_index() is used get the offset into the pgd page's array of pgd_t's;
> >   */
> > -#define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address)))
> > +#define pgd_offset(mm, address)						\
> > +({									\
> > +	pgd_t *ret = ((mm)->pgd + pgd_index((address)));		\
> > +	smp_read_barrier_depends(); /* see mm/memory.c:__pte_alloc */	\
> > +	ret;								\
> > +})
> 
> Is there some fundamental reason this needs to be a macro?
> 
> It is really ugly, and it would be much nicer to make this an inline 
> function if at all possible.
> 
> Yeah, maybe it requires some more #include's, but ..
> 
> (Especially since it apparently gets worse, and the pgd load needs a 
> ACCESS_ONCE() too - the code generated is the same, but the source gets 
> more and more involved)

Hmm, I remember trying this a while back (though not for this exact
patch) and running into depend issues. Seems like Hugh does as well.
And include dependency problems are not trivial to test for so I didn't
want to introduce bugs with the fix.

 
> That said, I *also* think that it's sad that you do this at all, since 
> smp_read_barrier_depends() is a no-op on x86, so why should we have it in 
> an x86-specific header file?
> 
> In short, I think the fixes are real, but the patch itself is really just 
> confusing things for no apparent good reason.

Right. As the comment says, the x86 stuff is kind of a "reference"
implementation, although if you prefer it isn't there, then I I can
easily just make it alpha only.

The x86 code (and all other archs) would I guess still need the ACCESS_ONCE
modifications. If we agree that this pointer reloading issue is one that
must be handled in our C code.. I don't know if I really like that idea.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-05 16:57   ` [patch 2/2] fix SMP data race in pagetable setup vs walking Hugh Dickins
@ 2008-05-06  9:52     ` Nick Piggin
  0 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-05-06  9:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, linux-arch, Linux Memory Management List,
	Paul McKenney

On Mon, May 05, 2008 at 05:57:20PM +0100, Hugh Dickins wrote:
> On Mon, 5 May 2008, Nick Piggin wrote:
> > Index: linux-2.6/mm/memory.c
> > ===================================================================
> > --- linux-2.6.orig/mm/memory.c
> > +++ linux-2.6/mm/memory.c
> > @@ -311,6 +311,37 @@ int __pte_alloc(struct mm_struct *mm, pm
> >  	if (!new)
> >  		return -ENOMEM;
> >  
> > +	/*
> > +	 * Ensure all pte setup (eg. pte page lock and page clearing) are
> > +	 * visible before the pte is made visible to other CPUs by being
> > +	 * put into page tables.
> > +	 *
> > +	 * The other side of the story is the pointer chasing in the page
> > +	 * table walking code (when walking the page table without locking;
> > +	 * ie. most of the time). Fortunately, these data accesses consist
> > +	 * of a chain of data-dependent loads, meaning most CPUs (alpha
> > +	 * being the notable exception) will already guarantee loads are
> > +	 * seen in-order. x86 has a "reference" implementation of
> > +	 * smp_read_barrier_depends() barriers in its page table walking
> > +	 * code, even though that barrier is a simple noop on that architecture.
> > +	 * Alpha obviously also has the required barriers.
> > +	 *
> > +	 * It is debatable whether or not the smp_read_barrier_depends()
> > +	 * barriers are required for kernel page tables; it could be that
> > +	 * nowhere in the kernel do we walk those pagetables without taking
> > +	 * init_mm's page_table_lock.
> 
> Just delete "; it could be that ... init_mm's page_table_lock":
> in general (if not everywhere) the architectures do not nowadays
> take init_mm's page_table_lock to walk down those pagetables (blame
> me for removing it, if anyone thinks that was wrong: I stand by it).

OK. That's fine IMO and I'll remove that line.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-06  7:08   ` David Miller, Nick Piggin
@ 2008-05-06  9:56     ` Nick Piggin
  0 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-05-06  9:56 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, hugh, linux-arch, linux-mm, paulmck

On Tue, May 06, 2008 at 12:08:03AM -0700, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Mon, 5 May 2008 14:12:40 +0200
> 
> > I only converted x86 and powerpc. I think comments in x86 are good because
> > that is more or less the reference implementation and is where many VM
> > developers would look to understand mm/ code. Commenting all page table
> > walking in all other architectures is kind of beyond my skill or patience,
> > and maintainers might consider this weird "alpha thingy" is below them ;)
> > But they are quite free to add smp_read_barrier_depends to their own code.
> > 
> > Still would like more acks on this before it is applied.
> 
> I've read this over a few times, I think it's OK:
> 
> Acked-by: David S. Miller <davem@davemloft.net>

Thanks a lot for that (and the others who reviewed). Gives me more
confidence.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-06  9:38     ` Nick Piggin
@ 2008-05-06 13:32       ` Paul E. McKenney
  2008-05-13  7:55         ` Nick Piggin
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2008-05-06 13:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List

On Tue, May 06, 2008 at 11:38:24AM +0200, Nick Piggin wrote:
> On Mon, May 05, 2008 at 07:35:47AM -0700, Paul E. McKenney wrote:
> > On Mon, May 05, 2008 at 02:12:40PM +0200, Nick Piggin wrote:
> > > I only converted x86 and powerpc. I think comments in x86 are good because
> > > that is more or less the reference implementation and is where many VM
> > > developers would look to understand mm/ code. Commenting all page table
> > > walking in all other architectures is kind of beyond my skill or patience,
> > > and maintainers might consider this weird "alpha thingy" is below them ;)
> > > But they are quite free to add smp_read_barrier_depends to their own code.
> > > 
> > > Still would like more acks on this before it is applied.
> > 
> > Need ACCESS_ONCE(), as called out below.  Note that without this, the
> > compiler would be within its rights to refetch the pointer, which could
> > cause later dereferences to be using different versions of the structure.
> > I suspect that there are not many pieces of code anticipating that sort
> > of abuse...
> 
> I'm wondering about this... and the problem does not only exist in
> memory ordering situations, but also just when using a single loaded
> value in a lot of times.
> 
> I'd be slightly worried about requiring this of threaded code. Even
> the regular memory ordering bugs we even have in core mm code is kind of
> annoying (and it is by no means just this current bug).
> 
> Is it such an improvement to refetch a pointer versus spilling to stack?
> Can we just ask gcc for a -multithreading-for-dummies mode?

I have thus far not been successful on this one in the general case.
It would be nice to be able to tell gcc that you really mean it when
you assign to a local variable...

> > > --
> > > 
> > > There is a possible data race in the page table walking code. After the split
> > > ptlock patches, it actually seems to have been introduced to the core code, but
> > > even before that I think it would have impacted some architectures (powerpc and
> > > sparc64, at least, walk the page tables without taking locks eg. see
> > > find_linux_pte()).
> > > 
> > > The race is as follows:
> > > The pte page is allocated, zeroed, and its struct page gets its spinlock
> > > initialized. The mm-wide ptl is then taken, and then the pte page is inserted
> > > into the pagetables.
> > > 
> > > At this point, the spinlock is not guaranteed to have ordered the previous
> > > stores to initialize the pte page with the subsequent store to put it in the
> > > page tables. So another Linux page table walker might be walking down (without
> > > any locks, because we have split-leaf-ptls), and find that new pte we've
> > > inserted. It might try to take the spinlock before the store from the other
> > > CPU initializes it. And subsequently it might read a pte_t out before stores
> > > from the other CPU have cleared the memory.
> > > 
> > > There seem to be similar races in higher levels of the page tables, but they
> > > obviously don't involve the spinlock, but one could see uninitialized memory.
> > > 
> > > Arch code and hardware pagetable walkers that walk the pagetables without
> > > locks could see similar uninitialized memory problems (regardless of whether
> > > we have split ptes or not).
> > > 
> > > Fortunately, on x86 (except OOSTORE), nothing needs to be done, because stores
> > > are in order, and so are loads.
> > > 
> > > I prefer to put the barriers in core code, because that's where the higher
> > > level logic happens, but the page table accessors are per-arch, and open-coding
> > > them everywhere I don't think is an option.
> > > 
> > > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > > 
> > > Index: linux-2.6/include/asm-x86/pgtable_32.h
> > > ===================================================================
> > > --- linux-2.6.orig/include/asm-x86/pgtable_32.h
> > > +++ linux-2.6/include/asm-x86/pgtable_32.h
> > > @@ -133,7 +133,12 @@ extern int pmd_bad(pmd_t pmd);
> > >   * pgd_offset() returns a (pgd_t *)
> > >   * pgd_index() is used get the offset into the pgd page's array of pgd_t's;
> > >   */
> > > -#define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address)))
> > > +#define pgd_offset(mm, address)						\
> > > +({									\
> > > +	pgd_t *ret = ((mm)->pgd + pgd_index((address)));		\
> > 
> > +	pgd_t *ret = (ACCESS_ONCE((mm)->pgd) + pgd_index((address)));	\
> > 
> > Without this change, the compiler could refetch mm->pgd.
> 
> You mean:
> ret = pgd_offset();
> pgd = *ret;
> x = pgd;
> y = pgd;
> x might != y because pgd might have been recalculated?

The example would need to be more complex, but in general, yes.
Something like the following:

ret = pgd_offset();
pgd = *ret;
x = pgd;
/* code with lots of register pressure. */
y = pgd;


> In that case it isn't really an ordering issue between two variables,
> but an issue within a single variable. And I'm not exactly sure we want
> to go down the path of trying to handle this. At least it probably belongs
> in a different patch.

Well, I have seen this sort of thing in real life with gcc, so I can't say
that I agree...  I was quite surprised the first time around!

> > It is not clear to me why we double-nest the parentheses around "address".
> 
> Typo maybe. I didn't want to make unnecessary changes.

Fair enough.

> > Also might want to either make this an inline function or use a
> > more-obscure name than "ret".
> 
> Hmm, inlines might be difficult as Hugh points out. I'll add a couple of
> underscores though ;)

Works for me!!!

> > Similar changes are needed for the rest of these.
> 
> Thanks. I'll try to improve them.

Sounds good!

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/2] read_barrier_depends fixlets
  2008-05-06  9:01   ` Nick Piggin
@ 2008-05-06 14:06     ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2008-05-06 14:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List

On Tue, May 06, 2008 at 11:01:56AM +0200, Nick Piggin wrote:
> On Mon, May 05, 2008 at 07:27:46AM -0700, Paul E. McKenney wrote:
> > On Mon, May 05, 2008 at 01:20:21PM +0200, Nick Piggin wrote:
> > > While considering the impact of read_barrier_depends, it occurred to
> > > me that it should really be really a noop for the compiler. At least, it is
> > > better to have every arch the same than to have a few that are slightly
> > > different. (Does this mean SMP Alpha's read_barrier_depends could drop the
> > > "memory" clobber too?)
> > 
> > SMP Alpha's read_barrier_depends() needs the "memory" clobber
> > because the compiler is otherwise free to move code across the
> > smp_read_barrier_depends(), which would defeat its purpose.
> 
> Oh that's what does it. I was thinking of volatile, but I guess that is
> to prevent the statement from being eliminated.

Yep!!!

> > > It would be a highly unusual compiler that might try to issue a load of
> > > data1 before it loads a data2 which is data-dependant on data1.
> > 
> > A bit unusual, perhaps, but not unprecedented.  Value speculating
> > compilers, for example.
> 
> Yes very true. Actually I guess it may even not be far off if we ever
> used gcc's builtin_expect for predicting data rather than control values.
> OTOH, would it help significantly over simply prefetching and then having
> the compiler issue the (non speculative) loads in the correct order? You
> would avoid speculation and fixup code in the generated code that way.

You would still need to control the ordering in the case of failed
speculation -- so there would need to be some additional built-ins to
handle this.  Might take some time...

> > > There is the problem of the compiler trying to reload data1 _after_
> > > loading data2, and thus having a newer data1 than data2. However if the
> > > compiler is so inclined, then it could perform such a load at any point
> > > after the barrier, so the barrier itself will not guarantee correctness.
> > > 
> > > I think we've mostly hoped the compiler would not to do that.
> > 
> > Well, this does point me at one thing I missed with preemptable RCU,
> > namely all the open-coded sequences using smp_read_barrier_depends().
> > Quite embarrassing!!!  But a lot easier having you point me at it than
> > however long it would have taken me to figure it out on my own, so thank
> > you very much!!!
> 
> Heh, glad to be of help ;)

And the first couple I have looked at seem to need some help...

> > > This brings alpha and frv into line with all other architectures.
> > 
> > Assuming that we apply ACCESS_ONCE() as needed to the uses of
> > smp_read_barrier_depends():
> 
> Hmm, more on this in the next mail... (but I think it is important to
> bring other archs into line with the common case, even if the common
> case may have some issues that need sorting out).

I am not hung up on the order that the patches happen, as long as they
all happen.  ;-)

						Thanx, Paul

> > Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > > 
> > > Index: linux-2.6/include/asm-alpha/barrier.h
> > > ===================================================================
> > > --- linux-2.6.orig/include/asm-alpha/barrier.h
> > > +++ linux-2.6/include/asm-alpha/barrier.h
> > > @@ -24,7 +24,7 @@ __asm__ __volatile__("mb": : :"memory")
> > >  #define smp_mb()	barrier()
> > >  #define smp_rmb()	barrier()
> > >  #define smp_wmb()	barrier()
> > > -#define smp_read_barrier_depends()	barrier()
> > > +#define smp_read_barrier_depends()	do { } while (0)
> > >  #endif
> > > 
> > >  #define set_mb(var, value) \
> > > Index: linux-2.6/include/asm-frv/system.h
> > > ===================================================================
> > > --- linux-2.6.orig/include/asm-frv/system.h
> > > +++ linux-2.6/include/asm-frv/system.h
> > > @@ -179,7 +179,7 @@ do {							\
> > >  #define mb()			asm volatile ("membar" : : :"memory")
> > >  #define rmb()			asm volatile ("membar" : : :"memory")
> > >  #define wmb()			asm volatile ("membar" : : :"memory")
> > > -#define read_barrier_depends()	barrier()
> > > +#define read_barrier_depends()	do { } while (0)
> > > 
> > >  #ifdef CONFIG_SMP
> > >  #define smp_mb()			mb()

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-06  9:51     ` Nick Piggin
@ 2008-05-06 14:53       ` Linus Torvalds
  2008-05-06 19:11         ` Paul E. McKenney
  2008-05-13  8:01         ` Nick Piggin
  0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2008-05-06 14:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Paul McKenney

On Tue, 6 May 2008, Nick Piggin wrote:
> 
> Right. As the comment says, the x86 stuff is kind of a "reference"
> implementation, although if you prefer it isn't there, then I I can
> easily just make it alpha only.

If there really was a point in teaching people about 
"read_barrier_depends()", I'd agree that it's probably good to have it as 
a reference in the x86 implementation.

But since alpha is the only one that needs it, and is likely to remain so, 
it's not like we ever want to copy that code to anything else, and it 
really is better to make it alpha-only if the code is so much uglier.

Maybe just a comment?

As to the ACCESS_ONCE() thing, thinking about it some more, I doubt it 
really matters. We're never going to change pgd anyway, so who cares if we 
access it once or a hundred times?

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/2] read_barrier_depends fixlets
  2008-05-05 11:20 [patch 1/2] read_barrier_depends fixlets Nick Piggin
  2008-05-05 12:12 ` [patch 2/2] fix SMP data race in pagetable setup vs walking Nick Piggin
  2008-05-05 14:27 ` [patch 1/2] read_barrier_depends fixlets Paul E. McKenney
@ 2008-05-06 15:29 ` David Howells
  2008-05-06 19:09   ` Paul E. McKenney
  2008-05-13  8:05   ` Nick Piggin
  2 siblings, 2 replies; 31+ messages in thread
From: David Howells @ 2008-05-06 15:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: dhowells, Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List, Paul McKenney

Nick Piggin <npiggin@suse.de> wrote:

> While considering the impact of read_barrier_depends, it occurred to
> me that it should really be really a noop for the compiler.

If you're defining it so, then you need to adjust memory-barriers.txt too.

	========================
	EXPLICIT KERNEL BARRIERS
	========================
	...
	CPU MEMORY BARRIERS
	-------------------

	The Linux kernel has eight basic CPU memory barriers:

		TYPE		MANDATORY		SMP CONDITIONAL
		===============	=======================	===========================
		GENERAL		mb()			smp_mb()
		WRITE		wmb()			smp_wmb()
		READ		rmb()			smp_rmb()
		DATA DEPENDENCY	read_barrier_depends()	smp_read_barrier_depends()


	All CPU memory barriers unconditionally imply compiler barriers.

That last line needs modification, perhaps to say:

	General, read and write memory barriers unconditionally imply general
	compiler barriers; data dependency barriers, however, imply a barrier
	only for the specific access being performed due to the fact that the
	instructions must be performed in a specific order.

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/2] read_barrier_depends fixlets
  2008-05-06 15:29 ` David Howells
@ 2008-05-06 19:09   ` Paul E. McKenney
  2008-05-13  8:05   ` Nick Piggin
  1 sibling, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2008-05-06 19:09 UTC (permalink / raw)
  To: David Howells
  Cc: Nick Piggin, Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List

> Nick Piggin <npiggin@suse.de> wrote:
> 
> > While considering the impact of read_barrier_depends, it occurred to
> > me that it should really be really a noop for the compiler.
> 
> If you're defining it so, then you need to adjust memory-barriers.txt too.
> 
> 	========================
> 	EXPLICIT KERNEL BARRIERS
> 	========================
> 	...
> 	CPU MEMORY BARRIERS
> 	-------------------
> 
> 	The Linux kernel has eight basic CPU memory barriers:
> 
> 		TYPE		MANDATORY		SMP CONDITIONAL
> 		===============	=======================	===========================
> 		GENERAL		mb()			smp_mb()
> 		WRITE		wmb()			smp_wmb()
> 		READ		rmb()			smp_rmb()
> 		DATA DEPENDENCY	read_barrier_depends()	smp_read_barrier_depends()
> 
> 
> 	All CPU memory barriers unconditionally imply compiler barriers.
> 
> That last line needs modification, perhaps to say:
> 
> 	General, read and write memory barriers unconditionally imply general
> 	compiler barriers; data dependency barriers, however, imply a barrier
> 	only for the specific access being performed due to the fact that the
> 	instructions must be performed in a specific order.

And to make sure the compiler preserves the ordering, you also need
the ACCESS_ONCE() in the general case.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-06 14:53       ` Linus Torvalds
@ 2008-05-06 19:11         ` Paul E. McKenney
  2008-05-14  4:27           ` Nick Piggin
  2008-05-13  8:01         ` Nick Piggin
  1 sibling, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2008-05-06 19:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Hugh Dickins, linux-arch,
	Linux Memory Management List

On Tue, May 06, 2008 at 07:53:23AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 6 May 2008, Nick Piggin wrote:
> > 
> > Right. As the comment says, the x86 stuff is kind of a "reference"
> > implementation, although if you prefer it isn't there, then I I can
> > easily just make it alpha only.
> 
> If there really was a point in teaching people about 
> "read_barrier_depends()", I'd agree that it's probably good to have it as 
> a reference in the x86 implementation.
> 
> But since alpha is the only one that needs it, and is likely to remain so, 
> it's not like we ever want to copy that code to anything else, and it 
> really is better to make it alpha-only if the code is so much uglier.
> 
> Maybe just a comment?
> 
> As to the ACCESS_ONCE() thing, thinking about it some more, I doubt it 
> really matters. We're never going to change pgd anyway, so who cares if we 
> access it once or a hundred times?

If we are never going to change mm->pgd, then why do we need the
smp_read_barrier_depends()?  Is this handling the initialization
case or some such?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-06 13:32       ` Paul E. McKenney
@ 2008-05-13  7:55         ` Nick Piggin
  2008-05-13 13:26           ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-05-13  7:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List

Sorry for the delay, was busy or away from keyboard for various reasons...

On Tue, May 06, 2008 at 06:32:24AM -0700, Paul E. McKenney wrote:
> On Tue, May 06, 2008 at 11:38:24AM +0200, Nick Piggin wrote:
> > 
> > I'm wondering about this... and the problem does not only exist in
> > memory ordering situations, but also just when using a single loaded
> > value in a lot of times.
> > 
> > I'd be slightly worried about requiring this of threaded code. Even
> > the regular memory ordering bugs we even have in core mm code is kind of
> > annoying (and it is by no means just this current bug).
> > 
> > Is it such an improvement to refetch a pointer versus spilling to stack?
> > Can we just ask gcc for a -multithreading-for-dummies mode?
> 
> I have thus far not been successful on this one in the general case.
> It would be nice to be able to tell gcc that you really mean it when
> you assign to a local variable...

Yes, exactly...

 
> > In that case it isn't really an ordering issue between two variables,
> > but an issue within a single variable. And I'm not exactly sure we want
> > to go down the path of trying to handle this. At least it probably belongs
> > in a different patch.
> 
> Well, I have seen this sort of thing in real life with gcc, so I can't say
> that I agree...  I was quite surprised the first time around!

I didn't intend to suggest that you are incorrect, or that ACCESS_ONCE
is not technically required for correctness. But I question whether it
is better to try fixing this throughout our source code, or in gcc's.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-06 14:53       ` Linus Torvalds
  2008-05-06 19:11         ` Paul E. McKenney
@ 2008-05-13  8:01         ` Nick Piggin
  2008-05-13 15:45           ` Linus Torvalds
  1 sibling, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-05-13  8:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Paul McKenney

On Tue, May 06, 2008 at 07:53:23AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 6 May 2008, Nick Piggin wrote:
> > 
> > Right. As the comment says, the x86 stuff is kind of a "reference"
> > implementation, although if you prefer it isn't there, then I I can
> > easily just make it alpha only.
> 
> If there really was a point in teaching people about 
> "read_barrier_depends()", I'd agree that it's probably good to have it as 
> a reference in the x86 implementation.
> 
> But since alpha is the only one that needs it, and is likely to remain so, 
> it's not like we ever want to copy that code to anything else, and it 
> really is better to make it alpha-only if the code is so much uglier.

No, *everyone* (except arch-only non-alpha developer) needs to know about
it.

x86 especially is a reference and often is a proving ground for code that
becomes generic, so I'd say even x86 developers should need to know about
it too.

 
> Maybe just a comment?

At the end of the day I don't care that much. I'm surprised you do,
but I'll do whatever it takes to get merged ;) 


> As to the ACCESS_ONCE() thing, thinking about it some more, I doubt it 
> really matters. We're never going to change pgd anyway, so who cares if we 
> access it once or a hundred times?

I will just re-review that I have my pointer following sequence correct,
it could be that I have one too many... but anyway it is needed for lower
levels I guess (as a general pattern -- in the actual case of pagetable
walking, I don't think it matters anywhere if a pointer gets refetched
after being dereferenced)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/2] read_barrier_depends fixlets
  2008-05-06 15:29 ` David Howells
  2008-05-06 19:09   ` Paul E. McKenney
@ 2008-05-13  8:05   ` Nick Piggin
  1 sibling, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-05-13  8:05 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List, Paul McKenney

On Tue, May 06, 2008 at 04:29:13PM +0100, David Howells wrote:
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > While considering the impact of read_barrier_depends, it occurred to
> > me that it should really be really a noop for the compiler.
> 
> If you're defining it so, then you need to adjust memory-barriers.txt too.

_I'm_ not defining it so, it has always been defined so (code speaks
louder than words). So this document has always been wrong.i

> 
> 	========================
> 	EXPLICIT KERNEL BARRIERS
> 	========================
> 	...
> 	CPU MEMORY BARRIERS
> 	-------------------
> 
> 	The Linux kernel has eight basic CPU memory barriers:
> 
> 		TYPE		MANDATORY		SMP CONDITIONAL
> 		===============	=======================	===========================
> 		GENERAL		mb()			smp_mb()
> 		WRITE		wmb()			smp_wmb()
> 		READ		rmb()			smp_rmb()
> 		DATA DEPENDENCY	read_barrier_depends()	smp_read_barrier_depends()
> 
> 
> 	All CPU memory barriers unconditionally imply compiler barriers.
> 
> That last line needs modification, perhaps to say:
> 
> 	General, read and write memory barriers unconditionally imply general
> 	compiler barriers; data dependency barriers, however, imply a barrier
> 	only for the specific access being performed due to the fact that the
> 	instructions must be performed in a specific order.
> 
> David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-13  7:55         ` Nick Piggin
@ 2008-05-13 13:26           ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2008-05-13 13:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List

On Tue, May 13, 2008 at 09:55:32AM +0200, Nick Piggin wrote:
> Sorry for the delay, was busy or away from keyboard for various reasons...
> 
> On Tue, May 06, 2008 at 06:32:24AM -0700, Paul E. McKenney wrote:
> > On Tue, May 06, 2008 at 11:38:24AM +0200, Nick Piggin wrote:
> > > 
> > > I'm wondering about this... and the problem does not only exist in
> > > memory ordering situations, but also just when using a single loaded
> > > value in a lot of times.
> > > 
> > > I'd be slightly worried about requiring this of threaded code. Even
> > > the regular memory ordering bugs we even have in core mm code is kind of
> > > annoying (and it is by no means just this current bug).
> > > 
> > > Is it such an improvement to refetch a pointer versus spilling to stack?
> > > Can we just ask gcc for a -multithreading-for-dummies mode?
> > 
> > I have thus far not been successful on this one in the general case.
> > It would be nice to be able to tell gcc that you really mean it when
> > you assign to a local variable...
> 
> Yes, exactly...
> 
> > > In that case it isn't really an ordering issue between two variables,
> > > but an issue within a single variable. And I'm not exactly sure we want
> > > to go down the path of trying to handle this. At least it probably belongs
> > > in a different patch.
> > 
> > Well, I have seen this sort of thing in real life with gcc, so I can't say
> > that I agree...  I was quite surprised the first time around!
> 
> I didn't intend to suggest that you are incorrect, or that ACCESS_ONCE
> is not technically required for correctness. But I question whether it
> is better to try fixing this throughout our source code, or in gcc's.

And it turns out that there are some features being proposed for the
upcoming c++0x standard that would have this effect.  "Relaxed access to
atomic variables" covers the "ACCESS_ONCE" piece, and is in the current
working draft.  We are also proposing something that captures the ordering
constraints of rcu_dereference(), which prohibits the compiler from doing
things like value speculation based on future dereferences of the variable
in question ("dependency ordering").  This has been through several
stages of review, and hopefully will get into the working draft soon.

Those sufficiently masochistic to dig through standardese might be
interested in:

	http://open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2556.html

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-13  8:01         ` Nick Piggin
@ 2008-05-13 15:45           ` Linus Torvalds
  2008-05-14  0:34             ` Nick Piggin
  0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2008-05-13 15:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Paul McKenney


On Tue, 13 May 2008, Nick Piggin wrote:
> 
> No, *everyone* (except arch-only non-alpha developer) needs to know about
> it.

Umm. In architecture files, by definition, only alpha needs to know about 
it.

That was very much an architecture-specific file: we're talking about 
asm-x86/pgtable_32.h here.

> x86 especially is a reference and often is a proving ground for code that
> becomes generic, so I'd say even x86 developers should need to know about
> it too.

And in reference files that are architecture-specific, there is absolutely 
*no point* in ever having read_barrier_depends(). Because even if another 
architecture copies it, it's better off without it.

				Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-13 15:45           ` Linus Torvalds
@ 2008-05-14  0:34             ` Nick Piggin
  2008-05-14  0:55               ` Linus Torvalds
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-05-14  0:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Paul McKenney

On Tue, May 13, 2008 at 08:45:51AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 13 May 2008, Nick Piggin wrote:
> > 
> > No, *everyone* (except arch-only non-alpha developer) needs to know about
> > it.
> 
> Umm. In architecture files, by definition, only alpha needs to know about 
> it.
> 
> That was very much an architecture-specific file: we're talking about 
> asm-x86/pgtable_32.h here.
> 
> > x86 especially is a reference and often is a proving ground for code that
> > becomes generic, so I'd say even x86 developers should need to know about
> > it too.
> 
> And in reference files that are architecture-specific, there is absolutely 
> *no point* in ever having read_barrier_depends(). Because even if another 
> architecture copies it, it's better off without it.

Uh, I don't follow your logic. The "reference" Linux memory model
requires it, so I don't see how you can justify saying it is wrong
just because a *specific* architecture doesn't need it.

I think that regardless of whether it is required or not, it is good
to have in order to prompt the reader to think about memory ordering.
I also think it is a good idea to use smp_rmb/smp_wmb in x86 only
code even though that is a noop too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-14  0:34             ` Nick Piggin
@ 2008-05-14  0:55               ` Linus Torvalds
  2008-05-14  1:18                 ` Nick Piggin
  2008-05-14  4:35                 ` [patch 1/2] read_barrier_depends arch fixlets Nick Piggin
  0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2008-05-14  0:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Paul McKenney

On Wed, 14 May 2008, Nick Piggin wrote:
> 
> Uh, I don't follow your logic. The "reference" Linux memory model
> requires it, so I don't see how you can justify saying it is wrong
> just because a *specific* architecture doesn't need it.

You're thinking about it the wrong way.

NO specific architecture requires it except for alpha, and alpha already 
has it.

Nobody else is *ever* likely to want it ever again.

In other words, it's not a "reference model". It's an "alpha hack". We do 
not want to copy it in code that doesn't need or want it.

And that's especially true when it's not needed at all, and adding it just 
makes a really simple macro much more complex and totally unreadable.

If it was about adding something to a function that was already a real 
function, it would be different.

If you coudl write it as a nice inline function, it would be different.

But when that alpha hack turns a regular (simple) #define into a thing of 
horror, the downside is much *much* bigger than any (non-existent) upside.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-14  0:55               ` Linus Torvalds
@ 2008-05-14  1:18                 ` Nick Piggin
  2008-05-14  4:35                 ` [patch 1/2] read_barrier_depends arch fixlets Nick Piggin
  1 sibling, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-05-14  1:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Paul McKenney

On Tue, May 13, 2008 at 05:55:50PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 14 May 2008, Nick Piggin wrote:
> > 
> > Uh, I don't follow your logic. The "reference" Linux memory model
> > requires it, so I don't see how you can justify saying it is wrong
> > just because a *specific* architecture doesn't need it.
> 
> You're thinking about it the wrong way.
> 
> NO specific architecture requires it except for alpha, and alpha already 
> has it.
> 
> Nobody else is *ever* likely to want it ever again.

Sure, I understand that. But unless you're hinting about removing
SMP support for those alphas that require it, all generic code
obviously still has to care about it. And not many people would
say they only ever look at x86 code and nothing else. I just think
it is a good exercise to _always_ be thinking about barriers, and
be thinking about them in terms of the Linux memory consistency
standard. Maybe I'm biased because I don't do much arch work...

> In other words, it's not a "reference model". It's an "alpha hack". We do 
> not want to copy it in code that doesn't need or want it.
> 
> And that's especially true when it's not needed at all, and adding it just 
> makes a really simple macro much more complex and totally unreadable.
> 
> If it was about adding something to a function that was already a real 
> function, it would be different.
> 
> If you coudl write it as a nice inline function, it would be different.
> 
> But when that alpha hack turns a regular (simple) #define into a thing of 
> horror, the downside is much *much* bigger than any (non-existent) upside.

Oh that's the main thing you're worried about. Then what about the
ACCESS_ONCE that might be required? That would require the same kind
of thing as I did. And actually smp_read_barrier_depends is a pretty
good indicator for (one of the) cases where we need that macro (although
as I said, I prefer gcc to be fixed).

Anyway, what I will do is send a patch which does the core, and the
alpha bits. At least then the actual bugs will be fixed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-06 19:11         ` Paul E. McKenney
@ 2008-05-14  4:27           ` Nick Piggin
  0 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-05-14  4:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List

On Tue, May 06, 2008 at 12:11:53PM -0700, Paul E. McKenney wrote:
> On Tue, May 06, 2008 at 07:53:23AM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Tue, 6 May 2008, Nick Piggin wrote:
> > > 
> > > Right. As the comment says, the x86 stuff is kind of a "reference"
> > > implementation, although if you prefer it isn't there, then I I can
> > > easily just make it alpha only.
> > 
> > If there really was a point in teaching people about 
> > "read_barrier_depends()", I'd agree that it's probably good to have it as 
> > a reference in the x86 implementation.
> > 
> > But since alpha is the only one that needs it, and is likely to remain so, 
> > it's not like we ever want to copy that code to anything else, and it 
> > really is better to make it alpha-only if the code is so much uglier.
> > 
> > Maybe just a comment?
> > 
> > As to the ACCESS_ONCE() thing, thinking about it some more, I doubt it 
> > really matters. We're never going to change pgd anyway, so who cares if we 
> > access it once or a hundred times?
> 
> If we are never going to change mm->pgd, then why do we need the
> smp_read_barrier_depends()?  Is this handling the initialization
> case or some such?

No, I had another look and I think Linus is correct. We don't need it for
mm->pgd.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 1/2] read_barrier_depends arch fixlets
  2008-05-14  0:55               ` Linus Torvalds
  2008-05-14  1:18                 ` Nick Piggin
@ 2008-05-14  4:35                 ` Nick Piggin
  2008-05-14  4:37                   ` [patch 2/2] fix SMP data race in pagetable setup vs walking Nick Piggin
  2008-05-14 13:26                   ` [patch 1/2] read_barrier_depends arch fixlets Paul E. McKenney
  1 sibling, 2 replies; 31+ messages in thread
From: Nick Piggin @ 2008-05-14  4:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Paul McKenney

read_barrie_depends has always been a noop (not a compiler barrier) on all
architectures except SMP alpha. This brings UP alpha and frv into line with all
other architectures, and fixes incorrect documentation.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

---
 Documentation/memory-barriers.txt |   12 +++++++++++-
 include/asm-alpha/barrier.h       |    2 +-
 include/asm-frv/system.h          |    2 +-
 3 files changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6/include/asm-alpha/barrier.h
===================================================================
--- linux-2.6.orig/include/asm-alpha/barrier.h
+++ linux-2.6/include/asm-alpha/barrier.h
@@ -24,7 +24,7 @@ __asm__ __volatile__("mb": : :"memory")
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	barrier()
+#define smp_read_barrier_depends()	do { } while (0)
 #endif
 
 #define set_mb(var, value) \
Index: linux-2.6/include/asm-frv/system.h
===================================================================
--- linux-2.6.orig/include/asm-frv/system.h
+++ linux-2.6/include/asm-frv/system.h
@@ -179,7 +179,7 @@ do {							\
 #define mb()			asm volatile ("membar" : : :"memory")
 #define rmb()			asm volatile ("membar" : : :"memory")
 #define wmb()			asm volatile ("membar" : : :"memory")
-#define read_barrier_depends()	barrier()
+#define read_barrier_depends()	do { } while (0)
 
 #ifdef CONFIG_SMP
 #define smp_mb()			mb()
Index: linux-2.6/Documentation/memory-barriers.txt
===================================================================
--- linux-2.6.orig/Documentation/memory-barriers.txt
+++ linux-2.6/Documentation/memory-barriers.txt
@@ -994,7 +994,17 @@ The Linux kernel has eight basic CPU mem
 	DATA DEPENDENCY	read_barrier_depends()	smp_read_barrier_depends()
 
 
-All CPU memory barriers unconditionally imply compiler barriers.
+All memory barriers except the data dependency barriers imply a compiler
+barrier. Data dependencies do not impose any additional compiler ordering.
+
+Aside: In the case of data dependencies, the compiler would be expected to
+issue the loads in the correct order (eg. `a[b]` would have to load the value
+of b before loading a[b]), however there is no guarantee in the C specification
+that the compiler may not speculate the value of b (eg. is equal to 1) and load
+a before b (eg. tmp = a[1]; if (b != 1) tmp = a[b]; ). There is also the
+problem of a compiler reloading b after having loaded a[b], thus having a newer
+copy of b than a[b]. A consensus has not yet been reached about these problems,
+however the ACCESS_ONCE macro is a good place to start looking.
 
 SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
 systems because it is assumed that a CPU will appear to be self-consistent,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 2/2] fix SMP data race in pagetable setup vs walking
  2008-05-14  4:35                 ` [patch 1/2] read_barrier_depends arch fixlets Nick Piggin
@ 2008-05-14  4:37                   ` Nick Piggin
  2008-05-14 13:26                   ` [patch 1/2] read_barrier_depends arch fixlets Paul E. McKenney
  1 sibling, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-05-14  4:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Paul McKenney

There is a possible data race in the page table walking code. After the split
ptlock patches, it actually seems to have been introduced to the core code, but
even before that I think it would have impacted some architectures (powerpc
and sparc64, at least, walk the page tables without taking locks eg. see
find_linux_pte()).

The race is as follows:
The pte page is allocated, zeroed, and its struct page gets its spinlock
initialized. The mm-wide ptl is then taken, and then the pte page is inserted
into the pagetables.

At this point, the spinlock is not guaranteed to have ordered the previous
stores to initialize the pte page with the subsequent store to put it in the
page tables. So another Linux page table walker might be walking down (without
any locks, because we have split-leaf-ptls), and find that new pte we've
inserted. It might try to take the spinlock before the store from the other
CPU initializes it. And subsequently it might read a pte_t out before stores
from the other CPU have cleared the memory.

There are also similar races in higher levels of the page tables. They
obviously don't involve the spinlock, but could see uninitialized memory.

Arch code and hardware pagetable walkers that walk the pagetables without
locks could see similar uninitialized memory problems, regardless of whether
split ptes are enabled or not.

I prefer to put the barriers in core code, because that's where the higher
level logic happens, but the page table accessors are per-arch, and open-coding
them everywhere I don't think is an option. I'll put the read-side barriers
in alpha arch code for now (other architectures perform data-dependent loads
in order).

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 include/asm-alpha/pgtable.h |   18 ++++++++++++++++--
 mm/memory.c                 |   21 +++++++++++++++++++++
 2 files changed, 37 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -311,6 +311,21 @@ int __pte_alloc(struct mm_struct *mm, pm
 	if (!new)
 		return -ENOMEM;
 
+	/*
+	 * Ensure all pte setup (eg. pte page lock and page clearing) are
+	 * visible before the pte is made visible to other CPUs by being
+	 * put into page tables.
+	 *
+	 * The other side of the story is the pointer chasing in the page
+	 * table walking code (when walking the page table without locking;
+	 * ie. most of the time). Fortunately, these data accesses consist
+	 * of a chain of data-dependent loads, meaning most CPUs (alpha
+	 * being the notable exception) will already guarantee loads are
+	 * seen in-order. See the alpha page table accessors for the
+	 * smp_read_barrier_depends() barriers in page table walking code.
+	 */
+	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
+
 	spin_lock(&mm->page_table_lock);
 	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
 		mm->nr_ptes++;
@@ -329,6 +344,8 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
 	if (!new)
 		return -ENOMEM;
 
+	smp_wmb(); /* See comment in __pte_alloc */
+
 	spin_lock(&init_mm.page_table_lock);
 	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
@@ -2616,6 +2633,8 @@ int __pud_alloc(struct mm_struct *mm, pg
 	if (!new)
 		return -ENOMEM;
 
+	smp_wmb(); /* See comment in __pte_alloc */
+
 	spin_lock(&mm->page_table_lock);
 	if (pgd_present(*pgd))		/* Another has populated it */
 		pud_free(mm, new);
@@ -2637,6 +2656,8 @@ int __pmd_alloc(struct mm_struct *mm, pu
 	if (!new)
 		return -ENOMEM;
 
+	smp_wmb(); /* See comment in __pte_alloc */
+
 	spin_lock(&mm->page_table_lock);
 #ifndef __ARCH_HAS_4LEVEL_HACK
 	if (pud_present(*pud))		/* Another has populated it */
Index: linux-2.6/include/asm-alpha/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-alpha/pgtable.h
+++ linux-2.6/include/asm-alpha/pgtable.h
@@ -287,17 +287,34 @@ extern inline pte_t pte_mkspecial(pte_t 
 #define pgd_index(address)	(((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
 #define pgd_offset(mm, address)	((mm)->pgd+pgd_index(address))
 
+/*
+ * The smp_read_barrier_depends() in the following functions are required to
+ * order the load of *dir (the pointer in the top level page table) with any
+ * subsequent load of the returned pmd_t *ret (ret is data dependent on *dir).
+ *
+ * If this ordering is not enforced, the CPU might load an older value of
+ * *ret, which may be uninitialized data. See mm/memory.c:__pte_alloc for
+ * more details.
+ *
+ * Note that we never change the mm->pgd pointer after the task is running, so
+ * pgd_offset does not require such a barrier.
+ */
+
 /* Find an entry in the second-level page table.. */
 extern inline pmd_t * pmd_offset(pgd_t * dir, unsigned long address)
 {
-	return (pmd_t *) pgd_page_vaddr(*dir) + ((address >> PMD_SHIFT) & (PTRS_PER_PAGE - 1));
+	pmd_t *ret = (pmd_t *) pgd_page_vaddr(*dir) + ((address >> PMD_SHIFT) & (PTRS_PER_PAGE - 1));
+	smp_read_barrier_depends(); /* see above */
+	return ret;
 }
 
 /* Find an entry in the third-level page table.. */
 extern inline pte_t * pte_offset_kernel(pmd_t * dir, unsigned long address)
 {
-	return (pte_t *) pmd_page_vaddr(*dir)
+	pte_t *ret = (pte_t *) pmd_page_vaddr(*dir)
 		+ ((address >> PAGE_SHIFT) & (PTRS_PER_PAGE - 1));
+	smp_read_barrier_depends(); /* see above */
+	return ret;
 }
 
 #define pte_offset_map(dir,addr)	pte_offset_kernel((dir),(addr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/2] read_barrier_depends arch fixlets
  2008-05-14  4:35                 ` [patch 1/2] read_barrier_depends arch fixlets Nick Piggin
  2008-05-14  4:37                   ` [patch 2/2] fix SMP data race in pagetable setup vs walking Nick Piggin
@ 2008-05-14 13:26                   ` Paul E. McKenney
  1 sibling, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2008-05-14 13:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Hugh Dickins, linux-arch,
	Linux Memory Management List

On Wed, May 14, 2008 at 06:35:11AM +0200, Nick Piggin wrote:
> read_barrie_depends has always been a noop (not a compiler barrier) on all
> architectures except SMP alpha. This brings UP alpha and frv into line with all
> other architectures, and fixes incorrect documentation.

One update for the documentation update.

							Thanx, Paul

> Signed-off-by: Nick Piggin <npiggin@suse.de>
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> ---
>  Documentation/memory-barriers.txt |   12 +++++++++++-
>  include/asm-alpha/barrier.h       |    2 +-
>  include/asm-frv/system.h          |    2 +-
>  3 files changed, 13 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6/include/asm-alpha/barrier.h
> ===================================================================
> --- linux-2.6.orig/include/asm-alpha/barrier.h
> +++ linux-2.6/include/asm-alpha/barrier.h
> @@ -24,7 +24,7 @@ __asm__ __volatile__("mb": : :"memory")
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> -#define smp_read_barrier_depends()	barrier()
> +#define smp_read_barrier_depends()	do { } while (0)
>  #endif
> 
>  #define set_mb(var, value) \
> Index: linux-2.6/include/asm-frv/system.h
> ===================================================================
> --- linux-2.6.orig/include/asm-frv/system.h
> +++ linux-2.6/include/asm-frv/system.h
> @@ -179,7 +179,7 @@ do {							\
>  #define mb()			asm volatile ("membar" : : :"memory")
>  #define rmb()			asm volatile ("membar" : : :"memory")
>  #define wmb()			asm volatile ("membar" : : :"memory")
> -#define read_barrier_depends()	barrier()
> +#define read_barrier_depends()	do { } while (0)
> 
>  #ifdef CONFIG_SMP
>  #define smp_mb()			mb()
> Index: linux-2.6/Documentation/memory-barriers.txt
> ===================================================================
> --- linux-2.6.orig/Documentation/memory-barriers.txt
> +++ linux-2.6/Documentation/memory-barriers.txt
> @@ -994,7 +994,17 @@ The Linux kernel has eight basic CPU mem
>  	DATA DEPENDENCY	read_barrier_depends()	smp_read_barrier_depends()
> 
> 
> -All CPU memory barriers unconditionally imply compiler barriers.
> +All memory barriers except the data dependency barriers imply a compiler
> +barrier. Data dependencies do not impose any additional compiler ordering.
> +
> +Aside: In the case of data dependencies, the compiler would be expected to
> +issue the loads in the correct order (eg. `a[b]` would have to load the value
> +of b before loading a[b]), however there is no guarantee in the C specification
> +that the compiler may not speculate the value of b (eg. is equal to 1) and load
> +a before b (eg. tmp = a[1]; if (b != 1) tmp = a[b]; ). There is also the
> +problem of a compiler reloading b after having loaded a[b], thus having a newer
> +copy of b than a[b]. A consensus has not yet been reached about these problems,
> +however the ACCESS_ONCE macro is a good place to start looking.

Please add something like:

"For example, b_local = b; smp_read_barrier_depends(); tmp = a[b_local];"

>  SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
>  systems because it is assumed that a CPU will appear to be self-consistent,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2008-05-14 13:26 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-05 11:20 [patch 1/2] read_barrier_depends fixlets Nick Piggin
2008-05-05 12:12 ` [patch 2/2] fix SMP data race in pagetable setup vs walking Nick Piggin
2008-05-05 14:35   ` Paul E. McKenney
2008-05-06  9:38     ` Nick Piggin
2008-05-06 13:32       ` Paul E. McKenney
2008-05-13  7:55         ` Nick Piggin
2008-05-13 13:26           ` Paul E. McKenney
2008-05-05 15:32   ` Linus Torvalds
2008-05-05 16:37     ` Hugh Dickins
2008-05-06  9:51     ` Nick Piggin
2008-05-06 14:53       ` Linus Torvalds
2008-05-06 19:11         ` Paul E. McKenney
2008-05-14  4:27           ` Nick Piggin
2008-05-13  8:01         ` Nick Piggin
2008-05-13 15:45           ` Linus Torvalds
2008-05-14  0:34             ` Nick Piggin
2008-05-14  0:55               ` Linus Torvalds
2008-05-14  1:18                 ` Nick Piggin
2008-05-14  4:35                 ` [patch 1/2] read_barrier_depends arch fixlets Nick Piggin
2008-05-14  4:37                   ` [patch 2/2] fix SMP data race in pagetable setup vs walking Nick Piggin
2008-05-14 13:26                   ` [patch 1/2] read_barrier_depends arch fixlets Paul E. McKenney
2008-05-05 16:57   ` [patch 2/2] fix SMP data race in pagetable setup vs walking Hugh Dickins
2008-05-06  9:52     ` Nick Piggin
2008-05-06  7:08   ` David Miller, Nick Piggin
2008-05-06  9:56     ` Nick Piggin
2008-05-05 14:27 ` [patch 1/2] read_barrier_depends fixlets Paul E. McKenney
2008-05-06  9:01   ` Nick Piggin
2008-05-06 14:06     ` Paul E. McKenney
2008-05-06 15:29 ` David Howells
2008-05-06 19:09   ` Paul E. McKenney
2008-05-13  8:05   ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).