[patch 2/2] fix SMP data race in pagetable setup vs walking

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nick Piggin <npiggin@suse.de>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Hugh Dickins <hugh@veritas.com>,
	linux-arch@vger.kernel.org,
	Linux Memory Management List <linux-mm@kvack.org>,
	Paul McKenney <paulmck@us.ibm.com>
Subject: [patch 2/2] fix SMP data race in pagetable setup vs walking
Date: Wed, 14 May 2008 06:37:36 +0200	[thread overview]
Message-ID: <20080514043736.GE23578@wotan.suse.de> (raw)
In-Reply-To: <20080514043511.GD23578@wotan.suse.de>

There is a possible data race in the page table walking code. After the split
ptlock patches, it actually seems to have been introduced to the core code, but
even before that I think it would have impacted some architectures (powerpc
and sparc64, at least, walk the page tables without taking locks eg. see
find_linux_pte()).

The race is as follows:
The pte page is allocated, zeroed, and its struct page gets its spinlock
initialized. The mm-wide ptl is then taken, and then the pte page is inserted
into the pagetables.

At this point, the spinlock is not guaranteed to have ordered the previous
stores to initialize the pte page with the subsequent store to put it in the
page tables. So another Linux page table walker might be walking down (without
any locks, because we have split-leaf-ptls), and find that new pte we've
inserted. It might try to take the spinlock before the store from the other
CPU initializes it. And subsequently it might read a pte_t out before stores
from the other CPU have cleared the memory.

There are also similar races in higher levels of the page tables. They
obviously don't involve the spinlock, but could see uninitialized memory.

Arch code and hardware pagetable walkers that walk the pagetables without
locks could see similar uninitialized memory problems, regardless of whether
split ptes are enabled or not.

I prefer to put the barriers in core code, because that's where the higher
level logic happens, but the page table accessors are per-arch, and open-coding
them everywhere I don't think is an option. I'll put the read-side barriers
in alpha arch code for now (other architectures perform data-dependent loads
in order).

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 include/asm-alpha/pgtable.h |   18 ++++++++++++++++--
 mm/memory.c                 |   21 +++++++++++++++++++++
 2 files changed, 37 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -311,6 +311,21 @@ int __pte_alloc(struct mm_struct *mm, pm
 	if (!new)
 		return -ENOMEM;
 
+	/*
+	 * Ensure all pte setup (eg. pte page lock and page clearing) are
+	 * visible before the pte is made visible to other CPUs by being
+	 * put into page tables.
+	 *
+	 * The other side of the story is the pointer chasing in the page
+	 * table walking code (when walking the page table without locking;
+	 * ie. most of the time). Fortunately, these data accesses consist
+	 * of a chain of data-dependent loads, meaning most CPUs (alpha
+	 * being the notable exception) will already guarantee loads are
+	 * seen in-order. See the alpha page table accessors for the
+	 * smp_read_barrier_depends() barriers in page table walking code.
+	 */
+	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
+
 	spin_lock(&mm->page_table_lock);
 	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
 		mm->nr_ptes++;
@@ -329,6 +344,8 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
 	if (!new)
 		return -ENOMEM;
 
+	smp_wmb(); /* See comment in __pte_alloc */
+
 	spin_lock(&init_mm.page_table_lock);
 	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
@@ -2616,6 +2633,8 @@ int __pud_alloc(struct mm_struct *mm, pg
 	if (!new)
 		return -ENOMEM;
 
+	smp_wmb(); /* See comment in __pte_alloc */
+
 	spin_lock(&mm->page_table_lock);
 	if (pgd_present(*pgd))		/* Another has populated it */
 		pud_free(mm, new);
@@ -2637,6 +2656,8 @@ int __pmd_alloc(struct mm_struct *mm, pu
 	if (!new)
 		return -ENOMEM;
 
+	smp_wmb(); /* See comment in __pte_alloc */
+
 	spin_lock(&mm->page_table_lock);
 #ifndef __ARCH_HAS_4LEVEL_HACK
 	if (pud_present(*pud))		/* Another has populated it */
Index: linux-2.6/include/asm-alpha/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-alpha/pgtable.h
+++ linux-2.6/include/asm-alpha/pgtable.h
@@ -287,17 +287,34 @@ extern inline pte_t pte_mkspecial(pte_t 
 #define pgd_index(address)	(((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
 #define pgd_offset(mm, address)	((mm)->pgd+pgd_index(address))
 
+/*
+ * The smp_read_barrier_depends() in the following functions are required to
+ * order the load of *dir (the pointer in the top level page table) with any
+ * subsequent load of the returned pmd_t *ret (ret is data dependent on *dir).
+ *
+ * If this ordering is not enforced, the CPU might load an older value of
+ * *ret, which may be uninitialized data. See mm/memory.c:__pte_alloc for
+ * more details.
+ *
+ * Note that we never change the mm->pgd pointer after the task is running, so
+ * pgd_offset does not require such a barrier.
+ */
+
 /* Find an entry in the second-level page table.. */
 extern inline pmd_t * pmd_offset(pgd_t * dir, unsigned long address)
 {
-	return (pmd_t *) pgd_page_vaddr(*dir) + ((address >> PMD_SHIFT) & (PTRS_PER_PAGE - 1));
+	pmd_t *ret = (pmd_t *) pgd_page_vaddr(*dir) + ((address >> PMD_SHIFT) & (PTRS_PER_PAGE - 1));
+	smp_read_barrier_depends(); /* see above */
+	return ret;
 }
 
 /* Find an entry in the third-level page table.. */
 extern inline pte_t * pte_offset_kernel(pmd_t * dir, unsigned long address)
 {
-	return (pte_t *) pmd_page_vaddr(*dir)
+	pte_t *ret = (pte_t *) pmd_page_vaddr(*dir)
 		+ ((address >> PAGE_SHIFT) & (PTRS_PER_PAGE - 1));
+	smp_read_barrier_depends(); /* see above */
+	return ret;
 }
 
 #define pte_offset_map(dir,addr)	pte_offset_kernel((dir),(addr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2008-05-14  4:37 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-05 11:20 [patch 1/2] read_barrier_depends fixlets Nick Piggin
2008-05-05 12:12 ` [patch 2/2] fix SMP data race in pagetable setup vs walking Nick Piggin
2008-05-05 14:35   ` Paul E. McKenney
2008-05-06  9:38     ` Nick Piggin
2008-05-06 13:32       ` Paul E. McKenney
2008-05-13  7:55         ` Nick Piggin
2008-05-13 13:26           ` Paul E. McKenney
2008-05-05 15:32   ` Linus Torvalds
2008-05-05 16:37     ` Hugh Dickins
2008-05-06  9:51     ` Nick Piggin
2008-05-06 14:53       ` Linus Torvalds
2008-05-06 19:11         ` Paul E. McKenney
2008-05-14  4:27           ` Nick Piggin
2008-05-13  8:01         ` Nick Piggin
2008-05-13 15:45           ` Linus Torvalds
2008-05-14  0:34             ` Nick Piggin
2008-05-14  0:55               ` Linus Torvalds
2008-05-14  1:18                 ` Nick Piggin
2008-05-14  4:35                 ` [patch 1/2] read_barrier_depends arch fixlets Nick Piggin
2008-05-14  4:37                   ` Nick Piggin [this message]
2008-05-14 13:26                   ` Paul E. McKenney
2008-05-05 16:57   ` [patch 2/2] fix SMP data race in pagetable setup vs walking Hugh Dickins
2008-05-06  9:52     ` Nick Piggin
2008-05-06  7:08   ` David Miller, Nick Piggin
2008-05-06  9:56     ` Nick Piggin
2008-05-05 14:27 ` [patch 1/2] read_barrier_depends fixlets Paul E. McKenney
2008-05-06  9:01   ` Nick Piggin
2008-05-06 14:06     ` Paul E. McKenney
2008-05-06 15:29 ` David Howells
2008-05-06 19:09   ` Paul E. McKenney
2008-05-13  8:05   ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080514043736.GE23578@wotan.suse.de \
    --to=npiggin@suse.de \
    --cc=hugh@veritas.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=paulmck@us.ibm.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).