[PATCH 0/2] Fix migration races in rmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Fix migration races in rmap_walk() V7
@ 2010-05-06 23:20 Mel Gorman
  2010-05-06 23:20 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
                   ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Linus Torvalds, Peter Zijlstra,
	Mel Gorman

No major change since V6, just looks nicer.

Changelog since V6
  o Make the nested anon_vma lock taking prettier

Changelog since V5
  o Have rmap_walk take anon_vma locks in order starting from the "root"
  o Ensure that mm_take_all_locks locks VMAs in the same order

Changelog since V4
  o Switch back anon_vma locking to put bulk of locking in rmap_walk
  o Fix anon_vma lock ordering in exec vs migration race

Changelog since V3
  o Rediff against the latest upstream tree
  o Improve the patch changelog a little (thanks Peterz)

Changelog since V2
  o Drop fork changes
  o Avoid pages in temporary stacks during exec instead of migration pte
    lazy cleanup
  o Drop locking-related patch and replace with Rik's

Changelog since V1
  o Handle the execve race
  o Be sure that rmap_walk() releases the correct VMA lock
  o Hold the anon_vma lock for the address lookup and the page remap
  o Add reviewed-bys

Broadly speaking, migration works by locking a page, unmapping it, putting
a migration PTE in place that looks like a swap entry, copying the page and
remapping the page removing the old migration PTE before unlocking the page.
If a fault occurs, the faulting process waits until migration completes.

The problem is that there are some races that either allow migration PTEs
to be left left behind. Migration still completes and the page is unlocked
but later a fault will call migration_entry_to_page() and BUG() because the
page is not locked. It's not possible to just clean up the migration PTE
because the page it points to has been potentially freed and reused. This
series aims to close the races.

Patch 1 of this series is about the of locking of anon_vma in migration versus
vma_adjust. While I am not aware of any reproduction cases, it is potentially
racy. This patch is an alternative to Rik's "heavy lock" approach posted
at http://lkml.org/lkml/2010/5/3/155. With the patch, rmap_walk finds the
"root" anon_vma and starts locking from there, locking each new anon_vma
as it finds it. As long as the order is preserved, there is no deadlock.
In vma_adjust, the anon_vma locks are acquired under similar conditions
to 2.6.33 so that walkers will block until VMA changes are complete. The
rmap_walk changes potentially slows down migration and aspects of page
reclaim a little but they are the less important path.

Patch 2 of this series addresses the swapops bug reported that is a race
between migration and execve where pages get migrated from the temporary
stack before it is moved. To avoid migration PTEs being left behind,
a temporary VMA is put in place so that a migration PTE in either the
temporary stack or the relocated stack can be found.

The reproduction case for the races was as follows;

1. Run kernel compilation in a loop
2. Start four processes, each of which creates one mapping. The three stress
   different aspects of the problem. The operations they undertake are;
	a) Forks a hundred children, each of which faults the mapping
		Purpose: stress tests migration pte removal
	b) Forks a hundred children, each which punches a hole in the mapping
	   and faults what remains
		Purpose: stress test VMA manipulations during migration
	c) Forks a hundred children, each of which execs and calls echo
		Purpose: stress test the execve race
	d) Size the mapping to be 1.5 times physical memory. Constantly
	   memset it
		Purpose: stress swapping
3. Constantly compact memory using /proc/sys/vm/compact_memory so migration
   is active all the time. In theory, you could also force this using
   sys_move_pages or memory hot-remove but it'd be nowhere near as easy
   to test.

Compaction is the easiest way to trigger these bugs which is not going to
be in 2.6.34 but in theory the problem also affects memory hot-remove.

There were some concerns with patch 2 that performance would be impacted. To
check if this was the case I ran kernbench, aim9 and sysbench. AIM9 in
particular was of interest as it has an exec microbenchmark.

             kernbench-vanilla    fixraces-v5r1
Elapsed mean     103.40 ( 0.00%)   103.35 ( 0.05%)
Elapsed stddev     0.09 ( 0.00%)     0.13 (-55.72%)
User    mean     313.50 ( 0.00%)   313.15 ( 0.11%)
User    stddev     0.61 ( 0.00%)     0.20 (66.70%)
System  mean      55.50 ( 0.00%)    55.85 (-0.64%)
System  stddev     0.48 ( 0.00%)     0.15 (68.98%)
CPU     mean     356.25 ( 0.00%)   356.50 (-0.07%)
CPU     stddev     0.43 ( 0.00%)     0.50 (-15.47%)

Nothing special there and kernbench is fork+exec heavy. The patched kernel
is slightly faster on wall time but it's well within the noise. System time
is slightly slower but again, it's within the noise.

AIM9
                  aim9-vanilla    fixraces-v5r1
creat-clo     116813.86 ( 0.00%)  117980.34 ( 0.99%)
page_test     270923.33 ( 0.00%)  268668.56 (-0.84%)
brk_test     2551558.07 ( 0.00%) 2649450.00 ( 3.69%)
signal_test   279866.67 ( 0.00%)  279533.33 (-0.12%)
exec_test        226.67 ( 0.00%)     232.67 ( 2.58%)
fork_test       4261.91 ( 0.00%)    4110.98 (-3.67%)
link_test      53534.78 ( 0.00%)   54076.49 ( 1.00%)

So, here exec and fork aren't showing up major worries. exec is faster but
these tests can be so sensitive to starting conditions that I tend not to
read much into them unless there are major differences.

SYSBENCH
              sysbench-vanilla    fixraces-v5r1
           1 14177.73 ( 0.00%) 14218.41 ( 0.29%)
           2 27647.23 ( 0.00%) 27774.14 ( 0.46%)
           3 31395.69 ( 0.00%) 31499.95 ( 0.33%)
           4 49866.54 ( 0.00%) 49713.49 (-0.31%)
           5 49919.58 ( 0.00%) 49524.21 (-0.80%)
           6 49532.97 ( 0.00%) 49397.60 (-0.27%)
           7 49465.79 ( 0.00%) 49384.14 (-0.17%)
           8 49483.33 ( 0.00%) 49186.49 (-0.60%)

These figures also show no differences worth talking about.

While the extra allocation in patch 2 would appear to slow down exec somewhat,
it's not by any amount that matters. As it is in exec, it means that anon_vmas
have likely been freed very recently so the allocation will be cache-hot and
cpu-local. It is possible to special-case migration to avoid migrating pages
in the temporary stack, but fixing it in exec is a more maintainable approach.

 fs/exec.c            |   37 ++++++++++++++++++++--
 include/linux/rmap.h |    2 +
 mm/ksm.c             |   20 ++++++++++--
 mm/mmap.c            |   14 +++++++-
 mm/rmap.c            |   81 +++++++++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 137 insertions(+), 17 deletions(-)

Andrea Arcangeli (1):
  mm,migration: Fix race between shift_arg_pages and rmap_walk by
    guaranteeing rmap_walk finds PTEs created within the temporary
    stack

Mel Gorman (1):
  mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA
    information

 fs/exec.c            |   37 +++++++++++++++++--
 include/linux/rmap.h |    4 ++
 mm/ksm.c             |   13 ++++++-
 mm/mmap.c            |   14 ++++++-
 mm/rmap.c            |   97 ++++++++++++++++++++++++++++++++++++++++++++++----
 5 files changed, 151 insertions(+), 14 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 23:20 [PATCH 0/2] Fix migration races in rmap_walk() V7 Mel Gorman
@ 2010-05-06 23:20 ` Mel Gorman
  2010-05-07  0:56   ` KAMEZAWA Hiroyuki
  2010-05-08 15:39   ` Andrea Arcangeli
  2010-05-06 23:20 ` [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack Mel Gorman
  2010-05-07  8:13 ` [PATCH 0/2] Fix migration races in rmap_walk() V7 KAMEZAWA Hiroyuki
  2 siblings, 2 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Linus Torvalds, Peter Zijlstra,
	Mel Gorman

vma_adjust() is updating anon VMA information without locks being taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with users of rmap_walk such as page migration.
vma_address() can return -EFAULT for an address that will soon be valid.
For migration, this potentially leaves a dangling migration PTE behind
which can later cause a BUG_ON to trigger when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock
to take when walking a list of anon_vma_chains but as the order of anon_vmas
cannot be guaranteed, rmap_walk cannot take multiple locks without
potentially deadlocking.

To resolve this problem, this patch has rmap_walk walk the anon_vma_chain
list but always starting from the "root" anon_vma which is the oldest
anon_vma in the list. It starts by locking the anon_vma lock associated
with a page. It then finds the "root" anon_vma using the anon_vma_chains
"same_vma" list as it is strictly ordered. The root anon_vma lock is taken
and rmap_walk traverses the list. This allows multiple locks to be taken
as the list is always traversed in the same direction.

As spotted by Rik, to avoid any deadlocks versus mmu_notify, the order that
anon_vmas is locked in by mm_take_all_locks is reversed by this patch so that
both rmap_walk and mm_take_all_locks lock anon_vmas in the order of old->new.

For vma_adjust(), the locking behaviour prior to the anon_vma is restored
so that rmap_walk() can be sure of the integrity of the VMA information and
lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
is taken if

	a) If there is any overlap with the next VMA due to the adjustment
	b) If there is a new VMA is being inserted into the address space
	c) If the start of the VMA is being changed so that the
	   relationship between vm_start and vm_pgoff is preserved
	   for vma_address()

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/rmap.h |    4 ++
 mm/ksm.c             |   13 ++++++-
 mm/mmap.c            |   14 ++++++-
 mm/rmap.c            |   97 ++++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 118 insertions(+), 10 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7721674..1dc949f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -121,6 +121,10 @@ int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
+struct anon_vma *anon_vma_lock_nested(struct anon_vma *prev,
+			struct anon_vma *next, struct anon_vma *root);
+struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma);
+struct anon_vma *page_anon_vma_lock_root(struct page *page);
 void __anon_vma_link(struct vm_area_struct *);
 void anon_vma_free(struct anon_vma *);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..1db8656 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1655,6 +1655,7 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 {
 	struct stable_node *stable_node;
 	struct hlist_node *hlist;
+	struct anon_vma *nested_anon_vma = NULL;
 	struct rmap_item *rmap_item;
 	int ret = SWAP_AGAIN;
 	int search_new_forks = 0;
@@ -1671,9 +1672,16 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		anon_vma = anon_vma_lock_root(anon_vma);
+		if (nested_anon_vma) {
+			spin_unlock(&nested_anon_vma->lock);
+			nested_anon_vma = NULL;
+		}
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
+			nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
+						vma->anon_vma, anon_vma);
+
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
 				continue;
@@ -1697,6 +1705,9 @@ again:
 	if (!search_new_forks++)
 		goto again;
 out:
+	if (nested_anon_vma)
+		spin_unlock(&nested_anon_vma->lock);
+
 	return ret;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..b447d5b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -505,6 +505,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct vm_area_struct *next = vma->vm_next;
 	struct vm_area_struct *importer = NULL;
 	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
 	long adjust_next = 0;
@@ -578,6 +579,11 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma && (insert || importer || start != vma->vm_start)) {
+		anon_vma = vma->anon_vma;
+		spin_lock(&anon_vma->lock);
+	}
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +626,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (anon_vma)
+		spin_unlock(&anon_vma->lock);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
@@ -2556,8 +2565,9 @@ int mm_take_all_locks(struct mm_struct *mm)
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (signal_pending(current))
 			goto out_unlock;
+		/* Lock the anon_vmas in the same order rmap_walk would */
 		if (vma->anon_vma)
-			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma)
 				vm_lock_anon_vma(mm, avc->anon_vma);
 	}
 
@@ -2620,7 +2630,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
 
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (vma->anon_vma)
-			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma)
 				vm_unlock_anon_vma(avc->anon_vma);
 		if (vma->vm_file && vma->vm_file->f_mapping)
 			vm_unlock_mapping(vma->vm_file->f_mapping);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..2e65a75 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -236,6 +236,81 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	return -ENOMEM;
 }
 
+/*
+ * When walking an anon_vma_chain and locking each anon_vma encountered,
+ * this function is responsible for checking if the next VMA is the
+ * same as the root, locking it if not and released the previous lock
+ * if necessary.
+ *
+ * It is assumed the caller has locked the root anon_vma
+ */
+struct anon_vma *anon_vma_lock_nested(struct anon_vma *prev,
+			struct anon_vma *next, struct anon_vma *root)
+{
+	if (prev)
+		spin_unlock(&prev->lock);
+	if (next == root)
+		return NULL;
+	spin_lock_nested(&next->lock, SINGLE_DEPTH_NESTING);
+	return next;
+}
+
+/*
+ * Given an anon_vma, find the root of the chain, lock it and return the
+ * root. This must be called with the rcu_read_lock held
+ */
+struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma)
+{
+	struct anon_vma *root_anon_vma;
+	struct anon_vma_chain *avc, *root_avc;
+	struct vm_area_struct *vma;
+
+	/* Lock the same_anon_vma list and make sure we are on a chain */
+	spin_lock(&anon_vma->lock);
+	if (list_empty(&anon_vma->head)) {
+		spin_unlock(&anon_vma->lock);
+		return NULL;
+	}
+
+	/*
+	 * Get the root anon_vma on the list by depending on the ordering
+	 * of the same_vma list setup by __page_set_anon_rmap. Basically
+	 * we are doing
+	 *
+	 * local anon_vma -> local vma -> root vma -> root anon_vma
+	 */
+	avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
+	vma = avc->vma;
+	root_avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+	root_anon_vma = root_avc->anon_vma;
+
+	/* Get the lock of the root anon_vma */
+	if (anon_vma != root_anon_vma) {
+		VM_BUG_ON(!rcu_read_lock_held());
+		spin_unlock(&anon_vma->lock);
+		spin_lock(&root_anon_vma->lock);
+	}
+
+	return root_anon_vma;
+}
+
+/*
+ * From the anon_vma associated with this page, find and lock the
+ * deepest anon_vma on the list. This allows multiple anon_vma locks
+ * to be taken by guaranteeing the locks are taken in the same order
+ */
+struct anon_vma *page_anon_vma_lock_root(struct page *page)
+{
+	struct anon_vma *anon_vma;
+
+	/* Get the local anon_vma */
+	anon_vma = page_anon_vma(page);
+	if (!anon_vma)
+		return NULL;
+
+	return anon_vma_lock_root(anon_vma);
+}
+
 static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 {
 	struct anon_vma *anon_vma = anon_vma_chain->anon_vma;
@@ -326,7 +401,7 @@ void page_unlock_anon_vma(struct anon_vma *anon_vma)
  * Returns virtual address or -EFAULT if page's index/offset is not
  * within the range mapped the @vma.
  */
-static inline unsigned long
+static noinline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1359,6 +1434,7 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
 {
 	struct anon_vma *anon_vma;
+	struct anon_vma *nested_anon_vma = NULL;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
@@ -1368,19 +1444,26 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
 	 */
-	anon_vma = page_anon_vma(page);
+	anon_vma = page_anon_vma_lock_root(page);
 	if (!anon_vma)
 		return ret;
-	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
-		if (address == -EFAULT)
-			continue;
-		ret = rmap_one(page, vma, address, arg);
+		unsigned long address;
+
+		nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
+						vma->anon_vma, anon_vma);
+		address = vma_address(page, vma);
+		if (address != -EFAULT)
+			ret = rmap_one(page, vma, address, arg);
+
 		if (ret != SWAP_AGAIN)
 			break;
 	}
+
+	if (nested_anon_vma)
+		spin_unlock(&nested_anon_vma->lock);
+
 	spin_unlock(&anon_vma->lock);
 	return ret;
 }
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 23:20 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
@ 2010-05-07  0:56   ` KAMEZAWA Hiroyuki
  2010-05-07 16:26     ` Mel Gorman
  2010-05-08 15:39   ` Andrea Arcangeli
  1 sibling, 1 reply; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-07  0:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Linus Torvalds, Peter Zijlstra

On Fri,  7 May 2010 00:20:52 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> vma_adjust() is updating anon VMA information without locks being taken.
> In contrast, file-backed mappings use the i_mmap_lock and this lack of
> locking can result in races with users of rmap_walk such as page migration.
> vma_address() can return -EFAULT for an address that will soon be valid.
> For migration, this potentially leaves a dangling migration PTE behind
> which can later cause a BUG_ON to trigger when the page is faulted in.
> 
> With the recent anon_vma changes, there can be more than one anon_vma->lock
> to take when walking a list of anon_vma_chains but as the order of anon_vmas
> cannot be guaranteed, rmap_walk cannot take multiple locks without
> potentially deadlocking.
> 
> To resolve this problem, this patch has rmap_walk walk the anon_vma_chain
> list but always starting from the "root" anon_vma which is the oldest
> anon_vma in the list. It starts by locking the anon_vma lock associated
> with a page. It then finds the "root" anon_vma using the anon_vma_chains
> "same_vma" list as it is strictly ordered. The root anon_vma lock is taken
> and rmap_walk traverses the list. This allows multiple locks to be taken
> as the list is always traversed in the same direction.
> 
> As spotted by Rik, to avoid any deadlocks versus mmu_notify, the order that
> anon_vmas is locked in by mm_take_all_locks is reversed by this patch so that
> both rmap_walk and mm_take_all_locks lock anon_vmas in the order of old->new.
> 
> For vma_adjust(), the locking behaviour prior to the anon_vma is restored
> so that rmap_walk() can be sure of the integrity of the VMA information and
> lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
> is taken if
> 
> 	a) If there is any overlap with the next VMA due to the adjustment
> 	b) If there is a new VMA is being inserted into the address space
> 	c) If the start of the VMA is being changed so that the
> 	   relationship between vm_start and vm_pgoff is preserved
> 	   for vma_address()
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/rmap.h |    4 ++
>  mm/ksm.c             |   13 ++++++-
>  mm/mmap.c            |   14 ++++++-
>  mm/rmap.c            |   97 ++++++++++++++++++++++++++++++++++++++++++++++----
>  4 files changed, 118 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 7721674..1dc949f 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -121,6 +121,10 @@ int  anon_vma_prepare(struct vm_area_struct *);
>  void unlink_anon_vmas(struct vm_area_struct *);
>  int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
>  int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
> +struct anon_vma *anon_vma_lock_nested(struct anon_vma *prev,
> +			struct anon_vma *next, struct anon_vma *root);
> +struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma);
> +struct anon_vma *page_anon_vma_lock_root(struct page *page);
>  void __anon_vma_link(struct vm_area_struct *);
>  void anon_vma_free(struct anon_vma *);
>  
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 3666d43..1db8656 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1655,6 +1655,7 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
>  {
>  	struct stable_node *stable_node;
>  	struct hlist_node *hlist;
> +	struct anon_vma *nested_anon_vma = NULL;
>  	struct rmap_item *rmap_item;
>  	int ret = SWAP_AGAIN;
>  	int search_new_forks = 0;
> @@ -1671,9 +1672,16 @@ again:
>  		struct anon_vma_chain *vmac;
>  		struct vm_area_struct *vma;
>  
> -		spin_lock(&anon_vma->lock);
> +		anon_vma = anon_vma_lock_root(anon_vma);
> +		if (nested_anon_vma) {
> +			spin_unlock(&nested_anon_vma->lock);
> +			nested_anon_vma = NULL;
> +		}
>  		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
>  			vma = vmac->vma;
> +			nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
> +						vma->anon_vma, anon_vma);
> +
>  			if (rmap_item->address < vma->vm_start ||
>  			    rmap_item->address >= vma->vm_end)
>  				continue;
> @@ -1697,6 +1705,9 @@ again:
>  	if (!search_new_forks++)
>  		goto again;
>  out:
> +	if (nested_anon_vma)
> +		spin_unlock(&nested_anon_vma->lock);
> +
>  	return ret;
>  }
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..b447d5b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -505,6 +505,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  	struct vm_area_struct *next = vma->vm_next;
>  	struct vm_area_struct *importer = NULL;
>  	struct address_space *mapping = NULL;
> +	struct anon_vma *anon_vma = NULL;
>  	struct prio_tree_root *root = NULL;
>  	struct file *file = vma->vm_file;
>  	long adjust_next = 0;
> @@ -578,6 +579,11 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		}
>  	}
>  
> +	if (vma->anon_vma && (insert || importer || start != vma->vm_start)) {
> +		anon_vma = vma->anon_vma;
> +		spin_lock(&anon_vma->lock);
> +	}
> +
>  	if (root) {
>  		flush_dcache_mmap_lock(mapping);
>  		vma_prio_tree_remove(vma, root);
> @@ -620,6 +626,9 @@ again:			remove_next = 1 + (end > next->vm_end);
>  	if (mapping)
>  		spin_unlock(&mapping->i_mmap_lock);
>  
> +	if (anon_vma)
> +		spin_unlock(&anon_vma->lock);
> +
>  	if (remove_next) {
>  		if (file) {
>  			fput(file);
> @@ -2556,8 +2565,9 @@ int mm_take_all_locks(struct mm_struct *mm)
>  	for (vma = mm->mmap; vma; vma = vma->vm_next) {
>  		if (signal_pending(current))
>  			goto out_unlock;
> +		/* Lock the anon_vmas in the same order rmap_walk would */
>  		if (vma->anon_vma)
> -			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
> +			list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma)
>  				vm_lock_anon_vma(mm, avc->anon_vma);
>  	}
>  
> @@ -2620,7 +2630,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
>  
>  	for (vma = mm->mmap; vma; vma = vma->vm_next) {
>  		if (vma->anon_vma)
> -			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
> +			list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma)
>  				vm_unlock_anon_vma(avc->anon_vma);
>  		if (vma->vm_file && vma->vm_file->f_mapping)
>  			vm_unlock_mapping(vma->vm_file->f_mapping);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 85f203e..2e65a75 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -236,6 +236,81 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	return -ENOMEM;
>  }
>  
> +/*
> + * When walking an anon_vma_chain and locking each anon_vma encountered,
> + * this function is responsible for checking if the next VMA is the
> + * same as the root, locking it if not and released the previous lock
> + * if necessary.
> + *
> + * It is assumed the caller has locked the root anon_vma
> + */
> +struct anon_vma *anon_vma_lock_nested(struct anon_vma *prev,
> +			struct anon_vma *next, struct anon_vma *root)
> +{
> +	if (prev)
> +		spin_unlock(&prev->lock);
> +	if (next == root)
> +		return NULL;
> +	spin_lock_nested(&next->lock, SINGLE_DEPTH_NESTING);
> +	return next;
> +}
> +
> +/*
> + * Given an anon_vma, find the root of the chain, lock it and return the
> + * root. This must be called with the rcu_read_lock held
> + */
> +struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma)
> +{
> +	struct anon_vma *root_anon_vma;
> +	struct anon_vma_chain *avc, *root_avc;
> +	struct vm_area_struct *vma;
> +
> +	/* Lock the same_anon_vma list and make sure we are on a chain */
> +	spin_lock(&anon_vma->lock);
> +	if (list_empty(&anon_vma->head)) {
> +		spin_unlock(&anon_vma->lock);
> +		return NULL;
> +	}
> +
> +	/*
> +	 * Get the root anon_vma on the list by depending on the ordering
> +	 * of the same_vma list setup by __page_set_anon_rmap. Basically
> +	 * we are doing
> +	 *
> +	 * local anon_vma -> local vma -> root vma -> root anon_vma
> +	 */
> +	avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
> +	vma = avc->vma;
> +	root_avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
> +	root_anon_vma = root_avc->anon_vma;
> +
> +	/* Get the lock of the root anon_vma */
> +	if (anon_vma != root_anon_vma) {
> +		VM_BUG_ON(!rcu_read_lock_held());
> +		spin_unlock(&anon_vma->lock);
> +		spin_lock(&root_anon_vma->lock);
> +	}

I'm sorry but I don't think I understand this. Could you help me ?

IIUC, anon_vma_chain is linked as 2D-mesh

            anon_vma1    anon_vma2    anon_vma3
                |            |            |
    vma1 -----  1  --------  2  --------- 3 -----
                |            |            |
    vma2 -----  4  --------  5 ---------- 6 -----
                |            |            |
    vma3 -----  7  --------  8 ---------- 9 -----


Here,
  * vertical link is anon_vma->head, avc->same_anon_vma link.
  * horizontal link is vma->anon_vma_chain, avc->same_vma link.
  * 1-9 are avcs.

When scanning pages, we may see a page whose anon_vma is anon_vma1
or anon_vma2 or anon_vma3. 

When we see anon_vma3 in page->mapping, we lock anon_vma1 and chase
avc1->avc4->avc7. Then, start from vma1. Next, we visit vma2, we lock anon_vma2.
At the last, we visit vma3 and lock anon_vma3.....And all are done under
anon_vma1->lock. Right ?

Hmm, one concern is 
	anon_vma3 -> avc3 -> vma1 -> avc1 -> anon_vma1 chasing.

What will prevent vma1 disappear right after releasling anon_vma3->lock ?

ex)
a1) At we chase, anon_vma3 -> avc3 -> vma1 -> anon_vma1, link was following.

            anon_vma1    anon_vma2    anon_vma3
                |            |            |
    vma1 -----  1  --------  2  --------- 3 -----
                |            |            |
    vma2 -----  4  --------  5 ---------- 6 -----
                |            |            |
    vma3 -----  7  --------  8 ---------- 9 -----
 
   We hold lock on anon_vma3.

a2) After releasing anon_vma3 lock. vma1 can be unlinked.

            anon_vma1    anon_vma2    anon_vma3
                |            |            |
 vma1 removed.
                |            |            |
    vma2 -----  4  --------  5 ---------- 6 -----
                |            |            |
    vma3 -----  7  --------  8 ---------- 9 -----

But we know anon_vma1->head is not empty, and it's accessable.
Then, no problem for our purpose. Right ?

b1) Another thinking.

At we chase, anon_vma3 -> avc3 -> vma1 -> anon_vma1, link was following.

            anon_vma1    anon_vma2    anon_vma3
                |            |            |
    vma1 -----  1  --------  2  --------- 3 -----
                |            |            |
    vma2 -----  4  --------  5 ---------- 6 -----
                |            |            |
    vma3 -----  7  --------  8 ---------- 9 -----
 
   We hold lock on anon_vma3. So, 

            anon_vma1    anon_vma2    anon_vma3
                |            |            |
    vma1 ----removed -----removed  ------ 3 -----
                |            |            |
    vma2 -----  4  --------  5 ---------- 6 -----
                |            |            |
    vma3 -----  7  --------  8 ---------- 9 -----

we may see half-broken link while we take anon_vma3->lock. In this case,
anon_vma1 can be caugt.

Don't we need this ?


 void unlink_anon_vmas(struct vm_area_struct *vma)
 {
        struct anon_vma_chain *avc, *next;

        /* Unlink each anon_vma chained to the VMA. */
-        list_for_each_entry_safe_reverse(avc, next, &vma->anon_vma_chain, same_vma) {
+        list_for_each_entry_safe_reverse(avc, next, &vma->anon_vma_chain, same_vma) {
                anon_vma_unlink(avc);
                list_del(&avc->same_vma);
                anon_vma_chain_free(avc);
         }
 }

head avc should be removed last...  Hmm ? I'm sorry if all are
in correct order already.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-07  0:56   ` KAMEZAWA Hiroyuki
@ 2010-05-07 16:26     ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-07 16:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Linus Torvalds, Peter Zijlstra

On Fri, May 07, 2010 at 09:56:54AM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri,  7 May 2010 00:20:52 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > vma_adjust() is updating anon VMA information without locks being taken.
> > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > locking can result in races with users of rmap_walk such as page migration.
> > vma_address() can return -EFAULT for an address that will soon be valid.
> > For migration, this potentially leaves a dangling migration PTE behind
> > which can later cause a BUG_ON to trigger when the page is faulted in.
> > 
> > <SNIP>
> 
> I'm sorry but I don't think I understand this. Could you help me ?
> 

Hopefully.

> IIUC, anon_vma_chain is linked as 2D-mesh
> 
>             anon_vma1    anon_vma2    anon_vma3
>                 |            |            |
>     vma1 -----  1  --------  2  --------- 3 -----
>                 |            |            |
>     vma2 -----  4  --------  5 ---------- 6 -----
>                 |            |            |
>     vma3 -----  7  --------  8 ---------- 9 -----
> 
> 
> Here,
>   * vertical link is anon_vma->head, avc->same_anon_vma link.
>   * horizontal link is vma->anon_vma_chain, avc->same_vma link.
>   * 1-9 are avcs.
> 

I don't think this is quite right for how the "root" anon_vma is
discovered. The ordering of same_vma is such that the prev pointer
points to the root anon_vma as described in __page_set_anon_rmap() but
even so...

> When scanning pages, we may see a page whose anon_vma is anon_vma1
> or anon_vma2 or anon_vma3. 
>

When we are walking the list for the anon_vma, we also hold the page
lock and what we're really interested in are ptes mapping that page.

> When we see anon_vma3 in page->mapping, we lock anon_vma1 and chase
> avc1->avc4->avc7. Then, start from vma1. Next, we visit vma2, we lock anon_vma2.
> At the last, we visit vma3 and lock anon_vma3.....And all are done under
> anon_vma1->lock. Right ?
> 

assuming it's the root lock, sure.

> Hmm, one concern is 
> 	anon_vma3 -> avc3 -> vma1 -> avc1 -> anon_vma1 chasing.
> 
> What will prevent vma1 disappear right after releasling anon_vma3->lock ?
> 

What does it matter if it disappeared? If it did, it was because it was torn
down, the PTEs are also gone and a user of rmap_walk should have stopped
caring. Right?

> ex)
> a1) At we chase, anon_vma3 -> avc3 -> vma1 -> anon_vma1, link was following.
> 
>             anon_vma1    anon_vma2    anon_vma3
>                 |            |            |
>     vma1 -----  1  --------  2  --------- 3 -----
>                 |            |            |
>     vma2 -----  4  --------  5 ---------- 6 -----
>                 |            |            |
>     vma3 -----  7  --------  8 ---------- 9 -----
>  
>    We hold lock on anon_vma3.
> 
> a2) After releasing anon_vma3 lock. vma1 can be unlinked.
> 
>             anon_vma1    anon_vma2    anon_vma3
>                 |            |            |
>  vma1 removed.
>                 |            |            |
>     vma2 -----  4  --------  5 ---------- 6 -----
>                 |            |            |
>     vma3 -----  7  --------  8 ---------- 9 -----
> 
> But we know anon_vma1->head is not empty, and it's accessable.
> Then, no problem for our purpose. Right ?
> 

As the PTEs are also gone, I'm not seeing the problem.

> b1) Another thinking.
> 
> At we chase, anon_vma3 -> avc3 -> vma1 -> anon_vma1, link was following.
> 
>             anon_vma1    anon_vma2    anon_vma3
>                 |            |            |
>     vma1 -----  1  --------  2  --------- 3 -----
>                 |            |            |
>     vma2 -----  4  --------  5 ---------- 6 -----
>                 |            |            |
>     vma3 -----  7  --------  8 ---------- 9 -----
>  
>    We hold lock on anon_vma3. So, 
> 
>             anon_vma1    anon_vma2    anon_vma3
>                 |            |            |
>     vma1 ----removed -----removed  ------ 3 -----
>                 |            |            |
>     vma2 -----  4  --------  5 ---------- 6 -----
>                 |            |            |
>     vma3 -----  7  --------  8 ---------- 9 -----
> 
> we may see half-broken link while we take anon_vma3->lock. In this case,
> anon_vma1 can be caugt.
> 
> Don't we need this ?
> 
> 
>  void unlink_anon_vmas(struct vm_area_struct *vma)
>  {
>         struct anon_vma_chain *avc, *next;
> 
>         /* Unlink each anon_vma chained to the VMA. */
> -        list_for_each_entry_safe_reverse(avc, next, &vma->anon_vma_chain, same_vma) {

This was meant to be list_for_each_entry_safe(....)

> +        list_for_each_entry_safe_reverse(avc, next, &vma->anon_vma_chain, same_vma) {
>                 anon_vma_unlink(avc);
>                 list_del(&avc->same_vma);
>                 anon_vma_chain_free(avc);
>          }
>  }
> 
> head avc should be removed last...  Hmm ? I'm sorry if all are
> in correct order already.
> 

I think the ordering is ok. The rmap_walk may find a situation where the
anon_vmas are being cleaned up but again as the page tables are going
away at this point, the contents of the PTEs are no longer important.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 23:20 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
  2010-05-07  0:56   ` KAMEZAWA Hiroyuki
@ 2010-05-08 15:39   ` Andrea Arcangeli
  2010-05-08 17:02     ` Linus Torvalds
  2010-05-09 19:23     ` Mel Gorman
  1 sibling, 2 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2010-05-08 15:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Linus Torvalds, Peter Zijlstra

On Fri, May 07, 2010 at 12:20:52AM +0100, Mel Gorman wrote:
> @@ -1655,6 +1655,7 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
>  {
>  	struct stable_node *stable_node;
>  	struct hlist_node *hlist;
> +	struct anon_vma *nested_anon_vma = NULL;
>  	struct rmap_item *rmap_item;
>  	int ret = SWAP_AGAIN;
>  	int search_new_forks = 0;
> @@ -1671,9 +1672,16 @@ again:
>  		struct anon_vma_chain *vmac;
>  		struct vm_area_struct *vma;
>  
> -		spin_lock(&anon_vma->lock);
> +		anon_vma = anon_vma_lock_root(anon_vma);
> +		if (nested_anon_vma) {
> +			spin_unlock(&nested_anon_vma->lock);
> +			nested_anon_vma = NULL;
> +		}
>  		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
>  			vma = vmac->vma;
> +			nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
> +						vma->anon_vma, anon_vma);
> +
>  			if (rmap_item->address < vma->vm_start ||
>  			    rmap_item->address >= vma->vm_end)
>  				continue;
> @@ -1368,19 +1444,26 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
>  	 * are holding mmap_sem. Users without mmap_sem are required to
>  	 * take a reference count to prevent the anon_vma disappearing
>  	 */
> -	anon_vma = page_anon_vma(page);
> +	anon_vma = page_anon_vma_lock_root(page);
>  	if (!anon_vma)
>  		return ret;
> -	spin_lock(&anon_vma->lock);
>  	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
>  		struct vm_area_struct *vma = avc->vma;
> -		unsigned long address = vma_address(page, vma);
> -		if (address == -EFAULT)
> -			continue;
> -		ret = rmap_one(page, vma, address, arg);
> +		unsigned long address;
> +
> +		nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
> +						vma->anon_vma, anon_vma);
> +		address = vma_address(page, vma);
> +		if (address != -EFAULT)
> +			ret = rmap_one(page, vma, address, arg);
> +
>  		if (ret != SWAP_AGAIN)
>  			break;
>  	}
> +
> +	if (nested_anon_vma)
> +		spin_unlock(&nested_anon_vma->lock);
> +
>  	spin_unlock(&anon_vma->lock);
>  	return ret;
>  }

I already told Mel by PM. This degrades the new-anon_vma code to an
even _slower_ mode than the old anon-vma code in 2.6.32 (the same in
math complexity terms but slower in practice) for migrate. Furthermore
page_referenced() may now return true even if there are young ptes
that simply get lost in the rmap walk.

The new anon-vma code is mostly relevant for migrate and memory
compaction and transparent hugepage support where it gets invoked even
if there's plenty of free memory and no I/O load at all. So whatever
you save during swap, you'll lose while transparent hugepage support
allocate the pages. So the above fix renders the whole effort
pointless as far as I'm concerned.

I think Rik's patch is the only sensible solution that won't
invalidate the whole effort for transparent hugepage.

About how to adapt split_huge_page to the root anon_vma I didn't even
think about it yet. All I can tell you right now is that
wait_split_huge_page can be changed to wait on the pmd_trans_splitting
(or alternatively the pmd_trans_huge bit) bit to go away in a
cpu_relax() barrier() loop. But the page->mapping/anon_vma->lock is
also used to serialize against parallel split_huge_page but I guess
taking the root anon_vma lock in split_huge_page() should work
fine. Just I'm not going to do that except maybe in a for-mainline
branch, but I'll keep master branch with the old-anon-vma 2.6.32 code
and the anon_vma_branch with Rik's fix that allows to take advantage
of the new anon-vma code (so it's not purely gratuitous complexity
added for nothing) also in migrate.c from memory compaction (that runs
24/7 on all my systems and it's much more frequent than the swap rmap
walks that in fact never ever happens here), and in the rmap walks in
split_huge_page too (which are not so frequent especially after
Johannes implemented native mprotect on hugepages but it's obviously
still more frequent than swap).

I'm simply not going to support the degradation to the root anon_vma
complexity in aa.git, except for strict merging requirements but I'll
keep backing it out in aa.git or I'd rather stick to old-anon-vma
code which at least is much simpler and saves memory too (as there are
many fewer anon-vma and no avc, and less useless locks).

What I instead already mentioned once was to add a _shared_ lock so
you share the spinlock across the whole forest but you keep walking
the right page->anon_vma->same_anon_vma! The moment you walk the
page->anon_vma->root_anon_vma->same_anon_vma you lost my support as it
makes the whole effort pointless compared to 2.6.32 as far as 99% of my
workloads are concerned.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-08 15:39   ` Andrea Arcangeli
@ 2010-05-08 17:02     ` Linus Torvalds
  2010-05-08 18:04       ` Andrea Arcangeli
  2010-05-09 19:23     ` Mel Gorman
  1 sibling, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2010-05-08 17:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter, Rik van Riel,
	Peter Zijlstra

On Sat, 8 May 2010, Andrea Arcangeli wrote:
> 
> I'm simply not going to support the degradation to the root anon_vma
> complexity in aa.git, except for strict merging requirements [ ..]

Goodie. That makes things easier for me - there's never going to be any 
issue of whether I need to even _think_ about merging the piece of crap.

In other words - if people want anything like that merged, you had better 
work on cleaning up the locking. Because I absolutely WILL NOT apply any 
of the f*ckign broken locking patches that I've seen, when I've personally 
told people how to fix the thing.

In other words, it's all _your_ problem.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-08 17:02     ` Linus Torvalds
@ 2010-05-08 18:04       ` Andrea Arcangeli
  2010-05-08 19:51         ` Linus Torvalds
  0 siblings, 1 reply; 72+ messages in thread
From: Andrea Arcangeli @ 2010-05-08 18:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter, Rik van Riel,
	Peter Zijlstra

On Sat, May 08, 2010 at 10:02:50AM -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 8 May 2010, Andrea Arcangeli wrote:
> > 
> > I'm simply not going to support the degradation to the root anon_vma
> > complexity in aa.git, except for strict merging requirements [ ..]
> 
> Goodie. That makes things easier for me - there's never going to be any 
> issue of whether I need to even _think_ about merging the piece of crap.
> 
> In other words - if people want anything like that merged, you had better 
> work on cleaning up the locking. Because I absolutely WILL NOT apply any 
> of the f*ckign broken locking patches that I've seen, when I've personally 
> told people how to fix the thing.

There is no broken (as in kernel crashing) locking in my THP
patches. It is more stable than mainline as far as migration is
concerned.

And the patch I think you're calling "broken" to fix the anon-vma
locking wasn't written by me (I only fixed the exec vs migrate race in
the way _I_ prefer and I still prefer, which is another bug not
related to the new anon-vma changes).

If the other patch you call broken:

   http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=f42cfe59ed4ea474ea22aeec033aad408e1fb9d2

go figure how broken it is to do all this anon-vma changes, with all
complexity and risk involved with the only benefit of running the rmap
walks faster, and in the end the very function called rmap_walk() runs
even slower than 2.6.32. _That_ is broken in my view, and I'm not
saying I like the patch in the above link (I suggested a shared lock
from the start if you check) but it surely is less broken than what
you're discussing here about the root anon-vma.

> In other words, it's all _your_ problem.

And I already said in the very previous email I sent in this thread, I
will also maintain a for-mainline tree trying to support this
root-anon-vma thing in case you merge it by mistake, just to avoid
this new anon-vma locking changes to be involved in the merging
issues of THP.

How you mix the merging of THP with how to solve this migrate crash
that only affects mainline new anon-vma code without THP, I've no
idea, especially considering I mentioned I will support whatever
anon-vma locking that will go in mainline in a separate branch
(clearly it won't be the one I recommend to use for the very reasons
explained above but I'll provide it for merging).

Also note, if you refuse to merge THP it's not my problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-08 18:04       ` Andrea Arcangeli
@ 2010-05-08 19:51         ` Linus Torvalds
  0 siblings, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2010-05-08 19:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter, Rik van Riel,
	Peter Zijlstra

On Sat, 8 May 2010, Andrea Arcangeli wrote:
> 
> There is no broken (as in kernel crashing) locking in my THP
> patches.

It's not about crashing. It's about being a totally unmaintainable mess, 
with ad-hoc temporary allocations, and loops over an unknown number of 
spinlocks.

That's _broken_. B. R. O. K. E. N.

And in all cases there are fixes that I've pointed out. If you can't see 
that, then that's _your_ problem. If you (or others) want your code 
merged, then it had better _fix_ the total disaster before merging. It's 
that simple.

The "lock root" thing you complain should also be easily fixable, by 
keeping the root lock a separate issue from walking the actual anon_vma 
(ie walk the anon_vma, but lock the root). You still don't have to lock 
the whole list.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-08 15:39   ` Andrea Arcangeli
  2010-05-08 17:02     ` Linus Torvalds
@ 2010-05-09 19:23     ` Mel Gorman
  1 sibling, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-09 19:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel, Linus Torvalds, Peter Zijlstra

On Sat, May 08, 2010 at 05:39:22PM +0200, Andrea Arcangeli wrote:
> On Fri, May 07, 2010 at 12:20:52AM +0100, Mel Gorman wrote:
> > @@ -1655,6 +1655,7 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
> >  {
> >  	struct stable_node *stable_node;
> >  	struct hlist_node *hlist;
> > +	struct anon_vma *nested_anon_vma = NULL;
> >  	struct rmap_item *rmap_item;
> >  	int ret = SWAP_AGAIN;
> >  	int search_new_forks = 0;
> > @@ -1671,9 +1672,16 @@ again:
> >  		struct anon_vma_chain *vmac;
> >  		struct vm_area_struct *vma;
> >  
> > -		spin_lock(&anon_vma->lock);
> > +		anon_vma = anon_vma_lock_root(anon_vma);
> > +		if (nested_anon_vma) {
> > +			spin_unlock(&nested_anon_vma->lock);
> > +			nested_anon_vma = NULL;
> > +		}
> >  		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> >  			vma = vmac->vma;
> > +			nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
> > +						vma->anon_vma, anon_vma);
> > +
> >  			if (rmap_item->address < vma->vm_start ||
> >  			    rmap_item->address >= vma->vm_end)
> >  				continue;
> > @@ -1368,19 +1444,26 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> >  	 * are holding mmap_sem. Users without mmap_sem are required to
> >  	 * take a reference count to prevent the anon_vma disappearing
> >  	 */
> > -	anon_vma = page_anon_vma(page);
> > +	anon_vma = page_anon_vma_lock_root(page);
> >  	if (!anon_vma)
> >  		return ret;
> > -	spin_lock(&anon_vma->lock);
> >  	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> >  		struct vm_area_struct *vma = avc->vma;
> > -		unsigned long address = vma_address(page, vma);
> > -		if (address == -EFAULT)
> > -			continue;
> > -		ret = rmap_one(page, vma, address, arg);
> > +		unsigned long address;
> > +
> > +		nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
> > +						vma->anon_vma, anon_vma);
> > +		address = vma_address(page, vma);
> > +		if (address != -EFAULT)
> > +			ret = rmap_one(page, vma, address, arg);
> > +
> >  		if (ret != SWAP_AGAIN)
> >  			break;
> >  	}
> > +
> > +	if (nested_anon_vma)
> > +		spin_unlock(&nested_anon_vma->lock);
> > +
> >  	spin_unlock(&anon_vma->lock);
> >  	return ret;
> >  }
> 
> I already told Mel by PM. This degrades the new-anon_vma code to an
> even _slower_ mode than the old anon-vma code in 2.6.32 (the same in
> math complexity terms but slower in practice) for migrate. Furthermore
> page_referenced() may now return true even if there are young ptes
> that simply get lost in the rmap walk.
> 

Yes, it does degrade it for migration but I don't consider that to be
the critical path at the moment. If it's important that it's faster,
then the thing to do is to take the anon_vma reference counting stuff
from Peter's patches, remove the RCU locking and always lock the root.

> The new anon-vma code is mostly relevant for migrate and memory
> compaction and transparent hugepage support where it gets invoked even
> if there's plenty of free memory and no I/O load at all. So whatever
> you save during swap, you'll lose while transparent hugepage support
> allocate the pages. So the above fix renders the whole effort
> pointless as far as I'm concerned.
> 

But this patch can be incrementally progressed to remove RCU and always
lock the root, right?

> I think Rik's patch is the only sensible solution that won't
> invalidate the whole effort for transparent hugepage.
> 

But it slows the mmap/munmap/etc patch right? And I don't recall seeing
a way that the cost of it could be reduced.

> About how to adapt split_huge_page to the root anon_vma I didn't even
> think about it yet. All I can tell you right now is that
> wait_split_huge_page can be changed to wait on the pmd_trans_splitting
> (or alternatively the pmd_trans_huge bit) bit to go away in a
> cpu_relax() barrier() loop. But the page->mapping/anon_vma->lock is
> also used to serialize against parallel split_huge_page but I guess
> taking the root anon_vma lock in split_huge_page() should work
> fine.

I would expect so. It would be preferable from your perspective to
always take the root lock on all paths but without the reference
counting, it requires RCU in the main paths.

> Just I'm not going to do that except maybe in a for-mainline
> branch, but I'll keep master branch with the old-anon-vma 2.6.32 code
> and the anon_vma_branch with Rik's fix that allows to take advantage
> of the new anon-vma code (so it's not purely gratuitous complexity
> added for nothing) also in migrate.c from memory compaction (that runs
> 24/7 on all my systems and it's much more frequent than the swap rmap
> walks that in fact never ever happens here), and in the rmap walks in
> split_huge_page too (which are not so frequent especially after
> Johannes implemented native mprotect on hugepages but it's obviously
> still more frequent than swap).
> 
> I'm simply not going to support the degradation to the root anon_vma
> complexity in aa.git, except for strict merging requirements but I'll
> keep backing it out in aa.git or I'd rather stick to old-anon-vma
> code which at least is much simpler and saves memory too (as there are
> many fewer anon-vma and no avc, and less useless locks).
> 
> What I instead already mentioned once was to add a _shared_ lock so
> you share the spinlock across the whole forest but you keep walking
> the right page->anon_vma->same_anon_vma!

Is the root lock not such a lock as long as it was always used?

> The moment you walk the
> page->anon_vma->root_anon_vma->same_anon_vma you lost my support as it
> makes the whole effort pointless compared to 2.6.32 as far as 99% of my
> workloads are concerned.
> 

Can we agree on this patch even? Even on it's own, I cannot reproduce
the migration-related problems racing with exec with this applied. Yes,
is adds some magic to exec, but the cost of exec remains the same.

==== CUT HERE ====
mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page.  If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up. For example

[  511.201534] kernel BUG at include/linux/swapops.h:105!
[  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[  511.201534] last sysfs file: /sys/block/sde/size
[  511.201534] CPU 0
[  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  511.888526]
[  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
[  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
[  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
[  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
[  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
[  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
[  512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
[  513.068415] Stack:
[  513.068415]  ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
[  513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
[  513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
[  513.228068] Call Trace:
[  513.228068]  [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[  513.228068]  [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
[  513.228068]  [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[  513.720755]  [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  513.789278]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff813099b5>] page_fault+0x25/0x30
[  513.851506]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
[  513.851506]  [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[  513.851506]  [<ffffffff8111329b>] search_binary_handler+0x173/0x313
[  513.851506]  [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
[  513.851506]  [<ffffffff81114896>] do_execve+0x219/0x30a
[  513.851506]  [<ffffffff8111887f>] ? getname+0x14d/0x1b3
[  513.851506]  [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[  514.483501]  [<ffffffff8100320a>] stub_execve+0x6a/0xc0
[  514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
[  514.704292] RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  514.808221]  RSP <ffff880037b979d8>
[  514.906179] ---[ end trace 4f88495edc224d6b ]---

There is a race between shift_arg_pages and migration that triggers this bug.
A temporary stack is setup during exec and later moved. If migration moves
a page in the temporary stack and the VMA is then removed before migration
completes, the migration PTE may not be found leading to a BUG when the
stack is faulted.

Ideally, shift_arg_pages must run atomically with respect of rmap_walk
by holding the anon_vma lock but this is problematic as pages must be
allocated for page tables which cannot happen with a spinlock held. Instead,
this patch skips processes in exec by making an assumption that a VMA with
stack-flags set and a map_count == 1 is in exec that hasn't finalised the
temporary stack yet so don't migrate the pages. Memory hot-remove will try
again, sys_move_pages() wouldn't be operating during exec() time and memory
compaction will just continue to another page without concern.

[kamezawa.hiroyu@jp.fujitsu.com: Idea for having migration skip temporary stacks]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/rmap.c |   30 +++++++++++++++++++++++++++++-
 1 files changed, 29 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index f95b66d..3bb6c9e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1143,6 +1143,20 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	return ret;
 }
 
+static bool is_vma_temporary_stack(struct vm_area_struct *vma)
+{
+	int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);
+
+	if (!maybe_stack)
+		return false;
+
+	/* If only the stack is mapped, assume exec is in progress */
+	if (vma->vm_mm->map_count == 1)
+		return true;
+
+	return false;
+}
+
 /**
  * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
  * rmap method
@@ -1171,7 +1185,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
+		unsigned long address;
+
+		/*
+		 * During exec, a temporary VMA is setup and later moved.
+		 * The VMA is moved under the anon_vma lock but not the
+		 * page tables leading to a race where migration cannot
+		 * find the migration ptes. Rather than increasing the
+		 * locking requirements of exec(), migration skips
+		 * temporary VMAs until after exec() completes.
+		 */
+		if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
+				is_vma_temporary_stack(vma))
+			continue;
+
+		address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
 		ret = try_to_unmap_one(page, vma, address, flags);
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-06 23:20 [PATCH 0/2] Fix migration races in rmap_walk() V7 Mel Gorman
  2010-05-06 23:20 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
@ 2010-05-06 23:20 ` Mel Gorman
  2010-05-07  1:40   ` Linus Torvalds
  2010-05-07  8:13 ` [PATCH 0/2] Fix migration races in rmap_walk() V7 KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Linus Torvalds, Peter Zijlstra,
	Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

Page migration requires rmap to be able to find all migration ptes
created by migration. If the second rmap_walk clearing migration PTEs
misses an entry, it is left dangling causing a BUG_ON to trigger during
fault. For example;

[  511.201534] kernel BUG at include/linux/swapops.h:105!
[  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[  511.201534] last sysfs file: /sys/block/sde/size
[  511.201534] CPU 0
[  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  511.888526]
[  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
[  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
[  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
[  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
[  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
[  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
[  512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
[  513.068415] Stack:
[  513.068415]  ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
[  513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
[  513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
[  513.228068] Call Trace:
[  513.228068]  [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[  513.228068]  [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
[  513.228068]  [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[  513.720755]  [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  513.789278]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff813099b5>] page_fault+0x25/0x30
[  513.851506]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
[  513.851506]  [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[  513.851506]  [<ffffffff8111329b>] search_binary_handler+0x173/0x313
[  513.851506]  [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
[  513.851506]  [<ffffffff81114896>] do_execve+0x219/0x30a
[  513.851506]  [<ffffffff8111887f>] ? getname+0x14d/0x1b3
[  513.851506]  [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[  514.483501]  [<ffffffff8100320a>] stub_execve+0x6a/0xc0
[  514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
[  514.704292] RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  514.808221]  RSP <ffff880037b979d8>
[  514.906179] ---[ end trace 4f88495edc224d6b ]---

This particular BUG_ON is caused by a race between shift_arg_pages and
migration. During exec, a temporary stack is created and later moved to
its final location. If migration selects a page within the temporary stack,
the page tables and migration PTE can be copied to the new location before
rmap_walk is able to find the copy. This leaves a dangling migration PTE
behind that later triggers the bug.

This patch fixes the problem by using two VMAs - one which covers the
temporary stack and the other which covers the new location. This guarantees
that rmap can always find the migration PTE even if it is copied while
rmap_walk is taking place.

[mel@csn.ul.ie: Tested and rewrote changelog]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/exec.c |   37 +++++++++++++++++++++++++++++++++----
 1 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 725d7ef..fd0abff 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,7 +504,9 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -515,17 +518,43 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+	INIT_LIST_HEAD(&tmp_vma->anon_vma_chain);
+	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
+	tmp_vma->vm_start = new_start;
+	/*
+	 * The tmp_vma destination of the copy (with the new vm_start)
+	 * has to be at the end of the anon_vma list for the rmap_walk
+	 * to find the moved pages at all times.
+	 */
+	if (unlikely(anon_vma_clone(tmp_vma, vma))) {
+		kmem_cache_free(vm_area_cachep, tmp_vma);
 		return -ENOMEM;
+	}
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+	moved_length = move_page_tables(vma, old_start,
+					vma, new_start, length);
+
+	vma->vm_start = new_start;
+	/* rmap walk will already find all pages using the new_start */
+	unlink_anon_vmas(tmp_vma);
+	kmem_cache_free(vm_area_cachep, tmp_vma);
+
+	if (length != moved_length) 
 		return -ENOMEM;
 
 	lru_add_drain();
@@ -551,7 +580,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
 
 	return 0;
 }
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-06 23:20 ` [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack Mel Gorman
@ 2010-05-07  1:40   ` Linus Torvalds
  2010-05-07  1:57     ` KAMEZAWA Hiroyuki
  2010-05-07  9:16     ` Mel Gorman
  0 siblings, 2 replies; 72+ messages in thread
From: Linus Torvalds @ 2010-05-07  1:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra



On Fri, 7 May 2010, Mel Gorman wrote:
> 
> Page migration requires rmap to be able to find all migration ptes
> created by migration. If the second rmap_walk clearing migration PTEs
> misses an entry, it is left dangling causing a BUG_ON to trigger during
> fault. For example;

So I still absolutely detest this patch.

Why didn't the other - much simpler - patch work? The one Rik pointed to:

	http://lkml.org/lkml/2010/4/30/198

and didn't do that _disgusting_ temporary anon_vma?

Alternatively, why don't we just take the anon_vma lock over this region, 
so that rmap can't _walk_ the damn thing?

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-07  1:40   ` Linus Torvalds
@ 2010-05-07  1:57     ` KAMEZAWA Hiroyuki
  2010-05-07  2:12       ` Linus Torvalds
  2010-05-07  9:16     ` Mel Gorman
  1 sibling, 1 reply; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-07  1:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Thu, 6 May 2010 18:40:39 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Fri, 7 May 2010, Mel Gorman wrote:
> > 
> > Page migration requires rmap to be able to find all migration ptes
> > created by migration. If the second rmap_walk clearing migration PTEs
> > misses an entry, it is left dangling causing a BUG_ON to trigger during
> > fault. For example;
> 
> So I still absolutely detest this patch.
> 
> Why didn't the other - much simpler - patch work? The one Rik pointed to:
> 
> 	http://lkml.org/lkml/2010/4/30/198
> 
> and didn't do that _disgusting_ temporary anon_vma?
> 
I vote for simple one rather than temporal anon_vma. IIUC, this patch is selected
for not to leak exec's problem out to mm/ by magical check.


> Alternatively, why don't we just take the anon_vma lock over this region, 
> so that rmap can't _walk_ the damn thing?
> 

IIUC, move_page_tables() may call "page table allocation" and it cannot be
done under spinlock.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-07  1:57     ` KAMEZAWA Hiroyuki
@ 2010-05-07  2:12       ` Linus Torvalds
  2010-05-07  4:19         ` KAMEZAWA Hiroyuki
  2010-05-09 19:21         ` Mel Gorman
  0 siblings, 2 replies; 72+ messages in thread
From: Linus Torvalds @ 2010-05-07  2:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra



On Fri, 7 May 2010, KAMEZAWA Hiroyuki wrote:
> 
> IIUC, move_page_tables() may call "page table allocation" and it cannot be
> done under spinlock.

Bah. It only does a "alloc_new_pmd()", and we could easily move that out 
of the loop and pre-allocate the pmd's.

If that's the only reason, then it's a really weak one, methinks.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-07  2:12       ` Linus Torvalds
@ 2010-05-07  4:19         ` KAMEZAWA Hiroyuki
  2010-05-07 14:18           ` Linus Torvalds
  2010-05-09 19:21         ` Mel Gorman
  1 sibling, 1 reply; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-07  4:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Thu, 6 May 2010 19:12:59 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Fri, 7 May 2010, KAMEZAWA Hiroyuki wrote:
> > 
> > IIUC, move_page_tables() may call "page table allocation" and it cannot be
> > done under spinlock.
> 
> Bah. It only does a "alloc_new_pmd()", and we could easily move that out 
> of the loop and pre-allocate the pmd's.
> 
> If that's the only reason, then it's a really weak one, methinks.
> 
Hmm, is this too slow ? This is the simplest one I have.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

exec()'s shift_arg_pages calls adjust_vma and move_page_table() directly.
During this, rmap information (i.e. page <-> pte <-> address <-> vma)
information is inconsistent. This causes a bug in rmap_walk() which
rmap_walk cannot find valid ptes it must visit.

Considering the race, move_vma() does valid things. So, making use of
move_vma() instead of bare move_page_tables() is a choice.

Pros.
 - all races are handled under /mm. very simple.
Cons.
 - This may makes exec() slow a bit.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/exec.c          |   64 +++++++++++++++--------------------------------------
 include/linux/mm.h |    6 ++--
 mm/mremap.c        |    4 +--
 3 files changed, 24 insertions(+), 50 deletions(-)

Index: linux-2.6.34-rc5-mm1/fs/exec.c
===================================================================
--- linux-2.6.34-rc5-mm1.orig/fs/exec.c
+++ linux-2.6.34-rc5-mm1/fs/exec.c
@@ -486,14 +486,7 @@ EXPORT_SYMBOL(copy_strings_kernel);
 /*
  * During bprm_mm_init(), we create a temporary stack at STACK_TOP_MAX.  Once
  * the binfmt code determines where the new stack should reside, we shift it to
- * its final location.  The process proceeds as follows:
- *
- * 1) Use shift to calculate the new vma endpoints.
- * 2) Extend vma to cover both the old and new ranges.  This ensures the
- *    arguments passed to subsequent functions are consistent.
- * 3) Move vma's page tables to the new range.
- * 4) Free up any cleared pgd range.
- * 5) Shrink the vma to cover only the new range.
+ * its final location.
  */
 static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 {
@@ -503,7 +496,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
-	struct mmu_gather *tlb;
+	unsigned long ret;
 
 	BUG_ON(new_start > new_end);
 
@@ -514,45 +507,23 @@ static int shift_arg_pages(struct vm_are
 	if (vma != find_vma(mm, new_start))
 		return -EFAULT;
 
-	/*
-	 * cover the whole range: [new_start, old_end)
-	 */
-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
-		return -ENOMEM;
-
-	/*
-	 * move the page tables downwards, on failure we rely on
-	 * process cleanup to remove whatever mess we made.
-	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
-		return -ENOMEM;
+	if (new_end > old_start) { /* overlap */
+		unsigned long part_len = new_end - old_start;
+		ret = move_vma(vma, old_start, part_len, part_len, new_start);
 
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	if (new_end > old_start) {
-		/*
-		 * when the old and new regions overlap clear from new_end.
-		 */
-		free_pgd_range(tlb, new_end, old_end, new_end,
-			vma->vm_next ? vma->vm_next->vm_start : 0);
+		if (ret != new_start)
+			return -ENOMEM;
+		/* old_vma is splitted.. */
+		vma = find_vma(mm, old_start + part_len);
+		ret = move_vma(vma, old_start + part_len, length - part_len,
+			length - part_len, new_start + part_len);
+		if (ret != new_start + part_len)
+			return -ENOMEM;
 	} else {
-		/*
-		 * otherwise, clean from old_start; this is done to not touch
-		 * the address space in [new_end, old_start) some architectures
-		 * have constraints on va-space that make this illegal (IA64) -
-		 * for the others its just a little faster.
-		 */
-		free_pgd_range(tlb, old_start, old_end, new_end,
-			vma->vm_next ? vma->vm_next->vm_start : 0);
+		ret = move_vma(vma, old_start, length, length, new_start);
+		if (ret != new_start)
+			return -ENOMEM;
 	}
-	tlb_finish_mmu(tlb, new_end, old_end);
-
-	/*
-	 * Shrink the vma to just the new range.  Always succeeds.
-	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
-
 	return 0;
 }
 
@@ -625,9 +596,12 @@ int setup_arg_pages(struct linux_binprm 
 
 	/* Move stack pages down in memory. */
 	if (stack_shift) {
+		unsigned long new_start = vma->vm_start - stack_shift;
 		ret = shift_arg_pages(vma, stack_shift);
 		if (ret)
 			goto out_unlock;
+		vma = find_vma(mm, new_start);
+		bprm->vma = vma;
 	}
 
 	stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
Index: linux-2.6.34-rc5-mm1/include/linux/mm.h
===================================================================
--- linux-2.6.34-rc5-mm1.orig/include/linux/mm.h
+++ linux-2.6.34-rc5-mm1/include/linux/mm.h
@@ -856,9 +856,9 @@ int set_page_dirty_lock(struct page *pag
 int set_page_dirty_notag(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
-extern unsigned long move_page_tables(struct vm_area_struct *vma,
-		unsigned long old_addr, struct vm_area_struct *new_vma,
-		unsigned long new_addr, unsigned long len);
+extern unsigned long move_vma(struct vm_area_struct *vma,
+		unsigned long old_addr, unsigned long old_length,
+		unsigned long new_length, unsigned long new_addr);
 extern unsigned long do_mremap(unsigned long addr,
 			       unsigned long old_len, unsigned long new_len,
 			       unsigned long flags, unsigned long new_addr);
Index: linux-2.6.34-rc5-mm1/mm/mremap.c
===================================================================
--- linux-2.6.34-rc5-mm1.orig/mm/mremap.c
+++ linux-2.6.34-rc5-mm1/mm/mremap.c
@@ -128,7 +128,7 @@ static void move_ptes(struct vm_area_str
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
 
-unsigned long move_page_tables(struct vm_area_struct *vma,
+static unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len)
 {
@@ -162,7 +162,7 @@ unsigned long move_page_tables(struct vm
 	return len + old_addr - old_end;	/* how much done */
 }
 
-static unsigned long move_vma(struct vm_area_struct *vma,
+unsigned long move_vma(struct vm_area_struct *vma,
 		unsigned long old_addr, unsigned long old_len,
 		unsigned long new_len, unsigned long new_addr)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-07  4:19         ` KAMEZAWA Hiroyuki
@ 2010-05-07 14:18           ` Linus Torvalds
  0 siblings, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2010-05-07 14:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Fri, 7 May 2010, KAMEZAWA Hiroyuki wrote:
> 
> Hmm, is this too slow ? This is the simplest one I have.

Well, it may be as slow (or slower) than Andrea's, but at least it is 
_clean_ and actually removes code. So if we can't do it better, I'd 
certainly prefer this to the horribly hacky one.

That said, I still think we could easily just split up 
"move_page_tables()" into two functions - one that just does the page 
table allocation, and one that actually moves the entries.

In fact, I think that would even clean up the error case for move_vma() 
too - the page table entry movement itself could never fail, so you never 
end up with that insane "move back" case.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-07  2:12       ` Linus Torvalds
  2010-05-07  4:19         ` KAMEZAWA Hiroyuki
@ 2010-05-09 19:21         ` Mel Gorman
  2010-05-09 19:56           ` Linus Torvalds
  2010-05-10  0:32           ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-09 19:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Thu, May 06, 2010 at 07:12:59PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 7 May 2010, KAMEZAWA Hiroyuki wrote:
> > 
> > IIUC, move_page_tables() may call "page table allocation" and it cannot be
> > done under spinlock.
> 
> Bah. It only does a "alloc_new_pmd()", and we could easily move that out 
> of the loop and pre-allocate the pmd's.
> 
> If that's the only reason, then it's a really weak one, methinks.
> 

It turns out not to be easy to the preallocating of PUDs, PMDs and PTEs
move_page_tables() needs.  To avoid overallocating, it has to follow the same
logic as move_page_tables duplicating some code in the process. The ugliest
aspect of all is passing those pre-allocated pages back into move_page_tables
where they need to be passed down to such functions as __pte_alloc. It turns
extremely messy.

I stopped working on it about half way through as it was already too ugly
to live and would have similar cost to Kamezawa's much more straight-forward
approach of using move_vma().

While using move_vma is straight-forward and solves the problem, it's
not as cheap as Andrea's solution. Andrea allocates a temporary VMA and
puts it on a list and very little else. It didn't show up any problems
in microbenchmarks. Calling move_vma does a lot more work particularly in
copy_vma and this slows down exec.

With Kamezawa's patch, kernbench was fine on wall time but in System Time,
it slowed by up 1.48% in comparison to Andrea's slowing up by 0.64%[1].

aim9 was slowed as well. Kamezawa's slowed by 2.77% where Andrea's reported
faster by 2.58%. While AIM9 is flaky and these figures are barely outside
the noise, calling move_vma() is obviously more expensive.

While my solution at http://lkml.org/lkml/2010/4/30/198 is cheapest as it
does not touch exec() at all, is_vma_temporary_stack() could be broken in
the future if any of the assumptions it makes change.

So what you have is an inverse relationship between magic and
performance. Mine has the most magic and is fastest. Kamezawa's has the
least magic but slowest and Andrea has the goldilocks factor. Which do
you prefer?

[1] One caveat of the performance tests was that a lot of debugging such
    as lockdep was enabled. Disabling these would give different results
    but it should still be the case that calling move_vma is more expensive
    than calling kmem_cache_alloc.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-09 19:21         ` Mel Gorman
@ 2010-05-09 19:56           ` Linus Torvalds
  2010-05-09 20:06             ` Linus Torvalds
                               ` (3 more replies)
  2010-05-10  0:32           ` KAMEZAWA Hiroyuki
  1 sibling, 4 replies; 72+ messages in thread
From: Linus Torvalds @ 2010-05-09 19:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra



On Sun, 9 May 2010, Mel Gorman wrote:
> 
> It turns out not to be easy to the preallocating of PUDs, PMDs and PTEs
> move_page_tables() needs.  To avoid overallocating, it has to follow the same
> logic as move_page_tables duplicating some code in the process. The ugliest
> aspect of all is passing those pre-allocated pages back into move_page_tables
> where they need to be passed down to such functions as __pte_alloc. It turns
> extremely messy.

Umm. What?

That's crazy talk. I'm not talking about preallocating stuff in order to 
pass it in to move_page_tables(). I'm talking about just _creating_ the 
dang page tables early - preallocating them IN THE PROCESS VM SPACE.

IOW, a patch like this (this is a pseudo-patch, totally untested, won't 
compile, yadda yadda - you need to actually make the people who call 
"move_page_tables()" call that prepare function first etc etc)

Yeah, if we care about holes in the page tables, we can certainly copy 
more of the move_page_tables() logic, but it certainly doesn't matter for 
execve(). This just makes sure that the destination page tables exist 
first.

		Linus

---
 mm/mremap.c |   22 +++++++++++++++++++++-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index cde56ee..c14505c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -128,6 +128,26 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
 
+/*
+ * Preallocate the page tables, so that we can do the actual move
+ * without any allocations, and thus no error handling etc.
+ */
+int prepare_move_page_tables(struct vm_area_struct *vma,
+	unsigned long old_addr, struct vm_area_struct *new_vma,
+	unsigned long new_addr, unsigned long len)
+{
+	unsigned long end_addr = new_addr + len;
+
+	while (new_addr < end_addr) {
+		pmd_t *new_pmd;
+		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+		if (!new_pmd)
+			return -ENOMEM;
+		new_addr = (new_addr + PMD_SIZE) & PMD_MASK;
+	}
+	return 0;
+}
+
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len)
@@ -147,7 +167,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;
-		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+		new_pmd = get_old_pmd(vma->vm_mm, new_addr);
 		if (!new_pmd)
 			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-09 19:56           ` Linus Torvalds
@ 2010-05-09 20:06             ` Linus Torvalds
  2010-05-09 20:20               ` Linus Torvalds
  2010-05-10  0:40             ` KAMEZAWA Hiroyuki
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2010-05-09 20:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra



On Sun, 9 May 2010, Linus Torvalds wrote:
> 
> IOW, a patch like this (this is a pseudo-patch, totally untested, won't 
> compile, yadda yadda - you need to actually make the people who call 
> "move_page_tables()" call that prepare function first etc etc)

Btw, that pseudo-patch is also likely the least efficient way to do this, 
since it re-walks the page directories for each pmd it creates.

Also - the thing is, we can easily choose to do this _just_ for the case 
of shifting argument pages, which really is a special case (the generic 
case of "mremap()" is a lot more complicated, and doesn't have the issue 
with rmap because it's doing all the "new vma" setup etc).

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-09 20:06             ` Linus Torvalds
@ 2010-05-09 20:20               ` Linus Torvalds
  0 siblings, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2010-05-09 20:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Sun, 9 May 2010, Linus Torvalds wrote:
> 
> Also - the thing is, we can easily choose to do this _just_ for the case 
> of shifting argument pages, which really is a special case (the generic 
> case of "mremap()" is a lot more complicated, and doesn't have the issue 
> with rmap because it's doing all the "new vma" setup etc).

Looking at the stack setup code, we have other things we probably might 
want to look at.

For example, we do that "mprotect_fixup()" to fix up possible vm_flags 
issues (notably whether execute bit is set or not), and that's absolutely 
something that we probably should do at the same time as moving the stack, 
so that we don't end up walking - and changing - the page tables _twice_.

So the stack setup really is a total special case, and it looks like we do 
some rather stupid things there. Havign a specialized routine that does 
just:

 - if we're moving things, fill in the new page tables (we know they are 
   dense, so my "stupid" routine actually does the right thing despite 
   being pretty simplistic) before moving.

 - if we're moving _or_ changing protections, do a

		for_each_page()
			move_andor_fix_protection()

   which kind of looks like the current "move_page_tables()", except we 
   know it doesn't need new allocations.

Hmm?

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-09 19:56           ` Linus Torvalds
  2010-05-09 20:06             ` Linus Torvalds
@ 2010-05-10  0:40             ` KAMEZAWA Hiroyuki
  2010-05-10  1:30               ` Linus Torvalds
  2010-05-10  0:42             ` KAMEZAWA Hiroyuki
  2010-05-10 13:49             ` Mel Gorman
  3 siblings, 1 reply; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-10  0:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Sun, 9 May 2010 12:56:49 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Sun, 9 May 2010, Mel Gorman wrote:
> > 
> > It turns out not to be easy to the preallocating of PUDs, PMDs and PTEs
> > move_page_tables() needs.  To avoid overallocating, it has to follow the same
> > logic as move_page_tables duplicating some code in the process. The ugliest
> > aspect of all is passing those pre-allocated pages back into move_page_tables
> > where they need to be passed down to such functions as __pte_alloc. It turns
> > extremely messy.
> 
> Umm. What?
> 
> That's crazy talk. I'm not talking about preallocating stuff in order to 
> pass it in to move_page_tables(). I'm talking about just _creating_ the 
> dang page tables early - preallocating them IN THE PROCESS VM SPACE.
> 
> IOW, a patch like this (this is a pseudo-patch, totally untested, won't 
> compile, yadda yadda - you need to actually make the people who call 
> "move_page_tables()" call that prepare function first etc etc)
> 
> Yeah, if we care about holes in the page tables, we can certainly copy 
> more of the move_page_tables() logic, but it certainly doesn't matter for 
> execve(). This just makes sure that the destination page tables exist 
> first.
> 
IMHO, I think move_page_tables() itself should be implemented as your patch.

But, move_page_tables()'s failure is not a big problem. At failure,
exec will abort and no page fault will occur later. What we have to do in
this migration-patch-series is avoding inconsistent update of sets of
[page, vma->vm_start, vma->pg_off, ptes] or "dont' migrate pages in exec's
statk".

Considering cost, as Mel shows, "don't migrate apges in exec's stack" seems
reasonable. But, I still doubt this check.

+static bool is_vma_temporary_stack(struct vm_area_struct *vma)
+{
+	int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);
+
+	if (!maybe_stack)
+		return false;
+
+	/* If only the stack is mapped, assume exec is in progress */
+	if (vma->vm_mm->map_count == 1) -------------------(*)
+		return true; 
+
+	return false;
+}
+
\
Mel, can (*) be safe even on a.out format (format other than ELFs) ?

Thanks,
-Kame







> 		Linus
> 
> ---
>  mm/mremap.c |   22 +++++++++++++++++++++-
>  1 files changed, 21 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index cde56ee..c14505c 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -128,6 +128,26 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  
>  #define LATENCY_LIMIT	(64 * PAGE_SIZE)
>  
> +/*
> + * Preallocate the page tables, so that we can do the actual move
> + * without any allocations, and thus no error handling etc.
> + */
> +int prepare_move_page_tables(struct vm_area_struct *vma,
> +	unsigned long old_addr, struct vm_area_struct *new_vma,
> +	unsigned long new_addr, unsigned long len)
> +{
> +	unsigned long end_addr = new_addr + len;
> +
> +	while (new_addr < end_addr) {
> +		pmd_t *new_pmd;
> +		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
> +		if (!new_pmd)
> +			return -ENOMEM;
> +		new_addr = (new_addr + PMD_SIZE) & PMD_MASK;
> +	}
> +	return 0;
> +}
> +
>  unsigned long move_page_tables(struct vm_area_struct *vma,
>  		unsigned long old_addr, struct vm_area_struct *new_vma,
>  		unsigned long new_addr, unsigned long len)
> @@ -147,7 +167,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
>  		if (!old_pmd)
>  			continue;
> -		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
> +		new_pmd = get_old_pmd(vma->vm_mm, new_addr);
>  		if (!new_pmd)
>  			break;
>  		next = (new_addr + PMD_SIZE) & PMD_MASK;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-10  0:40             ` KAMEZAWA Hiroyuki
@ 2010-05-10  1:30               ` Linus Torvalds
  2010-05-10  1:32                 ` Linus Torvalds
  0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2010-05-10  1:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Mon, 10 May 2010, KAMEZAWA Hiroyuki wrote:
> 
> But, move_page_tables()'s failure is not a big problem.

Well, yes and no.

It's not a problem because it fails, but because it does the allocation. 
Which means that we can't protect the thing with the (natural) anon_vma 
locking.

> Considering cost, as Mel shows, "don't migrate apges in exec's stack" seems
> reasonable. But, I still doubt this check.

Well, I actually always liked Mel's patch, the one he calls "magic". I 
think it's _less_ magic than the crazy "let's create another vma and 
anon_vma chain just because migration has it's thumb up its ass".

So I never disliked that patch. I'm perfectly happy with a "don't migrate 
these pages at all, because they are in a half-way state in the middle of 
execve stack magic".

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-10  1:30               ` Linus Torvalds
@ 2010-05-10  1:32                 ` Linus Torvalds
  2010-05-10  1:40                   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2010-05-10  1:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra



On Sun, 9 May 2010, Linus Torvalds wrote:
> 
> So I never disliked that patch. I'm perfectly happy with a "don't migrate 
> these pages at all, because they are in a half-way state in the middle of 
> execve stack magic".

Btw, I also think that Mel's patch could be made a lot _less_ magic by 
just marking that initial stack vma with a VM_STACK_INCOMPLETE_SETUP bit, 
instead of doing that "maybe_stack" thing. We could easily make that 
initial vma setup very explicit indeed, and then just clear that bit when 
we've moved the stack to its final position.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-10  1:32                 ` Linus Torvalds
@ 2010-05-10  1:40                   ` KAMEZAWA Hiroyuki
  2010-05-10  1:49                     ` Linus Torvalds
  2010-05-10 13:24                     ` Mel Gorman
  0 siblings, 2 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-10  1:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Sun, 9 May 2010 18:32:32 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Sun, 9 May 2010, Linus Torvalds wrote:
> > 
> > So I never disliked that patch. I'm perfectly happy with a "don't migrate 
> > these pages at all, because they are in a half-way state in the middle of 
> > execve stack magic".
> 
> Btw, I also think that Mel's patch could be made a lot _less_ magic by 
> just marking that initial stack vma with a VM_STACK_INCOMPLETE_SETUP bit, 
> instead of doing that "maybe_stack" thing. We could easily make that 
> initial vma setup very explicit indeed, and then just clear that bit when 
> we've moved the stack to its final position.
> 

Hmm. vm_flags is still 32bit..(I think it should be long long)

Using combination of existing flags...

#define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEC_READ)

Can be used instead of checking mapcount, I think.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-10  1:40                   ` KAMEZAWA Hiroyuki
@ 2010-05-10  1:49                     ` Linus Torvalds
  2010-05-10 13:24                     ` Mel Gorman
  1 sibling, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2010-05-10  1:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra



On Mon, 10 May 2010, KAMEZAWA Hiroyuki wrote:
>
> Hmm. vm_flags is still 32bit..(I think it should be long long)
> 
> Using combination of existing flags...
> 
> #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEC_READ)
> 
> Can be used instead of checking mapcount, I think.

Ahh, yes. We can also do things like not having VM_MAY_READ/WRITE set. 
That's impossible on a real mapping - even if it's not readable, it is 
always something you could mprotect to _be_ readable.

The point being, we can make the tests more explicit, and less "magic that 
happens to work". As long as it's ok to just say "don't migrate pages in 
this mapping yet, because we're still setting it up".

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-10  1:40                   ` KAMEZAWA Hiroyuki
  2010-05-10  1:49                     ` Linus Torvalds
@ 2010-05-10 13:24                     ` Mel Gorman
  2010-05-10 23:55                       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2010-05-10 13:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Mon, May 10, 2010 at 10:40:39AM +0900, KAMEZAWA Hiroyuki wrote:
> On Sun, 9 May 2010 18:32:32 -0700 (PDT)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > 
> > 
> > On Sun, 9 May 2010, Linus Torvalds wrote:
> > > 
> > > So I never disliked that patch. I'm perfectly happy with a "don't migrate 
> > > these pages at all, because they are in a half-way state in the middle of 
> > > execve stack magic".
> > 
> > Btw, I also think that Mel's patch could be made a lot _less_ magic by 
> > just marking that initial stack vma with a VM_STACK_INCOMPLETE_SETUP bit, 
> > instead of doing that "maybe_stack" thing. We could easily make that 
> > initial vma setup very explicit indeed, and then just clear that bit when 
> > we've moved the stack to its final position.
> > 
> 
> Hmm. vm_flags is still 32bit..(I think it should be long long)
> 

This is why I didn't use a bit in vm_flags and I didn't want to increase
the size of any structure. I had thought of using a combination of flags
but thought people would prefer a test based on existing information such as
map_count. That said, I also made terrible choices for possible combination
of flags :)

> Using combination of existing flags...
> 
> #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEC_READ)
> 
> Can be used instead of checking mapcount, I think.
> 

It can and this is a very good combination of flags to base the test on.
How does the following look?

==== CUT HERE ====
mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page.  If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up. For example

[  511.201534] kernel BUG at include/linux/swapops.h:105!
[  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[  511.201534] last sysfs file: /sys/block/sde/size
[  511.201534] CPU 0
[  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  511.888526]
[  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
[  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
[  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
[  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
[  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
[  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
[  512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
[  513.068415] Stack:
[  513.068415]  ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
[  513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
[  513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
[  513.228068] Call Trace:
[  513.228068]  [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[  513.228068]  [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
[  513.228068]  [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[  513.720755]  [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  513.789278]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff813099b5>] page_fault+0x25/0x30
[  513.851506]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
[  513.851506]  [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[  513.851506]  [<ffffffff8111329b>] search_binary_handler+0x173/0x313
[  513.851506]  [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
[  513.851506]  [<ffffffff81114896>] do_execve+0x219/0x30a
[  513.851506]  [<ffffffff8111887f>] ? getname+0x14d/0x1b3
[  513.851506]  [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[  514.483501]  [<ffffffff8100320a>] stub_execve+0x6a/0xc0
[  514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
[  514.704292] RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  514.808221]  RSP <ffff880037b979d8>
[  514.906179] ---[ end trace 4f88495edc224d6b ]---

There is a race between shift_arg_pages and migration that triggers this bug.
A temporary stack is setup during exec and later moved. If migration moves
a page in the temporary stack and the VMA is then removed before migration
completes, the migration PTE may not be found leading to a BUG when the
stack is faulted.

This patch causes pages within the temporary stack during exec to be skipped
by migration. It does this by marking the VMA covering the temporary stack
with an otherwise impossible combination of VMA flags. These flags are
cleared when the temporary stack is moved to its final location.

[kamezawa.hiroyu@jp.fujitsu.com: Idea for identifying and skipping temporary stacks]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/exec.c          |    7 ++++++-
 include/linux/mm.h |    3 +++
 mm/rmap.c          |   30 +++++++++++++++++++++++++++++-
 3 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 725d7ef..13f8e7f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -242,9 +242,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 	 * use STACK_TOP because that can depend on attributes which aren't
 	 * configured yet.
 	 */
+	BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
 	vma->vm_end = STACK_TOP_MAX;
 	vma->vm_start = vma->vm_end - PAGE_SIZE;
-	vma->vm_flags = VM_STACK_FLAGS;
+	vma->vm_flags = VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP;
 	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	err = insert_vm_struct(mm, vma);
@@ -616,6 +617,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	else if (executable_stack == EXSTACK_DISABLE_X)
 		vm_flags &= ~VM_EXEC;
 	vm_flags |= mm->def_flags;
+	vm_flags |= VM_STACK_INCOMPLETE_SETUP;
 
 	ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
 			vm_flags);
@@ -630,6 +632,9 @@ int setup_arg_pages(struct linux_binprm *bprm,
 			goto out_unlock;
 	}
 
+	/* mprotect_fixup is overkill to remove the temporary stack flags */
+	vma->vm_flags &= ~VM_STACK_INCOMPLETE_SETUP;
+
 	stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
 	stack_size = vma->vm_end - vma->vm_start;
 	/*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index eb21256..925f5bc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -106,6 +106,9 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
+/* Bits set in the VMA until the stack is in its final location */
+#define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
+
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..e96565f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1141,6 +1141,20 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	return ret;
 }
 
+static bool is_vma_temporary_stack(struct vm_area_struct *vma)
+{
+	int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);
+
+	if (!maybe_stack)
+		return false;
+
+	if ((vma->vm_flags & VM_STACK_INCOMPLETE_SETUP) ==
+						VM_STACK_INCOMPLETE_SETUP)
+		return true;
+
+	return false;
+}
+
 /**
  * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
  * rmap method
@@ -1169,7 +1183,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
+		unsigned long address;
+
+		/*
+		 * During exec, a temporary VMA is setup and later moved.
+		 * The VMA is moved under the anon_vma lock but not the
+		 * page tables leading to a race where migration cannot
+		 * find the migration ptes. Rather than increasing the
+		 * locking requirements of exec(), migration skips
+		 * temporary VMAs until after exec() completes.
+		 */
+		if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
+				is_vma_temporary_stack(vma))
+			continue;
+
+		address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
 		ret = try_to_unmap_one(page, vma, address, flags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-10 13:24                     ` Mel Gorman
@ 2010-05-10 23:55                       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-10 23:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Mon, 10 May 2010 14:24:31 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Mon, May 10, 2010 at 10:40:39AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Sun, 9 May 2010 18:32:32 -0700 (PDT)
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > > 
> > > 
> > > On Sun, 9 May 2010, Linus Torvalds wrote:
> > > > 
> > > > So I never disliked that patch. I'm perfectly happy with a "don't migrate 
> > > > these pages at all, because they are in a half-way state in the middle of 
> > > > execve stack magic".
> > > 
> > > Btw, I also think that Mel's patch could be made a lot _less_ magic by 
> > > just marking that initial stack vma with a VM_STACK_INCOMPLETE_SETUP bit, 
> > > instead of doing that "maybe_stack" thing. We could easily make that 
> > > initial vma setup very explicit indeed, and then just clear that bit when 
> > > we've moved the stack to its final position.
> > > 
> > 
> > Hmm. vm_flags is still 32bit..(I think it should be long long)
> > 
> 
> This is why I didn't use a bit in vm_flags and I didn't want to increase
> the size of any structure. I had thought of using a combination of flags
> but thought people would prefer a test based on existing information such as
> map_count. That said, I also made terrible choices for possible combination
> of flags :)
> 
> > Using combination of existing flags...
> > 
> > #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEC_READ)
> > 
> > Can be used instead of checking mapcount, I think.
> > 
> 
> It can and this is a very good combination of flags to base the test on.
> How does the following look?
> 
Seems nice to me.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> ==== CUT HERE ====
> mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
> 
> Page migration requires rmap to be able to find all ptes mapping a page
> at all times, otherwise the migration entry can be instantiated, but it
> is possible to leave one behind if the second rmap_walk fails to find
> the page.  If this page is later faulted, migration_entry_to_page() will
> call BUG because the page is locked indicating the page was migrated by
> the migration PTE not cleaned up. For example
> 
> [  511.201534] kernel BUG at include/linux/swapops.h:105!
> [  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
> [  511.201534] last sysfs file: /sys/block/sde/size
> [  511.201534] CPU 0
> [  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
> [  511.888526]
> [  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
> [  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
> [  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
> [  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
> [  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
> [  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
> [  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
> [  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
> [  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
> [  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
> [  512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
> [  513.068415] Stack:
> [  513.068415]  ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
> [  513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
> [  513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
> [  513.228068] Call Trace:
> [  513.228068]  [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
> [  513.228068]  [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
> [  513.228068]  [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
> [  513.720755]  [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [  513.789278]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
> [  513.851506]  [<ffffffff813099b5>] page_fault+0x25/0x30
> [  513.851506]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
> [  513.851506]  [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
> [  513.851506]  [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
> [  513.851506]  [<ffffffff8111329b>] search_binary_handler+0x173/0x313
> [  513.851506]  [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
> [  513.851506]  [<ffffffff81114896>] do_execve+0x219/0x30a
> [  513.851506]  [<ffffffff8111887f>] ? getname+0x14d/0x1b3
> [  513.851506]  [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
> [  514.483501]  [<ffffffff8100320a>] stub_execve+0x6a/0xc0
> [  514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
> [  514.704292] RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
> [  514.808221]  RSP <ffff880037b979d8>
> [  514.906179] ---[ end trace 4f88495edc224d6b ]---
> 
> There is a race between shift_arg_pages and migration that triggers this bug.
> A temporary stack is setup during exec and later moved. If migration moves
> a page in the temporary stack and the VMA is then removed before migration
> completes, the migration PTE may not be found leading to a BUG when the
> stack is faulted.
> 
> This patch causes pages within the temporary stack during exec to be skipped
> by migration. It does this by marking the VMA covering the temporary stack
> with an otherwise impossible combination of VMA flags. These flags are
> cleared when the temporary stack is moved to its final location.
> 
> [kamezawa.hiroyu@jp.fujitsu.com: Idea for identifying and skipping temporary stacks]
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  fs/exec.c          |    7 ++++++-
>  include/linux/mm.h |    3 +++
>  mm/rmap.c          |   30 +++++++++++++++++++++++++++++-
>  3 files changed, 38 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 725d7ef..13f8e7f 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -242,9 +242,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
>  	 * use STACK_TOP because that can depend on attributes which aren't
>  	 * configured yet.
>  	 */
> +	BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
>  	vma->vm_end = STACK_TOP_MAX;
>  	vma->vm_start = vma->vm_end - PAGE_SIZE;
> -	vma->vm_flags = VM_STACK_FLAGS;
> +	vma->vm_flags = VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP;
>  	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
>  	err = insert_vm_struct(mm, vma);
> @@ -616,6 +617,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
>  	else if (executable_stack == EXSTACK_DISABLE_X)
>  		vm_flags &= ~VM_EXEC;
>  	vm_flags |= mm->def_flags;
> +	vm_flags |= VM_STACK_INCOMPLETE_SETUP;
>  
>  	ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
>  			vm_flags);
> @@ -630,6 +632,9 @@ int setup_arg_pages(struct linux_binprm *bprm,
>  			goto out_unlock;
>  	}
>  
> +	/* mprotect_fixup is overkill to remove the temporary stack flags */
> +	vma->vm_flags &= ~VM_STACK_INCOMPLETE_SETUP;
> +
>  	stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
>  	stack_size = vma->vm_end - vma->vm_start;
>  	/*
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index eb21256..925f5bc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -106,6 +106,9 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
>  #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
>  
> +/* Bits set in the VMA until the stack is in its final location */
> +#define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
> +
>  #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
>  #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
>  #endif
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 85f203e..e96565f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1141,6 +1141,20 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  	return ret;
>  }
>  
> +static bool is_vma_temporary_stack(struct vm_area_struct *vma)
> +{
> +	int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);
> +
> +	if (!maybe_stack)
> +		return false;
> +
> +	if ((vma->vm_flags & VM_STACK_INCOMPLETE_SETUP) ==
> +						VM_STACK_INCOMPLETE_SETUP)
> +		return true;
> +
> +	return false;
> +}
> +
>  /**
>   * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
>   * rmap method
> @@ -1169,7 +1183,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
>  
>  	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
>  		struct vm_area_struct *vma = avc->vma;
> -		unsigned long address = vma_address(page, vma);
> +		unsigned long address;
> +
> +		/*
> +		 * During exec, a temporary VMA is setup and later moved.
> +		 * The VMA is moved under the anon_vma lock but not the
> +		 * page tables leading to a race where migration cannot
> +		 * find the migration ptes. Rather than increasing the
> +		 * locking requirements of exec(), migration skips
> +		 * temporary VMAs until after exec() completes.
> +		 */
> +		if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
> +				is_vma_temporary_stack(vma))
> +			continue;
> +
> +		address = vma_address(page, vma);
>  		if (address == -EFAULT)
>  			continue;
>  		ret = try_to_unmap_one(page, vma, address, flags);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-09 19:56           ` Linus Torvalds
  2010-05-09 20:06             ` Linus Torvalds
  2010-05-10  0:40             ` KAMEZAWA Hiroyuki
@ 2010-05-10  0:42             ` KAMEZAWA Hiroyuki
  2010-05-10 14:02               ` Mel Gorman
  2010-05-10 13:49             ` Mel Gorman
  3 siblings, 1 reply; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-10  0:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Sun, 9 May 2010 12:56:49 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Sun, 9 May 2010, Mel Gorman wrote:
> > 
> > It turns out not to be easy to the preallocating of PUDs, PMDs and PTEs
> > move_page_tables() needs.  To avoid overallocating, it has to follow the same
> > logic as move_page_tables duplicating some code in the process. The ugliest
> > aspect of all is passing those pre-allocated pages back into move_page_tables
> > where they need to be passed down to such functions as __pte_alloc. It turns
> > extremely messy.
> 
> Umm. What?
> 
> That's crazy talk. I'm not talking about preallocating stuff in order to 
> pass it in to move_page_tables(). I'm talking about just _creating_ the 
> dang page tables early - preallocating them IN THE PROCESS VM SPACE.
> 
> IOW, a patch like this (this is a pseudo-patch, totally untested, won't 
> compile, yadda yadda - you need to actually make the people who call 
> "move_page_tables()" call that prepare function first etc etc)
> 
> Yeah, if we care about holes in the page tables, we can certainly copy 
> more of the move_page_tables() logic, but it certainly doesn't matter for 
> execve(). This just makes sure that the destination page tables exist 
> first.
> 
IMHO, I think move_page_tables() itself should be implemented as your patch.

But, move_page_tables()'s failure is not a big problem. At failure,
exec will abort and no page fault will occur later. What we have to do in
this migration-patch-series is avoding inconsistent update of sets of
[page, vma->vm_start, vma->pg_off, ptes] or "dont' migrate pages in exec's
statk".

Considering cost, as Mel shows, "don't migrate pages in exec's stack" seems
reasonable. But, I still doubt this check.

+static bool is_vma_temporary_stack(struct vm_area_struct *vma)
+{
+	int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);
+
+	if (!maybe_stack)
+		return false;
+
+	/* If only the stack is mapped, assume exec is in progress */
+	if (vma->vm_mm->map_count == 1) -------------------(*)
+		return true; 
+
+	return false;
+}
+


Mel, can (*) be safe even on a.out format (format other than ELFs) ?

Thanks,
-Kame







> 		Linus
> 
> ---
>  mm/mremap.c |   22 +++++++++++++++++++++-
>  1 files changed, 21 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index cde56ee..c14505c 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -128,6 +128,26 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  
>  #define LATENCY_LIMIT	(64 * PAGE_SIZE)
>  
> +/*
> + * Preallocate the page tables, so that we can do the actual move
> + * without any allocations, and thus no error handling etc.
> + */
> +int prepare_move_page_tables(struct vm_area_struct *vma,
> +	unsigned long old_addr, struct vm_area_struct *new_vma,
> +	unsigned long new_addr, unsigned long len)
> +{
> +	unsigned long end_addr = new_addr + len;
> +
> +	while (new_addr < end_addr) {
> +		pmd_t *new_pmd;
> +		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
> +		if (!new_pmd)
> +			return -ENOMEM;
> +		new_addr = (new_addr + PMD_SIZE) & PMD_MASK;
> +	}
> +	return 0;
> +}
> +
>  unsigned long move_page_tables(struct vm_area_struct *vma,
>  		unsigned long old_addr, struct vm_area_struct *new_vma,
>  		unsigned long new_addr, unsigned long len)
> @@ -147,7 +167,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
>  		if (!old_pmd)
>  			continue;
> -		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
> +		new_pmd = get_old_pmd(vma->vm_mm, new_addr);
>  		if (!new_pmd)
>  			break;
>  		next = (new_addr + PMD_SIZE) & PMD_MASK;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-10  0:42             ` KAMEZAWA Hiroyuki
@ 2010-05-10 14:02               ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-10 14:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Mon, May 10, 2010 at 09:42:38AM +0900, KAMEZAWA Hiroyuki wrote:
> On Sun, 9 May 2010 12:56:49 -0700 (PDT)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > 
> > 
> > On Sun, 9 May 2010, Mel Gorman wrote:
> > > 
> > > It turns out not to be easy to the preallocating of PUDs, PMDs and PTEs
> > > move_page_tables() needs.  To avoid overallocating, it has to follow the same
> > > logic as move_page_tables duplicating some code in the process. The ugliest
> > > aspect of all is passing those pre-allocated pages back into move_page_tables
> > > where they need to be passed down to such functions as __pte_alloc. It turns
> > > extremely messy.
> > 
> > Umm. What?
> > 
> > That's crazy talk. I'm not talking about preallocating stuff in order to 
> > pass it in to move_page_tables(). I'm talking about just _creating_ the 
> > dang page tables early - preallocating them IN THE PROCESS VM SPACE.
> > 
> > IOW, a patch like this (this is a pseudo-patch, totally untested, won't 
> > compile, yadda yadda - you need to actually make the people who call 
> > "move_page_tables()" call that prepare function first etc etc)
> > 
> > Yeah, if we care about holes in the page tables, we can certainly copy 
> > more of the move_page_tables() logic, but it certainly doesn't matter for 
> > execve(). This just makes sure that the destination page tables exist 
> > first.
> > 
> IMHO, I think move_page_tables() itself should be implemented as your patch.
> 
> But, move_page_tables()'s failure is not a big problem. At failure,
> exec will abort and no page fault will occur later. What we have to do in
> this migration-patch-series is avoding inconsistent update of sets of
> [page, vma->vm_start, vma->pg_off, ptes] or "dont' migrate pages in exec's
> statk".
> 
> Considering cost, as Mel shows, "don't migrate pages in exec's stack" seems
> reasonable. But, I still doubt this check.
> 
> +static bool is_vma_temporary_stack(struct vm_area_struct *vma)
> +{
> +	int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);
> +
> +	if (!maybe_stack)
> +		return false;
> +
> +	/* If only the stack is mapped, assume exec is in progress */
> +	if (vma->vm_mm->map_count == 1) -------------------(*)
> +		return true; 
> +
> +	return false;
> +}
> +
> 
> Mel, can (*) be safe even on a.out format (format other than ELFs) ?
> 

I felt it was safe because this happens before search_binary_handler is
called to find a handler to load the binary. Still, the suggestion to
use an impossible combination of VMA flags is more robust against any
future change.

> <SNIP>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-09 19:56           ` Linus Torvalds
                               ` (2 preceding siblings ...)
  2010-05-10  0:42             ` KAMEZAWA Hiroyuki
@ 2010-05-10 13:49             ` Mel Gorman
  3 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-10 13:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Sun, May 09, 2010 at 12:56:49PM -0700, Linus Torvalds wrote:
> 
> 
> On Sun, 9 May 2010, Mel Gorman wrote:
> > 
> > It turns out not to be easy to the preallocating of PUDs, PMDs and PTEs
> > move_page_tables() needs.  To avoid overallocating, it has to follow the same
> > logic as move_page_tables duplicating some code in the process. The ugliest
> > aspect of all is passing those pre-allocated pages back into move_page_tables
> > where they need to be passed down to such functions as __pte_alloc. It turns
> > extremely messy.
> 
> Umm. What?
> 
> That's crazy talk. I'm not talking about preallocating stuff in order to 
> pass it in to move_page_tables(). I'm talking about just _creating_ the 
> dang page tables early - preallocating them IN THE PROCESS VM SPACE.
> 

Ok, I took the totally wrong approach by pre-allocating the pages and passing
them in to move_page_tables. Pages from the list were then taken in preference
to calling the page allocator. At the time, I thought the cleanup for ENOMEM
would be easier as well as avoiding complications with overlapping temporary
stack and new stack location. It was a bad choice, wrong and damn fugly.

> IOW, a patch like this (this is a pseudo-patch, totally untested, won't 
> compile, yadda yadda - you need to actually make the people who call 
> "move_page_tables()" call that prepare function first etc etc)
> 

Sounds reasonable as a general approach. I'll tinker with it for a bit to
see if exec can be made any faster when combined with your other suggestion
to avoid mprotect_fixup. In the meantime though, can the patch that avoids
migrating within the temporary stack be picked up? It closes the worst of
the migrate vs exec race without impacting performance. At least, I'm
no longer able to cause an oops with it applied.

> Yeah, if we care about holes in the page tables, we can certainly copy 
> more of the move_page_tables() logic, but it certainly doesn't matter for 
> execve(). This just makes sure that the destination page tables exist 
> first.
> 
> 		Linus
> 
> ---
>  mm/mremap.c |   22 +++++++++++++++++++++-
>  1 files changed, 21 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index cde56ee..c14505c 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -128,6 +128,26 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  
>  #define LATENCY_LIMIT	(64 * PAGE_SIZE)
>  
> +/*
> + * Preallocate the page tables, so that we can do the actual move
> + * without any allocations, and thus no error handling etc.
> + */
> +int prepare_move_page_tables(struct vm_area_struct *vma,
> +	unsigned long old_addr, struct vm_area_struct *new_vma,
> +	unsigned long new_addr, unsigned long len)
> +{
> +	unsigned long end_addr = new_addr + len;
> +
> +	while (new_addr < end_addr) {
> +		pmd_t *new_pmd;
> +		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
> +		if (!new_pmd)
> +			return -ENOMEM;
> +		new_addr = (new_addr + PMD_SIZE) & PMD_MASK;
> +	}
> +	return 0;
> +}
> +
>  unsigned long move_page_tables(struct vm_area_struct *vma,
>  		unsigned long old_addr, struct vm_area_struct *new_vma,
>  		unsigned long new_addr, unsigned long len)
> @@ -147,7 +167,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
>  		if (!old_pmd)
>  			continue;
> -		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
> +		new_pmd = get_old_pmd(vma->vm_mm, new_addr);
>  		if (!new_pmd)
>  			break;
>  		next = (new_addr + PMD_SIZE) & PMD_MASK;
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-09 19:21         ` Mel Gorman
  2010-05-09 19:56           ` Linus Torvalds
@ 2010-05-10  0:32           ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-10  0:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Sun, 9 May 2010 20:21:45 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, May 06, 2010 at 07:12:59PM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Fri, 7 May 2010, KAMEZAWA Hiroyuki wrote:
> > > 
> > > IIUC, move_page_tables() may call "page table allocation" and it cannot be
> > > done under spinlock.
> > 
> > Bah. It only does a "alloc_new_pmd()", and we could easily move that out 
> > of the loop and pre-allocate the pmd's.
> > 
> > If that's the only reason, then it's a really weak one, methinks.
> > 
> 
> It turns out not to be easy to the preallocating of PUDs, PMDs and PTEs
> move_page_tables() needs.  To avoid overallocating, it has to follow the same
> logic as move_page_tables duplicating some code in the process. The ugliest
> aspect of all is passing those pre-allocated pages back into move_page_tables
> where they need to be passed down to such functions as __pte_alloc. It turns
> extremely messy.
> 
> I stopped working on it about half way through as it was already too ugly
> to live and would have similar cost to Kamezawa's much more straight-forward
> approach of using move_vma().
> 
> While using move_vma is straight-forward and solves the problem, it's
> not as cheap as Andrea's solution. Andrea allocates a temporary VMA and
> puts it on a list and very little else. It didn't show up any problems
> in microbenchmarks. Calling move_vma does a lot more work particularly in
> copy_vma and this slows down exec.
> 
> With Kamezawa's patch, kernbench was fine on wall time but in System Time,
> it slowed by up 1.48% in comparison to Andrea's slowing up by 0.64%[1].
> 
> aim9 was slowed as well. Kamezawa's slowed by 2.77% where Andrea's reported
> faster by 2.58%. While AIM9 is flaky and these figures are barely outside
> the noise, calling move_vma() is obviously more expensive.
> 

Thank you for testing.


> While my solution at http://lkml.org/lkml/2010/4/30/198 is cheapest as it
> does not touch exec() at all, is_vma_temporary_stack() could be broken in
> the future if any of the assumptions it makes change.
> 
> So what you have is an inverse relationship between magic and
> performance. Mine has the most magic and is fastest. Kamezawa's has the
> least magic but slowest and Andrea has the goldilocks factor. Which do
> you prefer?
> 

I like the fastest one ;)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-07  1:40   ` Linus Torvalds
  2010-05-07  1:57     ` KAMEZAWA Hiroyuki
@ 2010-05-07  9:16     ` Mel Gorman
  1 sibling, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-07  9:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Thu, May 06, 2010 at 06:40:39PM -0700, Linus Torvalds wrote:
> On Fri, 7 May 2010, Mel Gorman wrote:
> > 
> > Page migration requires rmap to be able to find all migration ptes
> > created by migration. If the second rmap_walk clearing migration PTEs
> > misses an entry, it is left dangling causing a BUG_ON to trigger during
> > fault. For example;
> 
> So I still absolutely detest this patch.
> 
> Why didn't the other - much simpler - patch work? The one Rik pointed to:
> 
> 	http://lkml.org/lkml/2010/4/30/198
> 

Oh, it works, but it depends on a magic check is_vma_temporary_stack().
Kamezawa had a variant that did not depend on magic but it increased the
size of vm_area_struct. That magic check was just something that would
be easy to break and not spot hence the temporary VMA instead.

> and didn't do that _disgusting_ temporary anon_vma?
> 
> Alternatively, why don't we just take the anon_vma lock over this region, 
> so that rmap can't _walk_ the damn thing?
> 

Because move_page_tables() calls into the page allocator. I'll create a
version that allocates the PMDs in advance and see what it looks like.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/2] Fix migration races in rmap_walk() V7
  2010-05-06 23:20 [PATCH 0/2] Fix migration races in rmap_walk() V7 Mel Gorman
  2010-05-06 23:20 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
  2010-05-06 23:20 ` [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack Mel Gorman
@ 2010-05-07  8:13 ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-07  8:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Linus Torvalds, Peter Zijlstra

On Fri,  7 May 2010 00:20:51 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> No major change since V6, just looks nicer.
> 

seems to work well in my test.
I think there is no unknown pitfalls. I'm glad if you double check my
concern for patch [1/2].

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 0/2] Fix migration races in rmap_walk() V6
@ 2010-05-06 15:33 Mel Gorman
  2010-05-06 15:33 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
  0 siblings, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 15:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Linus Torvalds, Peter Zijlstra,
	Mel Gorman

Patch 1 of this series is the biggest change. Instead of the trylock+retry
logic, it finds the "root" anon_vma and locks all anon_vmas encountered. As
long as walkers taking multiple locks use the same order, there is no
deadlock. Stress-tests based on compaction have been running a while with
these patches applied without problems.

Changelog since V5
  o Have rmap_walk take anon_vma locks in order starting from the "root"
  o Ensure that mm_take_all_locks locks VMAs in the same order

Changelog since V4
  o Switch back anon_vma locking to put bulk of locking in rmap_walk
  o Fix anon_vma lock ordering in exec vs migration race

Changelog since V3
  o Rediff against the latest upstream tree
  o Improve the patch changelog a little (thanks Peterz)

Changelog since V2
  o Drop fork changes
  o Avoid pages in temporary stacks during exec instead of migration pte
    lazy cleanup
  o Drop locking-related patch and replace with Rik's

Changelog since V1
  o Handle the execve race
  o Be sure that rmap_walk() releases the correct VMA lock
  o Hold the anon_vma lock for the address lookup and the page remap
  o Add reviewed-bys

Broadly speaking, migration works by locking a page, unmapping it, putting
a migration PTE in place that looks like a swap entry, copying the page and
remapping the page removing the old migration PTE before unlocking the page.
If a fault occurs, the faulting process waits until migration completes.

The problem is that there are some races that either allow migration PTEs
to be left left behind. Migration still completes and the page is unlocked
but later a fault will call migration_entry_to_page() and BUG() because the
page is not locked. It's not possible to just clean up the migration PTE
because the page it points to has been potentially freed and reused. This
series aims to close the races.

Patch 1 of this series is about the of locking of anon_vma in migration versus
vma_adjust. While I am not aware of any reproduction cases, it is potentially
racy. This patch is an alternative to Rik's "heavy lock" approach posted
at http://lkml.org/lkml/2010/5/3/155. With the patch, rmap_walk finds the
"root" anon_vma and starts locking from there, locking each new anon_vma
as it finds it. As long as the order is preserved, there is no deadlock.
In vma_adjust, the anon_vma locks are acquired under similar conditions
to 2.6.33 so that walkers will block until VMA changes are complete. The
rmap_walk changes potentially slows down migration and aspects of page
reclaim a little but they are the less important path.

Patch 2 of this series addresses the swapops bug reported that is a race
between migration and execve where pages get migrated from the temporary
stack before it is moved. To avoid migration PTEs being left behind,
a temporary VMA is put in place so that a migration PTE in either the
temporary stack or the relocated stack can be found.

The reproduction case for the races was as follows;

1. Run kernel compilation in a loop
2. Start four processes, each of which creates one mapping. The three stress
   different aspects of the problem. The operations they undertake are;
	a) Forks a hundred children, each of which faults the mapping
		Purpose: stress tests migration pte removal
	b) Forks a hundred children, each which punches a hole in the mapping
	   and faults what remains
		Purpose: stress test VMA manipulations during migration
	c) Forks a hundred children, each of which execs and calls echo
		Purpose: stress test the execve race
	d) Size the mapping to be 1.5 times physical memory. Constantly
	   memset it
		Purpose: stress swapping
3. Constantly compact memory using /proc/sys/vm/compact_memory so migration
   is active all the time. In theory, you could also force this using
   sys_move_pages or memory hot-remove but it'd be nowhere near as easy
   to test.

Compaction is the easiest way to trigger these bugs which is not going to
be in 2.6.34 but in theory the problem also affects memory hot-remove.

There were some concerns with patch 2 that performance would be impacted. To
check if this was the case I ran kernbench, aim9 and sysbench. AIM9 in
particular was of interest as it has an exec microbenchmark.

             kernbench-vanilla    fixraces-v5r1
Elapsed mean     103.40 ( 0.00%)   103.35 ( 0.05%)
Elapsed stddev     0.09 ( 0.00%)     0.13 (-55.72%)
User    mean     313.50 ( 0.00%)   313.15 ( 0.11%)
User    stddev     0.61 ( 0.00%)     0.20 (66.70%)
System  mean      55.50 ( 0.00%)    55.85 (-0.64%)
System  stddev     0.48 ( 0.00%)     0.15 (68.98%)
CPU     mean     356.25 ( 0.00%)   356.50 (-0.07%)
CPU     stddev     0.43 ( 0.00%)     0.50 (-15.47%)

Nothing special there and kernbench is fork+exec heavy. The patched kernel
is slightly faster on wall time but it's well within the noise. System time
is slightly slower but again, it's within the noise.

AIM9
                  aim9-vanilla    fixraces-v5r1
creat-clo     116813.86 ( 0.00%)  117980.34 ( 0.99%)
page_test     270923.33 ( 0.00%)  268668.56 (-0.84%)
brk_test     2551558.07 ( 0.00%) 2649450.00 ( 3.69%)
signal_test   279866.67 ( 0.00%)  279533.33 (-0.12%)
exec_test        226.67 ( 0.00%)     232.67 ( 2.58%)
fork_test       4261.91 ( 0.00%)    4110.98 (-3.67%)
link_test      53534.78 ( 0.00%)   54076.49 ( 1.00%)

So, here exec and fork aren't showing up major worries. exec is faster but
these tests can be so sensitive to starting conditions that I tend not to
read much into them unless there are major differences.

SYSBENCH
              sysbench-vanilla    fixraces-v5r1
           1 14177.73 ( 0.00%) 14218.41 ( 0.29%)
           2 27647.23 ( 0.00%) 27774.14 ( 0.46%)
           3 31395.69 ( 0.00%) 31499.95 ( 0.33%)
           4 49866.54 ( 0.00%) 49713.49 (-0.31%)
           5 49919.58 ( 0.00%) 49524.21 (-0.80%)
           6 49532.97 ( 0.00%) 49397.60 (-0.27%)
           7 49465.79 ( 0.00%) 49384.14 (-0.17%)
           8 49483.33 ( 0.00%) 49186.49 (-0.60%)

These figures also show no differences worth talking about.

While the extra allocation in patch 2 would appear to slow down exec somewhat,
it's not by any amount that matters. As it is in exec, it means that anon_vmas
have likely been freed very recently so the allocation will be cache-hot and
cpu-local. It is possible to special-case migration to avoid migrating pages
in the temporary stack, but fixing it in exec is a more maintainable approach.

 fs/exec.c            |   37 ++++++++++++++++++++--
 include/linux/rmap.h |    2 +
 mm/ksm.c             |   20 ++++++++++--
 mm/mmap.c            |   14 +++++++-
 mm/rmap.c            |   81 +++++++++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 137 insertions(+), 17 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 15:33 [PATCH 0/2] Fix migration races in rmap_walk() V6 Mel Gorman
@ 2010-05-06 15:33 ` Mel Gorman
  2010-05-06 15:44   ` Rik van Riel
  2010-05-06 15:59   ` Linus Torvalds
  0 siblings, 2 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 15:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki, Christoph Lameter,
	Andrea Arcangeli, Rik van Riel, Linus Torvalds, Peter Zijlstra,
	Mel Gorman

vma_adjust() is updating anon VMA information without locks being taken
unlike what happened in 2.6.33.  In contrast, file-backed mappings use
the i_mmap_lock and this lack of locking can result in races with users of
rmap_walk such as page migration.  vma_address() can return -EFAULT for an
address that will soon be valid.  For migration, this potentially leaves
a dangling migration PTE behind which can later cause a BUG_ON to trigger
when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock
to take when walking a list of anon_vma_chains. As rmap_walk begins with
the anon_vma a page is associated with, the order of anon_vmas cannot be
guaranteed and multiple locks cannot be taken without potentially deadlocking.

To resolve this problem, this patch has rmap_walk walk the anon_vma_chain
list but always starting from the "root" anon_vma which is the oldest
anon_vma in the list. It starts by locking the anon_vma lock associated
with a page. It then finds the "root" anon_vma using the anon_vma_chains
"same_vma" list as it is strictly ordered. The root anon_vma lock is taken
and rmap_walk traverses the list. This allows multiple locks to be taken
as the list is always traversed in the same direction.

As spotted by Rik, to avoid any deadlocks versus mmu_notify, the order that
anon_vmas is locked in by mm_take_all_locks is reversed by this patch so that
both rmap_walk and mm_take_all_locks lock anon_vmas in the order of
old to new.

For vma_adjust(), the locking behaviour prior to the anon_vma is restored
so that rmap_walk() can be sure of the integrity of the VMA information and
lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
is taken if

	a) If there is any overlap with the next VMA due to the adjustment
	b) If there is a new VMA is being inserted into the address space
	c) If the start of the VMA is being changed so that the
	   relationship between vm_start and vm_pgoff is preserved
	   for vma_address()

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/rmap.h |    2 +
 mm/ksm.c             |   20 ++++++++++--
 mm/mmap.c            |   14 +++++++-
 mm/rmap.c            |   81 +++++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 104 insertions(+), 13 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7721674..6d4d5f7 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -121,6 +121,8 @@ int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
+struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma);
+struct anon_vma *page_anon_vma_lock_root(struct page *page);
 void __anon_vma_link(struct vm_area_struct *);
 void anon_vma_free(struct anon_vma *);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..d16b459 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1668,15 +1668,24 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
+		struct anon_vma *locked_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		anon_vma = anon_vma_lock_root(anon_vma);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
+
+			locked_vma = NULL;
+			if (anon_vma != vma->anon_vma) {
+				locked_vma = vma->anon_vma;
+				spin_lock_nested(&locked_vma->lock, SINGLE_DEPTH_NESTING);
+			}
+
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
-				continue;
+				goto next_vma;
+
 			/*
 			 * Initially we examine only the vma which covers this
 			 * rmap_item; but later, if there is still work to do,
@@ -1684,9 +1693,14 @@ again:
 			 * were forked from the original since ksmd passed.
 			 */
 			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
-				continue;
+				goto next_vma;
 
 			ret = rmap_one(page, vma, rmap_item->address, arg);
+
+next_vma:
+			if (locked_vma)
+				spin_unlock(&locked_vma->lock);
+
 			if (ret != SWAP_AGAIN) {
 				spin_unlock(&anon_vma->lock);
 				goto out;
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..b447d5b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -505,6 +505,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct vm_area_struct *next = vma->vm_next;
 	struct vm_area_struct *importer = NULL;
 	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
 	long adjust_next = 0;
@@ -578,6 +579,11 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma && (insert || importer || start != vma->vm_start)) {
+		anon_vma = vma->anon_vma;
+		spin_lock(&anon_vma->lock);
+	}
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +626,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (anon_vma)
+		spin_unlock(&anon_vma->lock);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
@@ -2556,8 +2565,9 @@ int mm_take_all_locks(struct mm_struct *mm)
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (signal_pending(current))
 			goto out_unlock;
+		/* Lock the anon_vmas in the same order rmap_walk would */
 		if (vma->anon_vma)
-			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma)
 				vm_lock_anon_vma(mm, avc->anon_vma);
 	}
 
@@ -2620,7 +2630,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
 
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (vma->anon_vma)
-			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma)
 				vm_unlock_anon_vma(avc->anon_vma);
 		if (vma->vm_file && vma->vm_file->f_mapping)
 			vm_unlock_mapping(vma->vm_file->f_mapping);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..b511710 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -236,6 +236,62 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	return -ENOMEM;
 }
 
+/*
+ * Given an anon_vma, find the root of the chain, lock it and return it.
+ * This must be called with the rcu_read_lock held
+ */
+struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma)
+{
+	struct anon_vma *root_anon_vma;
+	struct anon_vma_chain *avc, *root_avc;
+	struct vm_area_struct *vma;
+
+	/* Lock the same_anon_vma list and make sure we are on a chain */
+	spin_lock(&anon_vma->lock);
+	if (list_empty(&anon_vma->head)) {
+		spin_unlock(&anon_vma->lock);
+		return NULL;
+	}
+
+	/*
+	 * Get the root anon_vma on the list by depending on the ordering
+	 * of the same_vma list setup by __page_set_anon_rmap. Basically
+	 * we are doing
+	 *
+	 * local anon_vma -> local vma -> root vma -> root anon_vma
+	 */
+	avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
+	vma = avc->vma;
+	root_avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+	root_anon_vma = root_avc->anon_vma;
+
+	/* Get the lock of the root anon_vma */
+	if (anon_vma != root_anon_vma) {
+		VM_BUG_ON(!rcu_read_lock_held());
+		spin_unlock(&anon_vma->lock);
+		spin_lock(&root_anon_vma->lock);
+	}
+
+	return root_anon_vma;
+}
+
+/*
+ * From the anon_vma associated with this page, find and lock the
+ * deepest anon_vma on the list. This allows multiple anon_vma locks
+ * to be taken by guaranteeing the locks are taken in the same order
+ */
+struct anon_vma *page_anon_vma_lock_root(struct page *page)
+{
+	struct anon_vma *anon_vma;
+
+	/* Get the local anon_vma */
+	anon_vma = page_anon_vma(page);
+	if (!anon_vma)
+		return NULL;
+
+	return anon_vma_lock_root(anon_vma);
+}
+
 static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 {
 	struct anon_vma *anon_vma = anon_vma_chain->anon_vma;
@@ -1358,7 +1414,7 @@ int try_to_munlock(struct page *page)
 static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma, *locked_vma;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
@@ -1368,16 +1424,25 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
 	 */
-	anon_vma = page_anon_vma(page);
+	anon_vma = page_anon_vma_lock_root(page);
 	if (!anon_vma)
 		return ret;
-	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
-		if (address == -EFAULT)
-			continue;
-		ret = rmap_one(page, vma, address, arg);
+		unsigned long address;
+
+		locked_vma = NULL;
+		if (anon_vma != vma->anon_vma) {
+			locked_vma = vma->anon_vma;
+			spin_lock_nested(&locked_vma->lock, SINGLE_DEPTH_NESTING);
+		}
+		address = vma_address(page, vma);
+		if (address != -EFAULT)
+			ret = rmap_one(page, vma, address, arg);
+
+		if (locked_vma)
+			spin_unlock(&locked_vma->lock);
+
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 15:33 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
@ 2010-05-06 15:44   ` Rik van Riel
  2010-05-06 15:51     ` Mel Gorman
  2010-05-06 15:59   ` Linus Torvalds
  1 sibling, 1 reply; 72+ messages in thread
From: Rik van Riel @ 2010-05-06 15:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Linus Torvalds,
	Peter Zijlstra

On 05/06/2010 11:33 AM, Mel Gorman wrote:

> @@ -1368,16 +1424,25 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
>   	 * are holding mmap_sem. Users without mmap_sem are required to
>   	 * take a reference count to prevent the anon_vma disappearing
>   	 */
> -	anon_vma = page_anon_vma(page);
> +	anon_vma = page_anon_vma_lock_root(page);
>   	if (!anon_vma)
>   		return ret;
> -	spin_lock(&anon_vma->lock);
>   	list_for_each_entry(avc,&anon_vma->head, same_anon_vma) {

One conceptual problem here.  By taking the oldest anon_vma,
instead of the anon_vma of the page, we may end up searching
way too many processes.

Eg. if the page is the page of a child process in a forking
server workload, the above code will end up searching the
parent and all of the siblings - even for a private page, in
the child process's private anon_vma.

For an Apache or Oracle system with 1000 clients (and child
processes), that could be quite a drag - searching 1000 times
as many processes as we should.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 15:44   ` Rik van Riel
@ 2010-05-06 15:51     ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 15:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Linus Torvalds,
	Peter Zijlstra

On Thu, May 06, 2010 at 11:44:57AM -0400, Rik van Riel wrote:
> On 05/06/2010 11:33 AM, Mel Gorman wrote:
>
>> @@ -1368,16 +1424,25 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
>>   	 * are holding mmap_sem. Users without mmap_sem are required to
>>   	 * take a reference count to prevent the anon_vma disappearing
>>   	 */
>> -	anon_vma = page_anon_vma(page);
>> +	anon_vma = page_anon_vma_lock_root(page);
>>   	if (!anon_vma)
>>   		return ret;
>> -	spin_lock(&anon_vma->lock);
>>   	list_for_each_entry(avc,&anon_vma->head, same_anon_vma) {
>
> One conceptual problem here.  By taking the oldest anon_vma,
> instead of the anon_vma of the page, we may end up searching
> way too many processes.
>
> Eg. if the page is the page of a child process in a forking
> server workload, the above code will end up searching the
> parent and all of the siblings - even for a private page, in
> the child process's private anon_vma.
>
> For an Apache or Oracle system with 1000 clients (and child
> processes), that could be quite a drag - searching 1000 times
> as many processes as we should.
>

That does indeed suck. If we were always locking the root anon_vma, we'd get
away with it but that would involve introducing RCU into the munmap/mmap/etc
path. Is there any way around this problem or will migration just have to
take it on the chin until anon_vma is reference counted and we can
cheaply lock the root anon_vma?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 15:33 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
  2010-05-06 15:44   ` Rik van Riel
@ 2010-05-06 15:59   ` Linus Torvalds
  2010-05-06 17:07     ` Mel Gorman
  1 sibling, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2010-05-06 15:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Thu, 6 May 2010, Mel Gorman wrote:
> +		anon_vma = anon_vma_lock_root(anon_vma);
>  		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
>  			vma = vmac->vma;
> +
> +			locked_vma = NULL;
> +			if (anon_vma != vma->anon_vma) {
> +				locked_vma = vma->anon_vma;
> +				spin_lock_nested(&locked_vma->lock, SINGLE_DEPTH_NESTING);
> +			}
> +
>  			if (rmap_item->address < vma->vm_start ||
>  			    rmap_item->address >= vma->vm_end)
> +				goto next_vma;
> +
>  			/*
>  			 * Initially we examine only the vma which covers this
>  			 * rmap_item; but later, if there is still work to do,
> @@ -1684,9 +1693,14 @@ again:
>  			 * were forked from the original since ksmd passed.
>  			 */
>  			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
> +				goto next_vma;
>  
>  			ret = rmap_one(page, vma, rmap_item->address, arg);
> +
> +next_vma:
> +			if (locked_vma)
> +				spin_unlock(&locked_vma->lock);
> +
>  			if (ret != SWAP_AGAIN) {
>  				spin_unlock(&anon_vma->lock);
>  				goto out;

[ Removed '-' lines to show the actual end result ]

That loop is f*cked up.

In the "goto next_vma" case, it will then test the 'ret' from the 
_previous_ iteration after having unlocked the anon_vma. Which may not 
even exist, if this is the first one.

Yes, yes, 'ret' is initialized to SWAP_AGAIN, so it will work, but it's 
still screwed up. It's just _waiting_ for bugs to be introduced.

Just make the "goto out" case unlock thngs properly. Have a real exclusive 
error return case that does

		/* normal return */
		return SWAP_AGAIN;

	out:
		if (locked_anon_vma)
			spin_unlock(&locked_anon_vma->lock);
		spin_unlock(&anon_vma->lock);
		return ret;

rather than that horrible crud in the loop itself.

Also, wouldn't it be nicer to make the whole "locked_vma" be something you 
do at the head of the loop, so that you can use "continue" instead of 
"goto next_vma". And then you can do it like this:

	locked_anon_vma = lock_nested_anon_vma(locked_anon_vma, vma->anon_vma, anon_vma);

where we have

   static struct anon_vma *lock_nested_anon_vma(struct anon_vma_struct anon_vma *prev,
	 struct anon_vma *next, struct anon_vma *root)
   {
	if (prev)
		spin_unlock(&prev->lock);
	if (next == root)
		return NULL;
	spin_lock_nested(&next->lock, SINGLE_DEPTH_NESTING);
	return next;
   }

isn't that _much_ nicer? You get to split the locking off into a function 
of its own, and you unlock the old one before you (potentially) lock the 
new one, _and_ you can just use "continue" to go to the next iteration.

Yes, yes, it means that after the loop you have to unlock that 
'locked_anon_vma', but you have to do that for the early exit case 
_anyway_, so that won't look all that odd. It will certainly look less odd 
than using a status variable from the previous iteration and depending on 
it having a special value.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 15:59   ` Linus Torvalds
@ 2010-05-06 17:07     ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Peter Zijlstra

On Thu, May 06, 2010 at 08:59:52AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 6 May 2010, Mel Gorman wrote:
> > +		anon_vma = anon_vma_lock_root(anon_vma);
> >  		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> >  			vma = vmac->vma;
> > +
> > +			locked_vma = NULL;
> > +			if (anon_vma != vma->anon_vma) {
> > +				locked_vma = vma->anon_vma;
> > +				spin_lock_nested(&locked_vma->lock, SINGLE_DEPTH_NESTING);
> > +			}
> > +
> >  			if (rmap_item->address < vma->vm_start ||
> >  			    rmap_item->address >= vma->vm_end)
> > +				goto next_vma;
> > +
> >  			/*
> >  			 * Initially we examine only the vma which covers this
> >  			 * rmap_item; but later, if there is still work to do,
> > @@ -1684,9 +1693,14 @@ again:
> >  			 * were forked from the original since ksmd passed.
> >  			 */
> >  			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
> > +				goto next_vma;
> >  
> >  			ret = rmap_one(page, vma, rmap_item->address, arg);
> > +
> > +next_vma:
> > +			if (locked_vma)
> > +				spin_unlock(&locked_vma->lock);
> > +
> >  			if (ret != SWAP_AGAIN) {
> >  				spin_unlock(&anon_vma->lock);
> >  				goto out;
> 
> [ Removed '-' lines to show the actual end result ]
> 
> That loop is f*cked up.
> 
> In the "goto next_vma" case, it will then test the 'ret' from the 
> _previous_ iteration after having unlocked the anon_vma. Which may not 
> even exist, if this is the first one.
> 
> Yes, yes, 'ret' is initialized to SWAP_AGAIN, so it will work, but it's 
> still screwed up.

Yes, it works but ...

> It's just _waiting_ for bugs to be introduced.
> 

This is true too. 

> Just make the "goto out" case unlock thngs properly. Have a real exclusive 
> error return case that does
> 
> 		/* normal return */
> 		return SWAP_AGAIN;
> 
> 	out:
> 		if (locked_anon_vma)
> 			spin_unlock(&locked_anon_vma->lock);
> 		spin_unlock(&anon_vma->lock);
> 		return ret;
> 
> rather than that horrible crud in the loop itself.
> 
> Also, wouldn't it be nicer to make the whole "locked_vma" be something you 
> do at the head of the loop, so that you can use "continue" instead of 
> "goto next_vma". And then you can do it like this:
> 
> 	locked_anon_vma = lock_nested_anon_vma(locked_anon_vma, vma->anon_vma, anon_vma);
> 

It obscures the unlocking slightly but it does look neater in the main
functions that call lock_nested_anon_vma. I considered for a while if
there was some macro magic that could be applied but it would be a
delicate obsenity at best.

> where we have
> 
>    static struct anon_vma *lock_nested_anon_vma(struct anon_vma_struct anon_vma *prev,
> 	 struct anon_vma *next, struct anon_vma *root)
>    {
> 	if (prev)
> 		spin_unlock(&prev->lock);
> 	if (next == root)
> 		return NULL;
> 	spin_lock_nested(&next->lock, SINGLE_DEPTH_NESTING);
> 	return next;
>    }
> 
> isn't that _much_ nicer? You get to split the locking off into a function 
> of its own, and you unlock the old one before you (potentially) lock the 
> new one, _and_ you can just use "continue" to go to the next iteration.
> 
> Yes, yes, it means that after the loop you have to unlock that 
> 'locked_anon_vma', but you have to do that for the early exit case 
> _anyway_, so that won't look all that odd. It will certainly look less odd 
> than using a status variable from the previous iteration and depending on 
> it having a special value.
> 

Can't argue with the logic and it does look a lot neater. This is what the
revised version looks like with that scheme. I changed the name of the locking
function slightly to be similar to the other anon_vma functions but that's
about it.  rmap_walk_ksm still looks somewhat tortured but rmap_walk_anon
is much neater.

==== CUT HERE ====
From: Mel Gorman <mel@csn.ul.ie>
Subject: [PATCH] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information

vma_adjust() is updating anon VMA information without locks being taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with users of rmap_walk such as page migration.
vma_address() can return -EFAULT for an address that will soon be valid.
For migration, this potentially leaves a dangling migration PTE behind
which can later cause a BUG_ON to trigger when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock
to take when walking a list of anon_vma_chains but as the order of anon_vmas
cannot be guaranteed, rmap_walk cannot take multiple locks without
potentially deadlocking.

To resolve this problem, this patch has rmap_walk walk the anon_vma_chain
list but always starting from the "root" anon_vma which is the oldest
anon_vma in the list. It starts by locking the anon_vma lock associated
with a page. It then finds the "root" anon_vma using the anon_vma_chains
"same_vma" list as it is strictly ordered. The root anon_vma lock is taken
and rmap_walk traverses the list. This allows multiple locks to be taken
as the list is always traversed in the same direction.

As spotted by Rik, to avoid any deadlocks versus mmu_notify, the order that
anon_vmas is locked in by mm_take_all_locks is reversed by this patch so that
both rmap_walk and mm_take_all_locks lock anon_vmas in the order of old->new.

For vma_adjust(), the locking behaviour prior to the anon_vma is restored
so that rmap_walk() can be sure of the integrity of the VMA information and
lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
is taken if

	a) If there is any overlap with the next VMA due to the adjustment
	b) If there is a new VMA is being inserted into the address space
	c) If the start of the VMA is being changed so that the
	   relationship between vm_start and vm_pgoff is preserved
	   for vma_address()

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/rmap.h |    4 ++
 mm/ksm.c             |   13 ++++++-
 mm/mmap.c            |   14 ++++++-
 mm/rmap.c            |   97 ++++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 118 insertions(+), 10 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7721674..1dc949f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -121,6 +121,10 @@ int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
+struct anon_vma *anon_vma_lock_nested(struct anon_vma *prev,
+			struct anon_vma *next, struct anon_vma *root);
+struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma);
+struct anon_vma *page_anon_vma_lock_root(struct page *page);
 void __anon_vma_link(struct vm_area_struct *);
 void anon_vma_free(struct anon_vma *);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..1db8656 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1655,6 +1655,7 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 {
 	struct stable_node *stable_node;
 	struct hlist_node *hlist;
+	struct anon_vma *nested_anon_vma = NULL;
 	struct rmap_item *rmap_item;
 	int ret = SWAP_AGAIN;
 	int search_new_forks = 0;
@@ -1671,9 +1672,16 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		anon_vma = anon_vma_lock_root(anon_vma);
+		if (nested_anon_vma) {
+			spin_unlock(&nested_anon_vma->lock);
+			nested_anon_vma = NULL;
+		}
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
+			nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
+						vma->anon_vma, anon_vma);
+
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
 				continue;
@@ -1697,6 +1705,9 @@ again:
 	if (!search_new_forks++)
 		goto again;
 out:
+	if (nested_anon_vma)
+		spin_unlock(&nested_anon_vma->lock);
+
 	return ret;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..b447d5b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -505,6 +505,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct vm_area_struct *next = vma->vm_next;
 	struct vm_area_struct *importer = NULL;
 	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
 	long adjust_next = 0;
@@ -578,6 +579,11 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma && (insert || importer || start != vma->vm_start)) {
+		anon_vma = vma->anon_vma;
+		spin_lock(&anon_vma->lock);
+	}
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +626,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (anon_vma)
+		spin_unlock(&anon_vma->lock);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
@@ -2556,8 +2565,9 @@ int mm_take_all_locks(struct mm_struct *mm)
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (signal_pending(current))
 			goto out_unlock;
+		/* Lock the anon_vmas in the same order rmap_walk would */
 		if (vma->anon_vma)
-			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma)
 				vm_lock_anon_vma(mm, avc->anon_vma);
 	}
 
@@ -2620,7 +2630,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
 
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (vma->anon_vma)
-			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma)
 				vm_unlock_anon_vma(avc->anon_vma);
 		if (vma->vm_file && vma->vm_file->f_mapping)
 			vm_unlock_mapping(vma->vm_file->f_mapping);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..2e65a75 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -236,6 +236,81 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	return -ENOMEM;
 }
 
+/*
+ * When walking an anon_vma_chain and locking each anon_vma encountered,
+ * this function is responsible for checking if the next VMA is the
+ * same as the root, locking it if not and released the previous lock
+ * if necessary.
+ *
+ * It is assumed the caller has locked the root anon_vma
+ */
+struct anon_vma *anon_vma_lock_nested(struct anon_vma *prev,
+			struct anon_vma *next, struct anon_vma *root)
+{
+	if (prev)
+		spin_unlock(&prev->lock);
+	if (next == root)
+		return NULL;
+	spin_lock_nested(&next->lock, SINGLE_DEPTH_NESTING);
+	return next;
+}
+
+/*
+ * Given an anon_vma, find the root of the chain, lock it and return the
+ * root. This must be called with the rcu_read_lock held
+ */
+struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma)
+{
+	struct anon_vma *root_anon_vma;
+	struct anon_vma_chain *avc, *root_avc;
+	struct vm_area_struct *vma;
+
+	/* Lock the same_anon_vma list and make sure we are on a chain */
+	spin_lock(&anon_vma->lock);
+	if (list_empty(&anon_vma->head)) {
+		spin_unlock(&anon_vma->lock);
+		return NULL;
+	}
+
+	/*
+	 * Get the root anon_vma on the list by depending on the ordering
+	 * of the same_vma list setup by __page_set_anon_rmap. Basically
+	 * we are doing
+	 *
+	 * local anon_vma -> local vma -> root vma -> root anon_vma
+	 */
+	avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
+	vma = avc->vma;
+	root_avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+	root_anon_vma = root_avc->anon_vma;
+
+	/* Get the lock of the root anon_vma */
+	if (anon_vma != root_anon_vma) {
+		VM_BUG_ON(!rcu_read_lock_held());
+		spin_unlock(&anon_vma->lock);
+		spin_lock(&root_anon_vma->lock);
+	}
+
+	return root_anon_vma;
+}
+
+/*
+ * From the anon_vma associated with this page, find and lock the
+ * deepest anon_vma on the list. This allows multiple anon_vma locks
+ * to be taken by guaranteeing the locks are taken in the same order
+ */
+struct anon_vma *page_anon_vma_lock_root(struct page *page)
+{
+	struct anon_vma *anon_vma;
+
+	/* Get the local anon_vma */
+	anon_vma = page_anon_vma(page);
+	if (!anon_vma)
+		return NULL;
+
+	return anon_vma_lock_root(anon_vma);
+}
+
 static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 {
 	struct anon_vma *anon_vma = anon_vma_chain->anon_vma;
@@ -326,7 +401,7 @@ void page_unlock_anon_vma(struct anon_vma *anon_vma)
  * Returns virtual address or -EFAULT if page's index/offset is not
  * within the range mapped the @vma.
  */
-static inline unsigned long
+static noinline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1359,6 +1434,7 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
 {
 	struct anon_vma *anon_vma;
+	struct anon_vma *nested_anon_vma = NULL;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
@@ -1368,19 +1444,26 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
 	 */
-	anon_vma = page_anon_vma(page);
+	anon_vma = page_anon_vma_lock_root(page);
 	if (!anon_vma)
 		return ret;
-	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
-		if (address == -EFAULT)
-			continue;
-		ret = rmap_one(page, vma, address, arg);
+		unsigned long address;
+
+		nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
+						vma->anon_vma, anon_vma);
+		address = vma_address(page, vma);
+		if (address != -EFAULT)
+			ret = rmap_one(page, vma, address, arg);
+
 		if (ret != SWAP_AGAIN)
 			break;
 	}
+
+	if (nested_anon_vma)
+		spin_unlock(&nested_anon_vma->lock);
+
 	spin_unlock(&anon_vma->lock);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 0/2] Fix migration races in rmap_walk() V5
@ 2010-05-05 13:14 Mel Gorman
  2010-05-05 13:14 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
  0 siblings, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2010-05-05 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Linus Torvalds, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Mel Gorman

So, V4 wasn't to everyone liking so here is another variation of the fix
needed for migration-related races on VMA adjustments. From V4, patch 1 is
a different approach where as patch 2 is the same except a minor bug on
anon_vma ordering is fixed. If these can be agreed upon, it'd be nice to
get a fix in for 2.6.34.

Changelog since V4
  o Switch back anon_vma locking to put bulk of locking in rmap_walk
  o Fix anon_vma lock ordering in exec vs migration race

Changelog since V3
  o Rediff against the latest upstream tree
  o Improve the patch changelog a little (thanks Peterz)

Changelog since V2
  o Drop fork changes
  o Avoid pages in temporary stacks during exec instead of migration pte
    lazy cleanup
  o Drop locking-related patch and replace with Rik's

Changelog since V1
  o Handle the execve race
  o Be sure that rmap_walk() releases the correct VMA lock
  o Hold the anon_vma lock for the address lookup and the page remap
  o Add reviewed-bys

Broadly speaking, migration works by locking a page, unmapping it, putting
a migration PTE in place that looks like a swap entry, copying the page and
remapping the page removing the old migration PTE before unlocking the page.
If a fault occurs, the faulting process waits until migration completes.

The problem is that there are some races that either allow migration PTEs
to be left left behind. Migration still completes and the page is unlocked
but later a fault will call migration_entry_to_page() and BUG() because the
page is not locked. It's not possible to just clean up the migration PTE
because the page it points to has been potentially freed and reused. This
series aims to close the races.

Patch 1 of this series is about the of locking of anon_vma in migration
versus vma_adjust. While I am not aware of any reproduction cases, it is
potentially racy. This patch is an alternative to Rik's more comprehensive
locking approach posted at http://lkml.org/lkml/2010/5/3/155 and uses
trylock-and-retry logic in rmap_walk until it can lock all the anon_vmas
without contention. In vma_adjust, the anon_vma locks are acquired under
similar conditions to 2.6.33. The rmap_walk changes potentially slows
down migration and aspects of page reclaim a little but they are the less
important path.

Patch 2 of this series addresses the swapops bug reported that is a race
between migration and execve where pages get migrated from the temporary
stack before it is moved. To avoid migration PTEs being left behind,
a temporary VMA is put in place so that a migration PTE in either the
temporary stack or the relocated stack can be found.

The reproduction case for the races was as follows;

1. Run kernel compilation in a loop
2. Start four processes, each of which creates one mapping. The three stress
   different aspects of the problem. The operations they undertake are;
	a) Forks a hundred children, each of which faults the mapping
		Purpose: stress tests migration pte removal
	b) Forks a hundred children, each which punches a hole in the mapping
	   and faults what remains
		Purpose: stress test VMA manipulations during migration
	c) Forks a hundred children, each of which execs and calls echo
		Purpose: stress test the execve race
	d) Size the mapping to be 1.5 times physical memory. Constantly
	   memset it
		Purpose: stress swapping
3. Constantly compact memory using /proc/sys/vm/compact_memory so migration
   is active all the time. In theory, you could also force this using
   sys_move_pages or memory hot-remove but it'd be nowhere near as easy
   to test.

Compaction is the easiest way to trigger these bugs which is not going to
be in 2.6.34 but in theory the problem also affects memory hot-remove.

There were some concerns with patch 2 that performance would be impacted. To
check if this was the case I ran kernbench, aim9 and sysbench. AIM9 in
particular was of interest as it has an exec microbenchmark.

             kernbench-vanilla    fixraces-v5r1
Elapsed mean     103.40 ( 0.00%)   103.35 ( 0.05%)
Elapsed stddev     0.09 ( 0.00%)     0.13 (-55.72%)
User    mean     313.50 ( 0.00%)   313.15 ( 0.11%)
User    stddev     0.61 ( 0.00%)     0.20 (66.70%)
System  mean      55.50 ( 0.00%)    55.85 (-0.64%)
System  stddev     0.48 ( 0.00%)     0.15 (68.98%)
CPU     mean     356.25 ( 0.00%)   356.50 (-0.07%)
CPU     stddev     0.43 ( 0.00%)     0.50 (-15.47%)

Nothing special there and kernbench is fork+exec heavy. The patched kernel
is slightly faster on wall time but it's well within the noise. System time
is slightly slower but again, it's within the noise.

AIM9
                  aim9-vanilla    fixraces-v5r1
creat-clo     116813.86 ( 0.00%)  117980.34 ( 0.99%)
page_test     270923.33 ( 0.00%)  268668.56 (-0.84%)
brk_test     2551558.07 ( 0.00%) 2649450.00 ( 3.69%)
signal_test   279866.67 ( 0.00%)  279533.33 (-0.12%)
exec_test        226.67 ( 0.00%)     232.67 ( 2.58%)
fork_test       4261.91 ( 0.00%)    4110.98 (-3.67%)
link_test      53534.78 ( 0.00%)   54076.49 ( 1.00%)

So, here exec and fork aren't showing up major worries. exec is faster but
these tests can be so sensitive to starting conditions that I tend not to
read much into them unless there are major differences.

SYSBENCH
              sysbench-vanilla    fixraces-v5r1
           1 14177.73 ( 0.00%) 14218.41 ( 0.29%)
           2 27647.23 ( 0.00%) 27774.14 ( 0.46%)
           3 31395.69 ( 0.00%) 31499.95 ( 0.33%)
           4 49866.54 ( 0.00%) 49713.49 (-0.31%)
           5 49919.58 ( 0.00%) 49524.21 (-0.80%)
           6 49532.97 ( 0.00%) 49397.60 (-0.27%)
           7 49465.79 ( 0.00%) 49384.14 (-0.17%)
           8 49483.33 ( 0.00%) 49186.49 (-0.60%)

These figures also show no differences worth talking about.

While the extra allocation in patch 2 would appear to slow down exec somewhat,
it's not by any amount that matters. As it is in exec, it means that anon_vmas
have likely been freed very recently so the allocation will be cache-hot and
cpu-local. It is possible to special-case migration to avoid migrating pages
in the temporary stack, but fixing it in exec is a more maintainable approach.

 fs/exec.c |   37 +++++++++++++++++++++++++++++++++----
 mm/ksm.c  |   22 ++++++++++++++++++++--
 mm/mmap.c |    9 +++++++++
 mm/rmap.c |   28 +++++++++++++++++++++++-----
 4 files changed, 85 insertions(+), 11 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 13:14 [PATCH 0/2] Fix migration races in rmap_walk() V5 Mel Gorman
@ 2010-05-05 13:14 ` Mel Gorman
  2010-05-05 14:34   ` Linus Torvalds
  2010-05-06  7:38   ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-05 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Linus Torvalds, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel, Mel Gorman

vma_adjust() is updating anon VMA information without locks being taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with users of rmap_walk such as page migration.
vma_address() can return -EFAULT for an address that will soon be valid.
For migration, this potentially leaves a dangling migration PTE behind
which can later cause a BUG_ON to trigger when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock
to take in a anon_vma_chain but a second lock cannot be spinned upon in case
of deadlock. The rmap walker tries to take locks of different anon_vma's
but if the attempt fails, locks are released and the operation is restarted.

For vma_adjust(), the locking behaviour prior to the anon_vma is restored
so that rmap_walk() can be sure of the integrity of the VMA information and
lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
is taken if

	a) If there is any overlap with the next VMA due to the adjustment
	b) If there is a new VMA is being inserted into the address space
	c) If the start of the VMA is being changed so that the
	   relationship between vm_start and vm_pgoff is preserved
	   for vma_address()

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/ksm.c  |   22 ++++++++++++++++++++--
 mm/mmap.c |    9 +++++++++
 mm/rmap.c |   28 +++++++++++++++++++++++-----
 3 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..0c09927 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1668,15 +1668,28 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
+		struct anon_vma *locked_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
 		spin_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
+
+			/* See comment in mm/rmap.c#rmap_walk_anon on locking */
+			locked_vma = NULL;
+			if (anon_vma != vma->anon_vma) {
+				locked_vma = vma->anon_vma;
+				if (!spin_trylock(&locked_vma->lock)) {
+					spin_unlock(&anon_vma->lock);
+					goto again;
+				}
+			}
+
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
-				continue;
+				goto next_vma;
+
 			/*
 			 * Initially we examine only the vma which covers this
 			 * rmap_item; but later, if there is still work to do,
@@ -1684,9 +1697,14 @@ again:
 			 * were forked from the original since ksmd passed.
 			 */
 			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
-				continue;
+				goto next_vma;
 
 			ret = rmap_one(page, vma, rmap_item->address, arg);
+
+next_vma:
+			if (locked_vma)
+				spin_unlock(&locked_vma->lock);
+
 			if (ret != SWAP_AGAIN) {
 				spin_unlock(&anon_vma->lock);
 				goto out;
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..d635132 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -505,6 +505,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct vm_area_struct *next = vma->vm_next;
 	struct vm_area_struct *importer = NULL;
 	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
 	long adjust_next = 0;
@@ -578,6 +579,11 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma && (insert || importer || start != vma->vm_start)) {
+		anon_vma = vma->anon_vma;
+		spin_lock(&anon_vma->lock);
+	}
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +626,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (anon_vma)
+		spin_unlock(&anon_vma->lock);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..f7ed89f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1358,7 +1358,7 @@ int try_to_munlock(struct page *page)
 static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma, *locked_vma;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
@@ -1368,16 +1368,34 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
 	 */
+retry:
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
 	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
-		if (address == -EFAULT)
-			continue;
-		ret = rmap_one(page, vma, address, arg);
+		unsigned long address;
+
+		/*
+		 * Guard against deadlocks by not spinning against
+		 * vma->anon_vma->lock. On contention release and retry
+		 */
+		locked_vma = NULL;
+		if (anon_vma != vma->anon_vma) {
+			locked_vma = vma->anon_vma;
+			if (!spin_trylock(&locked_vma->lock)) {
+				spin_unlock(&anon_vma->lock);
+				goto retry;
+			}
+		}
+		address = vma_address(page, vma);
+		if (address != -EFAULT)
+			ret = rmap_one(page, vma, address, arg);
+
+		if (locked_vma)
+			spin_unlock(&locked_vma->lock);
+
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 13:14 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
@ 2010-05-05 14:34   ` Linus Torvalds
  2010-05-05 14:56     ` Mel Gorman
  2010-05-06  7:38   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2010-05-05 14:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, 5 May 2010, Mel Gorman wrote:
>
> With the recent anon_vma changes, there can be more than one anon_vma->lock
> to take in a anon_vma_chain but a second lock cannot be spinned upon in case
> of deadlock. The rmap walker tries to take locks of different anon_vma's
> but if the attempt fails, locks are released and the operation is restarted.

Btw, is this really needed?

Nobody else takes two anon_vma locks at the same time, so in order to 
avoid ABBA deadlocks all we need to guarantee is that rmap_walk_ksm() and 
rmap_walk_anon() always lock the anon_vma's in the same order.

And they do, as far as I can tell. How could we ever get a deadlock when 
we have both cases doing the locking by walking the same_anon_vma list?

	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {

So I think the "retry" logic looks unnecessary, and actually opens us up 
to a possible livelock bug (imagine a long chain, and heavy page fault 
activity elsewhere that ends up locking some anon_vma in the chain, and 
just the right behavior that gets us into a lockstep situation), rather 
than fixing an ABBA deadlock.

Now, if it's true that somebody else _does_ do nested anon_vma locking, 
I'm obviously wrong. But I don't see such usage.

Comments?

				Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 14:34   ` Linus Torvalds
@ 2010-05-05 14:56     ` Mel Gorman
  2010-05-05 15:31       ` Linus Torvalds
  0 siblings, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2010-05-05 14:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, May 05, 2010 at 07:34:37AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 5 May 2010, Mel Gorman wrote:
> >
> > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > to take in a anon_vma_chain but a second lock cannot be spinned upon in case
> > of deadlock. The rmap walker tries to take locks of different anon_vma's
> > but if the attempt fails, locks are released and the operation is restarted.
> 
> Btw, is this really needed?
> 

I could not convince myself that it wasn't. lockdep throws a fit if you try
but it can be taught about the situation if necessary.

> Nobody else takes two anon_vma locks at the same time, so in order to 
> avoid ABBA deadlocks all we need to guarantee is that rmap_walk_ksm() and 
> rmap_walk_anon() always lock the anon_vma's in the same order.
> 

rmap_walk() appears to be the only one that takes multiple locks but it itself
is not serialised. If there are more than one process calling rmap_walk()
on different processes sharing the same VMAs, is there a guarantee they walk
it in the same order? I didn't think so at the time the patch because the
anon_vma the walk starts from is based on the page being migrated rather
than any idea of starting from a parent or primary anon_vma.

> And they do, as far as I can tell. How could we ever get a deadlock when 
> we have both cases doing the locking by walking the same_anon_vma list?
> 

If we always started the list walk in the same place then it'd be fine but
if they start in different places, it could deadlock.

> 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> 
> So I think the "retry" logic looks unnecessary, and actually opens us up 
> to a possible livelock bug (imagine a long chain, and heavy page fault 
> activity elsewhere that ends up locking some anon_vma in the chain, and 
> just the right behavior that gets us into a lockstep situation),

I imagined it and I'm not super-happy about it. It's one of the reasons Rik
called it "fragile".

> rather than fixing an ABBA deadlock.
> 
> Now, if it's true that somebody else _does_ do nested anon_vma locking, 
> I'm obviously wrong. But I don't see such usage.
> 
> Comments?
> 

Just what I have above. I couldn't convince myself that two callers to
rmap_walk from pages based on different VMAs on the same_anon_vma list would
always started in the same place.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 14:56     ` Mel Gorman
@ 2010-05-05 15:31       ` Linus Torvalds
  2010-05-05 15:54         ` Mel Gorman
  2010-05-05 17:53         ` Mel Gorman
  0 siblings, 2 replies; 72+ messages in thread
From: Linus Torvalds @ 2010-05-05 15:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, 5 May 2010, Mel Gorman wrote:
> 
> rmap_walk() appears to be the only one that takes multiple locks but it itself
> is not serialised. If there are more than one process calling rmap_walk()
> on different processes sharing the same VMAs, is there a guarantee they walk
> it in the same order?

So I had this notion of the list always getting deeper and us guaranteeing 
the order in it, but you're right - that's not the 'same_anon_vma' list, 
it's the 'same_vma' one.

Damn. So yeah, I don't see us guaranteeing any ordering guarantees. My 
bad.

That said, I do wonder if we could _make_ the ordering reliable. I did 
that for the 'same_vma' one, because I wanted to be able to verify that 
chains were consistent (and we also needed to be able to find the "oldest 
anon_vma" for the case of re-instantiating pages that migth exist in 
multiple different anon_vma's).

Any ideas?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 15:31       ` Linus Torvalds
@ 2010-05-05 15:54         ` Mel Gorman
  2010-05-05 16:13           ` Andrea Arcangeli
  2010-05-05 17:34           ` Linus Torvalds
  2010-05-05 17:53         ` Mel Gorman
  1 sibling, 2 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-05 15:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, May 05, 2010 at 08:31:42AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 5 May 2010, Mel Gorman wrote:
> > 
> > rmap_walk() appears to be the only one that takes multiple locks but it itself
> > is not serialised. If there are more than one process calling rmap_walk()
> > on different processes sharing the same VMAs, is there a guarantee they walk
> > it in the same order?
> 
> So I had this notion of the list always getting deeper and us guaranteeing 
> the order in it, but you're right - that's not the 'same_anon_vma' list, 
> it's the 'same_vma' one.
> 
> Damn. So yeah, I don't see us guaranteeing any ordering guarantees. My 
> bad.
> 
> That said, I do wonder if we could _make_ the ordering reliable.

I'm still thinking of the ordering but one possibility would be to use a mutex
similar to mm_all_locks_mutex to force the serialisation of rmap_walk instead
of the trylock-and-retry. That way, the ordering wouldn't matter. It would
slow migration if multiple processes are migrating pages by some unknowable
quantity but it would avoid livelocking.

> I did 
> that for the 'same_vma' one, because I wanted to be able to verify that 
> chains were consistent (and we also needed to be able to find the "oldest 
> anon_vma" for the case of re-instantiating pages that migth exist in 
> multiple different anon_vma's).
> 
> Any ideas?
> 

Not yet.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 15:54         ` Mel Gorman
@ 2010-05-05 16:13           ` Andrea Arcangeli
  2010-05-05 19:11             ` Peter Zijlstra
  2010-05-06 10:37             ` Mel Gorman
  2010-05-05 17:34           ` Linus Torvalds
  1 sibling, 2 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2010-05-05 16:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter, Rik van Riel

On Wed, May 05, 2010 at 04:54:54PM +0100, Mel Gorman wrote:
> I'm still thinking of the ordering but one possibility would be to use a mutex

I can't take mutex in split_huge_page... so I'd need to use an other solution.

> Not yet.

Rik's patch that takes the locks in the faster path is preferable to
me, it's just simpler, you know the really "strong" long is the
page->mapping/anon_vma->lock and nothing else. You've a page, you take
that lock, you're done for that very page.

Sure that means updating vm_start/vm_pgoff then requires locking all
anon_vmas that the vma registered into, but that's conceptually
simpler and it doesn't alter the page_lock_anon_vma semantics. Now I
wonder if you said the same_anon_vma is in order, but the same_vma is
not, if it's safe to lock the same_vma in list order in anon_vma_lock,
I didn't experience problems on the anon_vma_chain branch but
anon_vma_lock disables all lockdep lock inversion checking.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 16:13           ` Andrea Arcangeli
@ 2010-05-05 19:11             ` Peter Zijlstra
  2010-05-05 19:57               ` Andrea Arcangeli
  2010-05-21  0:27               ` Andrea Arcangeli
  2010-05-06 10:37             ` Mel Gorman
  1 sibling, 2 replies; 72+ messages in thread
From: Peter Zijlstra @ 2010-05-05 19:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Minchan Kim, KAMEZAWA Hiroyuki, Christoph Lameter, Rik van Riel

On Wed, 2010-05-05 at 18:13 +0200, Andrea Arcangeli wrote:
> On Wed, May 05, 2010 at 04:54:54PM +0100, Mel Gorman wrote:
> > I'm still thinking of the ordering but one possibility would be to use a mutex
> 
> I can't take mutex in split_huge_page... so I'd need to use an other solution.

So how's that going to work out for my make anon_vma->lock a mutex
patches?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 19:11             ` Peter Zijlstra
@ 2010-05-05 19:57               ` Andrea Arcangeli
  2010-05-21  0:27               ` Andrea Arcangeli
  1 sibling, 0 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2010-05-05 19:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Minchan Kim, KAMEZAWA Hiroyuki, Christoph Lameter, Rik van Riel

On Wed, May 05, 2010 at 09:11:25PM +0200, Peter Zijlstra wrote:
> On Wed, 2010-05-05 at 18:13 +0200, Andrea Arcangeli wrote:
> > On Wed, May 05, 2010 at 04:54:54PM +0100, Mel Gorman wrote:
> > > I'm still thinking of the ordering but one possibility would be to use a mutex
> > 
> > I can't take mutex in split_huge_page... so I'd need to use an other solution.
> 
> So how's that going to work out for my make anon_vma->lock a mutex
> patches?

I'm not seeing much problem after all, even if you only switch the
anon_vma->lock (you switch both so it's quite different), unmap_vmas
may end up calling split_huge_page_pmd in zap_pmd_range only if the
vma is full anonymous (I don't allow hugepages in MAP_PRIVATE yet) so
there would be no i_mmap_lock held. But clearly if you switch _both_
it's even safer. In any case when we make that change, it'll require
to call split_huge_page_pmd and split_huge_page only in preemptive
points, and there is no such requirement today, and clearly when all
vm locking goes preemptive it'll be much natural and lower risk to
remove that requirement from split_huge_page too.

Also I think if we start taking mutex in anon_vma the i_mmap_lock
should switch too at the same time. I suspect it's an arbitrary choice
that we've to take always the i_mmap_lock before the anon_vma locks in
mmap.c so it makes sense they move in tandem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 19:11             ` Peter Zijlstra
  2010-05-05 19:57               ` Andrea Arcangeli
@ 2010-05-21  0:27               ` Andrea Arcangeli
  1 sibling, 0 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2010-05-21  0:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Minchan Kim, KAMEZAWA Hiroyuki, Christoph Lameter, Rik van Riel

On Wed, May 05, 2010 at 09:11:25PM +0200, Peter Zijlstra wrote:
> On Wed, 2010-05-05 at 18:13 +0200, Andrea Arcangeli wrote:
> > On Wed, May 05, 2010 at 04:54:54PM +0100, Mel Gorman wrote:
> > > I'm still thinking of the ordering but one possibility would be to use a mutex
> > 
> > I can't take mutex in split_huge_page... so I'd need to use an other solution.
> 
> So how's that going to work out for my make anon_vma->lock a mutex
> patches?

If you're interested I can include your patchset after memory
compaction in aa.git, far from the ideal path for merging but ideal if
you want to test together with the full thing (memory compaction,
split_huge_page as you wondered just above etc..) and hopefully give
it more testing.

Note: I'm not sure if it's the right way to go, in fact I'm quite
skeptical, not because it won't work, but ironically the main reason
I'm interested is to close the XPMEM requirements the right way (not
with page pins and deferred async invalidates), as long as we've users
asking for rescheduling in mmu notifier methods this is the only way
to go. Initially I thought it had to be a build time option, but
seeing you doing it by default and for totally different reasons, I'm
slightly more optimistic it can be the default and surely XPMEM will
love it... the fact these locks are smarter helps a lot too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 16:13           ` Andrea Arcangeli
  2010-05-05 19:11             ` Peter Zijlstra
@ 2010-05-06 10:37             ` Mel Gorman
  1 sibling, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 10:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter, Rik van Riel

On Wed, May 05, 2010 at 06:13:19PM +0200, Andrea Arcangeli wrote:
> On Wed, May 05, 2010 at 04:54:54PM +0100, Mel Gorman wrote:
> > I'm still thinking of the ordering but one possibility would be to use a mutex
> 
> I can't take mutex in split_huge_page... so I'd need to use an other solution.
> 
> > Not yet.
> 
> Rik's patch that takes the locks in the faster path is preferable to
> me, it's just simpler, you know the really "strong" long is the
> page->mapping/anon_vma->lock and nothing else.

The hatchet-job mutex is off the table so it's down to

start-with-root-anon_vma-and-lock-in-order-when-walking-list (what I last posted)
take-all-anon_vma-locks-when-changing-vmas (Rik's)
use-seq-counter-to-spot-changes-to-VMAs-when-walking-list (Kamezawa-san's approach)

Any strong preference?

I still haven't read the other comments Linus made so I don't have a strong
preference yet. Either Rik's or the patch I posted should be enough for
migration to not get tripped up as far as I can see.

> You've a page, you take
> that lock, you're done for that very page.
> 
> Sure that means updating vm_start/vm_pgoff then requires locking all
> anon_vmas that the vma registered into, but that's conceptually
> simpler and it doesn't alter the page_lock_anon_vma semantics. Now I
> wonder if you said the same_anon_vma is in order, but the same_vma is
> not, if it's safe to lock the same_vma in list order in anon_vma_lock,
> I didn't experience problems on the anon_vma_chain branch but
> anon_vma_lock disables all lockdep lock inversion checking.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 15:54         ` Mel Gorman
  2010-05-05 16:13           ` Andrea Arcangeli
@ 2010-05-05 17:34           ` Linus Torvalds
  2010-05-05 17:57             ` Linus Torvalds
                               ` (2 more replies)
  1 sibling, 3 replies; 72+ messages in thread
From: Linus Torvalds @ 2010-05-05 17:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, 5 May 2010, Mel Gorman wrote:
> 
> I'm still thinking of the ordering but one possibility would be to use a mutex
> similar to mm_all_locks_mutex to force the serialisation of rmap_walk instead
> of the trylock-and-retry. That way, the ordering wouldn't matter. It would
> slow migration if multiple processes are migrating pages by some unknowable
> quantity but it would avoid livelocking.

Hmm.. An idea is starting to take form..

How about something like this?

 - the lock is per-anon_vma

BUT

 - you always lock the _deepest_ anon_vma you can find.

That means just a single lock. And the "deepest" anon_vma is well-defined 
for all anon_vma's, because each same_anon_vma chain is always rooted in 
the original anon_vma that caused it.

>From the vma, it's simply
	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
	anon_vma = avc->anon_vma;

and once you take that lock, you know you've gotten the lock for all 
chains related to that page. We _know_ that every single vma that is 
associated with that anon_vma must have a chain that eventually ends in 
that entry.

So I wonder if the locking can't be just something like this:

   struct anon_vma *lock_anon_vma_root(struct page *page)
   {
	struct anon_vma *anon_vma, *root;

	rcu_read_lock();
	anon_vma = page_anon_vma(page);
	if (!anon_vma)
		return ret;
	/* Make sure the anon_vma 'same_anon_vma' list is stable! */
	spin_lock(&anon_vma->lock);
	root = NULL;
	if (!list_empty(&anon_vma->head)) {
		struct anon_vma_chain *avc;
		struct vm_area_struct *vma;
		struct anon_vma *root;
		avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
		vma = avc->vma;
		avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
		root = avc->anon_vma;
	}
	/* We already locked it - anon_vma _was_ the root */
	if (root == anon_vma)
		return root;
	spin_unlock(&anon_vma->lock);
	if (root) {
		spin_lock(&root->lock);
		return root;
	}
	rcu_read_unlock();
	return NULL;
   }

and

   void unlock_anon_vma_root(struct anon_vma *root)
   {
	spin_unlock(&root->lock);
	rcu_read_unlock();
   }

or something. I agree that the above is not _beautiful_, and it's not 
exactly simple, but it does seem to have the absolutely huge advantage 
that it is a nice O(1) thing that only ever takes a single lock and has no 
nesting. And while the code looks complicated, it's based on a pretty 
simple constraint on the anon_vma's that we already require (ie that all 
related anon_vma chains have to end up at the same root anon_vma).

In other words: _any_ vma that is associated with _any_ related anon_vma 
will always end up feeding up to the same root anon_vma.

I do think other people should think this through. And it needs a comment 
that really explains all this.

(And the code above is written in my email editor - it has not been 
tested, compiled, or anythign else. It may _look_ like real code, but 
think of it as pseudo-code where the explanation for the code is more 
important than the exact details.

NOTE NOTE NOTE! In particular, I think that the 'rcu_read_lock()' and the 
actual lookup of the anon_vma (ie the "anon_vma = page_anon_vma(page)") 
part should probably be in the callers. I put it in the pseudo-code itself 
to just show how you go from a 'struct page' to the "immediate" anon_vma 
it is associated with, and from that to the "root" anon_vma of the whole 
chain.

And maybe I'm too clever for myself, and I've made some fundamental 
mistake that means that the above doesn't work.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 17:34           ` Linus Torvalds
@ 2010-05-05 17:57             ` Linus Torvalds
  2010-05-05 18:14             ` Mel Gorman
  2010-05-06 13:40             ` Rik van Riel
  2 siblings, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2010-05-05 17:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, 5 May 2010, Linus Torvalds wrote:
> 
> From the vma, it's simply
> 	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
> 	anon_vma = avc->anon_vma;
> 
> and once you take that lock, you know you've gotten the lock for all 
> chains related to that page. We _know_ that every single vma that is 
> associated with that anon_vma must have a chain that eventually ends in 
> that entry.

To clarify: here "is associated with that anon_vma" is basically about the 
whole forest of anon_vma/vma links. Different vma's can be associated with 
different anon_vma's, and pages can be associated with anon_vma's that can 
in turn reach other anon_vma's and many other vma's.

But regardless of _how_ you walk the chains between anon_vma's and vma's 
(including walking "back-pointers" that don't even exist except 
conceptually for the pointer going the other way), any relationship will 
have started at _some_ root vma.

IOW, the root anon_vma is directly 1:1 related to "do these vma/anon_vma's 
relate in _any_ way". If it's the same root anon_vma, then there is a 
historical relationship. And if the root anon_vma's are different, then 
there cannot be any relationship at all.

So locking the root anon_vma is both minimal _and_ sufficient. Any locking 
scheme (traversing the lists) would eventually end up hitting that root 
entry (minimal locking), and at the same time that root entry is also 
guaranteed to be the same for all related entities (ie it's sufficient to 
lock the root entry if everybody else also looks up their root and locks 
it).

I think. Tell me if I'm wrong.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 17:34           ` Linus Torvalds
  2010-05-05 17:57             ` Linus Torvalds
@ 2010-05-05 18:14             ` Mel Gorman
  2010-05-05 18:34               ` Linus Torvalds
  2010-05-06 13:40             ` Rik van Riel
  2 siblings, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2010-05-05 18:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, May 05, 2010 at 10:34:03AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 5 May 2010, Mel Gorman wrote:
> > 
> > I'm still thinking of the ordering but one possibility would be to use a mutex
> > similar to mm_all_locks_mutex to force the serialisation of rmap_walk instead
> > of the trylock-and-retry. That way, the ordering wouldn't matter. It would
> > slow migration if multiple processes are migrating pages by some unknowable
> > quantity but it would avoid livelocking.
> 
> Hmm.. An idea is starting to take form..
> 
> How about something like this?
> 
>  - the lock is per-anon_vma
> 
> BUT
> 
>  - you always lock the _deepest_ anon_vma you can find.
> 

Maybe I should have read mail before sending off a work-in-progress patch :/ .

> That means just a single lock. And the "deepest" anon_vma is well-defined 
> for all anon_vma's, because each same_anon_vma chain is always rooted in 
> the original anon_vma that caused it.
> 

In the direction I was taking, only rmap_walk took the deepest lock (I called
it oldest but hey) and would take other anon_vma locks as well. The objective
was to make sure the order the locks were taken in was correct.

I think you are suggesting that any anon_vma lock that is taken should always
take the deepest lock. Am I right and is that necessary? The downsides is that
there is a single lock that is hotter. The upside is that rmap_walk no longer
has different semantics as vma_adjust and friends because it's the same lock.

> From the vma, it's simply
> 	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
> 	anon_vma = avc->anon_vma;
> 

I think I got that right at least.

> and once you take that lock, you know you've gotten the lock for all 
> chains related to that page. We _know_ that every single vma that is 
> associated with that anon_vma must have a chain that eventually ends in 
> that entry.
> 

That was my understanding but I'm still not sure of my footing on the
anon_vma changes so doubted myself.

> So I wonder if the locking can't be just something like this:
> 
>    struct anon_vma *lock_anon_vma_root(struct page *page)
>    {
> 	struct anon_vma *anon_vma, *root;
> 
> 	rcu_read_lock();

Not sure if this is necessary. In the case of rmap_walk(), it's already
held. In the other cases, a semaphore is held which should prevent the first
anon_vma disappearing.

> 	anon_vma = page_anon_vma(page);
> 	if (!anon_vma)
> 		return ret;
> 	/* Make sure the anon_vma 'same_anon_vma' list is stable! */
> 	spin_lock(&anon_vma->lock);
> 	root = NULL;
> 	if (!list_empty(&anon_vma->head)) {

Can it be empty? I didn't think it was possible as the anon_vma must
have at least it's own chain.

> 		struct anon_vma_chain *avc;
> 		struct vm_area_struct *vma;
> 		struct anon_vma *root;
> 		avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
> 		vma = avc->vma;
> 		avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
> 		root = avc->anon_vma;
> 	}
> 	/* We already locked it - anon_vma _was_ the root */
> 	if (root == anon_vma)
> 		return root;
> 	spin_unlock(&anon_vma->lock);
> 	if (root) {
> 		spin_lock(&root->lock);
> 		return root;
> 	}
> 	rcu_read_unlock();
> 	return NULL;
>    }

Other than terminology and minor implementation details, this is more or
less what I came up with as well.

> 
> and
> 
>    void unlock_anon_vma_root(struct anon_vma *root)
>    {
> 	spin_unlock(&root->lock);
> 	rcu_read_unlock();
>    }
> 
> or something. I agree that the above is not _beautiful_, and it's not 
> exactly simple, but it does seem to have the absolutely huge advantage 
> that it is a nice O(1) thing that only ever takes a single lock and has no 
> nesting.

This is the only significant point we differ on. Do we;

1. Always take the deepest anon_vma lock using the lock anon_vma lock
   just to protect the list long enough to find it?

or

2. Only have rmap_walk use the deepest anon_vma lock just so it takes
   multiple locks in the correct order?

I am not seeing a killer advantage of one over the order. While (2)
would mean the same lock is always taken - it's a double lock to get it
"lock local, find deepest, lock deepest, release local" which adds a
small overhead to the common path. With 1, the searching around is
confined to migration.

Andrea, is there an advantage of one over the other for you or is Rik's
approach just better overall?

Rik, any opinions?

> And while the code looks complicated, it's based on a pretty 
> simple constraint on the anon_vma's that we already require (ie that all 
> related anon_vma chains have to end up at the same root anon_vma).
> 
> In other words: _any_ vma that is associated with _any_ related anon_vma 
> will always end up feeding up to the same root anon_vma.
> 
> I do think other people should think this through. And it needs a comment 
> that really explains all this.
> 
> (And the code above is written in my email editor - it has not been 
> tested, compiled, or anythign else. It may _look_ like real code, but 
> think of it as pseudo-code where the explanation for the code is more 
> important than the exact details.
> 

Well, for what it works, the basic principal appears to work. At least,
the machine I'm testing my own patch on hasn't managed to deadlock yet.

> NOTE NOTE NOTE! In particular, I think that the 'rcu_read_lock()' and the 
> actual lookup of the anon_vma (ie the "anon_vma = page_anon_vma(page)") 
> part should probably be in the callers. I put it in the pseudo-code itself 
> to just show how you go from a 'struct page' to the "immediate" anon_vma 
> it is associated with, and from that to the "root" anon_vma of the whole 
> chain.
> 
> And maybe I'm too clever for myself, and I've made some fundamental 
> mistake that means that the above doesn't work.
> 

Considering that we came up with more or less the same idea, the basic
idea of "lock the deepest anon_vma" must be sound :/

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 18:14             ` Mel Gorman
@ 2010-05-05 18:34               ` Linus Torvalds
  2010-05-06 11:03                 ` Mel Gorman
  0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2010-05-05 18:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, 5 May 2010, Mel Gorman wrote:
> 
> In the direction I was taking, only rmap_walk took the deepest lock (I called
> it oldest but hey) and would take other anon_vma locks as well. The objective
> was to make sure the order the locks were taken in was correct.
>
> I think you are suggesting that any anon_vma lock that is taken should always
> take the deepest lock. Am I right and is that necessary? The downsides is that
> there is a single lock that is hotter. The upside is that rmap_walk no longer
> has different semantics as vma_adjust and friends because it's the same lock.

I could personally go either way, I don't really care that deeply.

I think you could easily just take the root lock in the rmap_walk_anon/ksm 
paths, and _also_ take the individual locks as you walk it (safe, since 
now the root lock avoids the ABBA issue - you only need to compare the 
individual lock against the root lock to not take it twice, of course).

Or you could take the "heavy lock" approach that Andrea was arguing for, 
but rather than iterating you'd just take the root lock.

I absolutely _hated_ the "iterate over all locks in the normal case" idea, 
but with the root lock it's much more targeted and no longer is about 
nested locks of the same type.

So the things I care about are just:

 - I hate that "retry" logic that made things more complex and had the 
   livelock problem.

   The "root lock" helper function certainly wouldn't be any fewer lines 
   than your retry version, but it's a clearly separate locking function, 
   rather than mixed in with the walking code. And it doesn't do livelock.

 - I detest "take all locks" in normal paths. I'm ok with it for special 
   case code (and I think the migrate code counts as special case), but I 
   think it was really horribly and fundamentally wrong in that "mm: Take 
   all anon_vma locks in anon_vma_lock" patch I saw.

but whether we want to take the root lock in "anon_vma_lock()" or not is 
just a "detail" as far as I'm concerned. It's no longer "horribly wrong". 
It might have scalability issues etc, of course, but likely only under 
insane loads.

So either way works for me. 

> > 	if (!list_empty(&anon_vma->head)) {
> 
> Can it be empty? I didn't think it was possible as the anon_vma must
> have at least it's own chain.

Ok, so that was answered in the other email - I think it's necessary in 
the general case, although depending on exactly _how_ the page was looked 
up, that may not be true.

If you have guarantees that the page is still mapped (thanks for page 
table lock or something) and the anon_vma can't go away (just a read lock 
on a mm_sem that was used to look up the page would also be sufficient), 
that list_empty() check is unnecessary.

So it's a bit context-dependent.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 18:34               ` Linus Torvalds
@ 2010-05-06 11:03                 ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 11:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, May 05, 2010 at 11:34:05AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 5 May 2010, Mel Gorman wrote:
> > 
> > In the direction I was taking, only rmap_walk took the deepest lock (I called
> > it oldest but hey) and would take other anon_vma locks as well. The objective
> > was to make sure the order the locks were taken in was correct.
> >
> > I think you are suggesting that any anon_vma lock that is taken should always
> > take the deepest lock. Am I right and is that necessary? The downsides is that
> > there is a single lock that is hotter. The upside is that rmap_walk no longer
> > has different semantics as vma_adjust and friends because it's the same lock.
> 
> I could personally go either way, I don't really care that deeply.
> 
> I think you could easily just take the root lock in the rmap_walk_anon/ksm 
> paths, and _also_ take the individual locks as you walk it (safe, since 
> now the root lock avoids the ABBA issue - you only need to compare the 
> individual lock against the root lock to not take it twice, of course).
> 

This is what I'm currently doing.

> Or you could take the "heavy lock" approach that Andrea was arguing for, 
> but rather than iterating you'd just take the root lock.
> 

Initially, I thought the problem with this was making the root anon_vma
lock hotter. It didn't seem that big of a deal but it was there. The greater
problem was that the RCU lock is needed to exchange the local anon_vma lock
with the root anon_vma lock. So the "heavy lock" approach is actually quite
heavy because it involves two spinlocks, the RCU lock and the root lock
being hotter.

Right now, I'm thinking that only rmap_walk taking the root anon_vma
lock and taking multiple locks as it walks is nicer. I believe it's
sufficient for migration but it also needs to be sufficient for
transparent hugepage support.

> I absolutely _hated_ the "iterate over all locks in the normal case" idea, 
> but with the root lock it's much more targeted and no longer is about 
> nested locks of the same type.
> 
> So the things I care about are just:
> 
>  - I hate that "retry" logic that made things more complex and had the 
>    livelock problem.
> 

Dumped.

>    The "root lock" helper function certainly wouldn't be any fewer lines 
>    than your retry version, but it's a clearly separate locking function, 
>    rather than mixed in with the walking code. And it doesn't do livelock.
> 

Agreed.

>  - I detest "take all locks" in normal paths. I'm ok with it for special 
>    case code (and I think the migrate code counts as special case), but I 
>    think it was really horribly and fundamentally wrong in that "mm: Take 
>    all anon_vma locks in anon_vma_lock" patch I saw.
> 
> but whether we want to take the root lock in "anon_vma_lock()" or not is 
> just a "detail" as far as I'm concerned. It's no longer "horribly wrong". 
> It might have scalability issues etc, of course, but likely only under 
> insane loads.
> 

I think this approach of always taking the root lock would be neater in a
number of respects because from a page, there would be the "one true lock".
If RCU was not involved, it would be particularly nice.

Part of PeterZ's "replace anon_vma lock with mutex" involves proper
reference counting of anon_vma. If even the reference count part was
polished, it would allow us to always take the root anon_vma lock
without RCU because it would be

lock local_anon_vma
find root_anon_vma
get root_anon_vma
unlock local_anon_vma
lock root anon_vma
put root_anon_vma

So maybe when anon_vma is reference counted, it'd be best to switch to
always locking the root anon_vma.

For the moment though, I reckon it's best to only have rmap_walk
concerned with the root anon_vma and have it take multiple locks.

> So either way works for me. 
> 
> > > 	if (!list_empty(&anon_vma->head)) {
> > 
> > Can it be empty? I didn't think it was possible as the anon_vma must
> > have at least it's own chain.
> 
> Ok, so that was answered in the other email - I think it's necessary in 
> the general case, although depending on exactly _how_ the page was looked 
> up, that may not be true.
> 
> If you have guarantees that the page is still mapped (thanks for page 
> table lock or something) and the anon_vma can't go away (just a read lock 
> on a mm_sem that was used to look up the page would also be sufficient), 
> that list_empty() check is unnecessary.
> 
> So it's a bit context-dependent.
> 
> 			Linus
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 17:34           ` Linus Torvalds
  2010-05-05 17:57             ` Linus Torvalds
  2010-05-05 18:14             ` Mel Gorman
@ 2010-05-06 13:40             ` Rik van Riel
  2010-05-06 13:45               ` Mel Gorman
  2 siblings, 1 reply; 72+ messages in thread
From: Rik van Riel @ 2010-05-06 13:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter, Andrea Arcangeli

On 05/05/2010 01:34 PM, Linus Torvalds wrote:

>   - you always lock the _deepest_ anon_vma you can find.

The emphasis should be on "always" :)

> That means just a single lock. And the "deepest" anon_vma is well-defined
> for all anon_vma's, because each same_anon_vma chain is always rooted in
> the original anon_vma that caused it.

It should work, but only if we always take the deepest
anon_vma lock.

Not just in the migration code, but also in mmap, munmap,
mprotect (for split_vma), expand_stack, etc...

Otherwise we will still not provide exclusion of migrate
vs. those events.

I'm guessing that means changing both anon_vma_lock and
page_lock_anon_vma to always take the deepest anon_vma
lock - not introducing a new function that is only called
by the migration code.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 13:40             ` Rik van Riel
@ 2010-05-06 13:45               ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 13:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter, Andrea Arcangeli

On Thu, May 06, 2010 at 09:40:56AM -0400, Rik van Riel wrote:
> On 05/05/2010 01:34 PM, Linus Torvalds wrote:
>
>>   - you always lock the _deepest_ anon_vma you can find.
>
> The emphasis should be on "always" :)
>
>> That means just a single lock. And the "deepest" anon_vma is well-defined
>> for all anon_vma's, because each same_anon_vma chain is always rooted in
>> the original anon_vma that caused it.
>
> It should work, but only if we always take the deepest
> anon_vma lock.
>
> Not just in the migration code, but also in mmap, munmap,
> mprotect (for split_vma), expand_stack, etc...
>
> Otherwise we will still not provide exclusion of migrate
> vs. those events.
>

Are you sure?

I thought this as well but considered a situation something like

root anon_vma          <--- rmap_walk starts here
     anon_vma a
     anon_vma b
     anon_vma c        <--- an munmap/mmap/mprotect/etc here
     anon_vma d
     anon_vma e

The rmap_walk takes the root lock and then locks a, b, c, d and e as it
walks along.

The mSomething event happens on c and takes the lock

if rmap_walk gets there first, it takes the lock and the mSomething
event waits until the full rmap_walk is complete (delayed slightly but
no biggie).

if mSomething gets there first, rmap_walk will wait on taking the lock.
Again, there could be some delays but no biggie.

What am I missing?

> I'm guessing that means changing both anon_vma_lock and
> page_lock_anon_vma to always take the deepest anon_vma
> lock - not introducing a new function that is only called
> by the migration code.
>

That would be the case all right but I'd prefer to have PeterZ's patches
that do full reference counting of anon_vma first instead of introducing
RCU to those paths.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 15:31       ` Linus Torvalds
  2010-05-05 15:54         ` Mel Gorman
@ 2010-05-05 17:53         ` Mel Gorman
  2010-05-05 18:02           ` Linus Torvalds
  1 sibling, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2010-05-05 17:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, May 05, 2010 at 08:31:42AM -0700, Linus Torvalds wrote:
> That said, I do wonder if we could _make_ the ordering reliable. I did 
> that for the 'same_vma' one, because I wanted to be able to verify that 
> chains were consistent (and we also needed to be able to find the "oldest 
> anon_vma" for the case of re-instantiating pages that migth exist in 
> multiple different anon_vma's).
> 
> Any ideas?
> 

If the same_vma list is properly ordered then maybe something like the
following is allowed?

(This patch is on top of mm,migration: Prevent rmap_walk_[anon|ksm]
seeing the wrong VMA information)

At least it booted and did not immediately kill itself during migration.
It's less clear what to do for KSM but I'm ignoring it for the moment.

==== CUT HERE ====
mm,migration: In rmap_walk, always take the locks in the same order

---
 include/linux/rmap.h |   32 ++++++++++++++++++++++++++++++++
 mm/ksm.c             |    5 +----
 mm/rmap.c            |   13 ++-----------
 3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7721674..749aaca 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -99,6 +99,38 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
+static inline struct anon_vma *page_anon_vma_lock_oldest(struct page *page)
+{
+	struct anon_vma *anon_vma, *oldest_anon_vma;
+	struct anon_vma_chain *avc, *oldest_avc;
+
+	/* Get the pages anon_vma */
+	if (((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) !=
+					    PAGE_MAPPING_ANON)
+		return NULL;
+	anon_vma = page_rmapping(page);
+	if (!anon_vma)
+		return NULL;
+
+	spin_lock(&anon_vma->lock);
+
+	/*
+	 * Get the oldest anon_vma on the list by depending on the ordering
+	 * of the same_vma list setup by __page_set_anon_rmap
+	 */
+	avc = list_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
+	oldest_avc = list_entry(avc->vma->anon_vma_chain.prev,
+				struct anon_vma_chain, same_vma);
+	oldest_anon_vma = oldest_avc->anon_vma;
+
+	if (anon_vma != oldest_anon_vma) {
+		spin_lock(&oldest_anon_vma->lock);
+		spin_unlock(&anon_vma->lock);
+	}
+
+	return oldest_anon_vma;
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
diff --git a/mm/ksm.c b/mm/ksm.c
index 0c09927..113f972 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1680,10 +1680,7 @@ again:
 			locked_vma = NULL;
 			if (anon_vma != vma->anon_vma) {
 				locked_vma = vma->anon_vma;
-				if (!spin_trylock(&locked_vma->lock)) {
-					spin_unlock(&anon_vma->lock);
-					goto again;
-				}
+				spin_lock(&locked_vma->lock);
 			}
 
 			if (rmap_item->address < vma->vm_start ||
diff --git a/mm/rmap.c b/mm/rmap.c
index f7ed89f..ae37a63 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1368,26 +1368,17 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
 	 */
-retry:
-	anon_vma = page_anon_vma(page);
+	anon_vma = page_anon_vma_lock_oldest(page);
 	if (!anon_vma)
 		return ret;
-	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address;
 
-		/*
-		 * Guard against deadlocks by not spinning against
-		 * vma->anon_vma->lock. On contention release and retry
-		 */
 		locked_vma = NULL;
 		if (anon_vma != vma->anon_vma) {
 			locked_vma = vma->anon_vma;
-			if (!spin_trylock(&locked_vma->lock)) {
-				spin_unlock(&anon_vma->lock);
-				goto retry;
-			}
+			spin_lock(&locked_vma->lock);
 		}
 		address = vma_address(page, vma);
 		if (address != -EFAULT)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 17:53         ` Mel Gorman
@ 2010-05-05 18:02           ` Linus Torvalds
  2010-05-05 18:17             ` Mel Gorman
  2010-05-06  0:22             ` Mel Gorman
  0 siblings, 2 replies; 72+ messages in thread
From: Linus Torvalds @ 2010-05-05 18:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel



On Wed, 5 May 2010, Mel Gorman wrote:
> 
> If the same_vma list is properly ordered then maybe something like the
> following is allowed?

Heh. This is the same logic I just sent out. However:

> +	anon_vma = page_rmapping(page);
> +	if (!anon_vma)
> +		return NULL;
> +
> +	spin_lock(&anon_vma->lock);

RCU should guarantee that this spin_lock() is valid, but:

> +	/*
> +	 * Get the oldest anon_vma on the list by depending on the ordering
> +	 * of the same_vma list setup by __page_set_anon_rmap
> +	 */
> +	avc = list_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);

We're not guaranteed that the 'anon_vma->head' list is non-empty.

Somebody could have freed the list and the anon_vma and we have a stale 
'page->anon_vma' (that has just not been _released_ yet). 

And shouldn't that be 'list_first_entry'? Or &anon_vma->head.next?

How did that line actually work for you? Or was it just a "it boots", but 
no actual testing of the rmap walk?

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 18:02           ` Linus Torvalds
@ 2010-05-05 18:17             ` Mel Gorman
  2010-05-06  0:22             ` Mel Gorman
  1 sibling, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-05 18:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, May 05, 2010 at 11:02:25AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 5 May 2010, Mel Gorman wrote:
> > 
> > If the same_vma list is properly ordered then maybe something like the
> > following is allowed?
> 
> Heh. This is the same logic I just sent out. However:
> 
> > +	anon_vma = page_rmapping(page);
> > +	if (!anon_vma)
> > +		return NULL;
> > +
> > +	spin_lock(&anon_vma->lock);
> 
> RCU should guarantee that this spin_lock() is valid, but:
> 
> > +	/*
> > +	 * Get the oldest anon_vma on the list by depending on the ordering
> > +	 * of the same_vma list setup by __page_set_anon_rmap
> > +	 */
> > +	avc = list_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
> 
> We're not guaranteed that the 'anon_vma->head' list is non-empty.
> 
> Somebody could have freed the list and the anon_vma and we have a stale 
> 'page->anon_vma' (that has just not been _released_ yet). 
> 

Ahh, fair point. I asked in the other mail was the empty list check
necessary - it is. Thanks

> And shouldn't that be 'list_first_entry'? Or &anon_vma->head.next?
> 

Should have been list_first_entry.

> How did that line actually work for you? Or was it just a "it boots", but 
> no actual testing of the rmap walk?
> 

It's currently running a migration-related stress test without any
deadlocks or lockdep wobbly so far.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 18:02           ` Linus Torvalds
  2010-05-05 18:17             ` Mel Gorman
@ 2010-05-06  0:22             ` Mel Gorman
  2010-05-06  0:42               ` Linus Torvalds
  2010-05-06  9:47               ` Minchan Kim
  1 sibling, 2 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-06  0:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, May 05, 2010 at 11:02:25AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 5 May 2010, Mel Gorman wrote:
> > 
> > If the same_vma list is properly ordered then maybe something like the
> > following is allowed?
> 
> Heh. This is the same logic I just sent out. However:
> 
> > +	anon_vma = page_rmapping(page);
> > +	if (!anon_vma)
> > +		return NULL;
> > +
> > +	spin_lock(&anon_vma->lock);
> 
> RCU should guarantee that this spin_lock() is valid, but:
> 
> > +	/*
> > +	 * Get the oldest anon_vma on the list by depending on the ordering
> > +	 * of the same_vma list setup by __page_set_anon_rmap
> > +	 */
> > +	avc = list_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
> 
> We're not guaranteed that the 'anon_vma->head' list is non-empty.
> 
> Somebody could have freed the list and the anon_vma and we have a stale 
> 'page->anon_vma' (that has just not been _released_ yet). 
> 
> And shouldn't that be 'list_first_entry'? Or &anon_vma->head.next?
> 
> How did that line actually work for you? Or was it just a "it boots", but 
> no actual testing of the rmap walk?
> 

This is what I just started testing on a 4-core machine. Lockdep didn't
complain but there are two potential sources of badness in anon_vma_lock_root
marked with XXX. The second is the most important because I can't see how the
local and root anon_vma locks can be safely swapped - i.e. release local and
get the root without the root disappearing. I haven't considered the other
possibilities yet such as always locking the root anon_vma. Going to
sleep on it.

Any comments?

==== CUT HERE ====
mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information

vma_adjust() is updating anon VMA information without locks being taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with users of rmap_walk such as page migration.
vma_address() can return -EFAULT for an address that will soon be valid.
For migration, this potentially leaves a dangling migration PTE behind
which can later cause a BUG_ON to trigger when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock to
take in a anon_vma_chain but as the order of anon_vmas cannot be guaranteed,
rmap_walk cannot take multiple locks. This patch has rmap_walk start
by locking the anon_vma lock associated with a page. It then finds the
"root" anon_vma using the anon_vma_chains same_vma list as it is strictly
ordered. The root anon_vma lock is taken and rmap_walk traverses the
list. This allows multiple locks to be taken as the list is always traversed
in the same direction.

For vma_adjust(), the locking behaviour prior to the anon_vma is restored
so that rmap_walk() can be sure of the integrity of the VMA information and
lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
is taken if

	a) If there is any overlap with the next VMA due to the adjustment
	b) If there is a new VMA is being inserted into the address space
	c) If the start of the VMA is being changed so that the
	   relationship between vm_start and vm_pgoff is preserved
	   for vma_address()

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/rmap.h |    2 +
 mm/ksm.c             |   20 ++++++++++--
 mm/mmap.c            |    9 +++++
 mm/rmap.c            |   88 +++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 108 insertions(+), 11 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7721674..6d4d5f7 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -121,6 +121,8 @@ int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
+struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma);
+struct anon_vma *page_anon_vma_lock_root(struct page *page);
 void __anon_vma_link(struct vm_area_struct *);
 void anon_vma_free(struct anon_vma *);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..d16b459 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1668,15 +1668,24 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
+		struct anon_vma *locked_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		anon_vma = anon_vma_lock_root(anon_vma);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
+
+			locked_vma = NULL;
+			if (anon_vma != vma->anon_vma) {
+				locked_vma = vma->anon_vma;
+				spin_lock_nested(&locked_vma->lock, SINGLE_DEPTH_NESTING);
+			}
+
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
-				continue;
+				goto next_vma;
+
 			/*
 			 * Initially we examine only the vma which covers this
 			 * rmap_item; but later, if there is still work to do,
@@ -1684,9 +1693,14 @@ again:
 			 * were forked from the original since ksmd passed.
 			 */
 			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
-				continue;
+				goto next_vma;
 
 			ret = rmap_one(page, vma, rmap_item->address, arg);
+
+next_vma:
+			if (locked_vma)
+				spin_unlock(&locked_vma->lock);
+
 			if (ret != SWAP_AGAIN) {
 				spin_unlock(&anon_vma->lock);
 				goto out;
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..d635132 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -505,6 +505,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct vm_area_struct *next = vma->vm_next;
 	struct vm_area_struct *importer = NULL;
 	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
 	long adjust_next = 0;
@@ -578,6 +579,11 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma && (insert || importer || start != vma->vm_start)) {
+		anon_vma = vma->anon_vma;
+		spin_lock(&anon_vma->lock);
+	}
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +626,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (anon_vma)
+		spin_unlock(&anon_vma->lock);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..0d8db6d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -236,6 +236,69 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	return -ENOMEM;
 }
 
+/* Given an anon_vma, find the root of the chain, lock it and return the root */
+struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma)
+{
+	struct anon_vma *root_anon_vma;
+	struct anon_vma_chain *avc, *root_avc;
+	struct vm_area_struct *vma;
+
+	/* Lock the same_anon_vma list and make sure we are on a chain */
+	spin_lock(&anon_vma->lock);
+	if (list_empty(&anon_vma->head)) {
+		spin_unlock(&anon_vma->lock);
+		return NULL;
+	}
+
+	/*
+	 * Get the root anon_vma on the list by depending on the ordering
+	 * of the same_vma list setup by __page_set_anon_rmap. Basically
+	 * we are doing
+	 *
+	 * local anon_vma -> local vma -> deepest vma -> anon_vma
+	 */
+	avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
+	vma = avc->vma;
+	root_avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+	root_anon_vma = root_avc->anon_vma;
+	if (!root_anon_vma) {
+		/* XXX: Can this happen? Don't think so but get confirmation */
+		WARN_ON_ONCE(1);
+		return anon_vma;
+	}
+
+	/* Get the lock of the root anon_vma */
+	if (anon_vma != root_anon_vma) {
+		/*
+		 * XXX: This doesn't seem safe. What prevents root_anon_vma
+		 * getting freed from underneath us? Not much but if
+		 * we take the second lock first, there is a deadlock
+		 * possibility if there are multiple callers of rmap_walk
+		 */
+		spin_unlock(&anon_vma->lock);
+		spin_lock(&root_anon_vma->lock);
+	}
+
+	return root_anon_vma;
+}
+
+/*
+ * From the anon_vma associated with this page, find and lock the
+ * deepest anon_vma on the list. This allows multiple anon_vma locks
+ * to be taken by guaranteeing the locks are taken in the same order
+ */
+struct anon_vma *page_anon_vma_lock_root(struct page *page)
+{
+	struct anon_vma *anon_vma;
+
+	/* Get the local anon_vma */
+	anon_vma = page_anon_vma(page);
+	if (!anon_vma)
+		return NULL;
+
+	return anon_vma_lock_root(anon_vma);
+}
+
 static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 {
 	struct anon_vma *anon_vma = anon_vma_chain->anon_vma;
@@ -326,7 +389,7 @@ void page_unlock_anon_vma(struct anon_vma *anon_vma)
  * Returns virtual address or -EFAULT if page's index/offset is not
  * within the range mapped the @vma.
  */
-static inline unsigned long
+static noinline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1358,7 +1421,7 @@ int try_to_munlock(struct page *page)
 static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma, *locked_vma;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
@@ -1368,16 +1431,25 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
 	 */
-	anon_vma = page_anon_vma(page);
+	anon_vma = page_anon_vma_lock_root(page);
 	if (!anon_vma)
 		return ret;
-	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
-		if (address == -EFAULT)
-			continue;
-		ret = rmap_one(page, vma, address, arg);
+		unsigned long address;
+
+		locked_vma = NULL;
+		if (anon_vma != vma->anon_vma) {
+			locked_vma = vma->anon_vma;
+			spin_lock_nested(&locked_vma->lock, SINGLE_DEPTH_NESTING);
+		}
+		address = vma_address(page, vma);
+		if (address != -EFAULT)
+			ret = rmap_one(page, vma, address, arg);
+
+		if (locked_vma)
+			spin_unlock(&locked_vma->lock);
+
 		if (ret != SWAP_AGAIN)
 			break;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06  0:22             ` Mel Gorman
@ 2010-05-06  0:42               ` Linus Torvalds
  2010-05-06 10:02                 ` Mel Gorman
  2010-05-06  9:47               ` Minchan Kim
  1 sibling, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2010-05-06  0:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel



On Thu, 6 May 2010, Mel Gorman wrote:
> +	/*
> +	 * Get the root anon_vma on the list by depending on the ordering
> +	 * of the same_vma list setup by __page_set_anon_rmap. Basically
> +	 * we are doing
> +	 *
> +	 * local anon_vma -> local vma -> deepest vma -> anon_vma
> +	 */
> +	avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
> +	vma = avc->vma;
> +	root_avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
> +	root_anon_vma = root_avc->anon_vma;
> +	if (!root_anon_vma) {
> +		/* XXX: Can this happen? Don't think so but get confirmation */
> +		WARN_ON_ONCE(1);
> +		return anon_vma;
> +	}

No, that can't happen. If you find an avc struct, it _will_ have a 
anon_vma pointer. So there's no point in testing for NULL. If some bug 
happens, you're much better off with the oops than with the warning.

> +	/* Get the lock of the root anon_vma */
> +	if (anon_vma != root_anon_vma) {
> +		/*
> +		 * XXX: This doesn't seem safe. What prevents root_anon_vma
> +		 * getting freed from underneath us? Not much but if
> +		 * we take the second lock first, there is a deadlock
> +		 * possibility if there are multiple callers of rmap_walk
> +		 */
> +		spin_unlock(&anon_vma->lock);
> +		spin_lock(&root_anon_vma->lock);
> +	}

What makes this ok is the fact that it must be running under the RCU read 
lock, and anon_vma's thus cannot be released. My version of the code made 
that explicit. Yours does not, and doesn't even have comments about the 
fact that it needs to be called RCU read-locked. Tssk, tssk.

Please don't just assume locking. Either lock it, or say "this must be 
called with so-and-so held". Not just a silent "this would be buggy if 
anybody ever called it without the RCU lock".

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06  0:42               ` Linus Torvalds
@ 2010-05-06 10:02                 ` Mel Gorman
  2010-05-06 14:15                   ` Linus Torvalds
  0 siblings, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 10:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed, May 05, 2010 at 05:42:19PM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 6 May 2010, Mel Gorman wrote:
> > +	/*
> > +	 * Get the root anon_vma on the list by depending on the ordering
> > +	 * of the same_vma list setup by __page_set_anon_rmap. Basically
> > +	 * we are doing
> > +	 *
> > +	 * local anon_vma -> local vma -> deepest vma -> anon_vma
> > +	 */
> > +	avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
> > +	vma = avc->vma;
> > +	root_avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
> > +	root_anon_vma = root_avc->anon_vma;
> > +	if (!root_anon_vma) {
> > +		/* XXX: Can this happen? Don't think so but get confirmation */
> > +		WARN_ON_ONCE(1);
> > +		return anon_vma;
> > +	}
> 
> No, that can't happen. If you find an avc struct, it _will_ have a 
> anon_vma pointer. So there's no point in testing for NULL. If some bug 
> happens, you're much better off with the oops than with the warning.
> 

Good. If this returns NULL, it should oops when spin_lock(NULL->lock)
is called.

> > +	/* Get the lock of the root anon_vma */
> > +	if (anon_vma != root_anon_vma) {
> > +		/*
> > +		 * XXX: This doesn't seem safe. What prevents root_anon_vma
> > +		 * getting freed from underneath us? Not much but if
> > +		 * we take the second lock first, there is a deadlock
> > +		 * possibility if there are multiple callers of rmap_walk
> > +		 */
> > +		spin_unlock(&anon_vma->lock);
> > +		spin_lock(&root_anon_vma->lock);
> > +	}
> 
> What makes this ok is the fact that it must be running under the RCU read 
> lock, and anon_vma's thus cannot be released.

This is very subtle in itself. RCU guarantees that the anon_vma exists
but does it guarantee that it's the same one we expect and that it
hasn't been freed and reused?

> My version of the code made 
> that explicit. Yours does not, and doesn't even have comments about the 
> fact that it needs to be called RCU read-locked. Tssk, tssk.
> 

I added a comment.

> Please don't just assume locking. Either lock it, or say "this must be 
> called with so-and-so held". Not just a silent "this would be buggy if 
> anybody ever called it without the RCU lock".
> 

Sure. It was an oversight when merging what I had with what you posted
up.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 10:02                 ` Mel Gorman
@ 2010-05-06 14:15                   ` Linus Torvalds
  2010-05-06 14:25                     ` Mel Gorman
  0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2010-05-06 14:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Thu, 6 May 2010, Mel Gorman wrote:
> > 
> > What makes this ok is the fact that it must be running under the RCU read 
> > lock, and anon_vma's thus cannot be released.
> 
> This is very subtle in itself. RCU guarantees that the anon_vma exists
> but does it guarantee that it's the same one we expect and that it
> hasn't been freed and reused?

Nothing. And we shouldn't care.

If it's been freed and re-used, then all the anon_vma's (and vma's) 
associated with the original anon_vma (and page) have been free'd.

And that, in turn, means that we don't really need to lock anything at 
all. The fact that we end up locking an anon_vma that _used_ to be the 
root anon_vma is immaterial - the lock won't _help_, but it shouldn't hurt 
either, since it's still a valid spinlock.

Now, the above is only true as far as the anon_vma itself is concerned. 
It's entirely possible that any _other_ data structures would need to be 
double-checked after getting the lock. For example, is the _page_ still 
associated with that anon_vma? But that's an external issue as far as the 
anon_vma locking is concerned - presumably the 'rmap_walk()' caller will 
have made sure that the page itself is stable somehow.

				Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 14:15                   ` Linus Torvalds
@ 2010-05-06 14:25                     ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 14:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Thu, May 06, 2010 at 07:15:31AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 6 May 2010, Mel Gorman wrote:
> > > 
> > > What makes this ok is the fact that it must be running under the RCU read 
> > > lock, and anon_vma's thus cannot be released.
> > 
> > This is very subtle in itself. RCU guarantees that the anon_vma exists
> > but does it guarantee that it's the same one we expect and that it
> > hasn't been freed and reused?
> 
> Nothing. And we shouldn't care.
> 
> If it's been freed and re-used, then all the anon_vma's (and vma's) 
> associated with the original anon_vma (and page) have been free'd.
> 
> And that, in turn, means that we don't really need to lock anything at 
> all. The fact that we end up locking an anon_vma that _used_ to be the 
> root anon_vma is immaterial - the lock won't _help_, but it shouldn't hurt 
> either, since it's still a valid spinlock.
> 

I can't see any problem with the logic.

> Now, the above is only true as far as the anon_vma itself is concerned. 
> It's entirely possible that any _other_ data structures would need to be 
> double-checked after getting the lock. For example, is the _page_ still 
> associated with that anon_vma? But that's an external issue as far as the 
> anon_vma locking is concerned - presumably the 'rmap_walk()' caller will 
> have made sure that the page itself is stable somehow.
> 

It does, by having the page locked as it performs the walk.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06  0:22             ` Mel Gorman
  2010-05-06  0:42               ` Linus Torvalds
@ 2010-05-06  9:47               ` Minchan Kim
  2010-05-06  9:54                 ` Mel Gorman
  2010-05-06 14:06                 ` Linus Torvalds
  1 sibling, 2 replies; 72+ messages in thread
From: Minchan Kim @ 2010-05-06  9:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Thu, May 6, 2010 at 9:22 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Wed, May 05, 2010 at 11:02:25AM -0700, Linus Torvalds wrote:
>>
>>
>> On Wed, 5 May 2010, Mel Gorman wrote:
>> >
>> > If the same_vma list is properly ordered then maybe something like the
>> > following is allowed?
>>
>> Heh. This is the same logic I just sent out. However:
>>
>> > +   anon_vma = page_rmapping(page);
>> > +   if (!anon_vma)
>> > +           return NULL;
>> > +
>> > +   spin_lock(&anon_vma->lock);
>>
>> RCU should guarantee that this spin_lock() is valid, but:
>>
>> > +   /*
>> > +    * Get the oldest anon_vma on the list by depending on the ordering
>> > +    * of the same_vma list setup by __page_set_anon_rmap
>> > +    */
>> > +   avc = list_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
>>
>> We're not guaranteed that the 'anon_vma->head' list is non-empty.
>>
>> Somebody could have freed the list and the anon_vma and we have a stale
>> 'page->anon_vma' (that has just not been _released_ yet).
>>
>> And shouldn't that be 'list_first_entry'? Or &anon_vma->head.next?
>>
>> How did that line actually work for you? Or was it just a "it boots", but
>> no actual testing of the rmap walk?
>>
>
> This is what I just started testing on a 4-core machine. Lockdep didn't
> complain but there are two potential sources of badness in anon_vma_lock_root
> marked with XXX. The second is the most important because I can't see how the
> local and root anon_vma locks can be safely swapped - i.e. release local and
> get the root without the root disappearing. I haven't considered the other
> possibilities yet such as always locking the root anon_vma. Going to
> sleep on it.
>
> Any comments?

<snip>
> +/* Given an anon_vma, find the root of the chain, lock it and return the root */
> +struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma)
> +{
> +       struct anon_vma *root_anon_vma;
> +       struct anon_vma_chain *avc, *root_avc;
> +       struct vm_area_struct *vma;
> +
> +       /* Lock the same_anon_vma list and make sure we are on a chain */
> +       spin_lock(&anon_vma->lock);
> +       if (list_empty(&anon_vma->head)) {
> +               spin_unlock(&anon_vma->lock);
> +               return NULL;
> +       }
> +
> +       /*
> +        * Get the root anon_vma on the list by depending on the ordering
> +        * of the same_vma list setup by __page_set_anon_rmap. Basically
> +        * we are doing
> +        *
> +        * local anon_vma -> local vma -> deepest vma -> anon_vma
> +        */
> +       avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);

Dumb question.

I can't understand why we should use list_first_entry.

I looked over the code.
anon_vma_chain_link uses list_add_tail so I think that's right.
But anon_vma_prepare uses list_add. So it's not consistent.
How do we make sure list_first_entry returns deepest vma?

Sorry if I am missing.


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06  9:47               ` Minchan Kim
@ 2010-05-06  9:54                 ` Mel Gorman
  2010-05-06 10:01                   ` Minchan Kim
  2010-05-06 14:06                 ` Linus Torvalds
  1 sibling, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2010-05-06  9:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Thu, May 06, 2010 at 06:47:12PM +0900, Minchan Kim wrote:
> On Thu, May 6, 2010 at 9:22 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Wed, May 05, 2010 at 11:02:25AM -0700, Linus Torvalds wrote:
> >>
> >>
> >> On Wed, 5 May 2010, Mel Gorman wrote:
> >> >
> >> > If the same_vma list is properly ordered then maybe something like the
> >> > following is allowed?
> >>
> >> Heh. This is the same logic I just sent out. However:
> >>
> >> > +   anon_vma = page_rmapping(page);
> >> > +   if (!anon_vma)
> >> > +           return NULL;
> >> > +
> >> > +   spin_lock(&anon_vma->lock);
> >>
> >> RCU should guarantee that this spin_lock() is valid, but:
> >>
> >> > +   /*
> >> > +    * Get the oldest anon_vma on the list by depending on the ordering
> >> > +    * of the same_vma list setup by __page_set_anon_rmap
> >> > +    */
> >> > +   avc = list_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
> >>
> >> We're not guaranteed that the 'anon_vma->head' list is non-empty.
> >>
> >> Somebody could have freed the list and the anon_vma and we have a stale
> >> 'page->anon_vma' (that has just not been _released_ yet).
> >>
> >> And shouldn't that be 'list_first_entry'? Or &anon_vma->head.next?
> >>
> >> How did that line actually work for you? Or was it just a "it boots", but
> >> no actual testing of the rmap walk?
> >>
> >
> > This is what I just started testing on a 4-core machine. Lockdep didn't
> > complain but there are two potential sources of badness in anon_vma_lock_root
> > marked with XXX. The second is the most important because I can't see how the
> > local and root anon_vma locks can be safely swapped - i.e. release local and
> > get the root without the root disappearing. I haven't considered the other
> > possibilities yet such as always locking the root anon_vma. Going to
> > sleep on it.
> >
> > Any comments?
> 
> <snip>
> > +/* Given an anon_vma, find the root of the chain, lock it and return the root */
> > +struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma)
> > +{
> > +       struct anon_vma *root_anon_vma;
> > +       struct anon_vma_chain *avc, *root_avc;
> > +       struct vm_area_struct *vma;
> > +
> > +       /* Lock the same_anon_vma list and make sure we are on a chain */
> > +       spin_lock(&anon_vma->lock);
> > +       if (list_empty(&anon_vma->head)) {
> > +               spin_unlock(&anon_vma->lock);
> > +               return NULL;
> > +       }
> > +
> > +       /*
> > +        * Get the root anon_vma on the list by depending on the ordering
> > +        * of the same_vma list setup by __page_set_anon_rmap. Basically
> > +        * we are doing
> > +        *
> > +        * local anon_vma -> local vma -> deepest vma -> anon_vma
> > +        */
> > +       avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
> 
> Dumb question.
> 
> I can't understand why we should use list_first_entry.
> 
> I looked over the code.
> anon_vma_chain_link uses list_add_tail so I think that's right.
> But anon_vma_prepare uses list_add. So it's not consistent.
> How do we make sure list_first_entry returns deepest vma?
> 

list_first_entry is not getting the root (what you called deepest but lets
pick a name and stick with it or this will be worse than it already is). That
list_first entry is what gets us from

local anon_vma -> avc for the local anon_vma -> local vma

It's the ordering of the same_vma list hanging off the local_vma that is
important according to this comment in __page_set_anon_rmap

                /*
                 * The page may be shared between multiple processes.
                 * We must use the _oldest_ possible anon_vma for the
                 * page mapping!  That anon_vma is guaranteed to be
                 * present in all processes that could share this page.
                 *
                 * So take the last AVC chain entry in the vma, which is the
                 * deepest ancestor, and use the anon_vma from that.
                 */
                avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
                anon_vma = avc->anon_vma;
        }

> Sorry if I am missing.
> 

Not at all. The more people that look at this the better.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06  9:54                 ` Mel Gorman
@ 2010-05-06 10:01                   ` Minchan Kim
  2010-05-06 10:10                     ` Mel Gorman
  0 siblings, 1 reply; 72+ messages in thread
From: Minchan Kim @ 2010-05-06 10:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Thu, May 6, 2010 at 6:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Thu, May 06, 2010 at 06:47:12PM +0900, Minchan Kim wrote:
>> On Thu, May 6, 2010 at 9:22 AM, Mel Gorman <mel@csn.ul.ie> wrote:
>> > On Wed, May 05, 2010 at 11:02:25AM -0700, Linus Torvalds wrote:
>> >>
>> >>
>> >> On Wed, 5 May 2010, Mel Gorman wrote:
>> >> >
>> >> > If the same_vma list is properly ordered then maybe something like the
>> >> > following is allowed?
>> >>
>> >> Heh. This is the same logic I just sent out. However:
>> >>
>> >> > +   anon_vma = page_rmapping(page);
>> >> > +   if (!anon_vma)
>> >> > +           return NULL;
>> >> > +
>> >> > +   spin_lock(&anon_vma->lock);
>> >>
>> >> RCU should guarantee that this spin_lock() is valid, but:
>> >>
>> >> > +   /*
>> >> > +    * Get the oldest anon_vma on the list by depending on the ordering
>> >> > +    * of the same_vma list setup by __page_set_anon_rmap
>> >> > +    */
>> >> > +   avc = list_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
>> >>
>> >> We're not guaranteed that the 'anon_vma->head' list is non-empty.
>> >>
>> >> Somebody could have freed the list and the anon_vma and we have a stale
>> >> 'page->anon_vma' (that has just not been _released_ yet).
>> >>
>> >> And shouldn't that be 'list_first_entry'? Or &anon_vma->head.next?
>> >>
>> >> How did that line actually work for you? Or was it just a "it boots", but
>> >> no actual testing of the rmap walk?
>> >>
>> >
>> > This is what I just started testing on a 4-core machine. Lockdep didn't
>> > complain but there are two potential sources of badness in anon_vma_lock_root
>> > marked with XXX. The second is the most important because I can't see how the
>> > local and root anon_vma locks can be safely swapped - i.e. release local and
>> > get the root without the root disappearing. I haven't considered the other
>> > possibilities yet such as always locking the root anon_vma. Going to
>> > sleep on it.
>> >
>> > Any comments?
>>
>> <snip>
>> > +/* Given an anon_vma, find the root of the chain, lock it and return the root */
>> > +struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma)
>> > +{
>> > +       struct anon_vma *root_anon_vma;
>> > +       struct anon_vma_chain *avc, *root_avc;
>> > +       struct vm_area_struct *vma;
>> > +
>> > +       /* Lock the same_anon_vma list and make sure we are on a chain */
>> > +       spin_lock(&anon_vma->lock);
>> > +       if (list_empty(&anon_vma->head)) {
>> > +               spin_unlock(&anon_vma->lock);
>> > +               return NULL;
>> > +       }
>> > +
>> > +       /*
>> > +        * Get the root anon_vma on the list by depending on the ordering
>> > +        * of the same_vma list setup by __page_set_anon_rmap. Basically
>> > +        * we are doing
>> > +        *
>> > +        * local anon_vma -> local vma -> deepest vma -> anon_vma
>> > +        */
>> > +       avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
>>
>> Dumb question.
>>
>> I can't understand why we should use list_first_entry.
>>
>> I looked over the code.
>> anon_vma_chain_link uses list_add_tail so I think that's right.
>> But anon_vma_prepare uses list_add. So it's not consistent.
>> How do we make sure list_first_entry returns deepest vma?
>>
>
> list_first_entry is not getting the root (what you called deepest but lets
> pick a name and stick with it or this will be worse than it already is). That
> list_first entry is what gets us from
>
> local anon_vma -> avc for the local anon_vma -> local vma
>

Yes. Sorry for confusing word. :)
Let's have a question again. What I have a question is that why we
have to use list_first_entry not list_entry for getting local_vma?


>> Sorry if I am missing.
>>
>
> Not at all. The more people that look at this the better.

Thanks. Mel.

> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 10:01                   ` Minchan Kim
@ 2010-05-06 10:10                     ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2010-05-06 10:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linus Torvalds, Andrew Morton, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Thu, May 06, 2010 at 07:01:54PM +0900, Minchan Kim wrote:
> On Thu, May 6, 2010 at 6:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Thu, May 06, 2010 at 06:47:12PM +0900, Minchan Kim wrote:
> >> On Thu, May 6, 2010 at 9:22 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> >> > On Wed, May 05, 2010 at 11:02:25AM -0700, Linus Torvalds wrote:
> >> >>
> >> >>
> >> >> On Wed, 5 May 2010, Mel Gorman wrote:
> >> >> >
> >> >> > If the same_vma list is properly ordered then maybe something like the
> >> >> > following is allowed?
> >> >>
> >> >> Heh. This is the same logic I just sent out. However:
> >> >>
> >> >> > +   anon_vma = page_rmapping(page);
> >> >> > +   if (!anon_vma)
> >> >> > +           return NULL;
> >> >> > +
> >> >> > +   spin_lock(&anon_vma->lock);
> >> >>
> >> >> RCU should guarantee that this spin_lock() is valid, but:
> >> >>
> >> >> > +   /*
> >> >> > +    * Get the oldest anon_vma on the list by depending on the ordering
> >> >> > +    * of the same_vma list setup by __page_set_anon_rmap
> >> >> > +    */
> >> >> > +   avc = list_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
> >> >>
> >> >> We're not guaranteed that the 'anon_vma->head' list is non-empty.
> >> >>
> >> >> Somebody could have freed the list and the anon_vma and we have a stale
> >> >> 'page->anon_vma' (that has just not been _released_ yet).
> >> >>
> >> >> And shouldn't that be 'list_first_entry'? Or &anon_vma->head.next?
> >> >>
> >> >> How did that line actually work for you? Or was it just a "it boots", but
> >> >> no actual testing of the rmap walk?
> >> >>
> >> >
> >> > This is what I just started testing on a 4-core machine. Lockdep didn't
> >> > complain but there are two potential sources of badness in anon_vma_lock_root
> >> > marked with XXX. The second is the most important because I can't see how the
> >> > local and root anon_vma locks can be safely swapped - i.e. release local and
> >> > get the root without the root disappearing. I haven't considered the other
> >> > possibilities yet such as always locking the root anon_vma. Going to
> >> > sleep on it.
> >> >
> >> > Any comments?
> >>
> >> <snip>
> >> > +/* Given an anon_vma, find the root of the chain, lock it and return the root */
> >> > +struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma)
> >> > +{
> >> > +       struct anon_vma *root_anon_vma;
> >> > +       struct anon_vma_chain *avc, *root_avc;
> >> > +       struct vm_area_struct *vma;
> >> > +
> >> > +       /* Lock the same_anon_vma list and make sure we are on a chain */
> >> > +       spin_lock(&anon_vma->lock);
> >> > +       if (list_empty(&anon_vma->head)) {
> >> > +               spin_unlock(&anon_vma->lock);
> >> > +               return NULL;
> >> > +       }
> >> > +
> >> > +       /*
> >> > +        * Get the root anon_vma on the list by depending on the ordering
> >> > +        * of the same_vma list setup by __page_set_anon_rmap. Basically
> >> > +        * we are doing
> >> > +        *
> >> > +        * local anon_vma -> local vma -> deepest vma -> anon_vma
> >> > +        */
> >> > +       avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
> >>
> >> Dumb question.
> >>
> >> I can't understand why we should use list_first_entry.
> >>
> >> I looked over the code.
> >> anon_vma_chain_link uses list_add_tail so I think that's right.
> >> But anon_vma_prepare uses list_add. So it's not consistent.
> >> How do we make sure list_first_entry returns deepest vma?
> >>
> >
> > list_first_entry is not getting the root (what you called deepest but lets
> > pick a name and stick with it or this will be worse than it already is).

Of course, I have to clean out my own references to "deepest" :/

> That
> > list_first entry is what gets us from
> >
> > local anon_vma -> avc for the local anon_vma -> local vma
> >
> 
> Yes. Sorry for confusing word. :)
> Let's have a question again. What I have a question is that why we
> have to use list_first_entry not list_entry for getting local_vma?
> 

Nothing other than it's easier to read and a bit more self-documenting
than;

avc = list_entry(anon_vma->head.next, struct anon_vma_chain, same_anon_vma);

> 
> >> Sorry if I am missing.
> >>
> >
> > Not at all. The more people that look at this the better.
> 
> Thanks. Mel.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the  wrong VMA information
  2010-05-06  9:47               ` Minchan Kim
  2010-05-06  9:54                 ` Mel Gorman
@ 2010-05-06 14:06                 ` Linus Torvalds
  2010-05-06 15:59                   ` Minchan Kim
  1 sibling, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2010-05-06 14:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Thu, 6 May 2010, Minchan Kim wrote:
> > + A  A  A  A */
> > + A  A  A  avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
> 
> Dumb question.
> 
> I can't understand why we should use list_first_entry.

It's not that we "should" use list_entry_first. It's that we want to find 
_any_ entry on the list, and the most natural one is the first one.

So we could take absolutely any 'avc' entry that is reachable from the 
anon_vma, and use that to look up _any_ 'vma' that is associated with that 
anon_vma. And then, from _any_ of those vma's, we know how to get to the 
"root anon_vma" - the one that they are all associated with.

So no, there's absolutely nothing special about the first entry. It's 
just a random easily found one.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 14:06                 ` Linus Torvalds
@ 2010-05-06 15:59                   ` Minchan Kim
  0 siblings, 0 replies; 72+ messages in thread
From: Minchan Kim @ 2010-05-06 15:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Thu, May 6, 2010 at 11:06 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Thu, 6 May 2010, Minchan Kim wrote:
>> > +        */
>> > +       avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
>>
>> Dumb question.
>>
>> I can't understand why we should use list_first_entry.
>
> It's not that we "should" use list_entry_first. It's that we want to find
> _any_ entry on the list, and the most natural one is the first one.
>
> So we could take absolutely any 'avc' entry that is reachable from the
> anon_vma, and use that to look up _any_ 'vma' that is associated with that
> anon_vma. And then, from _any_ of those vma's, we know how to get to the
> "root anon_vma" - the one that they are all associated with.
>
> So no, there's absolutely nothing special about the first entry. It's
> just a random easily found one.
>
>                Linus
>

Thanks, Linus and Mel.
You understood my question correctly. :)

My concern was following case.

Child process does mmap new VMA but anon_vma is reused nearer child's
VMA which is linked parent's VMA by fork.
In that case, anon_vma_prepare calls list_add not list_add_tail.
ex) list_add(&avc->same_anon_vma, &anon_vma->head);

It means list_first_entry is the new VMA not old VMA and new VMA's
root_avc isn't linked at parent's one. It means we are locking each
other locks. That's why I have a question.

But I carefully looked at the reusable_anon_vma and found
list_is_singular. I remember Linus changed it to make problem simple.
So in my scenario, new VMA can't share old VMA's anon_vma.

So my story is broken.
If I miss something, please, correct me. :)

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-05 13:14 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
  2010-05-05 14:34   ` Linus Torvalds
@ 2010-05-06  7:38   ` KAMEZAWA Hiroyuki
  2010-05-06  9:46     ` Mel Gorman
  1 sibling, 1 reply; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-06  7:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Linus Torvalds, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Wed,  5 May 2010 14:14:40 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> vma_adjust() is updating anon VMA information without locks being taken.
> In contrast, file-backed mappings use the i_mmap_lock and this lack of
> locking can result in races with users of rmap_walk such as page migration.
> vma_address() can return -EFAULT for an address that will soon be valid.
> For migration, this potentially leaves a dangling migration PTE behind
> which can later cause a BUG_ON to trigger when the page is faulted in.
> 
> With the recent anon_vma changes, there can be more than one anon_vma->lock
> to take in a anon_vma_chain but a second lock cannot be spinned upon in case
> of deadlock. The rmap walker tries to take locks of different anon_vma's
> but if the attempt fails, locks are released and the operation is restarted.
> 
> For vma_adjust(), the locking behaviour prior to the anon_vma is restored
> so that rmap_walk() can be sure of the integrity of the VMA information and
> lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
> is taken if
> 
> 	a) If there is any overlap with the next VMA due to the adjustment
> 	b) If there is a new VMA is being inserted into the address space
> 	c) If the start of the VMA is being changed so that the
> 	   relationship between vm_start and vm_pgoff is preserved
> 	   for vma_address()
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

I'm sorry I couldn't catch all details but can I make a question ?
Why seq_counter is bad finally ? I can't understand why we have
to lock anon_vma with risks of costs, which is mysterious struct now.

Adding a new to mm_struct is too bad ?

Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At treating rmap, there is no guarantee that "rmap is always correct"
because vma->vm_start, vma->vm_pgoff are modified without any lock.

In usual, it's not a problem that we see incosistent rmap at 
try_to_unmap() etc...But, at migration, this temporal inconsistency
makes rmap_walk() to do wrong decision and leaks migration_pte.
This causes BUG later.

This patch adds seq_counter to mm-struct(not vma because inconsistency
information should cover multiple vmas.). By this, rmap_walk()
can always see consistent [start, end. pgoff] information at checking
page's pte in a vma.

In exec()'s failure case, rmap is left as broken but we don't have to
take care of it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/exec.c                |   20 +++++++++++++++-----
 include/linux/mm_types.h |    2 ++
 mm/mmap.c                |    3 +++
 mm/rmap.c                |   13 ++++++++++++-
 4 files changed, 32 insertions(+), 6 deletions(-)

Index: linux-2.6.34-rc5-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.34-rc5-mm1.orig/include/linux/mm_types.h
+++ linux-2.6.34-rc5-mm1/include/linux/mm_types.h
@@ -14,6 +14,7 @@
 #include <linux/page-debug-flags.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
+#include <linux/seqlock.h>
 
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
@@ -310,6 +311,7 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+	seqcount_t	rmap_consistent;
 };
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
Index: linux-2.6.34-rc5-mm1/mm/rmap.c
===================================================================
--- linux-2.6.34-rc5-mm1.orig/mm/rmap.c
+++ linux-2.6.34-rc5-mm1/mm/rmap.c
@@ -332,8 +332,19 @@ vma_address(struct page *page, struct vm
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	unsigned long address;
+	unsigned int seq;
+
+	/*
+ 	 * Because we don't take mm->mmap_sem, we have race with
+ 	 * vma adjusting....we'll be able to see broken rmap. To avoid
+ 	 * that, check consistency of rmap by seqcounter.
+ 	 */
+	do {
+		seq = read_seqcount_begin(&vma->vm_mm->rmap_consistent);
+		address = vma->vm_start
+			+ ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	} while (read_seqcount_retry(&vma->vm_mm->rmap_consistent, seq));
 
-	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
 		/* page should be within @vma mapping range */
 		return -EFAULT;
Index: linux-2.6.34-rc5-mm1/fs/exec.c
===================================================================
--- linux-2.6.34-rc5-mm1.orig/fs/exec.c
+++ linux-2.6.34-rc5-mm1/fs/exec.c
@@ -517,16 +517,25 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
-		return -ENOMEM;
-
+	write_seqcount_begin(&mm->rmap_consistent);
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
+	/*
+	 * vma->vm_start should be updated always for freeing pgds.
+	 * after failure.
+ 	 */
+	vma->vm_start = new_start;
 	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+				       vma, new_start, length)) {
+		/*
+		 * We have broken rmap here. But we can unlock this becauase
+ 		 * no one will do page-fault to ptes in this range more.
+ 		 */
+		write_seqcount_end(&mm->rmap_consistent);
 		return -ENOMEM;
+	}
 
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
@@ -551,7 +560,8 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
+	write_seqcount_end(&mm->rmap_consistent);
 
 	return 0;
 }
Index: linux-2.6.34-rc5-mm1/mm/mmap.c
===================================================================
--- linux-2.6.34-rc5-mm1.orig/mm/mmap.c
+++ linux-2.6.34-rc5-mm1/mm/mmap.c
@@ -585,6 +585,7 @@ again:			remove_next = 1 + (end > next->
 			vma_prio_tree_remove(next, root);
 	}
 
+	write_seqcount_begin(&mm->rmap_consistent);
 	vma->vm_start = start;
 	vma->vm_end = end;
 	vma->vm_pgoff = pgoff;
@@ -620,6 +621,8 @@ again:			remove_next = 1 + (end > next->
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	write_seqcount_end(&mm->rmap_consistent);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06  7:38   ` KAMEZAWA Hiroyuki
@ 2010-05-06  9:46     ` Mel Gorman
  2010-05-06 23:52       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2010-05-06  9:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Linux-MM, LKML, Linus Torvalds, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Thu, May 06, 2010 at 04:38:37PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed,  5 May 2010 14:14:40 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > vma_adjust() is updating anon VMA information without locks being taken.
> > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > locking can result in races with users of rmap_walk such as page migration.
> > vma_address() can return -EFAULT for an address that will soon be valid.
> > For migration, this potentially leaves a dangling migration PTE behind
> > which can later cause a BUG_ON to trigger when the page is faulted in.
> > 
> > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > to take in a anon_vma_chain but a second lock cannot be spinned upon in case
> > of deadlock. The rmap walker tries to take locks of different anon_vma's
> > but if the attempt fails, locks are released and the operation is restarted.
> > 
> > For vma_adjust(), the locking behaviour prior to the anon_vma is restored
> > so that rmap_walk() can be sure of the integrity of the VMA information and
> > lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
> > is taken if
> > 
> > 	a) If there is any overlap with the next VMA due to the adjustment
> > 	b) If there is a new VMA is being inserted into the address space
> > 	c) If the start of the VMA is being changed so that the
> > 	   relationship between vm_start and vm_pgoff is preserved
> > 	   for vma_address()
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> I'm sorry I couldn't catch all details but can I make a question ?

Of course.

> Why seq_counter is bad finally ? I can't understand why we have
> to lock anon_vma with risks of costs, which is mysterious struct now.
> 
> Adding a new to mm_struct is too bad ?
> 

It's not the biggest problem. I'm not totally against this approach but
some of the problems I had were;

1. It introduced new locking. anon_vmas would be covered by RCU,
   spinlocks and seqlock - each of which is used in different
   circumstances. The last patch I posted doesn't drastically
   alter the locking. It just says that if you are taking multiple
   locks, you must start from the "root" anon_vma.

2. I wasn't sure if it was usable by transparent hugepage support.
   Andrea?

3. I had similar concerns about it livelocking like the
   trylock-and-retry although it's not terrible.

4. I couldn't convince myself at the time that it wasn't possible for
   someone to manipulate the list while it was being walked and a VMA would be
   missed. For example, if fork() was called while rmap_walk was happening,
   were we guaranteed to find the VMAs added to the list?  I admit I didn't
   fully investigate this question at the time as I was still getting to
   grips with anon_vma. I can reinvestigate if you think the "lock the root
   anon_vma first when taking multiple locks" has a bad cost that is
   potentially resolved with seqcounter

5. It added a field to mm_struct. It's the smallest of concerns though.

Do you think it's a better approach and should be revisited?


> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> At treating rmap, there is no guarantee that "rmap is always correct"
> because vma->vm_start, vma->vm_pgoff are modified without any lock.
> 
> In usual, it's not a problem that we see incosistent rmap at 
> try_to_unmap() etc...But, at migration, this temporal inconsistency
> makes rmap_walk() to do wrong decision and leaks migration_pte.
> This causes BUG later.
> 
> This patch adds seq_counter to mm-struct(not vma because inconsistency
> information should cover multiple vmas.). By this, rmap_walk()
> can always see consistent [start, end. pgoff] information at checking
> page's pte in a vma.
> 
> In exec()'s failure case, rmap is left as broken but we don't have to
> take care of it.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  fs/exec.c                |   20 +++++++++++++++-----
>  include/linux/mm_types.h |    2 ++
>  mm/mmap.c                |    3 +++
>  mm/rmap.c                |   13 ++++++++++++-
>  4 files changed, 32 insertions(+), 6 deletions(-)
> 
> Index: linux-2.6.34-rc5-mm1/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.34-rc5-mm1.orig/include/linux/mm_types.h
> +++ linux-2.6.34-rc5-mm1/include/linux/mm_types.h
> @@ -14,6 +14,7 @@
>  #include <linux/page-debug-flags.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
> +#include <linux/seqlock.h>
>  
>  #ifndef AT_VECTOR_SIZE_ARCH
>  #define AT_VECTOR_SIZE_ARCH 0
> @@ -310,6 +311,7 @@ struct mm_struct {
>  #ifdef CONFIG_MMU_NOTIFIER
>  	struct mmu_notifier_mm *mmu_notifier_mm;
>  #endif
> +	seqcount_t	rmap_consistent;
>  };
>  
>  /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
> Index: linux-2.6.34-rc5-mm1/mm/rmap.c
> ===================================================================
> --- linux-2.6.34-rc5-mm1.orig/mm/rmap.c
> +++ linux-2.6.34-rc5-mm1/mm/rmap.c
> @@ -332,8 +332,19 @@ vma_address(struct page *page, struct vm
>  {
>  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
>  	unsigned long address;
> +	unsigned int seq;
> +
> +	/*
> + 	 * Because we don't take mm->mmap_sem, we have race with
> + 	 * vma adjusting....we'll be able to see broken rmap. To avoid
> + 	 * that, check consistency of rmap by seqcounter.
> + 	 */
> +	do {
> +		seq = read_seqcount_begin(&vma->vm_mm->rmap_consistent);
> +		address = vma->vm_start
> +			+ ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +	} while (read_seqcount_retry(&vma->vm_mm->rmap_consistent, seq));
>  
> -	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
>  		/* page should be within @vma mapping range */
>  		return -EFAULT;
> Index: linux-2.6.34-rc5-mm1/fs/exec.c
> ===================================================================
> --- linux-2.6.34-rc5-mm1.orig/fs/exec.c
> +++ linux-2.6.34-rc5-mm1/fs/exec.c
> @@ -517,16 +517,25 @@ static int shift_arg_pages(struct vm_are
>  	/*
>  	 * cover the whole range: [new_start, old_end)
>  	 */
> -	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
> -		return -ENOMEM;
> -
> +	write_seqcount_begin(&mm->rmap_consistent);
>  	/*
>  	 * move the page tables downwards, on failure we rely on
>  	 * process cleanup to remove whatever mess we made.
>  	 */
> +	/*
> +	 * vma->vm_start should be updated always for freeing pgds.
> +	 * after failure.
> + 	 */
> +	vma->vm_start = new_start;
>  	if (length != move_page_tables(vma, old_start,
> -				       vma, new_start, length))
> +				       vma, new_start, length)) {
> +		/*
> +		 * We have broken rmap here. But we can unlock this becauase
> + 		 * no one will do page-fault to ptes in this range more.
> + 		 */
> +		write_seqcount_end(&mm->rmap_consistent);
>  		return -ENOMEM;
> +	}
>  
>  	lru_add_drain();
>  	tlb = tlb_gather_mmu(mm, 0);
> @@ -551,7 +560,8 @@ static int shift_arg_pages(struct vm_are
>  	/*
>  	 * Shrink the vma to just the new range.  Always succeeds.
>  	 */
> -	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
> +	vma->vm_end = new_end;
> +	write_seqcount_end(&mm->rmap_consistent);
>  
>  	return 0;
>  }
> Index: linux-2.6.34-rc5-mm1/mm/mmap.c
> ===================================================================
> --- linux-2.6.34-rc5-mm1.orig/mm/mmap.c
> +++ linux-2.6.34-rc5-mm1/mm/mmap.c
> @@ -585,6 +585,7 @@ again:			remove_next = 1 + (end > next->
>  			vma_prio_tree_remove(next, root);
>  	}
>  
> +	write_seqcount_begin(&mm->rmap_consistent);
>  	vma->vm_start = start;
>  	vma->vm_end = end;
>  	vma->vm_pgoff = pgoff;
> @@ -620,6 +621,8 @@ again:			remove_next = 1 + (end > next->
>  	if (mapping)
>  		spin_unlock(&mapping->i_mmap_lock);
>  
> +	write_seqcount_end(&mm->rmap_consistent);
> +
>  	if (remove_next) {
>  		if (file) {
>  			fput(file);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06  9:46     ` Mel Gorman
@ 2010-05-06 23:52       ` KAMEZAWA Hiroyuki
  2010-05-07  5:49         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-06 23:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Linus Torvalds, Minchan Kim,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Thu, 6 May 2010 10:46:21 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, May 06, 2010 at 04:38:37PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed,  5 May 2010 14:14:40 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > vma_adjust() is updating anon VMA information without locks being taken.
> > > In contrast, file-backed mappings use the i_mmap_lock and this lack of
> > > locking can result in races with users of rmap_walk such as page migration.
> > > vma_address() can return -EFAULT for an address that will soon be valid.
> > > For migration, this potentially leaves a dangling migration PTE behind
> > > which can later cause a BUG_ON to trigger when the page is faulted in.
> > > 
> > > With the recent anon_vma changes, there can be more than one anon_vma->lock
> > > to take in a anon_vma_chain but a second lock cannot be spinned upon in case
> > > of deadlock. The rmap walker tries to take locks of different anon_vma's
> > > but if the attempt fails, locks are released and the operation is restarted.
> > > 
> > > For vma_adjust(), the locking behaviour prior to the anon_vma is restored
> > > so that rmap_walk() can be sure of the integrity of the VMA information and
> > > lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
> > > is taken if
> > > 
> > > 	a) If there is any overlap with the next VMA due to the adjustment
> > > 	b) If there is a new VMA is being inserted into the address space
> > > 	c) If the start of the VMA is being changed so that the
> > > 	   relationship between vm_start and vm_pgoff is preserved
> > > 	   for vma_address()
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > I'm sorry I couldn't catch all details but can I make a question ?
> 
> Of course.
> 
> > Why seq_counter is bad finally ? I can't understand why we have
> > to lock anon_vma with risks of costs, which is mysterious struct now.
> > 
> > Adding a new to mm_struct is too bad ?
> > 
> 
> It's not the biggest problem. I'm not totally against this approach but
> some of the problems I had were;
> 
> 1. It introduced new locking. anon_vmas would be covered by RCU,
>    spinlocks and seqlock - each of which is used in different
>    circumstances. The last patch I posted doesn't drastically
>    alter the locking. It just says that if you are taking multiple
>    locks, you must start from the "root" anon_vma.
> 
ok. I just thought a lock-system which we have to find "which lock should I
take" is not very good.


> 2. I wasn't sure if it was usable by transparent hugepage support.
>    Andrea?

Hmm.

> 
> 3. I had similar concerns about it livelocking like the
>    trylock-and-retry although it's not terrible.
> 
Agreed.

> 4. I couldn't convince myself at the time that it wasn't possible for
>    someone to manipulate the list while it was being walked and a VMA would be
>    missed. For example, if fork() was called while rmap_walk was happening,
>    were we guaranteed to find the VMAs added to the list?  I admit I didn't
>    fully investigate this question at the time as I was still getting to
>    grips with anon_vma. I can reinvestigate if you think the "lock the root
>    anon_vma first when taking multiple locks" has a bad cost that is
>    potentially resolved with seqcounter
> 
If no regressions in measurement, I have no objections.

> 5. It added a field to mm_struct. It's the smallest of concerns though.
> 
> Do you think it's a better approach and should be revisited?
> 
> 

If everyone think seqlock is simple, I think it should be. But it seems you all are
going ahead with anon_vma->lock approach. 
(Basically, it's ok to me if it works. We may be able to make it better in later.)

I'll check your V7.

Thank you for answering.

Regards,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 23:52       ` KAMEZAWA Hiroyuki
@ 2010-05-07  5:49         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-07  5:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Linus Torvalds,
	Minchan Kim, Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Fri, 7 May 2010 08:52:19 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 6 May 2010 10:46:21 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:

> > 5. It added a field to mm_struct. It's the smallest of concerns though.
> > 
> > Do you think it's a better approach and should be revisited?
> > 
> > 
> 
> If everyone think seqlock is simple, I think it should be. But it seems you all are
> going ahead with anon_vma->lock approach. 
> (Basically, it's ok to me if it works. We may be able to make it better in later.)
> 
> I'll check your V7.
> 
> Thank you for answering.

plz forget about seq_counter. we may have to add "retry" path for avoiding
dead lock. If so, using anon_vma->lock in proper manner seems sane.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2010-05-21  0:28 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-06 23:20 [PATCH 0/2] Fix migration races in rmap_walk() V7 Mel Gorman
2010-05-06 23:20 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
2010-05-07  0:56   ` KAMEZAWA Hiroyuki
2010-05-07 16:26     ` Mel Gorman
2010-05-08 15:39   ` Andrea Arcangeli
2010-05-08 17:02     ` Linus Torvalds
2010-05-08 18:04       ` Andrea Arcangeli
2010-05-08 19:51         ` Linus Torvalds
2010-05-09 19:23     ` Mel Gorman
2010-05-06 23:20 ` [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack Mel Gorman
2010-05-07  1:40   ` Linus Torvalds
2010-05-07  1:57     ` KAMEZAWA Hiroyuki
2010-05-07  2:12       ` Linus Torvalds
2010-05-07  4:19         ` KAMEZAWA Hiroyuki
2010-05-07 14:18           ` Linus Torvalds
2010-05-09 19:21         ` Mel Gorman
2010-05-09 19:56           ` Linus Torvalds
2010-05-09 20:06             ` Linus Torvalds
2010-05-09 20:20               ` Linus Torvalds
2010-05-10  0:40             ` KAMEZAWA Hiroyuki
2010-05-10  1:30               ` Linus Torvalds
2010-05-10  1:32                 ` Linus Torvalds
2010-05-10  1:40                   ` KAMEZAWA Hiroyuki
2010-05-10  1:49                     ` Linus Torvalds
2010-05-10 13:24                     ` Mel Gorman
2010-05-10 23:55                       ` KAMEZAWA Hiroyuki
2010-05-10  0:42             ` KAMEZAWA Hiroyuki
2010-05-10 14:02               ` Mel Gorman
2010-05-10 13:49             ` Mel Gorman
2010-05-10  0:32           ` KAMEZAWA Hiroyuki
2010-05-07  9:16     ` Mel Gorman
2010-05-07  8:13 ` [PATCH 0/2] Fix migration races in rmap_walk() V7 KAMEZAWA Hiroyuki
  -- strict thread matches above, loose matches on Subject: below --
2010-05-06 15:33 [PATCH 0/2] Fix migration races in rmap_walk() V6 Mel Gorman
2010-05-06 15:33 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
2010-05-06 15:44   ` Rik van Riel
2010-05-06 15:51     ` Mel Gorman
2010-05-06 15:59   ` Linus Torvalds
2010-05-06 17:07     ` Mel Gorman
2010-05-05 13:14 [PATCH 0/2] Fix migration races in rmap_walk() V5 Mel Gorman
2010-05-05 13:14 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
2010-05-05 14:34   ` Linus Torvalds
2010-05-05 14:56     ` Mel Gorman
2010-05-05 15:31       ` Linus Torvalds
2010-05-05 15:54         ` Mel Gorman
2010-05-05 16:13           ` Andrea Arcangeli
2010-05-05 19:11             ` Peter Zijlstra
2010-05-05 19:57               ` Andrea Arcangeli
2010-05-21  0:27               ` Andrea Arcangeli
2010-05-06 10:37             ` Mel Gorman
2010-05-05 17:34           ` Linus Torvalds
2010-05-05 17:57             ` Linus Torvalds
2010-05-05 18:14             ` Mel Gorman
2010-05-05 18:34               ` Linus Torvalds
2010-05-06 11:03                 ` Mel Gorman
2010-05-06 13:40             ` Rik van Riel
2010-05-06 13:45               ` Mel Gorman
2010-05-05 17:53         ` Mel Gorman
2010-05-05 18:02           ` Linus Torvalds
2010-05-05 18:17             ` Mel Gorman
2010-05-06  0:22             ` Mel Gorman
2010-05-06  0:42               ` Linus Torvalds
2010-05-06 10:02                 ` Mel Gorman
2010-05-06 14:15                   ` Linus Torvalds
2010-05-06 14:25                     ` Mel Gorman
2010-05-06  9:47               ` Minchan Kim
2010-05-06  9:54                 ` Mel Gorman
2010-05-06 10:01                   ` Minchan Kim
2010-05-06 10:10                     ` Mel Gorman
2010-05-06 14:06                 ` Linus Torvalds
2010-05-06 15:59                   ` Minchan Kim
2010-05-06  7:38   ` KAMEZAWA Hiroyuki
2010-05-06  9:46     ` Mel Gorman
2010-05-06 23:52       ` KAMEZAWA Hiroyuki
2010-05-07  5:49         ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).