[PATCH 0/2] Fix migration races in rmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Fix migration races in rmap_walk() V3
@ 2010-04-29  8:32 Mel Gorman
  2010-04-29  8:32 ` [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock Mel Gorman
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Mel Gorman @ 2010-04-29  8:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

The sloppy cleanup of migration PTEs in V2 was to no ones liking, the fork()
change was unnecessary and Rik devised a locking scheme for anon_vma that
was more robust for transparent hugepage support than was purposed in V2.

Andrew, patch one of this series is about the correctness of locking of
anon_vma with respect to migration. While I am not aware of any reproduction
cases, it is potentially racy. Rik will probably release another versions
so I'm not expecting this one to be picked up but I'm including it for
completeness.

Patch two of this series addresses the swapops bug reported that is a race
between migration due to compaction and execve where pages get migrated from
the temporary stack before it is moved. Technically, it would be best if the
anon_vma lock was held while the temporary stack is moved but it would make
exec significantly more complex, particularly in move_page_tables to handle
a corner case in migration. I don't think adding complexity is justified. If
there are no objections, please pick it up and place it between the patches

	mmmigration-allow-the-migration-of-pageswapcache-pages.patch
	mm-allow-config_migration-to-be-set-without-config_numa-or-memory-hot-remove.patch

Unfortunately, I'll be offline for a few days but should be back online
Tuesday.

Changelog since V2
  o Drop fork changes
  o Avoid pages in temporary stacks during exec instead of migration pte
    lazy cleanup
  o Drop locking-related patch and replace with Rik's

Changelog since V1
  o Handle the execve race
  o Be sure that rmap_walk() releases the correct VMA lock
  o Hold the anon_vma lock for the address lookup and the page remap
  o Add reviewed-bys

There are a number of races between migration and other operations that mean a
migration PTE can be left behind. Broadly speaking, migration works by locking
a page, unmapping it, putting a migration PTE in place that looks like a swap
entry, copying the page and remapping the page removing the old migration PTE.
If a fault occurs, the faulting process waits until migration completes.

The problem is that there are some races that either allow migration PTEs to
be copied or a migration PTE to be left behind. Migration still completes and
the page is unlocked but later a fault will call migration_entry_to_page()
and BUG() because the page is not locked. This series aims to close some
of these races.

Patch 1 notes that with the anon_vma changes, taking one lock is not
	necessarily enough to guard against changes in all VMAs on a list.
	It introduces a new lock to allow taking the locks on all anon_vmas
	to exclude migration from VMA changes.

Patch 2 notes that while a VMA is moved under the anon_vma lock, the page
	tables are not similarly protected. To avoid migration PTEs being left
	behind, pages within a temporary stack are simply not migrated.

The reproduction case was as follows;

1. Run kernel compilation in a loop
2. Start four processes, each of which creates one mapping. The three stress
   different aspects of the problem. The operations they undertake are;
	a) Forks a hundred children, each of which faults the mapping
		Purpose: stress tests migration pte removal
	b) Forks a hundred children, each which punches a hole in the mapping
	   and faults what remains
		Purpose: stress test VMA manipulations during migration
	c) Forks a hundred children, each of which execs and calls echo
		Purpose: stress test the execve race
	d) Size the mapping to be 1.5 times physical memory. Constantly
	   memset it
		Purpose: stress swapping
3. Constantly compact memory using /proc/sys/vm/compact_memory so migration
   is active all the time. In theory, you could also force this using
   sys_move_pages or memory hot-remove but it'd be nowhere near as easy
   to test.

 include/linux/mm_types.h |    1 +
 include/linux/rmap.h     |   28 +++++++++++++++++++---------
 kernel/fork.c            |    1 +
 mm/init-mm.c             |    1 +
 mm/mmap.c                |   21 ++++++++++++---------
 mm/rmap.c                |   40 +++++++++++++++++++++++++++++++++++-----
 6 files changed, 69 insertions(+), 23 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-04-29  8:32 [PATCH 0/2] Fix migration races in rmap_walk() V3 Mel Gorman
@ 2010-04-29  8:32 ` Mel Gorman
  2010-05-02 17:28   ` Minchan Kim
  2010-04-29  8:32 ` [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks Mel Gorman
  2010-04-30 18:28 ` [PATCH 0/2] Fix migration races in rmap_walk() V3 Andrea Arcangeli
  2 siblings, 1 reply; 44+ messages in thread
From: Mel Gorman @ 2010-04-29  8:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

From: Rik van Riel <riel@redhat.com>

Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...

Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
otherwise we have an unavoidable lock ordering conflict.  This changes the
locking rules for the "same_vma" list to be either mm->mmap_sem for write,
or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
the place where the new lock is taken to 2 locations - anon_vma_prepare and
expand_downwards.

Document the locking rules for the same_vma list in the anon_vma_chain and
remove the anon_vma_lock call from expand_upwards, which does not need it.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm_types.h |    1 +
 include/linux/rmap.h     |   28 +++++++++++++++++++---------
 kernel/fork.c            |    1 +
 mm/init-mm.c             |    1 +
 mm/mmap.c                |   21 ++++++++++++---------
 mm/rmap.c                |   10 ++++++----
 6 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
+	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7721674..e8d480d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -61,11 +61,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->anon_vma_chain_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
+	struct list_head same_vma;	/* see above */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
@@ -99,18 +103,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
-static inline void anon_vma_lock(struct vm_area_struct *vma)
-{
-	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_lock(&anon_vma->lock);
-}
+#define anon_vma_lock(vma, nest_lock)					\
+({									\
+	struct anon_vma *anon_vma = vma->anon_vma;			\
+	if (anon_vma) {							\
+		struct anon_vma_chain *avc;				\
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) \
+			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock); \
+	}								\
+})
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_unlock(&avc->anon_vma->lock);
+	}
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 7f4c6fe..98adcdf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -481,6 +481,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	mm->nr_ptes = 0;
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
+	spin_lock_init(&mm->anon_vma_chain_lock);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 57aba0d..3ce8a1f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,7 @@ struct mm_struct init_mm = {
 	.mm_count	= ATOMIC_INIT(1),
 	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
+	.anon_vma_chain_lock =  __SPIN_LOCK_UNLOCKED(init_mm.anon_vma_chain_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.cpu_vm_mask	= CPU_MASK_ALL,
 };
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..4602358 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 		spin_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
-	anon_vma_lock(vma);
+	anon_vma_lock(vma, &mm->mmap_sem);
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
@@ -578,6 +578,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	anon_vma_lock(vma, &mm->mmap_sem);
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -599,6 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		vma_prio_tree_insert(vma, root);
 		flush_dcache_mmap_unlock(mapping);
 	}
+	anon_vma_unlock(vma);
 
 	if (remove_next) {
 		/*
@@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		return -EFAULT;
 
 	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
+	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
+	 * because we leave vma->vm_start and vma->pgoff untouched. 
+	 * This means rmap lookups of pages inside this VMA stay valid
+	 * throughout the stack expansion.
 	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-	anon_vma_lock(vma);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (address < PAGE_ALIGN(address+4))
 		address = PAGE_ALIGN(address+4);
 	else {
-		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	error = 0;
@@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		if (!error)
 			vma->vm_end = address;
 	}
-	anon_vma_unlock(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -1749,6 +1748,7 @@ static int expand_downwards(struct vm_area_struct *vma,
 				   unsigned long address)
 {
 	int error;
+	struct mm_struct *mm = vma->vm_mm;
 
 	/*
 	 * We must make sure the anon_vma is allocated
@@ -1762,7 +1762,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 	if (error)
 		return error;
 
-	anon_vma_lock(vma);
+	spin_lock(&mm->anon_vma_chain_lock);
+	anon_vma_lock(vma, &mm->anon_vma_chain_lock);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1784,6 +1785,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 		}
 	}
 	anon_vma_unlock(vma);
+	spin_unlock(&mm->anon_vma_chain_lock);
+
 	return error;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..f95b66d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,6 +23,7 @@
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
+ *   mm->anon_vma_chain_lock (mmap_sem for read, protects vma->anon_vma_chain)
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_lock
  *         anon_vma->lock
@@ -133,10 +134,11 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 				goto out_enomem_free_avc;
 			allocated = anon_vma;
 		}
+
+		/* anon_vma_chain_lock to protect against threads */
+		spin_lock(&mm->anon_vma_chain_lock);
 		spin_lock(&anon_vma->lock);
 
-		/* page_table_lock to protect against threads */
-		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			avc->anon_vma = anon_vma;
@@ -145,9 +147,9 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			list_add(&avc->same_anon_vma, &anon_vma->head);
 			allocated = NULL;
 		}
-		spin_unlock(&mm->page_table_lock);
-
 		spin_unlock(&anon_vma->lock);
+		spin_unlock(&mm->anon_vma_chain_lock);
+
 		if (unlikely(allocated)) {
 			anon_vma_free(allocated);
 			anon_vma_chain_free(avc);
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-04-29  8:32 [PATCH 0/2] Fix migration races in rmap_walk() V3 Mel Gorman
  2010-04-29  8:32 ` [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock Mel Gorman
@ 2010-04-29  8:32 ` Mel Gorman
  2010-04-29 16:21   ` Andrea Arcangeli
  2010-05-02  1:56   ` Christoph Lameter
  2010-04-30 18:28 ` [PATCH 0/2] Fix migration races in rmap_walk() V3 Andrea Arcangeli
  2 siblings, 2 replies; 44+ messages in thread
From: Mel Gorman @ 2010-04-29  8:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki, Mel Gorman,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page.  If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up. For example

[  511.201534] kernel BUG at include/linux/swapops.h:105!
[  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[  511.201534] last sysfs file: /sys/block/sde/size
[  511.201534] CPU 0
[  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  511.888526]
[  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
[  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
[  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
[  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
[  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
[  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
[  512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
[  513.068415] Stack:
[  513.068415]  ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
[  513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
[  513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
[  513.228068] Call Trace:
[  513.228068]  [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[  513.228068]  [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
[  513.228068]  [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[  513.720755]  [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  513.789278]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff813099b5>] page_fault+0x25/0x30
[  513.851506]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
[  513.851506]  [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[  513.851506]  [<ffffffff8111329b>] search_binary_handler+0x173/0x313
[  513.851506]  [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
[  513.851506]  [<ffffffff81114896>] do_execve+0x219/0x30a
[  513.851506]  [<ffffffff8111887f>] ? getname+0x14d/0x1b3
[  513.851506]  [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[  514.483501]  [<ffffffff8100320a>] stub_execve+0x6a/0xc0
[  514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
[  514.704292] RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  514.808221]  RSP <ffff880037b979d8>
[  514.906179] ---[ end trace 4f88495edc224d6b ]---

There is a race between shift_arg_pages and migration that triggers this bug.
A temporary stack is setup during exec and later moved. If migration moves
a page in the temporary stack and the VMA is then removed before migration
completes, the migration PTE may not be found leading to a BUG when the
stack is faulted.

Ideally, shift_arg_pages must run atomically with respect of rmap_walk
by holding the anon_vma lock but this is problematic as pages must be
allocated for page tables which cannot happen with a spinlock held. Instead,
this patch skips processes in exec by making an assumption that a VMA with
stack-flags set and a map_count == 1 is in exec that hasn't finalised the
temporary stack yet so don't migrate the pages. Memory hot-remove will try
again, sys_move_pages() wouldn't be operating during exec() time and memory
compaction will just continue to another page without concern.

[kamezawa.hiroyu@jp.fujitsu.com: Idea for having migration skip temporary stacks]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/rmap.c |   30 +++++++++++++++++++++++++++++-
 1 files changed, 29 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index f95b66d..3bb6c9e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1143,6 +1143,20 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	return ret;
 }
 
+static bool is_vma_temporary_stack(struct vm_area_struct *vma)
+{
+	int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);
+
+	if (!maybe_stack)
+		return false;
+
+	/* If only the stack is mapped, assume exec is in progress */
+	if (vma->vm_mm->map_count == 1)
+		return true;
+
+	return false;
+}
+
 /**
  * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
  * rmap method
@@ -1171,7 +1185,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
+		unsigned long address;
+
+		/*
+		 * During exec, a temporary VMA is setup and later moved.
+		 * The VMA is moved under the anon_vma lock but not the
+		 * page tables leading to a race where migration cannot
+		 * find the migration ptes. Rather than increasing the
+		 * locking requirements of exec(), migration skips
+		 * temporary VMAs until after exec() completes.
+		 */
+		if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
+				is_vma_temporary_stack(vma))
+			continue;
+
+		address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
 		ret = try_to_unmap_one(page, vma, address, flags);
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-04-29  8:32 ` [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks Mel Gorman
@ 2010-04-29 16:21   ` Andrea Arcangeli
  2010-04-30 19:22     ` Andrea Arcangeli
                       ` (2 more replies)
  2010-05-02  1:56   ` Christoph Lameter
  1 sibling, 3 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2010-04-29 16:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel

Hi Mel,

did you see my proposed fix? I'm running with it applied, I'd be
interested if you can test it. Surely it will also work for new
anon-vma code in upstream, because at that point there's just 1
anon-vma and nothing else attached to the vma.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=6efa1dfa5152ef8d7f26beb188d6877525a9dd03

I think it's wrong to try to handle the race in rmap walk by making
magic checks on vm_flags VM_GROWSDOWN|GROWSUP and
vma->vm_mm->map_count == 1, when we can fix it fully and simply in
exec.c by indexing two vmas in the same anon-vma with a different
vm_start so the pages will be found at all times by the rmap_walk.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Fix migration races in rmap_walk() V3
  2010-04-29  8:32 [PATCH 0/2] Fix migration races in rmap_walk() V3 Mel Gorman
  2010-04-29  8:32 ` [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock Mel Gorman
  2010-04-29  8:32 ` [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks Mel Gorman
@ 2010-04-30 18:28 ` Andrea Arcangeli
  2010-05-01 13:51   ` Johannes Weiner
  2 siblings, 1 reply; 44+ messages in thread
From: Andrea Arcangeli @ 2010-04-30 18:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel

Hi,

with mainline + my exec-migrate race fix + your patch1 in this series,
and the below patches, THP should work safe also on top of new
anon-vma code. I'm keeping them incremental but I'll keep them applied
also in aa.git as these patches should work for both the new and old
anon-vma semantics (if there are rejects they're minor). aa.git will
stick longer with old anon-vma code to avoid testing too much stuff at
the same time.

I'm sending the changes below for review.

I think I'll also try to use quilt guards to maintain two trees one
with new anon-vma ready for merging and one with the old
anon-vma. aa.git origin/master will stick to the old anon-vma code.

Thanks,
Andrea

---
Subject: update to the new anon-vma semantics

From: Andrea Arcangeli <aarcange@redhat.com>

The new anon-vma code broke for example wait_split_huge_page because the
vma->anon_vma->lock isn't necessarily the one that split_huge_page holds.
split_huge_page holds the page->mapping/anon_vma->lock if "page" is a shared
readonly hugepage.

The code that works with the new anon-vma code also works with the old one, so
it's better to apply it uncoditionally so it gets more testing also on top of
the old anon-vma code.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -74,24 +74,13 @@ extern int handle_pte_fault(struct mm_st
 			    pte_t *pte, pmd_t *pmd, unsigned int flags);
 extern int split_huge_page(struct page *page);
 extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
+extern void wait_split_huge_page(struct mm_struct *mm, pmd_t *pmd);
 #define split_huge_page_pmd(__mm, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
 		if (unlikely(pmd_trans_huge(*____pmd)))			\
 			__split_huge_page_pmd(__mm, ____pmd);		\
 	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		spin_unlock_wait(&(__anon_vma)->lock);			\
-		/*							\
-		 * spin_unlock_wait() is just a loop in C and so the	\
-		 * CPU can reorder anything around it.			\
-		 */							\
-		smp_mb();						\
-		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
-		       pmd_trans_huge(*____pmd));			\
-	} while (0)
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 #if HPAGE_PMD_ORDER > MAX_ORDER
@@ -118,7 +107,7 @@ static inline int split_huge_page(struct
 }
 #define split_huge_page_pmd(__mm, __pmd)	\
 	do { } while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)	\
+#define wait_split_huge_page(__mm, __pmd)	\
 	do { } while (0)
 #define PageTransHuge(page) 0
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -162,6 +162,9 @@ int try_to_munlock(struct page *);
  */
 struct anon_vma *page_lock_anon_vma(struct page *page);
 void page_unlock_anon_vma(struct anon_vma *anon_vma);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void wait_page_lock_anon_vma(struct page *page);
+#endif
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -314,7 +314,7 @@ int copy_huge_pmd(struct mm_struct *dst_
 		spin_unlock(&src_mm->page_table_lock);
 		spin_unlock(&dst_mm->page_table_lock);
 
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		wait_split_huge_page(src_mm, src_pmd); /* src_vma */
 		goto out;
 	}
 	src_page = pmd_page(pmd);
@@ -551,8 +551,7 @@ int zap_huge_pmd(struct mmu_gather *tlb,
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
 			spin_unlock(&tlb->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma,
-					     pmd);
+			wait_split_huge_page(tlb->mm, pmd);
 		} else {
 			struct page *page;
 			pgtable_t pgtable;
@@ -879,3 +878,35 @@ void __split_huge_page_pmd(struct mm_str
 	put_page(page);
 	BUG_ON(pmd_trans_huge(*pmd));
 }
+
+void wait_split_huge_page(struct mm_struct *mm, pmd_t *pmd)
+{
+	struct page *page;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		VM_BUG_ON(pmd_trans_splitting(*pmd));
+		return;
+	}
+	page = pmd_page(*pmd);
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	/*
+	 * The vma->anon_vma->lock is the wrong lock if the page is shared,
+	 * the anon_vma->lock pointed by page->mapping is the right one.
+	 */
+	wait_page_lock_anon_vma(page);
+
+	put_page(page);
+
+	/*
+	 * spin_unlock_wait() is just a loop in C and so the
+	 * CPU can reorder anything around it.
+	 */
+	smp_mb();
+
+	BUG_ON(pmd_trans_huge(*pmd));
+	VM_BUG_ON(pmd_trans_splitting(*pmd));
+}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -400,7 +400,7 @@ int __pte_alloc(struct mm_struct *mm, st
 		pmd_t *pmd, unsigned long address)
 {
 	pgtable_t new = pte_alloc_one(mm, address);
-	int wait_split_huge_page;
+	int need_wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -420,18 +420,18 @@ int __pte_alloc(struct mm_struct *mm, st
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	spin_lock(&mm->page_table_lock);
-	wait_split_huge_page = 0;
+	need_wait_split_huge_page = 0;
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm->nr_ptes++;
 		pmd_populate(mm, pmd, new);
 		new = NULL;
 	} else if (unlikely(pmd_trans_splitting(*pmd)))
-		wait_split_huge_page = 1;
+		need_wait_split_huge_page = 1;
 	spin_unlock(&mm->page_table_lock);
 	if (new)
 		pte_free(mm, new);
-	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
+	if (need_wait_split_huge_page)
+		wait_split_huge_page(mm, pmd);
 	return 0;
 }
 
@@ -1302,7 +1302,7 @@ struct page *follow_page(struct vm_area_
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
 				spin_unlock(&mm->page_table_lock);
-				wait_split_huge_page(vma->anon_vma, pmd);
+				wait_split_huge_page(mm, pmd);
 			} else {
 				page = follow_trans_huge_pmd(mm, address,
 							     pmd, flags);
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -225,6 +225,30 @@ void page_unlock_anon_vma(struct anon_vm
 	rcu_read_unlock();
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Getting a lock on a stable anon_vma from a page off the LRU is
+ * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ */
+void wait_page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	unsigned long anon_mapping;
+
+	rcu_read_lock();
+	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
+		goto out;
+	if (!page_mapped(page))
+		goto out;
+
+	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
+	spin_unlock_wait(&anon_vma->lock);
+out:
+	rcu_read_unlock();
+}
+#endif
+
 /*
  * At what user virtual address is page expected in @vma?
  * Returns virtual address or -EFAULT if page's index/offset is not




----
Subject: adapt mincore to anon_vma chain semantics

From: Andrea Arcangeli <aarcange@redhat.com>

wait_split_huge_page interface changed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -905,7 +905,7 @@ int mincore_huge_pmd(struct vm_area_stru
 		ret = !pmd_trans_splitting(*pmd);
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		if (unlikely(!ret))
-			wait_split_huge_page(vma->anon_vma, pmd);
+			wait_split_huge_page(vma->vm_mm, pmd);
 		else {
 			/*
 			 * All logical pages in the range are present
-----
Subject: adapt mprotect to anon_vma chain semantics

From: Andrea Arcangeli <aarcange@redhat.com>

wait_split_huge_page interface changed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -929,7 +929,7 @@ int change_huge_pmd(struct vm_area_struc
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
 			spin_unlock(&mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
+			wait_split_huge_page(mm, pmd);
 		} else {
 			pmd_t entry;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-04-29 16:21   ` Andrea Arcangeli
@ 2010-04-30 19:22     ` Andrea Arcangeli
  2010-04-30 20:21       ` Rik van Riel
  2010-05-01  9:39       ` Andrea Arcangeli
  2010-05-02 17:40     ` Minchan Kim
  2010-05-04 10:32     ` Mel Gorman
  2 siblings, 2 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2010-04-30 19:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel

I'm building a mergeable THP+memory compaction tree ready for mainline
merging based on new anon-vma code, so I'm integrating your patch1 and
this below should be the port of my alternate fix to your patch2 to
fix the longstanding crash in migrate (not a bug in new anon-vma code
but longstanding). patch1 is instead about the bugs introduced by the
new anon-vma code that might crash migrate (even without memory
compaction and/or THP) the same way as the bug fixed by the below.

==
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,7 +504,9 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -515,17 +518,46 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+
+	if (unlikely(anon_vma_clone(tmp_vma, vma))) {
+		kmem_cache_free(vm_area_cachep, tmp_vma);
+		return -ENOMEM;
+	}
+
+	/*
 	 * cover the whole range: [new_start, old_end)
+	 *
+	 * The vma is attached only to vma->anon_vma so one lock is
+	 * enough. Even this lock might be removed and we could run it
+	 * out of order but it's nicer to make atomic updates to
+	 * vm_start and so we won't need a smb_wmb() before calling
+	 * move_page_tables.
 	 */
-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
-		return -ENOMEM;
+	spin_lock(&vma->anon_vma->lock);
+	vma->vm_start = new_start;
+	spin_unlock(&vma->anon_vma->lock);
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+	moved_length = move_page_tables(vma, old_start,
+					vma, new_start, length);
+
+	/* rmap walk will already find all pages using the new_start */
+	unlink_anon_vmas(tmp_vma);
+	kmem_cache_free(vm_area_cachep, tmp_vma);
+
+	if (length != moved_length) 
 		return -ENOMEM;
 
 	lru_add_drain();
@@ -551,7 +583,7 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
 
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-04-30 19:22     ` Andrea Arcangeli
@ 2010-04-30 20:21       ` Rik van Riel
  2010-05-01  9:39       ` Andrea Arcangeli
  1 sibling, 0 replies; 44+ messages in thread
From: Rik van Riel @ 2010-04-30 20:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter

On 04/30/2010 03:22 PM, Andrea Arcangeli wrote:
> I'm building a mergeable THP+memory compaction tree ready for mainline
> merging based on new anon-vma code, so I'm integrating your patch1 and
> this below should be the port of my alternate fix to your patch2 to
> fix the longstanding crash in migrate (not a bug in new anon-vma code
> but longstanding). patch1 is instead about the bugs introduced by the
> new anon-vma code that might crash migrate (even without memory
> compaction and/or THP) the same way as the bug fixed by the below.
>
> ==
> Subject: fix race between shift_arg_pages and rmap_walk
>
> From: Andrea Arcangeli<aarcange@redhat.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-04-30 19:22     ` Andrea Arcangeli
  2010-04-30 20:21       ` Rik van Riel
@ 2010-05-01  9:39       ` Andrea Arcangeli
  2010-05-01 13:02         ` Rik van Riel
  1 sibling, 1 reply; 44+ messages in thread
From: Andrea Arcangeli @ 2010-05-01  9:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel

On Fri, Apr 30, 2010 at 09:22:35PM +0200, Andrea Arcangeli wrote:
> I'm building a mergeable THP+memory compaction tree ready for mainline
> merging based on new anon-vma code, so I'm integrating your patch1 and
> this below should be the port of my alternate fix to your patch2 to
> fix the longstanding crash in migrate (not a bug in new anon-vma code
> but longstanding). patch1 is instead about the bugs introduced by the
> new anon-vma code that might crash migrate (even without memory
> compaction and/or THP) the same way as the bug fixed by the below.

I noticed I didn't qrefresh to sync the patch to the working dir and I
had some change in working dir not picked up in the patch. A
INIT_LIST_HEAD was missing and this was the patch I actually tested.

===
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,7 +504,9 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -515,17 +518,46 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+	INIT_LIST_HEAD(&tmp_vma->anon_vma_chain);
+	if (unlikely(anon_vma_clone(tmp_vma, vma))) {
+		kmem_cache_free(vm_area_cachep, tmp_vma);
+		return -ENOMEM;
+	}
+
+	/*
 	 * cover the whole range: [new_start, old_end)
+	 *
+	 * The vma is attached only to vma->anon_vma so one lock is
+	 * enough. Even this lock might be removed and we could run it
+	 * out of order but it's nicer to make atomic updates to
+	 * vm_start and so we won't need a smb_wmb() before calling
+	 * move_page_tables.
 	 */
-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
-		return -ENOMEM;
+	spin_lock(&vma->anon_vma->lock);
+	vma->vm_start = new_start;
+	spin_unlock(&vma->anon_vma->lock);
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+	moved_length = move_page_tables(vma, old_start,
+					vma, new_start, length);
+
+	/* rmap walk will already find all pages using the new_start */
+	unlink_anon_vmas(tmp_vma);
+	kmem_cache_free(vm_area_cachep, tmp_vma);
+
+	if (length != moved_length) 
 		return -ENOMEM;
 
 	lru_add_drain();
@@ -551,7 +583,7 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
 
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-01  9:39       ` Andrea Arcangeli
@ 2010-05-01 13:02         ` Rik van Riel
  0 siblings, 0 replies; 44+ messages in thread
From: Rik van Riel @ 2010-05-01 13:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter

On 05/01/2010 05:39 AM, Andrea Arcangeli wrote:

> ===
> Subject: fix race between shift_arg_pages and rmap_walk
>
> From: Andrea Arcangeli<aarcange@redhat.com>
>
> migrate.c requires rmap to be able to find all ptes mapping a page at
> all times, otherwise the migration entry can be instantiated, but it
> can't be removed if the second rmap_walk fails to find the page.
>
> And split_huge_page() will have the same requirements as migrate.c
> already has.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Fix migration races in rmap_walk() V3
  2010-04-30 18:28 ` [PATCH 0/2] Fix migration races in rmap_walk() V3 Andrea Arcangeli
@ 2010-05-01 13:51   ` Johannes Weiner
  2010-05-03 15:33     ` Andrea Arcangeli
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Weiner @ 2010-05-01 13:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter, Rik van Riel

On Fri, Apr 30, 2010 at 08:28:53PM +0200, Andrea Arcangeli wrote:
> Subject: adapt mprotect to anon_vma chain semantics
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> wait_split_huge_page interface changed.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -929,7 +929,7 @@ int change_huge_pmd(struct vm_area_struc
>  	if (likely(pmd_trans_huge(*pmd))) {
>  		if (unlikely(pmd_trans_splitting(*pmd))) {
>  			spin_unlock(&mm->page_table_lock);
> -			wait_split_huge_page(vma->anon_vma, pmd);
> +			wait_split_huge_page(mm, pmd);

That makes mprotect-vma-arg obsolete, I guess.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-04-29  8:32 ` [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks Mel Gorman
  2010-04-29 16:21   ` Andrea Arcangeli
@ 2010-05-02  1:56   ` Christoph Lameter
  2010-05-04  9:45     ` Mel Gorman
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Lameter @ 2010-05-02  1:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Rik van Riel

On Thu, 29 Apr 2010, Mel Gorman wrote:

> There is a race between shift_arg_pages and migration that triggers this bug.
> A temporary stack is setup during exec and later moved. If migration moves
> a page in the temporary stack and the VMA is then removed before migration
> completes, the migration PTE may not be found leading to a BUG when the
> stack is faulted.

A simpler solution would be to not allow migration of the temporary stack?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-04-29  8:32 ` [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock Mel Gorman
@ 2010-05-02 17:28   ` Minchan Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2010-05-02 17:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel

On Thu, Apr 29, 2010 at 5:32 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> From: Rik van Riel <riel@redhat.com>
>
> Take all the locks for all the anon_vmas in anon_vma_lock, this properly
> excludes migration and the transparent hugepage code from VMA changes done
> by mmap/munmap/mprotect/expand_stack/etc...
>
> Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
> otherwise we have an unavoidable lock ordering conflict.  This changes the
> locking rules for the "same_vma" list to be either mm->mmap_sem for write,
> or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
> the place where the new lock is taken to 2 locations - anon_vma_prepare and
> expand_downwards.
>
> Document the locking rules for the same_vma list in the anon_vma_chain and
> remove the anon_vma_lock call from expand_upwards, which does not need it.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

I like this one.

Although it try to lock the number of anon_vmas attached to a VMA ,
it's small so latency couldn't be big. :)
It's height problem not width problem of tree. :)

Thanks, Rik.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-04-29 16:21   ` Andrea Arcangeli
  2010-04-30 19:22     ` Andrea Arcangeli
@ 2010-05-02 17:40     ` Minchan Kim
  2010-05-02 18:20       ` Andrea Arcangeli
  2010-05-04 10:32     ` Mel Gorman
  2 siblings, 1 reply; 44+ messages in thread
From: Minchan Kim @ 2010-05-02 17:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel

On Fri, Apr 30, 2010 at 1:21 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Mel,
>
> did you see my proposed fix? I'm running with it applied, I'd be
> interested if you can test it. Surely it will also work for new
> anon-vma code in upstream, because at that point there's just 1
> anon-vma and nothing else attached to the vma.
>
> http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=6efa1dfa5152ef8d7f26beb188d6877525a9dd03
>
> I think it's wrong to try to handle the race in rmap walk by making
> magic checks on vm_flags VM_GROWSDOWN|GROWSUP and
> vma->vm_mm->map_count == 1, when we can fix it fully and simply in
> exec.c by indexing two vmas in the same anon-vma with a different
> vm_start so the pages will be found at all times by the rmap_walk.
>

I like this approach than exclude temporal stack while migration.

If we look it through viewpoint of performance, Mel and Kame's one
look good and simple. But If I look it through viewpoint of
correctness, Andrea's one looks good.
I mean Mel's approach is that problem is here but let us solve it with
there. it makes dependency between here and there. And In future, if
temporal stack and rmap code might be problem, we also should solve it
in there. :)

So I support this one.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-02 17:40     ` Minchan Kim
@ 2010-05-02 18:20       ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2010-05-02 18:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel

On Mon, May 03, 2010 at 02:40:44AM +0900, Minchan Kim wrote:
> On Fri, Apr 30, 2010 at 1:21 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > Hi Mel,
> >
> > did you see my proposed fix? I'm running with it applied, I'd be
> > interested if you can test it. Surely it will also work for new
> > anon-vma code in upstream, because at that point there's just 1
> > anon-vma and nothing else attached to the vma.
> >
> > http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=6efa1dfa5152ef8d7f26beb188d6877525a9dd03
> >
> > I think it's wrong to try to handle the race in rmap walk by making
> > magic checks on vm_flags VM_GROWSDOWN|GROWSUP and
> > vma->vm_mm->map_count == 1, when we can fix it fully and simply in
> > exec.c by indexing two vmas in the same anon-vma with a different
> > vm_start so the pages will be found at all times by the rmap_walk.
> >
> 
> I like this approach than exclude temporal stack while migration.
> 
> If we look it through viewpoint of performance, Mel and Kame's one
> look good and simple. But If I look it through viewpoint of
> correctness, Andrea's one looks good.
> I mean Mel's approach is that problem is here but let us solve it with
> there. it makes dependency between here and there. And In future, if
> temporal stack and rmap code might be problem, we also should solve it
> in there. :)

That explains exactly why I wanted to solve it locally to exec.c and
by using the same method of mremap. And it fixes all users not just
migrate (split_huge_page may also need it in the future if we ever
allow the user stack to be born huge instead of growing huge with
khugepaged if needed).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Fix migration races in rmap_walk() V3
  2010-05-01 13:51   ` Johannes Weiner
@ 2010-05-03 15:33     ` Andrea Arcangeli
  2010-05-03 23:41       ` Johannes Weiner
  0 siblings, 1 reply; 44+ messages in thread
From: Andrea Arcangeli @ 2010-05-03 15:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter, Rik van Riel

On Sat, May 01, 2010 at 03:51:10PM +0200, Johannes Weiner wrote:
> On Fri, Apr 30, 2010 at 08:28:53PM +0200, Andrea Arcangeli wrote:
> > Subject: adapt mprotect to anon_vma chain semantics
> > 
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > wait_split_huge_page interface changed.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -929,7 +929,7 @@ int change_huge_pmd(struct vm_area_struc
> >  	if (likely(pmd_trans_huge(*pmd))) {
> >  		if (unlikely(pmd_trans_splitting(*pmd))) {
> >  			spin_unlock(&mm->page_table_lock);
> > -			wait_split_huge_page(vma->anon_vma, pmd);
> > +			wait_split_huge_page(mm, pmd);
> 
> That makes mprotect-vma-arg obsolete, I guess.

Well it's needed for flush_tlb_range. Also normally we could run a
single invlpg on x86 to invalidate huge pmd tlbs, but I read some
errata for some x86, and I didn't want to take risks plus this is
common code so I can't just run a common code flush_tlb_page. In
mincore_huge_pmd probably we could pass vma->vm_mm instead of vma (as
there is not flush_tlb_range), I can change it if you prefer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 16:17 [PATCH 0/2] Fix migration races in rmap_walk() V4 Rik van Riel
@ 2010-05-03 16:18 ` Rik van Riel
  2010-05-03 16:41   ` Linus Torvalds
  2010-05-03 16:55   ` Peter Zijlstra
  0 siblings, 2 replies; 44+ messages in thread
From: Rik van Riel @ 2010-05-03 16:18 UTC (permalink / raw)
  To: akpm
  Cc: torvalds, Mel Gorman, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Andrea Arcangeli, Christoph Lameter

From: Rik van Riel <riel@redhat.com>

Both the page migration code and the transparent hugepage patches expect
100% reliable rmap lookups and use page_lock_anon_vma(page) to prevent
races with mmap, munmap, expand_stack, etc.

Specifically, try_to_unmap indirectly calls vma_address, which uses the
difference between vma->vm_start and vma->vm_pgoff, which can race when a
stack is expanded downwards.  VMA splitting and merging present similar
issues.

With the new anon_vma code, one VMA can be attached to multiple anon_vmas,
however mmap, munmap, expand_stack and others only took one anon_vma->lock.
This patch changes things so we take the anon_vma locks for all of the
anon_vmas attached to a VMA in the code that try_to_unmap would otherwise
race against: mmap, munmap, expand_stack, etc. 

Unfortunately, this leads to a lock ordering conflict with the page_table_lock,
which protected the "same_vma" list in the anon_vma_chain.  Replacing that
lock with a new lock (mm->anon_vma_chain_lock), which is taken higher up in
the mm locking hierarchy, solves that issue.  This changes the locking rules
for the "same_vma" list to be either mm->mmap_sem for write, or mm->mmap_sem
for read plus the new mm->anon_vma_chain lock.  This limits the place where
the new lock is taken to 2 locations - anon_vma_prepare and expand_downwards.

Document the locking rules for the same_vma list in the anon_vma_chain and
remove the anon_vma_lock call from expand_upwards, which does not need it.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
+	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..703c472 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->anon_vma_chain_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
+	struct list_head same_vma;	/* see above */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
@@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
-static inline void anon_vma_lock(struct vm_area_struct *vma)
-{
-	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_lock(&anon_vma->lock);
-}
+#define anon_vma_lock(vma, nest_lock)					\
+({									\
+	struct anon_vma *anon_vma = vma->anon_vma;			\
+	if (anon_vma) {							\
+		struct anon_vma_chain *avc;				\
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) \
+			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock); \
+	}								\
+})
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_unlock(&avc->anon_vma->lock);
+	}
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 44b0791..83b1ba2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -468,6 +468,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	mm->nr_ptes = 0;
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
+	spin_lock_init(&mm->anon_vma_chain_lock);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 57aba0d..3ce8a1f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,7 @@ struct mm_struct init_mm = {
 	.mm_count	= ATOMIC_INIT(1),
 	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
+	.anon_vma_chain_lock =  __SPIN_LOCK_UNLOCKED(init_mm.anon_vma_chain_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.cpu_vm_mask	= CPU_MASK_ALL,
 };
diff --git a/mm/mmap.c b/mm/mmap.c
index 456ec6f..81850fc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -452,7 +452,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 		spin_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
-	anon_vma_lock(vma);
+	anon_vma_lock(vma, &mm->mmap_sem);
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
@@ -578,6 +578,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	anon_vma_lock(vma, &mm->mmap_sem);
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -599,6 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		vma_prio_tree_insert(vma, root);
 		flush_dcache_mmap_unlock(mapping);
 	}
+	anon_vma_unlock(vma);
 
 	if (remove_next) {
 		/*
@@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		return -EFAULT;
 
 	/*
-	 * We must make sure the anon_vma is allocated
-	 * so that the anon_vma locking is not a noop.
+	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
+	 * because we leave vma->vm_start and vma->pgoff untouched. 
+	 * This means rmap lookups of pages inside this VMA stay valid
+	 * throughout the stack expansion.
 	 */
-	if (unlikely(anon_vma_prepare(vma)))
-		return -ENOMEM;
-	anon_vma_lock(vma);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (address < PAGE_ALIGN(address+4))
 		address = PAGE_ALIGN(address+4);
 	else {
-		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	error = 0;
@@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		if (!error)
 			vma->vm_end = address;
 	}
-	anon_vma_unlock(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -1749,6 +1748,7 @@ static int expand_downwards(struct vm_area_struct *vma,
 				   unsigned long address)
 {
 	int error;
+	struct mm_struct *mm = vma->vm_mm;
 
 	/*
 	 * We must make sure the anon_vma is allocated
@@ -1762,7 +1762,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 	if (error)
 		return error;
 
-	anon_vma_lock(vma);
+	spin_lock(&mm->anon_vma_chain_lock);
+	anon_vma_lock(vma, &mm->anon_vma_chain_lock);
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -1784,6 +1785,8 @@ static int expand_downwards(struct vm_area_struct *vma,
 		}
 	}
 	anon_vma_unlock(vma);
+	spin_unlock(&mm->anon_vma_chain_lock);
+
 	return error;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 07fc947..a2dc8d9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,6 +23,7 @@
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
+ *   mm->anon_vma_chain_lock (mmap_sem for read, protects vma->anon_vma_chain)
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_lock
  *         anon_vma->lock
@@ -134,9 +135,10 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			allocated = anon_vma;
 		}
 
+		/* anon_vma_chain_lock to protect against threads */
+		spin_lock(&mm->anon_vma_chain_lock);
 		spin_lock(&anon_vma->lock);
-		/* page_table_lock to protect against threads */
-		spin_lock(&mm->page_table_lock);
+
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			avc->anon_vma = anon_vma;
@@ -146,8 +148,8 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			allocated = NULL;
 			avc = NULL;
 		}
-		spin_unlock(&mm->page_table_lock);
 		spin_unlock(&anon_vma->lock);
+		spin_unlock(&mm->anon_vma_chain_lock);
 
 		if (unlikely(allocated))
 			anon_vma_free(allocated);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 16:18 ` [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock Rik van Riel
@ 2010-05-03 16:41   ` Linus Torvalds
  2010-05-03 16:53     ` Rik van Riel
  2010-05-03 16:55   ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2010-05-03 16:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: akpm, Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Christoph Lameter



On Mon, 3 May 2010, Rik van Riel wrote:
> 
> Both the page migration code and the transparent hugepage patches expect
> 100% reliable rmap lookups and use page_lock_anon_vma(page) to prevent
> races with mmap, munmap, expand_stack, etc.

Pretty much same comments as for the other one. Why are we pandering to 
the case that is/should be unusual?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 16:41   ` Linus Torvalds
@ 2010-05-03 16:53     ` Rik van Riel
  2010-05-03 17:17       ` Linus Torvalds
  0 siblings, 1 reply; 44+ messages in thread
From: Rik van Riel @ 2010-05-03 16:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: akpm, Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Christoph Lameter

On 05/03/2010 12:41 PM, Linus Torvalds wrote:
> On Mon, 3 May 2010, Rik van Riel wrote:
>>
>> Both the page migration code and the transparent hugepage patches expect
>> 100% reliable rmap lookups and use page_lock_anon_vma(page) to prevent
>> races with mmap, munmap, expand_stack, etc.
>
> Pretty much same comments as for the other one. Why are we pandering to
> the case that is/should be unusual?

In this case, because the fix from the migration side is
difficult and fragile, while fixing things from the mmap
side is straightforward.

I believe the overhead of patch 1/2 should be minimal
as well, because the locks we take are the _depth_ of
the process tree (truncated every exec), not the width.


As for patch 2/2, Mel has an alternative approach for that:

http://lkml.org/lkml/2010/4/30/198

Does Mel's patch seem more reasonable to you?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 16:18 ` [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock Rik van Riel
  2010-05-03 16:41   ` Linus Torvalds
@ 2010-05-03 16:55   ` Peter Zijlstra
  2010-05-03 17:02     ` Andrea Arcangeli
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-05-03 16:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: akpm, torvalds, Mel Gorman, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Andrea Arcangeli, Christoph Lameter

On Mon, 2010-05-03 at 12:18 -0400, Rik van Riel wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Both the page migration code and the transparent hugepage patches expect
> 100% reliable rmap lookups and use page_lock_anon_vma(page) to prevent
> races with mmap, munmap, expand_stack, etc.
> 
> Specifically, try_to_unmap indirectly calls vma_address, which uses the
> difference between vma->vm_start and vma->vm_pgoff, which can race when a
> stack is expanded downwards.  VMA splitting and merging present similar
> issues.
> 
> With the new anon_vma code, one VMA can be attached to multiple anon_vmas,
> however mmap, munmap, expand_stack and others only took one anon_vma->lock.
> This patch changes things so we take the anon_vma locks for all of the
> anon_vmas attached to a VMA in the code that try_to_unmap would otherwise
> race against: mmap, munmap, expand_stack, etc. 
> 
> Unfortunately, this leads to a lock ordering conflict with the page_table_lock,
> which protected the "same_vma" list in the anon_vma_chain.  Replacing that
> lock with a new lock (mm->anon_vma_chain_lock), which is taken higher up in
> the mm locking hierarchy, solves that issue.  This changes the locking rules
> for the "same_vma" list to be either mm->mmap_sem for write, or mm->mmap_sem
> for read plus the new mm->anon_vma_chain lock.  This limits the place where
> the new lock is taken to 2 locations - anon_vma_prepare and expand_downwards.
> 
> Document the locking rules for the same_vma list in the anon_vma_chain and
> remove the anon_vma_lock call from expand_upwards, which does not need it.

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 456ec6f..81850fc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c

> @@ -1705,12 +1707,11 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  		return -EFAULT;
>  
>  	/*
> -	 * We must make sure the anon_vma is allocated
> -	 * so that the anon_vma locking is not a noop.
> +	 * Unlike expand_downwards, we do not need to take the anon_vma lock,
> +	 * because we leave vma->vm_start and vma->pgoff untouched. 
> +	 * This means rmap lookups of pages inside this VMA stay valid
> +	 * throughout the stack expansion.
>  	 */
> -	if (unlikely(anon_vma_prepare(vma)))
> -		return -ENOMEM;
> -	anon_vma_lock(vma);
>  
>  	/*
>  	 * vma->vm_start/vm_end cannot change under us because the caller
> @@ -1721,7 +1722,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  	if (address < PAGE_ALIGN(address+4))
>  		address = PAGE_ALIGN(address+4);
>  	else {
> -		anon_vma_unlock(vma);
>  		return -ENOMEM;
>  	}
>  	error = 0;
> @@ -1737,7 +1737,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  		if (!error)
>  			vma->vm_end = address;
>  	}
> -	anon_vma_unlock(vma);
>  	return error;
>  }
>  #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */


This does leave me worrying about concurrent faults poking at
vma->vm_end without synchronization.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 16:55   ` Peter Zijlstra
@ 2010-05-03 17:02     ` Andrea Arcangeli
  2010-05-03 17:11       ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Andrea Arcangeli @ 2010-05-03 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, akpm, torvalds, Mel Gorman, Linux-MM, LKML,
	Minchan Kim, KAMEZAWA Hiroyuki, Christoph Lameter

On Mon, May 03, 2010 at 06:55:12PM +0200, Peter Zijlstra wrote:
> This does leave me worrying about concurrent faults poking at
> vma->vm_end without synchronization.

I didn't check this patch in detail yet. I agree it can be removed and
I think it can be safely replaced with the page_table_lock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 17:02     ` Andrea Arcangeli
@ 2010-05-03 17:11       ` Peter Zijlstra
  2010-05-03 17:18         ` Andrea Arcangeli
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-05-03 17:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, akpm, torvalds, Mel Gorman, Linux-MM, LKML,
	Minchan Kim, KAMEZAWA Hiroyuki, Christoph Lameter

On Mon, 2010-05-03 at 19:02 +0200, Andrea Arcangeli wrote:
> On Mon, May 03, 2010 at 06:55:12PM +0200, Peter Zijlstra wrote:
> > This does leave me worrying about concurrent faults poking at
> > vma->vm_end without synchronization.
> 
> I didn't check this patch in detail yet. I agree it can be removed and
> I think it can be safely replaced with the page_table_lock.

Sure, it could probably be replaced with the ptl, but a single
anon_vma->lock would I think be better since there's more of them.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 16:53     ` Rik van Riel
@ 2010-05-03 17:17       ` Linus Torvalds
  2010-05-03 17:58         ` Rik van Riel
  0 siblings, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2010-05-03 17:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: akpm, Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Christoph Lameter

On Mon, 3 May 2010, Rik van Riel wrote:
> > 
> > Pretty much same comments as for the other one. Why are we pandering to
> > the case that is/should be unusual?
> 
> In this case, because the fix from the migration side is
> difficult and fragile, while fixing things from the mmap
> side is straightforward.
> 
> I believe the overhead of patch 1/2 should be minimal
> as well, because the locks we take are the _depth_ of
> the process tree (truncated every exec), not the width.

Quite frankly, I think it's totally insane to walk a list of all 
anon_vma's that are associated with one vma, just to lock them all.

Tell me why you just don't put the lock in the vma itself then? Walking a 
list in order to lock multiple things is something we should _never_ do 
under any normal circumstances.

I can see why you'd want to do this in theory (the "other side" of the 
locker might have to lock just the _one_ anon_vma), but if your argument 
is that the list is usually not very deep ("one"?), then there is no 
advantage, because putting the lock in the anon_vma vs the vma will get 
the same kind of contention.

And if the list _is_ deep, then walking the list to lock them all is a 
crime against humanity.

So explain.

> As for patch 2/2, Mel has an alternative approach for that:
> 
> http://lkml.org/lkml/2010/4/30/198
> 
> Does Mel's patch seem more reasonable to you?

Well, I certainly think that seems to be a lot more targeted, and not add 
new allocations in a path that I think is already one of the more 
expensive ones. Yes, you can argue that execve() is already so expensive 
that a few more allocations don't matter, and you migth be right, but 
that's how things get to be too expensive to begin with.

That said, I do still wonder why we shouldn't just say that the person who 
wants the safety is the one that should do the extra work.

In particular, why don't we just make rmap_walk() be the one that locks 
all the anon_vma's? Instead of locking just one? THAT is the function that 
cares. THAT is the function that should do proper locking and not expect 
others to do extra unnecessary locking.

So again, my gut feel is that if the lock just were in the vma itself, 
then the "normal" users would have just one natural lock, while the 
special case users (rmap_walk_anon) would have to lock each vma it 
traverses. That would seem to be the more natural way to lock things.

I dunno. There may well be reasons why it doesn't work, like just the 
allocation lifetime rules for vma's vs anon_vma's. I'm not claiming I've 
thought this true. I just get a feeling of "that isn't right" when I look 
at the original 2/2 patch, and while Mel's patch certainly looks better, 
it seems to be a bit ad-hoc and hacky to me.

Btw, Mel's patch doesn't really match the description of 2/2. 2/2 says 
that all pages must always be findable in rmap. Mel's patch seems to 
explicitly say "we want to ignore that thing that is busy for execve". Are 
we just avoiding a BUG_ON()? Is perhaps the BUG_ON() buggy?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 17:11       ` Peter Zijlstra
@ 2010-05-03 17:18         ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2010-05-03 17:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, akpm, torvalds, Mel Gorman, Linux-MM, LKML,
	Minchan Kim, KAMEZAWA Hiroyuki, Christoph Lameter

On Mon, May 03, 2010 at 07:11:19PM +0200, Peter Zijlstra wrote:
> On Mon, 2010-05-03 at 19:02 +0200, Andrea Arcangeli wrote:
> > On Mon, May 03, 2010 at 06:55:12PM +0200, Peter Zijlstra wrote:
> > > This does leave me worrying about concurrent faults poking at
> > > vma->vm_end without synchronization.
> > 
> > I didn't check this patch in detail yet. I agree it can be removed and
> > I think it can be safely replaced with the page_table_lock.
> 
> Sure, it could probably be replaced with the ptl, but a single
> anon_vma->lock would I think be better since there's more of them.

ptl not enough, or it'd break if stack grows fast more than the size
of one pmd, page_table_lock enough instead.

Keeping anon_vma lock is sure fine with me ;), I was informally asked
if it was a must have, and I couldn't foresee any problem in
_replacing_ it (not removing) with page_table_lock (which I hope I
mentioned in my answer ;). But I never had an interest to remove it,
just I couldn't find any good reason to keep it either other than
"paranoid just in case", which is good enough justification to me ;)
considering these archs are uncommon and by definition gets less
testing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 17:17       ` Linus Torvalds
@ 2010-05-03 17:58         ` Rik van Riel
  2010-05-03 18:13           ` Andrea Arcangeli
                             ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Rik van Riel @ 2010-05-03 17:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: akpm, Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Christoph Lameter

On 05/03/2010 01:17 PM, Linus Torvalds wrote:

> Quite frankly, I think it's totally insane to walk a list of all
> anon_vma's that are associated with one vma, just to lock them all.
>
> Tell me why you just don't put the lock in the vma itself then? Walking a
> list in order to lock multiple things is something we should _never_ do
> under any normal circumstances.

One problem is that we cannot find the VMAs (multiple) from
the page, except by walking the anon_vma_chain.same_anon_vma
list.  At the very least, that list requires locking, done
by the anon_vma.lock.

When already you have that lock, also taking per-VMA locks
would be superfluous. It could also be difficult, especially
since the rmap side code and the mmap side code approach the
data structures from the other side, potentially leading to
locking order conflicts.

> I can see why you'd want to do this in theory (the "other side" of the
> locker might have to lock just the _one_ anon_vma), but if your argument
> is that the list is usually not very deep ("one"?), then there is no
> advantage, because putting the lock in the anon_vma vs the vma will get
> the same kind of contention.

In a freshly exec()d process, there will be one anon_vma.

In workloads with forking daemons, like apache, postgresql
or sendmail, the child process will end up with two anon_vmas
in VMAs inherited from the parent.

> And if the list _is_ deep, then walking the list to lock them all is a
> crime against humanity.

A forkbomb could definately end up getting slowed down by
this patch.  Is there any real workload out there that just
forks deeper and deeper from the parent process, without
calling exec() after a generation or two?

>> As for patch 2/2, Mel has an alternative approach for that:
>>
>> http://lkml.org/lkml/2010/4/30/198
>>
>> Does Mel's patch seem more reasonable to you?
>
> Well, I certainly think that seems to be a lot more targeted,

> In particular, why don't we just make rmap_walk() be the one that locks
> all the anon_vma's? Instead of locking just one? THAT is the function that
> cares. THAT is the function that should do proper locking and not expect
> others to do extra unnecessary locking.

That looks like it might work for rmap_walk.

> So again, my gut feel is that if the lock just were in the vma itself,
> then the "normal" users would have just one natural lock, while the
> special case users (rmap_walk_anon) would have to lock each vma it
> traverses. That would seem to be the more natural way to lock things.

However ... there's still the issue of page_lock_anon_vma
in try_to_unmap_anon.

I guess it protects against any of the VMAs going away,
because anon_vma_unlink also takes the anon_vma->lock.

Does it need to protect against anything else?

> Btw, Mel's patch doesn't really match the description of 2/2. 2/2 says
> that all pages must always be findable in rmap. Mel's patch seems to
> explicitly say "we want to ignore that thing that is busy for execve". Are
> we just avoiding a BUG_ON()? Is perhaps the BUG_ON() buggy?

I have no good answer to this question.

Mel?  Andrea?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 17:58         ` Rik van Riel
@ 2010-05-03 18:13           ` Andrea Arcangeli
  2010-05-03 18:19           ` Linus Torvalds
  2010-05-04 13:12           ` Mel Gorman
  2 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2010-05-03 18:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, akpm, Mel Gorman, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter

> > Btw, Mel's patch doesn't really match the description of 2/2. 2/2 says
> > that all pages must always be findable in rmap. Mel's patch seems to
> > explicitly say "we want to ignore that thing that is busy for execve". Are
> > we just avoiding a BUG_ON()? Is perhaps the BUG_ON() buggy?
> 
> I have no good answer to this question.
> 
> Mel?  Andrea?

If try_to_unmap is allowed to establish the migration pte, then such
pte has to remain reachable through rmap_walk at all times after that,
or migration_entry_wait will crash because it notices the page has
been migrated already (PG_lock not set) but there is still a migration
pte established. (remove_migration_pte like split_huge_page isn't
allowed to fail finding all migration ptes mapping a page during the
rmap walk)

It's not false positive BUG_ON if that's what you mean, removing the
BUG_ON would still lead to infinite hang waiting on a migration pte
that shouldn't be there anymore.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 17:58         ` Rik van Riel
  2010-05-03 18:13           ` Andrea Arcangeli
@ 2010-05-03 18:19           ` Linus Torvalds
  2010-05-03 18:38             ` Rik van Riel
  2010-05-04 13:12           ` Mel Gorman
  2 siblings, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2010-05-03 18:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: akpm, Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Christoph Lameter

On Mon, 3 May 2010, Rik van Riel wrote:
> 
> One problem is that we cannot find the VMAs (multiple) from
> the page, except by walking the anon_vma_chain.same_anon_vma
> list.  At the very least, that list requires locking, done
> by the anon_vma.lock.

But that's exactly what we do in rmap_walk() anyway.

But yes, I can well imagine that in other cases we only do the one 
anon_vma. I didn't check who used the lock.

So if we do want to keep the lock in the anon_vma, I would just suggest 
that instead of making "normal" users do lots of locking, make the 
rmap_walk side one.

> A forkbomb could definately end up getting slowed down by
> this patch.  Is there any real workload out there that just
> forks deeper and deeper from the parent process, without
> calling exec() after a generation or two?

Heh. AIM7. Wasn't that why we merged the multiple anon_vma's in the first 
place?

> > So again, my gut feel is that if the lock just were in the vma itself,
> > then the "normal" users would have just one natural lock, while the
> > special case users (rmap_walk_anon) would have to lock each vma it
> > traverses. That would seem to be the more natural way to lock things.
> 
> However ... there's still the issue of page_lock_anon_vma
> in try_to_unmap_anon.

Do we care?

We've not locked them all there, and we've historically not cares about 
the rmap list being "perfect", have we? 

So I _think_ it's just the migration case (and apparently potentially the 
hugepage case) that wants _exact_ information. Which is why I suggest the 
onus of the extra locking should be on _them_, not on the regular code.

I dunno. Again, my objections to the patches are really based more on a 
gut feel of "that can't be the right thing to do" than anything else.

We have _extremely_ few places that walk lists to lock things. And they 
are never "normal" code. Things like that magic "mm_take_all_locks()", for 
example. That is why I then react with "that can't be right" to patches 
like this.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 18:19           ` Linus Torvalds
@ 2010-05-03 18:38             ` Rik van Riel
  0 siblings, 0 replies; 44+ messages in thread
From: Rik van Riel @ 2010-05-03 18:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: akpm, Mel Gorman, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Christoph Lameter

On 05/03/2010 02:19 PM, Linus Torvalds wrote:
> On Mon, 3 May 2010, Rik van Riel wrote:
>>
>> One problem is that we cannot find the VMAs (multiple) from
>> the page, except by walking the anon_vma_chain.same_anon_vma
>> list.  At the very least, that list requires locking, done
>> by the anon_vma.lock.
>
> But that's exactly what we do in rmap_walk() anyway.

Mel's original patch adds trylock & retry all code to rmap_walk
and a few other places:

http://lkml.org/lkml/2010/4/26/321

I submitted my patch 1/2 as an alternative, because these repeated
trylocks are pretty complex and easy to accidentally break when
changes to other VM code are made.

>> A forkbomb could definately end up getting slowed down by
>> this patch.  Is there any real workload out there that just
>> forks deeper and deeper from the parent process, without
>> calling exec() after a generation or two?
>
> Heh. AIM7. Wasn't that why we merged the multiple anon_vma's in the first
> place?

AIM7, like sendmail, apache or postgresql, is only 2 deep.

>>> So again, my gut feel is that if the lock just were in the vma itself,
>>> then the "normal" users would have just one natural lock, while the
>>> special case users (rmap_walk_anon) would have to lock each vma it
>>> traverses. That would seem to be the more natural way to lock things.
>>
>> However ... there's still the issue of page_lock_anon_vma
>> in try_to_unmap_anon.
>
> Do we care?
>
> We've not locked them all there, and we've historically not cares about
> the rmap list being "perfect", have we?

Well, try_to_unmap_anon walks just one page, and has the anon_vma
for that page locked.

Having said that, for pageout we do indeed not care about getting
it perfect.

> So I _think_ it's just the migration case (and apparently potentially the
> hugepage case) that wants _exact_ information. Which is why I suggest the
> onus of the extra locking should be on _them_, not on the regular code.

It's a matter of cost vs complexity.  IMHO the locking changes in
the lowest overhead patches (Mel's) are quite complex and could end
up being hard to maintain in the future.  I wanted to introduce
something a little simpler, with hopefully minimal overhead.

But hey, that's just my opinion - what matters is that the bug gets
fixed somehow.  If you prefer the more complex but slightly lower
overhead patches from Mel, that's fine too.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Fix migration races in rmap_walk() V3
  2010-05-03 15:33     ` Andrea Arcangeli
@ 2010-05-03 23:41       ` Johannes Weiner
  2010-05-04 17:35         ` Andrea Arcangeli
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Weiner @ 2010-05-03 23:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter, Rik van Riel

On Mon, May 03, 2010 at 05:33:01PM +0200, Andrea Arcangeli wrote:
> On Sat, May 01, 2010 at 03:51:10PM +0200, Johannes Weiner wrote:
> > On Fri, Apr 30, 2010 at 08:28:53PM +0200, Andrea Arcangeli wrote:
> > > Subject: adapt mprotect to anon_vma chain semantics
> > > 
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > wait_split_huge_page interface changed.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > ---
> > > 
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -929,7 +929,7 @@ int change_huge_pmd(struct vm_area_struc
> > >  	if (likely(pmd_trans_huge(*pmd))) {
> > >  		if (unlikely(pmd_trans_splitting(*pmd))) {
> > >  			spin_unlock(&mm->page_table_lock);
> > > -			wait_split_huge_page(vma->anon_vma, pmd);
> > > +			wait_split_huge_page(mm, pmd);
> > 
> > That makes mprotect-vma-arg obsolete, I guess.
> 
> Well it's needed for flush_tlb_range.

I must have been blind, sorry for the noise.

> In mincore_huge_pmd probably we could pass vma->vm_mm instead of
> vma (as there is not flush_tlb_range), I can change it if you prefer.

Although not strictly required, it's probably nicer to keep the
function signatures in this code alike.  So everything fine with
me as it stands :)

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-02  1:56   ` Christoph Lameter
@ 2010-05-04  9:45     ` Mel Gorman
  2010-05-10 17:41       ` Christoph Lameter
  0 siblings, 1 reply; 44+ messages in thread
From: Mel Gorman @ 2010-05-04  9:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Rik van Riel

On Sat, May 01, 2010 at 08:56:18PM -0500, Christoph Lameter wrote:
> On Thu, 29 Apr 2010, Mel Gorman wrote:
> 
> > There is a race between shift_arg_pages and migration that triggers this bug.
> > A temporary stack is setup during exec and later moved. If migration moves
> > a page in the temporary stack and the VMA is then removed before migration
> > completes, the migration PTE may not be found leading to a BUG when the
> > stack is faulted.
> 
> A simpler solution would be to not allow migration of the temporary stack?
> 

The patch's intention is to not migrate pages within the temporary
stack. What are you suggesting that is different?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-04-29 16:21   ` Andrea Arcangeli
  2010-04-30 19:22     ` Andrea Arcangeli
  2010-05-02 17:40     ` Minchan Kim
@ 2010-05-04 10:32     ` Mel Gorman
  2010-05-04 12:56       ` Andrea Arcangeli
  2 siblings, 1 reply; 44+ messages in thread
From: Mel Gorman @ 2010-05-04 10:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel

On Thu, Apr 29, 2010 at 06:21:20PM +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 
> did you see my proposed fix?

I did when I got back - sorry for the delay. The patchset I sent out was what
I had fully tested and was confident worked. I picked up the version of the
patch that was sent to Linus by Rik for merging.

> I'm running with it applied, I'd be
> interested if you can test it.

Unfortunately, the same bug triggers after about 18 minutes. The objective of
your fix is very simple - have a VMA covering the new range so that rmap can
find it. However, no lock is held during move_page_tables() because of the
need to call the page allocator. Due to the lack of locking, is it possible
that something like the following is happening?

Exec Process				Migration Process
begin move_page_tables
					begin rmap walk
					take anon_vma locks
					find new location of pte (do nothing)
copy migration pte to new location
#### Bad PTE now in place
					find old location of pte
					remove old migration pte
					release anon_vma locks
remove temporary VMA
some time later, bug on migration pte

Even with the care taken, a migration PTE got copied and then left behind. What
I haven't confirmed at this point is if the ordering of the walk in "migration
process" is correct in the above scenario. The order is important for
the race as described to happen.

If the above is wrong, there is still a race somewhere else.

> Surely it will also work for new
> anon-vma code in upstream, because at that point there's just 1
> anon-vma and nothing else attached to the vma.
> 
> http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=6efa1dfa5152ef8d7f26beb188d6877525a9dd03
> 
> I think it's wrong to try to handle the race in rmap walk by making
> magic checks on vm_flags VM_GROWSDOWN|GROWSUP and
> vma->vm_mm->map_count == 1,

How bad is that magic check really? Is there a scenario when it's
the wrong thing to do?

I agree that migration skipping specific pages of the temporary stack is
unfortunate and having exec-aware informtion in migration is an odd dependency
at best. On the other hand, it's not as bad as skipping other regions as exec
will finish and allow the pages to be moved again. The impact to compaction
or transparent support would appear to be minimal.

> when we can fix it fully and simply in
> exec.c by indexing two vmas in the same anon-vma with a different
> vm_start so the pages will be found at all times by the rmap_walk.
> 

If it can be simply fixed in exec, then I'll agree. Your patch looked simple
but unfortunately it doesn't fix the problem and it does introduce another
call to kmalloc() in the exec path. It's probably something that would only
be noticed by microbenchmarks though so I'm less concerned about that aspect.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-04 10:32     ` Mel Gorman
@ 2010-05-04 12:56       ` Andrea Arcangeli
  2010-05-04 14:33         ` Mel Gorman
  0 siblings, 1 reply; 44+ messages in thread
From: Andrea Arcangeli @ 2010-05-04 12:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel

On Tue, May 04, 2010 at 11:32:13AM +0100, Mel Gorman wrote:
> Unfortunately, the same bug triggers after about 18 minutes. The objective of
> your fix is very simple - have a VMA covering the new range so that rmap can
> find it. However, no lock is held during move_page_tables() because of the
> need to call the page allocator. Due to the lack of locking, is it possible
> that something like the following is happening?
> 
> Exec Process				Migration Process
> begin move_page_tables
> 					begin rmap walk
> 					take anon_vma locks
> 					find new location of pte (do nothing)
> copy migration pte to new location
> #### Bad PTE now in place
> 					find old location of pte
> 					remove old migration pte
> 					release anon_vma locks
> remove temporary VMA
> some time later, bug on migration pte
> 
> Even with the care taken, a migration PTE got copied and then left behind. What
> I haven't confirmed at this point is if the ordering of the walk in "migration
> process" is correct in the above scenario. The order is important for
> the race as described to happen.

Ok so this seems the ordering dependency on the anon_vma list that
strikes again, I didn't realize the ordering would matter here, but it
does as shown above, great catch! The destination vma of the
move_page_tables has to be at the tail of the anon_vma list like the
child vma have to be at the end to avoid the equivalent race in
fork. This has to be a requirement for mremap too. We just want to
enforce the same invariants that mremap already enforces, to avoid
adding new special cases to the VM.

== for new anon-vma code ==
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,7 +504,9 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -515,17 +518,43 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+	INIT_LIST_HEAD(&tmp_vma->anon_vma_chain);
+	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
+	tmp_vma->vm_start = new_start;
+	/*
+	 * The tmp_vma destination of the copy (with the new vm_start)
+	 * has to be at the end of the anon_vma list for the rmap_walk
+	 * to find the moved pages at all times.
+	 */
+	if (unlikely(anon_vma_clone(tmp_vma, vma))) {
+		kmem_cache_free(vm_area_cachep, tmp_vma);
 		return -ENOMEM;
+	}
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+	moved_length = move_page_tables(vma, old_start,
+					vma, new_start, length);
+
+	vma->vm_start = new_start;
+	/* rmap walk will already find all pages using the new_start */
+	unlink_anon_vmas(tmp_vma);
+	kmem_cache_free(vm_area_cachep, tmp_vma);
+
+	if (length != moved_length) 
 		return -ENOMEM;
 
 	lru_add_drain();
@@ -551,7 +580,7 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
 
 	return 0;
 }


== for old anon-vma code ==
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -502,7 +503,9 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -514,16 +517,41 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+
+	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
+	tmp_vma->vm_start = new_start;
+
+	/*
+	 * The tmp_vma destination of the copy (with the new vm_start)
+	 * has to be at the end of the anon_vma list for the rmap_walk
+	 * to find the moved pages at all times.
+	 */
+	anon_vma_link(tmp_vma);
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+	moved_length = move_page_tables(vma, old_start,
+					vma, new_start, length);
+
+	vma->vm_start = new_start;
+	/* rmap walk will already find all pages using the new_start */
+	anon_vma_unlink(tmp_vma);
+	kmem_cache_free(vm_area_cachep, tmp_vma);
+
+	if (length != moved_length) 
 		return -ENOMEM;
 
 	lru_add_drain();
@@ -549,7 +577,7 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * shrink the vma to just the new range.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
 
 	return 0;
 }


I'll release new THP-23 and THP-23-anon_vma_chain (prerelease is
already in aa.git but it misses the above bit) soon enough...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock
  2010-05-03 17:58         ` Rik van Riel
  2010-05-03 18:13           ` Andrea Arcangeli
  2010-05-03 18:19           ` Linus Torvalds
@ 2010-05-04 13:12           ` Mel Gorman
  2 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2010-05-04 13:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, akpm, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Andrea Arcangeli, Christoph Lameter

On Mon, May 03, 2010 at 01:58:36PM -0400, Rik van Riel wrote:
>>
>> Btw, Mel's patch doesn't really match the description of 2/2. 2/2 says
>> that all pages must always be findable in rmap. Mel's patch seems to
>> explicitly say "we want to ignore that thing that is busy for execve". Are
>> we just avoiding a BUG_ON()? Is perhaps the BUG_ON() buggy?
>
> I have no good answer to this question.
>
> Mel?  Andrea?
>

The wording could have been better.

The problem is that once a migration PTE is established, it is expected that
rmap can find it. In the specific case of exec, this can fail because of
how the temporary stack is moved. As migration colliding with exec is rare,
the approach taken by the patch was to not create migration PTEs that rmap
could not find. On the plus side, exec (the common case) is unaffected. On
the negative side, it's avoiding the exec vs migration problem instead of
fixing it.

The BUG_ON is not a buggy check. While migration is taking place, the page lock
is held and not unreleased until all the migration PTEs have been removed. If
a migration entry exists and the page is unlocked, it means that rmap failed
to find all the entries. If the BUG_ON was not made, do_swap_page() would
either end up looking up a semi-random entry in swap cache and inserting it
(memory corruption), inserting a random page from swap (memory corruption)
or returning VM_FAULT_OOM to the fault handler (general carnage).

It was considered to lazily clean up the migration PTEs
(http://lkml.org/lkml/2010/4/27/458) but there is no guarantee that the page
the migration PTE pointed to is still the correct one. If it had been freed
and re-used, the results would probably be memory corruption.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-04 12:56       ` Andrea Arcangeli
@ 2010-05-04 14:33         ` Mel Gorman
  2010-05-04 14:44           ` Andrea Arcangeli
  0 siblings, 1 reply; 44+ messages in thread
From: Mel Gorman @ 2010-05-04 14:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel

On Tue, May 04, 2010 at 02:56:06PM +0200, Andrea Arcangeli wrote:
> On Tue, May 04, 2010 at 11:32:13AM +0100, Mel Gorman wrote:
> > Unfortunately, the same bug triggers after about 18 minutes. The objective of
> > your fix is very simple - have a VMA covering the new range so that rmap can
> > find it. However, no lock is held during move_page_tables() because of the
> > need to call the page allocator. Due to the lack of locking, is it possible
> > that something like the following is happening?
> > 
> > Exec Process				Migration Process
> > begin move_page_tables
> > 					begin rmap walk
> > 					take anon_vma locks
> > 					find new location of pte (do nothing)
> > copy migration pte to new location
> > #### Bad PTE now in place
> > 					find old location of pte
> > 					remove old migration pte
> > 					release anon_vma locks
> > remove temporary VMA
> > some time later, bug on migration pte
> > 
> > Even with the care taken, a migration PTE got copied and then left behind. What
> > I haven't confirmed at this point is if the ordering of the walk in "migration
> > process" is correct in the above scenario. The order is important for
> > the race as described to happen.
> 
> Ok so this seems the ordering dependency on the anon_vma list that
> strikes again, I didn't realize the ordering would matter here, but it
> does as shown above, great catch! The destination vma of the
> move_page_tables has to be at the tail of the anon_vma list like the
> child vma have to be at the end to avoid the equivalent race in
> fork. This has to be a requirement for mremap too. We just want to
> enforce the same invariants that mremap already enforces, to avoid
> adding new special cases to the VM.
> 

Agreed. To be honest, I found the problems ordering of the anon_vma a little
confusing but as long as it's consistent everywhere, it's manageable. If
this ever burns us in the future though, we might want DEBUG_VM option that
somehow verifies the ordering of the anon_vma list.

> == for new anon-vma code ==
> Subject: fix race between shift_arg_pages and rmap_walk
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> migrate.c requires rmap to be able to find all ptes mapping a page at
> all times, otherwise the migration entry can be instantiated, but it
> can't be removed if the second rmap_walk fails to find the page.
> 
> And split_huge_page() will have the same requirements as migrate.c
> already has.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Reviewed-by: Mel Gorman <mel@csn.ul.ie>

I'm currently testing this and have seen no problems after an hour which
is typically good. To be absolutly sure, it needs 24 hours but so far so
good. The changelog is a tad on the light side so maybe you'd like to take
this one instead and edit it to your liking?

==== CUT HERE ===

mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack

Page migration requires rmap to be able to find all migration ptes
created by migration. If the second rmap_walk clearing migration PTEs
misses an entry, it is left dangling causing a BUG_ON to trigger during
fault. For example;

[  511.201534] kernel BUG at include/linux/swapops.h:105!
[  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[  511.201534] last sysfs file: /sys/block/sde/size
[  511.201534] CPU 0
[  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  511.888526]
[  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
[  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
[  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
[  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
[  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
[  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
[  512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
[  513.068415] Stack:
[  513.068415]  ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
[  513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
[  513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
[  513.228068] Call Trace:
[  513.228068]  [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[  513.228068]  [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
[  513.228068]  [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[  513.720755]  [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  513.789278]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff813099b5>] page_fault+0x25/0x30
[  513.851506]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
[  513.851506]  [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[  513.851506]  [<ffffffff8111329b>] search_binary_handler+0x173/0x313
[  513.851506]  [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
[  513.851506]  [<ffffffff81114896>] do_execve+0x219/0x30a
[  513.851506]  [<ffffffff8111887f>] ? getname+0x14d/0x1b3
[  513.851506]  [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[  514.483501]  [<ffffffff8100320a>] stub_execve+0x6a/0xc0
[  514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
[  514.704292] RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  514.808221]  RSP <ffff880037b979d8>
[  514.906179] ---[ end trace 4f88495edc224d6b ]---

This particular BUG_ON is caused by a race between shift_arg_pages and
migration. During exec, a temporary stack is created and later moved to its
final location. If migration selects a page within the temporary stack,
the page tables and migration PTE can be copied to the new location
before rmap_walk is able to find the copy. This leaves a dangling
migration PTE behind that later triggers the bug.

This patch fixes the problem by using two VMAs - one which covers the temporary
stack and the other which covers the new location. This guarantees that rmap
can always find the migration PTE even if it is copied while rmap_walk is
taking place.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-04 14:33         ` Mel Gorman
@ 2010-05-04 14:44           ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2010-05-04 14:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Rik van Riel

On Tue, May 04, 2010 at 03:33:11PM +0100, Mel Gorman wrote:
> I'm currently testing this and have seen no problems after an hour which
> is typically good. To be absolutly sure, it needs 24 hours but so far so
> good. The changelog is a tad on the light side so maybe you'd like to take
> this one instead and edit it to your liking?

I'll take your changelog for aa.git thanks! And the non trivial stuff
was documented in the code too.

So now in aa.git I've two branches, master -> old-anon_vma,
anon_vma_chain -> new-anon_vma.

anon_vma_chain starts with Rik's patch 1/2 and then this
patch. old-anon_vma starts with backout-anon-vma and then this patch 2
backported to old anon-vma code. After the removal of all
vma->anon_vma->lock usages from THP code, and switching to a slower
get_page() spin_unlock(page_table_lock) page_lock_anon_vma(page)
model, the anon_vma_chain branch has a chance to be as solid as the
master branch. anon_vma_chain branch can be pulled from mainline
branches too. The master branch is also not using anymore any
vma->anon_vma->lock even if it still could and it'd be a bit faster,
to give more testing to the anon_vma_chain code.

You can see the difference with "git diff master anon_vma_chain".

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=summary
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=refs/heads/anon_vma_chain

This should be THP-23 and THP-23-anon_vma_chain tags, I'll do proper
release soon.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Fix migration races in rmap_walk() V3
  2010-05-03 23:41       ` Johannes Weiner
@ 2010-05-04 17:35         ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2010-05-04 17:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Christoph Lameter, Rik van Riel

On Tue, May 04, 2010 at 01:41:32AM +0200, Johannes Weiner wrote:
> Although not strictly required, it's probably nicer to keep the
> function signatures in this code alike.  So everything fine with
> me as it stands :)

Too late I already optimized mincore...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-04  9:45     ` Mel Gorman
@ 2010-05-10 17:41       ` Christoph Lameter
  2010-05-10 17:56         ` Mel Gorman
  2010-05-10 19:05         ` Andrea Arcangeli
  0 siblings, 2 replies; 44+ messages in thread
From: Christoph Lameter @ 2010-05-10 17:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Rik van Riel

On Tue, 4 May 2010, Mel Gorman wrote:

> On Sat, May 01, 2010 at 08:56:18PM -0500, Christoph Lameter wrote:
> > On Thu, 29 Apr 2010, Mel Gorman wrote:
> >
> > > There is a race between shift_arg_pages and migration that triggers this bug.
> > > A temporary stack is setup during exec and later moved. If migration moves
> > > a page in the temporary stack and the VMA is then removed before migration
> > > completes, the migration PTE may not be found leading to a BUG when the
> > > stack is faulted.
> >
> > A simpler solution would be to not allow migration of the temporary stack?
> >
>
> The patch's intention is to not migrate pages within the temporary
> stack. What are you suggesting that is different?

A simple way to disallow migration of pages is to increment the refcount
of a page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-10 17:41       ` Christoph Lameter
@ 2010-05-10 17:56         ` Mel Gorman
  2010-05-11 13:59           ` Christoph Lameter
  2010-05-10 19:05         ` Andrea Arcangeli
  1 sibling, 1 reply; 44+ messages in thread
From: Mel Gorman @ 2010-05-10 17:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Rik van Riel

On Mon, May 10, 2010 at 12:41:07PM -0500, Christoph Lameter wrote:
> On Tue, 4 May 2010, Mel Gorman wrote:
> 
> > On Sat, May 01, 2010 at 08:56:18PM -0500, Christoph Lameter wrote:
> > > On Thu, 29 Apr 2010, Mel Gorman wrote:
> > >
> > > > There is a race between shift_arg_pages and migration that triggers this bug.
> > > > A temporary stack is setup during exec and later moved. If migration moves
> > > > a page in the temporary stack and the VMA is then removed before migration
> > > > completes, the migration PTE may not be found leading to a BUG when the
> > > > stack is faulted.
> > >
> > > A simpler solution would be to not allow migration of the temporary stack?
> > >
> >
> > The patch's intention is to not migrate pages within the temporary
> > stack. What are you suggesting that is different?
> 
> A simple way to disallow migration of pages is to increment the refcount
> of a page.
> 

I guess it could be done by walking the page-tables in advance of the move
and elevating the page count of any pages faulted and then finding those
pages afterwards.  The fail path would be a bit of a pain though if the page
tables are partially moved though. It's unnecessarily complicated when the
temporary stack can be easily avoided.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-10 17:41       ` Christoph Lameter
  2010-05-10 17:56         ` Mel Gorman
@ 2010-05-10 19:05         ` Andrea Arcangeli
  2010-05-11  0:10           ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 44+ messages in thread
From: Andrea Arcangeli @ 2010-05-10 19:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Rik van Riel

On Mon, May 10, 2010 at 12:41:07PM -0500, Christoph Lameter wrote:
> A simple way to disallow migration of pages is to increment the refcount
> of a page.

Ok for migrate but it won't prevent to crash in split_huge_page rmap
walk, nor the PG_lock. Why for a rmap bug have a migrate specific fix?
The fix that makes execve the only special place to handle in every
rmap walk, is at least more maintainable than a fix that makes one of
the rmap walk users special and won't fix the others, as there will be
more than just 1 user that requires this. My fix didn't make execve
special and it didn't require execve knowledge into the every rmap
walk like migrate (split_huge_page etc...) but as long as the kernel
doesn't crash I'm fine ;).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-10 19:05         ` Andrea Arcangeli
@ 2010-05-11  0:10           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 44+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-11  0:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Linux-MM, LKML,
	Minchan Kim, Rik van Riel

On Mon, 10 May 2010 21:05:59 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Mon, May 10, 2010 at 12:41:07PM -0500, Christoph Lameter wrote:
> > A simple way to disallow migration of pages is to increment the refcount
> > of a page.
> 
> Ok for migrate but it won't prevent to crash in split_huge_page rmap
> walk, nor the PG_lock. Why for a rmap bug have a migrate specific fix?
> The fix that makes execve the only special place to handle in every
> rmap walk, is at least more maintainable than a fix that makes one of
> the rmap walk users special and won't fix the others, as there will be
> more than just 1 user that requires this. My fix didn't make execve
> special and it didn't require execve knowledge into the every rmap
> walk like migrate (split_huge_page etc...) but as long as the kernel
> doesn't crash I'm fine ;).
> 

At first, I like step-by-step approach even if it makes our cost double
because it's easy to understand and makes chasing change-log easy.

Ok, your split_huge_page() has some problems with current rmap+migration.
But I don't like a patch for never-happen-now bug in change-log.

I believe it can be fixed by the same approach for execs.
Renaming 
#define VM_STACK_INCOMPLETE_SETUP
to be
#define VM_TEMPORARY_INCONSITENT_RMAP
in _your_ patch series and add some check in rmap_walk() seems enough.

Of course, I may misunderstand your problem. Could you show your patch
which meets the problem with rmap+migration ?

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-10 17:56         ` Mel Gorman
@ 2010-05-11 13:59           ` Christoph Lameter
  2010-05-11 15:11             ` Mel Gorman
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Lameter @ 2010-05-11 13:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Rik van Riel

On Mon, 10 May 2010, Mel Gorman wrote:

> > A simple way to disallow migration of pages is to increment the refcount
> > of a page.
> I guess it could be done by walking the page-tables in advance of the move
> and elevating the page count of any pages faulted and then finding those
> pages afterwards.  The fail path would be a bit of a pain though if the page
> tables are partially moved though. It's unnecessarily complicated when the
> temporary stack can be easily avoided.

Faulting during exec? Dont we hold mmap_sem for write? A get_user_pages()
or so on the range will increment the refcount.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-11 13:59           ` Christoph Lameter
@ 2010-05-11 15:11             ` Mel Gorman
  2010-05-11 15:56               ` Christoph Lameter
  0 siblings, 1 reply; 44+ messages in thread
From: Mel Gorman @ 2010-05-11 15:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Rik van Riel

On Tue, May 11, 2010 at 08:59:12AM -0500, Christoph Lameter wrote:
> On Mon, 10 May 2010, Mel Gorman wrote:
> 
> > > A simple way to disallow migration of pages is to increment the refcount
> > > of a page.
> > I guess it could be done by walking the page-tables in advance of the move
> > and elevating the page count of any pages faulted and then finding those
> > pages afterwards.  The fail path would be a bit of a pain though if the page
> > tables are partially moved though. It's unnecessarily complicated when the
> > temporary stack can be easily avoided.
> 
> Faulting during exec?

Copying in arguments and the like

> Dont we hold mmap_sem for write? A get_user_pages()
> or so on the range will increment the refcount.
> 

Or just identify the temporary stack from the migration side instead of
adding to the cost of exec?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-11 15:11             ` Mel Gorman
@ 2010-05-11 15:56               ` Christoph Lameter
  2010-05-11 16:15                 ` Mel Gorman
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Lameter @ 2010-05-11 15:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Rik van Riel

On Tue, 11 May 2010, Mel Gorman wrote:

> Or just identify the temporary stack from the migration side instead of
> adding to the cost of exec?

Adding one off checks to a generic mechanism isnt really clean
programming. Using the provided means of disabling a generic mechanism is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-11 15:56               ` Christoph Lameter
@ 2010-05-11 16:15                 ` Mel Gorman
  2010-05-11 16:29                   ` Andrea Arcangeli
  0 siblings, 1 reply; 44+ messages in thread
From: Mel Gorman @ 2010-05-11 16:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Rik van Riel

On Tue, May 11, 2010 at 10:56:00AM -0500, Christoph Lameter wrote:
> On Tue, 11 May 2010, Mel Gorman wrote:
> 
> > Or just identify the temporary stack from the migration side instead of
> > adding to the cost of exec?
> 
> Adding one off checks to a generic mechanism isnt really clean
> programming. Using the provided means of disabling a generic mechanism is.
> 

Andrea's solution is likely lighter than yours as it is one kmalloc and
an insertion into the VM as opposed to a page table walk with reference
counting. Better yet, it exists as a patch that has been tested and it
fits in with the generic mechanism by guaranteeing that rmap_walk finds
all the migration PTEs during the second walk.

The problem remains the same - that class of solution increases the cost of
a common operation (exec) to keep a much less operation (migration) happy.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
  2010-05-11 16:15                 ` Mel Gorman
@ 2010-05-11 16:29                   ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2010-05-11 16:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Andrew Morton, Linux-MM, LKML, Minchan Kim,
	KAMEZAWA Hiroyuki, Rik van Riel

On Tue, May 11, 2010 at 05:15:16PM +0100, Mel Gorman wrote:
> On Tue, May 11, 2010 at 10:56:00AM -0500, Christoph Lameter wrote:
> > On Tue, 11 May 2010, Mel Gorman wrote:
> > 
> > > Or just identify the temporary stack from the migration side instead of
> > > adding to the cost of exec?
> > 
> > Adding one off checks to a generic mechanism isnt really clean
> > programming. Using the provided means of disabling a generic mechanism is.
> > 
> 
> Andrea's solution is likely lighter than yours as it is one kmalloc and
> an insertion into the VM as opposed to a page table walk with reference
> counting. Better yet, it exists as a patch that has been tested and it
> fits in with the generic mechanism by guaranteeing that rmap_walk finds
> all the migration PTEs during the second walk.
> 
> The problem remains the same - that class of solution increases the cost of
> a common operation (exec) to keep a much less operation (migration) happy.

page table walk adding reference counting is still a one off check,
the generic rmap_walk mechanism won't care about the reference
counting, still only migrate checks the page count... so it doesn't
move the needle in clean programming terms.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2010-05-11 16:29 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-29  8:32 [PATCH 0/2] Fix migration races in rmap_walk() V3 Mel Gorman
2010-04-29  8:32 ` [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock Mel Gorman
2010-05-02 17:28   ` Minchan Kim
2010-04-29  8:32 ` [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks Mel Gorman
2010-04-29 16:21   ` Andrea Arcangeli
2010-04-30 19:22     ` Andrea Arcangeli
2010-04-30 20:21       ` Rik van Riel
2010-05-01  9:39       ` Andrea Arcangeli
2010-05-01 13:02         ` Rik van Riel
2010-05-02 17:40     ` Minchan Kim
2010-05-02 18:20       ` Andrea Arcangeli
2010-05-04 10:32     ` Mel Gorman
2010-05-04 12:56       ` Andrea Arcangeli
2010-05-04 14:33         ` Mel Gorman
2010-05-04 14:44           ` Andrea Arcangeli
2010-05-02  1:56   ` Christoph Lameter
2010-05-04  9:45     ` Mel Gorman
2010-05-10 17:41       ` Christoph Lameter
2010-05-10 17:56         ` Mel Gorman
2010-05-11 13:59           ` Christoph Lameter
2010-05-11 15:11             ` Mel Gorman
2010-05-11 15:56               ` Christoph Lameter
2010-05-11 16:15                 ` Mel Gorman
2010-05-11 16:29                   ` Andrea Arcangeli
2010-05-10 19:05         ` Andrea Arcangeli
2010-05-11  0:10           ` KAMEZAWA Hiroyuki
2010-04-30 18:28 ` [PATCH 0/2] Fix migration races in rmap_walk() V3 Andrea Arcangeli
2010-05-01 13:51   ` Johannes Weiner
2010-05-03 15:33     ` Andrea Arcangeli
2010-05-03 23:41       ` Johannes Weiner
2010-05-04 17:35         ` Andrea Arcangeli
  -- strict thread matches above, loose matches on Subject: below --
2010-05-03 16:17 [PATCH 0/2] Fix migration races in rmap_walk() V4 Rik van Riel
2010-05-03 16:18 ` [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock Rik van Riel
2010-05-03 16:41   ` Linus Torvalds
2010-05-03 16:53     ` Rik van Riel
2010-05-03 17:17       ` Linus Torvalds
2010-05-03 17:58         ` Rik van Riel
2010-05-03 18:13           ` Andrea Arcangeli
2010-05-03 18:19           ` Linus Torvalds
2010-05-03 18:38             ` Rik van Riel
2010-05-04 13:12           ` Mel Gorman
2010-05-03 16:55   ` Peter Zijlstra
2010-05-03 17:02     ` Andrea Arcangeli
2010-05-03 17:11       ` Peter Zijlstra
2010-05-03 17:18         ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).