[RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue
@ 2010-01-18  3:21 Rik van Riel
  2010-01-21  5:05 ` KOSAKI Motohiro
  0 siblings, 1 reply; 9+ messages in thread
From: Rik van Riel @ 2010-01-18  3:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Lee Schermerhorn, KOSAKI Motohiro, minchan.kim,
	lwoodman, aarcange

Thanks to some hints from peterz, and a day of debugging this weekend,
this version boots all the way to udev, then appears to hang. I probably 
made a silly mistake somewhere.  I am mailing this patch out in the hopes
that my last silly mistakes will be seen by someone :)
----

The old anon_vma code can lead to scalability issues with heavily
forking workloads.  Specifically, each anon_vma will be shared
between the parent process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just
one of the 1000 processes.  However, the current rmap code needs to
walk them all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables
of 1000 processes in page_referenced_one, while all other CPUs are
stuck on the anon_vma lock.  This leads to catastrophic failure for
a benchmark like AIM7, where the total number of processes can reach
in the tens of thousands.  Real workloads are still a factor 10 less
process intensive than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which
allows us to associate multiple anon_vmas with a VMA.  At fork
time, each child process gets its own anon_vmas, in which its
COWed pages will be instantiated.  The parents' anon_vma is also
linked to the VMA, because non-COWed pages could be present in
any of the children.

This reduces rmap scanning complexity to O(1) for the pages of
the 1000 child processes, with O(N) complexity for at most 1/N
pages in the system.  This reduces the average scanning cost in
heavily forking workloads from O(N) to 2.

The only real complexity in this patch stems from the fact that
linking a VMA to anon_vmas now involves memory allocations. This
means vma_adjust can fail, if it needs to attach a VMA to anon_vma
structures. This in turn means error handling needs to be added
to the calling functions.

A second source of complexity is that, because there can be
multiple anon_vmas, the anon_vma linking in vma_adjust can
no longer be done under "the" anon_vma lock.  To prevent the
rmap code from walking up an incomplete VMA, this patch
introduces the VM_LOCK_RMAP VMA flag.  This bit flag uses
the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef
in mm.h to make sure it is impossible to compile a kernel
that needs both symbolic values for the same bitflag.

Signed-off-by: Rik van Riel <riel@redhat.com>

diff --git a/arch/ia64/ia32/binfmt_elf32.c b/arch/ia64/ia32/binfmt_elf32.c
index c69552b..f69e7ba 100644
--- a/arch/ia64/ia32/binfmt_elf32.c
+++ b/arch/ia64/ia32/binfmt_elf32.c
@@ -89,6 +89,7 @@ ia64_elf32_init (struct pt_regs *regs)
 	 */
 	vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
 	if (vma) {
+		INIT_LIST_HEAD(&vma->anon_vma_chain);
 		vma->vm_mm = current->mm;
 		vma->vm_start = IA32_GDT_OFFSET;
 		vma->vm_end = vma->vm_start + PAGE_SIZE;
@@ -114,6 +115,7 @@ ia64_elf32_init (struct pt_regs *regs)
 	 */
 	vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
 	if (vma) {
+		INIT_LIST_HEAD(&vma->anon_vma_chain);
 		vma->vm_mm = current->mm;
 		vma->vm_start = IA32_GATE_OFFSET;
 		vma->vm_end = vma->vm_start + PAGE_SIZE;
@@ -138,6 +140,7 @@ ia64_elf32_init (struct pt_regs *regs)
 	 */
 	vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
 	if (vma) {
+		INIT_LIST_HEAD(&vma->anon_vma_chain);
 		vma->vm_mm = current->mm;
 		vma->vm_start = IA32_LDT_OFFSET;
 		vma->vm_end = vma->vm_start + PAGE_ALIGN(IA32_LDT_ENTRIES*IA32_LDT_ENTRY_SIZE);
diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index a97f838..95069e9 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2314,6 +2314,7 @@ pfm_smpl_buffer_alloc(struct task_struct *task, struct file *filp, pfm_context_t
 		DPRINT(("Cannot allocate vma\n"));
 		goto error_kmem;
 	}
+	INIT_LIST_HEAD(&vma->anon_vma_chain);
 
 	/*
 	 * partially initialize the vma for the sampling buffer
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 7c0d481..4157ea5 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -118,6 +118,7 @@ ia64_init_addr_space (void)
 	 */
 	vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
 	if (vma) {
+		INIT_LIST_HEAD(&vma->anon_vma_chain);
 		vma->vm_mm = current->mm;
 		vma->vm_start = current->thread.rbs_bot & PAGE_MASK;
 		vma->vm_end = vma->vm_start + PAGE_SIZE;
@@ -136,6 +137,7 @@ ia64_init_addr_space (void)
 	if (!(current->personality & MMAP_PAGE_ZERO)) {
 		vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
 		if (vma) {
+			INIT_LIST_HEAD(&vma->anon_vma_chain);
 			vma->vm_mm = current->mm;
 			vma->vm_end = PAGE_SIZE;
 			vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT);
diff --git a/fs/exec.c b/fs/exec.c
index 271c557..21a5f94 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -246,6 +246,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 	vma->vm_start = vma->vm_end - PAGE_SIZE;
 	vma->vm_flags = VM_STACK_FLAGS;
 	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
+	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	err = insert_vm_struct(mm, vma);
 	if (err)
 		goto err;
@@ -516,7 +517,8 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
+	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
+		return -ENOMEM;
 
 	/*
 	 * move the page tables downwards, on failure we rely on
diff --git a/include/linux/mm.h b/include/linux/mm.h
index be7f851..93bbb70 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -96,7 +96,11 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
+#ifdef CONFIG_MMU
+#define VM_LOCK_RMAP	0x01000000	/* Do not follow this rmap (mmu mmap) */
+#else
 #define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
+#endif
 #define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
 #define VM_ALWAYSDUMP	0x04000000	/* Always include in core dumps */
 
@@ -1108,7 +1112,7 @@ static inline void vma_nonlinear_insert(struct vm_area_struct *vma,
 
 /* mmap.c */
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
-extern void vma_adjust(struct vm_area_struct *vma, unsigned long start,
+extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 84a524a..44cfb13 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -167,7 +167,7 @@ struct vm_area_struct {
 	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
 	 * or brk vma (with NULL file) can only be in an anon_vma list.
 	 */
-	struct list_head anon_vma_node;	/* Serialized by anon_vma->lock */
+	struct list_head anon_vma_chain; /* Serialized by anon_vma->lock */
 	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */
 
 	/* Function pointers to deal with this struct. */
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b019ae6..0d1903a 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -37,7 +37,27 @@ struct anon_vma {
 	 * is serialized by a system wide lock only visible to
 	 * mm_take_all_locks() (mm_all_locks_mutex).
 	 */
-	struct list_head head;	/* List of private "related" vmas */
+	struct list_head head;	/* Chain of private "related" vmas */
+};
+
+/*
+ * The copy-on-write semantics of fork mean that an anon_vma
+ * can become associated with multiple processes. Furthermore,
+ * each child process will have its own anon_vma, where new
+ * pages for that process are instantiated.
+ *
+ * This structure allows us to find the anon_vmas associated
+ * with a VMA, or the VMAs associated with an anon_vma.
+ * The "same_vma" list contains the anon_vma_chains linking
+ * all the anon_vmas associated with this VMA.
+ * The "same_anon_vma" list contains the anon_vma_chains
+ * which link all the VMAs associated with this anon_vma.
+ */
+struct anon_vma_chain {
+	struct vm_area_struct *vma;
+	struct anon_vma *anon_vma;
+	struct list_head same_vma;	/* locked by mmap_sem & friends */
+	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
 #ifdef CONFIG_MMU
@@ -89,12 +109,19 @@ static inline void anon_vma_unlock(struct vm_area_struct *vma)
  */
 void anon_vma_init(void);	/* create anon_vma_cachep */
 int  anon_vma_prepare(struct vm_area_struct *);
-void __anon_vma_merge(struct vm_area_struct *, struct vm_area_struct *);
-void anon_vma_unlink(struct vm_area_struct *);
-void anon_vma_link(struct vm_area_struct *);
+void unlink_anon_vmas(struct vm_area_struct *);
+int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
 void anon_vma_free(struct anon_vma *);
 
+static inline void anon_vma_merge(struct vm_area_struct *vma,
+				  struct vm_area_struct *next)
+{
+	BUG_ON(vma->anon_vma != next->anon_vma);
+	unlink_anon_vmas(next);
+}
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index 404e6ca..e58f905 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -328,15 +328,17 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		if (!tmp)
 			goto fail_nomem;
 		*tmp = *mpnt;
+		INIT_LIST_HEAD(&tmp->anon_vma_chain);
 		pol = mpol_dup(vma_policy(mpnt));
 		retval = PTR_ERR(pol);
 		if (IS_ERR(pol))
 			goto fail_nomem_policy;
 		vma_set_policy(tmp, pol);
+		if (anon_vma_fork(tmp, mpnt))
+			goto fail_nomem_anon_vma_fork;
 		tmp->vm_flags &= ~VM_LOCKED;
 		tmp->vm_mm = mm;
 		tmp->vm_next = NULL;
-		anon_vma_link(tmp);
 		file = tmp->vm_file;
 		if (file) {
 			struct inode *inode = file->f_path.dentry->d_inode;
@@ -391,6 +393,7 @@ out:
 	flush_tlb_mm(oldmm);
 	up_write(&oldmm->mmap_sem);
 	return retval;
+fail_nomem_anon_vma_fork:
 fail_nomem_policy:
 	kmem_cache_free(vm_area_cachep, tmp);
 fail_nomem:
diff --git a/mm/ksm.c b/mm/ksm.c
index 56a0da1..a93f1b7 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1563,10 +1563,12 @@ int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
+		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
 		spin_lock(&anon_vma->lock);
-		list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
+			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
 				continue;
@@ -1614,10 +1616,12 @@ int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
+		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
 		spin_lock(&anon_vma->lock);
-		list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
+			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
 				continue;
@@ -1664,10 +1668,12 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
+		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
 		spin_lock(&anon_vma->lock);
-		list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
+			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
 				continue;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2c08089..b6d77ec 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -363,6 +363,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 			      struct to_kill **tkc)
 {
 	struct vm_area_struct *vma;
+	struct anon_vma_chain *vmac;
 	struct task_struct *tsk;
 	struct anon_vma *av;
 
@@ -373,7 +374,8 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 	for_each_process (tsk) {
 		if (!task_early_kill(tsk))
 			continue;
-		list_for_each_entry (vma, &av->head, anon_vma_node) {
+		list_for_each_entry (vmac, &av->head, same_anon_vma) {
+			vma = vmac->vma;
 			if (!page_mapped_in_vma(page, vma))
 				continue;
 			if (vma->vm_mm == tsk->mm)
diff --git a/mm/memory.c b/mm/memory.c
index 09e4b1b..6a0f36e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -300,7 +300,7 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 * Hide vma from rmap and truncate_pagecache before freeing
 		 * pgtables
 		 */
-		anon_vma_unlink(vma);
+		unlink_anon_vmas(vma);
 		unlink_file_vma(vma);
 
 		if (is_vm_hugetlb_page(vma)) {
@@ -314,7 +314,7 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			       && !is_vm_hugetlb_page(next)) {
 				vma = next;
 				next = vma->vm_next;
-				anon_vma_unlink(vma);
+				unlink_anon_vmas(vma);
 				unlink_file_vma(vma);
 			}
 			free_pgd_range(tlb, addr, vma->vm_end,
diff --git a/mm/mmap.c b/mm/mmap.c
index 64e0b04..d8efac7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -437,7 +437,6 @@ __vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	__vma_link_list(mm, vma, prev, rb_parent);
 	__vma_link_rb(mm, vma, rb_link, rb_parent);
-	__anon_vma_link(vma);
 }
 
 static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -499,7 +498,7 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
  * are necessary.  The "insert" vma (if any) is to be inserted
  * before we drop the necessary locks.
  */
-void vma_adjust(struct vm_area_struct *vma, unsigned long start,
+int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -542,6 +541,29 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	/*
+	 * When changing only vma->vm_end, we don't really need
+	 * anon_vma lock.
+	 */
+	if (vma->anon_vma && (insert || importer || start != vma->vm_start))
+		anon_vma = vma->anon_vma;
+	if (anon_vma) {
+		/*
+		 * Easily overlooked: when mprotect shifts the boundary,
+		 * make sure the expanding vma has anon_vma set if the
+		 * shrinking vma had, to cover any anon pages imported.
+		 */
+		if (importer && !importer->anon_vma) {
+			/* Block reverse map lookups until things are set up. */
+			importer->vm_flags |= VM_LOCK_RMAP;
+			if (anon_vma_clone(importer, vma)) {
+				importer->vm_flags &= ~VM_LOCK_RMAP;
+				return -ENOMEM;
+			}
+			importer->anon_vma = anon_vma;
+		}
+	}
+
 	if (file) {
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
@@ -567,25 +589,6 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
-	/*
-	 * When changing only vma->vm_end, we don't really need
-	 * anon_vma lock.
-	 */
-	if (vma->anon_vma && (insert || importer || start != vma->vm_start))
-		anon_vma = vma->anon_vma;
-	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
-		/*
-		 * Easily overlooked: when mprotect shifts the boundary,
-		 * make sure the expanding vma has anon_vma set if the
-		 * shrinking vma had, to cover any anon pages imported.
-		 */
-		if (importer && !importer->anon_vma) {
-			importer->anon_vma = anon_vma;
-			__anon_vma_link(importer);
-		}
-	}
-
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -616,8 +619,11 @@ again:			remove_next = 1 + (end > next->vm_end);
 		__vma_unlink(mm, next, vma);
 		if (file)
 			__remove_shared_vm_struct(next, file, mapping);
-		if (next->anon_vma)
-			__anon_vma_merge(vma, next);
+		/*
+		 * This VMA is now dead, no need for rmap to follow it.
+		 * Call anon_vma_merge below, outside of i_mmap_lock.
+		 */
+		next->vm_flags |= VM_LOCK_RMAP;
 	} else if (insert) {
 		/*
 		 * split_vma has split insert from vma, and needs
@@ -627,17 +633,25 @@ again:			remove_next = 1 + (end > next->vm_end);
 		__insert_vm_struct(mm, insert);
 	}
 
-	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	/*
+	 * The current VMA has been set up. It is now safe for the
+	 * rmap code to get from the pages to the ptes.
+	 */
+	if (anon_vma && importer)
+		importer->vm_flags &= ~VM_LOCK_RMAP;
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
 			if (next->vm_flags & VM_EXECUTABLE)
 				removed_exe_file_vma(mm);
 		}
+		/* Protected by mmap_sem and VM_LOCK_RMAP. */
+		if (next->anon_vma)
+			anon_vma_merge(vma, next);
 		mm->map_count--;
 		mpol_put(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
@@ -653,6 +667,8 @@ again:			remove_next = 1 + (end > next->vm_end);
 	}
 
 	validate_mm(mm);
+
+	return 0;
 }
 
 /*
@@ -759,6 +775,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 {
 	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
 	struct vm_area_struct *area, *next;
+	int err;
 
 	/*
 	 * We later require that vma->vm_flags == vm_flags,
@@ -792,11 +809,13 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 				is_mergeable_anon_vma(prev->anon_vma,
 						      next->anon_vma)) {
 							/* cases 1, 6 */
-			vma_adjust(prev, prev->vm_start,
+			err = vma_adjust(prev, prev->vm_start,
 				next->vm_end, prev->vm_pgoff, NULL);
 		} else					/* cases 2, 5, 7 */
-			vma_adjust(prev, prev->vm_start,
+			err = vma_adjust(prev, prev->vm_start,
 				end, prev->vm_pgoff, NULL);
+		if (err)
+			return NULL;
 		return prev;
 	}
 
@@ -808,11 +827,13 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			can_vma_merge_before(next, vm_flags,
 					anon_vma, file, pgoff+pglen)) {
 		if (prev && addr < prev->vm_end)	/* case 4 */
-			vma_adjust(prev, prev->vm_start,
+			err = vma_adjust(prev, prev->vm_start,
 				addr, prev->vm_pgoff, NULL);
 		else					/* cases 3, 8 */
-			vma_adjust(area, addr, next->vm_end,
+			err = vma_adjust(area, addr, next->vm_end,
 				next->vm_pgoff - pglen, NULL);
+		if (err)
+			return NULL;
 		return area;
 	}
 
@@ -1187,6 +1208,7 @@ munmap_back:
 	vma->vm_flags = vm_flags;
 	vma->vm_page_prot = vm_get_page_prot(vm_flags);
 	vma->vm_pgoff = pgoff;
+	INIT_LIST_HEAD(&vma->anon_vma_chain);
 
 	if (file) {
 		error = -EINVAL;
@@ -1847,6 +1869,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 {
 	struct mempolicy *pol;
 	struct vm_area_struct *new;
+	int err = -ENOMEM;
 
 	if (is_vm_hugetlb_page(vma) && (addr &
 					~(huge_page_mask(hstate_vma(vma)))))
@@ -1854,11 +1877,15 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 
 	new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 	if (!new)
-		return -ENOMEM;
+		goto out_err;
 
 	/* most fields are the same, copy all, and then fixup */
 	*new = *vma;
 
+	INIT_LIST_HEAD(&new->anon_vma_chain);
+	if (anon_vma_clone(new, vma))
+		goto out_err;
+
 	if (new_below)
 		new->vm_end = addr;
 	else {
@@ -1868,8 +1895,8 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 
 	pol = mpol_dup(vma_policy(vma));
 	if (IS_ERR(pol)) {
-		kmem_cache_free(vm_area_cachep, new);
-		return PTR_ERR(pol);
+		err = PTR_ERR(pol);
+		goto out_free_vma;
 	}
 	vma_set_policy(new, pol);
 
@@ -1883,12 +1910,27 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 		new->vm_ops->open(new);
 
 	if (new_below)
-		vma_adjust(vma, addr, vma->vm_end, vma->vm_pgoff +
+		err = vma_adjust(vma, addr, vma->vm_end, vma->vm_pgoff +
 			((addr - new->vm_start) >> PAGE_SHIFT), new);
 	else
-		vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new);
+		err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new);
 
-	return 0;
+	/* Success. */
+	if (!err)
+		return 0;
+
+	/* Clean everything up if vma_adjust failed. */
+	new->vm_ops->close(new);
+	if (new->vm_file) {
+		if (vma->vm_flags & VM_EXECUTABLE)
+			removed_exe_file_vma(mm);
+		fput(new->vm_file);
+	}
+	mpol_put(pol);
+ out_free_vma:
+	kmem_cache_free(vm_area_cachep, new);
+ out_err:
+	return err;
 }
 
 /*
@@ -2105,6 +2147,7 @@ unsigned long do_brk(unsigned long addr, unsigned long len)
 		return -ENOMEM;
 	}
 
+	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma->vm_mm = mm;
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
@@ -2241,10 +2284,11 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		if (new_vma) {
 			*new_vma = *vma;
 			pol = mpol_dup(vma_policy(vma));
-			if (IS_ERR(pol)) {
-				kmem_cache_free(vm_area_cachep, new_vma);
-				return NULL;
-			}
+			if (IS_ERR(pol))
+				goto out_free_vma;
+			INIT_LIST_HEAD(&vma->anon_vma_chain);
+			if (anon_vma_clone(new_vma, vma))
+				goto out_free_mempol;
 			vma_set_policy(new_vma, pol);
 			new_vma->vm_start = addr;
 			new_vma->vm_end = addr + len;
@@ -2260,6 +2304,12 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		}
 	}
 	return new_vma;
+
+ out_free_mempol:
+	mpol_put(pol);
+ out_free_vma:
+	kmem_cache_free(vm_area_cachep, new_vma);
+	return NULL;
 }
 
 /*
@@ -2337,6 +2387,7 @@ int install_special_mapping(struct mm_struct *mm,
 	if (unlikely(vma == NULL))
 		return -ENOMEM;
 
+	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma->vm_mm = mm;
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
@@ -2437,6 +2488,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
 int mm_take_all_locks(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	struct anon_vma_chain *avc;
 	int ret = -EINTR;
 
 	BUG_ON(down_read_trylock(&mm->mmap_sem));
@@ -2454,7 +2506,8 @@ int mm_take_all_locks(struct mm_struct *mm)
 		if (signal_pending(current))
 			goto out_unlock;
 		if (vma->anon_vma)
-			vm_lock_anon_vma(mm, vma->anon_vma);
+			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+				vm_lock_anon_vma(mm, avc->anon_vma);
 	}
 
 	ret = 0;
@@ -2509,13 +2562,15 @@ static void vm_unlock_mapping(struct address_space *mapping)
 void mm_drop_all_locks(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	struct anon_vma_chain *avc;
 
 	BUG_ON(down_read_trylock(&mm->mmap_sem));
 	BUG_ON(!mutex_is_locked(&mm_all_locks_mutex));
 
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (vma->anon_vma)
-			vm_unlock_anon_vma(vma->anon_vma);
+			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+				vm_unlock_anon_vma(avc->anon_vma);
 		if (vma->vm_file && vma->vm_file->f_mapping)
 			vm_unlock_mapping(vma->vm_file->f_mapping);
 	}
diff --git a/mm/mremap.c b/mm/mremap.c
index 8f88547..8dbc8a5 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -389,8 +389,11 @@ unsigned long do_mremap(unsigned long addr,
 		if (max_addr - addr >= new_len) {
 			int pages = (new_len - old_len) >> PAGE_SHIFT;
 
-			vma_adjust(vma, vma->vm_start,
-				addr + new_len, vma->vm_pgoff, NULL);
+			if (vma_adjust(vma, vma->vm_start, addr + new_len,
+				       vma->vm_pgoff, NULL)) {
+				ret = -ENOMEM;
+				goto out;
+			}
 
 			mm->total_vm += pages;
 			vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
diff --git a/mm/nommu.c b/mm/nommu.c
index 1544a65..9db704f 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1209,7 +1209,7 @@ unsigned long do_mmap_pgoff(struct file *file,
 	region->vm_flags = vm_flags;
 	region->vm_pgoff = pgoff;
 
-	INIT_LIST_HEAD(&vma->anon_vma_node);
+	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma->vm_flags = vm_flags;
 	vma->vm_pgoff = pgoff;
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 278cd27..5b675e3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -62,6 +62,7 @@
 #include "internal.h"
 
 static struct kmem_cache *anon_vma_cachep;
+static struct kmem_cache *anon_vma_chain_cachep;
 
 static inline struct anon_vma *anon_vma_alloc(void)
 {
@@ -73,6 +74,16 @@ void anon_vma_free(struct anon_vma *anon_vma)
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+static inline struct anon_vma_chain *anon_vma_chain_alloc(void)
+{
+	return kmem_cache_alloc(anon_vma_chain_cachep, GFP_KERNEL);
+}
+
+void anon_vma_chain_free(struct anon_vma_chain *anon_vma_chain)
+{
+	kmem_cache_free(anon_vma_chain_cachep, anon_vma_chain);
+}
+
 /**
  * anon_vma_prepare - attach an anon_vma to a memory region
  * @vma: the memory region in question
@@ -103,18 +114,23 @@ void anon_vma_free(struct anon_vma *anon_vma)
 int anon_vma_prepare(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
+	struct anon_vma_chain *avc;
 
 	might_sleep();
 	if (unlikely(!anon_vma)) {
 		struct mm_struct *mm = vma->vm_mm;
 		struct anon_vma *allocated;
 
+		avc = anon_vma_chain_alloc();
+		if (!avc)
+			goto out_enomem;
+
 		anon_vma = find_mergeable_anon_vma(vma);
 		allocated = NULL;
 		if (!anon_vma) {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
-				return -ENOMEM;
+				goto out_enomem_free_avc;
 			allocated = anon_vma;
 		}
 		spin_lock(&anon_vma->lock);
@@ -123,53 +139,113 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
-			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+			avc->anon_vma = anon_vma;
+			avc->vma = vma;
+			list_add(&avc->same_vma, &vma->anon_vma_chain);
+			list_add(&avc->same_anon_vma, &anon_vma->head);
 			allocated = NULL;
 		}
 		spin_unlock(&mm->page_table_lock);
 
 		spin_unlock(&anon_vma->lock);
-		if (unlikely(allocated))
+		if (unlikely(allocated)) {
 			anon_vma_free(allocated);
+			anon_vma_chain_free(avc);
+		}
 	}
 	return 0;
+
+ out_enomem_free_avc:
+	anon_vma_chain_free(avc);
+ out_enomem:
+	return -ENOMEM;
 }
 
-void __anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next)
+static void anon_vma_chain_link(struct vm_area_struct *vma,
+				struct anon_vma_chain *avc,
+				struct anon_vma *anon_vma)
 {
-	BUG_ON(vma->anon_vma != next->anon_vma);
-	list_del(&next->anon_vma_node);
+	avc->vma = vma;
+	avc->anon_vma = anon_vma;
+	list_add(&avc->same_vma, &vma->anon_vma_chain);
+
+	spin_lock(&anon_vma->lock);
+	list_add_tail(&avc->same_anon_vma, &anon_vma->head);
+	spin_unlock(&anon_vma->lock);
 }
 
-void __anon_vma_link(struct vm_area_struct *vma)
+/*
+ * Attach the anon_vmas from src to dst.
+ * Returns 0 on success, -ENOMEM on failure.
+ */
+int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 {
-	struct anon_vma *anon_vma = vma->anon_vma;
+	struct anon_vma_chain *avc, *pavc;
+
+	list_for_each_entry(pavc, &src->anon_vma_chain, same_vma) {
+		avc = anon_vma_chain_alloc();
+		if (!avc)
+			goto enomem_failure;
+		anon_vma_chain_link(dst, avc, pavc->anon_vma);
+	}
+	return 0;
 
-	if (anon_vma)
-		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+ enomem_failure:
+	unlink_anon_vmas(dst);
+	return -ENOMEM;
 }
 
-void anon_vma_link(struct vm_area_struct *vma)
+/*
+ * Attach vma to its own anon_vma, as well as to the anon_vmas that
+ * the corresponding VMA in the parent process is attached to.
+ * Returns 0 on success, non-zero on failure.
+ */
+int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 {
-	struct anon_vma *anon_vma = vma->anon_vma;
+	struct anon_vma_chain *avc;
+	struct anon_vma *anon_vma;
 
-	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
-		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
-		spin_unlock(&anon_vma->lock);
-	}
+	/* Don't bother if the parent process has no anon_vma here. */
+	if (!pvma->anon_vma)
+		return 0;
+
+	/*
+	 * First, attach the new VMA to the parent VMA's anon_vmas,
+	 * so rmap can find non-COWed pages in child processes.
+	 */
+	if (anon_vma_clone(vma, pvma))
+		return -ENOMEM;
+
+	/* Then add our own anon_vma. */
+	anon_vma = anon_vma_alloc();
+	if (!anon_vma)
+		goto out_error;
+	avc = anon_vma_chain_alloc();
+	if (!avc)
+		goto out_error_free_anon_vma;
+	anon_vma_chain_link(vma, avc, anon_vma);
+	/* Mark this anon_vma as the one where our new (COWed) pages go. */
+	vma->anon_vma = anon_vma;
+
+	return 0;
+
+ out_error_free_anon_vma:
+	anon_vma_free(anon_vma);
+ out_error:
+	return -ENOMEM;
 }
 
-void anon_vma_unlink(struct vm_area_struct *vma)
+static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 {
-	struct anon_vma *anon_vma = vma->anon_vma;
+	struct anon_vma *anon_vma = anon_vma_chain->anon_vma;
 	int empty;
 
+	/* If anon_vma_fork fails, we can get an empty anon_vma_chain. */
 	if (!anon_vma)
 		return;
 
 	spin_lock(&anon_vma->lock);
-	list_del(&vma->anon_vma_node);
+	list_del_init(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
 	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
@@ -179,6 +255,18 @@ void anon_vma_unlink(struct vm_area_struct *vma)
 		anon_vma_free(anon_vma);
 }
 
+void unlink_anon_vmas(struct vm_area_struct *vma)
+{
+	struct anon_vma_chain *avc, *next;
+
+	/* Unlink each anon_vma chained to the VMA. */
+	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
+		anon_vma_unlink(avc);
+		list_del_init(&avc->same_vma);
+		anon_vma_chain_free(avc);
+	}
+}
+
 static void anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;
@@ -188,10 +276,21 @@ static void anon_vma_ctor(void *data)
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
+static void anon_vma_chain_ctor(void *data)
+{
+	struct anon_vma_chain *anon_vma_chain = data;
+
+	INIT_LIST_HEAD(&anon_vma_chain->same_vma);
+	INIT_LIST_HEAD(&anon_vma_chain->same_anon_vma);
+}
+
 void __init anon_vma_init(void)
 {
 	anon_vma_cachep = kmem_cache_create("anon_vma", sizeof(struct anon_vma),
 			0, SLAB_DESTROY_BY_RCU|SLAB_PANIC, anon_vma_ctor);
+	anon_vma_chain_cachep = kmem_cache_create("anon_vma_chain",
+			sizeof(struct anon_vma_chain), 0,
+			SLAB_DESTROY_BY_RCU|SLAB_PANIC, anon_vma_chain_ctor);
 }
 
 /*
@@ -240,6 +339,14 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 		/* page should be within @vma mapping range */
 		return -EFAULT;
 	}
+	if (unlikely(vma->vm_flags & VM_LOCK_RMAP))
+		/*
+		 * This VMA is being unlinked or not yet linked into the
+		 * VMA tree.  Do not try to follow this rmap.  This race
+		 * condition can result in page_referenced ignoring a
+		 * reference or try_to_unmap failing to unmap a page.
+		 */
+		return -EFAULT;
 	return address;
 }
 
@@ -396,7 +503,7 @@ static int page_referenced_anon(struct page *page,
 {
 	unsigned int mapcount;
 	struct anon_vma *anon_vma;
-	struct vm_area_struct *vma;
+	struct anon_vma_chain *avc;
 	int referenced = 0;
 
 	anon_vma = page_lock_anon_vma(page);
@@ -404,7 +511,8 @@ static int page_referenced_anon(struct page *page,
 		return referenced;
 
 	mapcount = page_mapcount(page);
-	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
@@ -1024,14 +1132,15 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 {
 	struct anon_vma *anon_vma;
-	struct vm_area_struct *vma;
+	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
 		return ret;
 
-	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
@@ -1222,7 +1331,7 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
 {
 	struct anon_vma *anon_vma;
-	struct vm_area_struct *vma;
+	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
 	/*
@@ -1237,7 +1346,8 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	if (!anon_vma)
 		return ret;
 	spin_lock(&anon_vma->lock);
-	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue
  2010-01-18  3:21 [RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue Rik van Riel
@ 2010-01-21  5:05 ` KOSAKI Motohiro
  2010-01-21  5:21   ` Rik van Riel
  0 siblings, 1 reply; 9+ messages in thread
From: KOSAKI Motohiro @ 2010-01-21  5:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, linux-mm, linux-kernel, Lee Schermerhorn,
	minchan.kim, lwoodman, aarcange

Hi

> Thanks to some hints from peterz, and a day of debugging this weekend,
> this version boots all the way to udev, then appears to hang. I probably 
> made a silly mistake somewhere.  I am mailing this patch out in the hopes
> that my last silly mistakes will be seen by someone :)
> ----
> 
> The old anon_vma code can lead to scalability issues with heavily
> forking workloads.  Specifically, each anon_vma will be shared
> between the parent process and all its child processes.
> 
> In a workload with 1000 child processes and a VMA with 1000 anonymous
> pages per process that get COWed, this leads to a system with a million
> anonymous pages in the same anon_vma, each of which is mapped in just
> one of the 1000 processes.  However, the current rmap code needs to
> walk them all, leading to O(N) scanning complexity for each page.
> 
> This can result in systems where one CPU is walking the page tables
> of 1000 processes in page_referenced_one, while all other CPUs are
> stuck on the anon_vma lock.  This leads to catastrophic failure for
> a benchmark like AIM7, where the total number of processes can reach
> in the tens of thousands.  Real workloads are still a factor 10 less
> process intensive than AIM7, but they are catching up.
> 
> This patch changes the way anon_vmas and VMAs are linked, which
> allows us to associate multiple anon_vmas with a VMA.  At fork
> time, each child process gets its own anon_vmas, in which its
> COWed pages will be instantiated.  The parents' anon_vma is also
> linked to the VMA, because non-COWed pages could be present in
> any of the children.
> 
> This reduces rmap scanning complexity to O(1) for the pages of
> the 1000 child processes, with O(N) complexity for at most 1/N
> pages in the system.  This reduces the average scanning cost in
> heavily forking workloads from O(N) to 2.
> 
> The only real complexity in this patch stems from the fact that
> linking a VMA to anon_vmas now involves memory allocations. This
> means vma_adjust can fail, if it needs to attach a VMA to anon_vma
> structures. This in turn means error handling needs to be added
> to the calling functions.
> 
> A second source of complexity is that, because there can be
> multiple anon_vmas, the anon_vma linking in vma_adjust can
> no longer be done under "the" anon_vma lock.  To prevent the
> rmap code from walking up an incomplete VMA, this patch
> introduces the VM_LOCK_RMAP VMA flag.  This bit flag uses
> the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef
> in mm.h to make sure it is impossible to compile a kernel
> that needs both symbolic values for the same bitflag.

I've only roughly reviewed this patch. So, perhaps I missed something.
My first impression is, this is slightly large but benefit is only affected
corner case.

If my remember is correct, you said you expect Nick's fair rwlock + Larry's rw-anon-lock
makes good result at some week ago. Why do you make alternative patch?
such way made bad result? or this patch have alternative benefit?

This patch seems to increase fork overhead instead decreasing vmscan overhead.
I'm not sure it is good deal.


[re-read page_referenced_anon() ... ]


Hmm...
Why can't we convert read side anon-vma walk to rcu? It need rcu aware vma
free, but anon_vma is alredy freed by rcu.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue
  2010-01-21  5:05 ` KOSAKI Motohiro
@ 2010-01-21  5:21   ` Rik van Riel
  2010-01-21 15:29     ` Minchan Kim
  2010-01-22  6:57     ` KOSAKI Motohiro
  0 siblings, 2 replies; 9+ messages in thread
From: Rik van Riel @ 2010-01-21  5:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, Lee Schermerhorn, minchan.kim, lwoodman,
	aarcange

On 01/21/2010 12:05 AM, KOSAKI Motohiro wrote:

>> In a workload with 1000 child processes and a VMA with 1000 anonymous
>> pages per process that get COWed, this leads to a system with a million
>> anonymous pages in the same anon_vma, each of which is mapped in just
>> one of the 1000 processes.  However, the current rmap code needs to
>> walk them all, leading to O(N) scanning complexity for each page.

>> This reduces rmap scanning complexity to O(1) for the pages of
>> the 1000 child processes, with O(N) complexity for at most 1/N
>> pages in the system.  This reduces the average scanning cost in
>> heavily forking workloads from O(N) to 2.

> I've only roughly reviewed this patch. So, perhaps I missed something.
> My first impression is, this is slightly large but benefit is only affected
> corner case.

At the moment it mostly triggers with artificial workloads, but
having 1000 client connections to eg. an Oracle database is not
unheard of.

The reason for wanting to fix the corner case is because it is
so incredibly bad.

> If my remember is correct, you said you expect Nick's fair rwlock + Larry's rw-anon-lock
> makes good result at some week ago. Why do you make alternative patch?
> such way made bad result? or this patch have alternative benefit?

After looking at the complexity figures (above), I suspect that
making a factor 5-10 speedup is not going to fix a factor 1000
increased complexity.

> This patch seems to increase fork overhead instead decreasing vmscan overhead.
> I'm not sure it is good deal.

My hope is that the overhead of adding a few small objects per VMA
will be unnoticable, compared to the overhead of refcounting pages,
handling page tables, etc.

The code looks like it could be a lot of anon_vma_chains, but in
practice the depth is limited because exec() wipes them all out.
Most of the time we will have just 0, 1 or 2 anon_vmas attached to
a VMA - one for the current process and one for the parent.

> Hmm...
> Why can't we convert read side anon-vma walk to rcu? It need rcu aware vma
> free, but anon_vma is alredy freed by rcu.

Changing the locking to RCU does not reduce the amount of work
that needs to be done in page_referenced_anon.  If we have 1000
siblings with 1000 pages each, we still end up scanning all
1000 processes for each of those 1000 pages in the pageout code.

Adding parallelism to that with better locking may speed it up
by the number of CPUs at most, which really may not help much
in these workloads.

Today having 1000 client connections to a forking server is
considered a lot, but I suspect it could be more common in a
few years. I would like Linux to be ready for those kinds of
workloads.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue
  2010-01-21  5:21   ` Rik van Riel
@ 2010-01-21 15:29     ` Minchan Kim
  2010-01-24 14:17       ` Minchan Kim
  2010-01-22  6:57     ` KOSAKI Motohiro
  1 sibling, 1 reply; 9+ messages in thread
From: Minchan Kim @ 2010-01-21 15:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, Lee Schermerhorn,
	lwoodman, aarcange

Hi, Rik. 

Actually, I tested this patch a few days ago.
I met problem like you that hang with udev.

I will debug it when I have a time. :)

On Thu, 2010-01-21 at 00:21 -0500, Rik van Riel wrote:
> Today having 1000 client connections to a forking server is
> considered a lot, but I suspect it could be more common in a
> few years. I would like Linux to be ready for those kinds of
> workloads.
> 

BTW, last year I suggested that removing anon_vma facility itself in 
no swap machine(ie, embedded machine). 

Although your patch add small cost, many of small memory machine don't 
like it if they become to aware this patch.

So I want to make this patch configurable until we prove this patch is
no cost and afterwards we can remove configurable option.

Thanks for new trial to improve VM. :)

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue
  2010-01-21  5:21   ` Rik van Riel
  2010-01-21 15:29     ` Minchan Kim
@ 2010-01-22  6:57     ` KOSAKI Motohiro
  2010-01-22 16:25       ` Rik van Riel
  2010-01-27 20:57       ` Rik van Riel
  1 sibling, 2 replies; 9+ messages in thread
From: KOSAKI Motohiro @ 2010-01-22  6:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, linux-mm, linux-kernel, Lee Schermerhorn,
	minchan.kim, lwoodman, aarcange

> On 01/21/2010 12:05 AM, KOSAKI Motohiro wrote:
> 
> >> In a workload with 1000 child processes and a VMA with 1000 anonymous
> >> pages per process that get COWed, this leads to a system with a million
> >> anonymous pages in the same anon_vma, each of which is mapped in just
> >> one of the 1000 processes.  However, the current rmap code needs to
> >> walk them all, leading to O(N) scanning complexity for each page.
> 
> >> This reduces rmap scanning complexity to O(1) for the pages of
> >> the 1000 child processes, with O(N) complexity for at most 1/N
> >> pages in the system.  This reduces the average scanning cost in
> >> heavily forking workloads from O(N) to 2.
> 
> > I've only roughly reviewed this patch. So, perhaps I missed something.
> > My first impression is, this is slightly large but benefit is only affected
> > corner case.
> 
> At the moment it mostly triggers with artificial workloads, but
> having 1000 client connections to eg. an Oracle database is not
> unheard of.
> 
> The reason for wanting to fix the corner case is because it is
> so incredibly bad.
> 
> > If my remember is correct, you said you expect Nick's fair rwlock + Larry's rw-anon-lock
> > makes good result at some week ago. Why do you make alternative patch?
> > such way made bad result? or this patch have alternative benefit?
> 
> After looking at the complexity figures (above), I suspect that
> making a factor 5-10 speedup is not going to fix a factor 1000
> increased complexity.
> 
> > This patch seems to increase fork overhead instead decreasing vmscan overhead.
> > I'm not sure it is good deal.
> 
> My hope is that the overhead of adding a few small objects per VMA
> will be unnoticable, compared to the overhead of refcounting pages,
> handling page tables, etc.
> 
> The code looks like it could be a lot of anon_vma_chains, but in
> practice the depth is limited because exec() wipes them all out.
> Most of the time we will have just 0, 1 or 2 anon_vmas attached to
> a VMA - one for the current process and one for the parent.
> 
> > Hmm...
> > Why can't we convert read side anon-vma walk to rcu? It need rcu aware vma
> > free, but anon_vma is alredy freed by rcu.
> 
> Changing the locking to RCU does not reduce the amount of work
> that needs to be done in page_referenced_anon.  If we have 1000
> siblings with 1000 pages each, we still end up scanning all
> 1000 processes for each of those 1000 pages in the pageout code.
> 
> Adding parallelism to that with better locking may speed it up
> by the number of CPUs at most, which really may not help much
> in these workloads.
> 
> Today having 1000 client connections to a forking server is
> considered a lot, but I suspect it could be more common in a
> few years. I would like Linux to be ready for those kinds of
> workloads.

Thanks. probably I understand your intention. I think this patch
need performance comparision with simple rw-spinlock patch. but
I don't worry it, maybe someone else does this.


roughly review is here.

[ generally, this patch have too few lock related comment. I think I
  haven't undestand correct lock rule of this patch. ]


> @@ -516,7 +517,8 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
>  	/*
>  	 * cover the whole range: [new_start, old_end)
>  	 */
> -	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
> +	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
> +		return -ENOMEM;

shift_arg_pages() have two vma_adjust() call. why don't we need change both?


> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 84a524a..44cfb13 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -167,7 +167,7 @@ struct vm_area_struct {
>  	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
>  	 * or brk vma (with NULL file) can only be in an anon_vma list.
>  	 */
> -	struct list_head anon_vma_node;	/* Serialized by anon_vma->lock */
> +	struct list_head anon_vma_chain; /* Serialized by anon_vma->lock */
>  	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */

Is this comment really correct? for example, following vma->anon_vma_chain
operation is in place out of anon_vma lock.

	static void anon_vma_chain_link(struct vm_area_struct *vma,
	                                struct anon_vma_chain *avc,
	                                struct anon_vma *anon_vma)
	{
	        avc->vma = vma;
	        avc->anon_vma = anon_vma;
	        list_add(&avc->same_vma, &vma->anon_vma_chain);

	        spin_lock(&anon_vma->lock);
	        list_add_tail(&avc->same_anon_vma, &anon_vma->head);
	        spin_unlock(&anon_vma->lock);
	}

I guess you intend to write /* locked by mmap_sem & friends */.


note: however I don't think "& friends" is good comment ;-)


>  	/* Function pointers to deal with this struct. */
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b019ae6..0d1903a 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -37,7 +37,27 @@ struct anon_vma {
>  	 * is serialized by a system wide lock only visible to
>  	 * mm_take_all_locks() (mm_all_locks_mutex).
>  	 */
> -	struct list_head head;	/* List of private "related" vmas */
> +	struct list_head head;	/* Chain of private "related" vmas */
> +};

Hmm..
It seems unclear comment. this list head don't linked struct vm_area_struct.
instead, linked struct anon_vma_chain. so "related vmas" isn't kindly comment.


> +
> +/*
> + * The copy-on-write semantics of fork mean that an anon_vma
> + * can become associated with multiple processes. Furthermore,
> + * each child process will have its own anon_vma, where new
> + * pages for that process are instantiated.
> + *
> + * This structure allows us to find the anon_vmas associated
> + * with a VMA, or the VMAs associated with an anon_vma.
> + * The "same_vma" list contains the anon_vma_chains linking
> + * all the anon_vmas associated with this VMA.
> + * The "same_anon_vma" list contains the anon_vma_chains
> + * which link all the VMAs associated with this anon_vma.
> + */
> +struct anon_vma_chain {
> +	struct vm_area_struct *vma;
> +	struct anon_vma *anon_vma;
> +	struct list_head same_vma;	/* locked by mmap_sem & friends */
> +	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
>  };

Probably, This place need more lots comments. struct anon_vma_chain
makes very complex relationship graph. example or good ascii art is needed.
especially, fork parent and child have different anon_vma_chain. it
seems tricky.


> +static inline void anon_vma_merge(struct vm_area_struct *vma,
> +				  struct vm_area_struct *next)
> +{
> +	BUG_ON(vma->anon_vma != next->anon_vma);
> +	unlink_anon_vmas(next);
> +}
> +

Probably VM_BUG_ON is enough?




> @@ -792,11 +809,13 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>  				is_mergeable_anon_vma(prev->anon_vma,
>  						      next->anon_vma)) {
>  							/* cases 1, 6 */
> -			vma_adjust(prev, prev->vm_start,
> +			err = vma_adjust(prev, prev->vm_start,
>  				next->vm_end, prev->vm_pgoff, NULL);
>  		} else					/* cases 2, 5, 7 */
> -			vma_adjust(prev, prev->vm_start,
> +			err = vma_adjust(prev, prev->vm_start,
>  				end, prev->vm_pgoff, NULL);
> +		if (err)
> +			return NULL;
>  		return prev;
>  	}

Currently, the callers of vma_merge() assume vma_merge doesn't failure.
IOW, they don't think return NULL is failure.

Probably we need to change all callers too.


> @@ -2454,7 +2506,8 @@ int mm_take_all_locks(struct mm_struct *mm)
>  		if (signal_pending(current))
>  			goto out_unlock;
>  		if (vma->anon_vma)
> -			vm_lock_anon_vma(mm, vma->anon_vma);
> +			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
> +				vm_lock_anon_vma(mm, avc->anon_vma);
>  	}

This function is not protected by mmap_sem. but anon_vma_chain->same_vma
iteration need to mmap_sem if your commnet is correct.


> @@ -2509,13 +2562,15 @@ static void vm_unlock_mapping(struct address_space *mapping)
>  void mm_drop_all_locks(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
> +	struct anon_vma_chain *avc;
>  
>  	BUG_ON(down_read_trylock(&mm->mmap_sem));
>  	BUG_ON(!mutex_is_locked(&mm_all_locks_mutex));
>  
>  	for (vma = mm->mmap; vma; vma = vma->vm_next) {
>  		if (vma->anon_vma)
> -			vm_unlock_anon_vma(vma->anon_vma);
> +			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
> +				vm_unlock_anon_vma(avc->anon_vma);
>  		if (vma->vm_file && vma->vm_file->f_mapping)
>  			vm_unlock_mapping(vma->vm_file->f_mapping);
>  	}

ditto.

> @@ -188,10 +276,21 @@ static void anon_vma_ctor(void *data)
>  	INIT_LIST_HEAD(&anon_vma->head);
>  }


>  void __init anon_vma_init(void)
>  {
>  	anon_vma_cachep = kmem_cache_create("anon_vma", sizeof(struct anon_vma),
>  			0, SLAB_DESTROY_BY_RCU|SLAB_PANIC, anon_vma_ctor);
> +	anon_vma_chain_cachep = kmem_cache_create("anon_vma_chain",
> +			sizeof(struct anon_vma_chain), 0,
> +			SLAB_DESTROY_BY_RCU|SLAB_PANIC, anon_vma_chain_ctor);
>  }

Why do we need SLAB_DESTROY_BY_RCU?
anon_vma's one is required by page migration. (Oops, It should be commented, I think)
but which code require anon_vma_chain's one?


>  /*
> @@ -240,6 +339,14 @@ vma_address(struct page *page, struct vm_area_struct *vma)
>  		/* page should be within @vma mapping range */
>  		return -EFAULT;
>  	}
> +	if (unlikely(vma->vm_flags & VM_LOCK_RMAP))
> +		/*
> +		 * This VMA is being unlinked or not yet linked into the
> +		 * VMA tree.  Do not try to follow this rmap.  This race
> +		 * condition can result in page_referenced ignoring a
> +		 * reference or try_to_unmap failing to unmap a page.
> +		 */
> +		return -EFAULT;
>  	return address;
>  }

In this place, the task have anon_vma->lock, but don't have mmap_sem.
But, VM_LOCK_RMAP changing point (i.e. vma_adjust()) is protected by mmap_sem.

IOW, "if (vma->vm_flags & VM_LOCK_RMAP)" return unstable value. Why can we use
unstable value as "lock"? 




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue
  2010-01-22  6:57     ` KOSAKI Motohiro
@ 2010-01-22 16:25       ` Rik van Riel
  2010-01-25  8:37         ` KOSAKI Motohiro
  2010-01-27 20:57       ` Rik van Riel
  1 sibling, 1 reply; 9+ messages in thread
From: Rik van Riel @ 2010-01-22 16:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, Lee Schermerhorn, minchan.kim, lwoodman,
	aarcange

On 01/22/2010 01:57 AM, KOSAKI Motohiro wrote:

> [ generally, this patch have too few lock related comment. I think I
>    haven't undestand correct lock rule of this patch. ]

I do not introduce any new locks with this patch, locking the
linked list on each "side" of the anon_vma_link with the lock
on that side - the anon_vma lock for the same_anon_vma list
and the per-mm locks on the vma side of the list.

>> @@ -516,7 +517,8 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
>>   	/*
>>   	 * cover the whole range: [new_start, old_end)
>>   	 */
>> -	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
>> +	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
>> +		return -ENOMEM;
>
> shift_arg_pages() have two vma_adjust() call. why don't we need change both?

Because shrinking a VMA cannot fail.  Looking at it some
more, this call cannot fail either because we check that
there is enough space to grow this VMA downward.

>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 84a524a..44cfb13 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -167,7 +167,7 @@ struct vm_area_struct {
>>   	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
>>   	 * or brk vma (with NULL file) can only be in an anon_vma list.
>>   	 */
>> -	struct list_head anon_vma_node;	/* Serialized by anon_vma->lock */
>> +	struct list_head anon_vma_chain; /* Serialized by anon_vma->lock */
>>   	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */
>
> Is this comment really correct? for example, following vma->anon_vma_chain
> operation is in place out of anon_vma lock.

> I guess you intend to write /* locked by mmap_sem&  friends */.

You are right. I will fix that comment.

> note: however I don't think "&  friends" is good comment ;-)

No kidding - however this is how mmap.c already serializes
all kinds of things :)

>
>>   	/* Function pointers to deal with this struct. */
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index b019ae6..0d1903a 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -37,7 +37,27 @@ struct anon_vma {
>>   	 * is serialized by a system wide lock only visible to
>>   	 * mm_take_all_locks() (mm_all_locks_mutex).
>>   	 */
>> -	struct list_head head;	/* List of private "related" vmas */
>> +	struct list_head head;	/* Chain of private "related" vmas */
>> +};
>
> Hmm..
> It seems unclear comment. this list head don't linked struct vm_area_struct.
> instead, linked struct anon_vma_chain. so "related vmas" isn't kindly comment.

True, it is a chain that points to the related VMAs.

>
>> +
>> +/*
>> + * The copy-on-write semantics of fork mean that an anon_vma
>> + * can become associated with multiple processes. Furthermore,
>> + * each child process will have its own anon_vma, where new
>> + * pages for that process are instantiated.
>> + *
>> + * This structure allows us to find the anon_vmas associated
>> + * with a VMA, or the VMAs associated with an anon_vma.
>> + * The "same_vma" list contains the anon_vma_chains linking
>> + * all the anon_vmas associated with this VMA.
>> + * The "same_anon_vma" list contains the anon_vma_chains
>> + * which link all the VMAs associated with this anon_vma.
>> + */
>> +struct anon_vma_chain {
>> +	struct vm_area_struct *vma;
>> +	struct anon_vma *anon_vma;
>> +	struct list_head same_vma;	/* locked by mmap_sem&  friends */
>> +	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
>>   };
>
> Probably, This place need more lots comments. struct anon_vma_chain
> makes very complex relationship graph. example or good ascii art is needed.
> especially, fork parent and child have different anon_vma_chain. it
> seems tricky.

I guess I'll have to draw up some ascii art...

>> +static inline void anon_vma_merge(struct vm_area_struct *vma,
>> +				  struct vm_area_struct *next)
>> +{
>> +	BUG_ON(vma->anon_vma != next->anon_vma);
>> +	unlink_anon_vmas(next);
>> +}
>> +
>
> Probably VM_BUG_ON is enough?

OK.

>> @@ -792,11 +809,13 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>>   				is_mergeable_anon_vma(prev->anon_vma,
>>   						      next->anon_vma)) {
>>   							/* cases 1, 6 */
>> -			vma_adjust(prev, prev->vm_start,
>> +			err = vma_adjust(prev, prev->vm_start,
>>   				next->vm_end, prev->vm_pgoff, NULL);
>>   		} else					/* cases 2, 5, 7 */
>> -			vma_adjust(prev, prev->vm_start,
>> +			err = vma_adjust(prev, prev->vm_start,
>>   				end, prev->vm_pgoff, NULL);
>> +		if (err)
>> +			return NULL;
>>   		return prev;
>>   	}
>
> Currently, the callers of vma_merge() assume vma_merge doesn't failure.
> IOW, they don't think return NULL is failure.
>
> Probably we need to change all callers too.

Are you sure?  To me it looks like vma_merge returns NULL when it
fails to do a merge, leaving the VMAs alone as a result.

What am I missing?

>> @@ -2454,7 +2506,8 @@ int mm_take_all_locks(struct mm_struct *mm)
>>   		if (signal_pending(current))
>>   			goto out_unlock;
>>   		if (vma->anon_vma)
>> -			vm_lock_anon_vma(mm, vma->anon_vma);
>> +			list_for_each_entry(avc,&vma->anon_vma_chain, same_vma)
>> +				vm_lock_anon_vma(mm, avc->anon_vma);
>>   	}
>
> This function is not protected by mmap_sem. but anon_vma_chain->same_vma
> iteration need to mmap_sem if your commnet is correct.

The comment above mm_take_all_locks says:

  * The caller must take the mmap_sem in write mode before calling
  * mm_take_all_locks(). The caller isn't allowed to release the
  * mmap_sem until mm_drop_all_locks() returns.

>> @@ -188,10 +276,21 @@ static void anon_vma_ctor(void *data)
>>   	INIT_LIST_HEAD(&anon_vma->head);
>>   }
>
>
>>   void __init anon_vma_init(void)
>>   {
>>   	anon_vma_cachep = kmem_cache_create("anon_vma", sizeof(struct anon_vma),
>>   			0, SLAB_DESTROY_BY_RCU|SLAB_PANIC, anon_vma_ctor);
>> +	anon_vma_chain_cachep = kmem_cache_create("anon_vma_chain",
>> +			sizeof(struct anon_vma_chain), 0,
>> +			SLAB_DESTROY_BY_RCU|SLAB_PANIC, anon_vma_chain_ctor);
>>   }
>
> Why do we need SLAB_DESTROY_BY_RCU?
> anon_vma's one is required by page migration. (Oops, It should be commented, I think)
> but which code require anon_vma_chain's one?

I just copied that code over - I guess we don't need that flag
and I'll remove it.

>>   /*
>> @@ -240,6 +339,14 @@ vma_address(struct page *page, struct vm_area_struct *vma)
>>   		/* page should be within @vma mapping range */
>>   		return -EFAULT;
>>   	}
>> +	if (unlikely(vma->vm_flags&  VM_LOCK_RMAP))
>> +		/*
>> +		 * This VMA is being unlinked or not yet linked into the
>> +		 * VMA tree.  Do not try to follow this rmap.  This race
>> +		 * condition can result in page_referenced ignoring a
>> +		 * reference or try_to_unmap failing to unmap a page.
>> +		 */
>> +		return -EFAULT;
>>   	return address;
>>   }
>
> In this place, the task have anon_vma->lock, but don't have mmap_sem.
> But, VM_LOCK_RMAP changing point (i.e. vma_adjust()) is protected by mmap_sem.
>
> IOW, "if (vma->vm_flags&  VM_LOCK_RMAP)" return unstable value. Why can we use
> unstable value as "lock"?

That's a good question.  I will have to think about it some more.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue
  2010-01-21 15:29     ` Minchan Kim
@ 2010-01-24 14:17       ` Minchan Kim
  0 siblings, 0 replies; 9+ messages in thread
From: Minchan Kim @ 2010-01-24 14:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, Lee Schermerhorn,
	lwoodman, aarcange

On Fri, 2010-01-22 at 00:29 +0900, Minchan Kim wrote:
> Hi, Rik. 
> 
> Actually, I tested this patch a few days ago.
> I met problem like you that hang with udev.
> 
> I will debug it when I have a time. :)
> 

Today, I tried to debug but don't get any useful clue.
I tried it by following debug patch and got following 
result.

It means anon_vma_chain has wrong entry.
But I don't know why it happens. 

I tried to found wrong entry when we adds anon_vma_chain
to vma->anon_vma_chain but can't find it.

I think it might happens when the entry was removed or 
some dangling pointer by someone due to locking problem. 

Is there any chance by SLAB_DESTROY_RCU which 
reusing SLAB page? 


== RESULT ==

 ^[[33m*^[[39;49m PulseAudio configured for per-user sessions
saned disabled; edit /etc/default/saned
 * Starting System Tools Backends system-tools-backends       ^[[80G ^M^[[74G[ OK ]
 * Starting anac(h)ronistic cron anacron       ^[[80G ^M^[[74G[ OK ]
 * Starting deferred execution scheduler atd       ^[[80G ^M^[[74G[ OK ]
 * Starting periodic command scheduler crond       ^[[80G ^M^[[74G[ OK ]
 * Enabling additional executable binary formats binfmt-support       ^[[80G ^M^[[74G[ OK ]
 * Checking battery state...       ^[[80G ^M^[[74G[ OK ]
count 1 avc f63abce4 magic 21 vma f63876e0
------------[ cut here ]------------
kernel BUG at mm/rmap.c:282!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/type
Modules linked in:

Pid: 2920, comm: nautilus Not tainted 2.6.33-rc4-mm1 #39 /
EIP: 0060:[<c02142f8>] EFLAGS: 00010286 CPU: 0
EIP is at unlink_anon_vmas+0xd8/0x100
EAX: 00000031 EBX: f63abce4 ECX: f60bc8c0 EDX: 00000000
ESI: f63abce4 EDI: f63abcec EBP: f6233f10 ESP: f6233edc
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process nautilus (pid: 2920, ti=f6232000 task=f60bc8c0 task.ti=f6232000)
Stack:
 c082c5b8 00000001 f63abce4 00000015 f63876e0 f628ee10 f63876e0 f6387714
<0> 00000001 f62d51f0 00000000 09950000 b3e7d000 f6233f38 c020b1f8 c03a945e
<0> fffffeff f63876e0 fffffeff c1c03160 f644acb8 f6a7d440 f6387528 f6233f68
Call Trace:
 [<c020b1f8>] ? free_pgtables+0x28/0xe0
 [<c03a945e>] ? __percpu_counter_add+0x9e/0xd0
 [<c0211b72>] ? unmap_region+0xd2/0x120
 [<c0211d88>] ? do_munmap+0x1c8/0x2e0
 [<c0211edd>] ? sys_munmap+0x3d/0x60
 [<c012e3a3>] ? sysenter_do_call+0x12/0x38
Code: 00 00 74 8f 89 f3 8b 45 e4 89 44 24 10 8b 43 18 89 5c 24 08 c7 04 24 b8 c5 82 c0 89 44 24 0c 8b 45 ec 89 44 24 04 e8 66 36 49 00 <0f> 0b eb fe 8d 74 26 00 83 c4 28 5b 5e 5f 5d c3 89 d0 e8 a1 69
EIP: [<c02142f8>] unlink_anon_vmas+0xd8/0x100 SS:ESP 0068:f6233edc
---[ end trace 1536c613246c1ea7 ]---


== DEBUG PATCH ==

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 0d1903a..fd77d90 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -58,6 +58,7 @@ struct anon_vma_chain {
 	struct anon_vma *anon_vma;
 	struct list_head same_vma;	/* locked by mmap_sem & friends */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
+	int magic;			/* for debug */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/rmap.c b/mm/rmap.c
index d9feb1d..eed2844 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -255,15 +255,36 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 		anon_vma_free(anon_vma);
 }
 
+/* sizeof(struct anon_vma_chain) is less than 0x20.
+ * So kmalloced addr might be aligned by 0x20.
+ * 1 means success, 0 means fail.
+ */
+int validate_anon_vma_chain(struct anon_vma_chain *avc)
+{
+	unsigned int addr = (unsigned int)avc;
+	addr %= 0x20;
+	if (addr) 
+		return 0;
+	return 1;
+	
+}
+
 void unlink_anon_vmas(struct vm_area_struct *vma)
 {
 	struct anon_vma_chain *avc, *next;
+	int count = 0;
 
 	/* Unlink each anon_vma chained to the VMA. */
 	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
+		if (!validate_anon_vma_chain(avc)) {
+			printk(KERN_ERR "count %d avc %p magic %d vma %p\n", 
+				count, avc, avc->magic, vma);
+			BUG();
+		}
 		anon_vma_unlink(avc);
 		list_del_init(&avc->same_vma);
 		anon_vma_chain_free(avc);
+		count++;
 	}
 }
 
@@ -282,6 +303,7 @@ static void anon_vma_chain_ctor(void *data)
 
 	INIT_LIST_HEAD(&anon_vma_chain->same_vma);
 	INIT_LIST_HEAD(&anon_vma_chain->same_anon_vma);
+	anon_vma_chain->magic = 0x83;
 }
 
 void __init anon_vma_init(void)




-- 
Kind regards,
Minchan Kim


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue
  2010-01-22 16:25       ` Rik van Riel
@ 2010-01-25  8:37         ` KOSAKI Motohiro
  0 siblings, 0 replies; 9+ messages in thread
From: KOSAKI Motohiro @ 2010-01-25  8:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, linux-mm, linux-kernel, Lee Schermerhorn,
	minchan.kim, lwoodman, aarcange

> On 01/22/2010 01:57 AM, KOSAKI Motohiro wrote:
> 
> > [ generally, this patch have too few lock related comment. I think I
> >    haven't undestand correct lock rule of this patch. ]
> 
> I do not introduce any new locks with this patch, locking the
> linked list on each "side" of the anon_vma_link with the lock
> on that side - the anon_vma lock for the same_anon_vma list
> and the per-mm locks on the vma side of the list.
> 
> >> @@ -516,7 +517,8 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
> >>   	/*
> >>   	 * cover the whole range: [new_start, old_end)
> >>   	 */
> >> -	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
> >> +	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
> >> +		return -ENOMEM;
> >
> > shift_arg_pages() have two vma_adjust() call. why don't we need change both?
> 
> Because shrinking a VMA cannot fail.  Looking at it some
> more, this call cannot fail either because we check that
> there is enough space to grow this VMA downward.

I see. you are right.


> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 84a524a..44cfb13 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -167,7 +167,7 @@ struct vm_area_struct {
> >>   	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
> >>   	 * or brk vma (with NULL file) can only be in an anon_vma list.
> >>   	 */
> >> -	struct list_head anon_vma_node;	/* Serialized by anon_vma->lock */
> >> +	struct list_head anon_vma_chain; /* Serialized by anon_vma->lock */
> >>   	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */
> >
> > Is this comment really correct? for example, following vma->anon_vma_chain
> > operation is in place out of anon_vma lock.
> 
> > I guess you intend to write /* locked by mmap_sem&  friends */.
> 
> You are right. I will fix that comment.
> 
> > note: however I don't think "&  friends" is good comment ;-)
> 
> No kidding - however this is how mmap.c already serializes
> all kinds of things :)

ok.


> >>   	/* Function pointers to deal with this struct. */
> >> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> >> index b019ae6..0d1903a 100644
> >> --- a/include/linux/rmap.h
> >> +++ b/include/linux/rmap.h
> >> @@ -37,7 +37,27 @@ struct anon_vma {
> >>   	 * is serialized by a system wide lock only visible to
> >>   	 * mm_take_all_locks() (mm_all_locks_mutex).
> >>   	 */
> >> -	struct list_head head;	/* List of private "related" vmas */
> >> +	struct list_head head;	/* Chain of private "related" vmas */
> >> +};
> >
> > Hmm..
> > It seems unclear comment. this list head don't linked struct vm_area_struct.
> > instead, linked struct anon_vma_chain. so "related vmas" isn't kindly comment.
> 
> True, it is a chain that points to the related VMAs.

thanks.


> >> +
> >> +/*
> >> + * The copy-on-write semantics of fork mean that an anon_vma
> >> + * can become associated with multiple processes. Furthermore,
> >> + * each child process will have its own anon_vma, where new
> >> + * pages for that process are instantiated.
> >> + *
> >> + * This structure allows us to find the anon_vmas associated
> >> + * with a VMA, or the VMAs associated with an anon_vma.
> >> + * The "same_vma" list contains the anon_vma_chains linking
> >> + * all the anon_vmas associated with this VMA.
> >> + * The "same_anon_vma" list contains the anon_vma_chains
> >> + * which link all the VMAs associated with this anon_vma.
> >> + */
> >> +struct anon_vma_chain {
> >> +	struct vm_area_struct *vma;
> >> +	struct anon_vma *anon_vma;
> >> +	struct list_head same_vma;	/* locked by mmap_sem&  friends */
> >> +	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
> >>   };
> >
> > Probably, This place need more lots comments. struct anon_vma_chain
> > makes very complex relationship graph. example or good ascii art is needed.
> > especially, fork parent and child have different anon_vma_chain. it
> > seems tricky.
> 
> I guess I'll have to draw up some ascii art...
> 
> >> +static inline void anon_vma_merge(struct vm_area_struct *vma,
> >> +				  struct vm_area_struct *next)
> >> +{
> >> +	BUG_ON(vma->anon_vma != next->anon_vma);
> >> +	unlink_anon_vmas(next);
> >> +}
> >> +
> >
> > Probably VM_BUG_ON is enough?
> 
> OK.

thanks.



> >> @@ -792,11 +809,13 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >>   				is_mergeable_anon_vma(prev->anon_vma,
> >>   						      next->anon_vma)) {
> >>   							/* cases 1, 6 */
> >> -			vma_adjust(prev, prev->vm_start,
> >> +			err = vma_adjust(prev, prev->vm_start,
> >>   				next->vm_end, prev->vm_pgoff, NULL);
> >>   		} else					/* cases 2, 5, 7 */
> >> -			vma_adjust(prev, prev->vm_start,
> >> +			err = vma_adjust(prev, prev->vm_start,
> >>   				end, prev->vm_pgoff, NULL);
> >> +		if (err)
> >> +			return NULL;
> >>   		return prev;
> >>   	}
> >
> > Currently, the callers of vma_merge() assume vma_merge doesn't failure.
> > IOW, they don't think return NULL is failure.
> >
> > Probably we need to change all callers too.
> 
> Are you sure?  To me it looks like vma_merge returns NULL when it
> fails to do a merge, leaving the VMAs alone as a result.
> 
> What am I missing?

I think it depend on what mean "failure".

 1) Don't merge because sibling vma have different vma attribute
 2) Don't merge because no memory

traditionally, only (1) can occur. 


Example, mlock.c have following code.

        *prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
                          vma->vm_file, pgoff, vma_policy(vma));
        if (*prev) {
                vma = *prev;
                goto success;
        }

        if (start != vma->vm_start) {
                ret = split_vma(mm, vma, start, 1);
                if (ret)
                        goto out;
        }

It doesn't assume NULL is failure. it treat merely "don't merge".
But if (2) occur, the syscall should fail, I think.


> >> @@ -2454,7 +2506,8 @@ int mm_take_all_locks(struct mm_struct *mm)
> >>   		if (signal_pending(current))
> >>   			goto out_unlock;
> >>   		if (vma->anon_vma)
> >> -			vm_lock_anon_vma(mm, vma->anon_vma);
> >> +			list_for_each_entry(avc,&vma->anon_vma_chain, same_vma)
> >> +				vm_lock_anon_vma(mm, avc->anon_vma);
> >>   	}
> >
> > This function is not protected by mmap_sem. but anon_vma_chain->same_vma
> > iteration need to mmap_sem if your commnet is correct.
> 
> The comment above mm_take_all_locks says:
> 
>   * The caller must take the mmap_sem in write mode before calling
>   * mm_take_all_locks(). The caller isn't allowed to release the
>   * mmap_sem until mm_drop_all_locks() returns.

you are right. sorry my fault ;-)



> >> @@ -188,10 +276,21 @@ static void anon_vma_ctor(void *data)
> >>   	INIT_LIST_HEAD(&anon_vma->head);
> >>   }
> >
> >
> >>   void __init anon_vma_init(void)
> >>   {
> >>   	anon_vma_cachep = kmem_cache_create("anon_vma", sizeof(struct anon_vma),
> >>   			0, SLAB_DESTROY_BY_RCU|SLAB_PANIC, anon_vma_ctor);
> >> +	anon_vma_chain_cachep = kmem_cache_create("anon_vma_chain",
> >> +			sizeof(struct anon_vma_chain), 0,
> >> +			SLAB_DESTROY_BY_RCU|SLAB_PANIC, anon_vma_chain_ctor);
> >>   }
> >
> > Why do we need SLAB_DESTROY_BY_RCU?
> > anon_vma's one is required by page migration. (Oops, It should be commented, I think)
> > but which code require anon_vma_chain's one?
> 
> I just copied that code over - I guess we don't need that flag
> and I'll remove it.

ok.


> >>   /*
> >> @@ -240,6 +339,14 @@ vma_address(struct page *page, struct vm_area_struct *vma)
> >>   		/* page should be within @vma mapping range */
> >>   		return -EFAULT;
> >>   	}
> >> +	if (unlikely(vma->vm_flags&  VM_LOCK_RMAP))
> >> +		/*
> >> +		 * This VMA is being unlinked or not yet linked into the
> >> +		 * VMA tree.  Do not try to follow this rmap.  This race
> >> +		 * condition can result in page_referenced ignoring a
> >> +		 * reference or try_to_unmap failing to unmap a page.
> >> +		 */
> >> +		return -EFAULT;
> >>   	return address;
> >>   }
> >
> > In this place, the task have anon_vma->lock, but don't have mmap_sem.
> > But, VM_LOCK_RMAP changing point (i.e. vma_adjust()) is protected by mmap_sem.
> >
> > IOW, "if (vma->vm_flags&  VM_LOCK_RMAP)" return unstable value. Why can we use
> > unstable value as "lock"?
> 
> That's a good question.  I will have to think about it some more.

thanks.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue
  2010-01-22  6:57     ` KOSAKI Motohiro
  2010-01-22 16:25       ` Rik van Riel
@ 2010-01-27 20:57       ` Rik van Riel
  1 sibling, 0 replies; 9+ messages in thread
From: Rik van Riel @ 2010-01-27 20:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, Lee Schermerhorn, minchan.kim, lwoodman,
	aarcange

On 01/22/2010 01:57 AM, KOSAKI Motohiro wrote:

>> @@ -240,6 +339,14 @@ vma_address(struct page *page, struct vm_area_struct *vma)
>>   		/* page should be within @vma mapping range */
>>   		return -EFAULT;
>>   	}
>> +	if (unlikely(vma->vm_flags&  VM_LOCK_RMAP))
>> +		/*
>> +		 * This VMA is being unlinked or not yet linked into the
>> +		 * VMA tree.  Do not try to follow this rmap.  This race
>> +		 * condition can result in page_referenced ignoring a
>> +		 * reference or try_to_unmap failing to unmap a page.
>> +		 */
>> +		return -EFAULT;
>>   	return address;
>>   }
>
> In this place, the task have anon_vma->lock, but don't have mmap_sem.
> But, VM_LOCK_RMAP changing point (i.e. vma_adjust()) is protected by mmap_sem.
>
> IOW, "if (vma->vm_flags&  VM_LOCK_RMAP)" return unstable value. Why can we use
> unstable value as "lock"?

I know the answer to this one. The VMA cannot be freed until the
anon_vmas have been unlinked.

That is serialized on the anon_vma->lock.  Either the pageout
code has that lock, or the VMA teardown code in mmap.c has it.

Either way they're protected from each other.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-01-27 20:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-18  3:21 [RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue Rik van Riel
2010-01-21  5:05 ` KOSAKI Motohiro
2010-01-21  5:21   ` Rik van Riel
2010-01-21 15:29     ` Minchan Kim
2010-01-24 14:17       ` Minchan Kim
2010-01-22  6:57     ` KOSAKI Motohiro
2010-01-22 16:25       ` Rik van Riel
2010-01-25  8:37         ` KOSAKI Motohiro
2010-01-27 20:57       ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).