[PATCH 00 of 12] mmu notifier #v13

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00 of 12] mmu notifier #v13
@ 2008-04-22 13:51 Andrea Arcangeli
  2008-04-22 13:51 ` [PATCH 01 of 12] Core of mmu notifiers Andrea Arcangeli
                   ` (13 more replies)
  0 siblings, 14 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

Hello,

This is the latest and greatest version of the mmu notifier patch #v13.

Changes are mainly in the mm_lock that uses sort() suggested by Christoph.
This reduces the complexity from O(N**2) to O(N*log(N)).

I folded the mm_lock functionality together with the mmu-notifier-core 1/12
patch to make it self-contained. I recommend merging 1/12 into -mm/mainline
ASAP. Lack of mmu notifiers is holding off KVM development. We are going to
rework the way the pages are mapped and unmapped to work with pure pfn for pci
passthrough without the use of page pinning, and we can't without mmu
notifiers. This is not just a performance matter.

KVM/GRU and AFAICT Quadrics are all covered by applying the single 1/12 patch
that shall be shipped with 2.6.26. The risk of brekage by applying 1/12 is
zero. Both when MMU_NOTIFIER=y and when it's =n, so it shouldn't be delayed
further.

XPMEM support comes with the later patches 2-12, risk for those patches is >0
and this is why the mmu-notifier-core is numbered 1/12 and not 12/12. Some are
simple and can go in immediately but not all are so simple.

2-12/12 are posted as usual for review by the VM developers and so Robin can
keep testing them on XPMEM and they can be merged later without any downside
(they're mostly orthogonal with 1/12).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
@ 2008-04-22 13:51 ` Andrea Arcangeli
  2008-04-22 14:56   ` Eric Dumazet
                     ` (2 more replies)
  2008-04-22 13:51 ` [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug Andrea Arcangeli
                   ` (12 subsequent siblings)
  13 siblings, 3 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208870142 -7200
# Node ID ea87c15371b1bd49380c40c3f15f1c7ca4438af5
# Parent  fb3bc9942fb78629d096bd07564f435d51d86e5f
Core of mmu notifiers.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1050,6 +1050,27 @@
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
 
+/*
+ * mm_lock will take mmap_sem writably (to prevent all modifications
+ * and scanning of vmas) and then also takes the mapping locks for
+ * each of the vma to lockout any scans of pagetables of this address
+ * space. This can be used to effectively holding off reclaim from the
+ * address space.
+ *
+ * mm_lock can fail if there is not enough memory to store a pointer
+ * array to all vmas.
+ *
+ * mm_lock and mm_unlock are expensive operations that may take a long time.
+ */
+struct mm_lock_data {
+	spinlock_t **i_mmap_locks;
+	spinlock_t **anon_vma_locks;
+	size_t nr_i_mmap_locks;
+	size_t nr_anon_vma_locks;
+};
+extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
+extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -225,6 +225,9 @@
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 	struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+	struct hlist_head mmu_notifier_list;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,229 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier_ops {
+	/*
+	 * Called after all other threads have terminated and the executing
+	 * thread is the only remaining execution thread. There are no
+	 * users of the mm_struct remaining.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * clear_flush_young is called after the VM is
+	 * test-and-clearing the young/accessed bitflag in the
+	 * pte. This way the VM will provide proper aging to the
+	 * accesses to the page through the secondary MMUs and not
+	 * only to the ones through the Linux pte.
+	 */
+	int (*clear_flush_young)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long address);
+
+	/*
+	 * Before this is invoked any secondary MMU is still ok to
+	 * read/write to the page previously pointed by the Linux pte
+	 * because the old page hasn't been freed yet.  If required
+	 * set_page_dirty has to be called internally to this method.
+	 */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_start() and invalidate_range_end() must be
+	 * paired and are called only when the mmap_sem is held and/or
+	 * the semaphores protecting the reverse maps. Both functions
+	 * may sleep. The subsystem must guarantee that no additional
+	 * references to the pages in the range established between
+	 * the call to invalidate_range_start() and the matching call
+	 * to invalidate_range_end().
+	 *
+	 * Invalidation of multiple concurrent ranges may be permitted
+	 * by the driver or the driver may exclude other invalidation
+	 * from proceeding by blocking on new invalidate_range_start()
+	 * callback that overlap invalidates that are already in
+	 * progress. Either way the establishment of sptes to the
+	 * range can only be allowed if all invalidate_range_stop()
+	 * function have been called.
+	 *
+	 * invalidate_range_start() is called when all pages in the
+	 * range are still mapped and have at least a refcount of one.
+	 *
+	 * invalidate_range_end() is called when all pages in the
+	 * range have been unmapped and the pages have been freed by
+	 * the VM.
+	 *
+	 * The VM will remove the page table entries and potentially
+	 * the page between invalidate_range_start() and
+	 * invalidate_range_end(). If the page must not be freed
+	 * because of pending I/O or other circumstances then the
+	 * invalidate_range_start() callback (or the initial mapping
+	 * by the driver) must make sure that the refcount is kept
+	 * elevated.
+	 *
+	 * If the driver increases the refcount when the pages are
+	 * initially mapped into an address space then either
+	 * invalidate_range_start() or invalidate_range_end() may
+	 * decrease the refcount. If the refcount is decreased on
+	 * invalidate_range_start() then the VM can free pages as page
+	 * table entries are removed.  If the refcount is only
+	 * droppped on invalidate_range_end() then the driver itself
+	 * will drop the last refcount but it must take care to flush
+	 * any secondary tlb before doing the final free on the
+	 * page. Pages will no longer be referenced by the linux
+	 * address space but may still be referenced by sptes until
+	 * the last refcount is dropped.
+	 */
+	void (*invalidate_range_start)(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start, unsigned long end);
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start, unsigned long end);
+};
+
+/*
+ * The notifier chains are protected by mmap_sem and/or the reverse map
+ * semaphores. Notifier chains are only changed when all reverse maps and
+ * the mmap_sem locks are taken.
+ *
+ * Therefore notifier chains can only be traversed when either
+ *
+ * 1. mmap_sem is held.
+ * 2. One of the reverse map locks is held (i_mmap_sem or anon_vma->sem).
+ * 3. No other concurrent thread can access the list (release)
+ */
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+static inline int mm_has_notifiers(struct mm_struct *mm)
+{
+	return unlikely(!hlist_empty(&mm->mmu_notifier_list));
+}
+
+extern int mmu_notifier_register(struct mmu_notifier *mn,
+				 struct mm_struct *mm);
+extern int mmu_notifier_unregister(struct mmu_notifier *mn,
+				   struct mm_struct *mm);
+extern void __mmu_notifier_release(struct mm_struct *mm);
+extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_release(mm);
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_clear_flush_young(mm, address);
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_page(mm, address);
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_start(mm, start, end);
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+	INIT_HLIST_HEAD(&mm->mmu_notifier_list);
+}
+
+#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
+({									\
+	pte_t __pte;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
+	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
+	__pte;								\
+})
+
+#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+}
+
+#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define ptep_clear_flush_notify ptep_clear_flush
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -362,6 +363,7 @@
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_mm_init(mm);
 		return mm;
 	}
 
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,3 +193,7 @@
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,4 +33,5 @@
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -194,7 +194,7 @@
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush(vma, address, pte);
+			pteval = ptep_clear_flush_notify(vma, address, pte);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -214,7 +215,9 @@
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
+	mmu_notifier_invalidate_range_end(mm, start, start + size);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -799,6 +800,7 @@
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -819,6 +821,7 @@
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -611,6 +612,9 @@
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_start(src_mm, addr, end);
+
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
@@ -621,6 +625,11 @@
 						vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_end(src_mm,
+						vma->vm_start, end);
+
 	return 0;
 }
 
@@ -825,7 +834,9 @@
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
 	int fullmm = (*tlbp)->fullmm;
+	struct mm_struct *mm = vma->vm_mm;
 
+	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
 
@@ -876,6 +887,7 @@
 		}
 	}
 out:
+	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
 
@@ -1463,10 +1475,11 @@
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1474,6 +1487,7 @@
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1675,7 +1689,7 @@
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush(vma, address, page_table);
+		ptep_clear_flush_notify(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -26,6 +26,9 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/vmalloc.h>
+#include <linux/sort.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2038,6 +2041,7 @@
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
+	mmu_notifier_release(mm);
 
 	lru_add_drain();
 	flush_cache_mm(mm);
@@ -2242,3 +2246,143 @@
 
 	return 0;
 }
+
+static int mm_lock_cmp(const void *a, const void *b)
+{
+	cond_resched();
+	if ((unsigned long)*(spinlock_t **)a <
+	    (unsigned long)*(spinlock_t **)b)
+		return -1;
+	else if (a == b)
+		return 0;
+	else
+		return 1;
+}
+
+static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+				  int anon)
+{
+	struct vm_area_struct *vma;
+	size_t i = 0;
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if (anon) {
+			if (vma->anon_vma)
+				locks[i++] = &vma->anon_vma->lock;
+		} else {
+			if (vma->vm_file && vma->vm_file->f_mapping)
+				locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock;
+		}
+	}
+
+	if (!i)
+		goto out;
+
+	sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+
+out:
+	return i;
+}
+
+static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
+						  spinlock_t **locks)
+{
+	return mm_lock_sort(mm, locks, 1);
+}
+
+static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
+						spinlock_t **locks)
+{
+	return mm_lock_sort(mm, locks, 0);
+}
+
+static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+{
+	spinlock_t *last = NULL;
+	size_t i;
+
+	for (i = 0; i < nr; i++)
+		/*  Multiple vmas may use the same lock. */
+		if (locks[i] != last) {
+			BUG_ON((unsigned long) last > (unsigned long) locks[i]);
+			last = locks[i];
+			if (lock)
+				spin_lock(last);
+			else
+				spin_unlock(last);
+		}
+}
+
+static inline void __mm_lock(spinlock_t **locks, size_t nr)
+{
+	mm_lock_unlock(locks, nr, 1);
+}
+
+static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+{
+	mm_lock_unlock(locks, nr, 0);
+}
+
+/*
+ * This operation locks against the VM for all pte/vma/mm related
+ * operations that could ever happen on a certain mm. This includes
+ * vmtruncate, try_to_unmap, and all page faults. The holder
+ * must not hold any mm related lock. A single task can't take more
+ * than one mm lock in a row or it would deadlock.
+ */
+int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
+{
+	spinlock_t **anon_vma_locks, **i_mmap_locks;
+
+	down_write(&mm->mmap_sem);
+	if (mm->map_count) {
+		anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
+		if (unlikely(!anon_vma_locks)) {
+			up_write(&mm->mmap_sem);
+			return -ENOMEM;
+		}
+
+		i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
+		if (unlikely(!i_mmap_locks)) {
+			up_write(&mm->mmap_sem);
+			vfree(anon_vma_locks);
+			return -ENOMEM;
+		}
+
+		data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
+		data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
+
+		if (data->nr_anon_vma_locks) {
+			__mm_lock(anon_vma_locks, data->nr_anon_vma_locks);
+			data->anon_vma_locks = anon_vma_locks;
+		} else
+			vfree(anon_vma_locks);
+
+		if (data->nr_i_mmap_locks) {
+			__mm_lock(i_mmap_locks, data->nr_i_mmap_locks);
+			data->i_mmap_locks = i_mmap_locks;
+		} else
+			vfree(i_mmap_locks);
+	}
+	return 0;
+}
+
+static void mm_unlock_vfree(spinlock_t **locks, size_t nr)
+{
+	__mm_unlock(locks, nr);
+	vfree(locks);
+}
+
+/* avoid memory allocations for mm_unlock to prevent deadlock */
+void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
+{
+	if (mm->map_count) {
+		if (data->nr_anon_vma_locks)
+			mm_unlock_vfree(data->anon_vma_locks,
+					data->nr_anon_vma_locks);
+		if (data->i_mmap_locks)
+			mm_unlock_vfree(data->i_mmap_locks,
+					data->nr_i_mmap_locks);
+	}
+	up_write(&mm->mmap_sem);
+}
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,130 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *             Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void __mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+
+	while (unlikely(!hlist_empty(&mm->mmu_notifier_list))) {
+		mn = hlist_entry(mm->mmu_notifier_list.first,
+				 struct mmu_notifier,
+				 hlist);
+		hlist_del(&mn->hlist);
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->clear_flush_young can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) {
+		if (mn->ops->clear_flush_young)
+			young |= mn->ops->clear_flush_young(mn, mm, address);
+	}
+
+	return young;
+}
+
+void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+
+	hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) {
+		if (mn->ops->invalidate_page)
+			mn->ops->invalidate_page(mn, mm, address);
+	}
+}
+
+void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+
+	hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) {
+		if (mn->ops->invalidate_range_start)
+			mn->ops->invalidate_range_start(mn, mm, start, end);
+	}
+}
+
+void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+
+	hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) {
+		if (mn->ops->invalidate_range_end)
+			mn->ops->invalidate_range_end(mn, mm, start, end);
+	}
+}
+
+/*
+ * Must not hold mmap_sem nor any other VM related lock when calling
+ * this registration function.
+ */
+int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct mm_lock_data data;
+	int ret;
+
+	ret = mm_lock(mm, &data);
+	if (unlikely(ret))
+		goto out;
+	hlist_add_head(&mn->hlist, &mm->mmu_notifier_list);
+	mm_unlock(mm, &data);
+out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+/*
+ * mm_users can't go down to zero while mmu_notifier_unregister()
+ * runs or it can race with ->release. So a mm_users pin must
+ * be taken by the caller (if mm can be different from current->mm).
+ */
+int mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct mm_lock_data data;
+	int ret;
+
+	BUG_ON(!atomic_read(&mm->mm_users));
+
+	ret = mm_lock(mm, &data);
+	if (unlikely(ret))
+		goto out;
+	hlist_del(&mn->hlist);
+	mm_unlock(mm, &data);
+out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -198,10 +199,12 @@
 		dirty_accountable = 1;
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -74,7 +75,11 @@
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
+	unsigned long old_start;
 
+	old_start = old_addr;
+	mmu_notifier_invalidate_range_start(vma->vm_mm,
+					    old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -116,6 +121,7 @@
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/kallsyms.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -287,7 +288,7 @@
 	if (vma->vm_flags & VM_LOCKED) {
 		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young(vma, address, pte))
+	} else if (ptep_clear_flush_young_notify(vma, address, pte))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -456,7 +457,7 @@
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush(vma, address, pte);
+		entry = ptep_clear_flush_notify(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -717,14 +718,14 @@
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
+			(ptep_clear_flush_young_notify(vma, address, pte)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	pteval = ptep_clear_flush_notify(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -849,12 +850,12 @@
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young(vma, address, pte))
+		if (ptep_clear_flush_young_notify(vma, address, pte))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush(vma, address, pte);
+		pteval = ptep_clear_flush_notify(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
  2008-04-22 13:51 ` [PATCH 01 of 12] Core of mmu notifiers Andrea Arcangeli
@ 2008-04-22 13:51 ` Andrea Arcangeli
  2008-04-22 20:22   ` Christoph Lameter
  2008-04-22 13:51 ` [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced Andrea Arcangeli
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872186 -7200
# Node ID 3c804dca25b15017b22008647783d6f5f3801fa9
# Parent  ea87c15371b1bd49380c40c3f15f1c7ca4438af5
Fix ia64 compilation failure because of common code include bug.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include <linux/rbtree.h>
 #include <linux/rwsem.h>
 #include <linux/completion.h>
+#include <linux/cpumask.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
  2008-04-22 13:51 ` [PATCH 01 of 12] Core of mmu notifiers Andrea Arcangeli
  2008-04-22 13:51 ` [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug Andrea Arcangeli
@ 2008-04-22 13:51 ` Andrea Arcangeli
  2008-04-22 20:23   ` Christoph Lameter
  2008-04-22 13:51 ` [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last Andrea Arcangeli
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872186 -7200
# Node ID a6672bdeead0d41b2ebd6846f731d43a611645b7
# Parent  3c804dca25b15017b22008647783d6f5f3801fa9
get_task_mm should not succeed if mmput() is running and has reduced
the mm_users count to zero. This can occur if a processor follows
a tasks pointer to an mm struct because that pointer is only cleared
after the mmput().

If get_task_mm() succeeds after mmput() reduced the mm_users to zero then
we have the lovely situation that one portion of the kernel is doing
all the teardown work for an mm while another portion is happily using
it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -442,7 +442,8 @@
 		if (task->flags & PF_BORROWED_MM)
 			mm = NULL;
 		else
-			atomic_inc(&mm->mm_users);
+			if (!atomic_inc_not_zero(&mm->mm_users))
+				mm = NULL;
 	}
 	task_unlock(task);
 	return mm;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2008-04-22 13:51 ` [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced Andrea Arcangeli
@ 2008-04-22 13:51 ` Andrea Arcangeli
  2008-04-22 20:24   ` Christoph Lameter
  2008-04-22 13:51 ` [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks Andrea Arcangeli
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872186 -7200
# Node ID ac9bb1fb3de2aa5d27210a28edf24f6577094076
# Parent  a6672bdeead0d41b2ebd6846f731d43a611645b7
Moves all mmu notifier methods outside the PT lock (first and not last
step to make them sleep capable).

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -169,27 +169,6 @@
 	INIT_HLIST_HEAD(&mm->mmu_notifier_list);
 }
 
-#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
-({									\
-	pte_t __pte;							\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
-	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
-	__pte;								\
-})
-
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
-({									\
-	int __young;							\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
-	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
-						  ___address);		\
-	__young;							\
-})
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -221,9 +200,6 @@
 {
 }
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
-#define ptep_clear_flush_notify ptep_clear_flush
-
 #endif /* CONFIG_MMU_NOTIFIER */
 
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -194,11 +194,13 @@
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush_notify(vma, address, pte);
+			pteval = ptep_clear_flush(vma, address, pte);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
+			/* must invalidate_page _before_ freeing the page */
+			mmu_notifier_invalidate_page(mm, address);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1627,9 +1627,10 @@
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
-			page_cache_release(old_page);
+			new_page = NULL;
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
+			page_cache_release(old_page);
 
 			page_mkwrite = 1;
 		}
@@ -1645,6 +1646,7 @@
 		if (ptep_set_access_flags(vma, address, page_table, entry,1))
 			update_mmu_cache(vma, address, entry);
 		ret |= VM_FAULT_WRITE;
+		old_page = new_page = NULL;
 		goto unlock;
 	}
 
@@ -1689,7 +1691,7 @@
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush_notify(vma, address, page_table);
+		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
@@ -1701,12 +1703,18 @@
 	} else
 		mem_cgroup_uncharge_page(new_page);
 
-	if (new_page)
+unlock:
+	pte_unmap_unlock(page_table, ptl);
+
+	if (new_page) {
+		if (new_page == old_page)
+			/* cow happened, notify before releasing old_page */
+			mmu_notifier_invalidate_page(mm, address);
 		page_cache_release(new_page);
+	}
 	if (old_page)
 		page_cache_release(old_page);
-unlock:
-	pte_unmap_unlock(page_table, ptl);
+
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -275,7 +275,7 @@
 	unsigned long address;
 	pte_t *pte;
 	spinlock_t *ptl;
-	int referenced = 0;
+	int referenced = 0, clear_flush_young = 0;
 
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -288,8 +288,11 @@
 	if (vma->vm_flags & VM_LOCKED) {
 		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young_notify(vma, address, pte))
-		referenced++;
+	} else {
+		clear_flush_young = 1;
+		if (ptep_clear_flush_young(vma, address, pte))
+			referenced++;
+	}
 
 	/* Pretend the page is referenced if the task has the
 	   swap token and is in the middle of a page fault. */
@@ -299,6 +302,10 @@
 
 	(*mapcount)--;
 	pte_unmap_unlock(pte, ptl);
+
+	if (clear_flush_young)
+		referenced += mmu_notifier_clear_flush_young(mm, address);
+
 out:
 	return referenced;
 }
@@ -457,7 +464,7 @@
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush_notify(vma, address, pte);
+		entry = ptep_clear_flush(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -465,6 +472,10 @@
 	}
 
 	pte_unmap_unlock(pte, ptl);
+
+	if (ret)
+		mmu_notifier_invalidate_page(mm, address);
+
 out:
 	return ret;
 }
@@ -717,15 +728,14 @@
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young_notify(vma, address, pte)))) {
+	if (!migration && (vma->vm_flags & VM_LOCKED)) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush_notify(vma, address, pte);
+	pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -780,6 +790,8 @@
 
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
+	if (ret != SWAP_FAIL)
+		mmu_notifier_invalidate_page(mm, address);
 out:
 	return ret;
 }
@@ -818,7 +830,7 @@
 	spinlock_t *ptl;
 	struct page *page;
 	unsigned long address;
-	unsigned long end;
+	unsigned long start, end;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -839,6 +851,8 @@
 	if (!pmd_present(*pmd))
 		return;
 
+	start = address;
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	/* Update high watermark before we lower rss */
@@ -850,12 +864,12 @@
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young_notify(vma, address, pte))
+		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush_notify(vma, address, pte);
+		pteval = ptep_clear_flush(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
@@ -871,6 +885,7 @@
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 }
 
 static int try_to_unmap_anon(struct page *page, int migration)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2008-04-22 13:51 ` [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last Andrea Arcangeli
@ 2008-04-22 13:51 ` Andrea Arcangeli
  2008-04-22 20:25   ` Christoph Lameter
  2008-04-22 13:51 ` [PATCH 06 of 12] Move the tlb flushing inside of unmap vmas. This saves us from passing Andrea Arcangeli
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872186 -7200
# Node ID ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0
# Parent  ac9bb1fb3de2aa5d27210a28edf24f6577094076
Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables() and we cannot sleep while gathering pages for a tlb
flush.

Move the tlb_gather/tlb_finish call to free_pgtables() to be done
for each vma. This may add a number of tlb flushes depending on the
number of vmas that cannot be coalesced into one.

The first pointer argument to free_pgtables() can then be dropped.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -751,8 +751,8 @@
 		    void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-		unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+						unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@
 	} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-		unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+							unsigned long ceiling)
 {
+	struct mmu_gather *tlb;
+
 	while (vma) {
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long addr = vma->vm_start;
@@ -286,7 +288,8 @@
 		unlink_file_vma(vma);
 
 		if (is_vm_hugetlb_page(vma)) {
-			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+			tlb = tlb_gather_mmu(vma->vm_mm, 0);
+			hugetlb_free_pgd_range(&tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		} else {
 			/*
@@ -299,9 +302,11 @@
 				anon_vma_unlink(vma);
 				unlink_file_vma(vma);
 			}
-			free_pgd_range(tlb, addr, vma->vm_end,
+			tlb = tlb_gather_mmu(vma->vm_mm, 0);
+			free_pgd_range(&tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		}
+		tlb_finish_mmu(tlb, addr, vma->vm_end);
 		vma = next;
 	}
 }
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1752,9 +1752,9 @@
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+	tlb_finish_mmu(tlb, start, end);
+	free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
-	tlb_finish_mmu(tlb, start, end);
 }
 
 /*
@@ -2050,8 +2050,8 @@
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 06 of 12] Move the tlb flushing inside of unmap vmas. This saves us from passing
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2008-04-22 13:51 ` [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks Andrea Arcangeli
@ 2008-04-22 13:51 ` Andrea Arcangeli
  2008-04-22 13:51 ` [PATCH 07 of 12] Add a function to rw_semaphores to check if there are any processes Andrea Arcangeli
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872186 -7200
# Node ID fbce3fecb033eb3fba1d9c2398ac74401ce0ecb5
# Parent  ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0
Move the tlb flushing inside of unmap vmas. This saves us from passing
a pointer to the TLB structure around and simplifies the callers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -723,8 +723,7 @@
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-		struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
 
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -804,7 +804,6 @@
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -816,20 +815,13 @@
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-		struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
@@ -838,9 +830,14 @@
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-	int fullmm = (*tlbp)->fullmm;
+	int fullmm;
+	struct mmu_gather *tlb;
 	struct mm_struct *mm = vma->vm_mm;
 
+	lru_add_drain();
+	tlb = tlb_gather_mmu(mm, 0);
+	update_hiwater_rss(mm);
+	fullmm = tlb->fullmm;
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
@@ -867,7 +864,7 @@
 						(HPAGE_SIZE / PAGE_SIZE);
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -875,22 +872,23 @@
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
+			tlb_finish_mmu(tlb, tlb_start, start);
 
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
 				if (i_mmap_lock) {
-					*tlbp = NULL;
+					tlb = NULL;
 					goto out;
 				}
 				cond_resched();
 			}
 
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
+			tlb = tlb_gather_mmu(vma->vm_mm, fullmm);
 			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
+	tlb_finish_mmu(tlb, start_addr, end_addr);
 out:
 	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
@@ -906,18 +904,10 @@
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	update_hiwater_rss(mm);
-	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
-	return end;
+	return unmap_vmas(vma, address, end, &nr_accounted, details);
 }
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1744,15 +1744,10 @@
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
 	unsigned long nr_accounted = 0;
 
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+	unmap_vmas(vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	tlb_finish_mmu(tlb, start, end);
 	free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 }
@@ -2034,7 +2029,6 @@
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
 	struct vm_area_struct *vma = mm->mmap;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2045,12 +2039,11 @@
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
+
 	/* Don't update_hiwater_rss(mm) here, do_exit already did */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	end = unmap_vmas(vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	tlb_finish_mmu(tlb, 0, end);
 	free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 07 of 12] Add a function to rw_semaphores to check if there are any processes
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2008-04-22 13:51 ` [PATCH 06 of 12] Move the tlb flushing inside of unmap vmas. This saves us from passing Andrea Arcangeli
@ 2008-04-22 13:51 ` Andrea Arcangeli
  2008-04-22 13:51 ` [PATCH 08 of 12] The conversion to a rwsem allows notifier callbacks during rmap traversal Andrea Arcangeli
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872187 -7200
# Node ID 8965539f4d174c79bd37e58e8b037d5db906e219
# Parent  fbce3fecb033eb3fba1d9c2398ac74401ce0ecb5
Add a function to rw_semaphores to check if there are any processes
waiting for the semaphore. Add rwsem_needbreak to sched.h that works
in the same way as spinlock_needbreak().

Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -59,6 +59,8 @@
  */
 extern void downgrade_write(struct rw_semaphore *sem);
 
+extern int rwsem_is_contended(struct rw_semaphore *sem);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 /*
  * nested locking. NOTE: rwsems are not allowed to recurse
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1984,6 +1984,15 @@
 #endif
 }
 
+static inline int rwsem_needbreak(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_PREEMPT
+	return rwsem_is_contended(sem);
+#else
+	return 0;
+#endif
+}
+
 /*
  * Reevaluate whether the task has signals pending delivery.
  * Wake the task if so.
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -305,6 +305,18 @@
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	/*
+	 * Racy check for an empty list. False positives or negatives
+	 * would be okay. False positive may cause a useless dropping of
+	 * locks. False negatives may cause locks to be held a bit
+	 * longer until the next check.
+	 */
+	return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(__init_rwsem);
 EXPORT_SYMBOL(__down_read);
 EXPORT_SYMBOL(__down_read_trylock);
diff --git a/lib/rwsem.c b/lib/rwsem.c
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -251,6 +251,18 @@
 	return sem;
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	/*
+	 * Racy check for an empty list. False positives or negatives
+	 * would be okay. False positive may cause a useless dropping of
+	 * locks. False negatives may cause locks to be held a bit
+	 * longer until the next check.
+	 */
+	return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(rwsem_down_read_failed);
 EXPORT_SYMBOL(rwsem_down_write_failed);
 EXPORT_SYMBOL(rwsem_wake);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 08 of 12] The conversion to a rwsem allows notifier callbacks during rmap traversal
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2008-04-22 13:51 ` [PATCH 07 of 12] Add a function to rw_semaphores to check if there are any processes Andrea Arcangeli
@ 2008-04-22 13:51 ` Andrea Arcangeli
  2008-04-22 13:51 ` [PATCH 09 of 12] Convert the anon_vma spinlock to a rw semaphore. This allows concurrent Andrea Arcangeli
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872187 -7200
# Node ID 6e04df1f4284689b1c46e57a67559abe49ecf292
# Parent  8965539f4d174c79bd37e58e8b037d5db906e219
The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/Documentation/vm/locking b/Documentation/vm/locking
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@
 expand_stack(), it is hard to come up with a destructive scenario without 
 having the vmlist protection in this case.
 
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
 c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
 dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -94,7 +94,7 @@
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -454,10 +454,10 @@
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -210,7 +210,7 @@
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	init_rwsem(&inode->i_data.i_mmap_sem);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -503,7 +503,7 @@
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct rw_semaphore	i_mmap_sem;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -716,7 +716,7 @@
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct rw_semaphore *i_mmap_sem;	/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -274,12 +274,12 @@
 				atomic_dec(&inode->i_writecount);
 
 			/* insert tmp into the share list, just after mpnt */
-			spin_lock(&file->f_mapping->i_mmap_lock);
+			down_write(&file->f_mapping->i_mmap_sem);
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(file->f_mapping);
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(file->f_mapping);
-			spin_unlock(&file->f_mapping->i_mmap_lock);
+			up_write(&file->f_mapping->i_mmap_sem);
 		}
 
 		/*
diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -61,16 +61,16 @@
 /*
  * Lock ordering:
  *
- *  ->i_mmap_lock		(vmtruncate)
+ *  ->i_mmap_sem		(vmtruncate)
  *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
  *  ->i_mutex
- *    ->i_mmap_lock		(truncate->unmap_mapping_range)
+ *    ->i_mmap_sem		(truncate->unmap_mapping_range)
  *
  *  ->mmap_sem
- *    ->i_mmap_lock
+ *    ->i_mmap_sem
  *      ->page_table_lock or pte_lock	(various, mainly in memory.c)
  *        ->mapping->tree_lock	(arch-dependent flush_dcache_mmap_lock)
  *
@@ -87,7 +87,7 @@
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
- *  ->i_mmap_lock
+ *  ->i_mmap_sem
  *    ->anon_vma.lock		(vma_adjust)
  *
  *  ->anon_vma.lock
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -184,7 +184,7 @@
 	if (!page)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -204,7 +204,7 @@
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -206,13 +206,13 @@
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 
 	mmu_notifier_invalidate_range_start(mm, start, start + size);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -790,7 +790,7 @@
 	struct page *page;
 	struct page *tmp;
 	/*
-	 * A page gathering list, protected by per file i_mmap_lock. The
+	 * A page gathering list, protected by per file i_mmap_sem. The
 	 * lock is used to avoid list corruption from multiple unmapping
 	 * of the same page since we are using page->lru.
 	 */
@@ -840,9 +840,9 @@
 	 * do nothing in this case.
 	 */
 	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+		down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 		__unmap_hugepage_range(vma, start, end);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+		up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	}
 }
 
@@ -1085,7 +1085,7 @@
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -1100,7 +1100,7 @@
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 
 	flush_tlb_range(vma, start, end);
 }
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -829,7 +829,7 @@
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	struct rw_semaphore *i_mmap_sem = details? details->i_mmap_sem: NULL;
 	int fullmm;
 	struct mmu_gather *tlb;
 	struct mm_struct *mm = vma->vm_mm;
@@ -875,8 +875,8 @@
 			tlb_finish_mmu(tlb, tlb_start, start);
 
 			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
+				(i_mmap_sem && rwsem_needbreak(i_mmap_sem))) {
+				if (i_mmap_sem) {
 					tlb = NULL;
 					goto out;
 				}
@@ -1742,7 +1742,7 @@
 /*
  * Helper functions for unmap_mapping_range().
  *
- * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __
+ * __ Notes on dropping i_mmap_sem to reduce latency while unmapping __
  *
  * We have to restart searching the prio_tree whenever we drop the lock,
  * since the iterator is only valid while the lock is held, and anyway
@@ -1761,7 +1761,7 @@
  * can't efficiently keep all vmas in step with mapping->truncate_count:
  * so instead reset them all whenever it wraps back to 0 (then go to 1).
  * mapping->truncate_count and vma->vm_truncate_count are protected by
- * i_mmap_lock.
+ * i_mmap_sem.
  *
  * In order to make forward progress despite repeatedly restarting some
  * large vma, note the restart_addr from unmap_vmas when it breaks out:
@@ -1811,7 +1811,7 @@
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched() || rwsem_needbreak(details->i_mmap_sem);
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -1825,9 +1825,9 @@
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	up_write(details->i_mmap_sem);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	down_write(details->i_mmap_sem);
 	return -EINTR;
 }
 
@@ -1921,9 +1921,9 @@
 	details.last_index = hba + hlen - 1;
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
-	details.i_mmap_lock = &mapping->i_mmap_lock;
+	details.i_mmap_sem = &mapping->i_mmap_sem;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_write(&mapping->i_mmap_sem);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -1938,7 +1938,7 @@
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_write(&mapping->i_mmap_sem);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -211,12 +211,12 @@
 	if (!mapping)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -189,7 +189,7 @@
 }
 
 /*
- * Requires inode->i_mapping->i_mmap_lock
+ * Requires inode->i_mapping->i_mmap_sem
  */
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		struct file *file, struct address_space *mapping)
@@ -217,9 +217,9 @@
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 }
 
@@ -442,7 +442,7 @@
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 	anon_vma_lock(vma);
@@ -452,7 +452,7 @@
 
 	anon_vma_unlock(vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -539,7 +539,7 @@
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -623,7 +623,7 @@
 	if (anon_vma)
 		spin_unlock(&anon_vma->lock);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	if (remove_next) {
 		if (file)
@@ -2058,7 +2058,7 @@
 
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
- * then i_mmap_lock is taken here.
+ * then i_mmap_sem is taken here.
  */
 int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
 {
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -88,7 +88,7 @@
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -120,7 +120,7 @@
 	pte_unmap_nested(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -24,7 +24,7 @@
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
- *       mapping->i_mmap_lock
+ *       mapping->i_mmap_sem
  *         anon_vma->lock
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -373,14 +373,14 @@
 	 * The page lock not only makes sure that page->mapping cannot
 	 * suddenly be NULLified by truncation, it makes sure that the
 	 * structure at mapping cannot be freed and reused yet,
-	 * so we can safely take mapping->i_mmap_lock.
+	 * so we can safely take mapping->i_mmap_sem.
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	/*
-	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
+	 * i_mmap_sem does not stabilize mapcount at all, but mapcount
 	 * is more likely to be accurate if we note it after spinning.
 	 */
 	mapcount = page_mapcount(page);
@@ -403,7 +403,7 @@
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return referenced;
 }
 
@@ -489,12 +489,12 @@
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED)
 			ret += page_mkclean_one(page, vma);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }
 
@@ -930,7 +930,7 @@
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
@@ -967,7 +967,6 @@
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -989,7 +988,6 @@
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -1001,7 +999,7 @@
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_write(&mapping->i_mmap_sem);
 	return ret;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 09 of 12] Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2008-04-22 13:51 ` [PATCH 08 of 12] The conversion to a rwsem allows notifier callbacks during rmap traversal Andrea Arcangeli
@ 2008-04-22 13:51 ` Andrea Arcangeli
  2008-04-22 13:51 ` [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock Andrea Arcangeli
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872187 -7200
# Node ID bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2
# Parent  6e04df1f4284689b1c46e57a67559abe49ecf292
Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
+	atomic_t refcount;	/* vmas on the list */
+	struct rw_semaphore sem;/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -43,18 +44,31 @@
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+	atomic_inc(&anon_vma->refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (atomic_dec_and_test(&anon_vma->refcount))
+		anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@
 		return;
 
 	/*
-	 * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+	 * We hold either the mmap_sem lock or a reference on the
+	 * anon_vma. So no need to call page_lock_anon_vma.
 	 */
 	anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	down_read(&anon_vma->sem);
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&anon_vma->lock);
+	up_read(&anon_vma->sem);
 }
 
 /*
@@ -623,7 +624,7 @@
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
-	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
 	int charge = 0;
 
 	if (!newpage)
@@ -647,16 +648,14 @@
 	}
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
-	 * we cannot notice that anon_vma is freed while we migrates a page.
+	 * we cannot notice that anon_vma is freed while we migrate a page.
 	 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 	 * of migration. File cache pages are no problem because of page_lock()
 	 * File Caches may use write_page() or lock_page() in migration, then,
 	 * just care Anon page here.
 	 */
-	if (PageAnon(page)) {
-		rcu_read_lock();
-		rcu_locked = 1;
-	}
+	if (PageAnon(page))
+		anon_vma = grab_anon_vma(page);
 
 	/*
 	 * Corner case handling:
@@ -674,10 +673,7 @@
 		if (!PageAnon(page) && PagePrivate(page)) {
 			/*
 			 * Go direct to try_to_free_buffers() here because
-			 * a) that's what try_to_release_page() would do anyway
-			 * b) we may be under rcu_read_lock() here, so we can't
-			 *    use GFP_KERNEL which is what try_to_release_page()
-			 *    needs to be effective.
+			 * that's what try_to_release_page() would do anyway
 			 */
 			try_to_free_buffers(page);
 		}
@@ -698,8 +694,8 @@
 	} else if (charge)
  		mem_cgroup_end_migration(newpage);
 rcu_unlock:
-	if (rcu_locked)
-		rcu_read_unlock();
+	if (anon_vma)
+		put_anon_vma(anon_vma);
 
 unlock:
 
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -567,7 +567,7 @@
 	if (vma->anon_vma)
 		anon_vma = vma->anon_vma;
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
@@ -621,7 +621,7 @@
 	}
 
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	if (mapping)
 		up_write(&mapping->i_mmap_sem);
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -69,7 +69,7 @@
 		if (anon_vma) {
 			allocated = NULL;
 			locked = anon_vma;
-			spin_lock(&locked->lock);
+			down_write(&locked->sem);
 		} else {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
@@ -81,6 +81,7 @@
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
+			get_anon_vma(anon_vma);
 			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
 			allocated = NULL;
@@ -88,7 +89,7 @@
 		spin_unlock(&mm->page_table_lock);
 
 		if (locked)
-			spin_unlock(&locked->lock);
+			up_write(&locked->sem);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
@@ -99,14 +100,17 @@
 {
 	BUG_ON(vma->anon_vma != next->anon_vma);
 	list_del(&next->anon_vma_node);
+	put_anon_vma(vma->anon_vma);
 }
 
 void __anon_vma_link(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 
-	if (anon_vma)
+	if (anon_vma) {
+		get_anon_vma(anon_vma);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+	}
 }
 
 void anon_vma_link(struct vm_area_struct *vma)
@@ -114,36 +118,32 @@
 	struct anon_vma *anon_vma = vma->anon_vma;
 
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		get_anon_vma(anon_vma);
+		down_write(&anon_vma->sem);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	}
 }
 
 void anon_vma_unlink(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	int empty;
 
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	down_write(&anon_vma->sem);
 	list_del(&vma->anon_vma_node);
-
-	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head);
-	spin_unlock(&anon_vma->lock);
-
-	if (empty)
-		anon_vma_free(anon_vma);
+	up_write(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 static void anon_vma_ctor(struct kmem_cache *cachep, void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	init_rwsem(&anon_vma->sem);
+	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -157,9 +157,9 @@
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *grab_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -170,17 +170,26 @@
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
-	return anon_vma;
+	if (!atomic_inc_not_zero(&anon_vma->refcount))
+		anon_vma = NULL;
 out:
 	rcu_read_unlock();
-	return NULL;
+	return anon_vma;
+}
+
+static struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma = grab_anon_vma(page);
+
+	if (anon_vma)
+		down_read(&anon_vma->sem);
+	return anon_vma;
 }
 
 static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
-	rcu_read_unlock();
+	up_read(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2008-04-22 13:51 ` [PATCH 09 of 12] Convert the anon_vma spinlock to a rw semaphore. This allows concurrent Andrea Arcangeli
@ 2008-04-22 13:51 ` Andrea Arcangeli
  2008-04-22 20:26   ` Christoph Lameter
  2008-04-22 13:51 ` [PATCH 11 of 12] XPMEM would have used sys_madvise() except that madvise_dontneed() Andrea Arcangeli
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872187 -7200
# Node ID f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93
# Parent  bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2
Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1062,10 +1062,10 @@
  * mm_lock and mm_unlock are expensive operations that may take a long time.
  */
 struct mm_lock_data {
-	spinlock_t **i_mmap_locks;
-	spinlock_t **anon_vma_locks;
-	size_t nr_i_mmap_locks;
-	size_t nr_anon_vma_locks;
+	struct rw_semaphore **i_mmap_sems;
+	struct rw_semaphore **anon_vma_sems;
+	size_t nr_i_mmap_sems;
+	size_t nr_anon_vma_sems;
 };
 extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
 extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2243,8 +2243,8 @@
 static int mm_lock_cmp(const void *a, const void *b)
 {
 	cond_resched();
-	if ((unsigned long)*(spinlock_t **)a <
-	    (unsigned long)*(spinlock_t **)b)
+	if ((unsigned long)*(struct rw_semaphore **)a <
+	    (unsigned long)*(struct rw_semaphore **)b)
 		return -1;
 	else if (a == b)
 		return 0;
@@ -2252,7 +2252,7 @@
 		return 1;
 }
 
-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore **sems,
 				  int anon)
 {
 	struct vm_area_struct *vma;
@@ -2261,59 +2261,59 @@
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (anon) {
 			if (vma->anon_vma)
-				locks[i++] = &vma->anon_vma->lock;
+				sems[i++] = &vma->anon_vma->sem;
 		} else {
 			if (vma->vm_file && vma->vm_file->f_mapping)
-				locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock;
+				sems[i++] = &vma->vm_file->f_mapping->i_mmap_sem;
 		}
 	}
 
 	if (!i)
 		goto out;
 
-	sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+	sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL);
 
 out:
 	return i;
 }
 
 static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
-						  spinlock_t **locks)
+						  struct rw_semaphore **sems)
 {
-	return mm_lock_sort(mm, locks, 1);
+	return mm_lock_sort(mm, sems, 1);
 }
 
 static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
-						spinlock_t **locks)
+						struct rw_semaphore **sems)
 {
-	return mm_lock_sort(mm, locks, 0);
+	return mm_lock_sort(mm, sems, 0);
 }
 
-static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock)
 {
-	spinlock_t *last = NULL;
+	struct rw_semaphore *last = NULL;
 	size_t i;
 
 	for (i = 0; i < nr; i++)
 		/*  Multiple vmas may use the same lock. */
-		if (locks[i] != last) {
-			BUG_ON((unsigned long) last > (unsigned long) locks[i]);
-			last = locks[i];
+		if (sems[i] != last) {
+			BUG_ON((unsigned long) last > (unsigned long) sems[i]);
+			last = sems[i];
 			if (lock)
-				spin_lock(last);
+				down_write(last);
 			else
-				spin_unlock(last);
+				up_write(last);
 		}
 }
 
-static inline void __mm_lock(spinlock_t **locks, size_t nr)
+static inline void __mm_lock(struct rw_semaphore **sems, size_t nr)
 {
-	mm_lock_unlock(locks, nr, 1);
+	mm_lock_unlock(sems, nr, 1);
 }
 
-static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr)
 {
-	mm_lock_unlock(locks, nr, 0);
+	mm_lock_unlock(sems, nr, 0);
 }
 
 /*
@@ -2325,57 +2325,57 @@
  */
 int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
 {
-	spinlock_t **anon_vma_locks, **i_mmap_locks;
+	struct rw_semaphore **anon_vma_sems, **i_mmap_sems;
 
 	down_write(&mm->mmap_sem);
 	if (mm->map_count) {
-		anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
-		if (unlikely(!anon_vma_locks)) {
+		anon_vma_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count);
+		if (unlikely(!anon_vma_sems)) {
 			up_write(&mm->mmap_sem);
 			return -ENOMEM;
 		}
 
-		i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
-		if (unlikely(!i_mmap_locks)) {
+		i_mmap_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count);
+		if (unlikely(!i_mmap_sems)) {
 			up_write(&mm->mmap_sem);
-			vfree(anon_vma_locks);
+			vfree(anon_vma_sems);
 			return -ENOMEM;
 		}
 
-		data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
-		data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
+		data->nr_anon_vma_sems = mm_lock_sort_anon_vma(mm, anon_vma_sems);
+		data->nr_i_mmap_sems = mm_lock_sort_i_mmap(mm, i_mmap_sems);
 
-		if (data->nr_anon_vma_locks) {
-			__mm_lock(anon_vma_locks, data->nr_anon_vma_locks);
-			data->anon_vma_locks = anon_vma_locks;
+		if (data->nr_anon_vma_sems) {
+			__mm_lock(anon_vma_sems, data->nr_anon_vma_sems);
+			data->anon_vma_sems = anon_vma_sems;
 		} else
-			vfree(anon_vma_locks);
+			vfree(anon_vma_sems);
 
-		if (data->nr_i_mmap_locks) {
-			__mm_lock(i_mmap_locks, data->nr_i_mmap_locks);
-			data->i_mmap_locks = i_mmap_locks;
+		if (data->nr_i_mmap_sems) {
+			__mm_lock(i_mmap_sems, data->nr_i_mmap_sems);
+			data->i_mmap_sems = i_mmap_sems;
 		} else
-			vfree(i_mmap_locks);
+			vfree(i_mmap_sems);
 	}
 	return 0;
 }
 
-static void mm_unlock_vfree(spinlock_t **locks, size_t nr)
+static void mm_unlock_vfree(struct rw_semaphore **sems, size_t nr)
 {
-	__mm_unlock(locks, nr);
-	vfree(locks);
+	__mm_unlock(sems, nr);
+	vfree(sems);
 }
 
 /* avoid memory allocations for mm_unlock to prevent deadlock */
 void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
 {
 	if (mm->map_count) {
-		if (data->nr_anon_vma_locks)
-			mm_unlock_vfree(data->anon_vma_locks,
-					data->nr_anon_vma_locks);
-		if (data->i_mmap_locks)
-			mm_unlock_vfree(data->i_mmap_locks,
-					data->nr_i_mmap_locks);
+		if (data->nr_anon_vma_sems)
+			mm_unlock_vfree(data->anon_vma_sems,
+					data->nr_anon_vma_sems);
+		if (data->i_mmap_sems)
+			mm_unlock_vfree(data->i_mmap_sems,
+					data->nr_i_mmap_sems);
 	}
 	up_write(&mm->mmap_sem);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 11 of 12] XPMEM would have used sys_madvise() except that madvise_dontneed()
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2008-04-22 13:51 ` [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock Andrea Arcangeli
@ 2008-04-22 13:51 ` Andrea Arcangeli
  2008-04-22 13:51 ` [PATCH 12 of 12] This patch adds a lock ordering rule to avoid a potential deadlock when Andrea Arcangeli
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872187 -7200
# Node ID 128d705f38c8a774ac11559db445787ce6e91c77
# Parent  f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93
XPMEM would have used sys_madvise() except that madvise_dontneed()
returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages
XPMEM imports from other partitions and is also true for uncached pages
allocated locally via the mspec allocator.  XPMEM needs zap_page_range()
functionality for these types of pages as well as 'normal' pages.

Signed-off-by: Dean Nelson <dcn@sgi.com>

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -909,6 +909,7 @@
 
 	return unmap_vmas(vma, address, end, &nr_accounted, details);
 }
+EXPORT_SYMBOL_GPL(zap_page_range);
 
 /*
  * Do a quick page-table lookup for a single page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 12 of 12] This patch adds a lock ordering rule to avoid a potential deadlock when
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2008-04-22 13:51 ` [PATCH 11 of 12] XPMEM would have used sys_madvise() except that madvise_dontneed() Andrea Arcangeli
@ 2008-04-22 13:51 ` Andrea Arcangeli
  2008-04-22 18:22 ` [PATCH 00 of 12] mmu notifier #v13 Robin Holt
  2008-04-23  0:31 ` Jack Steiner
  13 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872187 -7200
# Node ID e847039ee2e815088661933b7195584847dc7540
# Parent  128d705f38c8a774ac11559db445787ce6e91c77
This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson <dcn@sgi.com>

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -79,6 +79,9 @@
  *
  *  ->i_mutex			(generic_file_buffered_write)
  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
+ *
+ *    When taking multiple mmap_sems, one should lock the lowest-addressed
+ *    one first proceeding on up to the highest-addressed one.
  *
  *  ->i_mutex
  *    ->i_alloc_sem             (various)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 13:51 ` [PATCH 01 of 12] Core of mmu notifiers Andrea Arcangeli
@ 2008-04-22 14:56   ` Eric Dumazet
  2008-04-22 15:15     ` Andrea Arcangeli
  2008-04-22 20:19   ` Christoph Lameter
  2008-04-23 17:09   ` Jack Steiner
  2 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2008-04-22 14:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, Robin Holt, general, Hugh Dickins, akpm,
	Rusty Russell

Andrea Arcangeli a ecrit :
> +
> +static int mm_lock_cmp(const void *a, const void *b)
> +{
> +	cond_resched();
> +	if ((unsigned long)*(spinlock_t **)a <
> +	    (unsigned long)*(spinlock_t **)b)
> +		return -1;
> +	else if (a == b)
> +		return 0;
> +	else
> +		return 1;
> +}
> +
This compare function looks unusual...
It should work, but sort() could be faster if the
if (a == b) test had a chance to be true eventually...

static int mm_lock_cmp(const void *a, const void *b)
{
	unsigned long la = (unsigned long)*(spinlock_t **)a;
	unsigned long lb = (unsigned long)*(spinlock_t **)b;

	cond_resched();
	if (la < lb)
		return -1;
	if (la > lb)
		return 1;
	return 0;
}





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 14:56   ` Eric Dumazet
@ 2008-04-22 15:15     ` Andrea Arcangeli
  2008-04-22 15:24       ` Avi Kivity
  2008-04-22 15:37       ` Eric Dumazet
  0 siblings, 2 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 15:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, Robin Holt, general, Hugh Dickins, akpm,
	Rusty Russell

On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote:
> Andrea Arcangeli a ecrit :
>> +
>> +static int mm_lock_cmp(const void *a, const void *b)
>> +{
>> +	cond_resched();
>> +	if ((unsigned long)*(spinlock_t **)a <
>> +	    (unsigned long)*(spinlock_t **)b)
>> +		return -1;
>> +	else if (a == b)
>> +		return 0;
>> +	else
>> +		return 1;
>> +}
>> +
> This compare function looks unusual...
> It should work, but sort() could be faster if the
> if (a == b) test had a chance to be true eventually...

Hmm, are you saying my mm_lock_cmp won't return 0 if a==b?

> static int mm_lock_cmp(const void *a, const void *b)
> {
> 	unsigned long la = (unsigned long)*(spinlock_t **)a;
> 	unsigned long lb = (unsigned long)*(spinlock_t **)b;
>
> 	cond_resched();
> 	if (la < lb)
> 		return -1;
> 	if (la > lb)
> 		return 1;
> 	return 0;
> }

If your intent is to use the assumption that there are going to be few
equal entries, you should have used likely(la > lb) to signal it's
rarely going to return zero or gcc is likely free to do whatever it
wants with the above. Overall that function is such a slow path that
this is going to be lost in the noise. My suggestion would be to defer
microoptimizations like this after 1/12 will be applied to mainline.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 15:15     ` Andrea Arcangeli
@ 2008-04-22 15:24       ` Avi Kivity
  2008-04-22 15:37       ` Eric Dumazet
  1 sibling, 0 replies; 86+ messages in thread
From: Avi Kivity @ 2008-04-22 15:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Eric Dumazet, Christoph Lameter, Nick Piggin, Jack Steiner,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, linux-mm, Robin Holt, general,
	Hugh Dickins, akpm, Rusty Russell

Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote:
>   
>> Andrea Arcangeli a ecrit :
>>     
>>> +
>>> +static int mm_lock_cmp(const void *a, const void *b)
>>> +{
>>> +	cond_resched();
>>> +	if ((unsigned long)*(spinlock_t **)a <
>>> +	    (unsigned long)*(spinlock_t **)b)
>>> +		return -1;
>>> +	else if (a == b)
>>> +		return 0;
>>> +	else
>>> +		return 1;
>>> +}
>>> +
>>>       
>> This compare function looks unusual...
>> It should work, but sort() could be faster if the
>> if (a == b) test had a chance to be true eventually...
>>     
>
> Hmm, are you saying my mm_lock_cmp won't return 0 if a==b?
>
>   

You need to compare *a to *b (at least, that's what you're doing for the 
< case).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 15:15     ` Andrea Arcangeli
  2008-04-22 15:24       ` Avi Kivity
@ 2008-04-22 15:37       ` Eric Dumazet
  2008-04-22 16:46         ` Andrea Arcangeli
  1 sibling, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2008-04-22 15:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, Robin Holt, general, Hugh Dickins, akpm,
	Rusty Russell

Andrea Arcangeli a ecrit :
> On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote:
>   
>> Andrea Arcangeli a ecrit :
>>     
>>> +
>>> +static int mm_lock_cmp(const void *a, const void *b)
>>> +{
>>> +	cond_resched();
>>> +	if ((unsigned long)*(spinlock_t **)a <
>>> +	    (unsigned long)*(spinlock_t **)b)
>>> +		return -1;
>>> +	else if (a == b)
>>> +		return 0;
>>> +	else
>>> +		return 1;
>>> +}
>>> +
>>>       
>> This compare function looks unusual...
>> It should work, but sort() could be faster if the
>> if (a == b) test had a chance to be true eventually...
>>     
>
> Hmm, are you saying my mm_lock_cmp won't return 0 if a==b?
>   
I am saying your intent was probably to test

else if ((unsigned long)*(spinlock_t **)a ==
	    (unsigned long)*(spinlock_t **)b)
		return 0;


Because a and b are pointers to the data you want to compare. You need 
to dereference them.


>> static int mm_lock_cmp(const void *a, const void *b)
>> {
>> 	unsigned long la = (unsigned long)*(spinlock_t **)a;
>> 	unsigned long lb = (unsigned long)*(spinlock_t **)b;
>>
>> 	cond_resched();
>> 	if (la < lb)
>> 		return -1;
>> 	if (la > lb)
>> 		return 1;
>> 	return 0;
>> }
>>     
>
> If your intent is to use the assumption that there are going to be few
> equal entries, you should have used likely(la > lb) to signal it's
> rarely going to return zero or gcc is likely free to do whatever it
> wants with the above. Overall that function is such a slow path that
> this is going to be lost in the noise. My suggestion would be to defer
> microoptimizations like this after 1/12 will be applied to mainline.
>
> Thanks!
>
>   
Hum, it's not a micro-optimization, but a bug fix. :)

Sorry if it was not clear




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 15:37       ` Eric Dumazet
@ 2008-04-22 16:46         ` Andrea Arcangeli
  0 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 16:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, Robin Holt, general, Hugh Dickins, akpm,
	Rusty Russell

On Tue, Apr 22, 2008 at 05:37:38PM +0200, Eric Dumazet wrote:
> I am saying your intent was probably to test
>
> else if ((unsigned long)*(spinlock_t **)a ==
> 	    (unsigned long)*(spinlock_t **)b)
> 		return 0;

Indeed...

> Hum, it's not a micro-optimization, but a bug fix. :)

The good thing is that even if this bug would lead to a system crash,
it would be still zero risk for everybody that isn't using KVM/GRU
actively with mmu notifiers. The important thing is that this patch
has zero risk to introduce regressions into the kernel, both when
enabled and disabled, it's like a new driver. I'll shortly resend 1/12
and likely 12/12 for theoretical correctness. For now you can go ahead
testing with this patch as it'll work fine despite of the bug (if it
wasn't the case I would have noticed already ;).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00 of 12] mmu notifier #v13
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2008-04-22 13:51 ` [PATCH 12 of 12] This patch adds a lock ordering rule to avoid a potential deadlock when Andrea Arcangeli
@ 2008-04-22 18:22 ` Robin Holt
  2008-04-22 18:43   ` Andrea Arcangeli
  2008-04-23  0:31 ` Jack Steiner
  13 siblings, 1 reply; 86+ messages in thread
From: Robin Holt @ 2008-04-22 18:22 UTC (permalink / raw)
  To: Andrea Arcangeli, Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

I believe the differences between your patch set and Christoph's need
to be understood and a compromise approach agreed upon.

Those differences, as I understand them, are:

1) invalidate_page:  You retain an invalidate_page() callout.  I believe
we have progressed that discussion to the point that it requires some
direction for Andrew, Linus, or somebody in authority.  The basics
of the difference distill down to no expected significant performance
difference between the two.  The invalidate_page() callout potentially
can simplify GRU code.  It does provide a more complex api for the
users of mmu_notifier which, IIRC, Christoph had interpretted from one
of Andrew's earlier comments as being undesirable.  I vaguely recall
that sentiment as having been expressed.

2) Range callout names: Your range callouts are invalidate_range_start
and invalidate_range_end whereas Christoph's are start and end.  I do not
believe this has been discussed in great detail.  I know I have expressed
a preference for your names.  I admit to having failed to follow up on
this issue.  I certainly believe we could come to an agreement quickly
if pressed.

3) The structure of the patch set:  Christoph's upcoming release orders
the patches so the prerequisite patches are seperately reviewable
and each file is only touched by a single patch.  Additionally, that
allows mmu_notifiers to be introduced as a single patch with sleeping
functionality from its inception and an API which remains unchanged.
Your patch set, however, introduces one API, then turns around and
changes that API.  Again, the desire to make it an unchanging API was
expressed by, IIRC, Andrew.  This does represent a risk to XPMEM as
the non-sleeping API may become entrenched and make acceptance of the
sleeping version less acceptable.

Can we agree upon this list of issues?

Thank you,
Robin Holt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00 of 12] mmu notifier #v13
  2008-04-22 18:22 ` [PATCH 00 of 12] mmu notifier #v13 Robin Holt
@ 2008-04-22 18:43   ` Andrea Arcangeli
  2008-04-22 19:42     ` Robin Holt
  2008-04-22 20:28     ` Christoph Lameter
  0 siblings, 2 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 18:43 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 01:22:13PM -0500, Robin Holt wrote:
> 1) invalidate_page:  You retain an invalidate_page() callout.  I believe
> we have progressed that discussion to the point that it requires some
> direction for Andrew, Linus, or somebody in authority.  The basics
> of the difference distill down to no expected significant performance
> difference between the two.  The invalidate_page() callout potentially
> can simplify GRU code.  It does provide a more complex api for the
> users of mmu_notifier which, IIRC, Christoph had interpretted from one
> of Andrew's earlier comments as being undesirable.  I vaguely recall
> that sentiment as having been expressed.

invalidate_page as demonstrated in KVM pseudocode doesn't change the
locking requirements, and it has the benefit of reducing the window of
time the secondary page fault has to be masked and at the same time
_halves_ the number of _hooks_ in the VM every time the VM deal with
single pages (example: do_wp_page hot path). As long as we can't fully
converge because of point 3, it'd rather keep invalidate_page to be
better. But that's by far not a priority to keep.

> 2) Range callout names: Your range callouts are invalidate_range_start
> and invalidate_range_end whereas Christoph's are start and end.  I do not
> believe this has been discussed in great detail.  I know I have expressed
> a preference for your names.  I admit to having failed to follow up on
> this issue.  I certainly believe we could come to an agreement quickly
> if pressed.

I think using ->start ->end is a mistake, think when we later add
mprotect_range_start/end. Here too I keep the better names only
because we can't converge on point 3 (the API will eventually change,
like every other kernel interal API, even core things like __free_page
have been mostly obsoleted).

> 3) The structure of the patch set:  Christoph's upcoming release orders
> the patches so the prerequisite patches are seperately reviewable
> and each file is only touched by a single patch.  Additionally, that

Each file touched by a single patch? I doubt... The split is about the
same, the main difference is the merge ordering, I always had the zero
risk part at the head, he moved it at the tail when he incorporated
#v12 into his patchset.

> allows mmu_notifiers to be introduced as a single patch with sleeping
> functionality from its inception and an API which remains unchanged.
> Your patch set, however, introduces one API, then turns around and
> changes that API.  Again, the desire to make it an unchanging API was
> expressed by, IIRC, Andrew.  This does represent a risk to XPMEM as
> the non-sleeping API may become entrenched and make acceptance of the
> sleeping version less acceptable.
> 
> Can we agree upon this list of issues?

This is a kernel internal API, so it will definitely change over
time. It's nothing close to a syscall.

Also note: the API is obviously defined in mmu_notifier.h and none of
the 2-12 patches touches mmu_notifier.h. So the extension of the
method semantics is 100% backwards compatible.

My patch order and API backward compatible extension over the patchset
is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support
XPMEM as well. KVM/GRU won't notice any difference once the support
for XPMEM is added, but even if the API would completely change in
2.6.27, that's still better than no functionality at all in 2.6.26.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00 of 12] mmu notifier #v13
  2008-04-22 18:43   ` Andrea Arcangeli
@ 2008-04-22 19:42     ` Robin Holt
  2008-04-22 20:30       ` Christoph Lameter
  2008-04-22 20:28     ` Christoph Lameter
  1 sibling, 1 reply; 86+ messages in thread
From: Robin Holt @ 2008-04-22 19:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Christoph Lameter, Nick Piggin, Jack Steiner,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 08:43:35PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 01:22:13PM -0500, Robin Holt wrote:
> > 1) invalidate_page:  You retain an invalidate_page() callout.  I believe
> > we have progressed that discussion to the point that it requires some
> > direction for Andrew, Linus, or somebody in authority.  The basics
> > of the difference distill down to no expected significant performance
> > difference between the two.  The invalidate_page() callout potentially
> > can simplify GRU code.  It does provide a more complex api for the
> > users of mmu_notifier which, IIRC, Christoph had interpretted from one
> > of Andrew's earlier comments as being undesirable.  I vaguely recall
> > that sentiment as having been expressed.
> 
> invalidate_page as demonstrated in KVM pseudocode doesn't change the
> locking requirements, and it has the benefit of reducing the window of
> time the secondary page fault has to be masked and at the same time
> _halves_ the number of _hooks_ in the VM every time the VM deal with
> single pages (example: do_wp_page hot path). As long as we can't fully
> converge because of point 3, it'd rather keep invalidate_page to be
> better. But that's by far not a priority to keep.

Christoph, Jack and I just discussed invalidate_page().  I don't think
the point Andrew was making is that compelling in this circumstance.
The code has change fairly remarkably.  Would you have any objection to
putting it back into your patch/agreeing to it remaining in Andrea's
patch?  If not, I think we can put this issue aside until Andrew gets
out of the merge window and can decide it.  Either way, the patches
become much more similar with this in.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 13:51 ` [PATCH 01 of 12] Core of mmu notifiers Andrea Arcangeli
  2008-04-22 14:56   ` Eric Dumazet
@ 2008-04-22 20:19   ` Christoph Lameter
  2008-04-22 20:31     ` Robin Holt
  2008-04-22 22:35     ` Andrea Arcangeli
  2008-04-23 17:09   ` Jack Steiner
  2 siblings, 2 replies; 86+ messages in thread
From: Christoph Lameter @ 2008-04-22 20:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

Thanks for adding most of my enhancements. But

1. There is no real need for invalidate_page(). Can be done with 
	invalidate_start/end. Needlessly complicates the API. One
	of the objections by Andrew was that there mere multiple
	callbacks that perform similar functions.

2. The locks that are used are later changed to semaphores. This is
   f.e. true for mm_lock / mm_unlock. The diffs will be smaller if the
   lock conversion is done first and then mm_lock is introduced. The
   way the patches are structured means that reviewers cannot review the
   final version of mm_lock etc etc. The lock conversion needs to come 
   first.

3. As noted by Eric and also contained in private post from yesterday by 
   me: The cmp function needs to retrieve the value before
   doing comparisons which is not done for the == of a and b.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug
  2008-04-22 13:51 ` [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug Andrea Arcangeli
@ 2008-04-22 20:22   ` Christoph Lameter
  2008-04-22 22:43     ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-22 20:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

Looks like this is not complete. There are numerous .h files missing which 
means that various structs are undefined (fs.h and rmap.h are needed 
f.e.) which leads to surprises when dereferencing fields of these struct.

It seems that mm_types.h is expected to be included only in certain 
contexts. Could you make sure to include all necessary .h files? Or add
some docs to clarify the situation here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced
  2008-04-22 13:51 ` [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced Andrea Arcangeli
@ 2008-04-22 20:23   ` Christoph Lameter
  2008-04-22 22:37     ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-22 20:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

Missing signoff by you.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
  2008-04-22 13:51 ` [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last Andrea Arcangeli
@ 2008-04-22 20:24   ` Christoph Lameter
  2008-04-22 22:40     ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-22 20:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

Reverts a part of an earlier patch. Why isnt this merged into 1 of 12?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks
  2008-04-22 13:51 ` [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks Andrea Arcangeli
@ 2008-04-22 20:25   ` Christoph Lameter
  0 siblings, 0 replies; 86+ messages in thread
From: Christoph Lameter @ 2008-04-22 20:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

Why are the subjects all screwed up? They are the first line of the 
description instead of the subject line of my patches.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
  2008-04-22 13:51 ` [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock Andrea Arcangeli
@ 2008-04-22 20:26   ` Christoph Lameter
  2008-04-22 22:54     ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-22 20:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

Doing the right patch ordering would have avoided this patch and allow 
better review.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00 of 12] mmu notifier #v13
  2008-04-22 18:43   ` Andrea Arcangeli
  2008-04-22 19:42     ` Robin Holt
@ 2008-04-22 20:28     ` Christoph Lameter
  1 sibling, 0 replies; 86+ messages in thread
From: Christoph Lameter @ 2008-04-22 20:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Tue, 22 Apr 2008, Andrea Arcangeli wrote:

> My patch order and API backward compatible extension over the patchset
> is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support
> XPMEM as well. KVM/GRU won't notice any difference once the support
> for XPMEM is added, but even if the API would completely change in
> 2.6.27, that's still better than no functionality at all in 2.6.26.

Please redo the patchset with the right order. To my knowledge there is no 
chance of this getting merged for 2.6.26.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00 of 12] mmu notifier #v13
  2008-04-22 19:42     ` Robin Holt
@ 2008-04-22 20:30       ` Christoph Lameter
  2008-04-23 13:33         ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-22 20:30 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Tue, 22 Apr 2008, Robin Holt wrote:

> putting it back into your patch/agreeing to it remaining in Andrea's
> patch?  If not, I think we can put this issue aside until Andrew gets
> out of the merge window and can decide it.  Either way, the patches
> become much more similar with this in.

One solution would be to separate the invalidate_page() callout into a
patch at the very end that can be omitted. AFACIT There is no compelling 
reason to have this callback and it complicates the API for the device 
driver writers. Not having this callback makes the way that mmu notifiers 
are called from the VM uniform which is a desirable goal.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 20:19   ` Christoph Lameter
@ 2008-04-22 20:31     ` Robin Holt
  2008-04-22 22:35     ` Andrea Arcangeli
  1 sibling, 0 replies; 86+ messages in thread
From: Robin Holt @ 2008-04-22 20:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, Robin Holt, general, Hugh Dickins, akpm,
	Rusty Russell

On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote:
> Thanks for adding most of my enhancements. But
> 
> 1. There is no real need for invalidate_page(). Can be done with 
> 	invalidate_start/end. Needlessly complicates the API. One
> 	of the objections by Andrew was that there mere multiple
> 	callbacks that perform similar functions.

While I agree with that reading of Andrew's email about invalidate_page,
I think the GRU hardware makes a strong enough case to justify the two
seperate callouts.

Due to the GRU hardware, we can assure that invalidate_page terminates all
pending GRU faults (that includes faults that are just beginning) and can
therefore be completed without needing any locking.  The invalidate_page()
callout gets turned into a GRU flush instruction and we return.

Because the invalidate_range_start() leaves the page table information
available, we can not use a single page _start to mimick that
functionality.  Therefore, there is a documented case justifying the
seperate callouts.

I agree the case is fairly weak, but it does exist.  Given Andrea's
unwillingness to move and Jack's documented case, it is my opinion the
most likely compromise is to leave in the invalidate_page() callout.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 20:19   ` Christoph Lameter
  2008-04-22 20:31     ` Robin Holt
@ 2008-04-22 22:35     ` Andrea Arcangeli
  2008-04-22 23:07       ` Robin Holt
  2008-04-22 23:20       ` Christoph Lameter
  1 sibling, 2 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 22:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote:
> 3. As noted by Eric and also contained in private post from yesterday by 
>    me: The cmp function needs to retrieve the value before
>    doing comparisons which is not done for the == of a and b.

I retrieved the value, which is why mm_lock works perfectly on #v13 as
well as #v12. It's not mandatory to ever return 0, so it won't produce
any runtime error (there is a bugcheck for wrong sort ordering in my
patch just in case it would generate any runtime error and it never
did, or I would have noticed before submission), which is why I didn't
need to release any hotfix yet and I'm waiting more time to get more
comments before sending an update to clean up that bit.

Mentioning this as the third and last point I guess shows how strong
are your arguments against merging my mmu-notifier-core now, so in the
end doing that cosmetical error payed off somehow.

I'll send an update in any case to Andrew way before Saturday so
hopefully we'll finally get mmu-notifiers-core merged before next
week. Also I'm not updating my mmu-notifier-core patch anymore except
for strict bugfixes so don't worry about any more cosmetical bugs
being introduced while optimizing the code like it happened this time.

The only other change I did has been to move mmu_notifier_unregister
at the end of the patchset after getting more questions about its
reliability and I documented a bit the rmmod requirements for
->release. we'll think later if it makes sense to add it, nobody's
using it anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced
  2008-04-22 20:23   ` Christoph Lameter
@ 2008-04-22 22:37     ` Andrea Arcangeli
  2008-04-22 23:13       ` Christoph Lameter
  0 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 22:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 01:23:16PM -0700, Christoph Lameter wrote:
> Missing signoff by you.

I thought I had to signoff if I conributed with anything that could
resemble copyright? Given I only merged that patch, I can add an
Acked-by if you like, but merging this in my patchset was already an
implicit ack ;-).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
  2008-04-22 20:24   ` Christoph Lameter
@ 2008-04-22 22:40     ` Andrea Arcangeli
  2008-04-22 23:14       ` Christoph Lameter
  0 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 22:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 01:24:21PM -0700, Christoph Lameter wrote:
> Reverts a part of an earlier patch. Why isnt this merged into 1 of 12?

To give zero regression risk to 1/12 when MMU_NOTIFIER=y or =n and the
mmu notifiers aren't registered by GRU or KVM. Keep in mind that the
whole point of my proposed patch ordering from day 0, is to keep as
1/N, the absolutely minimum change that fully satisfy GRU and KVM
requirements. 4/12 isn't required by GRU/KVM so I keep it in a later
patch. I now moved mmu_notifier_unregister in a later patch too for
the same reason.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug
  2008-04-22 20:22   ` Christoph Lameter
@ 2008-04-22 22:43     ` Andrea Arcangeli
  2008-04-22 23:07       ` Robin Holt
  0 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 22:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 01:22:55PM -0700, Christoph Lameter wrote:
> Looks like this is not complete. There are numerous .h files missing which 
> means that various structs are undefined (fs.h and rmap.h are needed 
> f.e.) which leads to surprises when dereferencing fields of these struct.
> 
> It seems that mm_types.h is expected to be included only in certain 
> contexts. Could you make sure to include all necessary .h files? Or add
> some docs to clarify the situation here.

Robin, what other changes did you need to compile? I only did that one
because I didn't hear any more feedback from you after I sent that
patch, so I assumed it was enough.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
  2008-04-22 20:26   ` Christoph Lameter
@ 2008-04-22 22:54     ` Andrea Arcangeli
  2008-04-22 23:19       ` Christoph Lameter
  0 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-22 22:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 01:26:13PM -0700, Christoph Lameter wrote:
> Doing the right patch ordering would have avoided this patch and allow 
> better review.

I didn't actually write this patch myself. This did it instead:

s/anon_vma_lock/anon_vma_sem/
s/i_mmap_lock/i_mmap_sem/
s/locks/sems/
s/spinlock_t/struct rw_semaphore/

so it didn't look a big deal to redo it indefinitely.

The right patch ordering isn't necessarily the one that reduces the
total number of lines in the patchsets. The mmu-notifier-core is
already converged and can go in. The rest isn't converged at
all... nearly nobody commented on the other part (the few comments so
far were negative), so there's no good reason to delay indefinitely
what is already converged, given it's already feature complete for
certain users of the code. My patch ordering looks more natural to
me. What is finished goes in, the rest is orthogonal anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 22:35     ` Andrea Arcangeli
@ 2008-04-22 23:07       ` Robin Holt
  2008-04-23  0:28         ` Jack Steiner
  2008-04-23 13:36         ` Andrea Arcangeli
  2008-04-22 23:20       ` Christoph Lameter
  1 sibling, 2 replies; 86+ messages in thread
From: Robin Holt @ 2008-04-22 23:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, Robin Holt, general, Hugh Dickins, akpm,
	Rusty Russell

> The only other change I did has been to move mmu_notifier_unregister
> at the end of the patchset after getting more questions about its
> reliability and I documented a bit the rmmod requirements for
> ->release. we'll think later if it makes sense to add it, nobody's
> using it anyway.

XPMEM is using it.  GRU will be as well (probably already does).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug
  2008-04-22 22:43     ` Andrea Arcangeli
@ 2008-04-22 23:07       ` Robin Holt
  0 siblings, 0 replies; 86+ messages in thread
From: Robin Holt @ 2008-04-22 23:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, Robin Holt, general, Hugh Dickins, akpm,
	Rusty Russell

On Wed, Apr 23, 2008 at 12:43:52AM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 01:22:55PM -0700, Christoph Lameter wrote:
> > Looks like this is not complete. There are numerous .h files missing which 
> > means that various structs are undefined (fs.h and rmap.h are needed 
> > f.e.) which leads to surprises when dereferencing fields of these struct.
> > 
> > It seems that mm_types.h is expected to be included only in certain 
> > contexts. Could you make sure to include all necessary .h files? Or add
> > some docs to clarify the situation here.
> 
> Robin, what other changes did you need to compile? I only did that one
> because I didn't hear any more feedback from you after I sent that
> patch, so I assumed it was enough.

It was perfect.  Nothing else was needed.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced
  2008-04-22 22:37     ` Andrea Arcangeli
@ 2008-04-22 23:13       ` Christoph Lameter
  0 siblings, 0 replies; 86+ messages in thread
From: Christoph Lameter @ 2008-04-22 23:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> On Tue, Apr 22, 2008 at 01:23:16PM -0700, Christoph Lameter wrote:
> > Missing signoff by you.
> 
> I thought I had to signoff if I conributed with anything that could
> resemble copyright? Given I only merged that patch, I can add an
> Acked-by if you like, but merging this in my patchset was already an
> implicit ack ;-).

No you have to include a signoff if the patch goes through your custody 
chain. This one did.

Also add a 

From: Christoph Lameter <clameter@sgi.com>

somewhere if you want to signify that the patch came from me. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
  2008-04-22 22:40     ` Andrea Arcangeli
@ 2008-04-22 23:14       ` Christoph Lameter
  2008-04-23 13:44         ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-22 23:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> On Tue, Apr 22, 2008 at 01:24:21PM -0700, Christoph Lameter wrote:
> > Reverts a part of an earlier patch. Why isnt this merged into 1 of 12?
> 
> To give zero regression risk to 1/12 when MMU_NOTIFIER=y or =n and the
> mmu notifiers aren't registered by GRU or KVM. Keep in mind that the
> whole point of my proposed patch ordering from day 0, is to keep as
> 1/N, the absolutely minimum change that fully satisfy GRU and KVM
> requirements. 4/12 isn't required by GRU/KVM so I keep it in a later
> patch. I now moved mmu_notifier_unregister in a later patch too for
> the same reason.

We want a full solution and this kind of patching makes the patches 
difficuilt to review because later patches revert earlier ones.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
  2008-04-22 22:54     ` Andrea Arcangeli
@ 2008-04-22 23:19       ` Christoph Lameter
  0 siblings, 0 replies; 86+ messages in thread
From: Christoph Lameter @ 2008-04-22 23:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> The right patch ordering isn't necessarily the one that reduces the
> total number of lines in the patchsets. The mmu-notifier-core is
> already converged and can go in. The rest isn't converged at
> all... nearly nobody commented on the other part (the few comments so
> far were negative), so there's no good reason to delay indefinitely
> what is already converged, given it's already feature complete for
> certain users of the code. My patch ordering looks more natural to
> me. What is finished goes in, the rest is orthogonal anyway.

I would not want to review code that is later reverted or essentially 
changed in later patches. I only review your patches because we have a 
high interest in the patch. I suspect that others will be more willing to 
review this material if it would be done the right way.

If you cannot produce an easily reviewable and properly formatted patchset 
that follows conventions then I will have to do it because we really need 
to get this merged.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 22:35     ` Andrea Arcangeli
  2008-04-22 23:07       ` Robin Holt
@ 2008-04-22 23:20       ` Christoph Lameter
  2008-04-23 16:26         ` Andrea Arcangeli
  1 sibling, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-22 23:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> I'll send an update in any case to Andrew way before Saturday so
> hopefully we'll finally get mmu-notifiers-core merged before next
> week. Also I'm not updating my mmu-notifier-core patch anymore except
> for strict bugfixes so don't worry about any more cosmetical bugs
> being introduced while optimizing the code like it happened this time.

I guess I have to prepare another patchset then?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 23:07       ` Robin Holt
@ 2008-04-23  0:28         ` Jack Steiner
  2008-04-23 16:37           ` Andrea Arcangeli
  2008-04-23 13:36         ` Andrea Arcangeli
  1 sibling, 1 reply; 86+ messages in thread
From: Jack Steiner @ 2008-04-23  0:28 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote:
> > The only other change I did has been to move mmu_notifier_unregister
> > at the end of the patchset after getting more questions about its
> > reliability and I documented a bit the rmmod requirements for
> > ->release. we'll think later if it makes sense to add it, nobody's
> > using it anyway.
> 
> XPMEM is using it.  GRU will be as well (probably already does).

Yeppp.

The GRU driver unregisters the notifier when all GRU mappings
are unmapped. I could make it work either way - either with or without
an unregister function. However, unregister is the most logical
action to take when all mappings have been destroyed.


--- jack

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00 of 12] mmu notifier #v13
  2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2008-04-22 18:22 ` [PATCH 00 of 12] mmu notifier #v13 Robin Holt
@ 2008-04-23  0:31 ` Jack Steiner
  13 siblings, 0 replies; 86+ messages in thread
From: Jack Steiner @ 2008-04-23  0:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 03:51:16PM +0200, Andrea Arcangeli wrote:
> Hello,
> 
> This is the latest and greatest version of the mmu notifier patch #v13.
> 

FWIW, I have updated the GRU driver to use this patch (plus the fixeups).
No problems. AFAICT, everything works.


--- jack

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00 of 12] mmu notifier #v13
  2008-04-22 20:30       ` Christoph Lameter
@ 2008-04-23 13:33         ` Andrea Arcangeli
  0 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 13:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 01:30:53PM -0700, Christoph Lameter wrote:
> One solution would be to separate the invalidate_page() callout into a
> patch at the very end that can be omitted. AFACIT There is no compelling 
> reason to have this callback and it complicates the API for the device 
> driver writers. Not having this callback makes the way that mmu notifiers 
> are called from the VM uniform which is a desirable goal.

I agree that the invalidate_page optimization can be moved to a
separate patch. That will be a patch that will definitely alter the
API in a not backwards compatible way (unlike 2-12 in my #v13, which
are all backwards compatible in terms of mmu notifier API).

invalidate_page is beneficial to both mmu notifier users, and a bit
beneficial to the do_wp_page users too. So there's no point to remove
it from my mmu-notifier-core as long as the mmu-notifier-core is 1/N
in my patchset, and N/N in your patchset, the differences caused by
that ordering difference are a bigger change than invalidate_page
existing or not.

As I expected invalidate_page provided significant benefits (not just
to GRU but to KVM too) without altering the locking scheme at all,
this is because the page fault handler has to notice if begin->end
both runs anyway after follow_page/get_user_pages. So it's a no
brainer to keep and my approach will avoid a not backwards compatible
breakage of the API IMHO. Not a big deal, nobody can care if the API
will change, it will definitely change eventually, it's a kernel
internal one, but given I've already invalidate_page in my patch
there's no reason to remove it as long as mmu-notifier-core remains
N/N on your patchset.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 23:07       ` Robin Holt
  2008-04-23  0:28         ` Jack Steiner
@ 2008-04-23 13:36         ` Andrea Arcangeli
  2008-04-23 14:47           ` Robin Holt
  1 sibling, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 13:36 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote:
> > The only other change I did has been to move mmu_notifier_unregister
> > at the end of the patchset after getting more questions about its
> > reliability and I documented a bit the rmmod requirements for
> > ->release. we'll think later if it makes sense to add it, nobody's
> > using it anyway.
> 
> XPMEM is using it.  GRU will be as well (probably already does).

XPMEM requires more patches anyway. Note that in previous email you
told me you weren't using it. I think GRU can work fine on 2.6.26
without mmu_notifier_unregister, like KVM too. You've simply to unpin
the module count in ->release. The most important bit is that you've
to do that anyway in case mmu_notifier_unregister fails (and it can
fail because of vmalloc space shortage because somebody loaded some
framebuffer driver or whatever).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
  2008-04-22 23:14       ` Christoph Lameter
@ 2008-04-23 13:44         ` Andrea Arcangeli
  2008-04-23 15:45           ` Robin Holt
  2008-04-23 18:02           ` Christoph Lameter
  0 siblings, 2 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 13:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 04:14:26PM -0700, Christoph Lameter wrote:
> We want a full solution and this kind of patching makes the patches 
> difficuilt to review because later patches revert earlier ones.

I know you rather want to see KVM development stalled for more months
than to get a partial solution now that already covers KVM and GRU
with the same API that XPMEM will also use later. It's very unfair on
your side to pretend to stall other people development if what you
need has stronger requirements and can't be merged immediately. This
is especially true given it was publically stated that XPMEM never
passed all regression tests anyway, so you can't possibly be in such
an hurry like we are, we can't progress without this. Infact we can
but it would be an huge effort and it would run _slower_ and it would
all need to be deleted once mmu notifiers are in.

Note that the only patch that you can avoid with your approach is
mm_lock-rwsem, given that's software developed and not human developed
I don't see a big deal of wasted effort. The main difference is the
ordering. Most of the code is orthogonal so there's not much to
revert.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 13:36         ` Andrea Arcangeli
@ 2008-04-23 14:47           ` Robin Holt
  2008-04-23 15:59             ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Robin Holt @ 2008-04-23 14:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Christoph Lameter, Nick Piggin, Jack Steiner,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell

On Wed, Apr 23, 2008 at 03:36:19PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote:
> > > The only other change I did has been to move mmu_notifier_unregister
> > > at the end of the patchset after getting more questions about its
> > > reliability and I documented a bit the rmmod requirements for
> > > ->release. we'll think later if it makes sense to add it, nobody's
> > > using it anyway.
> > 
> > XPMEM is using it.  GRU will be as well (probably already does).
> 
> XPMEM requires more patches anyway. Note that in previous email you
> told me you weren't using it. I think GRU can work fine on 2.6.26

I said I could test without it.  It is needed for the final version.
It also makes the API consistent.  What you are proposing is equivalent
to having a file you can open but never close.

This whole discussion seems ludicrous.  You could refactor the code to get
the sorted list of locks, pass that list into mm_lock to do the locking,
do the register/unregister, then pass the same list into mm_unlock.

If the allocation fails, you could fall back to the older slower method
of repeatedly scanning the lists and acquiring locks in ascending order.

> without mmu_notifier_unregister, like KVM too. You've simply to unpin
> the module count in ->release. The most important bit is that you've
> to do that anyway in case mmu_notifier_unregister fails (and it can

If you are not going to provide the _unregister callout you need to change
the API so I can scan the list of notifiers to see if my structures are
already registered.

We register our notifier structure at device open time.  If we receive a
_release callout, we mark our structure as unregistered.  At device close
time, if we have not been unregistered, we call _unregister.  If you
take away _unregister, I have an xpmem kernel structure in use _AFTER_
the device is closed with no indication that the process is using it.
In that case, I need to get an extra reference to the module in my device
open method and hold that reference until the _release callout.

Additionally, if the users program reopens the device, I need to scan the
mmu_notifiers list to see if this tasks notifier is already registered.

I view _unregister as essential.  Did I miss something?

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
  2008-04-23 13:44         ` Andrea Arcangeli
@ 2008-04-23 15:45           ` Robin Holt
  2008-04-23 16:15             ` Andrea Arcangeli
  2008-04-23 21:05             ` Avi Kivity
  2008-04-23 18:02           ` Christoph Lameter
  1 sibling, 2 replies; 86+ messages in thread
From: Robin Holt @ 2008-04-23 15:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, Robin Holt, general, Hugh Dickins, akpm,
	Rusty Russell

On Wed, Apr 23, 2008 at 03:44:27PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 04:14:26PM -0700, Christoph Lameter wrote:
> > We want a full solution and this kind of patching makes the patches 
> > difficuilt to review because later patches revert earlier ones.
> 
> I know you rather want to see KVM development stalled for more months
> than to get a partial solution now that already covers KVM and GRU
> with the same API that XPMEM will also use later. It's very unfair on
> your side to pretend to stall other people development if what you
> need has stronger requirements and can't be merged immediately. This
> is especially true given it was publically stated that XPMEM never
> passed all regression tests anyway, so you can't possibly be in such

XPMEM has passed all regression tests using your version 12 notifiers.

I have a bug in xpmem which shows up on our 8x oversubscription tests,
but that is clearly my bug to figure out.  Unfortunately it only shows
up on a 128 processor machine so I have 1024 stack traces to sort
through each time it fails.  Does take a bit of time and a lot of
concentration.

> an hurry like we are, we can't progress without this. Infact we can

SGI is under an equally strict timeline.  We really needed the sleeping
version into 2.6.26.  We may still be able to get this accepted by
vendor distros if we make 2.6.27.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 14:47           ` Robin Holt
@ 2008-04-23 15:59             ` Andrea Arcangeli
  2008-04-23 18:09               ` Christoph Lameter
  0 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 15:59 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Wed, Apr 23, 2008 at 09:47:47AM -0500, Robin Holt wrote:
> It also makes the API consistent.  What you are proposing is equivalent
> to having a file you can open but never close.

That's not entirely true, you can close the file just fine it by
killing the tasks leading to an mmput. From an user prospective in KVM
terms, it won't make a difference as /dev/kvm will remain open and
it'll pin the module count until the kvm task is killed anyway, I
assume for GRU it's similar.

Until I had the idea of how to implement an mm_lock to ensure the
mmu_notifier_register could miss a running invalidate_range_begin, it
wasn't even possible to implement a mmu_notifier_unregister (see EMM
patches) and it looked like you were ok with that API that missed
_unregister...

> This whole discussion seems ludicrous.  You could refactor the code to get
> the sorted list of locks, pass that list into mm_lock to do the locking,
> do the register/unregister, then pass the same list into mm_unlock.

Correct, but it will keep the vmalloc ram pinned during the
runtime. There's no reason to keep that ram allocated per-VM while the
VM runs. We only need it during the startup and teardown.

> If the allocation fails, you could fall back to the older slower method
> of repeatedly scanning the lists and acquiring locks in ascending order.

Correct, I already thought about that. This is exactly why I'm
deferring this for later! Or those perfectionism not needed for
KVM/GRU will keep delaying indefinitely the part that is already
converged and that's enough for KVM and GRU (and for this specific
bit, actually enough for XPMEM as well).

We can make a second version of mm_lock_slow to use if mm_lock fails,
in mmu_notifier_unregister, with N^2 complexity later, after the
mmu-notifier-core is merged into mainline.

> If you are not going to provide the _unregister callout you need to change
> the API so I can scan the list of notifiers to see if my structures are
> already registered.

As said 1/N isn't enough for XPMEM anyway. 1/N has to only include the
absolute minimum and zero risk stuff, that is enough for both KVM and
GRU.

> We register our notifier structure at device open time.  If we receive a
> _release callout, we mark our structure as unregistered.  At device close
> time, if we have not been unregistered, we call _unregister.  If you
> take away _unregister, I have an xpmem kernel structure in use _AFTER_
> the device is closed with no indication that the process is using it.
> In that case, I need to get an extra reference to the module in my device
> open method and hold that reference until the _release callout.

Yes exactly, but you've to do that anyway, if mmu_notifier_unregister
fails because some driver already allocated all vmalloc space (even
x86-64 hasn't indefinite amount of vmalloc because of the vmalloc
being in the end of the address space) unless we've a N^2 fallback,
but the N^2 fallback will make the code more easily dosable and
unkillable, so if I would be an admin I'd prefer having to quickly
kill -9 a task in O(N) than having to wait some syscall that runs in
O(N^2) to complete before the task quits. So the fallback to a slower
algorithm isn't necessarily what will really happen after 2.6.26 is
released, we'll see. Relaying on ->release for the module unpin sounds
preferable, and it's certainly the only reliable way to unregister
that we'll provide in 2.6.26.

> Additionally, if the users program reopens the device, I need to scan the
> mmu_notifiers list to see if this tasks notifier is already registered.

But you don't need any browse the list for this, keep a flag in your
structure after the mmu_notifier struct, set the bitflag after
mmu_notifier_register returns, and clear the bitflag after ->release
runs or after mmu_notifier_unregister returns success. What's the big
deal to track if you've to call mmu_notifier_register a second time or
not? Or you can create a new structure every time somebody asks to
reattach.

> I view _unregister as essential.  Did I miss something?

We can add it later, and we can keep discussing on what's the best
model to implement it as long as you want after 2.6.26 is released
with mmu-notifier-core so GRU/KVM are done. It's unlikely KVM will use
mmu_notifier_unregister anyway as we need it attached for the whole
lifetime of the task, and only for the lifetime of the task.

This is the patch to add it, as you can see it's entirely orthogonal,
backwards compatible with previous API and it doesn't duplicate or
rewrite any code.

Don't worry, any kernel after 2.6.26 will have unregister, but we can't
focus on this for 2.6.26. We can also consider making
mmu_notifier_register safe against double calls on the same structure
but again that's not something we should be doing in 1/N and it can be
done later in a backwards compatible way (plus we're perfectly fine
with the API having not backwards compatible changes as long as 2.6.26
can work for us).

---------------------------------
Implement unregister but it's not reliable, only ->release is reliable.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -119,6 +119,8 @@

 extern int mmu_notifier_register(struct mmu_notifier *mn,
 				 struct mm_struct *mm);
+extern int mmu_notifier_unregister(struct mmu_notifier *mn,
+				   struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long address);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -106,3 +106,29 @@
 	return ret;
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+/*
+ * mm_users can't go down to zero while mmu_notifier_unregister()
+ * runs or it can race with ->release. So a mm_users pin must
+ * be taken by the caller (if mm can be different from current->mm).
+ *
+ * This function can fail (for example during out of memory conditions
+ * or after vmalloc virtual range shortage), so the only reliable way
+ * to unregister is to wait release() to be called.
+ */
+int mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct mm_lock_data data;
+	int ret;
+
+	BUG_ON(!atomic_read(&mm->mm_users));
+
+	ret = mm_lock(mm, &data);
+	if (unlikely(ret))
+		goto out;
+	hlist_del(&mn->hlist);
+	mm_unlock(mm, &data);
+out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
  2008-04-23 15:45           ` Robin Holt
@ 2008-04-23 16:15             ` Andrea Arcangeli
  2008-04-23 19:55               ` Robin Holt
  2008-04-23 21:05             ` Avi Kivity
  1 sibling, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 16:15 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Nick Piggin, Jack Steiner, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Wed, Apr 23, 2008 at 10:45:36AM -0500, Robin Holt wrote:
> XPMEM has passed all regression tests using your version 12 notifiers.

That's great news, thanks! I'd greatly appreciate if you could test
#v13 too as I posted it. It already passed GRU and KVM regressions
tests and it should work fine for XPMEM too. You can ignore the purely
cosmetical error I managed to introduce in mm_lock_cmp (I implemented
a BUG_ON that would have trigger if that wasn't a purely cosmetical
issue, and it clearly doesn't trigger so you can be sure it's
only cosmetical ;).

Once I get confirmation that everyone is ok with #v13 I'll push a #v14
before Saturday with that cosmetical error cleaned up and
mmu_notifier_unregister moved at the end (XPMEM will have unregister
don't worry). I expect the 1/13 of #v14 to go in -mm and then 2.6.26.

> I have a bug in xpmem which shows up on our 8x oversubscription tests,
> but that is clearly my bug to figure out.  Unfortunately it only shows

This is what I meant.

As opposed we don't have any known bug left in this area, infact we
need mmu_notifiers to _fix_ issues I identified that can't be fixed
efficiently without mmu notifiers, and we need the mmu notifier to go
productive ASAP.

> up on a 128 processor machine so I have 1024 stack traces to sort
> through each time it fails.  Does take a bit of time and a lot of
> concentration.

Sure, hope you find it soon!

> SGI is under an equally strict timeline.  We really needed the sleeping
> version into 2.6.26.  We may still be able to get this accepted by
> vendor distros if we make 2.6.27.

I don't think vendor distro are less likely to take the patches 2-12
if 1/N (aka mmu-notifier-core) is merged in 2.6.26 especially at the
light of kabi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 23:20       ` Christoph Lameter
@ 2008-04-23 16:26         ` Andrea Arcangeli
  2008-04-23 17:24           ` Andrea Arcangeli
  2008-04-23 18:15           ` Christoph Lameter
  0 siblings, 2 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 16:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote:
> I guess I have to prepare another patchset then?

If you want to embarrass yourself three time in a row go ahead ;). I
thought two failed takeovers was enough.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23  0:28         ` Jack Steiner
@ 2008-04-23 16:37           ` Andrea Arcangeli
  2008-04-23 18:19             ` Christoph Lameter
  2008-04-23 22:19             ` Andrea Arcangeli
  0 siblings, 2 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 16:37 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Robin Holt, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 22, 2008 at 07:28:49PM -0500, Jack Steiner wrote:
> The GRU driver unregisters the notifier when all GRU mappings
> are unmapped. I could make it work either way - either with or without
> an unregister function. However, unregister is the most logical
> action to take when all mappings have been destroyed.

This is true for KVM as well, unregister would be the most logical
action to take when the kvm device is closed and the vm
destroyed. However we can't implement mm_lock in O(N*log(N)) without
triggering RAM allocations. And the size of those ram allocations are
unknown at the time unregister runs (they also depend on the
max_nr_vmas sysctl). So on a second thought not even passing the array
from register to unregister would solve it (unless we allocate
max_nr_vmas and we block the sysctl to alter max_nr_vmas if not all
unregister run yet).That's clearly unacceptable.

The only way to avoid failing because of vmalloc space shortage or
oom, would be to provide a O(N*N) fallback. But one that can't be
interrupted by sigkill! sigkill interruption was ok in #v12 because we
didn't rely on mmu_notifier_unregister to succeed. So it avoided any
DoS but it still can't provide any reliable unregister.

So in the end unregistering with kill -9 leading to ->release in O(1)
sounds safer solution for the long term. You can't loop if unregister
fails and pretend your module not to have deadlocks.

Yes, waiting ->release add up a bit of complexity but I think it worth
it, and there weren't genial ideas on how to avoid O(N*N) complexity
and allocations too in mmu_notifier_unregister yet. Until that genius
idea will materialize we'll stick with ->release in O(1) as the only
safe unregister so we guarantee the admin will be in control of his
hardware in O(1) with kill -9 no matter if /dev/kvm and /dev/gru are
owned by sillyuser.

I'm afraid if you don't want to worst-case unregister with ->release
you need to have a better idea than my mm_lock and personally I can't
see any other way than mm_lock to ensure not to miss range_begin...

All the above is in 2.6.27 context (for 2.6.26 ->release is the way,
even if the genius idea would materialize).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-22 13:51 ` [PATCH 01 of 12] Core of mmu notifiers Andrea Arcangeli
  2008-04-22 14:56   ` Eric Dumazet
  2008-04-22 20:19   ` Christoph Lameter
@ 2008-04-23 17:09   ` Jack Steiner
  2008-04-23 17:45     ` Andrea Arcangeli
  2 siblings, 1 reply; 86+ messages in thread
From: Jack Steiner @ 2008-04-23 17:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

You may have spotted this already. If so, just ignore this.

It looks like there is a bug in copy_page_range() around line 667.
It's possible to do a mmu_notifier_invalidate_range_start(), then
return -ENOMEM w/o doing a corresponding mmu_notifier_invalidate_range_end().

--- jack


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 16:26         ` Andrea Arcangeli
@ 2008-04-23 17:24           ` Andrea Arcangeli
  2008-04-23 18:21             ` Christoph Lameter
  2008-04-23 18:15           ` Christoph Lameter
  1 sibling, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 17:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Wed, Apr 23, 2008 at 06:26:29PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote:
> > I guess I have to prepare another patchset then?

Apologies for my previous not too polite comment in answer to the
above, but I thought this double patchset was over now that you
converged in #v12 and obsoleted EMM and after the last private
discussions. There's nothing personal here on my side, just a bit of
general frustration on this matter. I appreciate all great
contribution from you, as last your idea to use sort(), but I can't
really see any possible benefit or justification anymore from keeping
two patchsets floating around given we already converged on the
mmu-notifier-core, and given it's almost certain mmu-notifier-core
will go in -mm in time for 2.6.26. Let's put it this way, if I fail to
merge mmu-notifier-core into 2.6.26 I'll voluntarily give up my entire
patchset and leave maintainership to you so you move 1/N to N/N and
remove mm_lock-sem patch (everything else can remain the same as it's
all orthogonal so changing the order is a matter of minutes).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 17:09   ` Jack Steiner
@ 2008-04-23 17:45     ` Andrea Arcangeli
  0 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 17:45 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Wed, Apr 23, 2008 at 12:09:09PM -0500, Jack Steiner wrote:
> 
> You may have spotted this already. If so, just ignore this.
> 
> It looks like there is a bug in copy_page_range() around line 667.
> It's possible to do a mmu_notifier_invalidate_range_start(), then
> return -ENOMEM w/o doing a corresponding mmu_notifier_invalidate_range_end().

No I didn't spot it yet, great catch!! ;) Thanks a lot. I think we can
take example by Jack and use our energy to spot any bug in the
mmu-notifier-core like with his above auditing effort (I'm quite
certain you didn't reprouce this with real oom ;) so we get a rock
solid mmu-notifier implementation in 2.6.26 so XPMEM will also benefit
later in 2.6.27 and I hope the last XPMEM internal bugs will also be
fixed by that time.

(for the not going to become mmu-notifier users, nothing to worry
about for you, unless you used KVM or GRU actively with mmu-notifiers
this bug would be entirely harmless with both MMU_NOTIFIER=n and =y,
as previously guaranteed)

Here the still untested fix for review.

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -597,6 +597,7 @@
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	int ret;
 
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.
@@ -604,33 +605,39 @@
 	 * readonly mappings. The tradeoff is that copy_page_range is more
 	 * efficient than faulting.
 	 */
+	ret = 0;
 	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
 		if (!vma->anon_vma)
-			return 0;
+			goto out;
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+	if (unlikely(is_vm_hugetlb_page(vma))) {
+		ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
+		goto out;
+	}
 
 	if (is_cow_mapping(vma->vm_flags))
 		mmu_notifier_invalidate_range_start(src_mm, addr, end);
 
+	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
-		if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-						vma, addr, next))
-			return -ENOMEM;
+		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+					    vma, addr, next))) {
+			ret = -ENOMEM;
+			break;
+		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow_mapping(vma->vm_flags))
 		mmu_notifier_invalidate_range_end(src_mm,
-						vma->vm_start, end);
-
-	return 0;
+						  vma->vm_start, end);
+out:
+	return ret;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
  2008-04-23 13:44         ` Andrea Arcangeli
  2008-04-23 15:45           ` Robin Holt
@ 2008-04-23 18:02           ` Christoph Lameter
  2008-04-23 18:16             ` Andrea Arcangeli
  1 sibling, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-23 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> I know you rather want to see KVM development stalled for more months
> than to get a partial solution now that already covers KVM and GRU
> with the same API that XPMEM will also use later. It's very unfair on
> your side to pretend to stall other people development if what you
> need has stronger requirements and can't be merged immediately. This
> is especially true given it was publically stated that XPMEM never
> passed all regression tests anyway, so you can't possibly be in such
> an hurry like we are, we can't progress without this. Infact we can
> but it would be an huge effort and it would run _slower_ and it would
> all need to be deleted once mmu notifiers are in.

We have had this workaround effort done years ago and have been 
suffering the ill effects of pinning for years. Had to deal with 
it again and again so I guess we do not matter? Certainly we have no 
interest in stalling KVM development.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 15:59             ` Andrea Arcangeli
@ 2008-04-23 18:09               ` Christoph Lameter
  2008-04-23 18:19                 ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-23 18:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> Implement unregister but it's not reliable, only ->release is reliable.

Why is there still the hlist stuff being used for the mmu notifier list? 
And why is this still unsafe?

There are cases in which you do not take the reverse map locks or mmap_sem
while traversing the notifier list?

This hope for inclusion without proper review (first for .25 now for .26) 
seems to interfere with the patch cleanup work and cause delay after delay 
for getting the patch ready. On what basis do you think that there is a 
chance of any of these patches making it into 2.6.26 given that this 
patchset has never been vetted in Andrew's tree?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 16:26         ` Andrea Arcangeli
  2008-04-23 17:24           ` Andrea Arcangeli
@ 2008-04-23 18:15           ` Christoph Lameter
  1 sibling, 0 replies; 86+ messages in thread
From: Christoph Lameter @ 2008-04-23 18:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote:
> > I guess I have to prepare another patchset then?
> 
> If you want to embarrass yourself three time in a row go ahead ;). I
> thought two failed takeovers was enough.

Takeover? I'd be happy if I would not have to deal with this issue.

These  patches were necessary because you were not listening to 
feedback plus there is the issue that your patchsets were not easy to 
review or diff against. I had to merge several patches to get to a useful 
patch. You have always picked up lots of stuff from my patchsets. Lots of 
work that could have been avoided by proper patchsets in the first place.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
  2008-04-23 18:02           ` Christoph Lameter
@ 2008-04-23 18:16             ` Andrea Arcangeli
  0 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 18:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Wed, Apr 23, 2008 at 11:02:18AM -0700, Christoph Lameter wrote:
> We have had this workaround effort done years ago and have been 
> suffering the ill effects of pinning for years. Had to deal with 

Yes. In addition to the pinning, there's lot of additional tlb
flushing work to do in kvm without mmu notifiers as the swapcache
could be freed by the vm the instruction after put_page unpins the
page for whatever reason.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 16:37           ` Andrea Arcangeli
@ 2008-04-23 18:19             ` Christoph Lameter
  2008-04-23 18:25               ` Andrea Arcangeli
  2008-04-23 22:19             ` Andrea Arcangeli
  1 sibling, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-23 18:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jack Steiner, Robin Holt, Nick Piggin, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> The only way to avoid failing because of vmalloc space shortage or
> oom, would be to provide a O(N*N) fallback. But one that can't be
> interrupted by sigkill! sigkill interruption was ok in #v12 because we
> didn't rely on mmu_notifier_unregister to succeed. So it avoided any
> DoS but it still can't provide any reliable unregister.

If unregister fails then the driver should not detach from the address
space immediately but wait until -->release is called. That may be
a possible solution. It will be rare that the unregister fails.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 18:09               ` Christoph Lameter
@ 2008-04-23 18:19                 ` Andrea Arcangeli
  2008-04-23 18:27                   ` Christoph Lameter
  0 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 18:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Wed, Apr 23, 2008 at 11:09:35AM -0700, Christoph Lameter wrote:
> Why is there still the hlist stuff being used for the mmu notifier list? 
> And why is this still unsafe?

What's the problem with hlist, it saves 8 bytes for each mm_struct,
you should be using it too instead of list.

> There are cases in which you do not take the reverse map locks or mmap_sem
> while traversing the notifier list?

There aren't.

> This hope for inclusion without proper review (first for .25 now for .26) 
> seems to interfere with the patch cleanup work and cause delay after delay 
> for getting the patch ready. On what basis do you think that there is a 
> chance of any of these patches making it into 2.6.26 given that this 
> patchset has never been vetted in Andrew's tree?

Let's say I try to be optimistic and hope the right thing will happen
given this is like a new driver that can't hurt anybody but KVM and
GRU if there's any bug. But in my view what interfere with proper
review for .26 are the endless discussions we're doing ;).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 17:24           ` Andrea Arcangeli
@ 2008-04-23 18:21             ` Christoph Lameter
  2008-04-23 18:34               ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-23 18:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> will go in -mm in time for 2.6.26. Let's put it this way, if I fail to
> merge mmu-notifier-core into 2.6.26 I'll voluntarily give up my entire
> patchset and leave maintainership to you so you move 1/N to N/N and
> remove mm_lock-sem patch (everything else can remain the same as it's
> all orthogonal so changing the order is a matter of minutes).

No I really want you to do this. I have no interest in a takeover in the 
future and have done the EMM stuff only because I saw no other way 
forward. I just want this be done the right way for all parties with 
patches that are nice and mergeable.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 18:19             ` Christoph Lameter
@ 2008-04-23 18:25               ` Andrea Arcangeli
  0 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 18:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jack Steiner, Robin Holt, Nick Piggin, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Wed, Apr 23, 2008 at 11:19:26AM -0700, Christoph Lameter wrote:
> If unregister fails then the driver should not detach from the address
> space immediately but wait until -->release is called. That may be
> a possible solution. It will be rare that the unregister fails.

This is the current idea, exactly. Unless we find a way to replace
mm_lock with something else, I don't see a way to make
mmu_notifier_unregister reliable without wasting ram.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 18:19                 ` Andrea Arcangeli
@ 2008-04-23 18:27                   ` Christoph Lameter
  2008-04-23 18:37                     ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-23 18:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> On Wed, Apr 23, 2008 at 11:09:35AM -0700, Christoph Lameter wrote:
> > Why is there still the hlist stuff being used for the mmu notifier list? 
> > And why is this still unsafe?
> 
> What's the problem with hlist, it saves 8 bytes for each mm_struct,
> you should be using it too instead of list.

list heads in mm_struct and in the mmu_notifier struct seemed to 
be more consistent. We have no hash list after all.

> 
> > There are cases in which you do not take the reverse map locks or mmap_sem
> > while traversing the notifier list?
> 
> There aren't.

There is a potential issue in move_ptes where you call 
invalidate_range_end after dropping i_mmap_sem whereas my patches did the 
opposite. Mmap_sem saves you there?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 18:21             ` Christoph Lameter
@ 2008-04-23 18:34               ` Andrea Arcangeli
  0 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 18:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, Robin Holt, general, Hugh Dickins, akpm, Rusty Russell

On Wed, Apr 23, 2008 at 11:21:49AM -0700, Christoph Lameter wrote:
> No I really want you to do this. I have no interest in a takeover in the 

Ok if you want me to do this, I definitely prefer the core to go in
now. It's so much easier to concentrate on two problems at different
times then to attack both problems at the same time given they're
mostly completely orthogonal problems. Given we already solved one
problem, I'd like to close it before concentrating on the second
problem. I already told you it was my interest to support XPMEM
too. For example it was me to notice we couldn't possibly remove
can_sleep parameter from invalidate_range without altering the locking
as vmas were unstable outside of one of the three core vm locks. That
finding resulted in much bigger patches than we hoped (like Andrew
previously sort of predicted) and you did all great work to develop
those. From my part, once the converged part is in, it'll be a lot
easier to fully concentrate on the rest. My main focus right now is to
produce a mmu-notifier-core that is entirely bug free for .26.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 18:27                   ` Christoph Lameter
@ 2008-04-23 18:37                     ` Andrea Arcangeli
  2008-04-23 18:46                       ` Christoph Lameter
  0 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 18:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Wed, Apr 23, 2008 at 11:27:21AM -0700, Christoph Lameter wrote:
> There is a potential issue in move_ptes where you call 
> invalidate_range_end after dropping i_mmap_sem whereas my patches did the 
> opposite. Mmap_sem saves you there?

Yes, there's really no risk of races in this area after introducing
mm_lock, any place that mangles over ptes and doesn't hold any of the
three locks is buggy anyway. I appreciate the audit work (I also did
it and couldn't find bugs but the more eyes the better).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 18:37                     ` Andrea Arcangeli
@ 2008-04-23 18:46                       ` Christoph Lameter
  0 siblings, 0 replies; 86+ messages in thread
From: Christoph Lameter @ 2008-04-23 18:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Nick Piggin, Jack Steiner, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> Yes, there's really no risk of races in this area after introducing
> mm_lock, any place that mangles over ptes and doesn't hold any of the
> three locks is buggy anyway. I appreciate the audit work (I also did
> it and couldn't find bugs but the more eyes the better).

I guess I would need to merge some patches together somehow to be able 
to review them properly like I did before <sigh>. I have not reviewed the 
latest code completely.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
  2008-04-23 16:15             ` Andrea Arcangeli
@ 2008-04-23 19:55               ` Robin Holt
  0 siblings, 0 replies; 86+ messages in thread
From: Robin Holt @ 2008-04-23 19:55 UTC (permalink / raw)
  To: Andrea Arcangeli, Jack Steiner
  Cc: Robin Holt, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Wed, Apr 23, 2008 at 06:15:45PM +0200, Andrea Arcangeli wrote:
> Once I get confirmation that everyone is ok with #v13 I'll push a #v14
> before Saturday with that cosmetical error cleaned up and
> mmu_notifier_unregister moved at the end (XPMEM will have unregister
> don't worry). I expect the 1/13 of #v14 to go in -mm and then 2.6.26.

I think GRU needs _unregister as well.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
  2008-04-23 15:45           ` Robin Holt
  2008-04-23 16:15             ` Andrea Arcangeli
@ 2008-04-23 21:05             ` Avi Kivity
  1 sibling, 0 replies; 86+ messages in thread
From: Avi Kivity @ 2008-04-23 21:05 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Christoph Lameter, Nick Piggin, Jack Steiner,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, linux-mm, general, Hugh Dickins, akpm,
	Rusty Russell

Robin Holt wrote:
>> an hurry like we are, we can't progress without this. Infact we can
>>     
>
> SGI is under an equally strict timeline.  We really needed the sleeping
> version into 2.6.26.  We may still be able to get this accepted by
> vendor distros if we make 2.6.27.
>   

The difference is that the non-sleeping variant can be shown not to 
affect stability or performance, even if configed in, as long as its not 
used.  The sleeping variant will raise performance and stability concerns.

I have zero objections to sleeping mmu notifiers; I only object to tying 
the schedules of the two together.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 16:37           ` Andrea Arcangeli
  2008-04-23 18:19             ` Christoph Lameter
@ 2008-04-23 22:19             ` Andrea Arcangeli
  2008-04-24  6:49               ` Andrea Arcangeli
  1 sibling, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-23 22:19 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Robin Holt, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Wed, Apr 23, 2008 at 06:37:13PM +0200, Andrea Arcangeli wrote:
> I'm afraid if you don't want to worst-case unregister with ->release
> you need to have a better idea than my mm_lock and personally I can't
> see any other way than mm_lock to ensure not to miss range_begin...

But wait, mmu_notifier_register absolutely requires mm_lock to ensure
that when the kvm->arch.mmu_notifier_invalidate_range_count is zero
(large variable name, it'll get shorter but this is to explain),
really no cpu is in the middle of range_begin/end critical
section. That's why we've to take all the mm locks.

But we cannot care less if we unregister in the middle, unregister
only needs to be sure that no cpu could possibly still using the ram
of the notifier allocated by the driver before returning. So I'll
implement unregister in O(1) and without ram allocations using srcu
and that'll fix all issues with unregister. It'll return "void" to
make it crystal clear it can't fail. It turns out unregister will make
life easier to kvm as well, mostly to simplify the teardown of the
/dev/kvm closure. Given this can be a considered a bugfix to
mmu_notifier_unregister I'll apply it to 1/N and I'll release a new
mmu-notifier-core patch for you to review before I resend to Andrew
before Saturday. Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-23 22:19             ` Andrea Arcangeli
@ 2008-04-24  6:49               ` Andrea Arcangeli
  2008-04-24  9:51                 ` Robin Holt
  0 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-24  6:49 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Robin Holt, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Thu, Apr 24, 2008 at 12:19:28AM +0200, Andrea Arcangeli wrote:
> /dev/kvm closure. Given this can be a considered a bugfix to
> mmu_notifier_unregister I'll apply it to 1/N and I'll release a new

I'm not sure anymore this can be considered a bugfix given how large
change this resulted in the locking and register/unregister/release
behavior.

Here a full draft patch for review and testing. Works great with KVM
so far at least...

- mmu_notifier_register has to run on current->mm or on
  get_task_mm() (in the later case it can mmput after
  mmu_notifier_register returns)

- mmu_notifier_register in turn can't race against
  mmu_notifier_release as that runs in exit_mmap after the last mmput

- mmu_notifier_unregister can run at any time, even after exit_mmap
  completed. No mm_count pin is required, it's taken automatically by
  register and released by unregister

- mmu_notifier_unregister serializes against all mmu notifiers with
  srcu, and it serializes especially against a concurrent
  mmu_notifier_unregister with a mix of a spinlock and SRCU

- the spinlock let us keep track who run first between
  mmu_notifier_unregister and mmu_notifier_release, this makes life
  much easier for the driver to handle as the driver is then
  guaranteed that ->release will run.

- The first that runs executes ->release method as well after dropping
  the spinlock but before releasing the srcu lock

- it was unsafe to unpin the module count from ->release, as release
  itself has to run the 'ret' instruction to return back to the mmu
  notifier code

- the ->release method is mandatory as it has to run before the pages
  are freed to zap all existing sptes

- the one that arrives second between mmu_notifier_unregister and
  mmu_notifier_register waits the first with srcu

As said this is a much larger change than I hoped, but as usual it can
only affect KVM/GRU/XPMEM if something is wrong with this. I don't
exclude we'll have to backoff to the previous mm_users model. The main
issue with taking a mm_users pin is that filehandles associated with
vmas aren't closed by exit() if the mm_users is pinned (that simply
leaks ram with kvm). It looks more correct not to relay on the
mm_users being >0 only in mmu_notifier_register. The other big change
is that ->release is mandatory and always called by the first between
mmu_notifier_unregister or mmu_notifier_release. Both
mmu_notifier_unregister and mmu_notifier_release are slow paths so
taking a spinlock there is no big deal.

Impact when the mmu notifiers are disarmed is unchanged.

The interesting part of the kvm patch to test this change is
below. After this last bit KVM patch status is almost final if this
new mmu notifier update is remotely ok, I've another one that does the
locking change to remove the page pin.

+static void kvm_free_vcpus(struct kvm *kvm);
+/* This must zap all the sptes because all pages will be freed then */
+static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
+				     struct mm_struct *mm)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	BUG_ON(mm != kvm->mm);
+	kvm_free_pit(kvm);
+	kfree(kvm->arch.vpic);
+	kfree(kvm->arch.vioapic);
+	kvm_free_vcpus(kvm);
+	kvm_free_physmem(kvm);
+	if (kvm->arch.apic_access_page)
+		put_page(kvm->arch.apic_access_page);
+}
+
+static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+	.release		= kvm_mmu_notifier_release,
+	.invalidate_page	= kvm_mmu_notifier_invalidate_page,
+	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
+	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
+};
+
 struct  kvm *kvm_arch_create_vm(void)
 {
 	struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+	int err;
 
 	if (!kvm)
 		return ERR_PTR(-ENOMEM);
 
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
 
+	kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+	err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+	if (err) {
+		kfree(kvm);
+		return ERR_PTR(err);
+	}
+
 	return kvm;
 }
 
@@ -3899,13 +3967,12 @@ static void kvm_free_vcpus(struct kvm *kvm)
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
-	kvm_free_pit(kvm);
-	kfree(kvm->arch.vpic);
-	kfree(kvm->arch.vioapic);
-	kvm_free_vcpus(kvm);
-	kvm_free_physmem(kvm);
-	if (kvm->arch.apic_access_page)
-		put_page(kvm->arch.apic_access_page);
+	/*
+	 * kvm_mmu_notifier_release() will be called before
+	 * mmu_notifier_unregister returns, if it didn't run
+	 * already.
+	 */
+	mmu_notifier_unregister(&kvm->arch.mmu_notifier, kvm->mm);
 	kfree(kvm);
 }


Let's call this mmu notifier #v14-test1.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1050,6 +1050,27 @@
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
 
+/*
+ * mm_lock will take mmap_sem writably (to prevent all modifications
+ * and scanning of vmas) and then also takes the mapping locks for
+ * each of the vma to lockout any scans of pagetables of this address
+ * space. This can be used to effectively holding off reclaim from the
+ * address space.
+ *
+ * mm_lock can fail if there is not enough memory to store a pointer
+ * array to all vmas.
+ *
+ * mm_lock and mm_unlock are expensive operations that may take a long time.
+ */
+struct mm_lock_data {
+	spinlock_t **i_mmap_locks;
+	spinlock_t **anon_vma_locks;
+	size_t nr_i_mmap_locks;
+	size_t nr_anon_vma_locks;
+};
+extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
+extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -19,6 +19,7 @@
 #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
 
 struct address_space;
+struct mmu_notifier_mm;
 
 #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
 typedef atomic_long_t mm_counter_t;
@@ -225,6 +226,9 @@
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 	struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+	struct mmu_notifier_mm *mmu_notifier_mm;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,251 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
+#include <linux/srcu.h>
+
+struct mmu_notifier_mm {
+	struct hlist_head list;
+	struct srcu_struct srcu;
+	/* to serialize mmu_notifier_unregister against mmu_notifier_release */
+	spinlock_t unregister_lock;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * Called after all other threads have terminated and the executing
+	 * thread is the only remaining execution thread. There are no
+	 * users of the mm_struct remaining.
+	 *
+	 * If the methods are implemented in a module, the module
+	 * can't be unloaded until release() is called.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * clear_flush_young is called after the VM is
+	 * test-and-clearing the young/accessed bitflag in the
+	 * pte. This way the VM will provide proper aging to the
+	 * accesses to the page through the secondary MMUs and not
+	 * only to the ones through the Linux pte.
+	 */
+	int (*clear_flush_young)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long address);
+
+	/*
+	 * Before this is invoked any secondary MMU is still ok to
+	 * read/write to the page previously pointed by the Linux pte
+	 * because the old page hasn't been freed yet.  If required
+	 * set_page_dirty has to be called internally to this method.
+	 */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_start() and invalidate_range_end() must be
+	 * paired and are called only when the mmap_sem is held and/or
+	 * the semaphores protecting the reverse maps. Both functions
+	 * may sleep. The subsystem must guarantee that no additional
+	 * references to the pages in the range established between
+	 * the call to invalidate_range_start() and the matching call
+	 * to invalidate_range_end().
+	 *
+	 * Invalidation of multiple concurrent ranges may be permitted
+	 * by the driver or the driver may exclude other invalidation
+	 * from proceeding by blocking on new invalidate_range_start()
+	 * callback that overlap invalidates that are already in
+	 * progress. Either way the establishment of sptes to the
+	 * range can only be allowed if all invalidate_range_stop()
+	 * function have been called.
+	 *
+	 * invalidate_range_start() is called when all pages in the
+	 * range are still mapped and have at least a refcount of one.
+	 *
+	 * invalidate_range_end() is called when all pages in the
+	 * range have been unmapped and the pages have been freed by
+	 * the VM.
+	 *
+	 * The VM will remove the page table entries and potentially
+	 * the page between invalidate_range_start() and
+	 * invalidate_range_end(). If the page must not be freed
+	 * because of pending I/O or other circumstances then the
+	 * invalidate_range_start() callback (or the initial mapping
+	 * by the driver) must make sure that the refcount is kept
+	 * elevated.
+	 *
+	 * If the driver increases the refcount when the pages are
+	 * initially mapped into an address space then either
+	 * invalidate_range_start() or invalidate_range_end() may
+	 * decrease the refcount. If the refcount is decreased on
+	 * invalidate_range_start() then the VM can free pages as page
+	 * table entries are removed.  If the refcount is only
+	 * droppped on invalidate_range_end() then the driver itself
+	 * will drop the last refcount but it must take care to flush
+	 * any secondary tlb before doing the final free on the
+	 * page. Pages will no longer be referenced by the linux
+	 * address space but may still be referenced by sptes until
+	 * the last refcount is dropped.
+	 */
+	void (*invalidate_range_start)(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start, unsigned long end);
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start, unsigned long end);
+};
+
+/*
+ * The notifier chains are protected by mmap_sem and/or the reverse map
+ * semaphores. Notifier chains are only changed when all reverse maps and
+ * the mmap_sem locks are taken.
+ *
+ * Therefore notifier chains can only be traversed when either
+ *
+ * 1. mmap_sem is held.
+ * 2. One of the reverse map locks is held (i_mmap_sem or anon_vma->sem).
+ * 3. No other concurrent thread can access the list (release)
+ */
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+static inline int mm_has_notifiers(struct mm_struct *mm)
+{
+	return unlikely(mm->mmu_notifier_mm);
+}
+
+extern int mmu_notifier_register(struct mmu_notifier *mn,
+				 struct mm_struct *mm);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
+extern void __mmu_notifier_release(struct mm_struct *mm);
+extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_release(mm);
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_clear_flush_young(mm, address);
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_page(mm, address);
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_start(mm, start, end);
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+	mm->mmu_notifier_mm = NULL;
+}
+
+static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_mm_destroy(mm);
+}
+
+#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
+({									\
+	pte_t __pte;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
+	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
+	__pte;								\
+})
+
+#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+}
+
+static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+}
+
+#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define ptep_clear_flush_notify ptep_clear_flush
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -362,6 +363,7 @@
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_mm_init(mm);
 		return mm;
 	}
 
@@ -395,6 +397,7 @@
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	mmu_notifier_mm_destroy(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,3 +193,7 @@
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,4 +33,5 @@
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -194,7 +194,7 @@
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush(vma, address, pte);
+			pteval = ptep_clear_flush_notify(vma, address, pte);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -214,7 +215,9 @@
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
+	mmu_notifier_invalidate_range_end(mm, start, start + size);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -799,6 +800,7 @@
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -819,6 +821,7 @@
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -596,6 +597,7 @@
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	int ret;
 
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.
@@ -603,25 +605,39 @@
 	 * readonly mappings. The tradeoff is that copy_page_range is more
 	 * efficient than faulting.
 	 */
+	ret = 0;
 	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
 		if (!vma->anon_vma)
-			return 0;
+			goto out;
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+	if (unlikely(is_vm_hugetlb_page(vma))) {
+		ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
+		goto out;
+	}
 
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_start(src_mm, addr, end);
+
+	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
-		if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-						vma, addr, next))
-			return -ENOMEM;
+		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+					    vma, addr, next))) {
+			ret = -ENOMEM;
+			break;
+		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
-	return 0;
+
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_end(src_mm,
+						  vma->vm_start, end);
+out:
+	return ret;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -825,7 +841,9 @@
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
 	int fullmm = (*tlbp)->fullmm;
+	struct mm_struct *mm = vma->vm_mm;
 
+	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
 
@@ -876,6 +894,7 @@
 		}
 	}
 out:
+	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
 
@@ -1463,10 +1482,11 @@
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1474,6 +1494,7 @@
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1675,7 +1696,7 @@
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush(vma, address, page_table);
+		ptep_clear_flush_notify(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -26,6 +26,9 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/vmalloc.h>
+#include <linux/sort.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2038,6 +2041,7 @@
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
+	mmu_notifier_release(mm);
 
 	lru_add_drain();
 	flush_cache_mm(mm);
@@ -2242,3 +2246,144 @@
 
 	return 0;
 }
+
+static int mm_lock_cmp(const void *a, const void *b)
+{
+	unsigned long _a = (unsigned long)*(spinlock_t **)a;
+	unsigned long _b = (unsigned long)*(spinlock_t **)b;
+
+	cond_resched();
+	if (_a < _b)
+		return -1;
+	if (_a > _b)
+		return 1;
+	return 0;
+}
+
+static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+				  int anon)
+{
+	struct vm_area_struct *vma;
+	size_t i = 0;
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if (anon) {
+			if (vma->anon_vma)
+				locks[i++] = &vma->anon_vma->lock;
+		} else {
+			if (vma->vm_file && vma->vm_file->f_mapping)
+				locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock;
+		}
+	}
+
+	if (!i)
+		goto out;
+
+	sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+
+out:
+	return i;
+}
+
+static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
+						  spinlock_t **locks)
+{
+	return mm_lock_sort(mm, locks, 1);
+}
+
+static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
+						spinlock_t **locks)
+{
+	return mm_lock_sort(mm, locks, 0);
+}
+
+static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+{
+	spinlock_t *last = NULL;
+	size_t i;
+
+	for (i = 0; i < nr; i++)
+		/*  Multiple vmas may use the same lock. */
+		if (locks[i] != last) {
+			BUG_ON((unsigned long) last > (unsigned long) locks[i]);
+			last = locks[i];
+			if (lock)
+				spin_lock(last);
+			else
+				spin_unlock(last);
+		}
+}
+
+static inline void __mm_lock(spinlock_t **locks, size_t nr)
+{
+	mm_lock_unlock(locks, nr, 1);
+}
+
+static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+{
+	mm_lock_unlock(locks, nr, 0);
+}
+
+/*
+ * This operation locks against the VM for all pte/vma/mm related
+ * operations that could ever happen on a certain mm. This includes
+ * vmtruncate, try_to_unmap, and all page faults. The holder
+ * must not hold any mm related lock. A single task can't take more
+ * than one mm lock in a row or it would deadlock.
+ */
+int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
+{
+	spinlock_t **anon_vma_locks, **i_mmap_locks;
+
+	down_write(&mm->mmap_sem);
+	if (mm->map_count) {
+		anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
+		if (unlikely(!anon_vma_locks)) {
+			up_write(&mm->mmap_sem);
+			return -ENOMEM;
+		}
+
+		i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
+		if (unlikely(!i_mmap_locks)) {
+			up_write(&mm->mmap_sem);
+			vfree(anon_vma_locks);
+			return -ENOMEM;
+		}
+
+		data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
+		data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
+
+		if (data->nr_anon_vma_locks) {
+			__mm_lock(anon_vma_locks, data->nr_anon_vma_locks);
+			data->anon_vma_locks = anon_vma_locks;
+		} else
+			vfree(anon_vma_locks);
+
+		if (data->nr_i_mmap_locks) {
+			__mm_lock(i_mmap_locks, data->nr_i_mmap_locks);
+			data->i_mmap_locks = i_mmap_locks;
+		} else
+			vfree(i_mmap_locks);
+	}
+	return 0;
+}
+
+static void mm_unlock_vfree(spinlock_t **locks, size_t nr)
+{
+	__mm_unlock(locks, nr);
+	vfree(locks);
+}
+
+/* avoid memory allocations for mm_unlock to prevent deadlock */
+void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
+{
+	if (mm->map_count) {
+		if (data->nr_anon_vma_locks)
+			mm_unlock_vfree(data->anon_vma_locks,
+					data->nr_anon_vma_locks);
+		if (data->i_mmap_locks)
+			mm_unlock_vfree(data->i_mmap_locks,
+					data->nr_i_mmap_locks);
+	}
+	up_write(&mm->mmap_sem);
+}
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,241 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *             Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/srcu.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+/*
+ * This function can't run concurrently against mmu_notifier_register
+ * or any other mmu notifier method. mmu_notifier_register can only
+ * run with mm->mm_users > 0 (and exit_mmap runs only when mm_users is
+ * zero). All other tasks of this mm already quit so they can't invoke
+ * mmu notifiers anymore. This can run concurrently only against
+ * mmu_notifier_unregister and it serializes against it with the
+ * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go
+ * away from under us as the exit_mmap holds a mm_count pin itself.
+ *
+ * The ->release method can't allow the module to be unloaded, the
+ * module can only be unloaded after mmu_notifier_unregister run. This
+ * is because the release method has to run the ret instruction to
+ * return back here, and so it can't allow the ret instruction to be
+ * freed.
+ */
+void __mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	spin_lock(&mm->mmu_notifier_mm->unregister_lock);
+	while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
+		mn = hlist_entry(mm->mmu_notifier_mm->list.first,
+				 struct mmu_notifier,
+				 hlist);
+		/*
+		 * We arrived before mmu_notifier_unregister so
+		 * mmu_notifier_unregister will do nothing else than
+		 * to wait ->release to finish and
+		 * mmu_notifier_unregister to return.
+		 */
+		hlist_del_init(&mn->hlist);
+		/*
+		 * if ->release runs before mmu_notifier_unregister it
+		 * must be handled as it's the only way for the driver
+		 * to flush all existing sptes before the pages in the
+		 * mm are freed.
+		 */
+		spin_unlock(&mm->mmu_notifier_mm->unregister_lock);
+		/* SRCU will block mmu_notifier_unregister */
+		mn->ops->release(mn, mm);
+		spin_lock(&mm->mmu_notifier_mm->unregister_lock);
+	}
+	spin_unlock(&mm->mmu_notifier_mm->unregister_lock);
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+
+	/*
+	 * Wait ->release if mmu_notifier_unregister run list_del_rcu.
+	 * srcu can't go away from under us because one mm_count is
+	 * hold by exit_mmap.
+	 */
+	synchronize_srcu(&mm->mmu_notifier_mm->srcu);
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->clear_flush_young can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0, srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->clear_flush_young)
+			young |= mn->ops->clear_flush_young(mn, mm, address);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+
+	return young;
+}
+
+void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_page)
+			mn->ops->invalidate_page(mn, mm, address);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range_start)
+			mn->ops->invalidate_range_start(mn, mm, start, end);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range_end)
+			mn->ops->invalidate_range_end(mn, mm, start, end);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+/*
+ * Must not hold mmap_sem nor any other VM related lock when calling
+ * this registration function. Must also ensure mm_users can't go down
+ * to zero while this runs to avoid races with mmu_notifier_release,
+ * so mm has to be current->mm or the mm should be pinned safely like
+ * with get_task_mm(). mmput can be called after mmu_notifier_register
+ * returns. mmu_notifier_unregister must be always called to
+ * unregister the notifier. mm_count is automatically pinned to allow
+ * mmu_notifier_unregister to safely run at any time later, before or
+ * after exit_mmap. ->release will always be called before exit_mmap
+ * frees the pages.
+ */
+int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct mm_lock_data data;
+	int ret;
+
+	BUG_ON(atomic_read(&mm->mm_users) <= 0);
+
+	ret = mm_lock(mm, &data);
+	if (unlikely(ret))
+		goto out;
+
+	if (!mm_has_notifiers(mm)) {
+		mm->mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm),
+					      GFP_KERNEL);
+		ret = -ENOMEM;
+		if (unlikely(!mm_has_notifiers(mm)))
+			goto out_unlock;
+
+		ret = init_srcu_struct(&mm->mmu_notifier_mm->srcu);
+		if (unlikely(ret)) {
+			kfree(mm->mmu_notifier_mm);
+			mmu_notifier_mm_init(mm);
+			goto out_unlock;
+		}
+		INIT_HLIST_HEAD(&mm->mmu_notifier_mm->list);
+		spin_lock_init(&mm->mmu_notifier_mm->unregister_lock);
+	}
+	atomic_inc(&mm->mm_count);
+
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier_mm->list);
+out_unlock:
+	mm_unlock(mm, &data);
+out:
+	BUG_ON(atomic_read(&mm->mm_users) <= 0);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+/* this is called after the last mmu_notifier_unregister() returned */
+void __mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+	BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list));
+	cleanup_srcu_struct(&mm->mmu_notifier_mm->srcu);
+	kfree(mm->mmu_notifier_mm);
+	mm->mmu_notifier_mm = LIST_POISON1; /* debug */
+}
+
+/*
+ * This releases the mm_count pin automatically and frees the mm
+ * structure if it was the last user of it. It serializes against
+ * running mmu notifiers with SRCU and against mmu_notifier_unregister
+ * with the unregister lock + SRCU. All sptes must be dropped before
+ * calling mmu_notifier_unregister. ->release or any other notifier
+ * method may be invoked concurrently with mmu_notifier_unregister,
+ * and only after mmu_notifier_unregister returned we're guaranteed
+ * that ->release or any other method can't run anymore.
+ */
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	int before_release = 0, srcu;
+
+	BUG_ON(atomic_read(&mm->mm_count) <= 0);
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	spin_lock(&mm->mmu_notifier_mm->unregister_lock);
+	if (!hlist_unhashed(&mn->hlist)) {
+		hlist_del_rcu(&mn->hlist);
+		before_release = 1;
+	}
+	spin_unlock(&mm->mmu_notifier_mm->unregister_lock);
+	if (before_release)
+		/*
+		 * exit_mmap will block in mmu_notifier_release to
+		 * guarantee ->release is called before freeing the
+		 * pages.
+		 */
+		mn->ops->release(mn, mm);
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+
+	/* wait any running method to finish, including ->release */
+	synchronize_srcu(&mm->mmu_notifier_mm->srcu);
+
+	BUG_ON(atomic_read(&mm->mm_count) <= 0);
+
+	mmdrop(mm);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -198,10 +199,12 @@
 		dirty_accountable = 1;
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -74,7 +75,11 @@
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
+	unsigned long old_start;
 
+	old_start = old_addr;
+	mmu_notifier_invalidate_range_start(vma->vm_mm,
+					    old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -116,6 +121,7 @@
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/kallsyms.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -287,7 +288,7 @@
 	if (vma->vm_flags & VM_LOCKED) {
 		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young(vma, address, pte))
+	} else if (ptep_clear_flush_young_notify(vma, address, pte))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -456,7 +457,7 @@
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush(vma, address, pte);
+		entry = ptep_clear_flush_notify(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -717,14 +718,14 @@
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
+			(ptep_clear_flush_young_notify(vma, address, pte)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	pteval = ptep_clear_flush_notify(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -849,12 +850,12 @@
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young(vma, address, pte))
+		if (ptep_clear_flush_young_notify(vma, address, pte))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush(vma, address, pte);
+		pteval = ptep_clear_flush_notify(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-24  6:49               ` Andrea Arcangeli
@ 2008-04-24  9:51                 ` Robin Holt
  2008-04-24 15:39                   ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Robin Holt @ 2008-04-24  9:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jack Steiner, Robin Holt, Christoph Lameter, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell

I am not certain of this, but it seems like this patch leaves things in
a somewhat asymetric state.  At the very least, I think that asymetry
should be documented in the comments of either mmu_notifier.h or .c.

Before I do the first mmu_notifier_register, all places that test for
mm_has_notifiers(mm) will return false and take the fast path.

After I do some mmu_notifier_register()s and their corresponding
mmu_notifier_unregister()s, The mm_has_notifiers(mm) will return true
and the slow path will be taken.  This, despite all registered notifiers
having unregistered.

It seems to me the work done by mmu_notifier_mm_destroy should really
be done inside the mm_lock()/mm_unlock area of mmu_unregister and
mm_notifier_release when we have removed the last entry.  That would
give the users job the same performance after they are done using the
special device that they had prior to its use.


On Thu, Apr 24, 2008 at 08:49:40AM +0200, Andrea Arcangeli wrote:
...
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
...
> @@ -603,25 +605,39 @@
>  	 * readonly mappings. The tradeoff is that copy_page_range is more
>  	 * efficient than faulting.
>  	 */
> +	ret = 0;
>  	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
>  		if (!vma->anon_vma)
> -			return 0;
> +			goto out;
>  	}
>  
> -	if (is_vm_hugetlb_page(vma))
> -		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
> +	if (unlikely(is_vm_hugetlb_page(vma))) {
> +		ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
> +		goto out;
> +	}
>  
> +	if (is_cow_mapping(vma->vm_flags))
> +		mmu_notifier_invalidate_range_start(src_mm, addr, end);
> +
> +	ret = 0;

I don't think this is needed.

...
> +/* avoid memory allocations for mm_unlock to prevent deadlock */
> +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
> +{
> +	if (mm->map_count) {
> +		if (data->nr_anon_vma_locks)
> +			mm_unlock_vfree(data->anon_vma_locks,
> +					data->nr_anon_vma_locks);
> +		if (data->i_mmap_locks)

I think you really want data->nr_i_mmap_locks.

...
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> new file mode 100644
> --- /dev/null
> +++ b/mm/mmu_notifier.c
...
> +/*
> + * This function can't run concurrently against mmu_notifier_register
> + * or any other mmu notifier method. mmu_notifier_register can only
> + * run with mm->mm_users > 0 (and exit_mmap runs only when mm_users is
> + * zero). All other tasks of this mm already quit so they can't invoke
> + * mmu notifiers anymore. This can run concurrently only against
> + * mmu_notifier_unregister and it serializes against it with the
> + * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go
> + * away from under us as the exit_mmap holds a mm_count pin itself.
> + *
> + * The ->release method can't allow the module to be unloaded, the
> + * module can only be unloaded after mmu_notifier_unregister run. This
> + * is because the release method has to run the ret instruction to
> + * return back here, and so it can't allow the ret instruction to be
> + * freed.
> + */

The second paragraph of this comment seems extraneous.

...
> +	/*
> +	 * Wait ->release if mmu_notifier_unregister run list_del_rcu.
> +	 * srcu can't go away from under us because one mm_count is
> +	 * hold by exit_mmap.
> +	 */

These two sentences don't make any sense to me.

...
> +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	int before_release = 0, srcu;
> +
> +	BUG_ON(atomic_read(&mm->mm_count) <= 0);
> +
> +	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
> +	spin_lock(&mm->mmu_notifier_mm->unregister_lock);
> +	if (!hlist_unhashed(&mn->hlist)) {
> +		hlist_del_rcu(&mn->hlist);
> +		before_release = 1;
> +	}
> +	spin_unlock(&mm->mmu_notifier_mm->unregister_lock);
> +	if (before_release)
> +		/*
> +		 * exit_mmap will block in mmu_notifier_release to
> +		 * guarantee ->release is called before freeing the
> +		 * pages.
> +		 */
> +		mn->ops->release(mn, mm);

I am not certain about the need to do the release callout when the driver
has already told this subsystem it is done.  For XPMEM, this callout
would immediately return.  I would expect it to be the same or GRU.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-24  9:51                 ` Robin Holt
@ 2008-04-24 15:39                   ` Andrea Arcangeli
  2008-04-24 17:41                     ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-24 15:39 UTC (permalink / raw)
  To: Robin Holt
  Cc: Jack Steiner, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Thu, Apr 24, 2008 at 04:51:12AM -0500, Robin Holt wrote:
> It seems to me the work done by mmu_notifier_mm_destroy should really
> be done inside the mm_lock()/mm_unlock area of mmu_unregister and

There's no mm_lock/unlock for mmu_unregister anymore. That's the whole
point of using srcu so it becomes reliable and quick.

> mm_notifier_release when we have removed the last entry.  That would
> give the users job the same performance after they are done using the
> special device that they had prior to its use.

That's not feasible. Otherwise mmu_notifier_mm will go away at any
time under both _release from exit_mmap and under _unregister
too. exit_mmap holds an mm_count implicit, so freeing mmu_notifier_mm
after the last mmdrop makes it safe. mmu_notifier_unregister also
holds the mm_count because mm_count was pinned by
mmu_notifier_register. That solves the issue with mmu_notifier_mm
going away from under mmu_notifier_unregister and _release and that's
why it can only be freed after mm_count == 0.

There's at least one small issue I noticed so far, that while _release
don't need to care about _register, but _unregister definitely need to
care about _register. I've to take the mmap_sem in addition or in
replacement of the unregister_lock. The srcu_read_lock can also likely
moved just before releasing the unregister_lock but that's just a
minor optimization to make the code more strict.

> On Thu, Apr 24, 2008 at 08:49:40AM +0200, Andrea Arcangeli wrote:
> ...
> > diff --git a/mm/memory.c b/mm/memory.c
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> ...
> > @@ -603,25 +605,39 @@
> >  	 * readonly mappings. The tradeoff is that copy_page_range is more
> >  	 * efficient than faulting.
> >  	 */
> > +	ret = 0;
> >  	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
> >  		if (!vma->anon_vma)
> > -			return 0;
> > +			goto out;
> >  	}
> >  
> > -	if (is_vm_hugetlb_page(vma))
> > -		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
> > +	if (unlikely(is_vm_hugetlb_page(vma))) {
> > +		ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
> > +		goto out;
> > +	}
> >  
> > +	if (is_cow_mapping(vma->vm_flags))
> > +		mmu_notifier_invalidate_range_start(src_mm, addr, end);
> > +
> > +	ret = 0;
> 
> I don't think this is needed.

It's not needed right, but I thought it was cleaner if they all use
"ret" after I had to change the code at the end of the
function. Anyway I'll delete this to make the patch shorter and only
change the minimum, agreed.

> ...
> > +/* avoid memory allocations for mm_unlock to prevent deadlock */
> > +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
> > +{
> > +	if (mm->map_count) {
> > +		if (data->nr_anon_vma_locks)
> > +			mm_unlock_vfree(data->anon_vma_locks,
> > +					data->nr_anon_vma_locks);
> > +		if (data->i_mmap_locks)
> 
> I think you really want data->nr_i_mmap_locks.

Indeed. It never happens that there are zero vmas with filebacked
mappings, this is why this couldn't be triggered in practice, thanks!

> The second paragraph of this comment seems extraneous.

ok removed.

> > +	/*
> > +	 * Wait ->release if mmu_notifier_unregister run list_del_rcu.
> > +	 * srcu can't go away from under us because one mm_count is
> > +	 * hold by exit_mmap.
> > +	 */
> 
> These two sentences don't make any sense to me.

Well that was a short explanation of why the mmu_notifier_mm structure
can only be freed after the last mmdrop, which is what you asked at
the top. I'll try to rephrase.

> > +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
> > +{
> > +	int before_release = 0, srcu;
> > +
> > +	BUG_ON(atomic_read(&mm->mm_count) <= 0);
> > +
> > +	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
> > +	spin_lock(&mm->mmu_notifier_mm->unregister_lock);
> > +	if (!hlist_unhashed(&mn->hlist)) {
> > +		hlist_del_rcu(&mn->hlist);
> > +		before_release = 1;
> > +	}
> > +	spin_unlock(&mm->mmu_notifier_mm->unregister_lock);
> > +	if (before_release)
> > +		/*
> > +		 * exit_mmap will block in mmu_notifier_release to
> > +		 * guarantee ->release is called before freeing the
> > +		 * pages.
> > +		 */
> > +		mn->ops->release(mn, mm);
> 
> I am not certain about the need to do the release callout when the driver
> has already told this subsystem it is done.  For XPMEM, this callout
> would immediately return.  I would expect it to be the same or GRU.

The point is that you don't want to run it twice. And without this you
will have to serialize against ->release yourself in the driver. It's
much more convenient if you know that ->release will be called just
once, and before mmu_notifier_unregister returns. It could be called
by _release even after you're already inside _unregister, _release may
reach the spinlock before _unregister, you won't notice the
difference. Solving this race in the driver looked too complex, I
rather solve it once inside the mmu notifier code to be sure. Missing
a release event is fatal because all sptes have to be dropped before
_release returns. The requirement is the same for _unregister, all
sptes have to be dropped before it returns. ->release should be able
to sleep as long as it wants even with only 1/N applied. exit_mmap can
sleep too, no problem. You can't unregister inside ->release first of
all because 'ret' instruction must be still allocated to return to mmu
notifier code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-24 15:39                   ` Andrea Arcangeli
@ 2008-04-24 17:41                     ` Andrea Arcangeli
  2008-04-26 13:17                       ` Robin Holt
  0 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-24 17:41 UTC (permalink / raw)
  To: Robin Holt
  Cc: Jack Steiner, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Thu, Apr 24, 2008 at 05:39:43PM +0200, Andrea Arcangeli wrote:
> There's at least one small issue I noticed so far, that while _release
> don't need to care about _register, but _unregister definitely need to
> care about _register. I've to take the mmap_sem in addition or in

In the end the best is to use the spinlock around those
list_add/list_del they all run in O(1) with the hlist and they take a
few asm insn. This also avoids to take the mmap_sem in exit_mmap, at
exit_mmap time nobody should need to use mmap_sem anymore, it might
work but this looks cleaner. The lock is dynamically allocated only
when the notifiers are registered, so the few bytes taken by it aren't
relevant.

A full new update will some become visible here:

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14-pre3/

Please have a close look again. Your help is extremely appreciated and
very helpful as usual! Thanks a lot.

diff -urN xxx/include/linux/mmu_notifier.h xx/include/linux/mmu_notifier.h
--- xxx/include/linux/mmu_notifier.h	2008-04-24 19:41:15.000000000 +0200
+++ xx/include/linux/mmu_notifier.h	2008-04-24 19:38:37.000000000 +0200
@@ -15,7 +15,7 @@
 	struct hlist_head list;
 	struct srcu_struct srcu;
 	/* to serialize mmu_notifier_unregister against mmu_notifier_release */
-	spinlock_t unregister_lock;
+	spinlock_t lock;
 };
 
 struct mmu_notifier_ops {
diff -urN xxx/mm/memory.c xx/mm/memory.c
--- xxx/mm/memory.c	2008-04-24 19:41:15.000000000 +0200
+++ xx/mm/memory.c	2008-04-24 19:38:37.000000000 +0200
@@ -605,16 +605,13 @@
 	 * readonly mappings. The tradeoff is that copy_page_range is more
 	 * efficient than faulting.
 	 */
-	ret = 0;
 	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
 		if (!vma->anon_vma)
-			goto out;
+			return 0;
 	}
 
-	if (unlikely(is_vm_hugetlb_page(vma))) {
-		ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
-		goto out;
-	}
+	if (is_vm_hugetlb_page(vma))
+		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
 	if (is_cow_mapping(vma->vm_flags))
 		mmu_notifier_invalidate_range_start(src_mm, addr, end);
@@ -636,7 +633,6 @@
 	if (is_cow_mapping(vma->vm_flags))
 		mmu_notifier_invalidate_range_end(src_mm,
 						  vma->vm_start, end);
-out:
 	return ret;
 }
 
diff -urN xxx/mm/mmap.c xx/mm/mmap.c
--- xxx/mm/mmap.c	2008-04-24 19:41:15.000000000 +0200
+++ xx/mm/mmap.c	2008-04-24 19:38:37.000000000 +0200
@@ -2381,7 +2381,7 @@
 		if (data->nr_anon_vma_locks)
 			mm_unlock_vfree(data->anon_vma_locks,
 					data->nr_anon_vma_locks);
-		if (data->i_mmap_locks)
+		if (data->nr_i_mmap_locks)
 			mm_unlock_vfree(data->i_mmap_locks,
 					data->nr_i_mmap_locks);
 	}
diff -urN xxx/mm/mmu_notifier.c xx/mm/mmu_notifier.c
--- xxx/mm/mmu_notifier.c	2008-04-24 19:41:15.000000000 +0200
+++ xx/mm/mmu_notifier.c	2008-04-24 19:31:23.000000000 +0200
@@ -24,22 +24,16 @@
  * zero). All other tasks of this mm already quit so they can't invoke
  * mmu notifiers anymore. This can run concurrently only against
  * mmu_notifier_unregister and it serializes against it with the
- * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go
- * away from under us as the exit_mmap holds a mm_count pin itself.
- *
- * The ->release method can't allow the module to be unloaded, the
- * module can only be unloaded after mmu_notifier_unregister run. This
- * is because the release method has to run the ret instruction to
- * return back here, and so it can't allow the ret instruction to be
- * freed.
+ * mmu_notifier_mm->lock in addition to RCU. struct mmu_notifier_mm
+ * can't go away from under us as exit_mmap holds a mm_count pin
+ * itself.
  */
 void __mmu_notifier_release(struct mm_struct *mm)
 {
 	struct mmu_notifier *mn;
 	int srcu;
 
-	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
-	spin_lock(&mm->mmu_notifier_mm->unregister_lock);
+	spin_lock(&mm->mmu_notifier_mm->lock);
 	while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
 		mn = hlist_entry(mm->mmu_notifier_mm->list.first,
 				 struct mmu_notifier,
@@ -52,23 +46,28 @@
 		 */
 		hlist_del_init(&mn->hlist);
 		/*
+		 * SRCU here will block mmu_notifier_unregister until
+		 * ->release returns.
+		 */
+		srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		/*
 		 * if ->release runs before mmu_notifier_unregister it
 		 * must be handled as it's the only way for the driver
-		 * to flush all existing sptes before the pages in the
-		 * mm are freed.
+		 * to flush all existing sptes and stop the driver
+		 * from establishing any more sptes before all the
+		 * pages in the mm are freed.
 		 */
-		spin_unlock(&mm->mmu_notifier_mm->unregister_lock);
-		/* SRCU will block mmu_notifier_unregister */
 		mn->ops->release(mn, mm);
-		spin_lock(&mm->mmu_notifier_mm->unregister_lock);
+		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+		spin_lock(&mm->mmu_notifier_mm->lock);
 	}
-	spin_unlock(&mm->mmu_notifier_mm->unregister_lock);
-	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
 
 	/*
-	 * Wait ->release if mmu_notifier_unregister run list_del_rcu.
-	 * srcu can't go away from under us because one mm_count is
-	 * hold by exit_mmap.
+	 * Wait ->release if mmu_notifier_unregister is running it.
+	 * The mmu_notifier_mm can't go away from under us because one
+	 * mm_count is hold by exit_mmap.
 	 */
 	synchronize_srcu(&mm->mmu_notifier_mm->srcu);
 }
@@ -177,11 +176,19 @@
 			goto out_unlock;
 		}
 		INIT_HLIST_HEAD(&mm->mmu_notifier_mm->list);
-		spin_lock_init(&mm->mmu_notifier_mm->unregister_lock);
+		spin_lock_init(&mm->mmu_notifier_mm->lock);
 	}
 	atomic_inc(&mm->mm_count);
 
+	/*
+	 * Serialize the update against mmu_notifier_unregister. A
+	 * side note: mmu_notifier_release can't run concurrently with
+	 * us because we hold the mm_users pin (either implicitly as
+	 * current->mm or explicitly with get_task_mm() or similar).
+	 */
+	spin_lock(&mm->mmu_notifier_mm->lock);
 	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier_mm->list);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
 out_unlock:
 	mm_unlock(mm, &data);
 out:
@@ -215,23 +222,32 @@
 
 	BUG_ON(atomic_read(&mm->mm_count) <= 0);
 
-	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
-	spin_lock(&mm->mmu_notifier_mm->unregister_lock);
+	spin_lock(&mm->mmu_notifier_mm->lock);
 	if (!hlist_unhashed(&mn->hlist)) {
 		hlist_del_rcu(&mn->hlist);
 		before_release = 1;
 	}
-	spin_unlock(&mm->mmu_notifier_mm->unregister_lock);
 	if (before_release)
 		/*
+		 * SRCU here will force exit_mmap to wait ->release to finish
+		 * before freeing the pages.
+		 */
+		srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+	if (before_release) {
+		/*
 		 * exit_mmap will block in mmu_notifier_release to
 		 * guarantee ->release is called before freeing the
 		 * pages.
 		 */
 		mn->ops->release(mn, mm);
-	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+	}
 
-	/* wait any running method to finish, including ->release */
+	/*
+	 * Wait any running method to finish, of course including
+	 * ->release if it was run by mmu_notifier_relase instead of us.
+	 */
 	synchronize_srcu(&mm->mmu_notifier_mm->srcu);
 
 	BUG_ON(atomic_read(&mm->mm_count) <= 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-24 17:41                     ` Andrea Arcangeli
@ 2008-04-26 13:17                       ` Robin Holt
  2008-04-26 14:04                         ` Andrea Arcangeli
  2008-04-27 12:27                         ` Andrea Arcangeli
  0 siblings, 2 replies; 86+ messages in thread
From: Robin Holt @ 2008-04-26 13:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Jack Steiner, Christoph Lameter, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell

On Thu, Apr 24, 2008 at 07:41:45PM +0200, Andrea Arcangeli wrote:
> A full new update will some become visible here:
> 
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14-pre3/

I grabbed these and built them.  Only change needed was another include.
After that, everything built fine and xpmem regression tests ran through
the first four sets.  The fifth is the oversubscription test which trips
my xpmem bug.  This is as good as the v12 runs from before.

Since this include and the one for mm_types.h both are build breakages
for ia64, I think you need to apply your ia64_cpumask and the following
(possibly as a single patch) first or in your patch 1.  Without that,
ia64 doing a git-bisect could hit a build failure.


Index: mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h
===================================================================
--- mmu_v14_pre3_xpmem_v003_v1.orig/include/linux/srcu.h	2008-04-26 06:41:54.000000000 -0500
+++ mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h	2008-04-26 07:01:17.292071827 -0500
@@ -27,6 +27,8 @@
 #ifndef _LINUX_SRCU_H
 #define _LINUX_SRCU_H
 
+#include <linux/mutex.h>
+
 struct srcu_struct_array {
 	int c[2];
 };

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-26 13:17                       ` Robin Holt
@ 2008-04-26 14:04                         ` Andrea Arcangeli
  2008-04-27 12:27                         ` Andrea Arcangeli
  1 sibling, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-26 14:04 UTC (permalink / raw)
  To: Robin Holt
  Cc: Jack Steiner, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote:
> Since this include and the one for mm_types.h both are build breakages
> for ia64, I think you need to apply your ia64_cpumask and the following
> (possibly as a single patch) first or in your patch 1.  Without that,
> ia64 doing a git-bisect could hit a build failure.

Agreed, so it doesn't risk to break ia64 compilation, thanks for the
great XPMEM feedback!

Also note, I figured out that mmu_notifier_release can actually run
concurrently against other mmu notifiers in case there's a vmtruncate
(->release could already run concurrently if invoked by _unregister,
the only guarantee is that ->release will be called one time and only
one time and that no mmu notifier will ever run after _unregister
returns).

In short I can't keep the list_del_init in _release and I need a
list_del_init_rcu instead to fix this minor issue. So this won't
really make much difference after all.

I'll release #v14 with all this after a bit of kvm testing with it...

diff --git a/include/linux/list.h b/include/linux/list.h
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -755,6 +755,14 @@ static inline void hlist_del_init(struct
 	}
 }
 
+static inline void hlist_del_init_rcu(struct hlist_node *n)
+{
+	if (!hlist_unhashed(n)) {
+		__hlist_del(n);
+		n->pprev = NULL;
+	}
+}
+
 /**
  * hlist_replace_rcu - replace old entry by new one
  * @old : the element to be replaced
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -22,7 +22,10 @@ struct mmu_notifier_ops {
 	/*
 	 * Called either by mmu_notifier_unregister or when the mm is
 	 * being destroyed by exit_mmap, always before all pages are
-	 * freed. It's mandatory to implement this method.
+	 * freed. It's mandatory to implement this method. This can
+	 * run concurrently to other mmu notifier methods and it
+	 * should teardown all secondary mmu mappings and freeze the
+	 * secondary mmu.
 	 */
 	void (*release)(struct mmu_notifier *mn,
 			struct mm_struct *mm);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -19,12 +19,13 @@
 
 /*
  * This function can't run concurrently against mmu_notifier_register
- * or any other mmu notifier method. mmu_notifier_register can only
- * run with mm->mm_users > 0 (and exit_mmap runs only when mm_users is
- * zero). All other tasks of this mm already quit so they can't invoke
- * mmu notifiers anymore. This can run concurrently only against
- * mmu_notifier_unregister and it serializes against it with the
- * mmu_notifier_mm->lock in addition to RCU. struct mmu_notifier_mm
+ * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
+ * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
+ * in parallel despite there's no task using this mm anymore, through
+ * the vmas outside of the exit_mmap context, like with
+ * vmtruncate. This serializes against mmu_notifier_unregister with
+ * the mmu_notifier_mm->lock in addition to SRCU and it serializes
+ * against the other mmu notifiers with SRCU. struct mmu_notifier_mm
  * can't go away from under us as exit_mmap holds a mm_count pin
  * itself.
  */
@@ -44,7 +45,7 @@ void __mmu_notifier_release(struct mm_st
 		 * to wait ->release to finish and
 		 * mmu_notifier_unregister to return.
 		 */
-		hlist_del_init(&mn->hlist);
+		hlist_del_init_rcu(&mn->hlist);
 		/*
 		 * SRCU here will block mmu_notifier_unregister until
 		 * ->release returns.
@@ -185,6 +186,8 @@ int mmu_notifier_register(struct mmu_not
 	 * side note: mmu_notifier_release can't run concurrently with
 	 * us because we hold the mm_users pin (either implicitly as
 	 * current->mm or explicitly with get_task_mm() or similar).
+	 * We can't race against any other mmu notifiers either thanks
+	 * to mm_lock().
 	 */
 	spin_lock(&mm->mmu_notifier_mm->lock);
 	hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-26 13:17                       ` Robin Holt
  2008-04-26 14:04                         ` Andrea Arcangeli
@ 2008-04-27 12:27                         ` Andrea Arcangeli
  2008-04-28 20:34                           ` Christoph Lameter
  1 sibling, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-27 12:27 UTC (permalink / raw)
  To: Robin Holt
  Cc: Jack Steiner, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote:
> the first four sets.  The fifth is the oversubscription test which trips
> my xpmem bug.  This is as good as the v12 runs from before.

Now that mmu-notifier-core #v14 seems finished and hopefully will
appear in 2.6.26 ;), I started exercising more the kvm-mmu-notifier
code with the full patchset applied and not only with
mmu-notifier-core. I soon found the full patchset has a swap deadlock
bug. Then I tried without using kvm (so with mmu notifier disarmed)
and I could still reproduce the crashes. After grabbing a few stack
traces I tracked it down to a bug in the i_mmap_lock->i_mmap_sem
conversion. If you oversubscription means swapping, you should retest
with this applied on #v14 i_mmap_sem patch as it would eventually
deadlock with all tasks allocating memory in D state without this. Now
the full patchset is as rock solid as with only mmu-notifier-core
applied. It's swapping 2G memhog on top of a 3G VM with 2G of ram for
the last hours without a problem. Everything is working great with KVM
at least.

Talking about post 2.6.26: the refcount with rcu in the anon-vma
conversion seems unnecessary and may explain part of the AIM slowdown
too. The rest looks ok and probably we should switch the code to a
compile-time decision between rwlock and rwsem (so obsoleting the
current spinlock).

diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1008,7 +1008,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	up_write(&mapping->i_mmap_sem);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-27 12:27                         ` Andrea Arcangeli
@ 2008-04-28 20:34                           ` Christoph Lameter
  2008-04-29  0:10                             ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-28 20:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Jack Steiner, Nick Piggin, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Sun, 27 Apr 2008, Andrea Arcangeli wrote:

> Talking about post 2.6.26: the refcount with rcu in the anon-vma
> conversion seems unnecessary and may explain part of the AIM slowdown
> too. The rest looks ok and probably we should switch the code to a
> compile-time decision between rwlock and rwsem (so obsoleting the
> current spinlock).

You are going to take a semphore in an rcu section? Guess you did not 
activate all debugging options while testing? I was not aware that you can 
take a sleeping lock from a non preemptible context.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-28 20:34                           ` Christoph Lameter
@ 2008-04-29  0:10                             ` Andrea Arcangeli
  2008-04-29  1:28                               ` Christoph Lameter
  2008-04-29 10:49                               ` Hugh Dickins
  0 siblings, 2 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-29  0:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Jack Steiner, Nick Piggin, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Mon, Apr 28, 2008 at 01:34:11PM -0700, Christoph Lameter wrote:
> On Sun, 27 Apr 2008, Andrea Arcangeli wrote:
> 
> > Talking about post 2.6.26: the refcount with rcu in the anon-vma
> > conversion seems unnecessary and may explain part of the AIM slowdown
> > too. The rest looks ok and probably we should switch the code to a
> > compile-time decision between rwlock and rwsem (so obsoleting the
> > current spinlock).
> 
> You are going to take a semphore in an rcu section? Guess you did not 
> activate all debugging options while testing? I was not aware that you can 
> take a sleeping lock from a non preemptible context.

I'd hoped to discuss this topic after mmu-notifier-core was already
merged, but let's do it anyway.

My point of view is that there was no rcu when I wrote that code, yet
there was no reference count and yet all locking looks still exactly
the same as I wrote it. There's even still the page_table_lock to
serialize threads taking the mmap_sem in read mode against the first
vma->anon_vma = anon_vma during the page fault.

Frankly I've absolutely no idea why rcu is needed in all rmap code
when walking the page->mapping. Definitely the PG_locked is taken so
there's no way page->mapping could possibly go away under the rmap
code, hence the anon_vma can't go away as it's queued in the vma, and
the vma has to go away before the page is zapped out of the pte.

So there are some possible scenarios:

1) my original anon_vma code was buggy not taking the rcu_read_lock()
and somebody fixed it (I tend to exclude it)

2) somebody has seen a race that doesn't exist and didn't bother to
document it other than with this obscure comment

 * Getting a lock on a stable anon_vma from a page off the LRU is
 * tricky: page_lock_anon_vma rely on RCU to guard against the races.

I tend to exclude it too as VM folks are too smart for this to be the case.

3) somebody did some microoptimization using rcu and we surely can
undo that microoptimization to get the code back to my original code
that didn't need rcu despite it worked exactly the same, and that is
going to be cheaper to use with semaphores than doubling the number of
locked ops for every lock instruction.

Now the double atomic op may not be horrible when not contented, as it
works on the same cacheline but with cacheline bouncing with
contention it sounds doubly horrible than a single cacheline bounce
and I don't see the point of it as you can't use rcu anyways, so you
can't possibly take advantage of whatever microoptimization done over
the original locking.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-29  0:10                             ` Andrea Arcangeli
@ 2008-04-29  1:28                               ` Christoph Lameter
  2008-04-29 15:30                                 ` Andrea Arcangeli
  2008-04-29 10:49                               ` Hugh Dickins
  1 sibling, 1 reply; 86+ messages in thread
From: Christoph Lameter @ 2008-04-29  1:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Jack Steiner, Nick Piggin, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Tue, 29 Apr 2008, Andrea Arcangeli wrote:

> Frankly I've absolutely no idea why rcu is needed in all rmap code
> when walking the page->mapping. Definitely the PG_locked is taken so
> there's no way page->mapping could possibly go away under the rmap
> code, hence the anon_vma can't go away as it's queued in the vma, and
> the vma has to go away before the page is zapped out of the pte.

zap_pte_range can race with the rmap code and it does not take the page 
lock. The page may not go away since a refcount was taken but the mapping 
can go away. Without RCU you have no guarantee that the anon_vma is 
existing when you take the lock. 

How long were you away from VM development?

> Now the double atomic op may not be horrible when not contented, as it
> works on the same cacheline but with cacheline bouncing with
> contention it sounds doubly horrible than a single cacheline bounce
> and I don't see the point of it as you can't use rcu anyways, so you
> can't possibly take advantage of whatever microoptimization done over
> the original locking.

Cachelines are acquired for exclusive use for a mininum duration. 
Multiple atomic operations can be performed after a cacheline becomes 
exclusive without danger of bouncing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-29  0:10                             ` Andrea Arcangeli
  2008-04-29  1:28                               ` Christoph Lameter
@ 2008-04-29 10:49                               ` Hugh Dickins
  2008-04-29 13:32                                 ` Andrea Arcangeli
  1 sibling, 1 reply; 86+ messages in thread
From: Hugh Dickins @ 2008-04-29 10:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Robin Holt, Jack Steiner, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general, akpm,
	Rusty Russell

On Tue, 29 Apr 2008, Andrea Arcangeli wrote:
> 
> My point of view is that there was no rcu when I wrote that code, yet
> there was no reference count and yet all locking looks still exactly
> the same as I wrote it. There's even still the page_table_lock to
> serialize threads taking the mmap_sem in read mode against the first
> vma->anon_vma = anon_vma during the page fault.
> 
> Frankly I've absolutely no idea why rcu is needed in all rmap code
> when walking the page->mapping. Definitely the PG_locked is taken so
> there's no way page->mapping could possibly go away under the rmap
> code, hence the anon_vma can't go away as it's queued in the vma, and
> the vma has to go away before the page is zapped out of the pte.

[I'm scarcely following the mmu notifiers to-and-fro, which seems
to be in good hands, amongst faster thinkers than me: who actually
need and can test this stuff.  Don't let me slow you down; but I
can quickly clarify on this history.]

No, the locking was different as you had it, Andrea: there was an extra
bitspin lock, carried over from the pte_chains days (maybe we changed
the name, maybe we disagreed over the name, I forget), which mainly
guarded the page->mapcount.  I thought that was one lock more than we
needed, and eliminated it in favour of atomic page->mapcount in 2.6.9.

Here's the relevant extracts from ChangeLog-2.6.9:

[PATCH] rmaplock: PageAnon in mapping

First of a batch of five patches to eliminate rmap's page_map_lock, replace
its trylocking by spinlocking, and use anon_vma to speed up swapoff.

Patches updated from the originals against 2.6.7-mm7: nothing new so I won't
spam the list, but including Manfred's SLAB_DESTROY_BY_RCU fixes, and omitting
the unuse_process mmap_sem fix already in 2.6.8-rc3.

This patch:

Replace the PG_anon page->flags bit by setting the lower bit of the pointer in
page->mapping when it's anon_vma: PAGE_MAPPING_ANON bit.

We're about to eliminate the locking which kept the flags and mapping in
synch: it's much easier to work on a local copy of page->mapping, than worry
about whether flags and mapping are in synch (though I imagine it could be
done, at greater cost, with some barriers).

[PATCH] rmaplock: kill page_map_lock

The pte_chains rmap used pte_chain_lock (bit_spin_lock on PG_chainlock) to
lock its pte_chains.  We kept this (as page_map_lock: bit_spin_lock on
PG_maplock) when we moved to objrmap.  But the file objrmap locks its vma tree
with mapping->i_mmap_lock, and the anon objrmap locks its vma list with
anon_vma->lock: so isn't the page_map_lock superfluous?

Pretty much, yes.  The mapcount was protected by it, and needs to become an
atomic: starting at -1 like page _count, so nr_mapped can be tracked precisely
up and down.  The last page_remove_rmap can't clear anon page mapping any
more, because of races with page_add_rmap; from which some BUG_ONs must go for
the same reason, but they've served their purpose.

vmscan decisions are naturally racy, little change there beyond removing
page_map_lock/unlock.  But to stabilize the file-backed page->mapping against
truncation while acquiring i_mmap_lock, page_referenced_file now needs page
lock to be held even for refill_inactive_zone.  There's a similar issue in
acquiring anon_vma->lock, where page lock doesn't help: which this patch
pretends to handle, but actually it needs the next.

Roughly 10% cut off lmbench fork numbers on my 2*HT*P4.  Must confess my
testing failed to show the races even while they were knowingly exposed: would
benefit from testing on racier equipment.

[PATCH] rmaplock: SLAB_DESTROY_BY_RCU

With page_map_lock gone, how to stabilize page->mapping's anon_vma while
acquiring anon_vma->lock in page_referenced_anon and try_to_unmap_anon?

The page cannot actually be freed (vmscan holds reference), but however much
we check page_mapped (which guarantees that anon_vma is in use - or would
guarantee that if we added suitable barriers), there's no locking against page
becoming unmapped the instant after, then anon_vma freed.

It's okay to take anon_vma->lock after it's freed, so long as it remains a
struct anon_vma (its list would become empty, or perhaps reused for an
unrelated anon_vma: but no problem since we always check that the page located
is the right one); but corruption if that memory gets reused for some other
purpose.

This is not unique: it's liable to be problem whenever the kernel tries to
approach a structure obliquely.  It's generally solved with an atomic
reference count; but one advantage of anon_vma over anonmm is that it does not
have such a count, and it would be a backward step to add one.

Therefore...  implement SLAB_DESTROY_BY_RCU flag, to guarantee that such a
kmem_cache_alloc'ed structure cannot get freed to other use while the
rcu_read_lock is held i.e.  preempt disabled; and use that for anon_vma.

Fix concerns raised by Manfred: this flag is incompatible with poisoning and
destructor, and kmem_cache_destroy needs to synchronize_kernel.

I hope SLAB_DESTROY_BY_RCU may be useful elsewhere; but though it's safe for
little anon_vma, I'd be reluctant to use it on any caches whose immediate
shrinkage under pressure is important to the system.

[PATCH] rmaplock: mm lock ordering

With page_map_lock out of the way, there's no need for page_referenced and
try_to_unmap to use trylocks - provided we switch anon_vma->lock and
mm->page_table_lock around in anon_vma_prepare.  Though I suppose it's
possible that we'll find that vmscan makes better progress with trylocks than
spinning - we're free to choose trylocks again if so.

Try to update the mm lock ordering documentation in filemap.c.  But I still
find it confusing, and I've no idea of where to stop.  So add an mm lock
ordering list I can understand to rmap.c.

[The fifth patch was about using anon_vma in swapoff, not relevant here.]

So, going back to what you wrote: holding the page lock there is
not enough to prevent the struct anon_vma going away beneath us.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-29 10:49                               ` Hugh Dickins
@ 2008-04-29 13:32                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-29 13:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Lameter, Robin Holt, Jack Steiner, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general, akpm,
	Rusty Russell

Hi Hugh!!

On Tue, Apr 29, 2008 at 11:49:11AM +0100, Hugh Dickins wrote:
> [I'm scarcely following the mmu notifiers to-and-fro, which seems
> to be in good hands, amongst faster thinkers than me: who actually
> need and can test this stuff.  Don't let me slow you down; but I
> can quickly clarify on this history.]

Still I think it'd be great if you could review mmu-notifier-core v14.
You and Nick are the core VM maintainers so it'd be great to hear any
feedback about it. I think it's fairly easy to classify the patch as
obviously safe as long as mmu notifiers are disarmed. Here a link for
your convenience.

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/mmu-notifier-core

> No, the locking was different as you had it, Andrea: there was an extra
> bitspin lock, carried over from the pte_chains days (maybe we changed
> the name, maybe we disagreed over the name, I forget), which mainly
> guarded the page->mapcount.  I thought that was one lock more than we
> needed, and eliminated it in favour of atomic page->mapcount in 2.6.9.

Thanks a lot for the explanation!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-29  1:28                               ` Christoph Lameter
@ 2008-04-29 15:30                                 ` Andrea Arcangeli
  2008-04-29 15:50                                   ` Robin Holt
  0 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-29 15:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Jack Steiner, Nick Piggin, Peter Zijlstra, kvm-devel,
	Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity,
	linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Mon, Apr 28, 2008 at 06:28:06PM -0700, Christoph Lameter wrote:
> On Tue, 29 Apr 2008, Andrea Arcangeli wrote:
> 
> > Frankly I've absolutely no idea why rcu is needed in all rmap code
> > when walking the page->mapping. Definitely the PG_locked is taken so
> > there's no way page->mapping could possibly go away under the rmap
> > code, hence the anon_vma can't go away as it's queued in the vma, and
> > the vma has to go away before the page is zapped out of the pte.
> 
> zap_pte_range can race with the rmap code and it does not take the page 
> lock. The page may not go away since a refcount was taken but the mapping 
> can go away. Without RCU you have no guarantee that the anon_vma is 
> existing when you take the lock. 

There's some room for improvement, like using down_read_trylock, if
that succeeds we don't need to increase the refcount and we can keep
the rcu_read_lock held instead.

Secondly we don't need to increase the refcount in fork() when we
queue the vma-copy in the anon_vma. You should init the refcount to 1
when the anon_vma is allocated, remove the atomic_inc from all code
(except when down_read_trylock fails) and then change anon_vma_unlink
to:

        up_write(&anon_vma->sem);
	if (empty)
		put_anon_vma(anon_vma);

While the down_read_trylock surely won't help in AIM, the second
change will reduce a bit the overhead in the VM core fast paths by
avoiding all refcounting changes by checking the list_empty the same
way the current code does. I really like how I designed the garbage
collection through list_empty and that's efficient and I'd like to
keep it.

I however doubt this will bring us back to the same performance of the
current spinlock version, as the real overhead should come out of
overscheduling in down_write ai anon_vma_link. Here an initially
spinning lock would help but that's gray area, it greatly depends on
timings, and on very large systems where a cacheline wait with many
cpus forking at the same time takes more than scheduling a semaphore
may not slowdown performance that much. So I think the only way is a
configuration option to switch the locking at compile time, then XPMEM
will depend on that option to be on, I don't see a big deal and this
guarantees embedded isn't screwed up by totally unnecessary locks on UP.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-29 15:30                                 ` Andrea Arcangeli
@ 2008-04-29 15:50                                   ` Robin Holt
  2008-04-29 16:03                                     ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Robin Holt @ 2008-04-29 15:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Robin Holt, Jack Steiner, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell

> I however doubt this will bring us back to the same performance of the
> current spinlock version, as the real overhead should come out of
> overscheduling in down_write ai anon_vma_link. Here an initially
> spinning lock would help but that's gray area, it greatly depends on
> timings, and on very large systems where a cacheline wait with many
> cpus forking at the same time takes more than scheduling a semaphore
> may not slowdown performance that much. So I think the only way is a
> configuration option to switch the locking at compile time, then XPMEM
> will depend on that option to be on, I don't see a big deal and this
> guarantees embedded isn't screwed up by totally unnecessary locks on UP.

You have said this continually about a CONFIG option.  I am unsure how
that could be achieved.  Could you provide a patch?

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-29 15:50                                   ` Robin Holt
@ 2008-04-29 16:03                                     ` Andrea Arcangeli
  2008-05-07 15:00                                       ` Andrea Arcangeli
  0 siblings, 1 reply; 86+ messages in thread
From: Andrea Arcangeli @ 2008-04-29 16:03 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Jack Steiner, Nick Piggin, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 29, 2008 at 10:50:30AM -0500, Robin Holt wrote:
> You have said this continually about a CONFIG option.  I am unsure how
> that could be achieved.  Could you provide a patch?

I'm busy with the reserved ram patch against 2.6.25 and latest kvm.git
that is moving from pages to pfn for pci passthrough (that change will
also remove the page pin with mmu notifiers).

Unfortunately reserved-ram bugs out again in the blk-settings.c on
real hardware. The fix I pushed in .25 for it, works when booting kvm
(that's how I tested it) but on real hardware sata b_pfn happens to be
1 page less than the result of the min comparison and I'll have to
figure out what happens (only .24 code works on real hardware..., at
least my fix is surely better than the previous .25-pre code).

I've other people waiting on that reserved-ram to be working, so once
I've finished, I'll do the optimization to anon-vma (at least the
removal of the unnecessary atomic_inc from fork) and add the config
option.

Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem
yourself in this direction, you're very welcome to go ahead while I
finish sorting out reserved-ram. If you do, please let me know so we
don't duplicate effort, and it'd be absolutely great if the patches
could be incremental with #v14 so I can merge them trivially later and
upload a new patchset once you're finished (the only outstanding fix
you have to apply on top of #v14 that is already integrated in my
patchset, is the i_mmap_sem deadlock fix I posted and that I'm sure
you've already applied on top of #v14 before doing any more
development on it).

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01 of 12] Core of mmu notifiers
  2008-04-29 16:03                                     ` Andrea Arcangeli
@ 2008-05-07 15:00                                       ` Andrea Arcangeli
  0 siblings, 0 replies; 86+ messages in thread
From: Andrea Arcangeli @ 2008-05-07 15:00 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Jack Steiner, Nick Piggin, Peter Zijlstra,
	kvm-devel, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel,
	Avi Kivity, linux-mm, general, Hugh Dickins, akpm, Rusty Russell

On Tue, Apr 29, 2008 at 06:03:40PM +0200, Andrea Arcangeli wrote:
> Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem
> yourself in this direction, you're very welcome to go ahead while I

In case you didn't notice this already, for a further explanation of
why semaphores runs slower for small critical sections and why the
conversion from spinlock to rwsem should happen under a config option,
see the "AIM7 40% regression with 2.6.26-rc1" thread.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2008-05-07 15:00 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-22 13:51 [PATCH 00 of 12] mmu notifier #v13 Andrea Arcangeli
2008-04-22 13:51 ` [PATCH 01 of 12] Core of mmu notifiers Andrea Arcangeli
2008-04-22 14:56   ` Eric Dumazet
2008-04-22 15:15     ` Andrea Arcangeli
2008-04-22 15:24       ` Avi Kivity
2008-04-22 15:37       ` Eric Dumazet
2008-04-22 16:46         ` Andrea Arcangeli
2008-04-22 20:19   ` Christoph Lameter
2008-04-22 20:31     ` Robin Holt
2008-04-22 22:35     ` Andrea Arcangeli
2008-04-22 23:07       ` Robin Holt
2008-04-23  0:28         ` Jack Steiner
2008-04-23 16:37           ` Andrea Arcangeli
2008-04-23 18:19             ` Christoph Lameter
2008-04-23 18:25               ` Andrea Arcangeli
2008-04-23 22:19             ` Andrea Arcangeli
2008-04-24  6:49               ` Andrea Arcangeli
2008-04-24  9:51                 ` Robin Holt
2008-04-24 15:39                   ` Andrea Arcangeli
2008-04-24 17:41                     ` Andrea Arcangeli
2008-04-26 13:17                       ` Robin Holt
2008-04-26 14:04                         ` Andrea Arcangeli
2008-04-27 12:27                         ` Andrea Arcangeli
2008-04-28 20:34                           ` Christoph Lameter
2008-04-29  0:10                             ` Andrea Arcangeli
2008-04-29  1:28                               ` Christoph Lameter
2008-04-29 15:30                                 ` Andrea Arcangeli
2008-04-29 15:50                                   ` Robin Holt
2008-04-29 16:03                                     ` Andrea Arcangeli
2008-05-07 15:00                                       ` Andrea Arcangeli
2008-04-29 10:49                               ` Hugh Dickins
2008-04-29 13:32                                 ` Andrea Arcangeli
2008-04-23 13:36         ` Andrea Arcangeli
2008-04-23 14:47           ` Robin Holt
2008-04-23 15:59             ` Andrea Arcangeli
2008-04-23 18:09               ` Christoph Lameter
2008-04-23 18:19                 ` Andrea Arcangeli
2008-04-23 18:27                   ` Christoph Lameter
2008-04-23 18:37                     ` Andrea Arcangeli
2008-04-23 18:46                       ` Christoph Lameter
2008-04-22 23:20       ` Christoph Lameter
2008-04-23 16:26         ` Andrea Arcangeli
2008-04-23 17:24           ` Andrea Arcangeli
2008-04-23 18:21             ` Christoph Lameter
2008-04-23 18:34               ` Andrea Arcangeli
2008-04-23 18:15           ` Christoph Lameter
2008-04-23 17:09   ` Jack Steiner
2008-04-23 17:45     ` Andrea Arcangeli
2008-04-22 13:51 ` [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug Andrea Arcangeli
2008-04-22 20:22   ` Christoph Lameter
2008-04-22 22:43     ` Andrea Arcangeli
2008-04-22 23:07       ` Robin Holt
2008-04-22 13:51 ` [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced Andrea Arcangeli
2008-04-22 20:23   ` Christoph Lameter
2008-04-22 22:37     ` Andrea Arcangeli
2008-04-22 23:13       ` Christoph Lameter
2008-04-22 13:51 ` [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last Andrea Arcangeli
2008-04-22 20:24   ` Christoph Lameter
2008-04-22 22:40     ` Andrea Arcangeli
2008-04-22 23:14       ` Christoph Lameter
2008-04-23 13:44         ` Andrea Arcangeli
2008-04-23 15:45           ` Robin Holt
2008-04-23 16:15             ` Andrea Arcangeli
2008-04-23 19:55               ` Robin Holt
2008-04-23 21:05             ` Avi Kivity
2008-04-23 18:02           ` Christoph Lameter
2008-04-23 18:16             ` Andrea Arcangeli
2008-04-22 13:51 ` [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks Andrea Arcangeli
2008-04-22 20:25   ` Christoph Lameter
2008-04-22 13:51 ` [PATCH 06 of 12] Move the tlb flushing inside of unmap vmas. This saves us from passing Andrea Arcangeli
2008-04-22 13:51 ` [PATCH 07 of 12] Add a function to rw_semaphores to check if there are any processes Andrea Arcangeli
2008-04-22 13:51 ` [PATCH 08 of 12] The conversion to a rwsem allows notifier callbacks during rmap traversal Andrea Arcangeli
2008-04-22 13:51 ` [PATCH 09 of 12] Convert the anon_vma spinlock to a rw semaphore. This allows concurrent Andrea Arcangeli
2008-04-22 13:51 ` [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock Andrea Arcangeli
2008-04-22 20:26   ` Christoph Lameter
2008-04-22 22:54     ` Andrea Arcangeli
2008-04-22 23:19       ` Christoph Lameter
2008-04-22 13:51 ` [PATCH 11 of 12] XPMEM would have used sys_madvise() except that madvise_dontneed() Andrea Arcangeli
2008-04-22 13:51 ` [PATCH 12 of 12] This patch adds a lock ordering rule to avoid a potential deadlock when Andrea Arcangeli
2008-04-22 18:22 ` [PATCH 00 of 12] mmu notifier #v13 Robin Holt
2008-04-22 18:43   ` Andrea Arcangeli
2008-04-22 19:42     ` Robin Holt
2008-04-22 20:30       ` Christoph Lameter
2008-04-23 13:33         ` Andrea Arcangeli
2008-04-22 20:28     ` Christoph Lameter
2008-04-23  0:31 ` Jack Steiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).