MMU notifiers review and some proposals

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* MMU notifiers review and some proposals
@ 2008-07-24 14:39 Nick Piggin
  2008-07-24 14:39 ` Nick Piggin
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Nick Piggin @ 2008-07-24 14:39 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, andrea

Hi,

I think everybody is hoping to have a workable mmu notifier scheme
merged in 2.6.27 (myself included). However I do have some concerns
about the implementation proposed (in -mm).

I apologise for this late review, before anybody gets too upset,
most of my concerns have been raised before, but I'd like to state
my case again and involving everyone.

So my concerns.

Firstly, mm_take_all_locks I dislike. Although it isn't too complex
and quite well contained, it unfortunately somewhat ties our hands a
bit when it comes to changing the fastpath locking. For exmple, it
might make it difficult to remove the locks from rmap walks now (not
that there aren't plenty of other difficulties, but as an example).

I also think we should be cautious to make this slowpath so horribly
slow. For KVM it's fine, but perhaps for GRU we'd actually want to
register quite a lot of notifiers, unregister them, and perhaps even
do per-vma registrations if we don't want to slow down other memory
management operations too much (eg. if it is otherwise just some
regular processes just doing accelerated things with GRU).

I think it is possible to come up with a pretty good implementation
that replaces mm_take_all_locks with something faster and using less
locks, I haven't quite got a detailed sketch yet, but if anyone is
interested, let me know.

However, I think the whole reason for take_all_locks is to support
the invalidate_range_begin/invalidate_range_end callbacks. I think
these are slightly odd because they introduce a 2nd design of TLB
shootdown scheme to the kernel: presently we downgrade pte
permissions, and then shootdown TLBs, and then free any pages;
the range callbacks first prevent new TLBs being set up, then shoot
down existing TLBs, then downgrade ptes, then free pages.

That's not to say this style will not work, but I prefer to keep
to our existing scheme. Obviously it is better proven, and also
it is good to have fewer if possible.

What are the pros of the new scheme? Well performance AFAIKS.
Large unmappings may not fit in the TLB gather API (and some
sites do not use TLB gather at all), but with the
invalidate_range_begin/invalidate_range_end we can still flush
the entire range in 1 call.

I would counter that our existing CPU flushing has held up fairly
well, and it can also be pretty expensive if it has to do lots of
IPIs. I don't have a problem with trying to maximise performance,
but I really feel this is one place where I'd prefer to see
numbers first.

One functional con of the invalidate_range_begin/invalidate_range_end
style I see is that the implementations hold off new TLB insertions
by incrementing a counter to hold them off. So actually poor
parallelism or starvation could become an issue with multiple
threads.


So anyway, I've attached a patch switching mmu notifiers to the
our traditional style of TLB invalidation, and attempted to
convert KVM and GRU to use it. Also keep in mind that after this
the mm_take_all_locks patches should not be needed. Unfortunately
I was not able to actually test this (but testing typically would not
find the rally subtle problems easily anyway)

---
 drivers/misc/sgi-gru/grufault.c    |  145 ++++---------------------------------
 drivers/misc/sgi-gru/grutables.h   |    1 
 drivers/misc/sgi-gru/grutlbpurge.c |   37 +--------
 include/linux/kvm_host.h           |   10 --
 include/linux/mmu_notifier.h       |   97 ++----------------------
 mm/fremap.c                        |    4 -
 mm/hugetlb.c                       |    3 
 mm/memory.c                        |   18 ++--
 mm/mmap.c                          |    2 
 mm/mmu_notifier.c                  |   43 +---------
 mm/mprotect.c                      |    3 
 mm/mremap.c                        |    4 -
 virt/kvm/kvm_main.c                |   65 +---------------
 13 files changed, 61 insertions(+), 371 deletions(-)

Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -65,60 +65,12 @@ struct mmu_notifier_ops {
 	 * Before this is invoked any secondary MMU is still ok to
 	 * read/write to the page previously pointed to by the Linux
 	 * pte because the page hasn't been freed yet and it won't be
-	 * freed until this returns. If required set_page_dirty has to
-	 * be called internally to this method.
+	 * freed until this returns. Also, importantly the address can't
+	 * be repopulated with some other page, so all MMUs retain a
+	 * coherent view of memory. If required set_page_dirty has to
+	 * be called internally to this method. Not optional.
 	 */
-	void (*invalidate_page)(struct mmu_notifier *mn,
-				struct mm_struct *mm,
-				unsigned long address);
-
-	/*
-	 * invalidate_range_start() and invalidate_range_end() must be
-	 * paired and are called only when the mmap_sem and/or the
-	 * locks protecting the reverse maps are held. The subsystem
-	 * must guarantee that no additional references are taken to
-	 * the pages in the range established between the call to
-	 * invalidate_range_start() and the matching call to
-	 * invalidate_range_end().
-	 *
-	 * Invalidation of multiple concurrent ranges may be
-	 * optionally permitted by the driver. Either way the
-	 * establishment of sptes is forbidden in the range passed to
-	 * invalidate_range_begin/end for the whole duration of the
-	 * invalidate_range_begin/end critical section.
-	 *
-	 * invalidate_range_start() is called when all pages in the
-	 * range are still mapped and have at least a refcount of one.
-	 *
-	 * invalidate_range_end() is called when all pages in the
-	 * range have been unmapped and the pages have been freed by
-	 * the VM.
-	 *
-	 * The VM will remove the page table entries and potentially
-	 * the page between invalidate_range_start() and
-	 * invalidate_range_end(). If the page must not be freed
-	 * because of pending I/O or other circumstances then the
-	 * invalidate_range_start() callback (or the initial mapping
-	 * by the driver) must make sure that the refcount is kept
-	 * elevated.
-	 *
-	 * If the driver increases the refcount when the pages are
-	 * initially mapped into an address space then either
-	 * invalidate_range_start() or invalidate_range_end() may
-	 * decrease the refcount. If the refcount is decreased on
-	 * invalidate_range_start() then the VM can free pages as page
-	 * table entries are removed.  If the refcount is only
-	 * droppped on invalidate_range_end() then the driver itself
-	 * will drop the last refcount but it must take care to flush
-	 * any secondary tlb before doing the final free on the
-	 * page. Pages will no longer be referenced by the linux
-	 * address space but may still be referenced by sptes until
-	 * the last refcount is dropped.
-	 */
-	void (*invalidate_range_start)(struct mmu_notifier *mn,
-				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
-	void (*invalidate_range_end)(struct mmu_notifier *mn,
+	void (*invalidate_range)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start, unsigned long end);
 };
@@ -154,11 +106,7 @@ extern void __mmu_notifier_mm_destroy(st
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long address);
-extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
-extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
-extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -175,25 +123,11 @@ static inline int mmu_notifier_clear_flu
 	return 0;
 }
 
-static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
-{
-	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
-}
-
-static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
-}
-
-static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
-{
-	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range(mm, start, end);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -221,7 +155,8 @@ static inline void mmu_notifier_mm_destr
 	struct vm_area_struct *___vma = __vma;				\
 	unsigned long ___address = __address;				\
 	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
-	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
+	mmu_notifier_invalidate_range(___vma->vm_mm, ___address,	\
+					____address + PAGE_SIZE);	\
 	__pte;								\
 })
 
@@ -248,17 +183,7 @@ static inline int mmu_notifier_clear_flu
 	return 0;
 }
 
-static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
-{
-}
-
-static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
-{
-}
-
-static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 }
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -32,7 +32,7 @@ static void zap_pte(struct mm_struct *mm
 		struct page *page;
 
 		flush_cache_page(vma, addr, pte_pfn(pte));
-		pte = ptep_clear_flush(vma, addr, ptep);
+		pte = ptep_clear_flush_notify(vma, addr, ptep);
 		page = vm_normal_page(vma, addr, pte);
 		if (page) {
 			if (pte_dirty(pte))
@@ -226,9 +226,7 @@ asmlinkage long sys_remap_file_pages(uns
 		vma->vm_flags = saved_flags;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
-	mmu_notifier_invalidate_range_end(mm, start, start + size);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (vma->vm_flags & VM_LOCKED) {
 			/*
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -1683,7 +1683,6 @@ void __unmap_hugepage_range(struct vm_ar
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -1725,7 +1724,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -677,9 +677,6 @@ int copy_page_range(struct mm_struct *ds
 	 * parent mm. And a permission downgrade will only happen if
 	 * is_cow_mapping() returns true.
 	 */
-	if (is_cow_mapping(vma->vm_flags))
-		mmu_notifier_invalidate_range_start(src_mm, addr, end);
-
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
@@ -695,8 +692,8 @@ int copy_page_range(struct mm_struct *ds
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow_mapping(vma->vm_flags))
-		mmu_notifier_invalidate_range_end(src_mm,
-						  vma->vm_start, end);
+		mmu_notifier_invalidate_range(src_mm, vma->vm_start, end);
+
 	return ret;
 }
 
@@ -903,7 +900,6 @@ unsigned long unmap_vmas(struct mmu_gath
 	int fullmm = (*tlbp)->fullmm;
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
 
@@ -951,6 +947,7 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
+			mmu_notifier_invalidate_range(mm, tlb_start, start);
 			tlb_finish_mmu(*tlbp, tlb_start, start);
 
 			if (need_resched() ||
@@ -968,7 +965,6 @@ unsigned long unmap_vmas(struct mmu_gath
 		}
 	}
 out:
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
 
@@ -991,8 +987,10 @@ unsigned long zap_page_range(struct vm_a
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
+	if (tlb) {
+		mmu_notifier_invalidate_range(mm, address, end);
 		tlb_finish_mmu(tlb, address, end);
+	}
 	return end;
 }
 EXPORT_SYMBOL_GPL(zap_page_range);
@@ -1668,7 +1666,6 @@ int apply_to_page_range(struct mm_struct
 	int err;
 
 	BUG_ON(addr >= end);
-	mmu_notifier_invalidate_range_start(mm, start, end);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1676,7 +1673,8 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range(mm, start, end);
+
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1794,6 +1794,7 @@ static void unmap_region(struct mm_struc
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
+	mmu_notifier_invalidate_range(mm, start, end);
 	tlb_finish_mmu(tlb, start, end);
 }
 
@@ -2123,6 +2124,7 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	memrlimit_cgroup_uncharge_as(mm, mm->total_vm);
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
+	mmu_notifier_invalidate_range(mm, 0, end);
 	tlb_finish_mmu(tlb, 0, end);
 
 	/*
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c
+++ linux-2.6/mm/mmu_notifier.c
@@ -99,45 +99,15 @@ int __mmu_notifier_clear_flush_young(str
 	return young;
 }
 
-void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
-{
-	struct mmu_notifier *mn;
-	struct hlist_node *n;
-
-	rcu_read_lock();
-	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
-	}
-	rcu_read_unlock();
-}
-
-void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
-{
-	struct mmu_notifier *mn;
-	struct hlist_node *n;
-
-	rcu_read_lock();
-	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
-	}
-	rcu_read_unlock();
-}
-
-void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	struct mmu_notifier *mn;
 	struct hlist_node *n;
 
 	rcu_read_lock();
-	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
-	}
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist)
+		mn->ops->invalidate_range(mn, mm, start, end);
 	rcu_read_unlock();
 }
 
@@ -148,6 +118,8 @@ static int do_mmu_notifier_register(stru
 	struct mmu_notifier_mm *mmu_notifier_mm;
 	int ret;
 
+	BUG_ON(!mn->ops);
+	BUG_ON(!mn->ops->invalidate_range);
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 
 	ret = -ENOMEM;
@@ -157,9 +129,6 @@ static int do_mmu_notifier_register(stru
 
 	if (take_mmap_sem)
 		down_write(&mm->mmap_sem);
-	ret = mm_take_all_locks(mm);
-	if (unlikely(ret))
-		goto out_cleanup;
 
 	if (!mm_has_notifiers(mm)) {
 		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
@@ -181,8 +150,6 @@ static int do_mmu_notifier_register(stru
 	hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
 	spin_unlock(&mm->mmu_notifier_mm->lock);
 
-	mm_drop_all_locks(mm);
-out_cleanup:
 	if (take_mmap_sem)
 		up_write(&mm->mmap_sem);
 	/* kfree() does nothing if mmu_notifier_mm is NULL */
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c
+++ linux-2.6/mm/mprotect.c
@@ -204,12 +204,11 @@ success:
 		dirty_accountable = 1;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range(mm, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -81,8 +81,6 @@ static void move_ptes(struct vm_area_str
 	unsigned long old_start;
 
 	old_start = old_addr;
-	mmu_notifier_invalidate_range_start(vma->vm_mm,
-					    old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -124,7 +122,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
-	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
+	mmu_notifier_invalidate_range(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
Index: linux-2.6/virt/kvm/kvm_main.c
===================================================================
--- linux-2.6.orig/virt/kvm/kvm_main.c
+++ linux-2.6/virt/kvm/kvm_main.c
@@ -192,13 +192,15 @@ static inline struct kvm *mmu_notifier_t
 	return container_of(mn, struct kvm, mmu_notifier);
 }
 
-static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long address)
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+						    struct mm_struct *mm,
+						    unsigned long start,
+						    unsigned long end)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int need_tlb_flush;
+	int need_tlb_flush = 0;
 
+	spin_lock(&kvm->mmu_lock);
 	/*
 	 * When ->invalidate_page runs, the linux pte has been zapped
 	 * already but the page is still allocated until
@@ -217,32 +219,7 @@ static void kvm_mmu_notifier_invalidate_
 	 * pte after kvm_unmap_hva returned, without noticing the page
 	 * is going to be freed.
 	 */
-	spin_lock(&kvm->mmu_lock);
 	kvm->mmu_notifier_seq++;
-	need_tlb_flush = kvm_unmap_hva(kvm, address);
-	spin_unlock(&kvm->mmu_lock);
-
-	/* we've to flush the tlb before the pages can be freed */
-	if (need_tlb_flush)
-		kvm_flush_remote_tlbs(kvm);
-
-}
-
-static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
-						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int need_tlb_flush = 0;
-
-	spin_lock(&kvm->mmu_lock);
-	/*
-	 * The count increase must become visible at unlock time as no
-	 * spte can be established without taking the mmu_lock and
-	 * count is also read inside the mmu_lock critical section.
-	 */
-	kvm->mmu_notifier_count++;
 	for (; start < end; start += PAGE_SIZE)
 		need_tlb_flush |= kvm_unmap_hva(kvm, start);
 	spin_unlock(&kvm->mmu_lock);
@@ -252,32 +229,6 @@ static void kvm_mmu_notifier_invalidate_
 		kvm_flush_remote_tlbs(kvm);
 }
 
-static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-
-	spin_lock(&kvm->mmu_lock);
-	/*
-	 * This sequence increase will notify the kvm page fault that
-	 * the page that is going to be mapped in the spte could have
-	 * been freed.
-	 */
-	kvm->mmu_notifier_seq++;
-	/*
-	 * The above sequence increase must be visible before the
-	 * below count decrease but both values are read by the kvm
-	 * page fault under mmu_lock spinlock so we don't need to add
-	 * a smb_wmb() here in between the two.
-	 */
-	kvm->mmu_notifier_count--;
-	spin_unlock(&kvm->mmu_lock);
-
-	BUG_ON(kvm->mmu_notifier_count < 0);
-}
-
 static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 					      struct mm_struct *mm,
 					      unsigned long address)
@@ -296,9 +247,7 @@ static int kvm_mmu_notifier_clear_flush_
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
-	.invalidate_page	= kvm_mmu_notifier_invalidate_page,
-	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
-	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
+	.invalidate_range	= kvm_mmu_notifier_invalidate_range,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
 };
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
Index: linux-2.6/drivers/misc/sgi-gru/grufault.c
===================================================================
--- linux-2.6.orig/drivers/misc/sgi-gru/grufault.c
+++ linux-2.6/drivers/misc/sgi-gru/grufault.c
@@ -195,74 +195,29 @@ static void get_clear_fault_map(struct g
  * 		< 0 - error code
  * 		  1 - (atomic only) try again in non-atomic context
  */
-static int non_atomic_pte_lookup(struct vm_area_struct *vma,
-				 unsigned long vaddr, int write,
+static int non_atomic_pte_lookup(unsigned long vaddr, int write,
 				 unsigned long *paddr, int *pageshift)
 {
 	struct page *page;
+	int ret;
 
 	/* ZZZ Need to handle HUGE pages */
-	if (is_vm_hugetlb_page(vma))
-		return -EFAULT;
 	*pageshift = PAGE_SHIFT;
-	if (get_user_pages
-	    (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
+	ret = get_user_pages(current, current->mm, vaddr, 1, write, 0,
+						&page, NULL);
+	if (ret < 0)
+		return ret;
+	if (ret == 0)
 		return -EFAULT;
+	if (PageCompound(page)) { /* hugepage */
+		put_page(page);
+		return -EFAULT;
+	}
+
 	*paddr = page_to_phys(page);
 	put_page(page);
-	return 0;
-}
-
-/*
- *
- * atomic_pte_lookup
- *
- * Convert a user virtual address to a physical address
- * Only supports Intel large pages (2MB only) on x86_64.
- *	ZZZ - hugepage support is incomplete
- */
-static int atomic_pte_lookup(struct vm_area_struct *vma, unsigned long vaddr,
-	int write, unsigned long *paddr, int *pageshift)
-{
-	pgd_t *pgdp;
-	pmd_t *pmdp;
-	pud_t *pudp;
-	pte_t pte;
-
-	WARN_ON(irqs_disabled());		/* ZZZ debug */
-
-	local_irq_disable();
-	pgdp = pgd_offset(vma->vm_mm, vaddr);
-	if (unlikely(pgd_none(*pgdp)))
-		goto err;
-
-	pudp = pud_offset(pgdp, vaddr);
-	if (unlikely(pud_none(*pudp)))
-		goto err;
-
-	pmdp = pmd_offset(pudp, vaddr);
-	if (unlikely(pmd_none(*pmdp)))
-		goto err;
-#ifdef CONFIG_X86_64
-	if (unlikely(pmd_large(*pmdp)))
-		pte = *(pte_t *) pmdp;
-	else
-#endif
-		pte = *pte_offset_kernel(pmdp, vaddr);
-
-	local_irq_enable();
-
-	if (unlikely(!pte_present(pte) ||
-		     (write && (!pte_write(pte) || !pte_dirty(pte)))))
-		return 1;
 
-	*paddr = pte_pfn(pte) << PAGE_SHIFT;
-	*pageshift = is_vm_hugetlb_page(vma) ? HPAGE_SHIFT : PAGE_SHIFT;
 	return 0;
-
-err:
-	local_irq_enable();
-	return 1;
 }
 
 /*
@@ -279,9 +234,7 @@ static int gru_try_dropin(struct gru_thr
 			  struct gru_tlb_fault_handle *tfh,
 			  unsigned long __user *cb)
 {
-	struct mm_struct *mm = gts->ts_mm;
-	struct vm_area_struct *vma;
-	int pageshift, asid, write, ret;
+	int pageshift, asid, write;
 	unsigned long paddr, gpa, vaddr;
 
 	/*
@@ -298,7 +251,7 @@ static int gru_try_dropin(struct gru_thr
 	 */
 	if (tfh->state == TFHSTATE_IDLE)
 		goto failidle;
-	if (tfh->state == TFHSTATE_MISS_FMM && cb)
+	if (tfh->state == TFHSTATE_MISS_FMM)
 		goto failfmm;
 
 	write = (tfh->cause & TFHCAUSE_TLB_MOD) != 0;
@@ -307,31 +260,9 @@ static int gru_try_dropin(struct gru_thr
 	if (asid == 0)
 		goto failnoasid;
 
-	rmb();	/* TFH must be cache resident before reading ms_range_active */
-
-	/*
-	 * TFH is cache resident - at least briefly. Fail the dropin
-	 * if a range invalidate is active.
-	 */
-	if (atomic_read(&gts->ts_gms->ms_range_active))
-		goto failactive;
-
-	vma = find_vma(mm, vaddr);
-	if (!vma)
+	if (non_atomic_pte_lookup(vaddr, write, &paddr, &pageshift))
 		goto failinval;
 
-	/*
-	 * Atomic lookup is faster & usually works even if called in non-atomic
-	 * context.
-	 */
-	ret = atomic_pte_lookup(vma, vaddr, write, &paddr, &pageshift);
-	if (ret) {
-		if (!cb)
-			goto failupm;
-		if (non_atomic_pte_lookup(vma, vaddr, write, &paddr,
-					  &pageshift))
-			goto failinval;
-	}
 	if (is_gru_paddr(paddr))
 		goto failinval;
 
@@ -351,19 +282,9 @@ failnoasid:
 	/* No asid (delayed unload). */
 	STAT(tlb_dropin_fail_no_asid);
 	gru_dbg(grudev, "FAILED no_asid tfh: 0x%p, vaddr 0x%lx\n", tfh, vaddr);
-	if (!cb)
-		tfh_user_polling_mode(tfh);
-	else
-		gru_flush_cache(tfh);
+	gru_flush_cache(tfh);
 	return -EAGAIN;
 
-failupm:
-	/* Atomic failure switch CBR to UPM */
-	tfh_user_polling_mode(tfh);
-	STAT(tlb_dropin_fail_upm);
-	gru_dbg(grudev, "FAILED upm tfh: 0x%p, vaddr 0x%lx\n", tfh, vaddr);
-	return 1;
-
 failfmm:
 	/* FMM state on UPM call */
 	STAT(tlb_dropin_fail_fmm);
@@ -373,8 +294,7 @@ failfmm:
 failidle:
 	/* TFH was idle  - no miss pending */
 	gru_flush_cache(tfh);
-	if (cb)
-		gru_flush_cache(cb);
+	gru_flush_cache(cb);
 	STAT(tlb_dropin_fail_idle);
 	gru_dbg(grudev, "FAILED idle tfh: 0x%p, state %d\n", tfh, tfh->state);
 	return 0;
@@ -385,17 +305,6 @@ failinval:
 	STAT(tlb_dropin_fail_invalid);
 	gru_dbg(grudev, "FAILED inval tfh: 0x%p, vaddr 0x%lx\n", tfh, vaddr);
 	return -EFAULT;
-
-failactive:
-	/* Range invalidate active. Switch to UPM iff atomic */
-	if (!cb)
-		tfh_user_polling_mode(tfh);
-	else
-		gru_flush_cache(tfh);
-	STAT(tlb_dropin_fail_range_active);
-	gru_dbg(grudev, "FAILED range active: tfh 0x%p, vaddr 0x%lx\n",
-		tfh, vaddr);
-	return 1;
 }
 
 /*
@@ -408,9 +317,8 @@ irqreturn_t gru_intr(int irq, void *dev_
 {
 	struct gru_state *gru;
 	struct gru_tlb_fault_map map;
-	struct gru_thread_state *gts;
 	struct gru_tlb_fault_handle *tfh = NULL;
-	int cbrnum, ctxnum;
+	int cbrnum;
 
 	STAT(intr);
 
@@ -434,19 +342,7 @@ irqreturn_t gru_intr(int irq, void *dev_
 		 * The gts cannot change until a TFH start/writestart command
 		 * is issued.
 		 */
-		ctxnum = tfh->ctxnum;
-		gts = gru->gs_gts[ctxnum];
-
-		/*
-		 * This is running in interrupt context. Trylock the mmap_sem.
-		 * If it fails, retry the fault in user context.
-		 */
-		if (down_read_trylock(&gts->ts_mm->mmap_sem)) {
-			gru_try_dropin(gts, tfh, NULL);
-			up_read(&gts->ts_mm->mmap_sem);
-		} else {
-			tfh_user_polling_mode(tfh);
-		}
+		tfh_user_polling_mode(tfh);
 	}
 	return IRQ_HANDLED;
 }
@@ -456,12 +352,9 @@ static int gru_user_dropin(struct gru_th
 			   struct gru_tlb_fault_handle *tfh,
 			   unsigned long __user *cb)
 {
-	struct gru_mm_struct *gms = gts->ts_gms;
 	int ret;
 
 	while (1) {
-		wait_event(gms->ms_wait_queue,
-			   atomic_read(&gms->ms_range_active) == 0);
 		prefetchw(tfh);	/* Helps on hdw, required for emulator */
 		ret = gru_try_dropin(gts, tfh, cb);
 		if (ret <= 0)
Index: linux-2.6/drivers/misc/sgi-gru/grutables.h
===================================================================
--- linux-2.6.orig/drivers/misc/sgi-gru/grutables.h
+++ linux-2.6/drivers/misc/sgi-gru/grutables.h
@@ -244,7 +244,6 @@ struct gru_mm_struct {
 	struct mmu_notifier	ms_notifier;
 	atomic_t		ms_refcnt;
 	spinlock_t		ms_asid_lock;	/* protects ASID assignment */
-	atomic_t		ms_range_active;/* num range_invals active */
 	char			ms_released;
 	wait_queue_head_t	ms_wait_queue;
 	DECLARE_BITMAP(ms_asidmap, GRU_MAX_GRUS);
Index: linux-2.6/drivers/misc/sgi-gru/grutlbpurge.c
===================================================================
--- linux-2.6.orig/drivers/misc/sgi-gru/grutlbpurge.c
+++ linux-2.6/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,41 +221,16 @@ void gru_flush_all_tlb(struct gru_state 
 /*
  * MMUOPS notifier callout functions
  */
-static void gru_invalidate_range_start(struct mmu_notifier *mn,
-				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+static void gru_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm,
+				unsigned long start, unsigned long end)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
 
-	STAT(mmu_invalidate_range);
-	atomic_inc(&gms->ms_range_active);
-	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
-		start, end, atomic_read(&gms->ms_range_active));
-	gru_flush_tlb_range(gms, start, end - start);
-}
-
-static void gru_invalidate_range_end(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end)
-{
-	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
-						 ms_notifier);
 
-	atomic_dec(&gms->ms_range_active);
-	wake_up_all(&gms->ms_wait_queue);
+	STAT(mmu_invalidate_range);
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
-}
-
-static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
-{
-	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
-						 ms_notifier);
-
-	STAT(mmu_invalidate_page);
-	gru_flush_tlb_range(gms, address, PAGE_SIZE);
-	gru_dbg(grudev, "gms %p, address 0x%lx\n", gms, address);
+	gru_flush_tlb_range(gms, start, end - start);
 }
 
 static void gru_release(struct mmu_notifier *mn, struct mm_struct *mm)
@@ -269,9 +244,7 @@ static void gru_release(struct mmu_notif
 
 
 static const struct mmu_notifier_ops gru_mmuops = {
-	.invalidate_page	= gru_invalidate_page,
-	.invalidate_range_start	= gru_invalidate_range_start,
-	.invalidate_range_end	= gru_invalidate_range_end,
+	.invalidate_range	= gru_invalidate_range,
 	.release		= gru_release,
 };
 
Index: linux-2.6/include/linux/kvm_host.h
===================================================================
--- linux-2.6.orig/include/linux/kvm_host.h
+++ linux-2.6/include/linux/kvm_host.h
@@ -125,7 +125,6 @@ struct kvm {
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_notifier_seq;
-	long mmu_notifier_count;
 #endif
 };
 
@@ -360,15 +359,6 @@ int kvm_trace_ioctl(unsigned int ioctl, 
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long mmu_seq)
 {
-	if (unlikely(vcpu->kvm->mmu_notifier_count))
-		return 1;
-	/*
-	 * Both reads happen under the mmu_lock and both values are
-	 * modified under mmu_lock, so there's no need of smb_rmb()
-	 * here in between, otherwise mmu_notifier_count should be
-	 * read before mmu_notifier_seq, see
-	 * mmu_notifier_invalidate_range_end write side.
-	 */
 	if (vcpu->kvm->mmu_notifier_seq != mmu_seq)
 		return 1;
 	return 0;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* MMU notifiers review and some proposals
  2008-07-24 14:39 MMU notifiers review and some proposals Nick Piggin
@ 2008-07-24 14:39 ` Nick Piggin
  2008-07-25 21:45 ` Andrea Arcangeli
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-07-24 14:39 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, andrea, steiner, cl

Hi,

I think everybody is hoping to have a workable mmu notifier scheme
merged in 2.6.27 (myself included). However I do have some concerns
about the implementation proposed (in -mm).

I apologise for this late review, before anybody gets too upset,
most of my concerns have been raised before, but I'd like to state
my case again and involving everyone.

So my concerns.

Firstly, mm_take_all_locks I dislike. Although it isn't too complex
and quite well contained, it unfortunately somewhat ties our hands a
bit when it comes to changing the fastpath locking. For exmple, it
might make it difficult to remove the locks from rmap walks now (not
that there aren't plenty of other difficulties, but as an example).

I also think we should be cautious to make this slowpath so horribly
slow. For KVM it's fine, but perhaps for GRU we'd actually want to
register quite a lot of notifiers, unregister them, and perhaps even
do per-vma registrations if we don't want to slow down other memory
management operations too much (eg. if it is otherwise just some
regular processes just doing accelerated things with GRU).

I think it is possible to come up with a pretty good implementation
that replaces mm_take_all_locks with something faster and using less
locks, I haven't quite got a detailed sketch yet, but if anyone is
interested, let me know.

However, I think the whole reason for take_all_locks is to support
the invalidate_range_begin/invalidate_range_end callbacks. I think
these are slightly odd because they introduce a 2nd design of TLB
shootdown scheme to the kernel: presently we downgrade pte
permissions, and then shootdown TLBs, and then free any pages;
the range callbacks first prevent new TLBs being set up, then shoot
down existing TLBs, then downgrade ptes, then free pages.

That's not to say this style will not work, but I prefer to keep
to our existing scheme. Obviously it is better proven, and also
it is good to have fewer if possible.

What are the pros of the new scheme? Well performance AFAIKS.
Large unmappings may not fit in the TLB gather API (and some
sites do not use TLB gather at all), but with the
invalidate_range_begin/invalidate_range_end we can still flush
the entire range in 1 call.

I would counter that our existing CPU flushing has held up fairly
well, and it can also be pretty expensive if it has to do lots of
IPIs. I don't have a problem with trying to maximise performance,
but I really feel this is one place where I'd prefer to see
numbers first.

One functional con of the invalidate_range_begin/invalidate_range_end
style I see is that the implementations hold off new TLB insertions
by incrementing a counter to hold them off. So actually poor
parallelism or starvation could become an issue with multiple
threads.


So anyway, I've attached a patch switching mmu notifiers to the
our traditional style of TLB invalidation, and attempted to
convert KVM and GRU to use it. Also keep in mind that after this
the mm_take_all_locks patches should not be needed. Unfortunately
I was not able to actually test this (but testing typically would not
find the rally subtle problems easily anyway)

---
 drivers/misc/sgi-gru/grufault.c    |  145 ++++---------------------------------
 drivers/misc/sgi-gru/grutables.h   |    1 
 drivers/misc/sgi-gru/grutlbpurge.c |   37 +--------
 include/linux/kvm_host.h           |   10 --
 include/linux/mmu_notifier.h       |   97 ++----------------------
 mm/fremap.c                        |    4 -
 mm/hugetlb.c                       |    3 
 mm/memory.c                        |   18 ++--
 mm/mmap.c                          |    2 
 mm/mmu_notifier.c                  |   43 +---------
 mm/mprotect.c                      |    3 
 mm/mremap.c                        |    4 -
 virt/kvm/kvm_main.c                |   65 +---------------
 13 files changed, 61 insertions(+), 371 deletions(-)

Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -65,60 +65,12 @@ struct mmu_notifier_ops {
 	 * Before this is invoked any secondary MMU is still ok to
 	 * read/write to the page previously pointed to by the Linux
 	 * pte because the page hasn't been freed yet and it won't be
-	 * freed until this returns. If required set_page_dirty has to
-	 * be called internally to this method.
+	 * freed until this returns. Also, importantly the address can't
+	 * be repopulated with some other page, so all MMUs retain a
+	 * coherent view of memory. If required set_page_dirty has to
+	 * be called internally to this method. Not optional.
 	 */
-	void (*invalidate_page)(struct mmu_notifier *mn,
-				struct mm_struct *mm,
-				unsigned long address);
-
-	/*
-	 * invalidate_range_start() and invalidate_range_end() must be
-	 * paired and are called only when the mmap_sem and/or the
-	 * locks protecting the reverse maps are held. The subsystem
-	 * must guarantee that no additional references are taken to
-	 * the pages in the range established between the call to
-	 * invalidate_range_start() and the matching call to
-	 * invalidate_range_end().
-	 *
-	 * Invalidation of multiple concurrent ranges may be
-	 * optionally permitted by the driver. Either way the
-	 * establishment of sptes is forbidden in the range passed to
-	 * invalidate_range_begin/end for the whole duration of the
-	 * invalidate_range_begin/end critical section.
-	 *
-	 * invalidate_range_start() is called when all pages in the
-	 * range are still mapped and have at least a refcount of one.
-	 *
-	 * invalidate_range_end() is called when all pages in the
-	 * range have been unmapped and the pages have been freed by
-	 * the VM.
-	 *
-	 * The VM will remove the page table entries and potentially
-	 * the page between invalidate_range_start() and
-	 * invalidate_range_end(). If the page must not be freed
-	 * because of pending I/O or other circumstances then the
-	 * invalidate_range_start() callback (or the initial mapping
-	 * by the driver) must make sure that the refcount is kept
-	 * elevated.
-	 *
-	 * If the driver increases the refcount when the pages are
-	 * initially mapped into an address space then either
-	 * invalidate_range_start() or invalidate_range_end() may
-	 * decrease the refcount. If the refcount is decreased on
-	 * invalidate_range_start() then the VM can free pages as page
-	 * table entries are removed.  If the refcount is only
-	 * droppped on invalidate_range_end() then the driver itself
-	 * will drop the last refcount but it must take care to flush
-	 * any secondary tlb before doing the final free on the
-	 * page. Pages will no longer be referenced by the linux
-	 * address space but may still be referenced by sptes until
-	 * the last refcount is dropped.
-	 */
-	void (*invalidate_range_start)(struct mmu_notifier *mn,
-				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
-	void (*invalidate_range_end)(struct mmu_notifier *mn,
+	void (*invalidate_range)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start, unsigned long end);
 };
@@ -154,11 +106,7 @@ extern void __mmu_notifier_mm_destroy(st
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long address);
-extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
-extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
-extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -175,25 +123,11 @@ static inline int mmu_notifier_clear_flu
 	return 0;
 }
 
-static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
-{
-	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
-}
-
-static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
-}
-
-static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
-{
-	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range(mm, start, end);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -221,7 +155,8 @@ static inline void mmu_notifier_mm_destr
 	struct vm_area_struct *___vma = __vma;				\
 	unsigned long ___address = __address;				\
 	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
-	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
+	mmu_notifier_invalidate_range(___vma->vm_mm, ___address,	\
+					____address + PAGE_SIZE);	\
 	__pte;								\
 })
 
@@ -248,17 +183,7 @@ static inline int mmu_notifier_clear_flu
 	return 0;
 }
 
-static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
-{
-}
-
-static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
-{
-}
-
-static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 }
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -32,7 +32,7 @@ static void zap_pte(struct mm_struct *mm
 		struct page *page;
 
 		flush_cache_page(vma, addr, pte_pfn(pte));
-		pte = ptep_clear_flush(vma, addr, ptep);
+		pte = ptep_clear_flush_notify(vma, addr, ptep);
 		page = vm_normal_page(vma, addr, pte);
 		if (page) {
 			if (pte_dirty(pte))
@@ -226,9 +226,7 @@ asmlinkage long sys_remap_file_pages(uns
 		vma->vm_flags = saved_flags;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
-	mmu_notifier_invalidate_range_end(mm, start, start + size);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (vma->vm_flags & VM_LOCKED) {
 			/*
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -1683,7 +1683,6 @@ void __unmap_hugepage_range(struct vm_ar
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -1725,7 +1724,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -677,9 +677,6 @@ int copy_page_range(struct mm_struct *ds
 	 * parent mm. And a permission downgrade will only happen if
 	 * is_cow_mapping() returns true.
 	 */
-	if (is_cow_mapping(vma->vm_flags))
-		mmu_notifier_invalidate_range_start(src_mm, addr, end);
-
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
@@ -695,8 +692,8 @@ int copy_page_range(struct mm_struct *ds
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow_mapping(vma->vm_flags))
-		mmu_notifier_invalidate_range_end(src_mm,
-						  vma->vm_start, end);
+		mmu_notifier_invalidate_range(src_mm, vma->vm_start, end);
+
 	return ret;
 }
 
@@ -903,7 +900,6 @@ unsigned long unmap_vmas(struct mmu_gath
 	int fullmm = (*tlbp)->fullmm;
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
 
@@ -951,6 +947,7 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
+			mmu_notifier_invalidate_range(mm, tlb_start, start);
 			tlb_finish_mmu(*tlbp, tlb_start, start);
 
 			if (need_resched() ||
@@ -968,7 +965,6 @@ unsigned long unmap_vmas(struct mmu_gath
 		}
 	}
 out:
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
 
@@ -991,8 +987,10 @@ unsigned long zap_page_range(struct vm_a
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
+	if (tlb) {
+		mmu_notifier_invalidate_range(mm, address, end);
 		tlb_finish_mmu(tlb, address, end);
+	}
 	return end;
 }
 EXPORT_SYMBOL_GPL(zap_page_range);
@@ -1668,7 +1666,6 @@ int apply_to_page_range(struct mm_struct
 	int err;
 
 	BUG_ON(addr >= end);
-	mmu_notifier_invalidate_range_start(mm, start, end);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1676,7 +1673,8 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range(mm, start, end);
+
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1794,6 +1794,7 @@ static void unmap_region(struct mm_struc
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
+	mmu_notifier_invalidate_range(mm, start, end);
 	tlb_finish_mmu(tlb, start, end);
 }
 
@@ -2123,6 +2124,7 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	memrlimit_cgroup_uncharge_as(mm, mm->total_vm);
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
+	mmu_notifier_invalidate_range(mm, 0, end);
 	tlb_finish_mmu(tlb, 0, end);
 
 	/*
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c
+++ linux-2.6/mm/mmu_notifier.c
@@ -99,45 +99,15 @@ int __mmu_notifier_clear_flush_young(str
 	return young;
 }
 
-void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
-{
-	struct mmu_notifier *mn;
-	struct hlist_node *n;
-
-	rcu_read_lock();
-	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
-	}
-	rcu_read_unlock();
-}
-
-void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
-{
-	struct mmu_notifier *mn;
-	struct hlist_node *n;
-
-	rcu_read_lock();
-	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
-	}
-	rcu_read_unlock();
-}
-
-void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	struct mmu_notifier *mn;
 	struct hlist_node *n;
 
 	rcu_read_lock();
-	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
-	}
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist)
+		mn->ops->invalidate_range(mn, mm, start, end);
 	rcu_read_unlock();
 }
 
@@ -148,6 +118,8 @@ static int do_mmu_notifier_register(stru
 	struct mmu_notifier_mm *mmu_notifier_mm;
 	int ret;
 
+	BUG_ON(!mn->ops);
+	BUG_ON(!mn->ops->invalidate_range);
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 
 	ret = -ENOMEM;
@@ -157,9 +129,6 @@ static int do_mmu_notifier_register(stru
 
 	if (take_mmap_sem)
 		down_write(&mm->mmap_sem);
-	ret = mm_take_all_locks(mm);
-	if (unlikely(ret))
-		goto out_cleanup;
 
 	if (!mm_has_notifiers(mm)) {
 		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
@@ -181,8 +150,6 @@ static int do_mmu_notifier_register(stru
 	hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
 	spin_unlock(&mm->mmu_notifier_mm->lock);
 
-	mm_drop_all_locks(mm);
-out_cleanup:
 	if (take_mmap_sem)
 		up_write(&mm->mmap_sem);
 	/* kfree() does nothing if mmu_notifier_mm is NULL */
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c
+++ linux-2.6/mm/mprotect.c
@@ -204,12 +204,11 @@ success:
 		dirty_accountable = 1;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range(mm, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -81,8 +81,6 @@ static void move_ptes(struct vm_area_str
 	unsigned long old_start;
 
 	old_start = old_addr;
-	mmu_notifier_invalidate_range_start(vma->vm_mm,
-					    old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -124,7 +122,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
-	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
+	mmu_notifier_invalidate_range(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
Index: linux-2.6/virt/kvm/kvm_main.c
===================================================================
--- linux-2.6.orig/virt/kvm/kvm_main.c
+++ linux-2.6/virt/kvm/kvm_main.c
@@ -192,13 +192,15 @@ static inline struct kvm *mmu_notifier_t
 	return container_of(mn, struct kvm, mmu_notifier);
 }
 
-static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long address)
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+						    struct mm_struct *mm,
+						    unsigned long start,
+						    unsigned long end)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int need_tlb_flush;
+	int need_tlb_flush = 0;
 
+	spin_lock(&kvm->mmu_lock);
 	/*
 	 * When ->invalidate_page runs, the linux pte has been zapped
 	 * already but the page is still allocated until
@@ -217,32 +219,7 @@ static void kvm_mmu_notifier_invalidate_
 	 * pte after kvm_unmap_hva returned, without noticing the page
 	 * is going to be freed.
 	 */
-	spin_lock(&kvm->mmu_lock);
 	kvm->mmu_notifier_seq++;
-	need_tlb_flush = kvm_unmap_hva(kvm, address);
-	spin_unlock(&kvm->mmu_lock);
-
-	/* we've to flush the tlb before the pages can be freed */
-	if (need_tlb_flush)
-		kvm_flush_remote_tlbs(kvm);
-
-}
-
-static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
-						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int need_tlb_flush = 0;
-
-	spin_lock(&kvm->mmu_lock);
-	/*
-	 * The count increase must become visible at unlock time as no
-	 * spte can be established without taking the mmu_lock and
-	 * count is also read inside the mmu_lock critical section.
-	 */
-	kvm->mmu_notifier_count++;
 	for (; start < end; start += PAGE_SIZE)
 		need_tlb_flush |= kvm_unmap_hva(kvm, start);
 	spin_unlock(&kvm->mmu_lock);
@@ -252,32 +229,6 @@ static void kvm_mmu_notifier_invalidate_
 		kvm_flush_remote_tlbs(kvm);
 }
 
-static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-
-	spin_lock(&kvm->mmu_lock);
-	/*
-	 * This sequence increase will notify the kvm page fault that
-	 * the page that is going to be mapped in the spte could have
-	 * been freed.
-	 */
-	kvm->mmu_notifier_seq++;
-	/*
-	 * The above sequence increase must be visible before the
-	 * below count decrease but both values are read by the kvm
-	 * page fault under mmu_lock spinlock so we don't need to add
-	 * a smb_wmb() here in between the two.
-	 */
-	kvm->mmu_notifier_count--;
-	spin_unlock(&kvm->mmu_lock);
-
-	BUG_ON(kvm->mmu_notifier_count < 0);
-}
-
 static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 					      struct mm_struct *mm,
 					      unsigned long address)
@@ -296,9 +247,7 @@ static int kvm_mmu_notifier_clear_flush_
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
-	.invalidate_page	= kvm_mmu_notifier_invalidate_page,
-	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
-	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
+	.invalidate_range	= kvm_mmu_notifier_invalidate_range,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
 };
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
Index: linux-2.6/drivers/misc/sgi-gru/grufault.c
===================================================================
--- linux-2.6.orig/drivers/misc/sgi-gru/grufault.c
+++ linux-2.6/drivers/misc/sgi-gru/grufault.c
@@ -195,74 +195,29 @@ static void get_clear_fault_map(struct g
  * 		< 0 - error code
  * 		  1 - (atomic only) try again in non-atomic context
  */
-static int non_atomic_pte_lookup(struct vm_area_struct *vma,
-				 unsigned long vaddr, int write,
+static int non_atomic_pte_lookup(unsigned long vaddr, int write,
 				 unsigned long *paddr, int *pageshift)
 {
 	struct page *page;
+	int ret;
 
 	/* ZZZ Need to handle HUGE pages */
-	if (is_vm_hugetlb_page(vma))
-		return -EFAULT;
 	*pageshift = PAGE_SHIFT;
-	if (get_user_pages
-	    (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
+	ret = get_user_pages(current, current->mm, vaddr, 1, write, 0,
+						&page, NULL);
+	if (ret < 0)
+		return ret;
+	if (ret == 0)
 		return -EFAULT;
+	if (PageCompound(page)) { /* hugepage */
+		put_page(page);
+		return -EFAULT;
+	}
+
 	*paddr = page_to_phys(page);
 	put_page(page);
-	return 0;
-}
-
-/*
- *
- * atomic_pte_lookup
- *
- * Convert a user virtual address to a physical address
- * Only supports Intel large pages (2MB only) on x86_64.
- *	ZZZ - hugepage support is incomplete
- */
-static int atomic_pte_lookup(struct vm_area_struct *vma, unsigned long vaddr,
-	int write, unsigned long *paddr, int *pageshift)
-{
-	pgd_t *pgdp;
-	pmd_t *pmdp;
-	pud_t *pudp;
-	pte_t pte;
-
-	WARN_ON(irqs_disabled());		/* ZZZ debug */
-
-	local_irq_disable();
-	pgdp = pgd_offset(vma->vm_mm, vaddr);
-	if (unlikely(pgd_none(*pgdp)))
-		goto err;
-
-	pudp = pud_offset(pgdp, vaddr);
-	if (unlikely(pud_none(*pudp)))
-		goto err;
-
-	pmdp = pmd_offset(pudp, vaddr);
-	if (unlikely(pmd_none(*pmdp)))
-		goto err;
-#ifdef CONFIG_X86_64
-	if (unlikely(pmd_large(*pmdp)))
-		pte = *(pte_t *) pmdp;
-	else
-#endif
-		pte = *pte_offset_kernel(pmdp, vaddr);
-
-	local_irq_enable();
-
-	if (unlikely(!pte_present(pte) ||
-		     (write && (!pte_write(pte) || !pte_dirty(pte)))))
-		return 1;
 
-	*paddr = pte_pfn(pte) << PAGE_SHIFT;
-	*pageshift = is_vm_hugetlb_page(vma) ? HPAGE_SHIFT : PAGE_SHIFT;
 	return 0;
-
-err:
-	local_irq_enable();
-	return 1;
 }
 
 /*
@@ -279,9 +234,7 @@ static int gru_try_dropin(struct gru_thr
 			  struct gru_tlb_fault_handle *tfh,
 			  unsigned long __user *cb)
 {
-	struct mm_struct *mm = gts->ts_mm;
-	struct vm_area_struct *vma;
-	int pageshift, asid, write, ret;
+	int pageshift, asid, write;
 	unsigned long paddr, gpa, vaddr;
 
 	/*
@@ -298,7 +251,7 @@ static int gru_try_dropin(struct gru_thr
 	 */
 	if (tfh->state == TFHSTATE_IDLE)
 		goto failidle;
-	if (tfh->state == TFHSTATE_MISS_FMM && cb)
+	if (tfh->state == TFHSTATE_MISS_FMM)
 		goto failfmm;
 
 	write = (tfh->cause & TFHCAUSE_TLB_MOD) != 0;
@@ -307,31 +260,9 @@ static int gru_try_dropin(struct gru_thr
 	if (asid == 0)
 		goto failnoasid;
 
-	rmb();	/* TFH must be cache resident before reading ms_range_active */
-
-	/*
-	 * TFH is cache resident - at least briefly. Fail the dropin
-	 * if a range invalidate is active.
-	 */
-	if (atomic_read(&gts->ts_gms->ms_range_active))
-		goto failactive;
-
-	vma = find_vma(mm, vaddr);
-	if (!vma)
+	if (non_atomic_pte_lookup(vaddr, write, &paddr, &pageshift))
 		goto failinval;
 
-	/*
-	 * Atomic lookup is faster & usually works even if called in non-atomic
-	 * context.
-	 */
-	ret = atomic_pte_lookup(vma, vaddr, write, &paddr, &pageshift);
-	if (ret) {
-		if (!cb)
-			goto failupm;
-		if (non_atomic_pte_lookup(vma, vaddr, write, &paddr,
-					  &pageshift))
-			goto failinval;
-	}
 	if (is_gru_paddr(paddr))
 		goto failinval;
 
@@ -351,19 +282,9 @@ failnoasid:
 	/* No asid (delayed unload). */
 	STAT(tlb_dropin_fail_no_asid);
 	gru_dbg(grudev, "FAILED no_asid tfh: 0x%p, vaddr 0x%lx\n", tfh, vaddr);
-	if (!cb)
-		tfh_user_polling_mode(tfh);
-	else
-		gru_flush_cache(tfh);
+	gru_flush_cache(tfh);
 	return -EAGAIN;
 
-failupm:
-	/* Atomic failure switch CBR to UPM */
-	tfh_user_polling_mode(tfh);
-	STAT(tlb_dropin_fail_upm);
-	gru_dbg(grudev, "FAILED upm tfh: 0x%p, vaddr 0x%lx\n", tfh, vaddr);
-	return 1;
-
 failfmm:
 	/* FMM state on UPM call */
 	STAT(tlb_dropin_fail_fmm);
@@ -373,8 +294,7 @@ failfmm:
 failidle:
 	/* TFH was idle  - no miss pending */
 	gru_flush_cache(tfh);
-	if (cb)
-		gru_flush_cache(cb);
+	gru_flush_cache(cb);
 	STAT(tlb_dropin_fail_idle);
 	gru_dbg(grudev, "FAILED idle tfh: 0x%p, state %d\n", tfh, tfh->state);
 	return 0;
@@ -385,17 +305,6 @@ failinval:
 	STAT(tlb_dropin_fail_invalid);
 	gru_dbg(grudev, "FAILED inval tfh: 0x%p, vaddr 0x%lx\n", tfh, vaddr);
 	return -EFAULT;
-
-failactive:
-	/* Range invalidate active. Switch to UPM iff atomic */
-	if (!cb)
-		tfh_user_polling_mode(tfh);
-	else
-		gru_flush_cache(tfh);
-	STAT(tlb_dropin_fail_range_active);
-	gru_dbg(grudev, "FAILED range active: tfh 0x%p, vaddr 0x%lx\n",
-		tfh, vaddr);
-	return 1;
 }
 
 /*
@@ -408,9 +317,8 @@ irqreturn_t gru_intr(int irq, void *dev_
 {
 	struct gru_state *gru;
 	struct gru_tlb_fault_map map;
-	struct gru_thread_state *gts;
 	struct gru_tlb_fault_handle *tfh = NULL;
-	int cbrnum, ctxnum;
+	int cbrnum;
 
 	STAT(intr);
 
@@ -434,19 +342,7 @@ irqreturn_t gru_intr(int irq, void *dev_
 		 * The gts cannot change until a TFH start/writestart command
 		 * is issued.
 		 */
-		ctxnum = tfh->ctxnum;
-		gts = gru->gs_gts[ctxnum];
-
-		/*
-		 * This is running in interrupt context. Trylock the mmap_sem.
-		 * If it fails, retry the fault in user context.
-		 */
-		if (down_read_trylock(&gts->ts_mm->mmap_sem)) {
-			gru_try_dropin(gts, tfh, NULL);
-			up_read(&gts->ts_mm->mmap_sem);
-		} else {
-			tfh_user_polling_mode(tfh);
-		}
+		tfh_user_polling_mode(tfh);
 	}
 	return IRQ_HANDLED;
 }
@@ -456,12 +352,9 @@ static int gru_user_dropin(struct gru_th
 			   struct gru_tlb_fault_handle *tfh,
 			   unsigned long __user *cb)
 {
-	struct gru_mm_struct *gms = gts->ts_gms;
 	int ret;
 
 	while (1) {
-		wait_event(gms->ms_wait_queue,
-			   atomic_read(&gms->ms_range_active) == 0);
 		prefetchw(tfh);	/* Helps on hdw, required for emulator */
 		ret = gru_try_dropin(gts, tfh, cb);
 		if (ret <= 0)
Index: linux-2.6/drivers/misc/sgi-gru/grutables.h
===================================================================
--- linux-2.6.orig/drivers/misc/sgi-gru/grutables.h
+++ linux-2.6/drivers/misc/sgi-gru/grutables.h
@@ -244,7 +244,6 @@ struct gru_mm_struct {
 	struct mmu_notifier	ms_notifier;
 	atomic_t		ms_refcnt;
 	spinlock_t		ms_asid_lock;	/* protects ASID assignment */
-	atomic_t		ms_range_active;/* num range_invals active */
 	char			ms_released;
 	wait_queue_head_t	ms_wait_queue;
 	DECLARE_BITMAP(ms_asidmap, GRU_MAX_GRUS);
Index: linux-2.6/drivers/misc/sgi-gru/grutlbpurge.c
===================================================================
--- linux-2.6.orig/drivers/misc/sgi-gru/grutlbpurge.c
+++ linux-2.6/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,41 +221,16 @@ void gru_flush_all_tlb(struct gru_state 
 /*
  * MMUOPS notifier callout functions
  */
-static void gru_invalidate_range_start(struct mmu_notifier *mn,
-				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+static void gru_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm,
+				unsigned long start, unsigned long end)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
 
-	STAT(mmu_invalidate_range);
-	atomic_inc(&gms->ms_range_active);
-	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
-		start, end, atomic_read(&gms->ms_range_active));
-	gru_flush_tlb_range(gms, start, end - start);
-}
-
-static void gru_invalidate_range_end(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end)
-{
-	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
-						 ms_notifier);
 
-	atomic_dec(&gms->ms_range_active);
-	wake_up_all(&gms->ms_wait_queue);
+	STAT(mmu_invalidate_range);
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
-}
-
-static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
-{
-	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
-						 ms_notifier);
-
-	STAT(mmu_invalidate_page);
-	gru_flush_tlb_range(gms, address, PAGE_SIZE);
-	gru_dbg(grudev, "gms %p, address 0x%lx\n", gms, address);
+	gru_flush_tlb_range(gms, start, end - start);
 }
 
 static void gru_release(struct mmu_notifier *mn, struct mm_struct *mm)
@@ -269,9 +244,7 @@ static void gru_release(struct mmu_notif
 
 
 static const struct mmu_notifier_ops gru_mmuops = {
-	.invalidate_page	= gru_invalidate_page,
-	.invalidate_range_start	= gru_invalidate_range_start,
-	.invalidate_range_end	= gru_invalidate_range_end,
+	.invalidate_range	= gru_invalidate_range,
 	.release		= gru_release,
 };
 
Index: linux-2.6/include/linux/kvm_host.h
===================================================================
--- linux-2.6.orig/include/linux/kvm_host.h
+++ linux-2.6/include/linux/kvm_host.h
@@ -125,7 +125,6 @@ struct kvm {
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_notifier_seq;
-	long mmu_notifier_count;
 #endif
 };
 
@@ -360,15 +359,6 @@ int kvm_trace_ioctl(unsigned int ioctl, 
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long mmu_seq)
 {
-	if (unlikely(vcpu->kvm->mmu_notifier_count))
-		return 1;
-	/*
-	 * Both reads happen under the mmu_lock and both values are
-	 * modified under mmu_lock, so there's no need of smb_rmb()
-	 * here in between, otherwise mmu_notifier_count should be
-	 * read before mmu_notifier_seq, see
-	 * mmu_notifier_invalidate_range_end write side.
-	 */
 	if (vcpu->kvm->mmu_notifier_seq != mmu_seq)
 		return 1;
 	return 0;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-24 14:39 MMU notifiers review and some proposals Nick Piggin
  2008-07-24 14:39 ` Nick Piggin
@ 2008-07-25 21:45 ` Andrea Arcangeli
  2008-07-26  3:08   ` Nick Piggin
  2008-07-25 23:29 ` Jack Steiner
  2008-07-27  9:45 ` Andrew Morton
  3 siblings, 1 reply; 26+ messages in thread
From: Andrea Arcangeli @ 2008-07-25 21:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

Hi Nick,

On Thu, Jul 24, 2008 at 04:39:49PM +0200, Nick Piggin wrote:
> I think everybody is hoping to have a workable mmu notifier scheme
> merged in 2.6.27 (myself included). However I do have some concerns

Just to be clear, I'm waiting mmu notifiers to showup on Linus's tree
before commenting this as it was all partly covered in the past
discussions anyway, so there's nothing really urgent or new here (at
least for me ;).

It's a tradeoff, you pointed out the positive sides and negative point
of both approaches, and depending which kind of the trade you're
interested about, you'll prefer one or the other approach. Your
preference is the exact opposite of what SGI liked and what we
liked. But all works for us, and all works for GRU (though -mm is
faster for the fast path), but only -mm can be easily later extended
for XPMEM/IB if they ever decide to schedule in the mmu notifier
methods in the future (which may never happen and it's unrelated to
the current patches that don't contemplate sleeping at all and it's
pure luck they can be trivially extended to provide for it).

As your patch shown the changes are fairly small anyway if we later
decide to change in 2.6.28-rc, in the meantime current code in -mm was
heavily tested and all code including kvm and gru has been tested only
with this, and this combined with the fact -mm guarantees the fastest
fast path, I hope we leave any discussion to the 2.6.28-rc merge
window, now it's time to go productive finally!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-25 21:45 ` Andrea Arcangeli
@ 2008-07-26  3:08   ` Nick Piggin
  2008-07-26 11:38     ` Andrea Arcangeli
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2008-07-26  3:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Fri, Jul 25, 2008 at 11:45:52PM +0200, Andrea Arcangeli wrote:
> Hi Nick,
> 
> On Thu, Jul 24, 2008 at 04:39:49PM +0200, Nick Piggin wrote:
> > I think everybody is hoping to have a workable mmu notifier scheme
> > merged in 2.6.27 (myself included). However I do have some concerns
> 
> Just to be clear, I'm waiting mmu notifiers to showup on Linus's tree
> before commenting this as it was all partly covered in the past
> discussions anyway, so there's nothing really urgent or new here (at
> least for me ;).

Well I just was never completely satisfied with how that turned out.
There was an assertion that invalidate range begin/end were the right
way to go because performance would be too bad using the traditional
mmu gather / range flushing. Now that I actually had the GRU and KVM
code to look at, I must say I would be very surprised if performance
is too bad. Given that, I would have thought the logical way to go
would be to start with the "minimal" type of notifiers I proposed now,
and then get some numbers together to try to support the start/end
scheme.

If there is some real performance numbers or something that I missed,
please let me know.

> It's a tradeoff, you pointed out the positive sides and negative point
> of both approaches, and depending which kind of the trade you're
> interested about, you'll prefer one or the other approach. Your
> preference is the exact opposite of what SGI liked and what we
> liked. But all works for us, and all works for GRU (though -mm is
> faster for the fast path), but only -mm can be easily later extended
> for XPMEM/IB if they ever decide to schedule in the mmu notifier
> methods in the future (which may never happen and it's unrelated to
> the current patches that don't contemplate sleeping at all and it's
> pure luck they can be trivially extended to provide for it).
> 
> As your patch shown the changes are fairly small anyway if we later
> decide to change in 2.6.28-rc, in the meantime current code in -mm was

OK. The significant thing from my POV is that it doesn't need to
introduce the take-all-locks code.

> heavily tested and all code including kvm and gru has been tested only
> with this, and this combined with the fact -mm guarantees the fastest
> fast path, I hope we leave any discussion to the 2.6.28-rc merge
> window, now it's time to go productive finally!

Well everybody knows I would prefer to merge the minimal notifiers now,
and look at more complex things as we get results to see if they are
required (I doubt anybody is going to help me get numbers to justify
going the other way ;)).

I think for the core VM, minimal notifiers are basically trivial, and
their consumers are more peripheral code so I don't think it would go
against merge standards.

Anyway, I just voice my opinion and let Andrew and Linus decide. To be
clear: I have not found any actual bugs in Andrea's -mm patchset, only
some dislikes of the approach.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26  3:08   ` Nick Piggin
@ 2008-07-26 11:38     ` Andrea Arcangeli
  2008-07-26 12:28       ` Nick Piggin
                         ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Andrea Arcangeli @ 2008-07-26 11:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 05:08:10AM +0200, Nick Piggin wrote:
> Well I just was never completely satisfied with how that turned out.
> There was an assertion that invalidate range begin/end were the right
> way to go because performance would be too bad using the traditional
> mmu gather / range flushing. Now that I actually had the GRU and KVM
> code to look at, I must say I would be very surprised if performance
> is too bad. Given that, I would have thought the logical way to go
> would be to start with the "minimal" type of notifiers I proposed now,
> and then get some numbers together to try to support the start/end
> scheme.

You know back in Feb I started with the minimal type of notifiers
you're suggesting but it was turned down.

The advantages of the current -mm approach are:

1) absolute minimal intrusion into the kernel common code, and
   absolute minimum number of branches added to the kernel fast
   paths. Kernel is faster than your "minimal" type of notifiers when
   they're disarmed. Nobody can care less about the performance of mmu
   notifiers infact, both will work fast enough, but I want the
   approach that bloats the main kernel less, and -mm reaches this
   objective fine.

2) No need to rewrite the tlb-gather logic, which in any case would
   not allow to schedule inside the mmu notifiers later if an user
   requiring scheduling ever showup (more flexible approach), and it
   would also need to become per-mm and not per-cpu.

3) Maximal reduction of IPI flood during vma teardown. You can't
   possibly hold off the CPU primary-mmu tlb miss handler, but we can
   hold off the secondary-mmu page-fautl/tlb-miss handler, and doing
   so we can run a single IPI for an unlimited amount of address space
   teardown.

Disavantages:

1) mm_take_all_locks is required to register

No other disavantage.

There is no problem with holding off the secondary mmu page fault, a
few threads may spin but signals are handled the whole time, the page
fault doesn't loop it returns and it is being retried. So you can
shoot yourself in the foot (with your own threads stepping on each
other toes) and that's all, there's no risk of starvation or anything
like that.

> If there is some real performance numbers or something that I missed,
> please let me know.

Producing performance numbers for KVM isn't really possible, because
either one will work fine, kvm never mangles the address space with
the exception of madvise with ballooning which is going to perform
well either way. invalidate_page is run most of the time in KVM
runtime, never invalidate_range_start/end so there would be no difference.

> I think for the core VM, minimal notifiers are basically trivial, and

Your minimal notifiers are a whole lot more complex, as they require
to rewrite the tlb-gather in all archs. Go ahead it won't be ready for
2.6.27 be sure...

Furthermore despite rewriting tlb-gather they still have all the
disavantages mentioned above.

What is possible is to have a minimal notifier that adds a branch for
every pte teardown, that's easy done that in Feb, that leaves the
tlb-gather optimization for later, but I still think using tlb-gather
when we can do better and we already do better is wrong.

> Anyway, I just voice my opinion and let Andrew and Linus decide. To be
> clear: I have not found any actual bugs in Andrea's -mm patchset, only
> some dislikes of the approach.

Yes, like I said I think this is a matter of taste of what you like of
the tradeoff. There are disadvantages and advantages in both and if we
wait forever to please everyone taste, it'll never go in.

My feeling is that what is in -mm is better and it will stay for long,
because it fully exploits the ability we have to hold off and reply
the secondary mmu page fault (equivalent of the primary-mmu tlb miss)
something we can't do with the primary mmu tlb miss and that forces us
to implement something as complex (and IPI-flooding) as tlb-gather
logic. And the result besides being theoretically faster in the fast
path both when armed and disarmed, is also simpler than plugging mmu
notifier invalidate_pages inside tlb-gather. So I think it's a better
tradeoff.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 11:38     ` Andrea Arcangeli
@ 2008-07-26 12:28       ` Nick Piggin
  2008-07-26 13:02         ` Andrea Arcangeli
  2008-07-26 12:33       ` Nick Piggin
  2008-07-26 13:04       ` Nick Piggin
  2 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2008-07-26 12:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 01:38:13PM +0200, Andrea Arcangeli wrote:
> On Sat, Jul 26, 2008 at 05:08:10AM +0200, Nick Piggin wrote:
> > Well I just was never completely satisfied with how that turned out.
> > There was an assertion that invalidate range begin/end were the right
> > way to go because performance would be too bad using the traditional
> > mmu gather / range flushing. Now that I actually had the GRU and KVM
> > code to look at, I must say I would be very surprised if performance
> > is too bad. Given that, I would have thought the logical way to go
> > would be to start with the "minimal" type of notifiers I proposed now,
> > and then get some numbers together to try to support the start/end
> > scheme.
> 
> You know back in Feb I started with the minimal type of notifiers
> you're suggesting but it was turned down.

Yes, but I don't think it was turned down for good reasons. It was
turned down solely because of performance AFAIKS, but there were
not even any numbers to back this up and I think it most situations
where KVM and GRU are used, it should not be measurable.

> The advantages of the current -mm approach are:
> 
> 1) absolute minimal intrusion into the kernel common code, and
>    absolute minimum number of branches added to the kernel fast
>    paths. Kernel is faster than your "minimal" type of notifiers when
>    they're disarmed. Nobody can care less about the performance of mmu
>    notifiers infact, both will work fast enough, but I want the
>    approach that bloats the main kernel less, and -mm reaches this
>    objective fine.

I claim opposite. There is no mm_take_all_locks in my approach, and
the TLB flushing design mirrors that which we use for CPU TLB flushing.
Both these points make the -mm design much more intrusive to kernel
common code.

As for notifiers disabled speed, I can't see how -mm would be faster.
Sometimes -mm will result in fewer notifier calls eg. due to huge
unmaps. Othertimes my design will be fewer because it only makes
a single range flush call rather than a start/end pair. How do you
think they are slower?

> 2) No need to rewrite the tlb-gather logic, which in any case would
>    not allow to schedule inside the mmu notifiers later if an user
>    requiring scheduling ever showup (more flexible approach), and it
>    would also need to become per-mm and not per-cpu.

I sent a patch in my first email. I don't see this as rewriting the
TLB gather logic at all. Or is there a hole in my design?

Hmm, I missed putting the notification in tlb_remove_page when the
gather fills up, but I don't think that classes as a redesign of
the tlb-gather logic to put a callback in there.

> 3) Maximal reduction of IPI flood during vma teardown. You can't
>    possibly hold off the CPU primary-mmu tlb miss handler, but we can

You definitely can on many architectures except x86. None has seemed
to require the need as yet.

>    hold off the secondary-mmu page-fautl/tlb-miss handler, and doing
>    so we can run a single IPI for an unlimited amount of address space
>    teardown.

That is true, but 1) most unmaps will be either fairly small or
whole-program. In both cases, minimal notifiers should be just as
good.

> Disavantages:
> 
> 1) mm_take_all_locks is required to register
> 
> No other disavantage.

2) new tlb flushing design

3) livelock/starvation problem with TLB holdoff

> There is no problem with holding off the secondary mmu page fault, a
> few threads may spin but signals are handled the whole time, the page
> fault doesn't loop it returns and it is being retried. So you can
> shoot yourself in the foot (with your own threads stepping on each
> other toes) and that's all, there's no risk of starvation or anything
> like that.

That's not shooting yourself in the foot if you are forced into the
design. Definitely there can be indefinite starvation because there
is no notion of queueing on the range_active count.

> > If there is some real performance numbers or something that I missed,
> > please let me know.
> 
> Producing performance numbers for KVM isn't really possible, because
> either one will work fine, kvm never mangles the address space with
> the exception of madvise with ballooning which is going to perform
> well either way. invalidate_page is run most of the time in KVM
> runtime, never invalidate_range_start/end so there would be no difference.

So then the important question is GRU. In which case the driver isn't
even finished, it is being run on a simulator, and probably very few
or no real users of the driver... so that's not a good platform to be
arguing for more complexity for the sake of performance IMO.

> > I think for the core VM, minimal notifiers are basically trivial, and
> 
> Your minimal notifiers are a whole lot more complex, as they require
> to rewrite the tlb-gather in all archs. Go ahead it won't be ready for
> 2.6.27 be sure...
> 
> Furthermore despite rewriting tlb-gather they still have all the
> disavantages mentioned above.
> 
> What is possible is to have a minimal notifier that adds a branch for
> every pte teardown, that's easy done that in Feb, that leaves the
> tlb-gather optimization for later, but I still think using tlb-gather
> when we can do better and we already do better is wrong.

I'm doing range flushing withing tlb-gathers in the patch I posted which
does not add a branch for every pte teardown. And I don't really
consider range_start/end as better, given my reasons.

> > Anyway, I just voice my opinion and let Andrew and Linus decide. To be
> > clear: I have not found any actual bugs in Andrea's -mm patchset, only
> > some dislikes of the approach.
> 
> Yes, like I said I think this is a matter of taste of what you like of
> the tradeoff. There are disadvantages and advantages in both and if we
> wait forever to please everyone taste, it'll never go in.
> 
> My feeling is that what is in -mm is better and it will stay for long,
> because it fully exploits the ability we have to hold off and reply
> the secondary mmu page fault (equivalent of the primary-mmu tlb miss)
> something we can't do with the primary mmu tlb miss and that forces us
> to implement something as complex (and IPI-flooding) as tlb-gather
> logic. And the result besides being theoretically faster in the fast
> path both when armed and disarmed, is also simpler than plugging mmu
> notifier invalidate_pages inside tlb-gather. So I think it's a better
> tradeoff.

If I had seen even a single number to show the more complex scheme
is better maybe I would be more receptive. I realise there might be
some theoretical advantages but I also believe they wouldn't be
measurable in real use cases.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 12:28       ` Nick Piggin
@ 2008-07-26 13:02         ` Andrea Arcangeli
  2008-07-26 13:10           ` Nick Piggin
  2008-07-26 13:14           ` Nick Piggin
  0 siblings, 2 replies; 26+ messages in thread
From: Andrea Arcangeli @ 2008-07-26 13:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 02:28:26PM +0200, Nick Piggin wrote:
> 3) livelock/starvation problem with TLB holdoff
>
> That's not shooting yourself in the foot if you are forced into the
> design. Definitely there can be indefinite starvation because there
> is no notion of queueing on the range_active count.

I don't see how, it's all per-mm, all other -mm can't possibly be hold
off. What can happen is that once cpu loops because the other cpu is
mprotecting in a flood forever. Both have to be threads belonging to
the same -mm for this holdoff to be meaningful. It's the same app,
with multiple threads. The rest of the system can't even notice
whatever happens in that -mm and they all reschedule all the time, so
it can't starve other tasks. It's not a _global_ hold off! It's a
per-mm holdoff. The counter is in the kvm struct and the kvm struct is
bind to a single mm. Each different kvm struct has a different
counter. all it can happen that an app is shooting itself in the foot,
it's actually easier to run "for (;;)" if it wants to shoot itself in
the foot... so no big deal.

I really can't see any issue with your point 3, and infact this looks a
nice locking design to me.

> I'm doing range flushing withing tlb-gathers in the patch I posted which
> does not add a branch for every pte teardown. And I don't really
> consider range_start/end as better, given my reasons.

As you said you missed tlb_remove_page, that is the _only_ troublesome
one! A bit too easy to skip that one and then claim your patch as
simpler...

Just post a patch that works for real and see if you can avoid to
invalidate every pte, or if you've to hack inside
asm-generic/tlb.h... I can't see how you can avoid to mangle on the
asm-generic/tlb.h if you don't want to send IPIs at every single pte
zapped with your patch...

> If I had seen even a single number to show the more complex scheme

Please post a patch that actually works then we'll re-evaluate what is
the best tradeoff ;).

In the meantime please merge -mm patches into Linus's tree, this is
taking forever and if the changes are so small to go Nick's way and
his future "actually working" patch remains so small, it can be
applied incrementally without any problem IMHO, infact it is presented
as an incremental patch in the first place.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 13:02         ` Andrea Arcangeli
@ 2008-07-26 13:10           ` Nick Piggin
  2008-07-26 13:35             ` Andrea Arcangeli
  2008-07-26 13:14           ` Nick Piggin
  1 sibling, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2008-07-26 13:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 03:02:02PM +0200, Andrea Arcangeli wrote:
> On Sat, Jul 26, 2008 at 02:28:26PM +0200, Nick Piggin wrote:
> > 3) livelock/starvation problem with TLB holdoff
> >
> > That's not shooting yourself in the foot if you are forced into the
> > design. Definitely there can be indefinite starvation because there
> > is no notion of queueing on the range_active count.
> 
> I don't see how, it's all per-mm, all other -mm can't possibly be hold
> off. What can happen is that once cpu loops because the other cpu is
> mprotecting in a flood forever. Both have to be threads belonging to
> the same -mm for this holdoff to be meaningful. It's the same app,
> with multiple threads. The rest of the system can't even notice
> whatever happens in that -mm and they all reschedule all the time, so
> it can't starve other tasks. It's not a _global_ hold off! It's a
> per-mm holdoff. The counter is in the kvm struct and the kvm struct is
> bind to a single mm. Each different kvm struct has a different
> counter. all it can happen that an app is shooting itself in the foot,
> it's actually easier to run "for (;;)" if it wants to shoot itself in
> the foot... so no big deal.
> 
> I really can't see any issue with your point 3, and infact this looks a
> nice locking design to me.

I am talking about a number of threads starving another thread of the
same process, but that isn't shooting themselves in the foot because
they might be doing simple normal operations that don't expect the
kernel to cause starvation.

 
> > I'm doing range flushing withing tlb-gathers in the patch I posted which
> > does not add a branch for every pte teardown. And I don't really
> > consider range_start/end as better, given my reasons.
> 
> As you said you missed tlb_remove_page, that is the _only_ troublesome
> one! A bit too easy to skip that one and then claim your patch as
> simpler...
> 
> Just post a patch that works for real and see if you can avoid to
> invalidate every pte, or if you've to hack inside
> asm-generic/tlb.h... I can't see how you can avoid to mangle on the
> asm-generic/tlb.h if you don't want to send IPIs at every single pte
> zapped with your patch...


What's the problem with patching asm-*/tlb.h? They're just other files,
but they all form part os the same tlb flushing design.


> > If I had seen even a single number to show the more complex scheme
> 
> Please post a patch that actually works then we'll re-evaluate what is
> the best tradeoff ;).
> 
> In the meantime please merge -mm patches into Linus's tree, this is
> taking forever and if the changes are so small to go Nick's way and
> his future "actually working" patch remains so small, it can be
> applied incrementally without any problem IMHO, infact it is presented
> as an incremental patch in the first place.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 13:10           ` Nick Piggin
@ 2008-07-26 13:35             ` Andrea Arcangeli
  2008-07-27 12:25               ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Andrea Arcangeli @ 2008-07-26 13:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 03:10:15PM +0200, Nick Piggin wrote:
> I am talking about a number of threads starving another thread of the
> same process, but that isn't shooting themselves in the foot because
> they might be doing simple normal operations that don't expect the
> kernel to cause starvation.

I thought you worried about security issues sorry.

Note that each user is free to implement the locks as it wants. Given
what you describe is a feature and not a bug for KVM, we use readonly
locks and that can't lead to security issues either as you agreed
above.

But nothing prevents you to use a spinlock or rwlock and take the read
side of it from the page fault and the write side of it from the
range_start/end. Careful with the rwlock though because that is a
read-starved lock so it won't buy you anything, it was just to make an
example ;). Surely you can build a fair rwlock if that is what your
app requires. For KVM we want to go unfair and purely readonly in the
kvm page fault because we know the address space is almost never
mangled and that makes it a perfect fit as it'll never starve for all
remotely useful cases, and it can't be exploited either.

So you're actually commenting on the kvm implementation of the
lowlevel methods, but we're not forcing that implementation to all
users. You surely can use the mmu notifiers in -mm, to have the
secondary mmu page fault block and takeover from the range_start/end
and take range_start/end out of the game.

So really this isn't a valid complaint... infact the same issue exists
for invalidate_page or your invalidate_range. It's up to each user to
decide if to make the page fault slower but 100% fair and higher
priority than the munmap flood coming from the other threads of the
same mm.

> What's the problem with patching asm-*/tlb.h? They're just other files,
> but they all form part os the same tlb flushing design.

Nothing fundamental, but at least for the last half an year apparently
moving mmu notifier invalidate calls into arch pte/tlb flushing code
was considered a negative in layering terms (I did that initially with
ptep_clear_flush).

I suggest you make another patch that actually works so we've a better
picture of how the changes to tlb-gather looks like. I think the idea
of calling invalidate_page before every tlb_remove_page should be
rejected.

Hope also this shows how those discussions are endless and pointless
if there's nothing we can deploy to the KVM users in the first place.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 13:35             ` Andrea Arcangeli
@ 2008-07-27 12:25               ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-07-27 12:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 03:35:25PM +0200, Andrea Arcangeli wrote:
> On Sat, Jul 26, 2008 at 03:10:15PM +0200, Nick Piggin wrote:
> > I am talking about a number of threads starving another thread of the
> > same process, but that isn't shooting themselves in the foot because
> > they might be doing simple normal operations that don't expect the
> > kernel to cause starvation.
> 
> I thought you worried about security issues sorry.
> 
> Note that each user is free to implement the locks as it wants. Given
> what you describe is a feature and not a bug for KVM, we use readonly
> locks and that can't lead to security issues either as you agreed
> above.

Well yes, KVM has a very controlled type of process which will
have notifiers registered. But KVM isn't particularly the interesting
case. GRU is going to have perhaps arbitrarily behaving tasks
registering mmu notifiers, and they probably won't even know about it
because it will probably be from a library call.


> But nothing prevents you to use a spinlock or rwlock and take the read
> side of it from the page fault and the write side of it from the
> range_start/end. Careful with the rwlock though because that is a
> read-starved lock so it won't buy you anything, it was just to make an
> example ;). Surely you can build a fair rwlock if that is what your
> app requires. For KVM we want to go unfair and purely readonly in the
> kvm page fault because we know the address space is almost never
> mangled and that makes it a perfect fit as it'll never starve for all
> remotely useful cases, and it can't be exploited either.
> 
> So you're actually commenting on the kvm implementation of the
> lowlevel methods, but we're not forcing that implementation to all
> users. You surely can use the mmu notifiers in -mm, to have the
> secondary mmu page fault block and takeover from the range_start/end
> and take range_start/end out of the game.

I was commenting on GRU mainly. But not you can't really do anything
else with the range callbacks because the range_end is called after
the pages are released.


Or you mean have the range invalidate actually make the page fault
handler spin wait rather than retry? I don't really like that either
although it may avoid the complete starvation.


> So really this isn't a valid complaint... infact the same issue exists
> for invalidate_page or your invalidate_range. It's up to each user to
> decide if to make the page fault slower but 100% fair and higher
> priority than the munmap flood coming from the other threads of the
> same mm.

Actually it won't have to. Because the invalidate callback runs after
te linux ptes have gone away, we can basically allow the page fault
just to either find the linux pte if it runs first (in which case  the
invalidate callback just has to invalidate it); or it finds no linux
pte and segfaults.


> > What's the problem with patching asm-*/tlb.h? They're just other files,
> > but they all form part os the same tlb flushing design.
> 
> Nothing fundamental, but at least for the last half an year apparently
> moving mmu notifier invalidate calls into arch pte/tlb flushing code
> was considered a negative in layering terms (I did that initially with
> ptep_clear_flush).
> 
> I suggest you make another patch that actually works so we've a better
> picture of how the changes to tlb-gather looks like. I think the idea
> of calling invalidate_page before every tlb_remove_page should be
> rejected.
> 
> Hope also this shows how those discussions are endless and pointless
> if there's nothing we can deploy to the KVM users in the first place.

That's true, you need to start with something as simple as possible
until you have some use cases to start with in which case you can
look at more complexity for more performance...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 13:02         ` Andrea Arcangeli
  2008-07-26 13:10           ` Nick Piggin
@ 2008-07-26 13:14           ` Nick Piggin
  2008-07-26 13:49             ` Andrea Arcangeli
  2008-07-30 14:19             ` Christoph Lameter
  1 sibling, 2 replies; 26+ messages in thread
From: Nick Piggin @ 2008-07-26 13:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 03:02:02PM +0200, Andrea Arcangeli wrote:
> On Sat, Jul 26, 2008 at 02:28:26PM +0200, Nick Piggin wrote:
> 
> > If I had seen even a single number to show the more complex scheme
> 
> Please post a patch that actually works then we'll re-evaluate what is
> the best tradeoff ;).
> 
> In the meantime please merge -mm patches into Linus's tree, this is
> taking forever and if the changes are so small to go Nick's way and
> his future "actually working" patch remains so small, it can be
> applied incrementally without any problem IMHO, infact it is presented
> as an incremental patch in the first place.

BTW. has anyone else actually looked at mmu notifiers or have an
opinion on this? It might be helpful for me to get someone else's
perspective.

I hate to cause conflict but obviously I think I have legitimate
concerns so I have to raise them...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 13:14           ` Nick Piggin
@ 2008-07-26 13:49             ` Andrea Arcangeli
  2008-07-27 12:32               ` Nick Piggin
  2008-07-30 14:19             ` Christoph Lameter
  1 sibling, 1 reply; 26+ messages in thread
From: Andrea Arcangeli @ 2008-07-26 13:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 03:14:50PM +0200, Nick Piggin wrote:
> BTW. has anyone else actually looked at mmu notifiers or have an
> opinion on this? It might be helpful for me to get someone else's
> perspective.

My last submission was for -mm on 26 Jun, and all these developers and
lists were in CC:

	Linus Torvalds <torvalds@linux-foundation.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Christoph Lameter <clameter@sgi.com>,
        Jack Steiner <steiner@sgi.com>, Robin Holt <holt@sgi.com>,
        Nick Piggin <npiggin@suse.de>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>, kvm@vger.kernel.org,
        Kanoj Sarcar <kanojsarcar@yahoo.com>,
        Roland Dreier <rdreier@cisco.com>,
        Steve Wise <swise@opengridcomputing.com>,
        linux-kernel@vger.kernel.org, Avi Kivity <avi@qumranet.com>,
        linux-mm@kvack.org, general@lists.openfabrics.org,
        Hugh Dickins <hugh@veritas.com>,
        Rusty Russell <rusty@rustcorp.com.au>,
        Anthony Liguori <aliguori@us.ibm.com>,
        Chris Wright <chrisw@redhat.com>,
        Marcelo Tosatti <marcelo@kvack.org>,
        Eric Dumazet <dada1@cosmosbay.com>,
        "Paul E. McKenney" <paulmck@us.ibm.com>,
        Izik Eidus <izike@qumranet.com>,
        Anthony Liguori <aliguori@us.ibm.com>,
        Rik van Riel <riel@redhat.com>

The ones explicitly agreeing (about all or part depending on the areas
of interest, and not just of the first patch adding the new list.h
function which is mostly unrelated) were Linus, Christoph, Jack,
Robin, Avi, Marcelo, Rik and last but not the least Paul.

Everyone else in the list implicitly agrees I assume, hope they're not
all waiting 1 month before commenting on it like you did ;).

Avi, me, Jack and Robin are the main users of the feature (or at least
the main users that are brave enough to be visible on lkml) so that
surely speaks well for the happiness of the mmu notifier users about
what is in -mm. Infact it is almost a sure thing that the users will
always prefer the current patches compared to the minimal notifier.

But I also wear a VM (as in virtual memory not virtual machine ;) hat
not just a KVM hat, so I surely wouldn't have submitted something that
I think is bad for the VM. Infact I opposed certain patches made
specifically for XPMEM that could hurt the VM a micro-bit (mostly
thinking at UP cellphones). Still I offered to support XPMEM but with
a lower priority and done right.

I don't happen to dislike mm_take_all_locks, as it's totally localized
and _can_never_run_ unless you load one of those kvm or gru
modules. I'd rather prefer mmu notifiers to be invisible to the
tlb-gather logic, surely it'd be orders of magnitude simpler to delete
mm_take_all_locks than to undo the changes to the tlb-gather logic. So
if something we should go with -mm first, and then evaluate if the
tlb-gather changes are better/worse.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 13:49             ` Andrea Arcangeli
@ 2008-07-27 12:32               ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-07-27 12:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 03:49:15PM +0200, Andrea Arcangeli wrote:
> On Sat, Jul 26, 2008 at 03:14:50PM +0200, Nick Piggin wrote:
> 
> But I also wear a VM (as in virtual memory not virtual machine ;) hat
> not just a KVM hat, so I surely wouldn't have submitted something that
> I think is bad for the VM. Infact I opposed certain patches made
> specifically for XPMEM that could hurt the VM a micro-bit (mostly
> thinking at UP cellphones). Still I offered to support XPMEM but with
> a lower priority and done right.

BTW. I don't like this approach especially for XPMEM the infinite
starvation will probably be a much bigger issue if tasks can go to
sleep there. Perhaps priority inversion problems too.

Also making the locks into sleeping locks I don't know if it is such
a good approach. We have gone the other way for some reasons in the
past, so they have to be addressed.

> I don't happen to dislike mm_take_all_locks, as it's totally localized
> and _can_never_run_ unless you load one of those kvm or gru
> modules. I'd rather prefer mmu notifiers to be invisible to the
> tlb-gather logic, surely it'd be orders of magnitude simpler to delete
> mm_take_all_locks than to undo the changes to the tlb-gather logic. So
> if something we should go with -mm first, and then evaluate if the
> tlb-gather changes are better/worse.

The thing about mm_take_all_locks that I don't think you quite appreciate
is that it isn't totally localized. It now adds a contract to the rest
of the VM to say it isn't allowed to invalidate anything while it is
being run.

If it literally was self contained and zero impact, of course I wouldn't
care about it from the core VM perspective, but I still think it might
be a bad idea for some of the GRU (and XPMEM) use cases.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 13:14           ` Nick Piggin
  2008-07-26 13:49             ` Andrea Arcangeli
@ 2008-07-30 14:19             ` Christoph Lameter
  2008-07-30 14:54               ` Andrea Arcangeli
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2008-07-30 14:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, linux-arch, steiner

Nick Piggin wrote:

> BTW. has anyone else actually looked at mmu notifiers or have an
> opinion on this? It might be helpful for me to get someone else's
> perspective.

Yes we have had so much talk about this that I am a bit tired of talking about it. I vaguely remember bringing up the same point a couple of months ago. If you can make it work then great.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-30 14:19             ` Christoph Lameter
@ 2008-07-30 14:54               ` Andrea Arcangeli
  2008-07-30 15:42                 ` Christoph Lameter
  0 siblings, 1 reply; 26+ messages in thread
From: Andrea Arcangeli @ 2008-07-30 14:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, linux-arch, steiner

On Wed, Jul 30, 2008 at 09:19:44AM -0500, Christoph Lameter wrote:
> Yes we have had so much talk about this that I am a bit tired of
> talking about it. I vaguely remember bringing up the same point a
> couple of months ago. If you can make it work then great.

I think the current implementation is fine for the long run, it can
provide the fastest performance when armed, and each invalidate either
requires IPIs or it may may need to access the southbridge, so when
freeing large areas of memory it's good being able to do a single
invalidate.

If I can add another comment, I think if a new user of
mm_take_all_locks showup, that will further confirm that such method
is useful and should stay. Of course it needs to be a legitimate usage
that allows to improve performance to the fast paths like in the
mmu-notifier usage. And if Nick's right that mm_take_all_locks will
ever become a limitation, removing it is trivial, much much simpler
than undoing mmu-notifier changes to tlb-gather. So until Nick will go
ahead and remove the anon_vma->lock (and I don't think it's feasible
without screwing other paths much more troublesome than
mm_take_all_locks) I think this is fine to stay. If you'll have
troubles removing anon_vma->lock it won't be because of
mm_take_all_locks be sure ;). If you ever get there we'll add
invalidate_page before tlb_remove_page and be done with it for the
benefit of the VM, no problem at all.

If we'll ever need to add scheduling capability to mmu notifiers (like
for XPMEM or perhaps in the future infiniband) that's nearly trivially
feasible too in the future without having to alter the API at all
(something not feasible with other implementations).

Nevertheless I'm ok if we want to alter the implementation in the
future for whatever good/wrong reasaon: the only important thing to me
is that from now on all kernels will have this functionality one way
or another because KVM already depends on it and it swaps much better
now!

The current implementation is bugfree, well tested, looks great to me
and there's no urgency to alter it. It's surely what all the
mmu-notifier users prefer, and I want to thank everyone for the help
in getting here and all the good/bad feedback provided that helped
improving the code so much.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-30 14:54               ` Andrea Arcangeli
@ 2008-07-30 15:42                 ` Christoph Lameter
  2008-07-31  6:14                   ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2008-07-30 15:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, linux-arch, steiner

Andrea Arcangeli wrote:

> I think the current implementation is fine for the long run, it can
> provide the fastest performance when armed, and each invalidate either
> requires IPIs or it may may need to access the southbridge, so when
> freeing large areas of memory it's good being able to do a single
> invalidate.

Right. A couple of months ago we had this discussion and agreed that the begin / end was the way to go. I still support that decision.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-30 15:42                 ` Christoph Lameter
@ 2008-07-31  6:14                   ` Nick Piggin
  2008-07-31 14:19                     ` Christoph Lameter
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2008-07-31  6:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, linux-arch, steiner

On Wed, Jul 30, 2008 at 10:42:12AM -0500, Christoph Lameter wrote:
> Andrea Arcangeli wrote:
> 
> > I think the current implementation is fine for the long run, it can
> > provide the fastest performance when armed, and each invalidate either
> > requires IPIs or it may may need to access the southbridge, so when
> > freeing large areas of memory it's good being able to do a single
> > invalidate.
> 
> Right. A couple of months ago we had this discussion and agreed that the begin / end was the way to go. I still support that decision.

That's OK. We don't have to make decisions just by people supporting one
way or the other, because I'll come up with some competing patches and
if they turn out to be significantly simpler to the core VM without having
a significant negative impact on performance then naturally everybody should
be happy to merge them, so nobody has to argue with handwaving.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-31  6:14                   ` Nick Piggin
@ 2008-07-31 14:19                     ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2008-07-31 14:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, linux-arch, steiner

Nick Piggin wrote:

> That's OK. We don't have to make decisions just by people supporting one
> way or the other, because I'll come up with some competing patches and
> if they turn out to be significantly simpler to the core VM without having
> a significant negative impact on performance then naturally everybody should
> be happy to merge them, so nobody has to argue with handwaving.

We make decisions based on technical issues. If you can come up with a solution that addresses the issues (please review the earlier discussion on the subject matter) then we will all be happy to see that merged.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 11:38     ` Andrea Arcangeli
  2008-07-26 12:28       ` Nick Piggin
@ 2008-07-26 12:33       ` Nick Piggin
  2008-07-26 13:04       ` Nick Piggin
  2 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-07-26 12:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 01:38:13PM +0200, Andrea Arcangeli wrote:
> On Sat, Jul 26, 2008 at 05:08:10AM +0200, Nick Piggin wrote:
> > Anyway, I just voice my opinion and let Andrew and Linus decide. To be
> > clear: I have not found any actual bugs in Andrea's -mm patchset, only
> > some dislikes of the approach.
> 
> Yes, like I said I think this is a matter of taste of what you like of
> the tradeoff. There are disadvantages and advantages in both and if we
> wait forever to please everyone taste, it'll never go in.

And for this item, I think there has been a bit too much emphasis
on pleasing the taste of the drivers and not enough on the core
VM. My concern about adding a new TLB flushing design to core VM
was never taken seriously, for example. Nor was my request for
performance numbers. And I did ask early in the year.

I believe when you don't have any real numbers for justification,
the only sane thing to do is go with the most minimal and simplest
implementation first, and add complexity if/when it can be justified.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 11:38     ` Andrea Arcangeli
  2008-07-26 12:28       ` Nick Piggin
  2008-07-26 12:33       ` Nick Piggin
@ 2008-07-26 13:04       ` Nick Piggin
  2008-07-26 13:16         ` Andrea Arcangeli
  2 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2008-07-26 13:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 01:38:13PM +0200, Andrea Arcangeli wrote:
> 
> 1) absolute minimal intrusion into the kernel common code, and
>    absolute minimum number of branches added to the kernel fast
>    paths. Kernel is faster than your "minimal" type of notifiers when
>    they're disarmed.

BTW. is this really significant? Having one branch per pte
I don't think is necessarily slower than 2 branches per unmap.

The 2 branches will use more icache and more branch history. One
branch even once per pte in the unmapping loop is going to remain
hot in icache and branch history isn't it?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 13:04       ` Nick Piggin
@ 2008-07-26 13:16         ` Andrea Arcangeli
  2008-07-27 12:08           ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Andrea Arcangeli @ 2008-07-26 13:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 03:04:06PM +0200, Nick Piggin wrote:
> On Sat, Jul 26, 2008 at 01:38:13PM +0200, Andrea Arcangeli wrote:
> > 
> > 1) absolute minimal intrusion into the kernel common code, and
> >    absolute minimum number of branches added to the kernel fast
> >    paths. Kernel is faster than your "minimal" type of notifiers when
> >    they're disarmed.
> 
> BTW. is this really significant? Having one branch per pte
> I don't think is necessarily slower than 2 branches per unmap.
> 
> The 2 branches will use more icache and more branch history. One
> branch even once per pte in the unmapping loop is going to remain
> hot in icache and branch history isn't it?

Even if branch-predicted and icached, it's still more executable to
compute in a tight loop. Even if quick it'll accumulate cycles. Said
that perhaps you're right that my point 1 wasn't that important or not
a tangible positive, but surely doing a secondary mmu invalidate for
each pte zapped isn't ideal... that's the whole point of the
tlb-gather logic, nobody wants to do that not even for the primary
tlb, and surely not for the secondary-mmu that may not even be as fast
as the primary-tlb at invalidating. Hence the very simple patch is
clearly inferior when they're armed (if only equivalent when they're
disarmed)...

I think we can argue once you've reduced the frequency of the
secondary mmu invalidates of a factor of 500 by mangling over the tlb
gather logic per-arch.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-26 13:16         ` Andrea Arcangeli
@ 2008-07-27 12:08           ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-07-27 12:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, steiner, cl

On Sat, Jul 26, 2008 at 03:16:51PM +0200, Andrea Arcangeli wrote:
> On Sat, Jul 26, 2008 at 03:04:06PM +0200, Nick Piggin wrote:
> > On Sat, Jul 26, 2008 at 01:38:13PM +0200, Andrea Arcangeli wrote:
> > > 
> > > 1) absolute minimal intrusion into the kernel common code, and
> > >    absolute minimum number of branches added to the kernel fast
> > >    paths. Kernel is faster than your "minimal" type of notifiers when
> > >    they're disarmed.
> > 
> > BTW. is this really significant? Having one branch per pte
> > I don't think is necessarily slower than 2 branches per unmap.
> > 
> > The 2 branches will use more icache and more branch history. One
> > branch even once per pte in the unmapping loop is going to remain
> > hot in icache and branch history isn't it?
> 
> Even if branch-predicted and icached, it's still more executable to
> compute in a tight loop. Even if quick it'll accumulate cycles. Said

True but having 2 branches and more icache is more likely to be a
branch mispredict or icache miss which costs a *lot* of cached,
predicted branches.

It's all speculation, but my point is that it is not accurate to
say my version woiuld be slower because in some cases it would be
the oposite.


> that perhaps you're right that my point 1 wasn't that important or not
> a tangible positive, but surely doing a secondary mmu invalidate for
> each pte zapped isn't ideal... that's the whole point of the
> tlb-gather logic, nobody wants to do that not even for the primary
> tlb, and surely not for the secondary-mmu that may not even be as fast
> as the primary-tlb at invalidating. Hence the very simple patch is
> clearly inferior when they're armed (if only equivalent when they're
> disarmed)...

See the thing about that is I don't actually dispute that in some
cases the range start/end case will definitely be faster. However,
firstly KVM as you say doesn't really care, and secondly we don't
have numbers for GRU (I'm talking about start/end vs gather)

 
> I think we can argue once you've reduced the frequency of the
> secondary mmu invalidates of a factor of 500 by mangling over the tlb
> gather logic per-arch.

OK, we'll see...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-24 14:39 MMU notifiers review and some proposals Nick Piggin
  2008-07-24 14:39 ` Nick Piggin
  2008-07-25 21:45 ` Andrea Arcangeli
@ 2008-07-25 23:29 ` Jack Steiner
  2008-07-26  3:18   ` Nick Piggin
  2008-07-27  9:45 ` Andrew Morton
  3 siblings, 1 reply; 26+ messages in thread
From: Jack Steiner @ 2008-07-25 23:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, andrea, cl

> 
> I also think we should be cautious to make this slowpath so horribly
> slow. For KVM it's fine, but perhaps for GRU we'd actually want to
> register quite a lot of notifiers, unregister them, and perhaps even
> do per-vma registrations if we don't want to slow down other memory
> management operations too much (eg. if it is otherwise just some
> regular processes just doing accelerated things with GRU).

I can't speak for other possible users of the mmu notifier mechanism
but the GRU will only register a single notifier for a process.
The notifier is registered when the GRU is opened and (currently)
unregistered when the GRU is closed.

If a task opens multiple GRUs (quite possible for threaded apps), all
shared the same notifier. Regardless of the number of threads/opens,
there will be a single notifier. 

At least one version of mmu notifiers did not support unregistration.
The notifier was left in the chain until the task exited. This
also worked for the GRU. If the gru was subsequently reopened
the notifier was reused. Providing the ability to unregister is the
right thing to do if the cost is not excessive. From a practical
standpoint, for the GRU usage, unregistration is not a big issue
either way.

--- jack

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-25 23:29 ` Jack Steiner
@ 2008-07-26  3:18   ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-07-26  3:18 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List,
	linux-arch, andrea, cl

On Fri, Jul 25, 2008 at 06:29:09PM -0500, Jack Steiner wrote:
> > 
> > I also think we should be cautious to make this slowpath so horribly
> > slow. For KVM it's fine, but perhaps for GRU we'd actually want to
> > register quite a lot of notifiers, unregister them, and perhaps even
> > do per-vma registrations if we don't want to slow down other memory
> > management operations too much (eg. if it is otherwise just some
> > regular processes just doing accelerated things with GRU).
> 
> I can't speak for other possible users of the mmu notifier mechanism
> but the GRU will only register a single notifier for a process.
> The notifier is registered when the GRU is opened and (currently)
> unregistered when the GRU is closed.
> 
> If a task opens multiple GRUs (quite possible for threaded apps), all
> shared the same notifier. Regardless of the number of threads/opens,
> there will be a single notifier. 

OK, sure, but you might have a lot of processes opening the GRU, no?
(I'm not sure how it is exactly going to be used, but if you have a
number of GRU units per blade...)

So you have a machine with 4096 CPUs, and maybe 4096 processes that
each have to lock the mapping of exec, glibc, pthreads, grulib, etc
etc.

Or you might have a situation where many short lived processes do
some operating on GRU.

Anyway, I won't dispute your data point (thanks for that), but still
noting that a very heavy registration is obviously less desirable
than a faster one, all else being equal...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-24 14:39 MMU notifiers review and some proposals Nick Piggin
                   ` (2 preceding siblings ...)
  2008-07-25 23:29 ` Jack Steiner
@ 2008-07-27  9:45 ` Andrew Morton
  2008-07-27 12:38   ` Nick Piggin
  3 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2008-07-27  9:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Linux Memory Management List, linux-arch, andrea,
	steiner, cl

On Thu, 24 Jul 2008 16:39:49 +0200 Nick Piggin <npiggin@suse.de> wrote:

> I think everybody is hoping to have a workable mmu notifier scheme
> merged in 2.6.27 (myself included). However I do have some concerns
> about the implementation proposed (in -mm).
> 
> I apologise for this late review, before anybody gets too upset,
> most of my concerns have been raised before, but I'd like to state
> my case again and involving everyone.

Nick, having read through this discussion and the code (yet again) I
think I'll go ahead and send it all in to Linus.  On the basis that

- the code is fairly short and simple

- has no known bugs

- seems to be needed by some folks ;)

- you already have a protopatch which partially addresses your
  concerns and afaik there's nothing blocking future improvements to
  this implementation?


And a late-breaking review comment: given that about 0.000000000000001%
of people will actually use mm_take_all_locks(), could we make its
compilation conditional on something?  Such as CONFIG_MMU_NOTIFIER?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: MMU notifiers review and some proposals
  2008-07-27  9:45 ` Andrew Morton
@ 2008-07-27 12:38   ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-07-27 12:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Linux Memory Management List, linux-arch, andrea,
	steiner, cl

On Sun, Jul 27, 2008 at 02:45:20AM -0700, Andrew Morton wrote:
> On Thu, 24 Jul 2008 16:39:49 +0200 Nick Piggin <npiggin@suse.de> wrote:
> 
> > I think everybody is hoping to have a workable mmu notifier scheme
> > merged in 2.6.27 (myself included). However I do have some concerns
> > about the implementation proposed (in -mm).
> > 
> > I apologise for this late review, before anybody gets too upset,
> > most of my concerns have been raised before, but I'd like to state
> > my case again and involving everyone.
> 
> Nick, having read through this discussion and the code (yet again) I
> think I'll go ahead and send it all in to Linus.  On the basis that

You call. As I said, I was OK for you to use your judgement.

 
> - the code is fairly short and simple
> 
> - has no known bugs
> 
> - seems to be needed by some folks ;)
> 
> - you already have a protopatch which partially addresses your
>   concerns and afaik there's nothing blocking future improvements to
>   this implementation?

Seems like I was able to get my head around KVM and GRU more easily
than I had feared. I don't expect help, but I will say that good
simplifications will always get merged unless there is a reasonable
performance case not to, so I will work on patches.


> And a late-breaking review comment: given that about 0.000000000000001%
> of people will actually use mm_take_all_locks(), could we make its
> compilation conditional on something?  Such as CONFIG_MMU_NOTIFIER?

Well I prefer that because I don't want any new callers to pop up.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-07-31 14:19 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-24 14:39 MMU notifiers review and some proposals Nick Piggin
2008-07-24 14:39 ` Nick Piggin
2008-07-25 21:45 ` Andrea Arcangeli
2008-07-26  3:08   ` Nick Piggin
2008-07-26 11:38     ` Andrea Arcangeli
2008-07-26 12:28       ` Nick Piggin
2008-07-26 13:02         ` Andrea Arcangeli
2008-07-26 13:10           ` Nick Piggin
2008-07-26 13:35             ` Andrea Arcangeli
2008-07-27 12:25               ` Nick Piggin
2008-07-26 13:14           ` Nick Piggin
2008-07-26 13:49             ` Andrea Arcangeli
2008-07-27 12:32               ` Nick Piggin
2008-07-30 14:19             ` Christoph Lameter
2008-07-30 14:54               ` Andrea Arcangeli
2008-07-30 15:42                 ` Christoph Lameter
2008-07-31  6:14                   ` Nick Piggin
2008-07-31 14:19                     ` Christoph Lameter
2008-07-26 12:33       ` Nick Piggin
2008-07-26 13:04       ` Nick Piggin
2008-07-26 13:16         ` Andrea Arcangeli
2008-07-27 12:08           ` Nick Piggin
2008-07-25 23:29 ` Jack Steiner
2008-07-26  3:18   ` Nick Piggin
2008-07-27  9:45 ` Andrew Morton
2008-07-27 12:38   ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).