* MMU notifiers review and some proposals
@ 2008-07-24 14:39 Nick Piggin
2008-07-24 14:39 ` Nick Piggin
` (3 more replies)
0 siblings, 4 replies; 26+ messages in thread
From: Nick Piggin @ 2008-07-24 14:39 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds, Linux Memory Management List,
linux-arch, andrea
Hi,
I think everybody is hoping to have a workable mmu notifier scheme
merged in 2.6.27 (myself included). However I do have some concerns
about the implementation proposed (in -mm).
I apologise for this late review, before anybody gets too upset,
most of my concerns have been raised before, but I'd like to state
my case again and involving everyone.
So my concerns.
Firstly, mm_take_all_locks I dislike. Although it isn't too complex
and quite well contained, it unfortunately somewhat ties our hands a
bit when it comes to changing the fastpath locking. For exmple, it
might make it difficult to remove the locks from rmap walks now (not
that there aren't plenty of other difficulties, but as an example).
I also think we should be cautious to make this slowpath so horribly
slow. For KVM it's fine, but perhaps for GRU we'd actually want to
register quite a lot of notifiers, unregister them, and perhaps even
do per-vma registrations if we don't want to slow down other memory
management operations too much (eg. if it is otherwise just some
regular processes just doing accelerated things with GRU).
I think it is possible to come up with a pretty good implementation
that replaces mm_take_all_locks with something faster and using less
locks, I haven't quite got a detailed sketch yet, but if anyone is
interested, let me know.
However, I think the whole reason for take_all_locks is to support
the invalidate_range_begin/invalidate_range_end callbacks. I think
these are slightly odd because they introduce a 2nd design of TLB
shootdown scheme to the kernel: presently we downgrade pte
permissions, and then shootdown TLBs, and then free any pages;
the range callbacks first prevent new TLBs being set up, then shoot
down existing TLBs, then downgrade ptes, then free pages.
That's not to say this style will not work, but I prefer to keep
to our existing scheme. Obviously it is better proven, and also
it is good to have fewer if possible.
What are the pros of the new scheme? Well performance AFAIKS.
Large unmappings may not fit in the TLB gather API (and some
sites do not use TLB gather at all), but with the
invalidate_range_begin/invalidate_range_end we can still flush
the entire range in 1 call.
I would counter that our existing CPU flushing has held up fairly
well, and it can also be pretty expensive if it has to do lots of
IPIs. I don't have a problem with trying to maximise performance,
but I really feel this is one place where I'd prefer to see
numbers first.
One functional con of the invalidate_range_begin/invalidate_range_end
style I see is that the implementations hold off new TLB insertions
by incrementing a counter to hold them off. So actually poor
parallelism or starvation could become an issue with multiple
threads.
So anyway, I've attached a patch switching mmu notifiers to the
our traditional style of TLB invalidation, and attempted to
convert KVM and GRU to use it. Also keep in mind that after this
the mm_take_all_locks patches should not be needed. Unfortunately
I was not able to actually test this (but testing typically would not
find the rally subtle problems easily anyway)
---
drivers/misc/sgi-gru/grufault.c | 145 ++++---------------------------------
drivers/misc/sgi-gru/grutables.h | 1
drivers/misc/sgi-gru/grutlbpurge.c | 37 +--------
include/linux/kvm_host.h | 10 --
include/linux/mmu_notifier.h | 97 ++----------------------
mm/fremap.c | 4 -
mm/hugetlb.c | 3
mm/memory.c | 18 ++--
mm/mmap.c | 2
mm/mmu_notifier.c | 43 +---------
mm/mprotect.c | 3
mm/mremap.c | 4 -
virt/kvm/kvm_main.c | 65 +---------------
13 files changed, 61 insertions(+), 371 deletions(-)
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -65,60 +65,12 @@ struct mmu_notifier_ops {
* Before this is invoked any secondary MMU is still ok to
* read/write to the page previously pointed to by the Linux
* pte because the page hasn't been freed yet and it won't be
- * freed until this returns. If required set_page_dirty has to
- * be called internally to this method.
+ * freed until this returns. Also, importantly the address can't
+ * be repopulated with some other page, so all MMUs retain a
+ * coherent view of memory. If required set_page_dirty has to
+ * be called internally to this method. Not optional.
*/
- void (*invalidate_page)(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long address);
-
- /*
- * invalidate_range_start() and invalidate_range_end() must be
- * paired and are called only when the mmap_sem and/or the
- * locks protecting the reverse maps are held. The subsystem
- * must guarantee that no additional references are taken to
- * the pages in the range established between the call to
- * invalidate_range_start() and the matching call to
- * invalidate_range_end().
- *
- * Invalidation of multiple concurrent ranges may be
- * optionally permitted by the driver. Either way the
- * establishment of sptes is forbidden in the range passed to
- * invalidate_range_begin/end for the whole duration of the
- * invalidate_range_begin/end critical section.
- *
- * invalidate_range_start() is called when all pages in the
- * range are still mapped and have at least a refcount of one.
- *
- * invalidate_range_end() is called when all pages in the
- * range have been unmapped and the pages have been freed by
- * the VM.
- *
- * The VM will remove the page table entries and potentially
- * the page between invalidate_range_start() and
- * invalidate_range_end(). If the page must not be freed
- * because of pending I/O or other circumstances then the
- * invalidate_range_start() callback (or the initial mapping
- * by the driver) must make sure that the refcount is kept
- * elevated.
- *
- * If the driver increases the refcount when the pages are
- * initially mapped into an address space then either
- * invalidate_range_start() or invalidate_range_end() may
- * decrease the refcount. If the refcount is decreased on
- * invalidate_range_start() then the VM can free pages as page
- * table entries are removed. If the refcount is only
- * droppped on invalidate_range_end() then the driver itself
- * will drop the last refcount but it must take care to flush
- * any secondary tlb before doing the final free on the
- * page. Pages will no longer be referenced by the linux
- * address space but may still be referenced by sptes until
- * the last refcount is dropped.
- */
- void (*invalidate_range_start)(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start, unsigned long end);
- void (*invalidate_range_end)(struct mmu_notifier *mn,
+ void (*invalidate_range)(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start, unsigned long end);
};
@@ -154,11 +106,7 @@ extern void __mmu_notifier_mm_destroy(st
extern void __mmu_notifier_release(struct mm_struct *mm);
extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
unsigned long address);
-extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address);
-extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end);
-extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
unsigned long start, unsigned long end);
static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -175,25 +123,11 @@ static inline int mmu_notifier_clear_flu
return 0;
}
-static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
-{
- if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_page(mm, address);
-}
-
-static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_start(mm, start, end);
-}
-
-static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end)
-{
- if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_end(mm, start, end);
+ __mmu_notifier_invalidate_range(mm, start, end);
}
static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -221,7 +155,8 @@ static inline void mmu_notifier_mm_destr
struct vm_area_struct *___vma = __vma; \
unsigned long ___address = __address; \
__pte = ptep_clear_flush(___vma, ___address, __ptep); \
- mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \
+ mmu_notifier_invalidate_range(___vma->vm_mm, ___address, \
+ ____address + PAGE_SIZE); \
__pte; \
})
@@ -248,17 +183,7 @@ static inline int mmu_notifier_clear_flu
return 0;
}
-static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
-{
-}
-
-static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end)
-{
-}
-
-static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
}
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -32,7 +32,7 @@ static void zap_pte(struct mm_struct *mm
struct page *page;
flush_cache_page(vma, addr, pte_pfn(pte));
- pte = ptep_clear_flush(vma, addr, ptep);
+ pte = ptep_clear_flush_notify(vma, addr, ptep);
page = vm_normal_page(vma, addr, pte);
if (page) {
if (pte_dirty(pte))
@@ -226,9 +226,7 @@ asmlinkage long sys_remap_file_pages(uns
vma->vm_flags = saved_flags;
}
- mmu_notifier_invalidate_range_start(mm, start, start + size);
err = populate_range(mm, vma, start, size, pgoff);
- mmu_notifier_invalidate_range_end(mm, start, start + size);
if (!err && !(flags & MAP_NONBLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
/*
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -1683,7 +1683,6 @@ void __unmap_hugepage_range(struct vm_ar
BUG_ON(start & ~huge_page_mask(h));
BUG_ON(end & ~huge_page_mask(h));
- mmu_notifier_invalidate_range_start(mm, start, end);
spin_lock(&mm->page_table_lock);
for (address = start; address < end; address += sz) {
ptep = huge_pte_offset(mm, address);
@@ -1725,7 +1724,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(&mm->page_table_lock);
flush_tlb_range(vma, start, end);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ mmu_notifier_invalidate_range(mm, start, end);
list_for_each_entry_safe(page, tmp, &page_list, lru) {
list_del(&page->lru);
put_page(page);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -677,9 +677,6 @@ int copy_page_range(struct mm_struct *ds
* parent mm. And a permission downgrade will only happen if
* is_cow_mapping() returns true.
*/
- if (is_cow_mapping(vma->vm_flags))
- mmu_notifier_invalidate_range_start(src_mm, addr, end);
-
ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
@@ -695,8 +692,8 @@ int copy_page_range(struct mm_struct *ds
} while (dst_pgd++, src_pgd++, addr = next, addr != end);
if (is_cow_mapping(vma->vm_flags))
- mmu_notifier_invalidate_range_end(src_mm,
- vma->vm_start, end);
+ mmu_notifier_invalidate_range(src_mm, vma->vm_start, end);
+
return ret;
}
@@ -903,7 +900,6 @@ unsigned long unmap_vmas(struct mmu_gath
int fullmm = (*tlbp)->fullmm;
struct mm_struct *mm = vma->vm_mm;
- mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
unsigned long end;
@@ -951,6 +947,7 @@ unsigned long unmap_vmas(struct mmu_gath
break;
}
+ mmu_notifier_invalidate_range(mm, tlb_start, start);
tlb_finish_mmu(*tlbp, tlb_start, start);
if (need_resched() ||
@@ -968,7 +965,6 @@ unsigned long unmap_vmas(struct mmu_gath
}
}
out:
- mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
return start; /* which is now the end (or restart) address */
}
@@ -991,8 +987,10 @@ unsigned long zap_page_range(struct vm_a
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
- if (tlb)
+ if (tlb) {
+ mmu_notifier_invalidate_range(mm, address, end);
tlb_finish_mmu(tlb, address, end);
+ }
return end;
}
EXPORT_SYMBOL_GPL(zap_page_range);
@@ -1668,7 +1666,6 @@ int apply_to_page_range(struct mm_struct
int err;
BUG_ON(addr >= end);
- mmu_notifier_invalidate_range_start(mm, start, end);
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
@@ -1676,7 +1673,8 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ mmu_notifier_invalidate_range(mm, start, end);
+
return err;
}
EXPORT_SYMBOL_GPL(apply_to_page_range);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1794,6 +1794,7 @@ static void unmap_region(struct mm_struc
vm_unacct_memory(nr_accounted);
free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
+ mmu_notifier_invalidate_range(mm, start, end);
tlb_finish_mmu(tlb, start, end);
}
@@ -2123,6 +2124,7 @@ void exit_mmap(struct mm_struct *mm)
vm_unacct_memory(nr_accounted);
memrlimit_cgroup_uncharge_as(mm, mm->total_vm);
free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
+ mmu_notifier_invalidate_range(mm, 0, end);
tlb_finish_mmu(tlb, 0, end);
/*
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c
+++ linux-2.6/mm/mmu_notifier.c
@@ -99,45 +99,15 @@ int __mmu_notifier_clear_flush_young(str
return young;
}
-void __mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
-{
- struct mmu_notifier *mn;
- struct hlist_node *n;
-
- rcu_read_lock();
- hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
- if (mn->ops->invalidate_page)
- mn->ops->invalidate_page(mn, mm, address);
- }
- rcu_read_unlock();
-}
-
-void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end)
-{
- struct mmu_notifier *mn;
- struct hlist_node *n;
-
- rcu_read_lock();
- hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
- if (mn->ops->invalidate_range_start)
- mn->ops->invalidate_range_start(mn, mm, start, end);
- }
- rcu_read_unlock();
-}
-
-void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+void __mmu_notifier_invalidate_range(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
struct mmu_notifier *mn;
struct hlist_node *n;
rcu_read_lock();
- hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
- if (mn->ops->invalidate_range_end)
- mn->ops->invalidate_range_end(mn, mm, start, end);
- }
+ hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist)
+ mn->ops->invalidate_range(mn, mm, start, end);
rcu_read_unlock();
}
@@ -148,6 +118,8 @@ static int do_mmu_notifier_register(stru
struct mmu_notifier_mm *mmu_notifier_mm;
int ret;
+ BUG_ON(!mn->ops);
+ BUG_ON(!mn->ops->invalidate_range);
BUG_ON(atomic_read(&mm->mm_users) <= 0);
ret = -ENOMEM;
@@ -157,9 +129,6 @@ static int do_mmu_notifier_register(stru
if (take_mmap_sem)
down_write(&mm->mmap_sem);
- ret = mm_take_all_locks(mm);
- if (unlikely(ret))
- goto out_cleanup;
if (!mm_has_notifiers(mm)) {
INIT_HLIST_HEAD(&mmu_notifier_mm->list);
@@ -181,8 +150,6 @@ static int do_mmu_notifier_register(stru
hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
spin_unlock(&mm->mmu_notifier_mm->lock);
- mm_drop_all_locks(mm);
-out_cleanup:
if (take_mmap_sem)
up_write(&mm->mmap_sem);
/* kfree() does nothing if mmu_notifier_mm is NULL */
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c
+++ linux-2.6/mm/mprotect.c
@@ -204,12 +204,11 @@ success:
dirty_accountable = 1;
}
- mmu_notifier_invalidate_range_start(mm, start, end);
if (is_vm_hugetlb_page(vma))
hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
else
change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ mmu_notifier_invalidate_range(mm, start, end);
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
vm_stat_account(mm, newflags, vma->vm_file, nrpages);
return 0;
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -81,8 +81,6 @@ static void move_ptes(struct vm_area_str
unsigned long old_start;
old_start = old_addr;
- mmu_notifier_invalidate_range_start(vma->vm_mm,
- old_start, old_end);
if (vma->vm_file) {
/*
* Subtle point from Rajesh Venkatasubramanian: before
@@ -124,7 +122,7 @@ static void move_ptes(struct vm_area_str
pte_unmap_unlock(old_pte - 1, old_ptl);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
- mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
+ mmu_notifier_invalidate_range(vma->vm_mm, old_start, old_end);
}
#define LATENCY_LIMIT (64 * PAGE_SIZE)
Index: linux-2.6/virt/kvm/kvm_main.c
===================================================================
--- linux-2.6.orig/virt/kvm/kvm_main.c
+++ linux-2.6/virt/kvm/kvm_main.c
@@ -192,13 +192,15 @@ static inline struct kvm *mmu_notifier_t
return container_of(mn, struct kvm, mmu_notifier);
}
-static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long address)
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- int need_tlb_flush;
+ int need_tlb_flush = 0;
+ spin_lock(&kvm->mmu_lock);
/*
* When ->invalidate_page runs, the linux pte has been zapped
* already but the page is still allocated until
@@ -217,32 +219,7 @@ static void kvm_mmu_notifier_invalidate_
* pte after kvm_unmap_hva returned, without noticing the page
* is going to be freed.
*/
- spin_lock(&kvm->mmu_lock);
kvm->mmu_notifier_seq++;
- need_tlb_flush = kvm_unmap_hva(kvm, address);
- spin_unlock(&kvm->mmu_lock);
-
- /* we've to flush the tlb before the pages can be freed */
- if (need_tlb_flush)
- kvm_flush_remote_tlbs(kvm);
-
-}
-
-static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end)
-{
- struct kvm *kvm = mmu_notifier_to_kvm(mn);
- int need_tlb_flush = 0;
-
- spin_lock(&kvm->mmu_lock);
- /*
- * The count increase must become visible at unlock time as no
- * spte can be established without taking the mmu_lock and
- * count is also read inside the mmu_lock critical section.
- */
- kvm->mmu_notifier_count++;
for (; start < end; start += PAGE_SIZE)
need_tlb_flush |= kvm_unmap_hva(kvm, start);
spin_unlock(&kvm->mmu_lock);
@@ -252,32 +229,6 @@ static void kvm_mmu_notifier_invalidate_
kvm_flush_remote_tlbs(kvm);
}
-static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end)
-{
- struct kvm *kvm = mmu_notifier_to_kvm(mn);
-
- spin_lock(&kvm->mmu_lock);
- /*
- * This sequence increase will notify the kvm page fault that
- * the page that is going to be mapped in the spte could have
- * been freed.
- */
- kvm->mmu_notifier_seq++;
- /*
- * The above sequence increase must be visible before the
- * below count decrease but both values are read by the kvm
- * page fault under mmu_lock spinlock so we don't need to add
- * a smb_wmb() here in between the two.
- */
- kvm->mmu_notifier_count--;
- spin_unlock(&kvm->mmu_lock);
-
- BUG_ON(kvm->mmu_notifier_count < 0);
-}
-
static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address)
@@ -296,9 +247,7 @@ static int kvm_mmu_notifier_clear_flush_
}
static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
- .invalidate_page = kvm_mmu_notifier_invalidate_page,
- .invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
- .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
+ .invalidate_range = kvm_mmu_notifier_invalidate_range,
.clear_flush_young = kvm_mmu_notifier_clear_flush_young,
};
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
Index: linux-2.6/drivers/misc/sgi-gru/grufault.c
===================================================================
--- linux-2.6.orig/drivers/misc/sgi-gru/grufault.c
+++ linux-2.6/drivers/misc/sgi-gru/grufault.c
@@ -195,74 +195,29 @@ static void get_clear_fault_map(struct g
* < 0 - error code
* 1 - (atomic only) try again in non-atomic context
*/
-static int non_atomic_pte_lookup(struct vm_area_struct *vma,
- unsigned long vaddr, int write,
+static int non_atomic_pte_lookup(unsigned long vaddr, int write,
unsigned long *paddr, int *pageshift)
{
struct page *page;
+ int ret;
/* ZZZ Need to handle HUGE pages */
- if (is_vm_hugetlb_page(vma))
- return -EFAULT;
*pageshift = PAGE_SHIFT;
- if (get_user_pages
- (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
+ ret = get_user_pages(current, current->mm, vaddr, 1, write, 0,
+ &page, NULL);
+ if (ret < 0)
+ return ret;
+ if (ret == 0)
return -EFAULT;
+ if (PageCompound(page)) { /* hugepage */
+ put_page(page);
+ return -EFAULT;
+ }
+
*paddr = page_to_phys(page);
put_page(page);
- return 0;
-}
-
-/*
- *
- * atomic_pte_lookup
- *
- * Convert a user virtual address to a physical address
- * Only supports Intel large pages (2MB only) on x86_64.
- * ZZZ - hugepage support is incomplete
- */
-static int atomic_pte_lookup(struct vm_area_struct *vma, unsigned long vaddr,
- int write, unsigned long *paddr, int *pageshift)
-{
- pgd_t *pgdp;
- pmd_t *pmdp;
- pud_t *pudp;
- pte_t pte;
-
- WARN_ON(irqs_disabled()); /* ZZZ debug */
-
- local_irq_disable();
- pgdp = pgd_offset(vma->vm_mm, vaddr);
- if (unlikely(pgd_none(*pgdp)))
- goto err;
-
- pudp = pud_offset(pgdp, vaddr);
- if (unlikely(pud_none(*pudp)))
- goto err;
-
- pmdp = pmd_offset(pudp, vaddr);
- if (unlikely(pmd_none(*pmdp)))
- goto err;
-#ifdef CONFIG_X86_64
- if (unlikely(pmd_large(*pmdp)))
- pte = *(pte_t *) pmdp;
- else
-#endif
- pte = *pte_offset_kernel(pmdp, vaddr);
-
- local_irq_enable();
-
- if (unlikely(!pte_present(pte) ||
- (write && (!pte_write(pte) || !pte_dirty(pte)))))
- return 1;
- *paddr = pte_pfn(pte) << PAGE_SHIFT;
- *pageshift = is_vm_hugetlb_page(vma) ? HPAGE_SHIFT : PAGE_SHIFT;
return 0;
-
-err:
- local_irq_enable();
- return 1;
}
/*
@@ -279,9 +234,7 @@ static int gru_try_dropin(struct gru_thr
struct gru_tlb_fault_handle *tfh,
unsigned long __user *cb)
{
- struct mm_struct *mm = gts->ts_mm;
- struct vm_area_struct *vma;
- int pageshift, asid, write, ret;
+ int pageshift, asid, write;
unsigned long paddr, gpa, vaddr;
/*
@@ -298,7 +251,7 @@ static int gru_try_dropin(struct gru_thr
*/
if (tfh->state == TFHSTATE_IDLE)
goto failidle;
- if (tfh->state == TFHSTATE_MISS_FMM && cb)
+ if (tfh->state == TFHSTATE_MISS_FMM)
goto failfmm;
write = (tfh->cause & TFHCAUSE_TLB_MOD) != 0;
@@ -307,31 +260,9 @@ static int gru_try_dropin(struct gru_thr
if (asid == 0)
goto failnoasid;
- rmb(); /* TFH must be cache resident before reading ms_range_active */
-
- /*
- * TFH is cache resident - at least briefly. Fail the dropin
- * if a range invalidate is active.
- */
- if (atomic_read(>s->ts_gms->ms_range_active))
- goto failactive;
-
- vma = find_vma(mm, vaddr);
- if (!vma)
+ if (non_atomic_pte_lookup(vaddr, write, &paddr, &pageshift))
goto failinval;
- /*
- * Atomic lookup is faster & usually works even if called in non-atomic
- * context.
- */
- ret = atomic_pte_lookup(vma, vaddr, write, &paddr, &pageshift);
- if (ret) {
- if (!cb)
- goto failupm;
- if (non_atomic_pte_lookup(vma, vaddr, write, &paddr,
- &pageshift))
- goto failinval;
- }
if (is_gru_paddr(paddr))
goto failinval;
@@ -351,19 +282,9 @@ failnoasid:
/* No asid (delayed unload). */
STAT(tlb_dropin_fail_no_asid);
gru_dbg(grudev, "FAILED no_asid tfh: 0x%p, vaddr 0x%lx\n", tfh, vaddr);
- if (!cb)
- tfh_user_polling_mode(tfh);
- else
- gru_flush_cache(tfh);
+ gru_flush_cache(tfh);
return -EAGAIN;
-failupm:
- /* Atomic failure switch CBR to UPM */
- tfh_user_polling_mode(tfh);
- STAT(tlb_dropin_fail_upm);
- gru_dbg(grudev, "FAILED upm tfh: 0x%p, vaddr 0x%lx\n", tfh, vaddr);
- return 1;
-
failfmm:
/* FMM state on UPM call */
STAT(tlb_dropin_fail_fmm);
@@ -373,8 +294,7 @@ failfmm:
failidle:
/* TFH was idle - no miss pending */
gru_flush_cache(tfh);
- if (cb)
- gru_flush_cache(cb);
+ gru_flush_cache(cb);
STAT(tlb_dropin_fail_idle);
gru_dbg(grudev, "FAILED idle tfh: 0x%p, state %d\n", tfh, tfh->state);
return 0;
@@ -385,17 +305,6 @@ failinval:
STAT(tlb_dropin_fail_invalid);
gru_dbg(grudev, "FAILED inval tfh: 0x%p, vaddr 0x%lx\n", tfh, vaddr);
return -EFAULT;
-
-failactive:
- /* Range invalidate active. Switch to UPM iff atomic */
- if (!cb)
- tfh_user_polling_mode(tfh);
- else
- gru_flush_cache(tfh);
- STAT(tlb_dropin_fail_range_active);
- gru_dbg(grudev, "FAILED range active: tfh 0x%p, vaddr 0x%lx\n",
- tfh, vaddr);
- return 1;
}
/*
@@ -408,9 +317,8 @@ irqreturn_t gru_intr(int irq, void *dev_
{
struct gru_state *gru;
struct gru_tlb_fault_map map;
- struct gru_thread_state *gts;
struct gru_tlb_fault_handle *tfh = NULL;
- int cbrnum, ctxnum;
+ int cbrnum;
STAT(intr);
@@ -434,19 +342,7 @@ irqreturn_t gru_intr(int irq, void *dev_
* The gts cannot change until a TFH start/writestart command
* is issued.
*/
- ctxnum = tfh->ctxnum;
- gts = gru->gs_gts[ctxnum];
-
- /*
- * This is running in interrupt context. Trylock the mmap_sem.
- * If it fails, retry the fault in user context.
- */
- if (down_read_trylock(>s->ts_mm->mmap_sem)) {
- gru_try_dropin(gts, tfh, NULL);
- up_read(>s->ts_mm->mmap_sem);
- } else {
- tfh_user_polling_mode(tfh);
- }
+ tfh_user_polling_mode(tfh);
}
return IRQ_HANDLED;
}
@@ -456,12 +352,9 @@ static int gru_user_dropin(struct gru_th
struct gru_tlb_fault_handle *tfh,
unsigned long __user *cb)
{
- struct gru_mm_struct *gms = gts->ts_gms;
int ret;
while (1) {
- wait_event(gms->ms_wait_queue,
- atomic_read(&gms->ms_range_active) == 0);
prefetchw(tfh); /* Helps on hdw, required for emulator */
ret = gru_try_dropin(gts, tfh, cb);
if (ret <= 0)
Index: linux-2.6/drivers/misc/sgi-gru/grutables.h
===================================================================
--- linux-2.6.orig/drivers/misc/sgi-gru/grutables.h
+++ linux-2.6/drivers/misc/sgi-gru/grutables.h
@@ -244,7 +244,6 @@ struct gru_mm_struct {
struct mmu_notifier ms_notifier;
atomic_t ms_refcnt;
spinlock_t ms_asid_lock; /* protects ASID assignment */
- atomic_t ms_range_active;/* num range_invals active */
char ms_released;
wait_queue_head_t ms_wait_queue;
DECLARE_BITMAP(ms_asidmap, GRU_MAX_GRUS);
Index: linux-2.6/drivers/misc/sgi-gru/grutlbpurge.c
===================================================================
--- linux-2.6.orig/drivers/misc/sgi-gru/grutlbpurge.c
+++ linux-2.6/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,41 +221,16 @@ void gru_flush_all_tlb(struct gru_state
/*
* MMUOPS notifier callout functions
*/
-static void gru_invalidate_range_start(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start, unsigned long end)
+static void gru_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
- STAT(mmu_invalidate_range);
- atomic_inc(&gms->ms_range_active);
- gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
- start, end, atomic_read(&gms->ms_range_active));
- gru_flush_tlb_range(gms, start, end - start);
-}
-
-static void gru_invalidate_range_end(struct mmu_notifier *mn,
- struct mm_struct *mm, unsigned long start,
- unsigned long end)
-{
- struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
- ms_notifier);
- atomic_dec(&gms->ms_range_active);
- wake_up_all(&gms->ms_wait_queue);
+ STAT(mmu_invalidate_range);
gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
-}
-
-static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
- unsigned long address)
-{
- struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
- ms_notifier);
-
- STAT(mmu_invalidate_page);
- gru_flush_tlb_range(gms, address, PAGE_SIZE);
- gru_dbg(grudev, "gms %p, address 0x%lx\n", gms, address);
+ gru_flush_tlb_range(gms, start, end - start);
}
static void gru_release(struct mmu_notifier *mn, struct mm_struct *mm)
@@ -269,9 +244,7 @@ static void gru_release(struct mmu_notif
static const struct mmu_notifier_ops gru_mmuops = {
- .invalidate_page = gru_invalidate_page,
- .invalidate_range_start = gru_invalidate_range_start,
- .invalidate_range_end = gru_invalidate_range_end,
+ .invalidate_range = gru_invalidate_range,
.release = gru_release,
};
Index: linux-2.6/include/linux/kvm_host.h
===================================================================
--- linux-2.6.orig/include/linux/kvm_host.h
+++ linux-2.6/include/linux/kvm_host.h
@@ -125,7 +125,6 @@ struct kvm {
#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
struct mmu_notifier mmu_notifier;
unsigned long mmu_notifier_seq;
- long mmu_notifier_count;
#endif
};
@@ -360,15 +359,6 @@ int kvm_trace_ioctl(unsigned int ioctl,
#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long mmu_seq)
{
- if (unlikely(vcpu->kvm->mmu_notifier_count))
- return 1;
- /*
- * Both reads happen under the mmu_lock and both values are
- * modified under mmu_lock, so there's no need of smb_rmb()
- * here in between, otherwise mmu_notifier_count should be
- * read before mmu_notifier_seq, see
- * mmu_notifier_invalidate_range_end write side.
- */
if (vcpu->kvm->mmu_notifier_seq != mmu_seq)
return 1;
return 0;
^ permalink raw reply [flat|nested] 26+ messages in thread* MMU notifiers review and some proposals 2008-07-24 14:39 MMU notifiers review and some proposals Nick Piggin @ 2008-07-24 14:39 ` Nick Piggin 2008-07-25 21:45 ` Andrea Arcangeli ` (2 subsequent siblings) 3 siblings, 0 replies; 26+ messages in thread From: Nick Piggin @ 2008-07-24 14:39 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, andrea, steiner, cl Hi, I think everybody is hoping to have a workable mmu notifier scheme merged in 2.6.27 (myself included). However I do have some concerns about the implementation proposed (in -mm). I apologise for this late review, before anybody gets too upset, most of my concerns have been raised before, but I'd like to state my case again and involving everyone. So my concerns. Firstly, mm_take_all_locks I dislike. Although it isn't too complex and quite well contained, it unfortunately somewhat ties our hands a bit when it comes to changing the fastpath locking. For exmple, it might make it difficult to remove the locks from rmap walks now (not that there aren't plenty of other difficulties, but as an example). I also think we should be cautious to make this slowpath so horribly slow. For KVM it's fine, but perhaps for GRU we'd actually want to register quite a lot of notifiers, unregister them, and perhaps even do per-vma registrations if we don't want to slow down other memory management operations too much (eg. if it is otherwise just some regular processes just doing accelerated things with GRU). I think it is possible to come up with a pretty good implementation that replaces mm_take_all_locks with something faster and using less locks, I haven't quite got a detailed sketch yet, but if anyone is interested, let me know. However, I think the whole reason for take_all_locks is to support the invalidate_range_begin/invalidate_range_end callbacks. I think these are slightly odd because they introduce a 2nd design of TLB shootdown scheme to the kernel: presently we downgrade pte permissions, and then shootdown TLBs, and then free any pages; the range callbacks first prevent new TLBs being set up, then shoot down existing TLBs, then downgrade ptes, then free pages. That's not to say this style will not work, but I prefer to keep to our existing scheme. Obviously it is better proven, and also it is good to have fewer if possible. What are the pros of the new scheme? Well performance AFAIKS. Large unmappings may not fit in the TLB gather API (and some sites do not use TLB gather at all), but with the invalidate_range_begin/invalidate_range_end we can still flush the entire range in 1 call. I would counter that our existing CPU flushing has held up fairly well, and it can also be pretty expensive if it has to do lots of IPIs. I don't have a problem with trying to maximise performance, but I really feel this is one place where I'd prefer to see numbers first. One functional con of the invalidate_range_begin/invalidate_range_end style I see is that the implementations hold off new TLB insertions by incrementing a counter to hold them off. So actually poor parallelism or starvation could become an issue with multiple threads. So anyway, I've attached a patch switching mmu notifiers to the our traditional style of TLB invalidation, and attempted to convert KVM and GRU to use it. Also keep in mind that after this the mm_take_all_locks patches should not be needed. Unfortunately I was not able to actually test this (but testing typically would not find the rally subtle problems easily anyway) --- drivers/misc/sgi-gru/grufault.c | 145 ++++--------------------------------- drivers/misc/sgi-gru/grutables.h | 1 drivers/misc/sgi-gru/grutlbpurge.c | 37 +-------- include/linux/kvm_host.h | 10 -- include/linux/mmu_notifier.h | 97 ++---------------------- mm/fremap.c | 4 - mm/hugetlb.c | 3 mm/memory.c | 18 ++-- mm/mmap.c | 2 mm/mmu_notifier.c | 43 +--------- mm/mprotect.c | 3 mm/mremap.c | 4 - virt/kvm/kvm_main.c | 65 +--------------- 13 files changed, 61 insertions(+), 371 deletions(-) Index: linux-2.6/include/linux/mmu_notifier.h =================================================================== --- linux-2.6.orig/include/linux/mmu_notifier.h +++ linux-2.6/include/linux/mmu_notifier.h @@ -65,60 +65,12 @@ struct mmu_notifier_ops { * Before this is invoked any secondary MMU is still ok to * read/write to the page previously pointed to by the Linux * pte because the page hasn't been freed yet and it won't be - * freed until this returns. If required set_page_dirty has to - * be called internally to this method. + * freed until this returns. Also, importantly the address can't + * be repopulated with some other page, so all MMUs retain a + * coherent view of memory. If required set_page_dirty has to + * be called internally to this method. Not optional. */ - void (*invalidate_page)(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long address); - - /* - * invalidate_range_start() and invalidate_range_end() must be - * paired and are called only when the mmap_sem and/or the - * locks protecting the reverse maps are held. The subsystem - * must guarantee that no additional references are taken to - * the pages in the range established between the call to - * invalidate_range_start() and the matching call to - * invalidate_range_end(). - * - * Invalidation of multiple concurrent ranges may be - * optionally permitted by the driver. Either way the - * establishment of sptes is forbidden in the range passed to - * invalidate_range_begin/end for the whole duration of the - * invalidate_range_begin/end critical section. - * - * invalidate_range_start() is called when all pages in the - * range are still mapped and have at least a refcount of one. - * - * invalidate_range_end() is called when all pages in the - * range have been unmapped and the pages have been freed by - * the VM. - * - * The VM will remove the page table entries and potentially - * the page between invalidate_range_start() and - * invalidate_range_end(). If the page must not be freed - * because of pending I/O or other circumstances then the - * invalidate_range_start() callback (or the initial mapping - * by the driver) must make sure that the refcount is kept - * elevated. - * - * If the driver increases the refcount when the pages are - * initially mapped into an address space then either - * invalidate_range_start() or invalidate_range_end() may - * decrease the refcount. If the refcount is decreased on - * invalidate_range_start() then the VM can free pages as page - * table entries are removed. If the refcount is only - * droppped on invalidate_range_end() then the driver itself - * will drop the last refcount but it must take care to flush - * any secondary tlb before doing the final free on the - * page. Pages will no longer be referenced by the linux - * address space but may still be referenced by sptes until - * the last refcount is dropped. - */ - void (*invalidate_range_start)(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, unsigned long end); - void (*invalidate_range_end)(struct mmu_notifier *mn, + void (*invalidate_range)(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, unsigned long end); }; @@ -154,11 +106,7 @@ extern void __mmu_notifier_mm_destroy(st extern void __mmu_notifier_release(struct mm_struct *mm); extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, unsigned long address); -extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, - unsigned long address); -extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end); -extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, +extern void __mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end); static inline void mmu_notifier_release(struct mm_struct *mm) @@ -175,25 +123,11 @@ static inline int mmu_notifier_clear_flu return 0; } -static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, - unsigned long address) -{ - if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_page(mm, address); -} - -static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, +static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end) { if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_start(mm, start, end); -} - -static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end) -{ - if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_end(mm, start, end); + __mmu_notifier_invalidate_range(mm, start, end); } static inline void mmu_notifier_mm_init(struct mm_struct *mm) @@ -221,7 +155,8 @@ static inline void mmu_notifier_mm_destr struct vm_area_struct *___vma = __vma; \ unsigned long ___address = __address; \ __pte = ptep_clear_flush(___vma, ___address, __ptep); \ - mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ + mmu_notifier_invalidate_range(___vma->vm_mm, ___address, \ + ____address + PAGE_SIZE); \ __pte; \ }) @@ -248,17 +183,7 @@ static inline int mmu_notifier_clear_flu return 0; } -static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, - unsigned long address) -{ -} - -static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end) -{ -} - -static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, +static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end) { } Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c +++ linux-2.6/mm/fremap.c @@ -32,7 +32,7 @@ static void zap_pte(struct mm_struct *mm struct page *page; flush_cache_page(vma, addr, pte_pfn(pte)); - pte = ptep_clear_flush(vma, addr, ptep); + pte = ptep_clear_flush_notify(vma, addr, ptep); page = vm_normal_page(vma, addr, pte); if (page) { if (pte_dirty(pte)) @@ -226,9 +226,7 @@ asmlinkage long sys_remap_file_pages(uns vma->vm_flags = saved_flags; } - mmu_notifier_invalidate_range_start(mm, start, start + size); err = populate_range(mm, vma, start, size, pgoff); - mmu_notifier_invalidate_range_end(mm, start, start + size); if (!err && !(flags & MAP_NONBLOCK)) { if (vma->vm_flags & VM_LOCKED) { /* Index: linux-2.6/mm/hugetlb.c =================================================================== --- linux-2.6.orig/mm/hugetlb.c +++ linux-2.6/mm/hugetlb.c @@ -1683,7 +1683,6 @@ void __unmap_hugepage_range(struct vm_ar BUG_ON(start & ~huge_page_mask(h)); BUG_ON(end & ~huge_page_mask(h)); - mmu_notifier_invalidate_range_start(mm, start, end); spin_lock(&mm->page_table_lock); for (address = start; address < end; address += sz) { ptep = huge_pte_offset(mm, address); @@ -1725,7 +1724,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); - mmu_notifier_invalidate_range_end(mm, start, end); + mmu_notifier_invalidate_range(mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -677,9 +677,6 @@ int copy_page_range(struct mm_struct *ds * parent mm. And a permission downgrade will only happen if * is_cow_mapping() returns true. */ - if (is_cow_mapping(vma->vm_flags)) - mmu_notifier_invalidate_range_start(src_mm, addr, end); - ret = 0; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); @@ -695,8 +692,8 @@ int copy_page_range(struct mm_struct *ds } while (dst_pgd++, src_pgd++, addr = next, addr != end); if (is_cow_mapping(vma->vm_flags)) - mmu_notifier_invalidate_range_end(src_mm, - vma->vm_start, end); + mmu_notifier_invalidate_range(src_mm, vma->vm_start, end); + return ret; } @@ -903,7 +900,6 @@ unsigned long unmap_vmas(struct mmu_gath int fullmm = (*tlbp)->fullmm; struct mm_struct *mm = vma->vm_mm; - mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { unsigned long end; @@ -951,6 +947,7 @@ unsigned long unmap_vmas(struct mmu_gath break; } + mmu_notifier_invalidate_range(mm, tlb_start, start); tlb_finish_mmu(*tlbp, tlb_start, start); if (need_resched() || @@ -968,7 +965,6 @@ unsigned long unmap_vmas(struct mmu_gath } } out: - mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ } @@ -991,8 +987,10 @@ unsigned long zap_page_range(struct vm_a tlb = tlb_gather_mmu(mm, 0); update_hiwater_rss(mm); end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); - if (tlb) + if (tlb) { + mmu_notifier_invalidate_range(mm, address, end); tlb_finish_mmu(tlb, address, end); + } return end; } EXPORT_SYMBOL_GPL(zap_page_range); @@ -1668,7 +1666,6 @@ int apply_to_page_range(struct mm_struct int err; BUG_ON(addr >= end); - mmu_notifier_invalidate_range_start(mm, start, end); pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); @@ -1676,7 +1673,8 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); - mmu_notifier_invalidate_range_end(mm, start, end); + mmu_notifier_invalidate_range(mm, start, end); + return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c +++ linux-2.6/mm/mmap.c @@ -1794,6 +1794,7 @@ static void unmap_region(struct mm_struc vm_unacct_memory(nr_accounted); free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); + mmu_notifier_invalidate_range(mm, start, end); tlb_finish_mmu(tlb, start, end); } @@ -2123,6 +2124,7 @@ void exit_mmap(struct mm_struct *mm) vm_unacct_memory(nr_accounted); memrlimit_cgroup_uncharge_as(mm, mm->total_vm); free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0); + mmu_notifier_invalidate_range(mm, 0, end); tlb_finish_mmu(tlb, 0, end); /* Index: linux-2.6/mm/mmu_notifier.c =================================================================== --- linux-2.6.orig/mm/mmu_notifier.c +++ linux-2.6/mm/mmu_notifier.c @@ -99,45 +99,15 @@ int __mmu_notifier_clear_flush_young(str return young; } -void __mmu_notifier_invalidate_page(struct mm_struct *mm, - unsigned long address) -{ - struct mmu_notifier *mn; - struct hlist_node *n; - - rcu_read_lock(); - hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { - if (mn->ops->invalidate_page) - mn->ops->invalidate_page(mn, mm, address); - } - rcu_read_unlock(); -} - -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end) -{ - struct mmu_notifier *mn; - struct hlist_node *n; - - rcu_read_lock(); - hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { - if (mn->ops->invalidate_range_start) - mn->ops->invalidate_range_start(mn, mm, start, end); - } - rcu_read_unlock(); -} - -void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, +void __mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end) { struct mmu_notifier *mn; struct hlist_node *n; rcu_read_lock(); - hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { - if (mn->ops->invalidate_range_end) - mn->ops->invalidate_range_end(mn, mm, start, end); - } + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) + mn->ops->invalidate_range(mn, mm, start, end); rcu_read_unlock(); } @@ -148,6 +118,8 @@ static int do_mmu_notifier_register(stru struct mmu_notifier_mm *mmu_notifier_mm; int ret; + BUG_ON(!mn->ops); + BUG_ON(!mn->ops->invalidate_range); BUG_ON(atomic_read(&mm->mm_users) <= 0); ret = -ENOMEM; @@ -157,9 +129,6 @@ static int do_mmu_notifier_register(stru if (take_mmap_sem) down_write(&mm->mmap_sem); - ret = mm_take_all_locks(mm); - if (unlikely(ret)) - goto out_cleanup; if (!mm_has_notifiers(mm)) { INIT_HLIST_HEAD(&mmu_notifier_mm->list); @@ -181,8 +150,6 @@ static int do_mmu_notifier_register(stru hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list); spin_unlock(&mm->mmu_notifier_mm->lock); - mm_drop_all_locks(mm); -out_cleanup: if (take_mmap_sem) up_write(&mm->mmap_sem); /* kfree() does nothing if mmu_notifier_mm is NULL */ Index: linux-2.6/mm/mprotect.c =================================================================== --- linux-2.6.orig/mm/mprotect.c +++ linux-2.6/mm/mprotect.c @@ -204,12 +204,11 @@ success: dirty_accountable = 1; } - mmu_notifier_invalidate_range_start(mm, start, end); if (is_vm_hugetlb_page(vma)) hugetlb_change_protection(vma, start, end, vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); - mmu_notifier_invalidate_range_end(mm, start, end); + mmu_notifier_invalidate_range(mm, start, end); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); return 0; Index: linux-2.6/mm/mremap.c =================================================================== --- linux-2.6.orig/mm/mremap.c +++ linux-2.6/mm/mremap.c @@ -81,8 +81,6 @@ static void move_ptes(struct vm_area_str unsigned long old_start; old_start = old_addr; - mmu_notifier_invalidate_range_start(vma->vm_mm, - old_start, old_end); if (vma->vm_file) { /* * Subtle point from Rajesh Venkatasubramanian: before @@ -124,7 +122,7 @@ static void move_ptes(struct vm_area_str pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) spin_unlock(&mapping->i_mmap_lock); - mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); + mmu_notifier_invalidate_range(vma->vm_mm, old_start, old_end); } #define LATENCY_LIMIT (64 * PAGE_SIZE) Index: linux-2.6/virt/kvm/kvm_main.c =================================================================== --- linux-2.6.orig/virt/kvm/kvm_main.c +++ linux-2.6/virt/kvm/kvm_main.c @@ -192,13 +192,15 @@ static inline struct kvm *mmu_notifier_t return container_of(mn, struct kvm, mmu_notifier); } -static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long address) +static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, + unsigned long end) { struct kvm *kvm = mmu_notifier_to_kvm(mn); - int need_tlb_flush; + int need_tlb_flush = 0; + spin_lock(&kvm->mmu_lock); /* * When ->invalidate_page runs, the linux pte has been zapped * already but the page is still allocated until @@ -217,32 +219,7 @@ static void kvm_mmu_notifier_invalidate_ * pte after kvm_unmap_hva returned, without noticing the page * is going to be freed. */ - spin_lock(&kvm->mmu_lock); kvm->mmu_notifier_seq++; - need_tlb_flush = kvm_unmap_hva(kvm, address); - spin_unlock(&kvm->mmu_lock); - - /* we've to flush the tlb before the pages can be freed */ - if (need_tlb_flush) - kvm_flush_remote_tlbs(kvm); - -} - -static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end) -{ - struct kvm *kvm = mmu_notifier_to_kvm(mn); - int need_tlb_flush = 0; - - spin_lock(&kvm->mmu_lock); - /* - * The count increase must become visible at unlock time as no - * spte can be established without taking the mmu_lock and - * count is also read inside the mmu_lock critical section. - */ - kvm->mmu_notifier_count++; for (; start < end; start += PAGE_SIZE) need_tlb_flush |= kvm_unmap_hva(kvm, start); spin_unlock(&kvm->mmu_lock); @@ -252,32 +229,6 @@ static void kvm_mmu_notifier_invalidate_ kvm_flush_remote_tlbs(kvm); } -static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end) -{ - struct kvm *kvm = mmu_notifier_to_kvm(mn); - - spin_lock(&kvm->mmu_lock); - /* - * This sequence increase will notify the kvm page fault that - * the page that is going to be mapped in the spte could have - * been freed. - */ - kvm->mmu_notifier_seq++; - /* - * The above sequence increase must be visible before the - * below count decrease but both values are read by the kvm - * page fault under mmu_lock spinlock so we don't need to add - * a smb_wmb() here in between the two. - */ - kvm->mmu_notifier_count--; - spin_unlock(&kvm->mmu_lock); - - BUG_ON(kvm->mmu_notifier_count < 0); -} - static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long address) @@ -296,9 +247,7 @@ static int kvm_mmu_notifier_clear_flush_ } static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { - .invalidate_page = kvm_mmu_notifier_invalidate_page, - .invalidate_range_start = kvm_mmu_notifier_invalidate_range_start, - .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, + .invalidate_range = kvm_mmu_notifier_invalidate_range, .clear_flush_young = kvm_mmu_notifier_clear_flush_young, }; #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ Index: linux-2.6/drivers/misc/sgi-gru/grufault.c =================================================================== --- linux-2.6.orig/drivers/misc/sgi-gru/grufault.c +++ linux-2.6/drivers/misc/sgi-gru/grufault.c @@ -195,74 +195,29 @@ static void get_clear_fault_map(struct g * < 0 - error code * 1 - (atomic only) try again in non-atomic context */ -static int non_atomic_pte_lookup(struct vm_area_struct *vma, - unsigned long vaddr, int write, +static int non_atomic_pte_lookup(unsigned long vaddr, int write, unsigned long *paddr, int *pageshift) { struct page *page; + int ret; /* ZZZ Need to handle HUGE pages */ - if (is_vm_hugetlb_page(vma)) - return -EFAULT; *pageshift = PAGE_SHIFT; - if (get_user_pages - (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0) + ret = get_user_pages(current, current->mm, vaddr, 1, write, 0, + &page, NULL); + if (ret < 0) + return ret; + if (ret == 0) return -EFAULT; + if (PageCompound(page)) { /* hugepage */ + put_page(page); + return -EFAULT; + } + *paddr = page_to_phys(page); put_page(page); - return 0; -} - -/* - * - * atomic_pte_lookup - * - * Convert a user virtual address to a physical address - * Only supports Intel large pages (2MB only) on x86_64. - * ZZZ - hugepage support is incomplete - */ -static int atomic_pte_lookup(struct vm_area_struct *vma, unsigned long vaddr, - int write, unsigned long *paddr, int *pageshift) -{ - pgd_t *pgdp; - pmd_t *pmdp; - pud_t *pudp; - pte_t pte; - - WARN_ON(irqs_disabled()); /* ZZZ debug */ - - local_irq_disable(); - pgdp = pgd_offset(vma->vm_mm, vaddr); - if (unlikely(pgd_none(*pgdp))) - goto err; - - pudp = pud_offset(pgdp, vaddr); - if (unlikely(pud_none(*pudp))) - goto err; - - pmdp = pmd_offset(pudp, vaddr); - if (unlikely(pmd_none(*pmdp))) - goto err; -#ifdef CONFIG_X86_64 - if (unlikely(pmd_large(*pmdp))) - pte = *(pte_t *) pmdp; - else -#endif - pte = *pte_offset_kernel(pmdp, vaddr); - - local_irq_enable(); - - if (unlikely(!pte_present(pte) || - (write && (!pte_write(pte) || !pte_dirty(pte))))) - return 1; - *paddr = pte_pfn(pte) << PAGE_SHIFT; - *pageshift = is_vm_hugetlb_page(vma) ? HPAGE_SHIFT : PAGE_SHIFT; return 0; - -err: - local_irq_enable(); - return 1; } /* @@ -279,9 +234,7 @@ static int gru_try_dropin(struct gru_thr struct gru_tlb_fault_handle *tfh, unsigned long __user *cb) { - struct mm_struct *mm = gts->ts_mm; - struct vm_area_struct *vma; - int pageshift, asid, write, ret; + int pageshift, asid, write; unsigned long paddr, gpa, vaddr; /* @@ -298,7 +251,7 @@ static int gru_try_dropin(struct gru_thr */ if (tfh->state == TFHSTATE_IDLE) goto failidle; - if (tfh->state == TFHSTATE_MISS_FMM && cb) + if (tfh->state == TFHSTATE_MISS_FMM) goto failfmm; write = (tfh->cause & TFHCAUSE_TLB_MOD) != 0; @@ -307,31 +260,9 @@ static int gru_try_dropin(struct gru_thr if (asid == 0) goto failnoasid; - rmb(); /* TFH must be cache resident before reading ms_range_active */ - - /* - * TFH is cache resident - at least briefly. Fail the dropin - * if a range invalidate is active. - */ - if (atomic_read(>s->ts_gms->ms_range_active)) - goto failactive; - - vma = find_vma(mm, vaddr); - if (!vma) + if (non_atomic_pte_lookup(vaddr, write, &paddr, &pageshift)) goto failinval; - /* - * Atomic lookup is faster & usually works even if called in non-atomic - * context. - */ - ret = atomic_pte_lookup(vma, vaddr, write, &paddr, &pageshift); - if (ret) { - if (!cb) - goto failupm; - if (non_atomic_pte_lookup(vma, vaddr, write, &paddr, - &pageshift)) - goto failinval; - } if (is_gru_paddr(paddr)) goto failinval; @@ -351,19 +282,9 @@ failnoasid: /* No asid (delayed unload). */ STAT(tlb_dropin_fail_no_asid); gru_dbg(grudev, "FAILED no_asid tfh: 0x%p, vaddr 0x%lx\n", tfh, vaddr); - if (!cb) - tfh_user_polling_mode(tfh); - else - gru_flush_cache(tfh); + gru_flush_cache(tfh); return -EAGAIN; -failupm: - /* Atomic failure switch CBR to UPM */ - tfh_user_polling_mode(tfh); - STAT(tlb_dropin_fail_upm); - gru_dbg(grudev, "FAILED upm tfh: 0x%p, vaddr 0x%lx\n", tfh, vaddr); - return 1; - failfmm: /* FMM state on UPM call */ STAT(tlb_dropin_fail_fmm); @@ -373,8 +294,7 @@ failfmm: failidle: /* TFH was idle - no miss pending */ gru_flush_cache(tfh); - if (cb) - gru_flush_cache(cb); + gru_flush_cache(cb); STAT(tlb_dropin_fail_idle); gru_dbg(grudev, "FAILED idle tfh: 0x%p, state %d\n", tfh, tfh->state); return 0; @@ -385,17 +305,6 @@ failinval: STAT(tlb_dropin_fail_invalid); gru_dbg(grudev, "FAILED inval tfh: 0x%p, vaddr 0x%lx\n", tfh, vaddr); return -EFAULT; - -failactive: - /* Range invalidate active. Switch to UPM iff atomic */ - if (!cb) - tfh_user_polling_mode(tfh); - else - gru_flush_cache(tfh); - STAT(tlb_dropin_fail_range_active); - gru_dbg(grudev, "FAILED range active: tfh 0x%p, vaddr 0x%lx\n", - tfh, vaddr); - return 1; } /* @@ -408,9 +317,8 @@ irqreturn_t gru_intr(int irq, void *dev_ { struct gru_state *gru; struct gru_tlb_fault_map map; - struct gru_thread_state *gts; struct gru_tlb_fault_handle *tfh = NULL; - int cbrnum, ctxnum; + int cbrnum; STAT(intr); @@ -434,19 +342,7 @@ irqreturn_t gru_intr(int irq, void *dev_ * The gts cannot change until a TFH start/writestart command * is issued. */ - ctxnum = tfh->ctxnum; - gts = gru->gs_gts[ctxnum]; - - /* - * This is running in interrupt context. Trylock the mmap_sem. - * If it fails, retry the fault in user context. - */ - if (down_read_trylock(>s->ts_mm->mmap_sem)) { - gru_try_dropin(gts, tfh, NULL); - up_read(>s->ts_mm->mmap_sem); - } else { - tfh_user_polling_mode(tfh); - } + tfh_user_polling_mode(tfh); } return IRQ_HANDLED; } @@ -456,12 +352,9 @@ static int gru_user_dropin(struct gru_th struct gru_tlb_fault_handle *tfh, unsigned long __user *cb) { - struct gru_mm_struct *gms = gts->ts_gms; int ret; while (1) { - wait_event(gms->ms_wait_queue, - atomic_read(&gms->ms_range_active) == 0); prefetchw(tfh); /* Helps on hdw, required for emulator */ ret = gru_try_dropin(gts, tfh, cb); if (ret <= 0) Index: linux-2.6/drivers/misc/sgi-gru/grutables.h =================================================================== --- linux-2.6.orig/drivers/misc/sgi-gru/grutables.h +++ linux-2.6/drivers/misc/sgi-gru/grutables.h @@ -244,7 +244,6 @@ struct gru_mm_struct { struct mmu_notifier ms_notifier; atomic_t ms_refcnt; spinlock_t ms_asid_lock; /* protects ASID assignment */ - atomic_t ms_range_active;/* num range_invals active */ char ms_released; wait_queue_head_t ms_wait_queue; DECLARE_BITMAP(ms_asidmap, GRU_MAX_GRUS); Index: linux-2.6/drivers/misc/sgi-gru/grutlbpurge.c =================================================================== --- linux-2.6.orig/drivers/misc/sgi-gru/grutlbpurge.c +++ linux-2.6/drivers/misc/sgi-gru/grutlbpurge.c @@ -221,41 +221,16 @@ void gru_flush_all_tlb(struct gru_state /* * MMUOPS notifier callout functions */ -static void gru_invalidate_range_start(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, unsigned long end) +static void gru_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end) { struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct, ms_notifier); - STAT(mmu_invalidate_range); - atomic_inc(&gms->ms_range_active); - gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms, - start, end, atomic_read(&gms->ms_range_active)); - gru_flush_tlb_range(gms, start, end - start); -} - -static void gru_invalidate_range_end(struct mmu_notifier *mn, - struct mm_struct *mm, unsigned long start, - unsigned long end) -{ - struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct, - ms_notifier); - atomic_dec(&gms->ms_range_active); - wake_up_all(&gms->ms_wait_queue); + STAT(mmu_invalidate_range); gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end); -} - -static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm, - unsigned long address) -{ - struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct, - ms_notifier); - - STAT(mmu_invalidate_page); - gru_flush_tlb_range(gms, address, PAGE_SIZE); - gru_dbg(grudev, "gms %p, address 0x%lx\n", gms, address); + gru_flush_tlb_range(gms, start, end - start); } static void gru_release(struct mmu_notifier *mn, struct mm_struct *mm) @@ -269,9 +244,7 @@ static void gru_release(struct mmu_notif static const struct mmu_notifier_ops gru_mmuops = { - .invalidate_page = gru_invalidate_page, - .invalidate_range_start = gru_invalidate_range_start, - .invalidate_range_end = gru_invalidate_range_end, + .invalidate_range = gru_invalidate_range, .release = gru_release, }; Index: linux-2.6/include/linux/kvm_host.h =================================================================== --- linux-2.6.orig/include/linux/kvm_host.h +++ linux-2.6/include/linux/kvm_host.h @@ -125,7 +125,6 @@ struct kvm { #ifdef KVM_ARCH_WANT_MMU_NOTIFIER struct mmu_notifier mmu_notifier; unsigned long mmu_notifier_seq; - long mmu_notifier_count; #endif }; @@ -360,15 +359,6 @@ int kvm_trace_ioctl(unsigned int ioctl, #ifdef KVM_ARCH_WANT_MMU_NOTIFIER static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long mmu_seq) { - if (unlikely(vcpu->kvm->mmu_notifier_count)) - return 1; - /* - * Both reads happen under the mmu_lock and both values are - * modified under mmu_lock, so there's no need of smb_rmb() - * here in between, otherwise mmu_notifier_count should be - * read before mmu_notifier_seq, see - * mmu_notifier_invalidate_range_end write side. - */ if (vcpu->kvm->mmu_notifier_seq != mmu_seq) return 1; return 0; ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-24 14:39 MMU notifiers review and some proposals Nick Piggin 2008-07-24 14:39 ` Nick Piggin @ 2008-07-25 21:45 ` Andrea Arcangeli 2008-07-26 3:08 ` Nick Piggin 2008-07-25 23:29 ` Jack Steiner 2008-07-27 9:45 ` Andrew Morton 3 siblings, 1 reply; 26+ messages in thread From: Andrea Arcangeli @ 2008-07-25 21:45 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl Hi Nick, On Thu, Jul 24, 2008 at 04:39:49PM +0200, Nick Piggin wrote: > I think everybody is hoping to have a workable mmu notifier scheme > merged in 2.6.27 (myself included). However I do have some concerns Just to be clear, I'm waiting mmu notifiers to showup on Linus's tree before commenting this as it was all partly covered in the past discussions anyway, so there's nothing really urgent or new here (at least for me ;). It's a tradeoff, you pointed out the positive sides and negative point of both approaches, and depending which kind of the trade you're interested about, you'll prefer one or the other approach. Your preference is the exact opposite of what SGI liked and what we liked. But all works for us, and all works for GRU (though -mm is faster for the fast path), but only -mm can be easily later extended for XPMEM/IB if they ever decide to schedule in the mmu notifier methods in the future (which may never happen and it's unrelated to the current patches that don't contemplate sleeping at all and it's pure luck they can be trivially extended to provide for it). As your patch shown the changes are fairly small anyway if we later decide to change in 2.6.28-rc, in the meantime current code in -mm was heavily tested and all code including kvm and gru has been tested only with this, and this combined with the fact -mm guarantees the fastest fast path, I hope we leave any discussion to the 2.6.28-rc merge window, now it's time to go productive finally! ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-25 21:45 ` Andrea Arcangeli @ 2008-07-26 3:08 ` Nick Piggin 2008-07-26 11:38 ` Andrea Arcangeli 0 siblings, 1 reply; 26+ messages in thread From: Nick Piggin @ 2008-07-26 3:08 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Fri, Jul 25, 2008 at 11:45:52PM +0200, Andrea Arcangeli wrote: > Hi Nick, > > On Thu, Jul 24, 2008 at 04:39:49PM +0200, Nick Piggin wrote: > > I think everybody is hoping to have a workable mmu notifier scheme > > merged in 2.6.27 (myself included). However I do have some concerns > > Just to be clear, I'm waiting mmu notifiers to showup on Linus's tree > before commenting this as it was all partly covered in the past > discussions anyway, so there's nothing really urgent or new here (at > least for me ;). Well I just was never completely satisfied with how that turned out. There was an assertion that invalidate range begin/end were the right way to go because performance would be too bad using the traditional mmu gather / range flushing. Now that I actually had the GRU and KVM code to look at, I must say I would be very surprised if performance is too bad. Given that, I would have thought the logical way to go would be to start with the "minimal" type of notifiers I proposed now, and then get some numbers together to try to support the start/end scheme. If there is some real performance numbers or something that I missed, please let me know. > It's a tradeoff, you pointed out the positive sides and negative point > of both approaches, and depending which kind of the trade you're > interested about, you'll prefer one or the other approach. Your > preference is the exact opposite of what SGI liked and what we > liked. But all works for us, and all works for GRU (though -mm is > faster for the fast path), but only -mm can be easily later extended > for XPMEM/IB if they ever decide to schedule in the mmu notifier > methods in the future (which may never happen and it's unrelated to > the current patches that don't contemplate sleeping at all and it's > pure luck they can be trivially extended to provide for it). > > As your patch shown the changes are fairly small anyway if we later > decide to change in 2.6.28-rc, in the meantime current code in -mm was OK. The significant thing from my POV is that it doesn't need to introduce the take-all-locks code. > heavily tested and all code including kvm and gru has been tested only > with this, and this combined with the fact -mm guarantees the fastest > fast path, I hope we leave any discussion to the 2.6.28-rc merge > window, now it's time to go productive finally! Well everybody knows I would prefer to merge the minimal notifiers now, and look at more complex things as we get results to see if they are required (I doubt anybody is going to help me get numbers to justify going the other way ;)). I think for the core VM, minimal notifiers are basically trivial, and their consumers are more peripheral code so I don't think it would go against merge standards. Anyway, I just voice my opinion and let Andrew and Linus decide. To be clear: I have not found any actual bugs in Andrea's -mm patchset, only some dislikes of the approach. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 3:08 ` Nick Piggin @ 2008-07-26 11:38 ` Andrea Arcangeli 2008-07-26 12:28 ` Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Andrea Arcangeli @ 2008-07-26 11:38 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 05:08:10AM +0200, Nick Piggin wrote: > Well I just was never completely satisfied with how that turned out. > There was an assertion that invalidate range begin/end were the right > way to go because performance would be too bad using the traditional > mmu gather / range flushing. Now that I actually had the GRU and KVM > code to look at, I must say I would be very surprised if performance > is too bad. Given that, I would have thought the logical way to go > would be to start with the "minimal" type of notifiers I proposed now, > and then get some numbers together to try to support the start/end > scheme. You know back in Feb I started with the minimal type of notifiers you're suggesting but it was turned down. The advantages of the current -mm approach are: 1) absolute minimal intrusion into the kernel common code, and absolute minimum number of branches added to the kernel fast paths. Kernel is faster than your "minimal" type of notifiers when they're disarmed. Nobody can care less about the performance of mmu notifiers infact, both will work fast enough, but I want the approach that bloats the main kernel less, and -mm reaches this objective fine. 2) No need to rewrite the tlb-gather logic, which in any case would not allow to schedule inside the mmu notifiers later if an user requiring scheduling ever showup (more flexible approach), and it would also need to become per-mm and not per-cpu. 3) Maximal reduction of IPI flood during vma teardown. You can't possibly hold off the CPU primary-mmu tlb miss handler, but we can hold off the secondary-mmu page-fautl/tlb-miss handler, and doing so we can run a single IPI for an unlimited amount of address space teardown. Disavantages: 1) mm_take_all_locks is required to register No other disavantage. There is no problem with holding off the secondary mmu page fault, a few threads may spin but signals are handled the whole time, the page fault doesn't loop it returns and it is being retried. So you can shoot yourself in the foot (with your own threads stepping on each other toes) and that's all, there's no risk of starvation or anything like that. > If there is some real performance numbers or something that I missed, > please let me know. Producing performance numbers for KVM isn't really possible, because either one will work fine, kvm never mangles the address space with the exception of madvise with ballooning which is going to perform well either way. invalidate_page is run most of the time in KVM runtime, never invalidate_range_start/end so there would be no difference. > I think for the core VM, minimal notifiers are basically trivial, and Your minimal notifiers are a whole lot more complex, as they require to rewrite the tlb-gather in all archs. Go ahead it won't be ready for 2.6.27 be sure... Furthermore despite rewriting tlb-gather they still have all the disavantages mentioned above. What is possible is to have a minimal notifier that adds a branch for every pte teardown, that's easy done that in Feb, that leaves the tlb-gather optimization for later, but I still think using tlb-gather when we can do better and we already do better is wrong. > Anyway, I just voice my opinion and let Andrew and Linus decide. To be > clear: I have not found any actual bugs in Andrea's -mm patchset, only > some dislikes of the approach. Yes, like I said I think this is a matter of taste of what you like of the tradeoff. There are disadvantages and advantages in both and if we wait forever to please everyone taste, it'll never go in. My feeling is that what is in -mm is better and it will stay for long, because it fully exploits the ability we have to hold off and reply the secondary mmu page fault (equivalent of the primary-mmu tlb miss) something we can't do with the primary mmu tlb miss and that forces us to implement something as complex (and IPI-flooding) as tlb-gather logic. And the result besides being theoretically faster in the fast path both when armed and disarmed, is also simpler than plugging mmu notifier invalidate_pages inside tlb-gather. So I think it's a better tradeoff. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 11:38 ` Andrea Arcangeli @ 2008-07-26 12:28 ` Nick Piggin 2008-07-26 13:02 ` Andrea Arcangeli 2008-07-26 12:33 ` Nick Piggin 2008-07-26 13:04 ` Nick Piggin 2 siblings, 1 reply; 26+ messages in thread From: Nick Piggin @ 2008-07-26 12:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 01:38:13PM +0200, Andrea Arcangeli wrote: > On Sat, Jul 26, 2008 at 05:08:10AM +0200, Nick Piggin wrote: > > Well I just was never completely satisfied with how that turned out. > > There was an assertion that invalidate range begin/end were the right > > way to go because performance would be too bad using the traditional > > mmu gather / range flushing. Now that I actually had the GRU and KVM > > code to look at, I must say I would be very surprised if performance > > is too bad. Given that, I would have thought the logical way to go > > would be to start with the "minimal" type of notifiers I proposed now, > > and then get some numbers together to try to support the start/end > > scheme. > > You know back in Feb I started with the minimal type of notifiers > you're suggesting but it was turned down. Yes, but I don't think it was turned down for good reasons. It was turned down solely because of performance AFAIKS, but there were not even any numbers to back this up and I think it most situations where KVM and GRU are used, it should not be measurable. > The advantages of the current -mm approach are: > > 1) absolute minimal intrusion into the kernel common code, and > absolute minimum number of branches added to the kernel fast > paths. Kernel is faster than your "minimal" type of notifiers when > they're disarmed. Nobody can care less about the performance of mmu > notifiers infact, both will work fast enough, but I want the > approach that bloats the main kernel less, and -mm reaches this > objective fine. I claim opposite. There is no mm_take_all_locks in my approach, and the TLB flushing design mirrors that which we use for CPU TLB flushing. Both these points make the -mm design much more intrusive to kernel common code. As for notifiers disabled speed, I can't see how -mm would be faster. Sometimes -mm will result in fewer notifier calls eg. due to huge unmaps. Othertimes my design will be fewer because it only makes a single range flush call rather than a start/end pair. How do you think they are slower? > 2) No need to rewrite the tlb-gather logic, which in any case would > not allow to schedule inside the mmu notifiers later if an user > requiring scheduling ever showup (more flexible approach), and it > would also need to become per-mm and not per-cpu. I sent a patch in my first email. I don't see this as rewriting the TLB gather logic at all. Or is there a hole in my design? Hmm, I missed putting the notification in tlb_remove_page when the gather fills up, but I don't think that classes as a redesign of the tlb-gather logic to put a callback in there. > 3) Maximal reduction of IPI flood during vma teardown. You can't > possibly hold off the CPU primary-mmu tlb miss handler, but we can You definitely can on many architectures except x86. None has seemed to require the need as yet. > hold off the secondary-mmu page-fautl/tlb-miss handler, and doing > so we can run a single IPI for an unlimited amount of address space > teardown. That is true, but 1) most unmaps will be either fairly small or whole-program. In both cases, minimal notifiers should be just as good. > Disavantages: > > 1) mm_take_all_locks is required to register > > No other disavantage. 2) new tlb flushing design 3) livelock/starvation problem with TLB holdoff > There is no problem with holding off the secondary mmu page fault, a > few threads may spin but signals are handled the whole time, the page > fault doesn't loop it returns and it is being retried. So you can > shoot yourself in the foot (with your own threads stepping on each > other toes) and that's all, there's no risk of starvation or anything > like that. That's not shooting yourself in the foot if you are forced into the design. Definitely there can be indefinite starvation because there is no notion of queueing on the range_active count. > > If there is some real performance numbers or something that I missed, > > please let me know. > > Producing performance numbers for KVM isn't really possible, because > either one will work fine, kvm never mangles the address space with > the exception of madvise with ballooning which is going to perform > well either way. invalidate_page is run most of the time in KVM > runtime, never invalidate_range_start/end so there would be no difference. So then the important question is GRU. In which case the driver isn't even finished, it is being run on a simulator, and probably very few or no real users of the driver... so that's not a good platform to be arguing for more complexity for the sake of performance IMO. > > I think for the core VM, minimal notifiers are basically trivial, and > > Your minimal notifiers are a whole lot more complex, as they require > to rewrite the tlb-gather in all archs. Go ahead it won't be ready for > 2.6.27 be sure... > > Furthermore despite rewriting tlb-gather they still have all the > disavantages mentioned above. > > What is possible is to have a minimal notifier that adds a branch for > every pte teardown, that's easy done that in Feb, that leaves the > tlb-gather optimization for later, but I still think using tlb-gather > when we can do better and we already do better is wrong. I'm doing range flushing withing tlb-gathers in the patch I posted which does not add a branch for every pte teardown. And I don't really consider range_start/end as better, given my reasons. > > Anyway, I just voice my opinion and let Andrew and Linus decide. To be > > clear: I have not found any actual bugs in Andrea's -mm patchset, only > > some dislikes of the approach. > > Yes, like I said I think this is a matter of taste of what you like of > the tradeoff. There are disadvantages and advantages in both and if we > wait forever to please everyone taste, it'll never go in. > > My feeling is that what is in -mm is better and it will stay for long, > because it fully exploits the ability we have to hold off and reply > the secondary mmu page fault (equivalent of the primary-mmu tlb miss) > something we can't do with the primary mmu tlb miss and that forces us > to implement something as complex (and IPI-flooding) as tlb-gather > logic. And the result besides being theoretically faster in the fast > path both when armed and disarmed, is also simpler than plugging mmu > notifier invalidate_pages inside tlb-gather. So I think it's a better > tradeoff. If I had seen even a single number to show the more complex scheme is better maybe I would be more receptive. I realise there might be some theoretical advantages but I also believe they wouldn't be measurable in real use cases. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 12:28 ` Nick Piggin @ 2008-07-26 13:02 ` Andrea Arcangeli 2008-07-26 13:10 ` Nick Piggin 2008-07-26 13:14 ` Nick Piggin 0 siblings, 2 replies; 26+ messages in thread From: Andrea Arcangeli @ 2008-07-26 13:02 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 02:28:26PM +0200, Nick Piggin wrote: > 3) livelock/starvation problem with TLB holdoff > > That's not shooting yourself in the foot if you are forced into the > design. Definitely there can be indefinite starvation because there > is no notion of queueing on the range_active count. I don't see how, it's all per-mm, all other -mm can't possibly be hold off. What can happen is that once cpu loops because the other cpu is mprotecting in a flood forever. Both have to be threads belonging to the same -mm for this holdoff to be meaningful. It's the same app, with multiple threads. The rest of the system can't even notice whatever happens in that -mm and they all reschedule all the time, so it can't starve other tasks. It's not a _global_ hold off! It's a per-mm holdoff. The counter is in the kvm struct and the kvm struct is bind to a single mm. Each different kvm struct has a different counter. all it can happen that an app is shooting itself in the foot, it's actually easier to run "for (;;)" if it wants to shoot itself in the foot... so no big deal. I really can't see any issue with your point 3, and infact this looks a nice locking design to me. > I'm doing range flushing withing tlb-gathers in the patch I posted which > does not add a branch for every pte teardown. And I don't really > consider range_start/end as better, given my reasons. As you said you missed tlb_remove_page, that is the _only_ troublesome one! A bit too easy to skip that one and then claim your patch as simpler... Just post a patch that works for real and see if you can avoid to invalidate every pte, or if you've to hack inside asm-generic/tlb.h... I can't see how you can avoid to mangle on the asm-generic/tlb.h if you don't want to send IPIs at every single pte zapped with your patch... > If I had seen even a single number to show the more complex scheme Please post a patch that actually works then we'll re-evaluate what is the best tradeoff ;). In the meantime please merge -mm patches into Linus's tree, this is taking forever and if the changes are so small to go Nick's way and his future "actually working" patch remains so small, it can be applied incrementally without any problem IMHO, infact it is presented as an incremental patch in the first place. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 13:02 ` Andrea Arcangeli @ 2008-07-26 13:10 ` Nick Piggin 2008-07-26 13:35 ` Andrea Arcangeli 2008-07-26 13:14 ` Nick Piggin 1 sibling, 1 reply; 26+ messages in thread From: Nick Piggin @ 2008-07-26 13:10 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 03:02:02PM +0200, Andrea Arcangeli wrote: > On Sat, Jul 26, 2008 at 02:28:26PM +0200, Nick Piggin wrote: > > 3) livelock/starvation problem with TLB holdoff > > > > That's not shooting yourself in the foot if you are forced into the > > design. Definitely there can be indefinite starvation because there > > is no notion of queueing on the range_active count. > > I don't see how, it's all per-mm, all other -mm can't possibly be hold > off. What can happen is that once cpu loops because the other cpu is > mprotecting in a flood forever. Both have to be threads belonging to > the same -mm for this holdoff to be meaningful. It's the same app, > with multiple threads. The rest of the system can't even notice > whatever happens in that -mm and they all reschedule all the time, so > it can't starve other tasks. It's not a _global_ hold off! It's a > per-mm holdoff. The counter is in the kvm struct and the kvm struct is > bind to a single mm. Each different kvm struct has a different > counter. all it can happen that an app is shooting itself in the foot, > it's actually easier to run "for (;;)" if it wants to shoot itself in > the foot... so no big deal. > > I really can't see any issue with your point 3, and infact this looks a > nice locking design to me. I am talking about a number of threads starving another thread of the same process, but that isn't shooting themselves in the foot because they might be doing simple normal operations that don't expect the kernel to cause starvation. > > I'm doing range flushing withing tlb-gathers in the patch I posted which > > does not add a branch for every pte teardown. And I don't really > > consider range_start/end as better, given my reasons. > > As you said you missed tlb_remove_page, that is the _only_ troublesome > one! A bit too easy to skip that one and then claim your patch as > simpler... > > Just post a patch that works for real and see if you can avoid to > invalidate every pte, or if you've to hack inside > asm-generic/tlb.h... I can't see how you can avoid to mangle on the > asm-generic/tlb.h if you don't want to send IPIs at every single pte > zapped with your patch... What's the problem with patching asm-*/tlb.h? They're just other files, but they all form part os the same tlb flushing design. > > If I had seen even a single number to show the more complex scheme > > Please post a patch that actually works then we'll re-evaluate what is > the best tradeoff ;). > > In the meantime please merge -mm patches into Linus's tree, this is > taking forever and if the changes are so small to go Nick's way and > his future "actually working" patch remains so small, it can be > applied incrementally without any problem IMHO, infact it is presented > as an incremental patch in the first place. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 13:10 ` Nick Piggin @ 2008-07-26 13:35 ` Andrea Arcangeli 2008-07-27 12:25 ` Nick Piggin 0 siblings, 1 reply; 26+ messages in thread From: Andrea Arcangeli @ 2008-07-26 13:35 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 03:10:15PM +0200, Nick Piggin wrote: > I am talking about a number of threads starving another thread of the > same process, but that isn't shooting themselves in the foot because > they might be doing simple normal operations that don't expect the > kernel to cause starvation. I thought you worried about security issues sorry. Note that each user is free to implement the locks as it wants. Given what you describe is a feature and not a bug for KVM, we use readonly locks and that can't lead to security issues either as you agreed above. But nothing prevents you to use a spinlock or rwlock and take the read side of it from the page fault and the write side of it from the range_start/end. Careful with the rwlock though because that is a read-starved lock so it won't buy you anything, it was just to make an example ;). Surely you can build a fair rwlock if that is what your app requires. For KVM we want to go unfair and purely readonly in the kvm page fault because we know the address space is almost never mangled and that makes it a perfect fit as it'll never starve for all remotely useful cases, and it can't be exploited either. So you're actually commenting on the kvm implementation of the lowlevel methods, but we're not forcing that implementation to all users. You surely can use the mmu notifiers in -mm, to have the secondary mmu page fault block and takeover from the range_start/end and take range_start/end out of the game. So really this isn't a valid complaint... infact the same issue exists for invalidate_page or your invalidate_range. It's up to each user to decide if to make the page fault slower but 100% fair and higher priority than the munmap flood coming from the other threads of the same mm. > What's the problem with patching asm-*/tlb.h? They're just other files, > but they all form part os the same tlb flushing design. Nothing fundamental, but at least for the last half an year apparently moving mmu notifier invalidate calls into arch pte/tlb flushing code was considered a negative in layering terms (I did that initially with ptep_clear_flush). I suggest you make another patch that actually works so we've a better picture of how the changes to tlb-gather looks like. I think the idea of calling invalidate_page before every tlb_remove_page should be rejected. Hope also this shows how those discussions are endless and pointless if there's nothing we can deploy to the KVM users in the first place. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 13:35 ` Andrea Arcangeli @ 2008-07-27 12:25 ` Nick Piggin 0 siblings, 0 replies; 26+ messages in thread From: Nick Piggin @ 2008-07-27 12:25 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 03:35:25PM +0200, Andrea Arcangeli wrote: > On Sat, Jul 26, 2008 at 03:10:15PM +0200, Nick Piggin wrote: > > I am talking about a number of threads starving another thread of the > > same process, but that isn't shooting themselves in the foot because > > they might be doing simple normal operations that don't expect the > > kernel to cause starvation. > > I thought you worried about security issues sorry. > > Note that each user is free to implement the locks as it wants. Given > what you describe is a feature and not a bug for KVM, we use readonly > locks and that can't lead to security issues either as you agreed > above. Well yes, KVM has a very controlled type of process which will have notifiers registered. But KVM isn't particularly the interesting case. GRU is going to have perhaps arbitrarily behaving tasks registering mmu notifiers, and they probably won't even know about it because it will probably be from a library call. > But nothing prevents you to use a spinlock or rwlock and take the read > side of it from the page fault and the write side of it from the > range_start/end. Careful with the rwlock though because that is a > read-starved lock so it won't buy you anything, it was just to make an > example ;). Surely you can build a fair rwlock if that is what your > app requires. For KVM we want to go unfair and purely readonly in the > kvm page fault because we know the address space is almost never > mangled and that makes it a perfect fit as it'll never starve for all > remotely useful cases, and it can't be exploited either. > > So you're actually commenting on the kvm implementation of the > lowlevel methods, but we're not forcing that implementation to all > users. You surely can use the mmu notifiers in -mm, to have the > secondary mmu page fault block and takeover from the range_start/end > and take range_start/end out of the game. I was commenting on GRU mainly. But not you can't really do anything else with the range callbacks because the range_end is called after the pages are released. Or you mean have the range invalidate actually make the page fault handler spin wait rather than retry? I don't really like that either although it may avoid the complete starvation. > So really this isn't a valid complaint... infact the same issue exists > for invalidate_page or your invalidate_range. It's up to each user to > decide if to make the page fault slower but 100% fair and higher > priority than the munmap flood coming from the other threads of the > same mm. Actually it won't have to. Because the invalidate callback runs after te linux ptes have gone away, we can basically allow the page fault just to either find the linux pte if it runs first (in which case the invalidate callback just has to invalidate it); or it finds no linux pte and segfaults. > > What's the problem with patching asm-*/tlb.h? They're just other files, > > but they all form part os the same tlb flushing design. > > Nothing fundamental, but at least for the last half an year apparently > moving mmu notifier invalidate calls into arch pte/tlb flushing code > was considered a negative in layering terms (I did that initially with > ptep_clear_flush). > > I suggest you make another patch that actually works so we've a better > picture of how the changes to tlb-gather looks like. I think the idea > of calling invalidate_page before every tlb_remove_page should be > rejected. > > Hope also this shows how those discussions are endless and pointless > if there's nothing we can deploy to the KVM users in the first place. That's true, you need to start with something as simple as possible until you have some use cases to start with in which case you can look at more complexity for more performance... ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 13:02 ` Andrea Arcangeli 2008-07-26 13:10 ` Nick Piggin @ 2008-07-26 13:14 ` Nick Piggin 2008-07-26 13:49 ` Andrea Arcangeli 2008-07-30 14:19 ` Christoph Lameter 1 sibling, 2 replies; 26+ messages in thread From: Nick Piggin @ 2008-07-26 13:14 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 03:02:02PM +0200, Andrea Arcangeli wrote: > On Sat, Jul 26, 2008 at 02:28:26PM +0200, Nick Piggin wrote: > > > If I had seen even a single number to show the more complex scheme > > Please post a patch that actually works then we'll re-evaluate what is > the best tradeoff ;). > > In the meantime please merge -mm patches into Linus's tree, this is > taking forever and if the changes are so small to go Nick's way and > his future "actually working" patch remains so small, it can be > applied incrementally without any problem IMHO, infact it is presented > as an incremental patch in the first place. BTW. has anyone else actually looked at mmu notifiers or have an opinion on this? It might be helpful for me to get someone else's perspective. I hate to cause conflict but obviously I think I have legitimate concerns so I have to raise them... ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 13:14 ` Nick Piggin @ 2008-07-26 13:49 ` Andrea Arcangeli 2008-07-27 12:32 ` Nick Piggin 2008-07-30 14:19 ` Christoph Lameter 1 sibling, 1 reply; 26+ messages in thread From: Andrea Arcangeli @ 2008-07-26 13:49 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 03:14:50PM +0200, Nick Piggin wrote: > BTW. has anyone else actually looked at mmu notifiers or have an > opinion on this? It might be helpful for me to get someone else's > perspective. My last submission was for -mm on 26 Jun, and all these developers and lists were in CC: Linus Torvalds <torvalds@linux-foundation.org>, Andrew Morton <akpm@linux-foundation.org>, Christoph Lameter <clameter@sgi.com>, Jack Steiner <steiner@sgi.com>, Robin Holt <holt@sgi.com>, Nick Piggin <npiggin@suse.de>, Peter Zijlstra <a.p.zijlstra@chello.nl>, kvm@vger.kernel.org, Kanoj Sarcar <kanojsarcar@yahoo.com>, Roland Dreier <rdreier@cisco.com>, Steve Wise <swise@opengridcomputing.com>, linux-kernel@vger.kernel.org, Avi Kivity <avi@qumranet.com>, linux-mm@kvack.org, general@lists.openfabrics.org, Hugh Dickins <hugh@veritas.com>, Rusty Russell <rusty@rustcorp.com.au>, Anthony Liguori <aliguori@us.ibm.com>, Chris Wright <chrisw@redhat.com>, Marcelo Tosatti <marcelo@kvack.org>, Eric Dumazet <dada1@cosmosbay.com>, "Paul E. McKenney" <paulmck@us.ibm.com>, Izik Eidus <izike@qumranet.com>, Anthony Liguori <aliguori@us.ibm.com>, Rik van Riel <riel@redhat.com> The ones explicitly agreeing (about all or part depending on the areas of interest, and not just of the first patch adding the new list.h function which is mostly unrelated) were Linus, Christoph, Jack, Robin, Avi, Marcelo, Rik and last but not the least Paul. Everyone else in the list implicitly agrees I assume, hope they're not all waiting 1 month before commenting on it like you did ;). Avi, me, Jack and Robin are the main users of the feature (or at least the main users that are brave enough to be visible on lkml) so that surely speaks well for the happiness of the mmu notifier users about what is in -mm. Infact it is almost a sure thing that the users will always prefer the current patches compared to the minimal notifier. But I also wear a VM (as in virtual memory not virtual machine ;) hat not just a KVM hat, so I surely wouldn't have submitted something that I think is bad for the VM. Infact I opposed certain patches made specifically for XPMEM that could hurt the VM a micro-bit (mostly thinking at UP cellphones). Still I offered to support XPMEM but with a lower priority and done right. I don't happen to dislike mm_take_all_locks, as it's totally localized and _can_never_run_ unless you load one of those kvm or gru modules. I'd rather prefer mmu notifiers to be invisible to the tlb-gather logic, surely it'd be orders of magnitude simpler to delete mm_take_all_locks than to undo the changes to the tlb-gather logic. So if something we should go with -mm first, and then evaluate if the tlb-gather changes are better/worse. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 13:49 ` Andrea Arcangeli @ 2008-07-27 12:32 ` Nick Piggin 0 siblings, 0 replies; 26+ messages in thread From: Nick Piggin @ 2008-07-27 12:32 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 03:49:15PM +0200, Andrea Arcangeli wrote: > On Sat, Jul 26, 2008 at 03:14:50PM +0200, Nick Piggin wrote: > > But I also wear a VM (as in virtual memory not virtual machine ;) hat > not just a KVM hat, so I surely wouldn't have submitted something that > I think is bad for the VM. Infact I opposed certain patches made > specifically for XPMEM that could hurt the VM a micro-bit (mostly > thinking at UP cellphones). Still I offered to support XPMEM but with > a lower priority and done right. BTW. I don't like this approach especially for XPMEM the infinite starvation will probably be a much bigger issue if tasks can go to sleep there. Perhaps priority inversion problems too. Also making the locks into sleeping locks I don't know if it is such a good approach. We have gone the other way for some reasons in the past, so they have to be addressed. > I don't happen to dislike mm_take_all_locks, as it's totally localized > and _can_never_run_ unless you load one of those kvm or gru > modules. I'd rather prefer mmu notifiers to be invisible to the > tlb-gather logic, surely it'd be orders of magnitude simpler to delete > mm_take_all_locks than to undo the changes to the tlb-gather logic. So > if something we should go with -mm first, and then evaluate if the > tlb-gather changes are better/worse. The thing about mm_take_all_locks that I don't think you quite appreciate is that it isn't totally localized. It now adds a contract to the rest of the VM to say it isn't allowed to invalidate anything while it is being run. If it literally was self contained and zero impact, of course I wouldn't care about it from the core VM perspective, but I still think it might be a bad idea for some of the GRU (and XPMEM) use cases. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 13:14 ` Nick Piggin 2008-07-26 13:49 ` Andrea Arcangeli @ 2008-07-30 14:19 ` Christoph Lameter 2008-07-30 14:54 ` Andrea Arcangeli 1 sibling, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2008-07-30 14:19 UTC (permalink / raw) To: Nick Piggin Cc: Andrea Arcangeli, Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner Nick Piggin wrote: > BTW. has anyone else actually looked at mmu notifiers or have an > opinion on this? It might be helpful for me to get someone else's > perspective. Yes we have had so much talk about this that I am a bit tired of talking about it. I vaguely remember bringing up the same point a couple of months ago. If you can make it work then great. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-30 14:19 ` Christoph Lameter @ 2008-07-30 14:54 ` Andrea Arcangeli 2008-07-30 15:42 ` Christoph Lameter 0 siblings, 1 reply; 26+ messages in thread From: Andrea Arcangeli @ 2008-07-30 14:54 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner On Wed, Jul 30, 2008 at 09:19:44AM -0500, Christoph Lameter wrote: > Yes we have had so much talk about this that I am a bit tired of > talking about it. I vaguely remember bringing up the same point a > couple of months ago. If you can make it work then great. I think the current implementation is fine for the long run, it can provide the fastest performance when armed, and each invalidate either requires IPIs or it may may need to access the southbridge, so when freeing large areas of memory it's good being able to do a single invalidate. If I can add another comment, I think if a new user of mm_take_all_locks showup, that will further confirm that such method is useful and should stay. Of course it needs to be a legitimate usage that allows to improve performance to the fast paths like in the mmu-notifier usage. And if Nick's right that mm_take_all_locks will ever become a limitation, removing it is trivial, much much simpler than undoing mmu-notifier changes to tlb-gather. So until Nick will go ahead and remove the anon_vma->lock (and I don't think it's feasible without screwing other paths much more troublesome than mm_take_all_locks) I think this is fine to stay. If you'll have troubles removing anon_vma->lock it won't be because of mm_take_all_locks be sure ;). If you ever get there we'll add invalidate_page before tlb_remove_page and be done with it for the benefit of the VM, no problem at all. If we'll ever need to add scheduling capability to mmu notifiers (like for XPMEM or perhaps in the future infiniband) that's nearly trivially feasible too in the future without having to alter the API at all (something not feasible with other implementations). Nevertheless I'm ok if we want to alter the implementation in the future for whatever good/wrong reasaon: the only important thing to me is that from now on all kernels will have this functionality one way or another because KVM already depends on it and it swaps much better now! The current implementation is bugfree, well tested, looks great to me and there's no urgency to alter it. It's surely what all the mmu-notifier users prefer, and I want to thank everyone for the help in getting here and all the good/bad feedback provided that helped improving the code so much. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-30 14:54 ` Andrea Arcangeli @ 2008-07-30 15:42 ` Christoph Lameter 2008-07-31 6:14 ` Nick Piggin 0 siblings, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2008-07-30 15:42 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner Andrea Arcangeli wrote: > I think the current implementation is fine for the long run, it can > provide the fastest performance when armed, and each invalidate either > requires IPIs or it may may need to access the southbridge, so when > freeing large areas of memory it's good being able to do a single > invalidate. Right. A couple of months ago we had this discussion and agreed that the begin / end was the way to go. I still support that decision. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-30 15:42 ` Christoph Lameter @ 2008-07-31 6:14 ` Nick Piggin 2008-07-31 14:19 ` Christoph Lameter 0 siblings, 1 reply; 26+ messages in thread From: Nick Piggin @ 2008-07-31 6:14 UTC (permalink / raw) To: Christoph Lameter Cc: Andrea Arcangeli, Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner On Wed, Jul 30, 2008 at 10:42:12AM -0500, Christoph Lameter wrote: > Andrea Arcangeli wrote: > > > I think the current implementation is fine for the long run, it can > > provide the fastest performance when armed, and each invalidate either > > requires IPIs or it may may need to access the southbridge, so when > > freeing large areas of memory it's good being able to do a single > > invalidate. > > Right. A couple of months ago we had this discussion and agreed that the begin / end was the way to go. I still support that decision. That's OK. We don't have to make decisions just by people supporting one way or the other, because I'll come up with some competing patches and if they turn out to be significantly simpler to the core VM without having a significant negative impact on performance then naturally everybody should be happy to merge them, so nobody has to argue with handwaving. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-31 6:14 ` Nick Piggin @ 2008-07-31 14:19 ` Christoph Lameter 0 siblings, 0 replies; 26+ messages in thread From: Christoph Lameter @ 2008-07-31 14:19 UTC (permalink / raw) To: Nick Piggin Cc: Andrea Arcangeli, Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner Nick Piggin wrote: > That's OK. We don't have to make decisions just by people supporting one > way or the other, because I'll come up with some competing patches and > if they turn out to be significantly simpler to the core VM without having > a significant negative impact on performance then naturally everybody should > be happy to merge them, so nobody has to argue with handwaving. We make decisions based on technical issues. If you can come up with a solution that addresses the issues (please review the earlier discussion on the subject matter) then we will all be happy to see that merged. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 11:38 ` Andrea Arcangeli 2008-07-26 12:28 ` Nick Piggin @ 2008-07-26 12:33 ` Nick Piggin 2008-07-26 13:04 ` Nick Piggin 2 siblings, 0 replies; 26+ messages in thread From: Nick Piggin @ 2008-07-26 12:33 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 01:38:13PM +0200, Andrea Arcangeli wrote: > On Sat, Jul 26, 2008 at 05:08:10AM +0200, Nick Piggin wrote: > > Anyway, I just voice my opinion and let Andrew and Linus decide. To be > > clear: I have not found any actual bugs in Andrea's -mm patchset, only > > some dislikes of the approach. > > Yes, like I said I think this is a matter of taste of what you like of > the tradeoff. There are disadvantages and advantages in both and if we > wait forever to please everyone taste, it'll never go in. And for this item, I think there has been a bit too much emphasis on pleasing the taste of the drivers and not enough on the core VM. My concern about adding a new TLB flushing design to core VM was never taken seriously, for example. Nor was my request for performance numbers. And I did ask early in the year. I believe when you don't have any real numbers for justification, the only sane thing to do is go with the most minimal and simplest implementation first, and add complexity if/when it can be justified. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 11:38 ` Andrea Arcangeli 2008-07-26 12:28 ` Nick Piggin 2008-07-26 12:33 ` Nick Piggin @ 2008-07-26 13:04 ` Nick Piggin 2008-07-26 13:16 ` Andrea Arcangeli 2 siblings, 1 reply; 26+ messages in thread From: Nick Piggin @ 2008-07-26 13:04 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 01:38:13PM +0200, Andrea Arcangeli wrote: > > 1) absolute minimal intrusion into the kernel common code, and > absolute minimum number of branches added to the kernel fast > paths. Kernel is faster than your "minimal" type of notifiers when > they're disarmed. BTW. is this really significant? Having one branch per pte I don't think is necessarily slower than 2 branches per unmap. The 2 branches will use more icache and more branch history. One branch even once per pte in the unmapping loop is going to remain hot in icache and branch history isn't it? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 13:04 ` Nick Piggin @ 2008-07-26 13:16 ` Andrea Arcangeli 2008-07-27 12:08 ` Nick Piggin 0 siblings, 1 reply; 26+ messages in thread From: Andrea Arcangeli @ 2008-07-26 13:16 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 03:04:06PM +0200, Nick Piggin wrote: > On Sat, Jul 26, 2008 at 01:38:13PM +0200, Andrea Arcangeli wrote: > > > > 1) absolute minimal intrusion into the kernel common code, and > > absolute minimum number of branches added to the kernel fast > > paths. Kernel is faster than your "minimal" type of notifiers when > > they're disarmed. > > BTW. is this really significant? Having one branch per pte > I don't think is necessarily slower than 2 branches per unmap. > > The 2 branches will use more icache and more branch history. One > branch even once per pte in the unmapping loop is going to remain > hot in icache and branch history isn't it? Even if branch-predicted and icached, it's still more executable to compute in a tight loop. Even if quick it'll accumulate cycles. Said that perhaps you're right that my point 1 wasn't that important or not a tangible positive, but surely doing a secondary mmu invalidate for each pte zapped isn't ideal... that's the whole point of the tlb-gather logic, nobody wants to do that not even for the primary tlb, and surely not for the secondary-mmu that may not even be as fast as the primary-tlb at invalidating. Hence the very simple patch is clearly inferior when they're armed (if only equivalent when they're disarmed)... I think we can argue once you've reduced the frequency of the secondary mmu invalidates of a factor of 500 by mangling over the tlb gather logic per-arch. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-26 13:16 ` Andrea Arcangeli @ 2008-07-27 12:08 ` Nick Piggin 0 siblings, 0 replies; 26+ messages in thread From: Nick Piggin @ 2008-07-27 12:08 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, steiner, cl On Sat, Jul 26, 2008 at 03:16:51PM +0200, Andrea Arcangeli wrote: > On Sat, Jul 26, 2008 at 03:04:06PM +0200, Nick Piggin wrote: > > On Sat, Jul 26, 2008 at 01:38:13PM +0200, Andrea Arcangeli wrote: > > > > > > 1) absolute minimal intrusion into the kernel common code, and > > > absolute minimum number of branches added to the kernel fast > > > paths. Kernel is faster than your "minimal" type of notifiers when > > > they're disarmed. > > > > BTW. is this really significant? Having one branch per pte > > I don't think is necessarily slower than 2 branches per unmap. > > > > The 2 branches will use more icache and more branch history. One > > branch even once per pte in the unmapping loop is going to remain > > hot in icache and branch history isn't it? > > Even if branch-predicted and icached, it's still more executable to > compute in a tight loop. Even if quick it'll accumulate cycles. Said True but having 2 branches and more icache is more likely to be a branch mispredict or icache miss which costs a *lot* of cached, predicted branches. It's all speculation, but my point is that it is not accurate to say my version woiuld be slower because in some cases it would be the oposite. > that perhaps you're right that my point 1 wasn't that important or not > a tangible positive, but surely doing a secondary mmu invalidate for > each pte zapped isn't ideal... that's the whole point of the > tlb-gather logic, nobody wants to do that not even for the primary > tlb, and surely not for the secondary-mmu that may not even be as fast > as the primary-tlb at invalidating. Hence the very simple patch is > clearly inferior when they're armed (if only equivalent when they're > disarmed)... See the thing about that is I don't actually dispute that in some cases the range start/end case will definitely be faster. However, firstly KVM as you say doesn't really care, and secondly we don't have numbers for GRU (I'm talking about start/end vs gather) > I think we can argue once you've reduced the frequency of the > secondary mmu invalidates of a factor of 500 by mangling over the tlb > gather logic per-arch. OK, we'll see... ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-24 14:39 MMU notifiers review and some proposals Nick Piggin 2008-07-24 14:39 ` Nick Piggin 2008-07-25 21:45 ` Andrea Arcangeli @ 2008-07-25 23:29 ` Jack Steiner 2008-07-26 3:18 ` Nick Piggin 2008-07-27 9:45 ` Andrew Morton 3 siblings, 1 reply; 26+ messages in thread From: Jack Steiner @ 2008-07-25 23:29 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, andrea, cl > > I also think we should be cautious to make this slowpath so horribly > slow. For KVM it's fine, but perhaps for GRU we'd actually want to > register quite a lot of notifiers, unregister them, and perhaps even > do per-vma registrations if we don't want to slow down other memory > management operations too much (eg. if it is otherwise just some > regular processes just doing accelerated things with GRU). I can't speak for other possible users of the mmu notifier mechanism but the GRU will only register a single notifier for a process. The notifier is registered when the GRU is opened and (currently) unregistered when the GRU is closed. If a task opens multiple GRUs (quite possible for threaded apps), all shared the same notifier. Regardless of the number of threads/opens, there will be a single notifier. At least one version of mmu notifiers did not support unregistration. The notifier was left in the chain until the task exited. This also worked for the GRU. If the gru was subsequently reopened the notifier was reused. Providing the ability to unregister is the right thing to do if the cost is not excessive. From a practical standpoint, for the GRU usage, unregistration is not a big issue either way. --- jack ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-25 23:29 ` Jack Steiner @ 2008-07-26 3:18 ` Nick Piggin 0 siblings, 0 replies; 26+ messages in thread From: Nick Piggin @ 2008-07-26 3:18 UTC (permalink / raw) To: Jack Steiner Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, linux-arch, andrea, cl On Fri, Jul 25, 2008 at 06:29:09PM -0500, Jack Steiner wrote: > > > > I also think we should be cautious to make this slowpath so horribly > > slow. For KVM it's fine, but perhaps for GRU we'd actually want to > > register quite a lot of notifiers, unregister them, and perhaps even > > do per-vma registrations if we don't want to slow down other memory > > management operations too much (eg. if it is otherwise just some > > regular processes just doing accelerated things with GRU). > > I can't speak for other possible users of the mmu notifier mechanism > but the GRU will only register a single notifier for a process. > The notifier is registered when the GRU is opened and (currently) > unregistered when the GRU is closed. > > If a task opens multiple GRUs (quite possible for threaded apps), all > shared the same notifier. Regardless of the number of threads/opens, > there will be a single notifier. OK, sure, but you might have a lot of processes opening the GRU, no? (I'm not sure how it is exactly going to be used, but if you have a number of GRU units per blade...) So you have a machine with 4096 CPUs, and maybe 4096 processes that each have to lock the mapping of exec, glibc, pthreads, grulib, etc etc. Or you might have a situation where many short lived processes do some operating on GRU. Anyway, I won't dispute your data point (thanks for that), but still noting that a very heavy registration is obviously less desirable than a faster one, all else being equal... ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-24 14:39 MMU notifiers review and some proposals Nick Piggin ` (2 preceding siblings ...) 2008-07-25 23:29 ` Jack Steiner @ 2008-07-27 9:45 ` Andrew Morton 2008-07-27 12:38 ` Nick Piggin 3 siblings, 1 reply; 26+ messages in thread From: Andrew Morton @ 2008-07-27 9:45 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Linux Memory Management List, linux-arch, andrea, steiner, cl On Thu, 24 Jul 2008 16:39:49 +0200 Nick Piggin <npiggin@suse.de> wrote: > I think everybody is hoping to have a workable mmu notifier scheme > merged in 2.6.27 (myself included). However I do have some concerns > about the implementation proposed (in -mm). > > I apologise for this late review, before anybody gets too upset, > most of my concerns have been raised before, but I'd like to state > my case again and involving everyone. Nick, having read through this discussion and the code (yet again) I think I'll go ahead and send it all in to Linus. On the basis that - the code is fairly short and simple - has no known bugs - seems to be needed by some folks ;) - you already have a protopatch which partially addresses your concerns and afaik there's nothing blocking future improvements to this implementation? And a late-breaking review comment: given that about 0.000000000000001% of people will actually use mm_take_all_locks(), could we make its compilation conditional on something? Such as CONFIG_MMU_NOTIFIER? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: MMU notifiers review and some proposals 2008-07-27 9:45 ` Andrew Morton @ 2008-07-27 12:38 ` Nick Piggin 0 siblings, 0 replies; 26+ messages in thread From: Nick Piggin @ 2008-07-27 12:38 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Linux Memory Management List, linux-arch, andrea, steiner, cl On Sun, Jul 27, 2008 at 02:45:20AM -0700, Andrew Morton wrote: > On Thu, 24 Jul 2008 16:39:49 +0200 Nick Piggin <npiggin@suse.de> wrote: > > > I think everybody is hoping to have a workable mmu notifier scheme > > merged in 2.6.27 (myself included). However I do have some concerns > > about the implementation proposed (in -mm). > > > > I apologise for this late review, before anybody gets too upset, > > most of my concerns have been raised before, but I'd like to state > > my case again and involving everyone. > > Nick, having read through this discussion and the code (yet again) I > think I'll go ahead and send it all in to Linus. On the basis that You call. As I said, I was OK for you to use your judgement. > - the code is fairly short and simple > > - has no known bugs > > - seems to be needed by some folks ;) > > - you already have a protopatch which partially addresses your > concerns and afaik there's nothing blocking future improvements to > this implementation? Seems like I was able to get my head around KVM and GRU more easily than I had feared. I don't expect help, but I will say that good simplifications will always get merged unless there is a reasonable performance case not to, so I will work on patches. > And a late-breaking review comment: given that about 0.000000000000001% > of people will actually use mm_take_all_locks(), could we make its > compilation conditional on something? Such as CONFIG_MMU_NOTIFIER? Well I prefer that because I don't want any new callers to pop up. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2008-07-31 14:19 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-24 14:39 MMU notifiers review and some proposals Nick Piggin 2008-07-24 14:39 ` Nick Piggin 2008-07-25 21:45 ` Andrea Arcangeli 2008-07-26 3:08 ` Nick Piggin 2008-07-26 11:38 ` Andrea Arcangeli 2008-07-26 12:28 ` Nick Piggin 2008-07-26 13:02 ` Andrea Arcangeli 2008-07-26 13:10 ` Nick Piggin 2008-07-26 13:35 ` Andrea Arcangeli 2008-07-27 12:25 ` Nick Piggin 2008-07-26 13:14 ` Nick Piggin 2008-07-26 13:49 ` Andrea Arcangeli 2008-07-27 12:32 ` Nick Piggin 2008-07-30 14:19 ` Christoph Lameter 2008-07-30 14:54 ` Andrea Arcangeli 2008-07-30 15:42 ` Christoph Lameter 2008-07-31 6:14 ` Nick Piggin 2008-07-31 14:19 ` Christoph Lameter 2008-07-26 12:33 ` Nick Piggin 2008-07-26 13:04 ` Nick Piggin 2008-07-26 13:16 ` Andrea Arcangeli 2008-07-27 12:08 ` Nick Piggin 2008-07-25 23:29 ` Jack Steiner 2008-07-26 3:18 ` Nick Piggin 2008-07-27 9:45 ` Andrew Morton 2008-07-27 12:38 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).