* [patch 0/6] MMU Notifiers V6
@ 2008-02-08 22:06 Christoph Lameter
2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
` (7 more replies)
0 siblings, 8 replies; 150+ messages in thread
From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw)
To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
Cc: Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt
This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel (an external MMU). MMU
notifiers allow us to get rid of the page pinning for RDMA and various
other purposes. It gets rid of the broken use of mlock for page pinning.
(mlock really does *not* pin pages....)
More information on the rationale and the technical details can be found in
the first patch and the README provided by that patch in
Documentation/mmu_notifiers.
The known immediate users are
KVM
- Establishes a refcount to the page via get_user_pages().
- External references are called spte.
- Has page tables to track pages whose refcount was elevated but
no reverse maps.
GRU
- Simple additional hardware TLB (possibly covering multiple instances of
Linux)
- Needs TLB shootdown when the VM unmaps pages.
- Determines page address via follow_page (from interrupt context) but can
fall back to get_user_pages().
- No page reference possible since no page status is kept..
XPmem
- Allows use of a processes memory by remote instances of Linux.
- Provides its own reverse mappings to track remote pte.
- Established refcounts on the exported pages.
- Must sleep in order to wait for remote acks of ptes that are being
cleared.
Andrea's mmu_notifier #4 -> RFC V1
- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
called.
- Develop a patch sequence that separates out the different types of
hooks so that we can review their use.
- Avoid adding include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.
V1->V2:
- Improve RCU support
- Use mmap_sem for mmu_notifier register / unregister
- Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we
already have invalidate_range() callbacks there.
- Clean compile for !MMU_NOTIFIER
- Isolate filemap_xip strangeness into its own diff
- Pass a the flag to invalidate_range to indicate if a spinlock
is held.
- Add invalidate_all()
V2->V3:
- Further RCU fixes
- Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page
and sys_remap_file_pages() after the pte clearing.
V3->V4:
- Drop locking and synchronize_rcu() on ->release since we know on release that
we are the only executing thread. This is also true for invalidate_all() so
we could drop off the mmu_notifier there early. Use hlist_del_init instead
of hlist_del_rcu.
- Do the invalidation as begin/end pairs with the requirement that the driver
holds off new references in between.
- Fixup filemap_xip.c
- Figure out a potential way in which XPmem can deal with locks that are held.
- Robin's patches to make the mmu_notifier logic manage the PageRmapExported bit.
- Strip cc list down a bit.
- Drop Peters new rcu list macro
- Add description to the core patch
V4->V5:
- Provide missing callouts for mremap.
- Provide missing callouts for copy_page_range.
- Reduce mm_struct space to zero if !MMU_NOTIFIER by #ifdeffing out
structure contents.
- Get rid of the invalidate_all() callback by moving ->release in place
of invalidate_all.
- Require holding mmap_sem on register/unregister instead of acquiring it
ourselves. In some contexts where we want to register/unregister we are
already holding mmap_sem.
- Split out the rmap support patch so that there is no need to apply
all patches for KVM and GRU.
V5->V6:
- Provide missing range callouts for mprotect
- Fix do_wp_page control path sequencing
- Clarify locking conventions
- GRU and XPmem confirmed to work with this patchset.
- Provide skeleton code for GRU/KVM type callback and for XPmem type.
- Rework documentation and put it into Documentation/mmu_notifier.
--
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 150+ messages in thread* [patch 1/6] mmu_notifier: Core code 2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter @ 2008-02-08 22:06 ` Christoph Lameter 2008-02-08 22:06 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter ` (6 subsequent siblings) 7 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Cc: Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt [-- Attachment #1: mmu_core --] [-- Type: text/plain, Size: 18454 bytes --] MMU notifiers are used for hardware and software that establishes external references to pages managed by the Linux kernel. These are page table entriews or tlb entries or something else that allows hardware (such as DMA engines, scatter gather devices, networking, sharing of address spaces across operating system boundaries) and software (Virtualization solutions such as KVM, Xen etc) to access memory managed by the Linux kernel. The MMU notifier will notify the device driver that subscribes to such a notifier that the VM is going to do something with the memory mapped by that device. The device must then drop references for the indicated memory area. The references may be reestablished later. The notification scheme is much better than the current scheme of avoiding the danger of the VM removing pages that are externally mapped. We currently mlock pages used for RDMA, XPmem etc in memory. Mlock causes problems with reclaim and may lead to OOM if too many pages are pinned in memory. It is also incorrect in terms what the POSIX specificies for what role mlock should play. Mlock does *not* pin pages in memory. Mlock just means do not allow the page to be moved to swap. Linux can move pages in memory (for example through the page migration mechanism). These pages can be moved even if they are mlocked(!!!!). The current approach of page pinning in use by RDMA etc is conceptually broken but there are currently no other easy solutions. The solution here allows us to finally fix this issue by requiring such devices to subscribe to a notification chain that will allow them to work without pinning. This patch: Core portion Signed-off-by: Christoph Lameter <clameter-sJ/iWh9BUns@public.gmane.org> Signed-off-by: Andrea Arcangeli <andrea-atKUWr5tajBWk0Htik3J/w@public.gmane.org> --- Documentation/mmu_notifier/README | 99 +++++++++++++++++++++ include/linux/mm_types.h | 7 + include/linux/mmu_notifier.h | 175 ++++++++++++++++++++++++++++++++++++++ kernel/fork.c | 2 mm/Kconfig | 4 mm/Makefile | 1 mm/mmap.c | 2 mm/mmu_notifier.c | 76 ++++++++++++++++ 8 files changed, 366 insertions(+) Index: linux-2.6/Documentation/mmu_notifier/README =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/Documentation/mmu_notifier/README 2008-02-08 12:30:47.000000000 -0800 @@ -0,0 +1,99 @@ +Linux MMU Notifiers +------------------- + +MMU notifiers are used for hardware and software that establishes +external references to pages managed by the Linux kernel. These are +page table entriews or tlb entries or something else that allows +hardware (such as DMA engines, scatter gather devices, networking, +sharing of address spaces across operating system boundaries) and +software (Virtualization solutions such as KVM, Xen etc) to +access memory managed by the Linux kernel. + +The MMU notifier will notify the device driver that subscribes to such +a notifier that the VM is going to do something with the memory +mapped by that device. The device must then drop references for the +indicated memory area. The references may be reestablished later. + +The notification scheme is much better than the current scheme of +dealing with the danger of the VM removing pages. +We currently mlock pages used for RDMA, XPmem etc in memory. + +Mlock causes problems with reclaim and may lead to OOM if too many +pages are pinned in memory. It is also incorrect in terms of the POSIX +specification of the role of mlock. Mlock does *not* pin pages in +memory. It just does not allow the page to be moved to swap. + +Linux can move pages in memory (for example through the page migration +mechanism). These pages can be moved even if they are mlocked(!!!!). +So the current approach in use by RDMA etc etc is conceptually broken +but there are currently no other easy solutions. + +The solution here allows us to finally fix this issue by requiring +such devices to subscribe to a notification chain that will allow +them to work without pinning. + +The notifier chains provide two callback mechanisms. The +first one is required for any device that establishes external mappings. +The second (rmap) mechanism is required if a device needs to be +able to sleep when invalidating references. Sleeping may be necessary +if we are mapping across a network or to different Linux instances +in the same address space. + +mmu_notifier mechanism (for KVM/GRU etc) +---------------------------------------- +Callbacks are registered with an mm_struct from a device driver using +mmu_notifier_register(). When the VM removes pages (or changes +permissions on pages etc) then callbacks are triggered. + +The invalidation function for a single page (*invalidate_page) +is called with spinlocks (in particular the pte lock) held. This allow +for an easy implementation of external ptes that are on the local system. + +The invalidation mechanism for a range (*invalidate_range_begin/end*) is +called most of the time without any locks held. It is only called with +locks held for file backed mappings that are truncated. A flag indicates +in which mode we are. A driver can use that mechanism to f.e. +delay the freeing of the pages during truncate until no locks are held. + +Pages must be marked dirty if dirty bits are found to be set in +the external ptes during unmap. + +The *release* method is called when a Linux process exits. It is run before +the pages and mappings of a process are torn down and gives the device driver +a chance to zap all the external mappings in one go. + +An example for a code that can be used to build a notifier mechanism into +a device driver can be found in the file +Documentation/mmu_notifier/skeleton.c + +mmu_rmap_notifier mechanism (XPMEM etc) +--------------------------------------- +The mmu_rmap_notifier allows the device driver to implement their own rmap +and allows the device driver to sleep during page eviction. This is necessary +for complex drivers that f.e. allow the sharing of memory between processes +running on different Linux instances (typically over a network or in a +partitioned NUMA system). + +The mmu_rmap_notifier adds another invalidate_page() callout that is called +*before* the Linux rmaps are walked. At that point only the page lock is +held. The invalidate_page() function must walk the driver rmaps and evict +all the references to the page. + +There is no process information available before the rmaps are consulted. +The notifier mechanism can therefore not be attached to an mm_struct. Instead +it is a global callback list. Having to perform a callback for each and every +page that is reclaimed would be inefficient. Therefore we add an additional +page flag: PageRmapExternal(). Only pages that are marked with this bit can +be exported and the rmap callbacks will only be performed for pages marked +that way. + +The required additional Page flag is only availabe in 64 bit mode and +therefore the mmu_rmap_notifier portion is not available on 32 bit platforms. + +An example of code to build a mmu_notifier mechanism with rmap capabilty +can be found in Documentation/mmu_notifier/skeleton_rmap.c + +February 9, 2008, + Christoph Lameter <clameter-sJ/iWh9BUns@public.gmane.org + +Index: linux-2.6/include/linux/mm_types.h Index: linux-2.6/include/linux/mm_types.h =================================================================== --- linux-2.6.orig/include/linux/mm_types.h 2008-02-08 12:28:06.000000000 -0800 +++ linux-2.6/include/linux/mm_types.h 2008-02-08 12:30:47.000000000 -0800 @@ -159,6 +159,12 @@ struct vm_area_struct { #endif }; +struct mmu_notifier_head { +#ifdef CONFIG_MMU_NOTIFIER + struct hlist_head head; +#endif +}; + struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ struct rb_root mm_rb; @@ -228,6 +234,7 @@ struct mm_struct { #ifdef CONFIG_CGROUP_MEM_CONT struct mem_cgroup *mem_cgroup; #endif + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ }; #endif /* _LINUX_MM_TYPES_H */ Index: linux-2.6/include/linux/mmu_notifier.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/include/linux/mmu_notifier.h 2008-02-08 12:35:14.000000000 -0800 @@ -0,0 +1,175 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +/* + * MMU motifier + * + * Notifier functions for hardware and software that establishes external + * references to pages of a Linux system. The notifier calls ensure that + * external mappings are removed when the Linux VM removes memory ranges + * or individual pages from a process. + * + * These fall into two classes: + * + * 1. mmu_notifier + * + * These are callbacks registered with an mm_struct. If pages are + * removed from an address space then callbacks are performed. + * + * Spinlocks must be held in order to walk reverse maps. The + * invalidate_page() callbacks are performed with spinlocks held. + * + * The invalidate_range_start/end callbacks can be performed in contexts + * where sleeping is allowed or in atomic contexts. A flag is passed + * to indicate an atomic context. + * + * Pages must be marked dirty if dirty bits are found to be set in + * the external ptes. + */ + +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/rcupdate.h> +#include <linux/mm_types.h> + +struct mmu_notifier_ops; + +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +struct mmu_notifier_ops { + /* + * The release notifier is called when no other execution threads + * are left. Synchronization is not necessary. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* + * age_page is called from contexts where the pte_lock is held + */ + int (*age_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* invalidate_page is called from contexts where the pte_lock is held */ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * invalidate_range_begin() and invalidate_range_end() must paired. + * + * Multiple invalidate_range_begin/ends may be nested or called + * concurrently. That is legit. However, no new external references + * may be established as long as any invalidate_xxx is running or + * any invalidate_range_begin() and has not been completed through a + * corresponding call to invalidate_range_end(). + * + * Locking within the notifier needs to serialize events correspondingly. + * + * invalidate_range_begin() must clear all references in the range + * and stop the establishment of new references. + * + * invalidate_range_end() reenables the establishment of references. + * + * atomic indicates that the function is called in an atomic context. + * We can sleep if atomic == 0. + */ + void (*invalidate_range_begin)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end, + int atomic); + + void (*invalidate_range_end)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end, + int atomic); +}; + +#ifdef CONFIG_MMU_NOTIFIER + +/* + * Must hold the mmap_sem for write. + * + * RCU is used to traverse the list. A quiescent period needs to pass + * before the notifier is guaranteed to be visible to all threads + */ +extern void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); + +/* + * Must hold mmap_sem for write. + * + * A quiescent period needs to pass before the mmu_notifier structure + * can be released. mmu_notifier_release() will wait for a quiescent period + * after calling the ->release callback. So it is safe to call + * mmu_notifier_unregister from the ->release function. + */ +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); + + +extern void mmu_notifier_release(struct mm_struct *mm); +extern int mmu_notifier_age_page(struct mm_struct *mm, + unsigned long address); + +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh) +{ + INIT_HLIST_HEAD(&mnh->head); +} + +#define mmu_notifier(function, mm, args...) \ + do { \ + struct mmu_notifier *__mn; \ + struct hlist_node *__n; \ + \ + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \ + rcu_read_lock(); \ + hlist_for_each_entry_rcu(__mn, __n, \ + &(mm)->mmu_notifier.head, \ + hlist) \ + if (__mn->ops->function) \ + __mn->ops->function(__mn, \ + mm, \ + args); \ + rcu_read_unlock(); \ + } \ + } while (0) + +#else /* CONFIG_MMU_NOTIFIER */ + +/* + * Notifiers that use the parameters that they were passed so that the + * compiler does not complain about unused variables but does proper + * parameter checks even if !CONFIG_MMU_NOTIFIER. + * Macros generate no code. + */ +#define mmu_notifier(function, mm, args...) \ + do { \ + if (0) { \ + struct mmu_notifier *__mn; \ + \ + __mn = (struct mmu_notifier *)(0x00ff); \ + __mn->ops->function(__mn, mm, args); \ + }; \ + } while (0) + +static inline void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm) {} +static inline void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm) {} +static inline void mmu_notifier_release(struct mm_struct *mm) {} +static inline int mmu_notifier_age_page(struct mm_struct *mm, + unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {} + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ Index: linux-2.6/mm/Kconfig =================================================================== --- linux-2.6.orig/mm/Kconfig 2008-02-08 12:28:06.000000000 -0800 +++ linux-2.6/mm/Kconfig 2008-02-08 12:30:47.000000000 -0800 @@ -193,3 +193,7 @@ config NR_QUICK config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + def_bool y + bool "MMU notifier, for paging KVM/RDMA" Index: linux-2.6/mm/Makefile =================================================================== --- linux-2.6.orig/mm/Makefile 2008-02-08 12:28:06.000000000 -0800 +++ linux-2.6/mm/Makefile 2008-02-08 12:30:47.000000000 -0800 @@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o Index: linux-2.6/mm/mmu_notifier.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/mm/mmu_notifier.c 2008-02-08 12:44:24.000000000 -0800 @@ -0,0 +1,76 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <clameter-sJ/iWh9BUns@public.gmane.org> + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/mmu_notifier.h> + +/* + * No synchronization. This function can only be called when only a single + * process remains that performs teardown. + */ +void mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + struct hlist_node *n, *t; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + hlist_for_each_entry_safe(mn, n, t, + &mm->mmu_notifier.head, hlist) { + hlist_del_init(&mn->hlist); + if (mn->ops->release) + mn->ops->release(mn, mm); + } + } +} + +/* + * If no young bitflag is supported by the hardware, ->age_page can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + rcu_read_lock(); + hlist_for_each_entry_rcu(mn, n, + &mm->mmu_notifier.head, hlist) { + if (mn->ops->age_page) + young |= mn->ops->age_page(mn, mm, address); + } + rcu_read_unlock(); + } + + return young; +} + +/* + * Note that all notifiers use RCU. The updates are only guaranteed to be + * visible to other processes after a RCU quiescent period! + * + * Must hold mmap_sem writably when calling registration functions. + */ +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head); +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + hlist_del_rcu(&mn->hlist); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); + Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c 2008-02-08 12:28:06.000000000 -0800 +++ linux-2.6/kernel/fork.c 2008-02-08 12:30:47.000000000 -0800 @@ -53,6 +53,7 @@ #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/blkdev.h> +#include <linux/mmu_notifier.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -362,6 +363,7 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_head_init(&mm->mmu_notifier); return mm; } Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-02-08 12:28:06.000000000 -0800 +++ linux-2.6/mm/mmap.c 2008-02-08 12:43:59.000000000 -0800 @@ -26,6 +26,7 @@ #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -2037,6 +2038,7 @@ void exit_mmap(struct mm_struct *mm) unsigned long end; /* mm's last user has gone, and its about to be pulled down */ + mmu_notifier_release(mm); arch_exit_mmap(mm); lru_add_drain();` -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter 2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter @ 2008-02-08 22:06 ` Christoph Lameter 2008-02-08 22:06 ` [patch 3/6] mmu_notifier: invalidate_page callbacks Christoph Lameter ` (5 subsequent siblings) 7 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Cc: Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt [-- Attachment #1: mmu_invalidate_range_callbacks --] [-- Type: text/plain, Size: 11534 bytes --] The invalidation of address ranges in a mm_struct needs to be performed when pages are removed or permissions etc change. If invalidate_range_begin() is called with locks held then we pass a flag into invalidate_range() to indicate that no sleeping is possible. Locks are only held for truncate and huge pages. In two cases we use invalidate_range_begin/end to invalidate single pages because the pair allows holding off new references (idea by Robin Holt). do_wp_page(): We hold off new references while we update the pte. xip_unmap: We are not taking the PageLock so we cannot use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end stands in. Signed-off-by: Andrea Arcangeli <andrea-atKUWr5tajBWk0Htik3J/w@public.gmane.org> Signed-off-by: Robin Holt <holt-sJ/iWh9BUns@public.gmane.org> Signed-off-by: Christoph Lameter <clameter-sJ/iWh9BUns@public.gmane.org> --- mm/filemap_xip.c | 5 +++++ mm/fremap.c | 3 +++ mm/hugetlb.c | 3 +++ mm/memory.c | 35 +++++++++++++++++++++++++++++------ mm/mmap.c | 2 ++ mm/mprotect.c | 3 +++ mm/mremap.c | 7 ++++++- 7 files changed, 51 insertions(+), 7 deletions(-) Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c 2008-02-08 13:18:58.000000000 -0800 +++ linux-2.6/mm/fremap.c 2008-02-08 13:25:22.000000000 -0800 @@ -15,6 +15,7 @@ #include <linux/rmap.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/mmu_context.h> #include <asm/cacheflush.h> @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns spin_unlock(&mapping->i_mmap_lock); } + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0); err = populate_range(mm, vma, start, size, pgoff); + mmu_notifier(invalidate_range_end, mm, start, start + size, 0); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2008-02-08 13:22:14.000000000 -0800 +++ linux-2.6/mm/memory.c 2008-02-08 13:25:22.000000000 -0800 @@ -51,6 +51,7 @@ #include <linux/init.h> #include <linux/writeback.h> #include <linux/memcontrol.h> +#include <linux/mmu_notifier.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0); + dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { @@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds vma, addr, next)) return -ENOMEM; } while (dst_pgd++, src_pgd++, addr = next, addr != end); + + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier(invalidate_range_end, src_mm, + vma->vm_start, end, 0); + return 0; } @@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a struct mmu_gather *tlb; unsigned long end = address + size; unsigned long nr_accounted = 0; + int atomic = details ? (details->i_mmap_lock != 0) : 0; lru_add_drain(); tlb = tlb_gather_mmu(mm, 0); update_hiwater_rss(mm); + mmu_notifier(invalidate_range_begin, mm, address, end, atomic); end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); if (tlb) tlb_finish_mmu(tlb, address, end); + mmu_notifier(invalidate_range_end, mm, address, end, atomic); return end; } @@ -1337,7 +1349,7 @@ int remap_pfn_range(struct vm_area_struc { pgd_t *pgd; unsigned long next; - unsigned long end = addr + PAGE_ALIGN(size); + unsigned long start = addr, end = addr + PAGE_ALIGN(size); struct mm_struct *mm = vma->vm_mm; int err; @@ -1371,6 +1383,7 @@ int remap_pfn_range(struct vm_area_struc pfn -= addr >> PAGE_SHIFT; pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); + mmu_notifier(invalidate_range_begin, mm, start, end, 0); do { next = pgd_addr_end(addr, end); err = remap_pud_range(mm, pgd, addr, next, @@ -1378,6 +1391,7 @@ int remap_pfn_range(struct vm_area_struc if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range_end, mm, start, end, 0); return err; } EXPORT_SYMBOL(remap_pfn_range); @@ -1461,10 +1475,11 @@ int apply_to_page_range(struct mm_struct { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); + mmu_notifier(invalidate_range_begin, mm, start, end, 0); pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); @@ -1472,6 +1487,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range_end, mm, start, end, 0); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1612,8 +1628,10 @@ static int do_wp_page(struct mm_struct * page_table = pte_offset_map_lock(mm, pmd, address, &ptl); page_cache_release(old_page); - if (!pte_same(*page_table, orig_pte)) - goto unlock; + if (!pte_same(*page_table, orig_pte)) { + pte_unmap_unlock(page_table, ptl); + goto check_dirty; + } page_mkwrite = 1; } @@ -1629,7 +1647,8 @@ static int do_wp_page(struct mm_struct * if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, entry); ret |= VM_FAULT_WRITE; - goto unlock; + pte_unmap_unlock(page_table, ptl); + goto check_dirty; } /* @@ -1651,6 +1670,8 @@ gotten: if (mem_cgroup_charge(new_page, mm, GFP_KERNEL)) goto oom_free_new; + mmu_notifier(invalidate_range_begin, mm, address, + address + PAGE_SIZE, 0); /* * Re-check the pte - we dropped the lock */ @@ -1689,8 +1710,10 @@ gotten: page_cache_release(new_page); if (old_page) page_cache_release(old_page); -unlock: pte_unmap_unlock(page_table, ptl); + mmu_notifier(invalidate_range_end, mm, + address, address + PAGE_SIZE, 0); +check_dirty: if (dirty_page) { if (vma->vm_file) file_update_time(vma->vm_file); Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-02-08 13:25:21.000000000 -0800 +++ linux-2.6/mm/mmap.c 2008-02-08 13:25:22.000000000 -0800 @@ -1748,11 +1748,13 @@ static void unmap_region(struct mm_struc lru_add_drain(); tlb = tlb_gather_mmu(mm, 0); update_hiwater_rss(mm); + mmu_notifier(invalidate_range_begin, mm, start, end, 0); unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); tlb_finish_mmu(tlb, start, end); + mmu_notifier(invalidate_range_end, mm, start, end, 0); } /* Index: linux-2.6/mm/hugetlb.c =================================================================== --- linux-2.6.orig/mm/hugetlb.c 2008-02-08 13:22:14.000000000 -0800 +++ linux-2.6/mm/hugetlb.c 2008-02-08 13:25:22.000000000 -0800 @@ -14,6 +14,7 @@ #include <linux/mempolicy.h> #include <linux/cpuset.h> #include <linux/mutex.h> +#include <linux/mmu_notifier.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -753,6 +754,7 @@ void __unmap_hugepage_range(struct vm_ar BUG_ON(start & ~HPAGE_MASK); BUG_ON(end & ~HPAGE_MASK); + mmu_notifier(invalidate_range_begin, mm, start, end, 1); spin_lock(&mm->page_table_lock); for (address = start; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -773,6 +775,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier(invalidate_range_end, mm, start, end, 1); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); Index: linux-2.6/mm/filemap_xip.c =================================================================== --- linux-2.6.orig/mm/filemap_xip.c 2008-02-08 13:22:14.000000000 -0800 +++ linux-2.6/mm/filemap_xip.c 2008-02-08 13:25:22.000000000 -0800 @@ -13,6 +13,7 @@ #include <linux/module.h> #include <linux/uio.h> #include <linux/rmap.h> +#include <linux/mmu_notifier.h> #include <linux/sched.h> #include <asm/tlbflush.h> @@ -190,6 +191,8 @@ __xip_unmap (struct address_space * mapp address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); BUG_ON(address < vma->vm_start || address >= vma->vm_end); + mmu_notifier(invalidate_range_begin, mm, address, + address + PAGE_SIZE, 1); pte = page_check_address(page, mm, address, &ptl); if (pte) { /* Nuke the page table entry. */ @@ -201,6 +204,8 @@ __xip_unmap (struct address_space * mapp pte_unmap_unlock(pte, ptl); page_cache_release(page); } + mmu_notifier(invalidate_range_end, mm, + address, address + PAGE_SIZE, 1); } spin_unlock(&mapping->i_mmap_lock); } Index: linux-2.6/mm/mremap.c =================================================================== --- linux-2.6.orig/mm/mremap.c 2008-02-08 13:18:58.000000000 -0800 +++ linux-2.6/mm/mremap.c 2008-02-08 13:25:22.000000000 -0800 @@ -18,6 +18,7 @@ #include <linux/highmem.h> #include <linux/security.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -124,12 +125,15 @@ unsigned long move_page_tables(struct vm unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len) { - unsigned long extent, next, old_end; + unsigned long extent, next, old_start, old_end; pmd_t *old_pmd, *new_pmd; + old_start = old_addr; old_end = old_addr + len; flush_cache_range(vma, old_addr, old_end); + mmu_notifier(invalidate_range_begin, vma->vm_mm, + old_addr, old_end, 0); for (; old_addr < old_end; old_addr += extent, new_addr += extent) { cond_resched(); next = (old_addr + PMD_SIZE) & PMD_MASK; @@ -150,6 +154,7 @@ unsigned long move_page_tables(struct vm move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma, new_pmd, new_addr); } + mmu_notifier(invalidate_range_end, vma->vm_mm, old_start, old_end, 0); return len + old_addr - old_end; /* how much done */ } Index: linux-2.6/mm/mprotect.c =================================================================== --- linux-2.6.orig/mm/mprotect.c 2008-02-08 13:18:58.000000000 -0800 +++ linux-2.6/mm/mprotect.c 2008-02-08 13:25:22.000000000 -0800 @@ -21,6 +21,7 @@ #include <linux/syscalls.h> #include <linux/swap.h> #include <linux/swapops.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/pgtable.h> #include <asm/cacheflush.h> @@ -198,10 +199,12 @@ success: dirty_accountable = 1; } + mmu_notifier(invalidate_range_begin, mm, start, end, 0); if (is_vm_hugetlb_page(vma)) hugetlb_change_protection(vma, start, end, vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + mmu_notifier(invalidate_range_end, mm, start, end, 0); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); return 0; -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [patch 3/6] mmu_notifier: invalidate_page callbacks 2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter 2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 2008-02-08 22:06 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter @ 2008-02-08 22:06 ` Christoph Lameter 2008-02-08 22:06 ` [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier Christoph Lameter ` (4 subsequent siblings) 7 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Cc: Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt [-- Attachment #1: mmu_invalidate_page --] [-- Type: text/plain, Size: 3325 bytes --] Two callbacks to remove individual pages as done in rmap code invalidate_page() Called from the inner loop of rmap walks to invalidate pages. age_page() Called for the determination of the page referenced status. If we do not care about page referenced status then an age_page callback may be be omitted. PageLock and pte lock are held when either of the functions is called. Signed-off-by: Andrea Arcangeli <andrea-atKUWr5tajBWk0Htik3J/w@public.gmane.org> Signed-off-by: Robin Holt <holt-sJ/iWh9BUns@public.gmane.org> Signed-off-by: Christoph Lameter <clameter-sJ/iWh9BUns@public.gmane.org> --- mm/rmap.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c 2008-02-07 16:49:32.000000000 -0800 +++ linux-2.6/mm/rmap.c 2008-02-07 17:25:25.000000000 -0800 @@ -49,6 +49,7 @@ #include <linux/module.h> #include <linux/kallsyms.h> #include <linux/memcontrol.h> +#include <linux/mmu_notifier.h> #include <asm/tlbflush.h> @@ -287,7 +288,8 @@ static int page_referenced_one(struct pa if (vma->vm_flags & VM_LOCKED) { referenced++; *mapcount = 1; /* break early from loop */ - } else if (ptep_clear_flush_young(vma, address, pte)) + } else if (ptep_clear_flush_young(vma, address, pte) | + mmu_notifier_age_page(mm, address)) referenced++; /* Pretend the page is referenced if the task has the @@ -455,6 +457,7 @@ static int page_mkclean_one(struct page flush_cache_page(vma, address, pte_pfn(*pte)); entry = ptep_clear_flush(vma, address, pte); + mmu_notifier(invalidate_page, mm, address); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -712,7 +715,8 @@ static int try_to_unmap_one(struct page * skipped over this mm) then we should reactivate it. */ if (!migration && ((vma->vm_flags & VM_LOCKED) || - (ptep_clear_flush_young(vma, address, pte)))) { + (ptep_clear_flush_young(vma, address, pte) | + mmu_notifier_age_page(mm, address)))) { ret = SWAP_FAIL; goto out_unmap; } @@ -720,6 +724,7 @@ static int try_to_unmap_one(struct page /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); + mmu_notifier(invalidate_page, mm, address); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -844,12 +849,14 @@ static void try_to_unmap_cluster(unsigne page = vm_normal_page(vma, address, *pte); BUG_ON(!page || PageAnon(page)); - if (ptep_clear_flush_young(vma, address, pte)) + if (ptep_clear_flush_young(vma, address, pte) | + mmu_notifier_age_page(mm, address)) continue; /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); + mmu_notifier(invalidate_page, mm, address); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier 2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter ` (2 preceding siblings ...) 2008-02-08 22:06 ` [patch 3/6] mmu_notifier: invalidate_page callbacks Christoph Lameter @ 2008-02-08 22:06 ` Christoph Lameter 2008-02-08 22:06 ` [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) Christoph Lameter ` (3 subsequent siblings) 7 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Cc: Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt [-- Attachment #1: mmu_skeleton --] [-- Type: text/plain, Size: 7634 bytes --] This is example code for a simple device driver interface to unmap pages that were externally mapped. Locking is simple through a single lock that is used to protect the device drivers data structures as well as a counter that tracks the active invalidates on a single address space. The invalidation of extern ptes must be possible with code that does not require sleeping. The lock is taken for all driver operations on the mmu that the driver manages. Locking could be made more sophisticated but I think this is going to be okay for most uses. Signed-off-by: Christoph Lameter <clameter-sJ/iWh9BUns@public.gmane.org> --- Documentation/mmu_notifier/skeleton.c | 239 ++++++++++++++++++++++++++++++++++ 1 file changed, 239 insertions(+) Index: linux-2.6/Documentation/mmu_notifier/skeleton.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/Documentation/mmu_notifier/skeleton.c 2008-02-08 13:14:16.000000000 -0800 @@ -0,0 +1,239 @@ +#include <linux/mm.h> +#include <linux/mmu_notifier.h> +#include <linux/err.h> +#include <linux/init.h> +#include <linux/pagemap.h> + +/* + * Skeleton for an mmu notifier without rmap callbacks and no need to slepp + * during invalidate_page(). + * + * (C) 2008 Silicon Graphics, Inc. + * Christoph Lameter <clameter-sJ/iWh9BUns@public.gmane.org> + * + * Note that the locking is fairly basic. One can add various optimizations + * here and there. There is a single lock for an address space which should be + * satisfactory for most cases. If not then the lock can be split like the + * pte_lock in Linux. It is most likely best to place the locks in the + * page table structure or into whatever the external mmu uses to + * track the mappings. + */ + +struct my_mmu { + /* MMU notifier specific fields */ + struct mmu_notifier notifier; + spinlock_t lock; /* Protects counter and invidual zaps */ + int invalidates; /* Number of active range_invalidates */ +}; + +/* + * Called with m->lock held + */ +static void my_mmu_insert_page(struct my_mmu *m, + unsigned long address, unsigned long pfn) +{ + /* Must be provided */ + printk(KERN_INFO "insert page %p address=%lx pfn=%ld\n", + m, address, pfn); +} + +/* + * Called with m->lock held (optional but usually required to + * protect data structures of the driver). + */ +static void my_mmu_zap_page(struct my_mmu *m, unsigned long address) +{ + /* Must be provided */ + printk(KERN_INFO "zap page %p address=%lx\n", m, address); +} + +/* + * Called with m->lock held + */ +static void my_mmu_zap_range(struct my_mmu *m, + unsigned long start, unsigned long end, int atomic) +{ + /* Must be provided */ + printk(KERN_INFO "zap range %p address=%lx-%lx atomic=%d\n", + m, start, end, atomic); +} + +/* + * Zap an individual page. + * + * The page must be locked and a refcount on the page must + * be held when this function is called. The page lock is also + * acquired when new references are established and the + * page lock effecively takes on the role of synchronization. + * + * The m->lock is only taken to preserve the integrity fo the + * drivers data structures since we may also race with + * invalidate_range() which will likely access the same mmu + * control structures. + * m->lock is therefore optional here. + */ +static void my_mmu_invalidate_page(struct mmu_notifier *mn, + struct mm_struct *mm, unsigned long address) +{ + struct my_mmu *m = container_of(mn, struct my_mmu, notifier); + + spin_lock(&m->lock); + my_mmu_zap_page(m, address); + spin_unlock(&m->lock); +} + +/* + * Increment and decrement of the number of range invalidates + */ +static inline void inc_active(struct my_mmu *m) +{ + spin_lock(&m->lock); + m->invalidates++; + spin_unlock(&m->lock); +} + +static inline void dec_active(struct my_mmu *m) +{ + spin_lock(&m->lock); + m->invalidates--; + spin_unlock(&m->lock); +} + +static void my_mmu_invalidate_range_begin(struct mmu_notifier *mn, + struct mm_struct *mm, unsigned long start, unsigned long end, + int atomic) +{ + struct my_mmu *m = container_of(mn, struct my_mmu, notifier); + + inc_active(m); /* Holds off new references */ + my_mmu_zap_range(m, start, end, atomic); +} + +static void my_mmu_invalidate_range_end(struct mmu_notifier *mn, + struct mm_struct *mm, unsigned long start, unsigned long end, + int atomic) +{ + struct my_mmu *m = container_of(mn, struct my_mmu, notifier); + + dec_active(m); /* Enables new references */ +} + +/* + * Populate a page. + * + * A return value of-EAGAIN means please retry this operation. + * + * Acquisition of mmap_sem can be omitted if the caller already holds + * the semaphore. + */ +struct page *my_mmu_populate_page(struct my_mmu *m, + struct vm_area_struct *vma, + unsigned long address, int atomic, int write) +{ + struct page *page = ERR_PTR(-EAGAIN); + int err; + + /* No need to do anything if a range invalidate is running */ + if (m->invalidates) + goto out; + + if (atomic) { + + if (!down_read_trylock(&vma->vm_mm->mmap_sem)) + goto out; + + /* No concurrent invalidates */ + page = follow_page(vma, address, FOLL_GET + + (write ? FOLL_WRITE : 0)); + + up_read(&vma->vm_mm->mmap_sem); + if (!page || IS_ERR(page) || TestSetPageLocked(page)) + goto out; + + } else { + + down_read(&vma->vm_mm->mmap_sem); + err = get_user_pages(current, vma->vm_mm, address, 1, + write, 1, &page, NULL); + + up_read(&vma->vm_mm->mmap_sem); + if (err < 0) { + page = ERR_PTR(err); + goto out; + } + lock_page(page); + + } + + /* + * The page is now locked and we are holding a refcount on it. + * So things are tied down. Now we can check the page status. + */ + if (page_mapped(page)) { + /* + * Must take the m->lock here to hold off concurrent + * invalidate_range_b/e. Serialization with invalidate_page() + * occurs because we are holding the page lock. + */ + spin_lock(&m->lock); + if (!m->invalidates) + my_mmu_insert_page(m, address, page_to_pfn(page)); + spin_unlock(&m->lock); + } + unlock_page(page); + put_page(page); +out: + return page; +} + +/* + * All other threads accessing this mm_struct must have terminated by now. + */ +static void my_mmu_release(struct mmu_notifier *mn, struct mm_struct *mm) +{ + struct my_mmu *m = container_of(mn, struct my_mmu, notifier); + + my_mmu_zap_range(m, 0, TASK_SIZE, 0); + kfree(m); + printk(KERN_INFO "MMU Notifier detaching\n"); +} + +static struct mmu_notifier_ops my_mmu_ops = { + my_mmu_release, + NULL, /* No aging function */ + my_mmu_invalidate_page, + my_mmu_invalidate_range_begin, + my_mmu_invalidate_range_end +}; + +/* + * This function must be called to activate callbacks from a process + */ +int my_mmu_attach_to_process(struct mm_struct *mm) +{ + struct my_mmu *m = kzalloc(sizeof(struct my_mmu), GFP_KERNEL); + + if (!m) + return -ENOMEM; + + m->notifier.ops = &my_mmu_ops; + spin_lock_init(&mm->lock); + + /* + * mmap_sem handling can be omitted if it is guaranteed that + * the context from which my_mmu_attach_to_process is called + * is already holding a writelock on mmap_sem. + */ + down_write(&mm->mmap_sem); + mmu_notifier_register(&m->notifier, mm); + up_write(&mm->mmap_sem); + + /* + * RCU sync is expensive but necessary if we need to guarantee + * that multiple threads running on other cpus have seen the + * notifier changes. + */ + synchronize_rcu(); + return 0; +} + -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) 2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter ` (3 preceding siblings ...) 2008-02-08 22:06 ` [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier Christoph Lameter @ 2008-02-08 22:06 ` Christoph Lameter 2008-02-08 22:06 ` [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps Christoph Lameter ` (2 subsequent siblings) 7 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Cc: Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt [-- Attachment #1: mmu_rmap_support --] [-- Type: text/plain, Size: 8541 bytes --] These special additional callbacks are required because XPmem (and likely other mechanisms) do use their own rmap (multiple processes on a series of remote Linux instances may be accessing the memory of a process). F.e. XPmem may have to send out notifications to remote Linux instances and receive confirmation before a page can be freed. So we handle this like an additional Linux reverse map that is walked after the existing rmaps have been walked. We leave the walking to the driver that is then able to use something else than a spinlock to walk its reverse maps. So we can actually call the driver without holding spinlocks while we hold the Pagelock. However, we cannot determine the mm_struct that a page belongs to at that point. The mm_struct can only be determined from the rmaps by the device driver. We add another pageflag (PageExternalRmap) that is set if a page has been remotely mapped (f.e. by a process from another Linux instance). We can then only perform the callbacks for pages that are actually in remote use. Rmap notifiers need an extra page bit and are only available on 64 bit platforms. This functionality is not available on 32 bit! A notifier that uses the reverse maps callbacks does not need to provide the invalidate_page() method that is called when locks are held. Signed-off-by: Christoph Lameter <clameter-sJ/iWh9BUns@public.gmane.org> --- include/linux/mmu_notifier.h | 65 +++++++++++++++++++++++++++++++++++++++++++ include/linux/page-flags.h | 11 +++++++ mm/mmu_notifier.c | 34 ++++++++++++++++++++++ mm/rmap.c | 9 +++++ 4 files changed, 119 insertions(+) Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2008-02-08 12:35:14.000000000 -0800 +++ linux-2.6/include/linux/page-flags.h 2008-02-08 12:44:33.000000000 -0800 @@ -105,6 +105,7 @@ * 64 bit | FIELDS | ?????? FLAGS | * 63 32 0 */ +#define PG_external_rmap 30 /* Page has external rmap */ #define PG_uncached 31 /* Page has been mapped as uncached */ #endif @@ -296,6 +297,16 @@ static inline void __ClearPageTail(struc #define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags) #define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags) +#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT) +#define PageExternalRmap(page) test_bit(PG_external_rmap, &(page)->flags) +#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags) +#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \ + &(page)->flags) +#else +#define ClearPageExternalRmap(page) do {} while (0) +#define PageExternalRmap(page) 0 +#endif + struct page; /* forward declaration */ extern void cancel_dirty_page(struct page *page, unsigned int account_size); Index: linux-2.6/include/linux/mmu_notifier.h =================================================================== --- linux-2.6.orig/include/linux/mmu_notifier.h 2008-02-08 12:35:14.000000000 -0800 +++ linux-2.6/include/linux/mmu_notifier.h 2008-02-08 12:44:33.000000000 -0800 @@ -23,6 +23,18 @@ * where sleeping is allowed or in atomic contexts. A flag is passed * to indicate an atomic context. * + * + * 2. mmu_rmap_notifier + * + * Callbacks for subsystems that provide their own rmaps. These + * need to walk their own rmaps for a page. The invalidate_page + * callback is outside of locks so that we are not in a strictly + * atomic context (but we may be in a PF_MEMALLOC context if the + * notifier is called from reclaim code) and are able to sleep. + * + * Rmap notifiers need an extra page bit and are only available + * on 64 bit platforms. + * * Pages must be marked dirty if dirty bits are found to be set in * the external ptes. */ @@ -89,6 +101,23 @@ struct mmu_notifier_ops { int atomic); }; +struct mmu_rmap_notifier_ops; + +struct mmu_rmap_notifier { + struct hlist_node hlist; + const struct mmu_rmap_notifier_ops *ops; +}; + +struct mmu_rmap_notifier_ops { + /* + * Called with the page lock held after ptes are modified or removed + * so that a subsystem with its own rmap's can remove remote ptes + * mapping a page. + */ + void (*invalidate_page)(struct mmu_rmap_notifier *mrn, + struct page *page); +}; + #ifdef CONFIG_MMU_NOTIFIER /* @@ -139,6 +168,27 @@ static inline void mmu_notifier_head_ini } \ } while (0) +extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn); +extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn); + +/* Must hold PageLock */ +extern void mmu_rmap_export_page(struct page *page); + +extern struct hlist_head mmu_rmap_notifier_list; + +#define mmu_rmap_notifier(function, args...) \ + do { \ + struct mmu_rmap_notifier *__mrn; \ + struct hlist_node *__n; \ + \ + rcu_read_lock(); \ + hlist_for_each_entry_rcu(__mrn, __n, \ + &mmu_rmap_notifier_list, hlist) \ + if (__mrn->ops->function) \ + __mrn->ops->function(__mrn, args); \ + rcu_read_unlock(); \ + } while (0); + #else /* CONFIG_MMU_NOTIFIER */ /* @@ -157,6 +207,16 @@ static inline void mmu_notifier_head_ini }; \ } while (0) +#define mmu_rmap_notifier(function, args...) \ + do { \ + if (0) { \ + struct mmu_rmap_notifier *__mrn; \ + \ + __mrn = (struct mmu_rmap_notifier *)(0x00ff); \ + __mrn->ops->function(__mrn, args); \ + } \ + } while (0); + static inline void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) {} static inline void mmu_notifier_unregister(struct mmu_notifier *mn, @@ -170,6 +230,11 @@ static inline int mmu_notifier_age_page( static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {} +static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn) + {} +static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn) + {} + #endif /* CONFIG_MMU_NOTIFIER */ #endif /* _LINUX_MMU_NOTIFIER_H */ Index: linux-2.6/mm/mmu_notifier.c =================================================================== --- linux-2.6.orig/mm/mmu_notifier.c 2008-02-08 12:44:24.000000000 -0800 +++ linux-2.6/mm/mmu_notifier.c 2008-02-08 12:44:33.000000000 -0800 @@ -74,3 +74,37 @@ void mmu_notifier_unregister(struct mmu_ } EXPORT_SYMBOL_GPL(mmu_notifier_unregister); +#ifdef CONFIG_64BIT +static DEFINE_SPINLOCK(mmu_notifier_list_lock); +HLIST_HEAD(mmu_rmap_notifier_list); + +void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn) +{ + spin_lock(&mmu_notifier_list_lock); + hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list); + spin_unlock(&mmu_notifier_list_lock); +} +EXPORT_SYMBOL(mmu_rmap_notifier_register); + +void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn) +{ + spin_lock(&mmu_notifier_list_lock); + hlist_del_rcu(&mrn->hlist); + spin_unlock(&mmu_notifier_list_lock); +} +EXPORT_SYMBOL(mmu_rmap_notifier_unregister); + +/* + * Export a page. + * + * Pagelock must be held. + * Must be called before a page is put on an external rmap. + */ +void mmu_rmap_export_page(struct page *page) +{ + BUG_ON(!PageLocked(page)); + SetPageExternalRmap(page); +} +EXPORT_SYMBOL(mmu_rmap_export_page); + +#endif Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c 2008-02-08 12:44:30.000000000 -0800 +++ linux-2.6/mm/rmap.c 2008-02-08 12:44:33.000000000 -0800 @@ -497,6 +497,10 @@ int page_mkclean(struct page *page) struct address_space *mapping = page_mapping(page); if (mapping) { ret = page_mkclean_file(mapping, page); + if (unlikely(PageExternalRmap(page))) { + mmu_rmap_notifier(invalidate_page, page); + ClearPageExternalRmap(page); + } if (page_test_dirty(page)) { page_clear_dirty(page); ret = 1; @@ -1013,6 +1017,11 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); + if (unlikely(PageExternalRmap(page))) { + mmu_rmap_notifier(invalidate_page, page); + ClearPageExternalRmap(page); + } + if (!page_mapped(page)) ret = SWAP_SUCCESS; return ret; -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps 2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter ` (4 preceding siblings ...) 2008-02-08 22:06 ` [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) Christoph Lameter @ 2008-02-08 22:06 ` Christoph Lameter 2008-02-08 22:23 ` [ofa-general] Re: [patch 0/6] MMU Notifiers V6 Andrew Morton 2008-02-13 14:31 ` Jack Steiner 7 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Cc: Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt [-- Attachment #1: mmu_rmap_skeleton --] [-- Type: text/plain, Size: 8104 bytes --] The skeleton for the rmap notifier leaves the invalidate_page method of the mmu_notifier empty and hooks a new invalidate_page callback into the global chain for mmu_rmap_notifiers. There are seveal simplifications in here to avoid making this too complex. The reverse maps need to consit of references to vma f.e. Signed-off-by: Christoph Lameter <clameter-sJ/iWh9BUns@public.gmane.org> --- Documentation/mmu_notifier/skeleton_rmap.c | 265 +++++++++++++++++++++++++++++ 1 file changed, 265 insertions(+) Index: linux-2.6/Documentation/mmu_notifier/skeleton_rmap.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/Documentation/mmu_notifier/skeleton_rmap.c 2008-02-08 13:25:28.000000000 -0800 @@ -0,0 +1,265 @@ +#include <linux/mm.h> +#include <linux/mmu_notifier.h> +#include <linux/err.h> +#include <linux/init.h> +#include <linux/pagemap.h> + +/* + * Skeleton for an mmu notifier with rmap callbacks and sleeping during + * invalidate_page. + * + * (C) 2008 Silicon Graphics, Inc. + * Christoph Lameter <clameter-sJ/iWh9BUns@public.gmane.org> + * + * Note that the locking is fairly basic. One can add various optimizations + * here and there. There is a single lock for an address space which should be + * satisfactory for most cases. If not then the lock can be split like the + * pte_lock in Linux. It is most likely best to place the locks in the + * page table structure or into whatever the external mmu uses to + * track the mappings. + */ + +struct my_mmu { + /* MMU notifier specific fields */ + struct mmu_notifier notifier; + spinlock_t lock; /* Protects counter and invidual zaps */ + int invalidates; /* Number of active range_invalidate */ + + /* Rmap support */ + struct list_head list; /* rmap list of my_mmu structs */ + unsigned long base; +}; + +/* + * Called with m->lock held + */ +static void my_mmu_insert_page(struct my_mmu *m, + unsigned long address, unsigned long pfn) +{ + /* Must be provided */ + printk(KERN_INFO "insert page %p address=%lx pfn=%ld\n", + m, address, pfn); +} + +/* + * Called with m->lock held + */ +static void my_mmu_zap_range(struct my_mmu *m, + unsigned long start, unsigned long end, int atomic) +{ + /* Must be provided */ + printk(KERN_INFO "zap range %p address=%lx-%lx atomic=%d\n", + m, start, end, atomic); +} + +/* + * Called with m->lock held (optional but usually required to + * protect data structures of the driver). + */ +static void my_mmu_zap_page(struct my_mmu *m, unsigned long address) +{ + /* Must be provided */ + printk(KERN_INFO "zap page %p address=%lx\n", m, address); +} + +/* + * Increment and decrement of the number of range invalidates + */ +static inline void inc_active(struct my_mmu *m) +{ + spin_lock(&m->lock); + m->invalidates++; + spin_unlock(&m->lock); +} + +static inline void dec_active(struct my_mmu *m) +{ + spin_lock(&m->lock); + m->invalidates--; + spin_unlock(&m->lock); +} + +static void my_mmu_invalidate_range_begin(struct mmu_notifier *mn, + struct mm_struct *mm, unsigned long start, unsigned long end, + int atomic) +{ + struct my_mmu *m = container_of(mn, struct my_mmu, notifier); + + inc_active(m); /* Holds off new references */ + my_mmu_zap_range(m, start, end, atomic); +} + +static void my_mmu_invalidate_range_end(struct mmu_notifier *mn, + struct mm_struct *mm, unsigned long start, unsigned long end, + int atomic) +{ + struct my_mmu *m = container_of(mn, struct my_mmu, notifier); + + dec_active(m); /* Enables new references */ +} + +/* + * Populate a page. + * + * A return value of-EAGAIN means please retry this operation. + * + * Acuisition of mmap_sem can be omitted if the caller already holds + * the semaphore. + */ +struct page *my_mmu_populate_page(struct my_mmu *m, + struct vm_area_struct *vma, + unsigned long address, int write) +{ + struct page *page = ERR_PTR(-EAGAIN); + int err; + + /* + * No need to do anything if a range invalidate is running + * Could use a wait queue here to avoid returning -EAGAIN. + */ + if (m->invalidates) + goto out; + + down_read(&vma->vm_mm->mmap_sem); + err = get_user_pages(current, vma->vm_mm, address, 1, + write, 1, &page, NULL); + + up_read(&vma->vm_mm->mmap_sem); + if (err < 0) { + page = ERR_PTR(err); + goto out; + } + lock_page(page); + + /* + * The page is now locked and we are holding a refcount on it. + * So things are tied down. Now we can check the page status. + */ + if (page_mapped(page)) { + /* Could do some preprocessing here. Can sleep */ + spin_lock(&m->lock); + if (!m->invalidates) + my_mmu_insert_page(m, address, page_to_pfn(page)); + spin_unlock(&m->lock); + /* Could do some postprocessing here. Can sleep */ + } + unlock_page(page); + put_page(page); +out: + return page; +} + +/* + * All other threads accessing this mm_struct must have terminated by now. + */ +static void my_mmu_release(struct mmu_notifier *mn, struct mm_struct *mm) +{ + struct my_mmu *m = container_of(mn, struct my_mmu, notifier); + + my_mmu_zap_range(m, 0, TASK_SIZE, 0); + /* No concurrent processes thus no worries about RCU */ + list_del(&m->list); + kfree(m); + printk(KERN_INFO "MMU Notifier terminating\n"); +} + +static struct mmu_notifier_ops my_mmu_ops = { + my_mmu_release, + NULL, /* No aging function */ + NULL, /* No atomic invalidate_page function */ + my_mmu_invalidate_range_begin, + my_mmu_invalidate_range_end +}; + +/* Rmap specific fields */ +static LIST_HEAD(my_mmu_list); +static struct rw_semaphore listlock; + +/* + * This function must be called to activate callbacks from a process + */ +int my_mmu_attach_to_process(struct mm_struct *mm) +{ + struct my_mmu *m = kzalloc(sizeof(struct my_mmu), GFP_KERNEL); + + if (!m) + return -ENOMEM; + + m->notifier.ops = &my_mmu_ops; + spin_lock_init(&m->lock); + + /* + * mmap_sem handling can be omitted if it is guaranteed that + * the context from which my_mmu_attach_to_process is called + * is already holding a writelock on mmap_sem. + */ + down_write(&mm->mmap_sem); + mmu_notifier_register(&m->notifier, mm); + up_write(&mm->mmap_sem); + down_write(&listlock); + list_add(&m->list, &my_mmu_list); + up_write(&listlock); + + /* + * RCU sync is expensive but necessary if we need to guarantee + * that multiple threads running on other cpus have seen the + * notifier changes. + */ + synchronize_rcu(); + return 0; +} + + +static void my_sleeping_invalidate_page(struct my_mmu *m, unsigned long address) +{ + /* Must be provided */ + spin_lock(&m->lock); /* Only taken to ensure mmu data integrity */ + my_mmu_zap_page(m, address); + spin_unlock(&m->lock); + printk(KERN_INFO "Sleeping invalidate_page %p address=%lx\n", + m, address); +} + +static unsigned long my_mmu_find_addr(struct my_mmu *m, struct page *page) +{ + /* Determine the address of a page in a mmu segment */ + return -EFAULT; +} + +/* + * A reference must be held on the page passed and the page passed + * must be locked. No spinlocks are held. invalidate_page() is held + * off by us holding the page lock. + */ +static void my_mmu_rmap_invalidate_page(struct mmu_rmap_notifier *mrn, + struct page *page) +{ + struct my_mmu *m; + + BUG_ON(!PageLocked(page)); + down_read(&listlock); + list_for_each_entry(m, &my_mmu_list, list) { + unsigned long address = my_mmu_find_addr(m, page); + + if (address != -EFAULT) + my_sleeping_invalidate_page(m, address); + } + up_read(&listlock); +} + +static struct mmu_rmap_notifier_ops my_mmu_rmap_ops = { + .invalidate_page = my_mmu_rmap_invalidate_page +}; + +static struct mmu_rmap_notifier my_mmu_rmap_notifier = { + .ops = &my_mmu_rmap_ops +}; + +static int __init my_mmu_init(void) +{ + mmu_rmap_notifier_register(&my_mmu_rmap_notifier); + return 0; +} + +late_initcall(my_mmu_init); + -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter ` (5 preceding siblings ...) 2008-02-08 22:06 ` [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps Christoph Lameter @ 2008-02-08 22:23 ` Andrew Morton [not found] ` <20080208142315.7fe4b95e.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> 2008-02-13 14:31 ` Jack Steiner 7 siblings, 1 reply; 150+ messages in thread From: Andrew Morton @ 2008-02-08 22:23 UTC (permalink / raw) To: Christoph Lameter Cc: andrea, a.p.zijlstra, linux-mm, izike, steiner, linux-kernel, avi, kvm-devel, daniel.blueman, holt, general On Fri, 08 Feb 2008 14:06:16 -0800 Christoph Lameter <clameter@sgi.com> wrote: > This is a patchset implementing MMU notifier callbacks based on Andrea's > earlier work. These are needed if Linux pages are referenced from something > else than tracked by the rmaps of the kernel (an external MMU). MMU > notifiers allow us to get rid of the page pinning for RDMA and various > other purposes. It gets rid of the broken use of mlock for page pinning. > (mlock really does *not* pin pages....) > > More information on the rationale and the technical details can be found in > the first patch and the README provided by that patch in > Documentation/mmu_notifiers. > > The known immediate users are > > KVM > - Establishes a refcount to the page via get_user_pages(). > - External references are called spte. > - Has page tables to track pages whose refcount was elevated but > no reverse maps. > > GRU > - Simple additional hardware TLB (possibly covering multiple instances of > Linux) > - Needs TLB shootdown when the VM unmaps pages. > - Determines page address via follow_page (from interrupt context) but can > fall back to get_user_pages(). > - No page reference possible since no page status is kept.. > > XPmem > - Allows use of a processes memory by remote instances of Linux. > - Provides its own reverse mappings to track remote pte. > - Established refcounts on the exported pages. > - Must sleep in order to wait for remote acks of ptes that are being > cleared. > What about ib_umem_get()? ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080208142315.7fe4b95e.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]
* Re: [patch 0/6] MMU Notifiers V6 [not found] ` <20080208142315.7fe4b95e.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> @ 2008-02-08 23:32 ` Christoph Lameter 2008-02-08 23:36 ` [ofa-general] " Robin Holt 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-08 23:32 UTC (permalink / raw) To: Andrew Morton Cc: andrea-atKUWr5tajBWk0Htik3J/w, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, avi-atKUWr5tajBWk0Htik3J/w, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, holt-sJ/iWh9BUns, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5 On Fri, 8 Feb 2008, Andrew Morton wrote: > What about ib_umem_get()? Ok. It pins using an elevated refcount. Same as XPmem right now. With that we effectively pin a page (page migration will fail) but we will continually be reclaiming the page and may repeatedly try to move it. We have issues with XPmem causing too many pages to be pinned and thus the OOM getting into weird behavior modes (OOM or stop lru scanning due to all_reclaimable set). An elevated refcount will also not be noticed by any of the schemes under consideration to improve LRU scanning performance. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-08 23:32 ` Christoph Lameter @ 2008-02-08 23:36 ` Robin Holt 2008-02-08 23:41 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Robin Holt @ 2008-02-08 23:36 UTC (permalink / raw) To: Christoph Lameter Cc: andrea, a.p.zijlstra, linux-mm, izike, steiner, linux-kernel, avi, kvm-devel, daniel.blueman, holt, general, Andrew Morton On Fri, Feb 08, 2008 at 03:32:19PM -0800, Christoph Lameter wrote: > On Fri, 8 Feb 2008, Andrew Morton wrote: > > > What about ib_umem_get()? > > Ok. It pins using an elevated refcount. Same as XPmem right now. With that > we effectively pin a page (page migration will fail) but we will > continually be reclaiming the page and may repeatedly try to move it. We > have issues with XPmem causing too many pages to be pinned and thus the > OOM getting into weird behavior modes (OOM or stop lru scanning due to > all_reclaimable set). > > An elevated refcount will also not be noticed by any of the schemes under > consideration to improve LRU scanning performance. Christoph, I am not sure what you are saying here. With v4 and later, I thought we were able to use the rmap invalidation to remove the ref count that XPMEM was holding and therefore be able to swapout. Did I miss something? I agree the existing XPMEM does pin. I hope we are not saying the XPMEM based upon these patches will not be able to swap/migrate. Thanks, Robin ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-08 23:36 ` [ofa-general] " Robin Holt @ 2008-02-08 23:41 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0802081540180.4291-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-08 23:41 UTC (permalink / raw) To: Robin Holt Cc: andrea, a.p.zijlstra, linux-mm, izike, steiner, linux-kernel, avi, kvm-devel, daniel.blueman, general, Andrew Morton On Fri, 8 Feb 2008, Robin Holt wrote: > > > What about ib_umem_get()? > > > > Ok. It pins using an elevated refcount. Same as XPmem right now. With that > > we effectively pin a page (page migration will fail) but we will > > continually be reclaiming the page and may repeatedly try to move it. We > > have issues with XPmem causing too many pages to be pinned and thus the > > OOM getting into weird behavior modes (OOM or stop lru scanning due to > > all_reclaimable set). > > > > An elevated refcount will also not be noticed by any of the schemes under > > consideration to improve LRU scanning performance. > > Christoph, I am not sure what you are saying here. With v4 and later, > I thought we were able to use the rmap invalidation to remove the ref > count that XPMEM was holding and therefore be able to swapout. Did I miss > something? I agree the existing XPMEM does pin. I hope we are not saying > the XPMEM based upon these patches will not be able to swap/migrate. Correct. You missed the turn of the conversation to how ib_umem_get() works. Currently it seems to pin the same way that the SLES10 XPmem works. ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0802081540180.4291-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 0/6] MMU Notifiers V6 [not found] ` <Pine.LNX.4.64.0802081540180.4291-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-02-08 23:43 ` Robin Holt 2008-02-08 23:56 ` [ofa-general] " Andrew Morton 0 siblings, 1 reply; 150+ messages in thread From: Robin Holt @ 2008-02-08 23:43 UTC (permalink / raw) To: Christoph Lameter Cc: andrea-atKUWr5tajBWk0Htik3J/w, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, avi-atKUWr5tajBWk0Htik3J/w, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, Andrew Morton On Fri, Feb 08, 2008 at 03:41:24PM -0800, Christoph Lameter wrote: > On Fri, 8 Feb 2008, Robin Holt wrote: > > > > > What about ib_umem_get()? > > Correct. > > You missed the turn of the conversation to how ib_umem_get() works. > Currently it seems to pin the same way that the SLES10 XPmem works. Ah. I took Andrew's question as more of a probe about whether we had worked with the IB folks to ensure this fits the ib_umem_get needs as well. Thanks, Robin ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-08 23:43 ` Robin Holt @ 2008-02-08 23:56 ` Andrew Morton 2008-02-09 0:05 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Andrew Morton @ 2008-02-08 23:56 UTC (permalink / raw) To: Robin Holt Cc: andrea, a.p.zijlstra, linux-mm, izike, steiner, linux-kernel, avi, kvm-devel, daniel.blueman, general, Christoph Lameter On Fri, 8 Feb 2008 17:43:02 -0600 Robin Holt <holt@sgi.com> wrote: > On Fri, Feb 08, 2008 at 03:41:24PM -0800, Christoph Lameter wrote: > > On Fri, 8 Feb 2008, Robin Holt wrote: > > > > > > > What about ib_umem_get()? > > > > Correct. > > > > You missed the turn of the conversation to how ib_umem_get() works. > > Currently it seems to pin the same way that the SLES10 XPmem works. > > Ah. I took Andrew's question as more of a probe about whether we had > worked with the IB folks to ensure this fits the ib_umem_get needs > as well. > You took it correctly, and I didn't understand the answer ;) ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-08 23:56 ` [ofa-general] " Andrew Morton @ 2008-02-09 0:05 ` Christoph Lameter 2008-02-09 0:12 ` Roland Dreier 2008-02-09 0:12 ` [ofa-general] Re: [patch 0/6] MMU Notifiers V6 Andrew Morton 0 siblings, 2 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-09 0:05 UTC (permalink / raw) To: Andrew Morton Cc: andrea, a.p.zijlstra, linux-mm, izike, steiner, linux-kernel, avi, kvm-devel, daniel.blueman, Robin Holt, general On Fri, 8 Feb 2008, Andrew Morton wrote: > You took it correctly, and I didn't understand the answer ;) We have done several rounds of discussion on linux-kernel about this so far and the IB folks have not shown up to join in. I have tried to make this as general as possible. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-09 0:05 ` Christoph Lameter @ 2008-02-09 0:12 ` Roland Dreier 2008-02-09 0:16 ` Christoph Lameter 2008-02-09 0:12 ` [ofa-general] Re: [patch 0/6] MMU Notifiers V6 Andrew Morton 1 sibling, 1 reply; 150+ messages in thread From: Roland Dreier @ 2008-02-09 0:12 UTC (permalink / raw) To: Christoph Lameter Cc: andrea, a.p.zijlstra, izike, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel > We have done several rounds of discussion on linux-kernel about this so > far and the IB folks have not shown up to join in. I have tried to make > this as general as possible. Sorry, this has been on my "things to look at" list for a while, but I haven't gotten a chance to really understand where things are yet. In general, this MMU notifier stuff will only be useful to a subset of InfiniBand/RDMA hardware. Some adapters are smart enough to handle changing the IO virtual -> bus/physical mapping on the fly, but some aren't. For the dumb adapters, I think the current ib_umem_get() is pretty close to as good as we can get: we have to keep the physical pages pinned for as long as the adapter is allowed to DMA into the memory region. For the smart adapters, we just need a chance to change the adapter's page table when the kernel/CPU's mapping changes, and naively, this stuff looks like it would work. Andrew, does that help? - R. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-09 0:12 ` Roland Dreier @ 2008-02-09 0:16 ` Christoph Lameter 2008-02-09 0:21 ` [ofa-general] trying to get of all lists R S 2008-02-09 0:22 ` [ofa-general] Re: [patch 0/6] MMU Notifiers V6 Roland Dreier 0 siblings, 2 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-09 0:16 UTC (permalink / raw) To: Roland Dreier Cc: andrea, a.p.zijlstra, izike, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Fri, 8 Feb 2008, Roland Dreier wrote: > In general, this MMU notifier stuff will only be useful to a subset of > InfiniBand/RDMA hardware. Some adapters are smart enough to handle > changing the IO virtual -> bus/physical mapping on the fly, but some > aren't. For the dumb adapters, I think the current ib_umem_get() is > pretty close to as good as we can get: we have to keep the physical > pages pinned for as long as the adapter is allowed to DMA into the > memory region. I thought the adaptor can always remove the mapping by renegotiating with the remote side? Even if its dumb then a callback could notify the driver that it may be required to tear down the mapping. We then hold the pages until we get okay by the driver that the mapping has been removed. We could also let the unmapping fail if the driver indicates that the mapping must stay. ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] trying to get of all lists 2008-02-09 0:16 ` Christoph Lameter @ 2008-02-09 0:21 ` R S 2008-02-09 0:22 ` [ofa-general] Re: [patch 0/6] MMU Notifiers V6 Roland Dreier 1 sibling, 0 replies; 150+ messages in thread From: R S @ 2008-02-09 0:21 UTC (permalink / raw) To: Christoph Lameter Cc: andrea, a.p.zijlstra, izike, steiner, linux-kernel, avi, linux-mm, daniel.blueman, holt, general, akpm, kvm-devel [-- Attachment #1.1: Type: text/plain, Size: 1734 bytes --] unsubscribe > Date: Fri, 8 Feb 2008 16:16:34 -0800> From: clameter@sgi.com> To: rdreier@cisco.com> CC: akpm@linux-foundation.org; andrea@qumranet.com; a.p.zijlstra@chello.nl; linux-mm@kvack.org; izike@qumranet.com; steiner@sgi.com; linux-kernel@vger.kernel.org; avi@qumranet.com; kvm-devel@lists.sourceforge.net; daniel.blueman@quadrics.com; holt@sgi.com; general@lists.openfabrics.org> Subject: Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6> > On Fri, 8 Feb 2008, Roland Dreier wrote:> > > In general, this MMU notifier stuff will only be useful to a subset of> > InfiniBand/RDMA hardware. Some adapters are smart enough to handle> > changing the IO virtual -> bus/physical mapping on the fly, but some> > aren't. For the dumb adapters, I think the current ib_umem_get() is> > pretty close to as good as we can get: we have to keep the physical> > pages pinned for as long as the adapter is allowed to DMA into the> > memory region.> > I thought the adaptor can always remove the mapping by renegotiating > with the remote side? Even if its dumb then a callback could notify the > driver that it may be required to tear down the mapping. We then hold the > pages until we get okay by the driver that the mapping has been removed.> > We could also let the unmapping fail if the driver indicates that the > mapping must stay.> --> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in> the body of a message to majordomo@vger.kernel.org> More majordomo info at http://vger.kernel.org/majordomo-info.html> Please read the FAQ at http://www.tux.org/lkml/ _________________________________________________________________ Shed those extra pounds with MSN and The Biggest Loser! http://biggestloser.msn.com/ [-- Attachment #1.2: Type: text/html, Size: 2162 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-09 0:16 ` Christoph Lameter 2008-02-09 0:21 ` [ofa-general] trying to get of all lists R S @ 2008-02-09 0:22 ` Roland Dreier [not found] ` <adalk5v0yi6.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 1 sibling, 1 reply; 150+ messages in thread From: Roland Dreier @ 2008-02-09 0:22 UTC (permalink / raw) To: Christoph Lameter Cc: andrea, a.p.zijlstra, izike, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel > I thought the adaptor can always remove the mapping by renegotiating > with the remote side? Even if its dumb then a callback could notify the > driver that it may be required to tear down the mapping. We then hold the > pages until we get okay by the driver that the mapping has been removed. Of course we can always destroy the memory region but that would break the semantics that applications expect. Basically an application can register some chunk of its memory and get a key that it can pass to a remote peer to let the remote peer operate on its memory via RDMA. And that memory region/key is expected to stay valid until there is an application-level operation to destroy it (or until the app crashes or gets killed, etc). > We could also let the unmapping fail if the driver indicates that the > mapping must stay. That would of course work -- dumb adapters would just always fail, which might be inefficient. ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <adalk5v0yi6.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>]
* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6 [not found] ` <adalk5v0yi6.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> @ 2008-02-09 0:36 ` Christoph Lameter 2008-02-09 1:24 ` Andrea Arcangeli 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-09 0:36 UTC (permalink / raw) To: Roland Dreier Cc: andrea-atKUWr5tajBWk0Htik3J/w, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, avi-atKUWr5tajBWk0Htik3J/w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, Andrew Morton, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f On Fri, 8 Feb 2008, Roland Dreier wrote: > That would of course work -- dumb adapters would just always fail, > which might be inefficient. Hmmmm.. that means we need something that actually pins pages for good so that the VM can avoid reclaiming it and so that page migration can avoid trying to migrate them. Something like yet another page flag. Ccing Rik. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-09 0:36 ` Christoph Lameter @ 2008-02-09 1:24 ` Andrea Arcangeli 2008-02-09 1:27 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-09 1:24 UTC (permalink / raw) To: Christoph Lameter Cc: Rik van Riel, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Fri, Feb 08, 2008 at 04:36:16PM -0800, Christoph Lameter wrote: > On Fri, 8 Feb 2008, Roland Dreier wrote: > > > That would of course work -- dumb adapters would just always fail, > > which might be inefficient. > > Hmmmm.. that means we need something that actually pins pages for good so > that the VM can avoid reclaiming it and so that page migration can avoid > trying to migrate them. Something like yet another page flag. What's wrong with pinning with the page count like now? Dumb adapters would simply not register themself in the mmu notifier list no? > > Ccing Rik. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-09 1:24 ` Andrea Arcangeli @ 2008-02-09 1:27 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0802081725200.5445-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-09 1:27 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Sat, 9 Feb 2008, Andrea Arcangeli wrote: > > Hmmmm.. that means we need something that actually pins pages for good so > > that the VM can avoid reclaiming it and so that page migration can avoid > > trying to migrate them. Something like yet another page flag. > > What's wrong with pinning with the page count like now? Dumb adapters > would simply not register themself in the mmu notifier list no? Pages will still be on the LRU and cycle through rmap again and again. If page migration is used on those pages then the code may make repeated attempt to migrate the page thinking that the page count must at some point drop. I do not think that the page count was intended to be used to pin pages permanently. If we had a marker on such pages then we could take them off the LRU and not try to migrate them. ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0802081725200.5445-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6 [not found] ` <Pine.LNX.4.64.0802081725200.5445-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-02-09 1:56 ` Andrea Arcangeli 2008-02-09 2:16 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-09 1:56 UTC (permalink / raw) To: Christoph Lameter Cc: a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Roland Dreier, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, avi-atKUWr5tajBWk0Htik3J/w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, Andrew Morton, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f On Fri, Feb 08, 2008 at 05:27:03PM -0800, Christoph Lameter wrote: > Pages will still be on the LRU and cycle through rmap again and again. > If page migration is used on those pages then the code may make repeated > attempt to migrate the page thinking that the page count must at some > point drop. > > I do not think that the page count was intended to be used to pin pages > permanently. If we had a marker on such pages then we could take them off > the LRU and not try to migrate them. The VM shouldn't break if try_to_unmap doesn't actually make the page freeable for whatever reason. Permanent pins shouldn't happen anyway, so defining an ad-hoc API for that doesn't sound too appealing. Not sure if old hardware deserves those special lru-size-reduction optimizations but it's not my call (certainly swapoff/mlock would get higher priority in that lru-size-reduction area). ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-09 1:56 ` Andrea Arcangeli @ 2008-02-09 2:16 ` Christoph Lameter 2008-02-09 12:55 ` Rik van Riel 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-09 2:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Sat, 9 Feb 2008, Andrea Arcangeli wrote: > The VM shouldn't break if try_to_unmap doesn't actually make the page > freeable for whatever reason. Permanent pins shouldn't happen anyway, VM is livelocking if too many page are pinned that way right now. The higher the processors per node the higher the risk of livelock because more processors are in the process of cycling through pages that have an elevated refcount. > so defining an ad-hoc API for that doesn't sound too appealing. Not > sure if old hardware deserves those special lru-size-reduction > optimizations but it's not my call (certainly swapoff/mlock would get > higher priority in that lru-size-reduction area). Rik has a patchset under development that addresses issues like this. The elevated refcount pin problem is not really relevant to the patchset we are discussing here. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-09 2:16 ` Christoph Lameter @ 2008-02-09 12:55 ` Rik van Riel 2008-02-09 21:46 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Rik van Riel @ 2008-02-09 12:55 UTC (permalink / raw) To: Christoph Lameter Cc: Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Fri, 8 Feb 2008 18:16:16 -0800 (PST) Christoph Lameter <clameter@sgi.com> wrote: > On Sat, 9 Feb 2008, Andrea Arcangeli wrote: > > > The VM shouldn't break if try_to_unmap doesn't actually make the page > > freeable for whatever reason. Permanent pins shouldn't happen anyway, > > VM is livelocking if too many page are pinned that way right now. > Rik has a patchset under development that addresses issues like this PG_mlock is on the way and can easily be reused for this, too. -- All rights reversed. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-09 12:55 ` Rik van Riel @ 2008-02-09 21:46 ` Christoph Lameter 2008-02-11 22:40 ` [ofa-general] Demand paging for memory regions (was Re: MMU Notifiers V6) Roland Dreier 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-09 21:46 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Sat, 9 Feb 2008, Rik van Riel wrote: > PG_mlock is on the way and can easily be reused for this, too. Note that a pinned page is different from an mlocked page. A mlocked page can be moved through page migration and/or memory hotplug. A pinned page must make both fail. ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Demand paging for memory regions (was Re: MMU Notifiers V6) 2008-02-09 21:46 ` Christoph Lameter @ 2008-02-11 22:40 ` Roland Dreier 2008-02-12 22:01 ` [ofa-general] " Steve Wise 0 siblings, 1 reply; 150+ messages in thread From: Roland Dreier @ 2008-02-11 22:40 UTC (permalink / raw) To: general Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, steiner, linux-kernel, avi, kvm-devel, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, Christoph Lameter [Adding general@lists.openfabrics.org to get the IB/RDMA people involved] This thread has patches that add support for notifying drivers when a process's memory map changes. The hope is that this is useful for letting RDMA devices handle registered memory without pinning the underlying pages, by updating the RDMA device's translation tables whenever the host kernel's tables change. Is anyone interested in working on using this for drivers/infiniband? I am interested in participating, but I don't think I have enough time to do this by myself. Also, at least naively it seems that this is only useful for hardware that has support for this type of demand paging, and can handle not-present pages, generating interrupts for page faults, etc. I know that Mellanox HCAs should have this support; are there any other devices that can do this? The beginning of this thread is at <http://lkml.org/lkml/2008/2/8/458>. - R. ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: Demand paging for memory regions (was Re: MMU Notifiers V6) 2008-02-11 22:40 ` [ofa-general] Demand paging for memory regions (was Re: MMU Notifiers V6) Roland Dreier @ 2008-02-12 22:01 ` Steve Wise 2008-02-12 22:10 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Steve Wise @ 2008-02-12 22:01 UTC (permalink / raw) To: Roland Dreier Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, steiner, linux-kernel, avi, kvm-devel, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, Christoph Lameter Roland Dreier wrote: > [Adding general@lists.openfabrics.org to get the IB/RDMA people involved] > > This thread has patches that add support for notifying drivers when a > process's memory map changes. The hope is that this is useful for > letting RDMA devices handle registered memory without pinning the > underlying pages, by updating the RDMA device's translation tables > whenever the host kernel's tables change. > > Is anyone interested in working on using this for drivers/infiniband? > I am interested in participating, but I don't think I have enough time > to do this by myself. I don't have time, although it would be interesting work! > > Also, at least naively it seems that this is only useful for hardware > that has support for this type of demand paging, and can handle > not-present pages, generating interrupts for page faults, etc. I know > that Mellanox HCAs should have this support; are there any other > devices that can do this? > Chelsio's T3 HW doesn't support this. Steve. ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: Demand paging for memory regions (was Re: MMU Notifiers V6) 2008-02-12 22:01 ` [ofa-general] " Steve Wise @ 2008-02-12 22:10 ` Christoph Lameter 2008-02-12 22:41 ` [ofa-general] Re: Demand paging for memory regions Roland Dreier 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-12 22:10 UTC (permalink / raw) To: Steve Wise Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Tue, 12 Feb 2008, Steve Wise wrote: > Chelsio's T3 HW doesn't support this. Not so far I guess but it could be equipped with these features right? Having the VM manage the memory area for Infiniband allows more reliable system operations and enables the sharing of large memory areas via Infiniband without the risk of livelocks or OOMs. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-12 22:10 ` Christoph Lameter @ 2008-02-12 22:41 ` Roland Dreier 2008-02-12 23:14 ` Felix Marti ` (3 more replies) 0 siblings, 4 replies; 150+ messages in thread From: Roland Dreier @ 2008-02-12 22:41 UTC (permalink / raw) To: Christoph Lameter Cc: Rik van Riel, steiner, Andrea Arcangeli, a.p.zijlstra, izike, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel > > Chelsio's T3 HW doesn't support this. > Not so far I guess but it could be equipped with these features right? I don't know anything about the T3 internals, but it's not clear that you could do this without a new chip design in general. Lot's of RDMA devices were designed expecting that when a packet arrives, the HW can look up the bus address for a given memory region/offset and place the packet immediately. It seems like a major change to be able to generate a "page fault" interrupt when a page isn't present, or even just wait to scatter some data until the host finishes updating page tables when the HW needs the translation. - R. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-12 22:41 ` [ofa-general] Re: Demand paging for memory regions Roland Dreier @ 2008-02-12 23:14 ` Felix Marti 2008-02-13 0:57 ` Christoph Lameter 2008-02-14 15:09 ` Steve Wise 2008-02-12 23:23 ` Jason Gunthorpe ` (2 subsequent siblings) 3 siblings, 2 replies; 150+ messages in thread From: Felix Marti @ 2008-02-12 23:14 UTC (permalink / raw) To: Roland Dreier, Christoph Lameter Cc: Andrea Arcangeli, a.p.zijlstra, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel > -----Original Message----- > From: general-bounces@lists.openfabrics.org [mailto:general- > bounces@lists.openfabrics.org] On Behalf Of Roland Dreier > Sent: Tuesday, February 12, 2008 2:42 PM > To: Christoph Lameter > Cc: Rik van Riel; steiner@sgi.com; Andrea Arcangeli; > a.p.zijlstra@chello.nl; izike@qumranet.com; linux- > kernel@vger.kernel.org; avi@qumranet.com; linux-mm@kvack.org; > daniel.blueman@quadrics.com; Robin Holt; general@lists.openfabrics.org; > Andrew Morton; kvm-devel@lists.sourceforge.net > Subject: Re: [ofa-general] Re: Demand paging for memory regions > > > > Chelsio's T3 HW doesn't support this. > > > Not so far I guess but it could be equipped with these features > right? > > I don't know anything about the T3 internals, but it's not clear that > you could do this without a new chip design in general. Lot's of RDMA > devices were designed expecting that when a packet arrives, the HW can > look up the bus address for a given memory region/offset and place the > packet immediately. It seems like a major change to be able to > generate a "page fault" interrupt when a page isn't present, or even > just wait to scatter some data until the host finishes updating page > tables when the HW needs the translation. That is correct, not a change we can make for T3. We could, in theory, deal with changing mappings though. The change would need to be synchronized though: the VM would need to tell us which mapping were about to change and the driver would then need to disable DMA to/from it, do the change and resume DMA. > > - R. > > _______________________________________________ > general mailing list > general@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* RE: [ofa-general] Re: Demand paging for memory regions 2008-02-12 23:14 ` Felix Marti @ 2008-02-13 0:57 ` Christoph Lameter 2008-02-14 15:09 ` Steve Wise 1 sibling, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-13 0:57 UTC (permalink / raw) To: Felix Marti Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Tue, 12 Feb 2008, Felix Marti wrote: > > I don't know anything about the T3 internals, but it's not clear that > > you could do this without a new chip design in general. Lot's of RDMA > > devices were designed expecting that when a packet arrives, the HW can > > look up the bus address for a given memory region/offset and place the > > packet immediately. It seems like a major change to be able to > > generate a "page fault" interrupt when a page isn't present, or even > > just wait to scatter some data until the host finishes updating page > > tables when the HW needs the translation. > > That is correct, not a change we can make for T3. We could, in theory, > deal with changing mappings though. The change would need to be > synchronized though: the VM would need to tell us which mapping were > about to change and the driver would then need to disable DMA to/from > it, do the change and resume DMA. Right. That is the intend of the patchset. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-12 23:14 ` Felix Marti 2008-02-13 0:57 ` Christoph Lameter @ 2008-02-14 15:09 ` Steve Wise 2008-02-14 15:53 ` Robin Holt 2008-02-14 19:39 ` Christoph Lameter 1 sibling, 2 replies; 150+ messages in thread From: Steve Wise @ 2008-02-14 15:09 UTC (permalink / raw) To: Felix Marti Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, kvm-devel, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, Christoph Lameter Felix Marti wrote: > > That is correct, not a change we can make for T3. We could, in theory, > deal with changing mappings though. The change would need to be > synchronized though: the VM would need to tell us which mapping were > about to change and the driver would then need to disable DMA to/from > it, do the change and resume DMA. > Note that for T3, this involves suspending _all_ rdma connections that are in the same PD as the MR being remapped. This is because the driver doesn't know who the application advertised the rkey/stag to. So without that knowledge, all connections that _might_ rdma into the MR must be suspended. If the MR was only setup for local access, then the driver could track the connections with references to the MR and only quiesce those connections. Point being, it will stop probably all connections that an application is using (assuming the application uses a single PD). Steve. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-14 15:09 ` Steve Wise @ 2008-02-14 15:53 ` Robin Holt 2008-02-14 16:23 ` Steve Wise 2008-02-14 19:39 ` Christoph Lameter 1 sibling, 1 reply; 150+ messages in thread From: Robin Holt @ 2008-02-14 15:53 UTC (permalink / raw) To: Steve Wise Cc: Felix Marti, Roland Dreier, Christoph Lameter, Rik van Riel, steiner, Andrea Arcangeli, a.p.zijlstra, izike, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote: > Note that for T3, this involves suspending _all_ rdma connections that are > in the same PD as the MR being remapped. This is because the driver > doesn't know who the application advertised the rkey/stag to. So without Is there a reason the driver can not track these. > Point being, it will stop probably all connections that an application is > using (assuming the application uses a single PD). It seems like the need to not stop all would be a compelling enough reason to modify the driver to track which processes have received the rkey/stag. Thanks, Robin ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-14 15:53 ` Robin Holt @ 2008-02-14 16:23 ` Steve Wise 2008-02-14 17:48 ` Caitlin Bestler 0 siblings, 1 reply; 150+ messages in thread From: Steve Wise @ 2008-02-14 16:23 UTC (permalink / raw) To: Robin Holt Cc: Rik van Riel, steiner, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, linux-kernel, avi, kvm-devel, linux-mm, daniel.blueman, general, Andrew Morton, Christoph Lameter Robin Holt wrote: > On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote: >> Note that for T3, this involves suspending _all_ rdma connections that are >> in the same PD as the MR being remapped. This is because the driver >> doesn't know who the application advertised the rkey/stag to. So without > > Is there a reason the driver can not track these. > Because advertising of a MR (ie telling the peer about your rkey/stag, offset and length) is application-specific and can be done out of band, or in band as simple SEND/RECV payload. Either way, the driver has no way of tracking this because the protocol used is application-specific. >> Point being, it will stop probably all connections that an application is >> using (assuming the application uses a single PD). > > It seems like the need to not stop all would be a compelling enough reason > to modify the driver to track which processes have received the rkey/stag. > Yes, _if_ the driver could track this. And _if_ the rdma API and paradigm was such that the kernel/driver could keep track, then remote revokations of MR tags could be supported. Stevo ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-14 16:23 ` Steve Wise @ 2008-02-14 17:48 ` Caitlin Bestler 0 siblings, 0 replies; 150+ messages in thread From: Caitlin Bestler @ 2008-02-14 17:48 UTC (permalink / raw) To: Steve Wise Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, linux-mm, izike, Roland Dreier, steiner, linux-kernel, avi, kvm-devel, daniel.blueman, Robin Holt, general, Andrew Morton, Christoph Lameter On Thu, Feb 14, 2008 at 8:23 AM, Steve Wise <swise@opengridcomputing.com> wrote: > Robin Holt wrote: > > On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote: > >> Note that for T3, this involves suspending _all_ rdma connections that are > >> in the same PD as the MR being remapped. This is because the driver > >> doesn't know who the application advertised the rkey/stag to. So without > > > > Is there a reason the driver can not track these. > > > > Because advertising of a MR (ie telling the peer about your rkey/stag, > offset and length) is application-specific and can be done out of band, > or in band as simple SEND/RECV payload. Either way, the driver has no > way of tracking this because the protocol used is application-specific. > > I fully agree. If there is one important thing about RDMA and other fastpath solutions that must be understood is that the driver does not see the payload. This is a fundamental strength, but it means that you have to identify what if any intercept points there are in advance. You also raise a good point on the scope of any suspend/resume API. Device reporting of this capability would not be a simple boolean, but more of a suspend/resume scope. A minimal scope would be any connection that actually attempts to use the suspended MR. Slightly wider would be any connection *allowed* to use the MR, which could expand all the way to any connection under the same PD. Convievably I could imagine an RDMA device reporting that it could support suspend/ resume, but only at the scope of the entire device. But even at such a wide scope, suspend/resume could be useful to a Memory Manager. The pages could be fully migrated to the new location, and the only work that was still required during the critical suspend/resume region was to actually shift to the new map. That might be short enough that not accepting *any* incoming RDMA packet would be acceptable. And if the goal is to replace a memory card the alternative might be migrating the applications to other physical servers, which would mean a much longer period of not accepting incoming RDMA packets. But the broader question is what the goal is here. Allowing memory to be shuffled is valuable, and perhaps even ultimately a requirement for high availability systems. RDMA and other direct-access APIs should be evolving their interfaces to accommodate these needs. Oversubscribing memory is a totally different matter. If an application is working with memory that is oversubscribed by a factor of 2 or more can it really benefit from zero-copy direct placement? At first glance I can't see what RDMA could be bringing of value when the overhead of swapping is going to be that large. If it really does make sense, then explicitly registering the portion of memory that should be enabled to receive incoming traffic while the application is swapped out actually makes sense. Current Memory Registration methods force applications to either register too much or too often. They register too much when the cost of registration is high, and the application responds by registering its entire buffer pool permanently. This is a problem when it overstates the amount of memory that the application needs to have resident, or when the device imposes limits on the size of memory maps that it can know. The alternative is to register too often, that is on a per-operation basis. To me that suggests the solutions lie in making it more reasonable to register more memory, or in making it practical to register memory on-the-fly on a per-operation basis with low enough overhead that applications don't feel the need to build elaborate registration caching schemes. As has been pointed out a few times in this thread, the RDMA and transport layers simply do not have enough information to know which portion of registered memory *really* had to be registered. So any back-pressure scheme where the Memory Manager is asking for pinned memory to be "given back" would have to go all the way to the application. Only the application knows what it is "really" using. I also suspect that most applications that are interested in using RDMA would rather be told they can allocate 200M indefinitely (and with real memory backing it) than be given 1GB of virtual memory that is backed by 200-300M of physical memory, especially if it meant dealing with memory pressure upcalls. > >> Point being, it will stop probably all connections that an application is > >> using (assuming the application uses a single PD). > > > > It seems like the need to not stop all would be a compelling enough reason > > to modify the driver to track which processes have received the rkey/stag. > > > > Yes, _if_ the driver could track this. > > And _if_ the rdma API and paradigm was such that the kernel/driver could > keep track, then remote revokations of MR tags could be supported. > > Stevo > > > _______________________________________________ > general mailing list > general@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-14 15:09 ` Steve Wise 2008-02-14 15:53 ` Robin Holt @ 2008-02-14 19:39 ` Christoph Lameter 2008-02-14 20:17 ` Caitlin Bestler 1 sibling, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-14 19:39 UTC (permalink / raw) To: Steve Wise Cc: Rik van Riel, steiner, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Thu, 14 Feb 2008, Steve Wise wrote: > Note that for T3, this involves suspending _all_ rdma connections that are in > the same PD as the MR being remapped. This is because the driver doesn't know > who the application advertised the rkey/stag to. So without that knowledge, > all connections that _might_ rdma into the MR must be suspended. If the MR > was only setup for local access, then the driver could track the connections > with references to the MR and only quiesce those connections. > > Point being, it will stop probably all connections that an application is > using (assuming the application uses a single PD). Right but if the system starts reclaiming pages of the application then we have a memory shortage. So the user should address that by not running other apps concurrently. The stopping of all connections is still better than the VM getting into major trouble. And the stopping of connections in order to move the process memory into a more advantageous memory location (f.e. using page migration) or stopping of connections in order to be able to move the process memory out of a range of failing memory is certainly good. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-14 19:39 ` Christoph Lameter @ 2008-02-14 20:17 ` Caitlin Bestler 2008-02-14 20:20 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Caitlin Bestler @ 2008-02-14 20:17 UTC (permalink / raw) To: Christoph Lameter Cc: Rik van Riel, steiner, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Thu, Feb 14, 2008 at 11:39 AM, Christoph Lameter <clameter@sgi.com> wrote: > On Thu, 14 Feb 2008, Steve Wise wrote: > > > Note that for T3, this involves suspending _all_ rdma connections that are in > > the same PD as the MR being remapped. This is because the driver doesn't know > > who the application advertised the rkey/stag to. So without that knowledge, > > all connections that _might_ rdma into the MR must be suspended. If the MR > > was only setup for local access, then the driver could track the connections > > with references to the MR and only quiesce those connections. > > > > Point being, it will stop probably all connections that an application is > > using (assuming the application uses a single PD). > > Right but if the system starts reclaiming pages of the application then we > have a memory shortage. So the user should address that by not running > other apps concurrently. The stopping of all connections is still better > than the VM getting into major trouble. And the stopping of connections in > order to move the process memory into a more advantageous memory location > (f.e. using page migration) or stopping of connections in order to be able > to move the process memory out of a range of failing memory is certainly > good. > In that spirit, there are two important aspects of a suspend/resume API that would enable the memory manager to solve problems most effectively: 1) The device should be allowed flexibility to extend the scope of the suspend to what it is capable of implementing -- rather than being forced to say that it does not support suspend/;resume merely because it does so at a different granularity. 2) It is very important that users of this API understand that it is only the RDMA device handling of incoming packets and WQEs that is being suspended. The peers are not suspended by this API, or even told that this end is suspending. Unless the suspend is kept *extremely* short there will be adverse impacts. And "short" here is measured in network terms, not human terms. The blink of any eye is *way* too long. Any external dependencies between "suspend" and "resume" will probably mean that things will not work, especially if the external entities involve a disk drive. So suspend/resume to re-arrange pages is one thing. Suspend/resume to cover swapping out pages so they can be reallocated is an exercise in futility. By the time you resume the connections will be broken or at the minimum damaged. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-14 20:17 ` Caitlin Bestler @ 2008-02-14 20:20 ` Christoph Lameter 2008-02-14 22:43 ` Caitlin Bestler 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-14 20:20 UTC (permalink / raw) To: Caitlin Bestler Cc: Rik van Riel, steiner, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Thu, 14 Feb 2008, Caitlin Bestler wrote: > So suspend/resume to re-arrange pages is one thing. Suspend/resume to cover > swapping out pages so they can be reallocated is an exercise in futility. By the > time you resume the connections will be broken or at the minimum damaged. The connections would then have to be torn down before swap out and would have to be reestablished after the pages have been brought back from swap. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-14 20:20 ` Christoph Lameter @ 2008-02-14 22:43 ` Caitlin Bestler 2008-02-14 22:48 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Caitlin Bestler @ 2008-02-14 22:43 UTC (permalink / raw) To: Christoph Lameter; +Cc: kvm-devel, linux-mm, general, linux-kernel, avi On Thu, Feb 14, 2008 at 12:20 PM, Christoph Lameter <clameter@sgi.com> wrote: > On Thu, 14 Feb 2008, Caitlin Bestler wrote: > > > So suspend/resume to re-arrange pages is one thing. Suspend/resume to cover > > swapping out pages so they can be reallocated is an exercise in futility. By the > > time you resume the connections will be broken or at the minimum damaged. > > The connections would then have to be torn down before swap out and would > have to be reestablished after the pages have been brought back from swap. > > I have no problem with that, as long as the application layer is responsible for tearing down and re-establishing the connections. The RDMA/transport layers are incapable of tearing down and re-establishing a connection transparently because connections need to be approved above the RDMA layer. Further the teardown will have visible artificats that the application must deal with, such as flushed Recv WQEs. This is still, the RDMA device will do X and will not worry about Y. The reasons for not worrying about Y could be that the suspend will be very short, or that other mechanisms have taken care of all the Ys independently. For example, an HPC cluster that suspended the *entire* cluster would not have to worry about dropped packets. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-14 22:43 ` Caitlin Bestler @ 2008-02-14 22:48 ` Christoph Lameter 2008-02-15 1:26 ` Caitlin Bestler 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-14 22:48 UTC (permalink / raw) To: Caitlin Bestler; +Cc: kvm-devel, linux-mm, general, linux-kernel, avi On Thu, 14 Feb 2008, Caitlin Bestler wrote: > I have no problem with that, as long as the application layer is responsible for > tearing down and re-establishing the connections. The RDMA/transport layers > are incapable of tearing down and re-establishing a connection transparently > because connections need to be approved above the RDMA layer. I am not that familiar with the RDMA layers but it seems that RDMA has a library that does device driver like things right? So the logic would best fit in there I guess. If you combine mlock with the mmu notifier then you can actually guarantee that a certain memory range will not be swapped out. The notifier will then only be called if the memory range will need to be moved for page migration, memory unplug etc etc. There may be a limit on the percentage of memory that you can mlock in the future. This may be done to guarantee that the VM still has memory to work with. ^ permalink raw reply [flat|nested] 150+ messages in thread
* RE: [ofa-general] Re: Demand paging for memory regions 2008-02-14 22:48 ` Christoph Lameter @ 2008-02-15 1:26 ` Caitlin Bestler 2008-02-15 2:37 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Caitlin Bestler @ 2008-02-15 1:26 UTC (permalink / raw) To: Christoph Lameter; +Cc: kvm-devel, linux-mm, general, linux-kernel, avi > -----Original Message----- > From: Christoph Lameter [mailto:clameter@sgi.com] > Sent: Thursday, February 14, 2008 2:49 PM > To: Caitlin Bestler > Cc: linux-kernel@vger.kernel.org; avi@qumranet.com; linux-mm@kvack.org; > general@lists.openfabrics.org; kvm-devel@lists.sourceforge.net > Subject: Re: [ofa-general] Re: Demand paging for memory regions > > On Thu, 14 Feb 2008, Caitlin Bestler wrote: > > > I have no problem with that, as long as the application layer is > responsible for > > tearing down and re-establishing the connections. The RDMA/transport > layers > > are incapable of tearing down and re-establishing a connection > transparently > > because connections need to be approved above the RDMA layer. > > I am not that familiar with the RDMA layers but it seems that RDMA has > a library that does device driver like things right? So the logic would > best fit in there I guess. > > If you combine mlock with the mmu notifier then you can actually > guarantee that a certain memory range will not be swapped out. The > notifier will then only be called if the memory range will need to be > moved for page migration, memory unplug etc etc. There may be a limit > on > the percentage of memory that you can mlock in the future. This may be > done to guarantee that the VM still has memory to work with. > The problem is that with existing APIs, or even slightly modified APIs, the RDMA layer will not be able to figure out which connections need to be "interrupted" in order to deal with what memory suspensions. Further, because any request for a new connection will be handled by the remote *application layer* peer there is no way for the two RDMA layers to agree to covertly tear down and re-establish the connection. Nor really should there be, connections should be approved by OS layer networking controls. RDMA should not be able to tell the network stack, "trust me, you don't have to check if this connection is legitimate". Another example, if you terminate a connection pending receive operations complete *to the user* in a Completion Queue. Those completions are NOT seen by the RDMA layer, and especially not by the Connection Manager. It has absolutely no way to repost them transparently to the same connection when the connection is re-established. Even worse, some portions of a receive operation might have been placed in the receive buffer and acknowledged to the remote peer. But there is no mechanism to report this fact in the CQE. A receive operation that is aborted is aborted. There is no concept of partial success. Therefore you cannot covertly terminate a connection mid-operation and covertly re-establish it later. Data will be lost, it will no longer be a reliable connection, and therefore it needs to be torn down anyway. The RDMA layers also cannot tell the other side not to transmit. Flow control is the responsibility of the application layer, not RDMA. What the RDMA layer could do is this: once you tell it to suspend a given memory region it can either tell you that it doesn't know how to do that or it can instruct the device to stop processing a set of connections that will ceases all access for a given Memory Region. When you resume it can guarantee that it is no longer using any cached older mappings for the memory region (assuming it was capable of doing the suspend), and then because RDMA connections are reliable everything will recover unless the connection timed-out. The chance that it will time-out is probably low, but the chance that the underlying connection will be in slow start or equivalent is much higher. So any solution that requires the upper layers to suspend operations for a brief bit will require explicit interaction with those layers. No RDMA layer can perform the sleight of hand tricks that you seem to want it to perform. AT the RDMA layer the best you could get is very brief suspensions for the purpose of *re-arranging* memory, not of reducing the amount of registered memory. If you need to reduce the amount of registered memory then you have to talk to the application. Discussions on making it easier for the application to trim a memory region dynamically might be in order, but you will not work around the fact that the application layer needs to determine what pages are registered. And they would really prefer just to be told how much memory they can have up front, they can figure out how to deal with that amount of memory on their own. ^ permalink raw reply [flat|nested] 150+ messages in thread
* RE: [ofa-general] Re: Demand paging for memory regions 2008-02-15 1:26 ` Caitlin Bestler @ 2008-02-15 2:37 ` Christoph Lameter 2008-02-15 18:09 ` Caitlin Bestler 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-15 2:37 UTC (permalink / raw) To: Caitlin Bestler; +Cc: kvm-devel, linux-mm, general, linux-kernel, avi On Thu, 14 Feb 2008, Caitlin Bestler wrote: > So any solution that requires the upper layers to suspend operations > for a brief bit will require explicit interaction with those layers. > No RDMA layer can perform the sleight of hand tricks that you seem > to want it to perform. Looks like it has to be up there right. > AT the RDMA layer the best you could get is very brief suspensions for > the purpose of *re-arranging* memory, not of reducing the amount of > registered memory. If you need to reduce the amount of registered memory > then you have to talk to the application. Discussions on making it > easier for the application to trim a memory region dynamically might be > in order, but you will not work around the fact that the application > layer needs to determine what pages are registered. And they would > really prefer just to be told how much memory they can have up front, > they can figure out how to deal with that amount of memory on their own. What does it mean that the "application layer has to be determine what pages are registered"? The application does not know which of its pages are currently in memory. It can only force these pages to stay in memory if their are mlocked. ^ permalink raw reply [flat|nested] 150+ messages in thread
* RE: [ofa-general] Re: Demand paging for memory regions 2008-02-15 2:37 ` Christoph Lameter @ 2008-02-15 18:09 ` Caitlin Bestler 2008-02-15 18:45 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Caitlin Bestler @ 2008-02-15 18:09 UTC (permalink / raw) To: Christoph Lameter; +Cc: kvm-devel, linux-mm, general, linux-kernel, avi Christoph Lameter asked: > > What does it mean that the "application layer has to be determine what > pages are registered"? The application does not know which of its pages > are currently in memory. It can only force these pages to stay in > memory if their are mlocked. > An application that advertises an RDMA accessible buffer to a remote peer *does* have to know that its pages *are* currently in memory. The application does *not* need for the virtual-to-physical mapping of those pages to be frozen for the lifespan of the Memory Region. But it is issuing an invitation to its peer to perform direct writes to the advertised buffer. When the peer decides to exercise that invitation the pages have to be there. An analogy: when you write a check for $100 you do not have to identify the serial numbers of ten $10 bills, but you are expected to have the funds in your account. Issuing a buffer advertisement for memory you do not have is the network equivalent of writing a check that you do not have funds for. Now, just as your bank may offer overdraft protection, an RDMA device could merely report a page fault rather than tearing down the connection itself. But that does not grant permission for applications to advertise buffer space that they do not have committed, it merely helps recovery from a programming fault. A suspend/resume interface between the Virtual Memory Manager and the RDMA layer allows pages to be re-arranged at the convenience of the Virtual Memory Manager without breaking the application layer peer-to-peer contract. The current interfaces that pin exact pages are really the equivalent of having to tell the bank that when Joe cashes this $100 check that you should give him *these* ten $10 bills. It works, but it adds too much overhead and is very inflexible. So there are a lot of good reasons to evolve this interface to better deal with these issues. Other areas of possible evolution include allowing growing or trimming of Memory Regions without invalidating their advertised handles. But the more fundamental issue is recognizing that applications that use direct interfaces need to know that buffers that they enable truly have committed resources. They need a way to ask for twenty *real* pages, not twenty pages of address space. And they need to do it in a way that allows memory to be rearranged or even migrated with them to a new host. ^ permalink raw reply [flat|nested] 150+ messages in thread
* RE: [ofa-general] Re: Demand paging for memory regions 2008-02-15 18:09 ` Caitlin Bestler @ 2008-02-15 18:45 ` Christoph Lameter 2008-02-15 18:53 ` Caitlin Bestler 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-15 18:45 UTC (permalink / raw) To: Caitlin Bestler; +Cc: kvm-devel, linux-mm, general, linux-kernel, avi On Fri, 15 Feb 2008, Caitlin Bestler wrote: > > What does it mean that the "application layer has to be determine what > > pages are registered"? The application does not know which of its > pages > > are currently in memory. It can only force these pages to stay in > > memory if their are mlocked. > > > > An application that advertises an RDMA accessible buffer > to a remote peer *does* have to know that its pages *are* > currently in memory. Ok that would mean it needs to inform the VM of that issue by mlocking these pages. > But the more fundamental issue is recognizing that applications > that use direct interfaces need to know that buffers that they > enable truly have committed resources. They need a way to > ask for twenty *real* pages, not twenty pages of address > space. And they need to do it in a way that allows memory > to be rearranged or even migrated with them to a new host. mlock will force the pages to stay in memory without requiring the OS to keep them where they are. ^ permalink raw reply [flat|nested] 150+ messages in thread
* RE: [ofa-general] Re: Demand paging for memory regions 2008-02-15 18:45 ` Christoph Lameter @ 2008-02-15 18:53 ` Caitlin Bestler 2008-02-15 20:02 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Caitlin Bestler @ 2008-02-15 18:53 UTC (permalink / raw) To: Christoph Lameter; +Cc: kvm-devel, linux-mm, general, linux-kernel, avi > -----Original Message----- > From: Christoph Lameter [mailto:clameter@sgi.com] > Sent: Friday, February 15, 2008 10:46 AM > To: Caitlin Bestler > Cc: linux-kernel@vger.kernel.org; avi@qumranet.com; linux-mm@kvack.org; > general@lists.openfabrics.org; kvm-devel@lists.sourceforge.net > Subject: RE: [ofa-general] Re: Demand paging for memory regions > > On Fri, 15 Feb 2008, Caitlin Bestler wrote: > > > > What does it mean that the "application layer has to be determine > what > > > pages are registered"? The application does not know which of its > > pages > > > are currently in memory. It can only force these pages to stay in > > > memory if their are mlocked. > > > > > > > An application that advertises an RDMA accessible buffer > > to a remote peer *does* have to know that its pages *are* > > currently in memory. > > Ok that would mean it needs to inform the VM of that issue by mlocking > these pages. > > > But the more fundamental issue is recognizing that applications > > that use direct interfaces need to know that buffers that they > > enable truly have committed resources. They need a way to > > ask for twenty *real* pages, not twenty pages of address > > space. And they need to do it in a way that allows memory > > to be rearranged or even migrated with them to a new host. > > mlock will force the pages to stay in memory without requiring the OS > to keep them where they are. So that would mean that mlock is used by the application before it registers memory for direct access, and then it is up to the RDMA layer and the OS to negotiate actual pinning of the addresses for whatever duration is required. There is no *protocol* barrier to replacing pages within a Memory Region as long as it is done in a way that keeps the content of those page coherent. But existing devices have their own ideas on how this is done and existing devices are notoriously poor at learning new tricks. Merely mlocking pages deals with the end-to-end RDMA semantics. What still needs to be addressed is how a fastpath interface would dynamically pin and unpin. Yielding pins for short-term suspensions (and flushing cached translations) deals with the rest. Understanding the range of support that existing devices could provide with software updates would be the next step if you wanted to pursue this. ^ permalink raw reply [flat|nested] 150+ messages in thread
* RE: [ofa-general] Re: Demand paging for memory regions 2008-02-15 18:53 ` Caitlin Bestler @ 2008-02-15 20:02 ` Christoph Lameter 2008-02-15 20:14 ` Caitlin Bestler 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-15 20:02 UTC (permalink / raw) To: Caitlin Bestler; +Cc: kvm-devel, linux-mm, general, linux-kernel, avi On Fri, 15 Feb 2008, Caitlin Bestler wrote: > So that would mean that mlock is used by the application before it > registers memory for direct access, and then it is up to the RDMA > layer and the OS to negotiate actual pinning of the addresses for > whatever duration is required. Right. > There is no *protocol* barrier to replacing pages within a Memory > Region as long as it is done in a way that keeps the content of > those page coherent. But existing devices have their own ideas > on how this is done and existing devices are notoriously poor at > learning new tricks. Hmmmm.. Okay. But that is mainly a device driver maintenance issue. > Merely mlocking pages deals with the end-to-end RDMA semantics. > What still needs to be addressed is how a fastpath interface > would dynamically pin and unpin. Yielding pins for short-term > suspensions (and flushing cached translations) deals with the > rest. Understanding the range of support that existing devices > could provide with software updates would be the next step if > you wanted to pursue this. That is addressed on the VM level by the mmu_notifier which started this whole thread. The RDMA layers need to subscribe to this notifier and then do whatever the hardware requires to unpin and pin memory. I can only go as far as dealing with the VM layer. If you have any issues there I'd be glad to help. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-15 20:02 ` Christoph Lameter @ 2008-02-15 20:14 ` Caitlin Bestler 2008-02-15 22:50 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Caitlin Bestler @ 2008-02-15 20:14 UTC (permalink / raw) To: Christoph Lameter; +Cc: kvm-devel, linux-mm, general, linux-kernel, avi Christoph Lameter wrote > > > Merely mlocking pages deals with the end-to-end RDMA semantics. > > What still needs to be addressed is how a fastpath interface > > would dynamically pin and unpin. Yielding pins for short-term > > suspensions (and flushing cached translations) deals with the > > rest. Understanding the range of support that existing devices > > could provide with software updates would be the next step if > > you wanted to pursue this. > > That is addressed on the VM level by the mmu_notifier which started > this whole thread. The RDMA layers need to subscribe to this notifier > and then do whatever the hardware requires to unpin and pin memory. > I can only go as far as dealing with the VM layer. If you have any > issues there I'd be glad to help. There isn't much point in the RDMA layer subscribing to mmu notifications if the specific RDMA device will not be able to react appropriately when the notification occurs. I don't see how you get around needing to know which devices are capable of supporting page migration (via suspend/resume or other mechanisms) and which can only respond to a page migration by aborting connections. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* RE: [ofa-general] Re: Demand paging for memory regions 2008-02-15 20:14 ` Caitlin Bestler @ 2008-02-15 22:50 ` Christoph Lameter 2008-02-15 23:50 ` Caitlin Bestler 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-15 22:50 UTC (permalink / raw) To: Caitlin Bestler; +Cc: kvm-devel, linux-mm, general, linux-kernel, avi On Fri, 15 Feb 2008, Caitlin Bestler wrote: > There isn't much point in the RDMA layer subscribing to mmu > notifications > if the specific RDMA device will not be able to react appropriately when > the notification occurs. I don't see how you get around needing to know > which devices are capable of supporting page migration (via > suspend/resume > or other mechanisms) and which can only respond to a page migration by > aborting connections. You either register callbacks if the device can react properly or you dont. If you dont then the device will continue to have the problem with page pinning etc until someone comes around and implements the mmu callbacks to fix these issues. I have doubts regarding the claim that some devices just cannot be made to suspend and resume appropriately. They obviously can be shutdown and so its a matter of sequencing the things the right way. I.e. stop the app wait for a quiet period then release resources etc. ^ permalink raw reply [flat|nested] 150+ messages in thread
* RE: [ofa-general] Re: Demand paging for memory regions 2008-02-15 22:50 ` Christoph Lameter @ 2008-02-15 23:50 ` Caitlin Bestler 0 siblings, 0 replies; 150+ messages in thread From: Caitlin Bestler @ 2008-02-15 23:50 UTC (permalink / raw) To: Christoph Lameter; +Cc: kvm-devel, linux-mm, general, linux-kernel, avi > -----Original Message----- > From: Christoph Lameter [mailto:clameter@sgi.com] > Sent: Friday, February 15, 2008 2:50 PM > To: Caitlin Bestler > Cc: linux-kernel@vger.kernel.org; avi@qumranet.com; linux-mm@kvack.org; > general@lists.openfabrics.org; kvm-devel@lists.sourceforge.net > Subject: RE: [ofa-general] Re: Demand paging for memory regions > > On Fri, 15 Feb 2008, Caitlin Bestler wrote: > > > There isn't much point in the RDMA layer subscribing to mmu > > notifications > > if the specific RDMA device will not be able to react appropriately > when > > the notification occurs. I don't see how you get around needing to > know > > which devices are capable of supporting page migration (via > > suspend/resume > > or other mechanisms) and which can only respond to a page migration > by > > aborting connections. > > You either register callbacks if the device can react properly or you > dont. If you dont then the device will continue to have the problem > with > page pinning etc until someone comes around and implements the > mmu callbacks to fix these issues. > > I have doubts regarding the claim that some devices just cannot be made > to > suspend and resume appropriately. They obviously can be shutdown and so > its a matter of sequencing the things the right way. I.e. stop the app > wait for a quiet period then release resources etc. > > That is true. What some devices will be unable to do is suspend and resume in a manner that is transparent to the application. However, for the duration required to re-arrange pages it is definitely feasible to do so transparently to the application. Presumably the Virtual Memory Manager would be more willing to take an action that is transparent to the user than one that is disruptive, although obviously as the owner of the physical memory it has the right to do either. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-12 22:41 ` [ofa-general] Re: Demand paging for memory regions Roland Dreier 2008-02-12 23:14 ` Felix Marti @ 2008-02-12 23:23 ` Jason Gunthorpe 2008-02-13 1:01 ` Christoph Lameter 2008-02-13 0:56 ` Christoph Lameter 2008-02-13 12:11 ` Christoph Raisch 3 siblings, 1 reply; 150+ messages in thread From: Jason Gunthorpe @ 2008-02-12 23:23 UTC (permalink / raw) To: Roland Dreier Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, steiner, linux-kernel, avi, kvm-devel, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, Christoph Lameter On Tue, Feb 12, 2008 at 02:41:48PM -0800, Roland Dreier wrote: > > > Chelsio's T3 HW doesn't support this. > > > Not so far I guess but it could be equipped with these features right? > > I don't know anything about the T3 internals, but it's not clear that > you could do this without a new chip design in general. Lot's of RDMA > devices were designed expecting that when a packet arrives, the HW can > look up the bus address for a given memory region/offset and place > the Well, certainly today the memfree IB devices store the page tables in host memory so they are already designed to hang onto packets during the page lookup over PCIE, adding in faulting makes this time larger. But this is not a good thing at all, IB's congestion model is based on the notion that end ports can always accept packets without making input contigent on output. If you take a software interrupt to fill in the page pointer then you could potentially deadlock on the fabric. For example using this mechanism to allow swap-in of RDMA target pages and then putting the storage over IB would be deadlock prone. Even without deadlock slowing down the input path will cause network congestion and poor performance for other nodes. It is not a desirable thing to do.. I expect that iwarp running over flow controlled ethernet has similar kinds of problems for similar reasons.. In general the best I think you can hope for with RDMA hardware is page migration using some atomic operations with the adaptor and a cpu page copy with retry sort of scheme - but is pure page migration interesting at all? Jason ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-12 23:23 ` Jason Gunthorpe @ 2008-02-13 1:01 ` Christoph Lameter 2008-02-13 1:26 ` Jason Gunthorpe 2008-02-13 1:55 ` Christian Bell 0 siblings, 2 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-13 1:01 UTC (permalink / raw) To: Jason Gunthorpe Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > Well, certainly today the memfree IB devices store the page tables in > host memory so they are already designed to hang onto packets during > the page lookup over PCIE, adding in faulting makes this time > larger. You really do not need a page table to use it. What needs to be maintained is knowledge on both side about what pages are currently shared across RDMA. If the VM decides to reclaim a page then the notification is used to remove the remote entry. If the remote side then tries to access the page again then the page fault on the remote side will stall until the local page has been brought back. RDMA can proceed after both sides again agree on that page now being sharable. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 1:01 ` Christoph Lameter @ 2008-02-13 1:26 ` Jason Gunthorpe 2008-02-13 1:45 ` Steve Wise 2008-02-13 2:35 ` Christoph Lameter 2008-02-13 1:55 ` Christian Bell 1 sibling, 2 replies; 150+ messages in thread From: Jason Gunthorpe @ 2008-02-13 1:26 UTC (permalink / raw) To: Christoph Lameter Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Tue, Feb 12, 2008 at 05:01:17PM -0800, Christoph Lameter wrote: > On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > > > Well, certainly today the memfree IB devices store the page tables in > > host memory so they are already designed to hang onto packets during > > the page lookup over PCIE, adding in faulting makes this time > > larger. > > You really do not need a page table to use it. What needs to be maintained > is knowledge on both side about what pages are currently shared across > RDMA. If the VM decides to reclaim a page then the notification is used to > remove the remote entry. If the remote side then tries to access the page > again then the page fault on the remote side will stall until the local > page has been brought back. RDMA can proceed after both sides again agree > on that page now being sharable. The problem is that the existing wire protocols do not have a provision for doing an 'are you ready' or 'I am not ready' exchange and they are not designed to store page tables on both sides as you propose. The remote side can send RDMA WRITE traffic at any time after the RDMA region is established. The local side must be able to handle it. There is no way to signal that a page is not ready and the remote should not send. This means the only possible implementation is to stall/discard at the local adaptor when a RDMA WRITE is recieved for a page that has been reclaimed. This is what leads to deadlock/poor performance.. Jason ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 1:26 ` Jason Gunthorpe @ 2008-02-13 1:45 ` Steve Wise 2008-02-13 2:35 ` Christoph Lameter 1 sibling, 0 replies; 150+ messages in thread From: Steve Wise @ 2008-02-13 1:45 UTC (permalink / raw) To: Jason Gunthorpe Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, kvm-devel, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, Christoph Lameter Jason Gunthorpe wrote: > On Tue, Feb 12, 2008 at 05:01:17PM -0800, Christoph Lameter wrote: >> On Tue, 12 Feb 2008, Jason Gunthorpe wrote: >> >>> Well, certainly today the memfree IB devices store the page tables in >>> host memory so they are already designed to hang onto packets during >>> the page lookup over PCIE, adding in faulting makes this time >>> larger. >> You really do not need a page table to use it. What needs to be maintained >> is knowledge on both side about what pages are currently shared across >> RDMA. If the VM decides to reclaim a page then the notification is used to >> remove the remote entry. If the remote side then tries to access the page >> again then the page fault on the remote side will stall until the local >> page has been brought back. RDMA can proceed after both sides again agree >> on that page now being sharable. > > The problem is that the existing wire protocols do not have a > provision for doing an 'are you ready' or 'I am not ready' exchange > and they are not designed to store page tables on both sides as you > propose. The remote side can send RDMA WRITE traffic at any time after > the RDMA region is established. The local side must be able to handle > it. There is no way to signal that a page is not ready and the remote > should not send. > > This means the only possible implementation is to stall/discard at the > local adaptor when a RDMA WRITE is recieved for a page that has been > reclaimed. This is what leads to deadlock/poor performance.. > If the events are few and far between then this model is probably ok. For iWARP, it means TCP retransmit and slow start and all that, but if its an infrequent event, then its ok if it helps the host better manage memory. Maybe... ;-) Steve. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 1:26 ` Jason Gunthorpe 2008-02-13 1:45 ` Steve Wise @ 2008-02-13 2:35 ` Christoph Lameter 2008-02-13 3:25 ` Jason Gunthorpe 2008-02-13 4:09 ` Christian Bell 1 sibling, 2 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-13 2:35 UTC (permalink / raw) To: Jason Gunthorpe Cc: Andrea Arcangeli, a.p.zijlstra, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > The problem is that the existing wire protocols do not have a > provision for doing an 'are you ready' or 'I am not ready' exchange > and they are not designed to store page tables on both sides as you > propose. The remote side can send RDMA WRITE traffic at any time after > the RDMA region is established. The local side must be able to handle > it. There is no way to signal that a page is not ready and the remote > should not send. > > This means the only possible implementation is to stall/discard at the > local adaptor when a RDMA WRITE is recieved for a page that has been > reclaimed. This is what leads to deadlock/poor performance.. You would only use the wire protocols *after* having established the RDMA region. The notifier chains allows a RDMA region (or parts thereof) to be down on demand by the VM. The region can be reestablished if one of the side accesses it. I hope I got that right. Not much exposure to Infiniband so far. Lets say you have a two systems A and B. Each has their memory region MemA and MemB. Each side also has page tables for this region PtA and PtB. Now you establish a RDMA connection between both side. The pages in both MemB and MemA are present and so are entries in PtA and PtB. RDMA traffic can proceed. The VM on system A now gets into a situation in which memory becomes heavily used by another (maybe non RDMA process) and after checking that there was no recent reference to MemA and MemB (via a notifier aging callback) decides to reclaim the memory from MemA. In that case it will notify the RDMA subsystem on A that it is trying to reclaim a certain page. The RDMA subsystem on A will then send a message to B notifying it that the memory will be going away. B now has to remove its corresponding page from memory (and drop the entry in PtB) and confirm to A that this has happened. RDMA traffic is then stopped for this page. Then A can also remove its page, the corresponding entry in PtA and the page is reclaimed or pushed out to swap completing the page reclaim. If either side then accesses the page again then the reverse process happens. If B accesses the page then it wil first of all incur a page fault because the entry in PtB is missing. The fault will then cause a message to be send to A to establish the page again. A will create an entry in PtA and will then confirm to B that the page was established. At that point RDMA operations can occur again. So the whole scheme does not really need a hardware page table in the RDMA hardware. The page tables of the two systems A and B are sufficient. The scheme can also be applied to a larger range than only a single page. The RDMA subsystem could tear down a large section when reclaim is pushing on it and then reestablish it as needed. Swapping and page reclaim is certainly not something that improves the speed of the application affected by swapping and page reclaim but it allows the VM to manage memory effectively if multiple loads are runing on a system. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 2:35 ` Christoph Lameter @ 2008-02-13 3:25 ` Jason Gunthorpe 2008-02-13 18:51 ` Christoph Lameter 2008-02-13 4:09 ` Christian Bell 1 sibling, 1 reply; 150+ messages in thread From: Jason Gunthorpe @ 2008-02-13 3:25 UTC (permalink / raw) To: Christoph Lameter Cc: Roland Dreier, Rik van Riel, steiner, Andrea Arcangeli, a.p.zijlstra, izike, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Tue, Feb 12, 2008 at 06:35:09PM -0800, Christoph Lameter wrote: > On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > > > The problem is that the existing wire protocols do not have a > > provision for doing an 'are you ready' or 'I am not ready' exchange > > and they are not designed to store page tables on both sides as you > > propose. The remote side can send RDMA WRITE traffic at any time after > > the RDMA region is established. The local side must be able to handle > > it. There is no way to signal that a page is not ready and the remote > > should not send. > > > > This means the only possible implementation is to stall/discard at the > > local adaptor when a RDMA WRITE is recieved for a page that has been > > reclaimed. This is what leads to deadlock/poor performance.. > > You would only use the wire protocols *after* having established the RDMA > region. The notifier chains allows a RDMA region (or parts thereof) to be > down on demand by the VM. The region can be reestablished if one of > the side accesses it. I hope I got that right. Not much exposure to > Infiniband so far. [clip explaination] But this isn't how IB or iwarp work at all. What you describe is a significant change to the general RDMA operation and requires changes to both sides of the connection and the wire protocol. A few comments on RDMA operation that might clarify things a little bit more: - In RDMA (iwarp and IB versions) the hardware page tables exist to linearize the local memory so the remote does not need to be aware of non-linearities in the physical address space. The main motivation for this is kernel bypass where the user space app wants to instruct the remote side to DMA into memory using user space addresses. Hardware provides the page tables to switch from incoming user space virtual addresses to physical addresess. This greatly simplifies the user space programming model since you don't need to pass around or create s/g lists for memory that is already virtually continuous. Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables for access control and enforcing the liftime of the mapping. The page tables in the RDMA hardware exist primarily to support this, and not for other reasons. The pinning of pages is one part to support the HW page tables and one part to support the RDMA lifetime rules, the liftime rules are what cause problems for the VM. - The wire protocol consists of packets that say 'Write XXX bytes to offset YY in Region RRR'. Creating a region produces the RRR label and currently pins the pages. So long as the RRR label is valid the remote side can issue write packets at any time without any further synchronization. There is no wire level events associated with creating RRR. You can pass RRR to the other machine in any fashion, even using carrier pigeons :) - The RDMA layer is very general (ala TCP), useful protocols (like SCSI) are built on top of it and they specify the lifetime rules and protocol for exchanging RRR. Every protocol is different. In kernel protocols like SRP and NFS RDMA seem to have very short lifetimes for RRR and work more like pci_map_* in real SCSI hardware. - HPC userspace apps, like MPI apps, have different lifetime rules and tend to be really long lived. These people will not want anything that makes their OPs more expensive and also probably don't care too much about the VM problems you are looking at (?) - There is no protocol support to exchange RRR. This is all done by upper level protocols (ala HTTP vs TCP). You cannot assert and revoke RRR in a general way. Every protocol is different and optimized. This is your step 'A will then send a message to B notifying..'. It simply does not exist in the protocol specifications I don't know much about Quadrics, but I would be hesitant to lump it in too much with these RDMA semantics. Christian's comments sound like they operate closer to what you described and that is why the have an existing patch set. I don't know :) What it boils down to is that to implement true removal of pages in a general way the kernel and HCA must either drop packets or stall incoming packets, both are big performance problems - and I can't see many users wanting this. Enterprise style people using SCSI, NFS, etc already have short pin periods and HPC MPI users probably won't care about the VM issues enough to warrent the performance overhead. Regards, Jason ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 3:25 ` Jason Gunthorpe @ 2008-02-13 18:51 ` Christoph Lameter 2008-02-13 19:51 ` Jason Gunthorpe 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-13 18:51 UTC (permalink / raw) To: Jason Gunthorpe Cc: Andrea Arcangeli, a.p.zijlstra, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > But this isn't how IB or iwarp work at all. What you describe is a > significant change to the general RDMA operation and requires changes to > both sides of the connection and the wire protocol. Yes it may require a separate connection between both sides where a kind of VM notification protocol is established to tear these things down and set them up again. That is if there is nothing in the RDMA protocol that allows a notification to the other side that the mapping is being down down. > - In RDMA (iwarp and IB versions) the hardware page tables exist to > linearize the local memory so the remote does not need to be aware > of non-linearities in the physical address space. The main > motivation for this is kernel bypass where the user space app wants > to instruct the remote side to DMA into memory using user space > addresses. Hardware provides the page tables to switch from > incoming user space virtual addresses to physical addresess. s/switch/translate I guess. That is good and those page tables could be used for the notification scheme to enable reclaim. But they are optional and are maintaining the driver state. The linearization could be reconstructed from the kernel page tables on demand. > Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables > for access control and enforcing the liftime of the mapping. Well the mapping would have to be on demand to avoid the issues that we currently have with pinning. The user API could stay the same. If the driver tracks the mappings using the notifier then the VM can make sure that the right things happen on exit etc etc. > The page tables in the RDMA hardware exist primarily to support > this, and not for other reasons. The pinning of pages is one part > to support the HW page tables and one part to support the RDMA > lifetime rules, the liftime rules are what cause problems for > the VM. So the driver software can tear down and establish page tables entries at will? I do not see the problem. The RDMA hardware is one thing, the way things are visible to the user another. If the driver can establish and remove mappings as needed via RDMA then the user can have the illusion of persistent RDMA memory. This is the same as virtual memory providing the illusion of a process having lots of memory all for itself. > - The wire protocol consists of packets that say 'Write XXX bytes to > offset YY in Region RRR'. Creating a region produces the RRR label > and currently pins the pages. So long as the RRR label is valid the > remote side can issue write packets at any time without any > further synchronization. There is no wire level events associated > with creating RRR. You can pass RRR to the other machine in any > fashion, even using carrier pigeons :) > - The RDMA layer is very general (ala TCP), useful protocols (like SCSI) > are built on top of it and they specify the lifetime rules and > protocol for exchanging RRR. Well yes of course. What is proposed here is an additional notification mechanism (could even be via tcp/udp to simplify things) that would manage the mappings at a higher level. The writes would not occur if the mapping has not been established. > This is your step 'A will then send a message to B notifying..'. > It simply does not exist in the protocol specifications Of course. You need to create an additional communication layer to get that. > What it boils down to is that to implement true removal of pages in a > general way the kernel and HCA must either drop packets or stall > incoming packets, both are big performance problems - and I can't see > many users wanting this. Enterprise style people using SCSI, NFS, etc > already have short pin periods and HPC MPI users probably won't care > about the VM issues enough to warrent the performance overhead. True maybe you cannot do this by simply staying within the protocol bounds of RDMA that is based on page pinning if the RDMA protocol does not support a notification to the other side that the mapping is going away. If RDMA cannot do this then you would need additional ways of notifying the remote side that pages/mappings are invalidated. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 18:51 ` Christoph Lameter @ 2008-02-13 19:51 ` Jason Gunthorpe 2008-02-13 20:36 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Jason Gunthorpe @ 2008-02-13 19:51 UTC (permalink / raw) To: Christoph Lameter Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Wed, Feb 13, 2008 at 10:51:58AM -0800, Christoph Lameter wrote: > On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > > > But this isn't how IB or iwarp work at all. What you describe is a > > significant change to the general RDMA operation and requires changes to > > both sides of the connection and the wire protocol. > > Yes it may require a separate connection between both sides where a > kind of VM notification protocol is established to tear these things down and > set them up again. That is if there is nothing in the RDMA protocol that > allows a notification to the other side that the mapping is being down > down. Well, yes, you could build this thing you are describing on top of the RDMA protocol and get some support from some of the hardware - but it is a new set of protocols and they would need to be implemented in several places. It is not transparent to userspace and it is not compatible with existing implementations. Unfortunately it really has little to do with the drivers - changes, for instance, need to be made to support this in the user space MPI libraries. The RDMA ops do not pass through the kernel, userspace talks directly to the hardware which complicates building any sort of abstraction. That is where I think you run into trouble, if you ask the MPI people to add code to their critical path to support swapping they probably will not be too interested. At a minimum to support your idea you need to check on every RDMA if the remote page is mapped... Plus the overheads Christian was talking about in the OOB channel(s). Jason ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 19:51 ` Jason Gunthorpe @ 2008-02-13 20:36 ` Christoph Lameter 0 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-13 20:36 UTC (permalink / raw) To: Jason Gunthorpe Cc: Andrea Arcangeli, a.p.zijlstra, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Wed, 13 Feb 2008, Jason Gunthorpe wrote: > Unfortunately it really has little to do with the drivers - changes, > for instance, need to be made to support this in the user space MPI > libraries. The RDMA ops do not pass through the kernel, userspace > talks directly to the hardware which complicates building any sort of > abstraction. Ok so the notifiers have to be handed over to the user space library that has the function of the device driver here... > That is where I think you run into trouble, if you ask the MPI people > to add code to their critical path to support swapping they probably > will not be too interested. At a minimum to support your idea you need > to check on every RDMA if the remote page is mapped... Plus the > overheads Christian was talking about in the OOB channel(s). You only need to check if a handle has been receiving invalidates. If not then you can just go ahead as now. You can use the notifier to take down the whole region if any reclaim occur against it (probably best and simples to implement approach). Then you mark the handle so that the mapping is reestablished before the next operation. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 2:35 ` Christoph Lameter 2008-02-13 3:25 ` Jason Gunthorpe @ 2008-02-13 4:09 ` Christian Bell 2008-02-13 19:00 ` Christoph Lameter 2008-02-13 23:23 ` Pete Wyckoff 1 sibling, 2 replies; 150+ messages in thread From: Christian Bell @ 2008-02-13 4:09 UTC (permalink / raw) To: Christoph Lameter Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Tue, 12 Feb 2008, Christoph Lameter wrote: > On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > > > The problem is that the existing wire protocols do not have a > > provision for doing an 'are you ready' or 'I am not ready' exchange > > and they are not designed to store page tables on both sides as you > > propose. The remote side can send RDMA WRITE traffic at any time after > > the RDMA region is established. The local side must be able to handle > > it. There is no way to signal that a page is not ready and the remote > > should not send. > > > > This means the only possible implementation is to stall/discard at the > > local adaptor when a RDMA WRITE is recieved for a page that has been > > reclaimed. This is what leads to deadlock/poor performance.. You're arguing that a HW page table is not needed by describing a use case that is essentially what all RDMA solutions already do above the wire protocols (all solutions except Quadrics, of course). > You would only use the wire protocols *after* having established the RDMA > region. The notifier chains allows a RDMA region (or parts thereof) to be > down on demand by the VM. The region can be reestablished if one of > the side accesses it. I hope I got that right. Not much exposure to > Infiniband so far. RDMA is already always used *after* memory regions are set up -- they are set up out-of-band w.r.t RDMA but essentially this is the "before" part. > Lets say you have a two systems A and B. Each has their memory region MemA > and MemB. Each side also has page tables for this region PtA and PtB. > > Now you establish a RDMA connection between both side. The pages in both > MemB and MemA are present and so are entries in PtA and PtB. RDMA > traffic can proceed. > > The VM on system A now gets into a situation in which memory becomes > heavily used by another (maybe non RDMA process) and after checking that > there was no recent reference to MemA and MemB (via a notifier aging > callback) decides to reclaim the memory from MemA. > > In that case it will notify the RDMA subsystem on A that it is trying to > reclaim a certain page. > > The RDMA subsystem on A will then send a message to B notifying it that > the memory will be going away. B now has to remove its corresponding page > from memory (and drop the entry in PtB) and confirm to A that this has > happened. RDMA traffic is then stopped for this page. Then A can also > remove its page, the corresponding entry in PtA and the page is reclaimed > or pushed out to swap completing the page reclaim. > > If either side then accesses the page again then the reverse process > happens. If B accesses the page then it wil first of all incur a page > fault because the entry in PtB is missing. The fault will then cause a > message to be send to A to establish the page again. A will create an > entry in PtA and will then confirm to B that the page was established. At > that point RDMA operations can occur again. The notifier-reclaim cycle you describe is akin to the out-of-band pin-unpin control messages used by existing communication libraries. Also, I think what you are proposing can have problems at scale -- A must keep track of all of the (potentially many systems) of memA and cooperatively get an agreement from all these systems before reclaiming the page. When messages are sufficiently large, the control messaging necessary to setup/teardown the regions is relatively small. This is not always the case however -- in programming models that employ smaller messages, the one-sided nature of RDMA is the most attractive part of it. > So the whole scheme does not really need a hardware page table in the RDMA > hardware. The page tables of the two systems A and B are sufficient. > > The scheme can also be applied to a larger range than only a single page. > The RDMA subsystem could tear down a large section when reclaim is > pushing on it and then reestablish it as needed. Nothing any communication/runtime system can't already do today. The point of RDMA demand paging is enabling the possibility of using RDMA without the implied synchronization -- the optimistic part. Using the notifiers to duplicate existing memory region handling for RDMA hardware that doesn't have HW page tables is possible but undermines the more important consumer of your patches in my opinion. One other area that has not been brought up yet (I think) is the applicability of notifiers in letting users know when pinned memory is reclaimed by the kernel. This is useful when a lower-level library employs lazy deregistration strategies on memory regions that are subsequently released to the kernel via the application's use of munmap or sbrk. Ohio Supercomputing Center has work in this area but a generalized approach in the kernel would certainly be welcome. . . christian -- christian.bell@qlogic.com (QLogic Host Solutions Group, formerly Pathscale) ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 4:09 ` Christian Bell @ 2008-02-13 19:00 ` Christoph Lameter 2008-02-13 19:46 ` Christian Bell 2008-02-13 23:23 ` Pete Wyckoff 1 sibling, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-13 19:00 UTC (permalink / raw) To: Christian Bell Cc: Andrea Arcangeli, a.p.zijlstra, Roland Dreier, steiner, linux-kernel, avi, Jason Gunthorpe, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Tue, 12 Feb 2008, Christian Bell wrote: > You're arguing that a HW page table is not needed by describing a use > case that is essentially what all RDMA solutions already do above the > wire protocols (all solutions except Quadrics, of course). The HW page table is not essential to the notification scheme. That the RDMA uses the page table for linearization is another issue. A chip could just have a TLB cache and lookup the entries using the OS page table f.e. > > Lets say you have a two systems A and B. Each has their memory region MemA > > and MemB. Each side also has page tables for this region PtA and PtB. > > If either side then accesses the page again then the reverse process > > happens. If B accesses the page then it wil first of all incur a page > > fault because the entry in PtB is missing. The fault will then cause a > > message to be send to A to establish the page again. A will create an > > entry in PtA and will then confirm to B that the page was established. At > > that point RDMA operations can occur again. > > The notifier-reclaim cycle you describe is akin to the out-of-band > pin-unpin control messages used by existing communication libraries. > Also, I think what you are proposing can have problems at scale -- A > must keep track of all of the (potentially many systems) of memA and > cooperatively get an agreement from all these systems before reclaiming > the page. Right. We (SGI) have done something like this for a long time with XPmem and it scales ok. > When messages are sufficiently large, the control messaging necessary > to setup/teardown the regions is relatively small. This is not > always the case however -- in programming models that employ smaller > messages, the one-sided nature of RDMA is the most attractive part of > it. The messaging would only be needed if a process comes under memory pressure. As long as there is enough memory nothing like this will occur. > Nothing any communication/runtime system can't already do today. The > point of RDMA demand paging is enabling the possibility of using RDMA > without the implied synchronization -- the optimistic part. Using > the notifiers to duplicate existing memory region handling for RDMA > hardware that doesn't have HW page tables is possible but undermines > the more important consumer of your patches in my opinion. The notifier schemet should integrate into existing memory region handling and not cause a duplication. If you already have library layers that do this then it should be possible to integrate it. > One other area that has not been brought up yet (I think) is the > applicability of notifiers in letting users know when pinned memory > is reclaimed by the kernel. This is useful when a lower-level > library employs lazy deregistration strategies on memory regions that > are subsequently released to the kernel via the application's use of > munmap or sbrk. Ohio Supercomputing Center has work in this area but > a generalized approach in the kernel would certainly be welcome. The driver gets the notifications about memory being reclaimed. The driver could then notify user code about the release as well. Pinned memory current *cannot* be reclaimed by the kernel. The refcount is elevated. This means that the VM tries to remove the mappings and then sees that it was not able to remove all references. Then it gives up and tries again and again and again.... Thus the potential for livelock. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 19:00 ` Christoph Lameter @ 2008-02-13 19:46 ` Christian Bell 2008-02-13 20:32 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Christian Bell @ 2008-02-13 19:46 UTC (permalink / raw) To: Christoph Lameter Cc: Jason Gunthorpe, Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Wed, 13 Feb 2008, Christoph Lameter wrote: > Right. We (SGI) have done something like this for a long time with XPmem > and it scales ok. I'd dispute this based on experience developing PGAS language support on the Altix but more importantly (and less subjectively), I think that "scales ok" refers to a very specific case. Sure, pages (and/or regions) can be large on some systems and the number of systems may not always be in the thousands but you're still claiming scalability for a mechanism that essentially logs who accesses the regions. Then there's the fact that reclaim becomes a collective communication operation over all region accessors. Makes me nervous. > > When messages are sufficiently large, the control messaging necessary > > to setup/teardown the regions is relatively small. This is not > > always the case however -- in programming models that employ smaller > > messages, the one-sided nature of RDMA is the most attractive part of > > it. > > The messaging would only be needed if a process comes under memory > pressure. As long as there is enough memory nothing like this will occur. > > > Nothing any communication/runtime system can't already do today. The > > point of RDMA demand paging is enabling the possibility of using RDMA > > without the implied synchronization -- the optimistic part. Using > > the notifiers to duplicate existing memory region handling for RDMA > > hardware that doesn't have HW page tables is possible but undermines > > the more important consumer of your patches in my opinion. > > The notifier schemet should integrate into existing memory region > handling and not cause a duplication. If you already have library layers > that do this then it should be possible to integrate it. I appreciate that you're trying to make a general case for the applicability of notifiers on all types of existing RDMA hardware and wire protocols. Also, I'm not disagreeing whether a HW page table is required or not: clearly it's not required to make *some* use of the notifier scheme. However, short of providing user-level notifications for pinned pages that are inadvertently released to the O/S, I don't believe that the patchset provides any significant added value for the HPC community that can't optimistically do RDMA demand paging. . . christian ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 19:46 ` Christian Bell @ 2008-02-13 20:32 ` Christoph Lameter 2008-02-13 22:44 ` Kanoj Sarcar 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-13 20:32 UTC (permalink / raw) To: Christian Bell Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Wed, 13 Feb 2008, Christian Bell wrote: > not always be in the thousands but you're still claiming scalability > for a mechanism that essentially logs who accesses the regions. Then > there's the fact that reclaim becomes a collective communication > operation over all region accessors. Makes me nervous. Well reclaim is not a very fast process (and we usually try to avoid it as much as possible for our HPC). Essentially its only there to allow shifts of processing loads and to allow efficient caching of application data. > However, short of providing user-level notifications for pinned pages > that are inadvertently released to the O/S, I don't believe that the > patchset provides any significant added value for the HPC community > that can't optimistically do RDMA demand paging. We currently also run XPmem with pinning. Its great as long as you just run one load on the system. No reclaim ever iccurs. However, if you do things that require lots of allocations etc etc then the page pinning can easily lead to livelock if reclaim is finally triggerd and also strange OOM situations since the VM cannot free any pages. So the main issue that is addressed here is reliability of pinned page operations. Better VM integration avoids these issues because we can unpin on request to deal with memory shortages. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 20:32 ` Christoph Lameter @ 2008-02-13 22:44 ` Kanoj Sarcar 2008-02-13 23:02 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Kanoj Sarcar @ 2008-02-13 22:44 UTC (permalink / raw) To: Christoph Lameter, Christian Bell Cc: Andrea Arcangeli, a.p.zijlstra, Roland Dreier, steiner, linux-kernel, avi, Jason Gunthorpe, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel --- Christoph Lameter <clameter@sgi.com> wrote: > On Wed, 13 Feb 2008, Christian Bell wrote: > > > not always be in the thousands but you're still > claiming scalability > > for a mechanism that essentially logs who accesses > the regions. Then > > there's the fact that reclaim becomes a collective > communication > > operation over all region accessors. Makes me > nervous. > > Well reclaim is not a very fast process (and we > usually try to avoid it > as much as possible for our HPC). Essentially its > only there to allow > shifts of processing loads and to allow efficient > caching of application > data. > > > However, short of providing user-level > notifications for pinned pages > > that are inadvertently released to the O/S, I > don't believe that the > > patchset provides any significant added value for > the HPC community > > that can't optimistically do RDMA demand paging. > > We currently also run XPmem with pinning. Its great > as long as you just > run one load on the system. No reclaim ever iccurs. > > However, if you do things that require lots of > allocations etc etc then > the page pinning can easily lead to livelock if > reclaim is finally > triggerd and also strange OOM situations since the > VM cannot free any > pages. So the main issue that is addressed here is > reliability of pinned > page operations. Better VM integration avoids these > issues because we can > unpin on request to deal with memory shortages. > > I have a question on the basic need for the mmu notifier stuff wrt rdma hardware and pinning memory. It seems that the need is to solve potential memory shortage and overcommit issues by being able to reclaim pages pinned by rdma driver/hardware. Is my understanding correct? If I do understand correctly, then why is rdma page pinning any different than eg mlock pinning? I imagine Oracle pins lots of memory (using mlock), how come they do not run into vm overcommit issues? Are we up against some kind of breaking c-o-w issue here that is different between mlock and rdma pinning? Asked another way, why should effort be spent on a notifier scheme, and rather not on fixing any memory accounting problems and unifying how pin pages are accounted for that get pinned via mlock() or rdma drivers? Startup benefits are well understood with the notifier scheme (ie, not all pages need to be faulted in at memory region creation time), specially when most of the memory region is not accessed at all. I would imagine most of HPC does not work this way though. Then again, as rdma hardware is applied (increasingly?) towards apps with short lived connections, the notifier scheme will help with startup times. Kanoj ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 22:44 ` Kanoj Sarcar @ 2008-02-13 23:02 ` Christoph Lameter 2008-02-13 23:43 ` Kanoj Sarcar 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-13 23:02 UTC (permalink / raw) To: Kanoj Sarcar Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Wed, 13 Feb 2008, Kanoj Sarcar wrote: > It seems that the need is to solve potential memory > shortage and overcommit issues by being able to > reclaim pages pinned by rdma driver/hardware. Is my > understanding correct? Correct. > If I do understand correctly, then why is rdma page > pinning any different than eg mlock pinning? I imagine > Oracle pins lots of memory (using mlock), how come > they do not run into vm overcommit issues? Mlocked pages are not pinned. They are movable by f.e. page migration and will be potentially be moved by future memory defrag approaches. Currently we have the same issues with mlocked pages as with pinned pages. There is work in progress to put mlocked pages onto a different lru so that reclaim exempts these pages and more work on limiting the percentage of memory that can be mlocked. > Are we up against some kind of breaking c-o-w issue > here that is different between mlock and rdma pinning? Not that I know. > Asked another way, why should effort be spent on a > notifier scheme, and rather not on fixing any memory > accounting problems and unifying how pin pages are > accounted for that get pinned via mlock() or rdma > drivers? There are efforts underway to account for and limit mlocked pages as described above. Page pinning the way it is done by Infiniband through increasing the page refcount is treated by the VM as a temporary condition not as a permanent pin. The VM will continually try to reclaim these pages thinking that the temporary usage of the page must cease soon. This is why the use of large amounts of pinned pages can lead to livelock situations. If we want to have pinning behavior then we could mark pinned pages specially so that the VM will not continually try to evict these pages. We could manage them similar to mlocked pages but just not allow page migration, memory unplug and defrag to occur on pinned memory. All of theses would have to fail. With the notifier scheme the device driver could be told to get rid of the pinned memory. This would make these 3 techniques work despite having an RDMA memory section. > Startup benefits are well understood with the notifier > scheme (ie, not all pages need to be faulted in at > memory region creation time), specially when most of > the memory region is not accessed at all. I would > imagine most of HPC does not work this way though. No for optimal performance you would want to prefault all pages like it is now. The notifier scheme would only become relevant in memory shortage situations. > Then again, as rdma hardware is applied (increasingly?) towards apps > with short lived connections, the notifier scheme will help with startup > times. The main use of the notifier scheme is for stability and reliability. The "pinned" pages become unpinnable on request by the VM. So the VM can work itself out of memory shortage situations in cooperation with the RDMA logic instead of simply failing. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 23:02 ` Christoph Lameter @ 2008-02-13 23:43 ` Kanoj Sarcar 2008-02-13 23:48 ` Jesse Barnes ` (2 more replies) 0 siblings, 3 replies; 150+ messages in thread From: Kanoj Sarcar @ 2008-02-13 23:43 UTC (permalink / raw) To: Christoph Lameter Cc: Christian Bell, Andrea Arcangeli, a.p.zijlstra, Roland Dreier, steiner, linux-kernel, avi, Jason Gunthorpe, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel --- Christoph Lameter <clameter@sgi.com> wrote: > On Wed, 13 Feb 2008, Kanoj Sarcar wrote: > > > It seems that the need is to solve potential > memory > > shortage and overcommit issues by being able to > > reclaim pages pinned by rdma driver/hardware. Is > my > > understanding correct? > > Correct. > > > If I do understand correctly, then why is rdma > page > > pinning any different than eg mlock pinning? I > imagine > > Oracle pins lots of memory (using mlock), how come > > they do not run into vm overcommit issues? > > Mlocked pages are not pinned. They are movable by > f.e. page migration and > will be potentially be moved by future memory defrag > approaches. Currently > we have the same issues with mlocked pages as with > pinned pages. There is > work in progress to put mlocked pages onto a > different lru so that reclaim > exempts these pages and more work on limiting the > percentage of memory > that can be mlocked. > > > Are we up against some kind of breaking c-o-w > issue > > here that is different between mlock and rdma > pinning? > > Not that I know. > > > Asked another way, why should effort be spent on a > > notifier scheme, and rather not on fixing any > memory > > accounting problems and unifying how pin pages are > > accounted for that get pinned via mlock() or rdma > > drivers? > > There are efforts underway to account for and limit > mlocked pages as > described above. Page pinning the way it is done by > Infiniband through > increasing the page refcount is treated by the VM as > a temporary > condition not as a permanent pin. The VM will > continually try to reclaim > these pages thinking that the temporary usage of the > page must cease > soon. This is why the use of large amounts of pinned > pages can lead to > livelock situations. Oh ok, yes, I did see the discussion on this; sorry I missed it. I do see what notifiers bring to the table now (without endorsing it :-)). An orthogonal question is this: is IB/rdma the only "culprit" that elevates page refcounts? Are there no other subsystems which do a similar thing? The example I am thinking about is rawio (Oracle's mlock'ed SHM regions are handed to rawio, isn't it?). My understanding of how rawio works in Linux is quite dated though ... Kanoj > > If we want to have pinning behavior then we could > mark pinned pages > specially so that the VM will not continually try to > evict these pages. We > could manage them similar to mlocked pages but just > not allow page > migration, memory unplug and defrag to occur on > pinned memory. All of > theses would have to fail. With the notifier scheme > the device driver > could be told to get rid of the pinned memory. This > would make these 3 > techniques work despite having an RDMA memory > section. > > > Startup benefits are well understood with the > notifier > > scheme (ie, not all pages need to be faulted in at > > memory region creation time), specially when most > of > > the memory region is not accessed at all. I would > > imagine most of HPC does not work this way though. > > No for optimal performance you would want to > prefault all pages like > it is now. The notifier scheme would only become > relevant in memory > shortage situations. > > > Then again, as rdma hardware is applied > (increasingly?) towards apps > > with short lived connections, the notifier scheme > will help with startup > > times. > > The main use of the notifier scheme is for stability > and reliability. The > "pinned" pages become unpinnable on request by the > VM. So the VM can work > itself out of memory shortage situations in > cooperation with the > RDMA logic instead of simply failing. > > -- > To unsubscribe, send a message with 'unsubscribe > linux-mm' in > the body to majordomo@kvack.org. For more info on > Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> > email@kvack.org </a> > ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: Demand paging for memory regions 2008-02-13 23:43 ` Kanoj Sarcar @ 2008-02-13 23:48 ` Jesse Barnes 2008-02-14 0:56 ` Andrea Arcangeli 2008-02-14 19:35 ` Christoph Lameter 2 siblings, 0 replies; 150+ messages in thread From: Jesse Barnes @ 2008-02-13 23:48 UTC (permalink / raw) To: Kanoj Sarcar Cc: Dave Airlie, Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, kvm-devel, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, Christoph Lameter On Wednesday, February 13, 2008 3:43 pm Kanoj Sarcar wrote: > Oh ok, yes, I did see the discussion on this; sorry I > missed it. I do see what notifiers bring to the table > now (without endorsing it :-)). > > An orthogonal question is this: is IB/rdma the only > "culprit" that elevates page refcounts? Are there no > other subsystems which do a similar thing? > > The example I am thinking about is rawio (Oracle's > mlock'ed SHM regions are handed to rawio, isn't it?). > My understanding of how rawio works in Linux is quite > dated though ... We're doing something similar in the DRM these days... We need big chunks of memory to be pinned so that the GPU can operate on them, but when the operation completes we can allow them to be swappable again. I think with the current implementation, allocations are always pinned, but we'll definitely want to change that soon. Dave? Jesse ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 23:43 ` Kanoj Sarcar 2008-02-13 23:48 ` Jesse Barnes @ 2008-02-14 0:56 ` Andrea Arcangeli 2008-02-14 19:35 ` Christoph Lameter 2 siblings, 0 replies; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-14 0:56 UTC (permalink / raw) To: Kanoj Sarcar Cc: Rik van Riel, kvm-devel, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, Christoph Lameter Hi Kanoj, On Wed, Feb 13, 2008 at 03:43:17PM -0800, Kanoj Sarcar wrote: > Oh ok, yes, I did see the discussion on this; sorry I > missed it. I do see what notifiers bring to the table > now (without endorsing it :-)). I'm not really livelocks are really the big issue here. I'm running N 1G VM on a 1G ram system, with N-1G swapped out. Combining this with auto-ballooning, rss limiting, and ksm ram sharing, provides really advanced and lowlevel virtualization VM capabilities to the linux kernel while at the same time guaranteeing no oom failures as long as the guest pages are lower than ram+swap (just slower runtime if too many pages are unshared or if the balloons are deflated etc..). Swapping the virtual machine in the host may be more efficient than having the guest swapping over a virtual swap paravirt storage for example. As more management features are added admins will gain more experience in handling those new features and they'll find what's best for them. mmu notifiers and real reliable swapping are the enabler for those more advanced VM features. oom livelocks wouldn't happen anyway with KVM as long as the maximimal number of guest physical is lower than RAM. > An orthogonal question is this: is IB/rdma the only > "culprit" that elevates page refcounts? Are there no > other subsystems which do a similar thing? > > The example I am thinking about is rawio (Oracle's > mlock'ed SHM regions are handed to rawio, isn't it?). > My understanding of how rawio works in Linux is quite > dated though ... rawio in flight I/O shall be limited. As long as each task can't pin more than X ram, and the ram is released when the task is oom killed, and the first get_user_pages/alloc_pages/slab_alloc that returns -ENOMEM takes an oom fail path that returns failure to userland, everything is ok. Even with IB deadlock could only happen if IB would allow unlimited memory to be pinned down by unprivileged users. If IB is insecure and DoSable without mmu notifiers, then I'm not sure how enabling swapping of the IB memory could be enough to fix the DoS. Keep in mind that even tmpfs can't be safe allowing all ram+swap to be allocated in a tmpfs file (despite the tmpfs file storage includes swap and not only ram). Pinning the whole ram+swap with tmpfs livelocks the same way of pinning the whole ram with ramfs. So if you add mmu notifier support to IB, you only need to RDMA an area as large as ram+swap to livelock again as before... no difference at all. I don't think livelocks have anything to do with mmu notifiers (other than to deferring the livelock to the "swap+ram" point of no return instead of the current "ram" point of no return). Livelocks have to be solved the usual way: handling alloc_pages/get_user_pages/slab allocation failures with a fail path that returns to userland and allows the ram to be released if the task was selected for oom-killage. The real benefit of the mmu notifiers for IB would be to allow the rdma region to be larger than RAM without triggering the oom killer (or without triggering a livelock if it's DoSable but then the livelock would need fixing to be converted in a regular oom-killing by some other mean not related to the mmu-notifier, it's really an orthogonal problem). So suppose you've a MPI simulation that requires a 10G array and you've only 1G of ram, then you can rdma over 10G like if you had 10G of ram. Things will preform ok only if there's some huge locality of the computations. For virtualization it's orders of magnitude more useful than for computer clusters but certain simulations really swaps so I don't exclude certain RDMA apps will also need this (dunno about IB). ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 23:43 ` Kanoj Sarcar 2008-02-13 23:48 ` Jesse Barnes 2008-02-14 0:56 ` Andrea Arcangeli @ 2008-02-14 19:35 ` Christoph Lameter 2 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-14 19:35 UTC (permalink / raw) To: Kanoj Sarcar Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Wed, 13 Feb 2008, Kanoj Sarcar wrote: > Oh ok, yes, I did see the discussion on this; sorry I > missed it. I do see what notifiers bring to the table > now (without endorsing it :-)). > > An orthogonal question is this: is IB/rdma the only > "culprit" that elevates page refcounts? Are there no > other subsystems which do a similar thing? Yes there are actually two projects by SGI that also ran into the same issue that motivated the work on this. One is XPmem which allows sharing of process memory between different Linux instances and then there is the GRU which is a kind of DMA engine. Then there is KVM and probably multiple other drivers. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 4:09 ` Christian Bell 2008-02-13 19:00 ` Christoph Lameter @ 2008-02-13 23:23 ` Pete Wyckoff 2008-02-14 0:01 ` Jason Gunthorpe 1 sibling, 1 reply; 150+ messages in thread From: Pete Wyckoff @ 2008-02-13 23:23 UTC (permalink / raw) To: Christoph Lameter Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel christian.bell@qlogic.com wrote on Tue, 12 Feb 2008 20:09 -0800: > One other area that has not been brought up yet (I think) is the > applicability of notifiers in letting users know when pinned memory > is reclaimed by the kernel. This is useful when a lower-level > library employs lazy deregistration strategies on memory regions that > are subsequently released to the kernel via the application's use of > munmap or sbrk. Ohio Supercomputing Center has work in this area but > a generalized approach in the kernel would certainly be welcome. The whole need for memory registration is a giant pain. There is no motivating application need for it---it is simply a hack around virtual memory and the lack of full VM support in current hardware. There are real hardware issues that interact poorly with virtual memory, as discussed previously in this thread. The way a messaging cycle goes in IB is: register buf post send from buf wait for completion deregister buf This tends to get hidden via userspace software libraries into a single call: MPI_send(buf) Now if you actually do the reg/dereg every time, things are very slow. So userspace library writers came up with the idea of caching registrations: if buf is not registered: register buf post send from buf wait for completion The second time that the app happens to do a send from the same buffer, it proceeds much faster. Spatial locality applies here, and this caching is generally worth it. Some libraries have schemes to limit the size of the registration cache too. But there are plenty of ways to hurt yourself with such a scheme. The first being a huge pool of unused but registered memory, as the library doesn't know the app patterns, and it doesn't know the VM pressure level in the kernel. There are plenty of subtle ways that this breaks too. If the registered buf is removed from the address space via munmap() or sbrk() or other ways, the mapping and registration are gone, but the library has no way of knowing that the app just did this. Sure the physical page is still there and pinned, but the app cannot get at it. Later if new address space arrives at the same virtual address but a different physical page, the library will mistakenly think it already has it registered properly, and data is transferred from this old now-unmapped physical page. The whole situation is rather ridiculuous, but we are quite stuck with it for current generation IB and iWarp hardware. If we can't have the kernel interact with the device directly, we could at least manage state in these multiple userspace registration caches. The VM could ask for certain (or any) pages to be released, and the library would respond if they are indeed not in use by the device. The app itself does not know about pinned regions, and the library is aware of exactly which regions are potentially in use. Since the great majority of userspace messaging over IB goes through middleware like MPI or PGAS languages, and they all have the same approach to registration caching, this approach could fix the problem for a big segment of use cases. More text on the registration caching problem is here: http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf with an approach using vm_ops open and close operations in a kernel module here: http://www.osc.edu/~pw/dreg/ There is a place for VM notifiers in RDMA messaging, but not in talking to devices, at least not the current set. If you can define a reasonable userspace interface for VM notifiers, libraries can manage registration caches more efficiently, letting the kernel unmap pinned pages as it likes. -- Pete ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 23:23 ` Pete Wyckoff @ 2008-02-14 0:01 ` Jason Gunthorpe 2008-02-27 22:11 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Jason Gunthorpe @ 2008-02-14 0:01 UTC (permalink / raw) To: Pete Wyckoff Cc: Christoph Lameter, Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Wed, Feb 13, 2008 at 06:23:08PM -0500, Pete Wyckoff wrote: > christian.bell@qlogic.com wrote on Tue, 12 Feb 2008 20:09 -0800: > > One other area that has not been brought up yet (I think) is the > > applicability of notifiers in letting users know when pinned memory > > is reclaimed by the kernel. This is useful when a lower-level > > library employs lazy deregistration strategies on memory regions that > > are subsequently released to the kernel via the application's use of > > munmap or sbrk. Ohio Supercomputing Center has work in this area but > > a generalized approach in the kernel would certainly be welcome. > > The whole need for memory registration is a giant pain. There is no > motivating application need for it---it is simply a hack around > virtual memory and the lack of full VM support in current hardware. > There are real hardware issues that interact poorly with virtual > memory, as discussed previously in this thread. Well, the registrations also exist to provide protection against rouge/faulty remotes, but for the purposes of MPI that is probably not important. Here is a thought.. Some RDMA hardware can change the page tables on the fly. What if the kernel had a mechanism to dynamically maintain a full registration of the processes entire address space ('mlocked' but able to be migrated)? MPI would never need to register a buffer, and all the messy cases with munmap/sbrk/etc go away - the risk is that other MPI nodes can randomly scribble all over the process :) Christoph: It seemed to me you were first talking about freeing/swapping/faulting RDMA'able pages - but would pure migration as a special hardware supported case be useful like Catilan suggested? Regards, Jason ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-14 0:01 ` Jason Gunthorpe @ 2008-02-27 22:11 ` Christoph Lameter 0 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-27 22:11 UTC (permalink / raw) To: Jason Gunthorpe Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, kvm-devel, linux-mm, daniel.blueman, avi, general, Andrew Morton, Robin Holt On Wed, 13 Feb 2008, Jason Gunthorpe wrote: > Christoph: It seemed to me you were first talking about > freeing/swapping/faulting RDMA'able pages - but would pure migration > as a special hardware supported case be useful like Catilan suggested? That is a special case of the proposed solution. You could mlock the regions of interest. Those can then only be migrated but not swapped out. However, I think we need some limit on the number of pages one can mlock. Otherwise the VM can get into a situation where reclaim is not possible because the majority of memory is either mlocked or pinned by I/O etc. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 1:01 ` Christoph Lameter 2008-02-13 1:26 ` Jason Gunthorpe @ 2008-02-13 1:55 ` Christian Bell 2008-02-13 2:19 ` Christoph Lameter 1 sibling, 1 reply; 150+ messages in thread From: Christian Bell @ 2008-02-13 1:55 UTC (permalink / raw) To: Christoph Lameter Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Tue, 12 Feb 2008, Christoph Lameter wrote: > On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > > > Well, certainly today the memfree IB devices store the page tables in > > host memory so they are already designed to hang onto packets during > > the page lookup over PCIE, adding in faulting makes this time > > larger. > > You really do not need a page table to use it. What needs to be maintained > is knowledge on both side about what pages are currently shared across > RDMA. If the VM decides to reclaim a page then the notification is used to > remove the remote entry. If the remote side then tries to access the page > again then the page fault on the remote side will stall until the local > page has been brought back. RDMA can proceed after both sides again agree > on that page now being sharable. HPC environments won't be amenable to a pessimistic approach of synchronizing before every data transfer. RDMA is assumed to be a low-level data movement mechanism that has no implied synchronization. In some parallel programming models, it's not uncommon to use RDMA to send 8-byte messages. It can be difficult to make and hold guarantees about in-memory pages when many concurrent RDMA operations are in flight (not uncommon in reasonably large machines). Some of the in-memory page information could be shared with some form of remote caching strategy but then it's a different problem with its own scalability challenges. I think there are very potential clients of the interface when an optimistic approach is used. Part of the trick, however, has to do with being able to re-start transfers instead of buffering the data or making guarantees about delivery that could cause deadlock (as was alluded to earlier in this thread). InfiniBand is constrained in this regard since it requires message-ordering between endpoints (or queue pairs). One could argue that this is still possible with IB, at the cost of throwing more packets away when a referenced page is not in memory. With this approach, the worse case demand paging scenario is met when the active working set of referenced pages is larger than the amount physical memory -- but HPC applications are already bound by this anyway. You'll find that Quadrics has the most experience in this area and that their entire architecture is adapted to being optimistic about demand paging in RDMA transfers -- they've been maintaining a patchset to do this for years. . . christian ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 1:55 ` Christian Bell @ 2008-02-13 2:19 ` Christoph Lameter 0 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-13 2:19 UTC (permalink / raw) To: Christian Bell Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Tue, 12 Feb 2008, Christian Bell wrote: > I think there are very potential clients of the interface when an > optimistic approach is used. Part of the trick, however, has to do > with being able to re-start transfers instead of buffering the data > or making guarantees about delivery that could cause deadlock (as was > alluded to earlier in this thread). InfiniBand is constrained in > this regard since it requires message-ordering between endpoints (or > queue pairs). One could argue that this is still possible with IB, > at the cost of throwing more packets away when a referenced page is > not in memory. With this approach, the worse case demand paging > scenario is met when the active working set of referenced pages is > larger than the amount physical memory -- but HPC applications are > already bound by this anyway. > > You'll find that Quadrics has the most experience in this area and > that their entire architecture is adapted to being optimistic about > demand paging in RDMA transfers -- they've been maintaining a patchset > to do this for years. The notifier patchset that we are discussing here was mostly inspired by their work. There is no need to restart transfers that you have never started in the first place. The remote side would never start a transfer if the page reference has been torn down. In order to start the transfer a fault handler on the remote side would have to setup the association between the memory on both ends again. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-12 22:41 ` [ofa-general] Re: Demand paging for memory regions Roland Dreier 2008-02-12 23:14 ` Felix Marti 2008-02-12 23:23 ` Jason Gunthorpe @ 2008-02-13 0:56 ` Christoph Lameter 2008-02-13 12:11 ` Christoph Raisch 3 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-13 0:56 UTC (permalink / raw) To: Roland Dreier Cc: steiner, Andrea Arcangeli, a.p.zijlstra, Steve Wise, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel On Tue, 12 Feb 2008, Roland Dreier wrote: > I don't know anything about the T3 internals, but it's not clear that > you could do this without a new chip design in general. Lot's of RDMA > devices were designed expecting that when a packet arrives, the HW can > look up the bus address for a given memory region/offset and place the > packet immediately. It seems like a major change to be able to > generate a "page fault" interrupt when a page isn't present, or even > just wait to scatter some data until the host finishes updating page > tables when the HW needs the translation. Well if the VM wants to invalidate a page then the remote end first has to remove its mapping. If a page has been removed then the remote end would encounter a fault and then would have to wait for the local end to reestablish its mapping before proceeding. So the packet would only be generated when both ends are in sync. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-12 22:41 ` [ofa-general] Re: Demand paging for memory regions Roland Dreier ` (2 preceding siblings ...) 2008-02-13 0:56 ` Christoph Lameter @ 2008-02-13 12:11 ` Christoph Raisch 2008-02-13 19:02 ` Christoph Lameter 3 siblings, 1 reply; 150+ messages in thread From: Christoph Raisch @ 2008-02-13 12:11 UTC (permalink / raw) To: Roland Dreier Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, linux-mm, izike, steiner, linux-kernel, avi, kvm-devel, general-bounces, daniel.blueman, Robin Holt, general, Andrew Morton, Christoph Lameter > > > Chelsio's T3 HW doesn't support this. For ehca we currently can't modify a large MR when it has been allocated. EHCA Hardware expects the pages to be there (MRs must not have "holes"). This is also true for the global MR covering all kernel space. Therefore we still need the memory to be "pinned" if ib_umem_get() is called. So with the current implementation we don't have much use for a notifier. "It is difficult to make predictions, especially about the future" Gruss / Regards Christoph Raisch + Hoang-Nam Nguyen ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [ofa-general] Re: Demand paging for memory regions 2008-02-13 12:11 ` Christoph Raisch @ 2008-02-13 19:02 ` Christoph Lameter 0 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-13 19:02 UTC (permalink / raw) To: Christoph Raisch Cc: Rik van Riel, Andrea Arcangeli, a.p.zijlstra, linux-mm, izike, Roland Dreier, steiner, linux-kernel, avi, kvm-devel, general-bounces, daniel.blueman, Robin Holt, general, Andrew Morton On Wed, 13 Feb 2008, Christoph Raisch wrote: > For ehca we currently can't modify a large MR when it has been allocated. > EHCA Hardware expects the pages to be there (MRs must not have "holes"). > This is also true for the global MR covering all kernel space. > Therefore we still need the memory to be "pinned" if ib_umem_get() is > called. It cannot be freed and then reallocated? What happens when a process exists? ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-09 0:05 ` Christoph Lameter 2008-02-09 0:12 ` Roland Dreier @ 2008-02-09 0:12 ` Andrew Morton 2008-02-09 0:18 ` Christoph Lameter 1 sibling, 1 reply; 150+ messages in thread From: Andrew Morton @ 2008-02-09 0:12 UTC (permalink / raw) To: Christoph Lameter Cc: andrea, a.p.zijlstra, linux-mm, izike, steiner, linux-kernel, avi, kvm-devel, daniel.blueman, Robin Holt, general On Fri, 8 Feb 2008 16:05:00 -0800 (PST) Christoph Lameter <clameter@sgi.com> wrote: > On Fri, 8 Feb 2008, Andrew Morton wrote: > > > You took it correctly, and I didn't understand the answer ;) > > We have done several rounds of discussion on linux-kernel about this so > far and the IB folks have not shown up to join in. I have tried to make > this as general as possible. infiniband would appear to be the major present in-kernel client of this new interface. So as a part of proving its usefulness, correctness, etc we should surely work on converting infiniband to use it, and prove its goodness. Quite possibly none of the infiniband developers even know about it.. ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 0/6] MMU Notifiers V6 2008-02-09 0:12 ` [ofa-general] Re: [patch 0/6] MMU Notifiers V6 Andrew Morton @ 2008-02-09 0:18 ` Christoph Lameter 0 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-09 0:18 UTC (permalink / raw) To: Andrew Morton Cc: andrea, a.p.zijlstra, linux-mm, izike, steiner, linux-kernel, avi, kvm-devel, daniel.blueman, Robin Holt, general On Fri, 8 Feb 2008, Andrew Morton wrote: > Quite possibly none of the infiniband developers even know about it.. Well Andrea's initial approach was even featured on LWN a couple of weeks back. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 0/6] MMU Notifiers V6 2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter ` (6 preceding siblings ...) 2008-02-08 22:23 ` [ofa-general] Re: [patch 0/6] MMU Notifiers V6 Andrew Morton @ 2008-02-13 14:31 ` Jack Steiner 7 siblings, 0 replies; 150+ messages in thread From: Jack Steiner @ 2008-02-13 14:31 UTC (permalink / raw) To: Christoph Lameter Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman > GRU > - Simple additional hardware TLB (possibly covering multiple instances of > Linux) > - Needs TLB shootdown when the VM unmaps pages. > - Determines page address via follow_page (from interrupt context) but can > fall back to get_user_pages(). > - No page reference possible since no page status is kept.. I applied the latest mmuops patch to a 2.6.24 kernel & updated the GRU driver to use it. As far as I can tell, everything works ok. Although more testing is needed, all current tests of driver functionality are working on both a system simulator and a hardware simulator. The driver itself is still a few weeks from being ready to post but I can send code fragments of the portions related to mmuops or external TLB management if anyone is interested. --- jack ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] [patch 0/6] MMU Notifiers V7 @ 2008-02-15 6:48 Christoph Lameter 2008-02-15 6:49 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-15 6:48 UTC (permalink / raw) To: akpm Cc: steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general This is a patchset implementing MMU notifier callbacks based on Andrea's earlier work. These are needed if Linux pages are referenced from something else than tracked by the rmaps of the kernel (an external MMU). MMU notifiers allow us to get rid of the page pinning for RDMA and various other purposes. It gets rid of the broken use of mlock for page pinning and avoids having to lock pages by increasing the refcount. (mlock really does *not* pin pages....) More information on the rationale and the technical details can be found in the first patch and the README provided by that patch in Documentation/mmu_notifiers. The known immediate users are KVM - Establishes a refcount to the page via get_user_pages(). - External references are called spte. - Has page tables to track pages whose refcount was elevated but no reverse maps. GRU - Simple additional hardware TLB (possibly covering multiple instances of Linux) - Needs TLB shootdown when the VM unmaps pages. - Determines page address via follow_page (from interrupt context) but can fall back to get_user_pages(). - No page reference possible since no page status is kept.. XPmem - Allows use of a processes memory by remote instances of Linux. - Provides its own reverse mappings to track remote pte. - Established refcounts on the exported pages. - Must sleep in order to wait for remote acks of ptes that are being cleared. Andrea's mmu_notifier #4 -> RFC V1 - Merge subsystem rmap based with Linux rmap based approach - Move Linux rmap based notifiers out of macro - Try to account for what locks are held while the notifiers are called. - Develop a patch sequence that separates out the different types of hooks so that we can review their use. - Avoid adding include to linux/mm_types.h - Integrate RCU logic suggested by Peter. V1->V2: - Improve RCU support - Use mmap_sem for mmu_notifier register / unregister - Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we already have invalidate_range() callbacks there. - Clean compile for !MMU_NOTIFIER - Isolate filemap_xip strangeness into its own diff - Pass a the flag to invalidate_range to indicate if a spinlock is held. - Add invalidate_all() V2->V3: - Further RCU fixes - Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page and sys_remap_file_pages() after the pte clearing. V3->V4: - Drop locking and synchronize_rcu() on ->release since we know on release that we are the only executing thread. This is also true for invalidate_all() so we could drop off the mmu_notifier there early. Use hlist_del_init instead of hlist_del_rcu. - Do the invalidation as begin/end pairs with the requirement that the driver holds off new references in between. - Fixup filemap_xip.c - Figure out a potential way in which XPmem can deal with locks that are held. - Robin's patches to make the mmu_notifier logic manage the PageRmapExported bit. - Strip cc list down a bit. - Drop Peters new rcu list macro - Add description to the core patch V4->V5: - Provide missing callouts for mremap. - Provide missing callouts for copy_page_range. - Reduce mm_struct space to zero if !MMU_NOTIFIER by #ifdeffing out structure contents. - Get rid of the invalidate_all() callback by moving ->release in place of invalidate_all. - Require holding mmap_sem on register/unregister instead of acquiring it ourselves. In some contexts where we want to register/unregister we are already holding mmap_sem. - Split out the rmap support patch so that there is no need to apply all patches for KVM and GRU. V5->V6: - Provide missing range callouts for mprotect - Fix do_wp_page control path sequencing - Clarify locking conventions - GRU and XPmem confirmed to work with this patchset. - Provide skeleton code for GRU/KVM type callback and for XPmem type. - Rework documentation and put it into Documentation/mmu_notifier. V6->V7: - Code our own page table traversal in the skeletons so that we can perform the insertion of a remote pte under pte lock. - Discuss page pinning by increasing page refcount -- ^ permalink raw reply [flat|nested] 150+ messages in thread
* [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-15 6:48 [ofa-general] [patch 0/6] MMU Notifiers V7 Christoph Lameter @ 2008-02-15 6:49 ` Christoph Lameter 2008-02-19 8:54 ` [ofa-general] " Nick Piggin 2008-02-19 23:08 ` [ofa-general] " Nick Piggin 0 siblings, 2 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-15 6:49 UTC (permalink / raw) To: akpm Cc: steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general [-- Attachment #1: mmu_invalidate_range_callbacks --] [-- Type: text/plain, Size: 11465 bytes --] The invalidation of address ranges in a mm_struct needs to be performed when pages are removed or permissions etc change. If invalidate_range_begin() is called with locks held then we pass a flag into invalidate_range() to indicate that no sleeping is possible. Locks are only held for truncate and huge pages. In two cases we use invalidate_range_begin/end to invalidate single pages because the pair allows holding off new references (idea by Robin Holt). do_wp_page(): We hold off new references while we update the pte. xip_unmap: We are not taking the PageLock so we cannot use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end stands in. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/filemap_xip.c | 5 +++++ mm/fremap.c | 3 +++ mm/hugetlb.c | 3 +++ mm/memory.c | 35 +++++++++++++++++++++++++++++------ mm/mmap.c | 2 ++ mm/mprotect.c | 3 +++ mm/mremap.c | 7 ++++++- 7 files changed, 51 insertions(+), 7 deletions(-) Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c 2008-02-14 18:43:31.000000000 -0800 +++ linux-2.6/mm/fremap.c 2008-02-14 18:45:07.000000000 -0800 @@ -15,6 +15,7 @@ #include <linux/rmap.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/mmu_context.h> #include <asm/cacheflush.h> @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns spin_unlock(&mapping->i_mmap_lock); } + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0); err = populate_range(mm, vma, start, size, pgoff); + mmu_notifier(invalidate_range_end, mm, start, start + size, 0); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2008-02-14 18:43:31.000000000 -0800 +++ linux-2.6/mm/memory.c 2008-02-14 18:45:07.000000000 -0800 @@ -51,6 +51,7 @@ #include <linux/init.h> #include <linux/writeback.h> #include <linux/memcontrol.h> +#include <linux/mmu_notifier.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0); + dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { @@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds vma, addr, next)) return -ENOMEM; } while (dst_pgd++, src_pgd++, addr = next, addr != end); + + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier(invalidate_range_end, src_mm, + vma->vm_start, end, 0); + return 0; } @@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a struct mmu_gather *tlb; unsigned long end = address + size; unsigned long nr_accounted = 0; + int atomic = details ? (details->i_mmap_lock != 0) : 0; lru_add_drain(); tlb = tlb_gather_mmu(mm, 0); update_hiwater_rss(mm); + mmu_notifier(invalidate_range_begin, mm, address, end, atomic); end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); if (tlb) tlb_finish_mmu(tlb, address, end); + mmu_notifier(invalidate_range_end, mm, address, end, atomic); return end; } @@ -1339,7 +1351,7 @@ int remap_pfn_range(struct vm_area_struc { pgd_t *pgd; unsigned long next; - unsigned long end = addr + PAGE_ALIGN(size); + unsigned long start = addr, end = addr + PAGE_ALIGN(size); struct mm_struct *mm = vma->vm_mm; int err; @@ -1373,6 +1385,7 @@ int remap_pfn_range(struct vm_area_struc pfn -= addr >> PAGE_SHIFT; pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); + mmu_notifier(invalidate_range_begin, mm, start, end, 0); do { next = pgd_addr_end(addr, end); err = remap_pud_range(mm, pgd, addr, next, @@ -1380,6 +1393,7 @@ int remap_pfn_range(struct vm_area_struc if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range_end, mm, start, end, 0); return err; } EXPORT_SYMBOL(remap_pfn_range); @@ -1463,10 +1477,11 @@ int apply_to_page_range(struct mm_struct { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); + mmu_notifier(invalidate_range_begin, mm, start, end, 0); pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); @@ -1474,6 +1489,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range_end, mm, start, end, 0); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1614,8 +1630,10 @@ static int do_wp_page(struct mm_struct * page_table = pte_offset_map_lock(mm, pmd, address, &ptl); page_cache_release(old_page); - if (!pte_same(*page_table, orig_pte)) - goto unlock; + if (!pte_same(*page_table, orig_pte)) { + pte_unmap_unlock(page_table, ptl); + goto check_dirty; + } page_mkwrite = 1; } @@ -1631,7 +1649,8 @@ static int do_wp_page(struct mm_struct * if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, entry); ret |= VM_FAULT_WRITE; - goto unlock; + pte_unmap_unlock(page_table, ptl); + goto check_dirty; } /* @@ -1653,6 +1672,8 @@ gotten: if (mem_cgroup_charge(new_page, mm, GFP_KERNEL)) goto oom_free_new; + mmu_notifier(invalidate_range_begin, mm, address, + address + PAGE_SIZE, 0); /* * Re-check the pte - we dropped the lock */ @@ -1691,8 +1712,10 @@ gotten: page_cache_release(new_page); if (old_page) page_cache_release(old_page); -unlock: pte_unmap_unlock(page_table, ptl); + mmu_notifier(invalidate_range_end, mm, + address, address + PAGE_SIZE, 0); +check_dirty: if (dirty_page) { if (vma->vm_file) file_update_time(vma->vm_file); Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-02-14 18:44:56.000000000 -0800 +++ linux-2.6/mm/mmap.c 2008-02-14 18:45:07.000000000 -0800 @@ -1748,11 +1748,13 @@ static void unmap_region(struct mm_struc lru_add_drain(); tlb = tlb_gather_mmu(mm, 0); update_hiwater_rss(mm); + mmu_notifier(invalidate_range_begin, mm, start, end, 0); unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); tlb_finish_mmu(tlb, start, end); + mmu_notifier(invalidate_range_end, mm, start, end, 0); } /* Index: linux-2.6/mm/hugetlb.c =================================================================== --- linux-2.6.orig/mm/hugetlb.c 2008-02-14 18:43:31.000000000 -0800 +++ linux-2.6/mm/hugetlb.c 2008-02-14 18:45:07.000000000 -0800 @@ -14,6 +14,7 @@ #include <linux/mempolicy.h> #include <linux/cpuset.h> #include <linux/mutex.h> +#include <linux/mmu_notifier.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -755,6 +756,7 @@ void __unmap_hugepage_range(struct vm_ar BUG_ON(start & ~HPAGE_MASK); BUG_ON(end & ~HPAGE_MASK); + mmu_notifier(invalidate_range_begin, mm, start, end, 1); spin_lock(&mm->page_table_lock); for (address = start; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -775,6 +777,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier(invalidate_range_end, mm, start, end, 1); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); Index: linux-2.6/mm/filemap_xip.c =================================================================== --- linux-2.6.orig/mm/filemap_xip.c 2008-02-14 18:43:31.000000000 -0800 +++ linux-2.6/mm/filemap_xip.c 2008-02-14 18:45:07.000000000 -0800 @@ -13,6 +13,7 @@ #include <linux/module.h> #include <linux/uio.h> #include <linux/rmap.h> +#include <linux/mmu_notifier.h> #include <linux/sched.h> #include <asm/tlbflush.h> @@ -190,6 +191,8 @@ __xip_unmap (struct address_space * mapp address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); BUG_ON(address < vma->vm_start || address >= vma->vm_end); + mmu_notifier(invalidate_range_begin, mm, address, + address + PAGE_SIZE, 1); pte = page_check_address(page, mm, address, &ptl); if (pte) { /* Nuke the page table entry. */ @@ -201,6 +204,8 @@ __xip_unmap (struct address_space * mapp pte_unmap_unlock(pte, ptl); page_cache_release(page); } + mmu_notifier(invalidate_range_end, mm, + address, address + PAGE_SIZE, 1); } spin_unlock(&mapping->i_mmap_lock); } Index: linux-2.6/mm/mremap.c =================================================================== --- linux-2.6.orig/mm/mremap.c 2008-02-14 18:43:31.000000000 -0800 +++ linux-2.6/mm/mremap.c 2008-02-14 18:45:07.000000000 -0800 @@ -18,6 +18,7 @@ #include <linux/highmem.h> #include <linux/security.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -124,12 +125,15 @@ unsigned long move_page_tables(struct vm unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len) { - unsigned long extent, next, old_end; + unsigned long extent, next, old_start, old_end; pmd_t *old_pmd, *new_pmd; + old_start = old_addr; old_end = old_addr + len; flush_cache_range(vma, old_addr, old_end); + mmu_notifier(invalidate_range_begin, vma->vm_mm, + old_addr, old_end, 0); for (; old_addr < old_end; old_addr += extent, new_addr += extent) { cond_resched(); next = (old_addr + PMD_SIZE) & PMD_MASK; @@ -150,6 +154,7 @@ unsigned long move_page_tables(struct vm move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma, new_pmd, new_addr); } + mmu_notifier(invalidate_range_end, vma->vm_mm, old_start, old_end, 0); return len + old_addr - old_end; /* how much done */ } Index: linux-2.6/mm/mprotect.c =================================================================== --- linux-2.6.orig/mm/mprotect.c 2008-02-14 18:43:31.000000000 -0800 +++ linux-2.6/mm/mprotect.c 2008-02-14 18:45:07.000000000 -0800 @@ -21,6 +21,7 @@ #include <linux/syscalls.h> #include <linux/swap.h> #include <linux/swapops.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/pgtable.h> #include <asm/cacheflush.h> @@ -198,10 +199,12 @@ success: dirty_accountable = 1; } + mmu_notifier(invalidate_range_begin, mm, start, end, 0); if (is_vm_hugetlb_page(vma)) hugetlb_change_protection(vma, start, end, vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + mmu_notifier(invalidate_range_end, mm, start, end, 0); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); return 0; -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-15 6:49 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter @ 2008-02-19 8:54 ` Nick Piggin 2008-02-19 13:34 ` Andrea Arcangeli 2008-02-19 23:08 ` [ofa-general] " Nick Piggin 1 sibling, 1 reply; 150+ messages in thread From: Nick Piggin @ 2008-02-19 8:54 UTC (permalink / raw) To: Christoph Lameter Cc: steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Friday 15 February 2008 17:49, Christoph Lameter wrote: > The invalidation of address ranges in a mm_struct needs to be > performed when pages are removed or permissions etc change. > > If invalidate_range_begin() is called with locks held then we > pass a flag into invalidate_range() to indicate that no sleeping is > possible. Locks are only held for truncate and huge pages. > > In two cases we use invalidate_range_begin/end to invalidate > single pages because the pair allows holding off new references > (idea by Robin Holt). > > do_wp_page(): We hold off new references while we update the pte. > > xip_unmap: We are not taking the PageLock so we cannot > use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end > stands in. This whole thing would be much better if you didn't rely on the page lock at all, but either a) used the same locking as Linux does for its ptes/tlbs, or b) have some locking that is private to the mmu notifier code. Then there is not all this new stuff that has to be understood in the core VM. Also, why do you have to "invalidate" ranges when switching to a _more_ permissive state? This stuff should basically be the same as (a subset of) the TLB flushing API AFAIKS. Anything more is a pretty big burden to put in the core VM. See my alternative patch I posted -- I can't see why it won't work just like a TLB. As far as sleeping inside callbacks goes... I think there are big problems with the patch (the sleeping patch and the external rmap patch). I don't think it is workable in its current state. Either we have to make some big changes to the core VM, or we have to turn some locks into sleeping locks to do it properly AFAIKS. Neither one is good. But anyway, I don't really think the two approaches (Andrea's notifiers vs sleeping/xrmap) should be tangled up too much. I think Andrea's can possibly be quite unintrusive and useful very soon. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-19 8:54 ` [ofa-general] " Nick Piggin @ 2008-02-19 13:34 ` Andrea Arcangeli 2008-02-27 22:23 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-19 13:34 UTC (permalink / raw) To: Nick Piggin Cc: steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm, Christoph Lameter On Tue, Feb 19, 2008 at 07:54:14PM +1100, Nick Piggin wrote: > As far as sleeping inside callbacks goes... I think there are big > problems with the patch (the sleeping patch and the external rmap > patch). I don't think it is workable in its current state. Either > we have to make some big changes to the core VM, or we have to turn > some locks into sleeping locks to do it properly AFAIKS. Neither > one is good. Agreed. The thing is quite simple, the moment we support xpmem the complexity in the mmu notifier patch start and there are hacks, duplicated functionality through the same xpmem callbacks etc... GRU can already be 100% supported (infact simpler and safer) with my patch. > But anyway, I don't really think the two approaches (Andrea's > notifiers vs sleeping/xrmap) should be tangled up too much. I > think Andrea's can possibly be quite unintrusive and useful very > soon. Yes, that's why I kept maintaining my patch and I posted the last revision to Andrew. I use pte/tlb locking of the core VM, it's unintrusive and obviously safe. Furthermore it can be extended with Christoph's stuff in a 100% backwards compatible fashion later if needed. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-19 13:34 ` Andrea Arcangeli @ 2008-02-27 22:23 ` Christoph Lameter 2008-02-27 23:57 ` Andrea Arcangeli 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-27 22:23 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Tue, 19 Feb 2008, Andrea Arcangeli wrote: > Yes, that's why I kept maintaining my patch and I posted the last > revision to Andrew. I use pte/tlb locking of the core VM, it's > unintrusive and obviously safe. Furthermore it can be extended with > Christoph's stuff in a 100% backwards compatible fashion later if needed. How would that work? You rely on the pte locking. Thus calls are all in an atomic context. I think we need a general scheme that allows sleeping when references are invalidates. Even the GRU has performance issues when using the KVM patch. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-27 22:23 ` Christoph Lameter @ 2008-02-27 23:57 ` Andrea Arcangeli 0 siblings, 0 replies; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-27 23:57 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Wed, Feb 27, 2008 at 02:23:29PM -0800, Christoph Lameter wrote: > How would that work? You rely on the pte locking. Thus calls are all in an I don't rely on the pte locking in #v7, exactly to satisfy GRU (so far purely theoretical) performance complains. > atomic context. I think we need a general scheme that allows sleeping when Calls are still in atomic context until we change the i_mmap_lock to a mutex under a CONFIG_XPMEM, or unless we boost mm_users, drop the lock and restart the loop at every different mm. In any case those changes should be under CONFIG_XPMEM IMHO given desktop users definitely don't need this (regular non-blocking mmu notifiers in my patch are all what a desktop user need as far as I can tell). > references are invalidates. Even the GRU has performance issues when using > the KVM patch. GRU will perform the same with #v7 or V8. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-15 6:49 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter 2008-02-19 8:54 ` [ofa-general] " Nick Piggin @ 2008-02-19 23:08 ` Nick Piggin 2008-02-20 1:00 ` Andrea Arcangeli 2008-02-27 22:35 ` [ofa-general] " Christoph Lameter 1 sibling, 2 replies; 150+ messages in thread From: Nick Piggin @ 2008-02-19 23:08 UTC (permalink / raw) To: Christoph Lameter Cc: steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Friday 15 February 2008 17:49, Christoph Lameter wrote: > The invalidation of address ranges in a mm_struct needs to be > performed when pages are removed or permissions etc change. > > If invalidate_range_begin() is called with locks held then we > pass a flag into invalidate_range() to indicate that no sleeping is > possible. Locks are only held for truncate and huge pages. You can't sleep inside rcu_read_lock()! I must say that for a patch that is up to v8 or whatever and is posted twice a week to such a big cc list, it is kind of slack to not even test it and expect other people to review it. Also, what we are going to need here are not skeleton drivers that just do all the *easy* bits (of registering their callbacks), but actual fully working examples that do everything that any real driver will need to do. If not for the sanity of the driver writer, then for the sanity of the VM developers (I don't want to have to understand xpmem or infiniband in order to understand how the VM works). > In two cases we use invalidate_range_begin/end to invalidate > single pages because the pair allows holding off new references > (idea by Robin Holt). > > do_wp_page(): We hold off new references while we update the pte. > > xip_unmap: We are not taking the PageLock so we cannot > use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end > stands in. > > Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Christoph Lameter <clameter@sgi.com> > > --- > mm/filemap_xip.c | 5 +++++ > mm/fremap.c | 3 +++ > mm/hugetlb.c | 3 +++ > mm/memory.c | 35 +++++++++++++++++++++++++++++------ > mm/mmap.c | 2 ++ > mm/mprotect.c | 3 +++ > mm/mremap.c | 7 ++++++- > 7 files changed, 51 insertions(+), 7 deletions(-) > > Index: linux-2.6/mm/fremap.c > =================================================================== > --- linux-2.6.orig/mm/fremap.c 2008-02-14 18:43:31.000000000 -0800 > +++ linux-2.6/mm/fremap.c 2008-02-14 18:45:07.000000000 -0800 > @@ -15,6 +15,7 @@ > #include <linux/rmap.h> > #include <linux/module.h> > #include <linux/syscalls.h> > +#include <linux/mmu_notifier.h> > > #include <asm/mmu_context.h> > #include <asm/cacheflush.h> > @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns > spin_unlock(&mapping->i_mmap_lock); > } > > + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0); > err = populate_range(mm, vma, start, size, pgoff); > + mmu_notifier(invalidate_range_end, mm, start, start + size, 0); > if (!err && !(flags & MAP_NONBLOCK)) { > if (unlikely(has_write_lock)) { > downgrade_write(&mm->mmap_sem); > Index: linux-2.6/mm/memory.c > =================================================================== > --- linux-2.6.orig/mm/memory.c 2008-02-14 18:43:31.000000000 -0800 > +++ linux-2.6/mm/memory.c 2008-02-14 18:45:07.000000000 -0800 > @@ -51,6 +51,7 @@ > #include <linux/init.h> > #include <linux/writeback.h> > #include <linux/memcontrol.h> > +#include <linux/mmu_notifier.h> > > #include <asm/pgalloc.h> > #include <asm/uaccess.h> > @@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds > if (is_vm_hugetlb_page(vma)) > return copy_hugetlb_page_range(dst_mm, src_mm, vma); > > + if (is_cow_mapping(vma->vm_flags)) > + mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0); > + > dst_pgd = pgd_offset(dst_mm, addr); > src_pgd = pgd_offset(src_mm, addr); > do { > @@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds > vma, addr, next)) > return -ENOMEM; > } while (dst_pgd++, src_pgd++, addr = next, addr != end); > + > + if (is_cow_mapping(vma->vm_flags)) > + mmu_notifier(invalidate_range_end, src_mm, > + vma->vm_start, end, 0); > + > return 0; > } > > @@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a > struct mmu_gather *tlb; > unsigned long end = address + size; > unsigned long nr_accounted = 0; > + int atomic = details ? (details->i_mmap_lock != 0) : 0; > > lru_add_drain(); > tlb = tlb_gather_mmu(mm, 0); > update_hiwater_rss(mm); > + mmu_notifier(invalidate_range_begin, mm, address, end, atomic); > end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); > if (tlb) > tlb_finish_mmu(tlb, address, end); > + mmu_notifier(invalidate_range_end, mm, address, end, atomic); > return end; > } > Where do you invalidate for munmap()? Also, how to you resolve the case where you are not allowed to sleep? I would have thought either you have to handle it, in which case nobody needs to sleep; or you can't handle it, in which case the code is broken. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-19 23:08 ` [ofa-general] " Nick Piggin @ 2008-02-20 1:00 ` Andrea Arcangeli 2008-02-20 3:00 ` Robin Holt 2008-02-27 22:39 ` Christoph Lameter 2008-02-27 22:35 ` [ofa-general] " Christoph Lameter 1 sibling, 2 replies; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-20 1:00 UTC (permalink / raw) To: Nick Piggin Cc: steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm, Christoph Lameter On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote: > You can't sleep inside rcu_read_lock()! > > I must say that for a patch that is up to v8 or whatever and is > posted twice a week to such a big cc list, it is kind of slack to > not even test it and expect other people to review it. Well, xpmem requirements are complex. As as side effect of the simplicity of my approach, my patch is 100% safe since #v1. Now it also works for GRU and it cluster invalidates. > Also, what we are going to need here are not skeleton drivers > that just do all the *easy* bits (of registering their callbacks), > but actual fully working examples that do everything that any > real driver will need to do. If not for the sanity of the driver I've a fully working scenario for my patch, infact I didn't post the mmu notifier patch until I got KVM to swap 100% reliably to be sure I would post something that works well. mmu notifiers are already used in KVM for: 1) 100% reliable and efficient swapping of guest physical memory 2) copy-on-writes of writeprotect faults after ksm page sharing of guest physical memory 3) ballooning using madvise to give the guest memory back to the host My implementation is the most handy because it requires zero changes to the ksm code too (no explicit mmu notifier calls after ptep_clear_flush) and it's also 100% safe (no mess with schedules over rcu_read_lock), no "atomic" parameters, and it doesn't open a window where sptes have a view on older pages and linux pte has view on newer pages (this can happen with remap_file_pages with my KVM swapping patch to use V8 Christoph's patch). > Also, how to you resolve the case where you are not allowed to sleep? > I would have thought either you have to handle it, in which case nobody > needs to sleep; or you can't handle it, in which case the code is > broken. I also asked exactly this, glad you reasked this too. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-20 1:00 ` Andrea Arcangeli @ 2008-02-20 3:00 ` Robin Holt 2008-02-20 3:11 ` Nick Piggin 2008-02-27 22:39 ` Christoph Lameter 1 sibling, 1 reply; 150+ messages in thread From: Robin Holt @ 2008-02-20 3:00 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm, Christoph Lameter On Wed, Feb 20, 2008 at 02:00:38AM +0100, Andrea Arcangeli wrote: > On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote: > > You can't sleep inside rcu_read_lock()! > > > > I must say that for a patch that is up to v8 or whatever and is > > posted twice a week to such a big cc list, it is kind of slack to > > not even test it and expect other people to review it. > > Well, xpmem requirements are complex. As as side effect of the > simplicity of my approach, my patch is 100% safe since #v1. Now it > also works for GRU and it cluster invalidates. > > > Also, what we are going to need here are not skeleton drivers > > that just do all the *easy* bits (of registering their callbacks), > > but actual fully working examples that do everything that any > > real driver will need to do. If not for the sanity of the driver > > I've a fully working scenario for my patch, infact I didn't post the > mmu notifier patch until I got KVM to swap 100% reliably to be sure I > would post something that works well. mmu notifiers are already used > in KVM for: > > 1) 100% reliable and efficient swapping of guest physical memory > 2) copy-on-writes of writeprotect faults after ksm page sharing of guest > physical memory > 3) ballooning using madvise to give the guest memory back to the host > > My implementation is the most handy because it requires zero changes > to the ksm code too (no explicit mmu notifier calls after > ptep_clear_flush) and it's also 100% safe (no mess with schedules over > rcu_read_lock), no "atomic" parameters, and it doesn't open a window > where sptes have a view on older pages and linux pte has view on newer > pages (this can happen with remap_file_pages with my KVM swapping > patch to use V8 Christoph's patch). > > > Also, how to you resolve the case where you are not allowed to sleep? > > I would have thought either you have to handle it, in which case nobody > > needs to sleep; or you can't handle it, in which case the code is > > broken. > > I also asked exactly this, glad you reasked this too. Currently, we BUG_ON having a PFN in our tables and not being able to sleep. These are mappings which MPT has never supported in the past and XPMEM was already not allowing page faults for VMAs which are not anonymous so it should never happen. If the file-backed operations can ever get changed to allow for sleeping and a customer has a need for it, we would need to change XPMEM to allow those types of faults to succeed. Thanks, Robin ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-20 3:00 ` Robin Holt @ 2008-02-20 3:11 ` Nick Piggin 0 siblings, 0 replies; 150+ messages in thread From: Nick Piggin @ 2008-02-20 3:11 UTC (permalink / raw) To: Robin Holt Cc: steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, general, akpm, Christoph Lameter On Wednesday 20 February 2008 14:00, Robin Holt wrote: > On Wed, Feb 20, 2008 at 02:00:38AM +0100, Andrea Arcangeli wrote: > > On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote: > > > Also, how to you resolve the case where you are not allowed to sleep? > > > I would have thought either you have to handle it, in which case nobody > > > needs to sleep; or you can't handle it, in which case the code is > > > broken. > > > > I also asked exactly this, glad you reasked this too. > > Currently, we BUG_ON having a PFN in our tables and not being able > to sleep. These are mappings which MPT has never supported in the past > and XPMEM was already not allowing page faults for VMAs which are not > anonymous so it should never happen. If the file-backed operations can > ever get changed to allow for sleeping and a customer has a need for it, > we would need to change XPMEM to allow those types of faults to succeed. Do you really want to be able to swap, or are you just interested in keeping track of unmaps / prot changes? ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-20 1:00 ` Andrea Arcangeli 2008-02-20 3:00 ` Robin Holt @ 2008-02-27 22:39 ` Christoph Lameter 2008-02-28 0:38 ` Andrea Arcangeli 1 sibling, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-27 22:39 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Wed, 20 Feb 2008, Andrea Arcangeli wrote: > Well, xpmem requirements are complex. As as side effect of the > simplicity of my approach, my patch is 100% safe since #v1. Now it > also works for GRU and it cluster invalidates. The patch has to satisfy RDMA, XPMEM, GRU and KVM. I keep hearing that we have a KVM only solution that works 100% (which makes me just switch ignore the rest of the argument because 100% solutions usually do not exist). > rcu_read_lock), no "atomic" parameters, and it doesn't open a window > where sptes have a view on older pages and linux pte has view on newer > pages (this can happen with remap_file_pages with my KVM swapping > patch to use V8 Christoph's patch). Ok so you are now getting away from keeping the refcount elevated? That was your design decision.... > > Also, how to you resolve the case where you are not allowed to sleep? > > I would have thought either you have to handle it, in which case nobody > > needs to sleep; or you can't handle it, in which case the code is > > broken. > > I also asked exactly this, glad you reasked this too. It would have helped if you would have repeated my answers that you had already gotten before. You knew I was on vacation.... ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-27 22:39 ` Christoph Lameter @ 2008-02-28 0:38 ` Andrea Arcangeli 0 siblings, 0 replies; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-28 0:38 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Wed, Feb 27, 2008 at 02:39:46PM -0800, Christoph Lameter wrote: > On Wed, 20 Feb 2008, Andrea Arcangeli wrote: > > > Well, xpmem requirements are complex. As as side effect of the > > simplicity of my approach, my patch is 100% safe since #v1. Now it > > also works for GRU and it cluster invalidates. > > The patch has to satisfy RDMA, XPMEM, GRU and KVM. I keep hearing that we > have a KVM only solution that works 100% (which makes me just switch > ignore the rest of the argument because 100% solutions usually do not > exist). I only said 100% safe, I didn't imply anything other than it won't crash the kernel ;). #v6 and #v7 only leaves XPMEM out AFIK, and that can be supported later with a CONFIG_XPMEM that purely changes some VM locking. #v7 also provides maximum performance to GRU. > > rcu_read_lock), no "atomic" parameters, and it doesn't open a window > > where sptes have a view on older pages and linux pte has view on newer > > pages (this can happen with remap_file_pages with my KVM swapping > > patch to use V8 Christoph's patch). > > Ok so you are now getting away from keeping the refcount elevated? That > was your design decision.... No, I'm not getting away from it. If I would get away from it, I would be forced to implement invalidate_range_begin. However even if I don't get away from it, the fact I only implement invalidate_range_end, and that's called after the PT lock is dropped, opens a little window with lost coherency (which may not be detectable by userland anyway). But this little window is fine for KVM and it doesn't impose any security risk. But clearly proving the locking safe becomes a bit more complex in #v7 than in #v6. > It would have helped if you would have repeated my answers that you had > already gotten before. You knew I was on vacation.... I didn't remember the BUG_ON crystal clear sorry, but not sure why you think it was your call, this was a lowlevel XPMEM question and Robin promptly answered/reminded about it infact. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-19 23:08 ` [ofa-general] " Nick Piggin 2008-02-20 1:00 ` Andrea Arcangeli @ 2008-02-27 22:35 ` Christoph Lameter 2008-02-28 0:10 ` Christoph Lameter ` (2 more replies) 1 sibling, 3 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-27 22:35 UTC (permalink / raw) To: Nick Piggin Cc: steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Wed, 20 Feb 2008, Nick Piggin wrote: > On Friday 15 February 2008 17:49, Christoph Lameter wrote: > > The invalidation of address ranges in a mm_struct needs to be > > performed when pages are removed or permissions etc change. > > > > If invalidate_range_begin() is called with locks held then we > > pass a flag into invalidate_range() to indicate that no sleeping is > > possible. Locks are only held for truncate and huge pages. > > You can't sleep inside rcu_read_lock()! Could you be specific? This refers to page migration? Hmmm... Guess we would need to inc the refcount there instead? > I must say that for a patch that is up to v8 or whatever and is > posted twice a week to such a big cc list, it is kind of slack to > not even test it and expect other people to review it. It was tested with the GRU and XPmem. Andrea also reported success. > Also, what we are going to need here are not skeleton drivers > that just do all the *easy* bits (of registering their callbacks), > but actual fully working examples that do everything that any > real driver will need to do. If not for the sanity of the driver > writer, then for the sanity of the VM developers (I don't want > to have to understand xpmem or infiniband in order to understand > how the VM works). There are 3 different drivers that can already use it but the code is complex and not easy to review. Skeletons are easy to allow people to get started with it. > > lru_add_drain(); > > tlb = tlb_gather_mmu(mm, 0); > > update_hiwater_rss(mm); > > + mmu_notifier(invalidate_range_begin, mm, address, end, atomic); > > end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); > > if (tlb) > > tlb_finish_mmu(tlb, address, end); > > + mmu_notifier(invalidate_range_end, mm, address, end, atomic); > > return end; > > } > > > > Where do you invalidate for munmap()? zap_page_range() called from unmap_vmas(). > Also, how to you resolve the case where you are not allowed to sleep? > I would have thought either you have to handle it, in which case nobody > needs to sleep; or you can't handle it, in which case the code is > broken. That can be done in a variety of ways: 1. Change VM locking 2. Not handle file backed mappings (XPmem could work mostly in such a config) 3. Keep the refcount elevated until pages are freed in another execution context. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-27 22:35 ` [ofa-general] " Christoph Lameter @ 2008-02-28 0:10 ` Christoph Lameter 2008-02-28 0:11 ` [ofa-general] " Andrea Arcangeli 2008-03-03 5:11 ` Nick Piggin 2 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-28 0:10 UTC (permalink / raw) To: Nick Piggin Cc: steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Wed, 27 Feb 2008, Christoph Lameter wrote: > Could you be specific? This refers to page migration? Hmmm... Guess we > would need to inc the refcount there instead? Argh. No its the callback list scanning. Yuck. No one noticed. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-27 22:35 ` [ofa-general] " Christoph Lameter 2008-02-28 0:10 ` Christoph Lameter @ 2008-02-28 0:11 ` Andrea Arcangeli 2008-02-28 0:14 ` Christoph Lameter 2008-03-03 5:11 ` Nick Piggin 2 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-28 0:11 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Wed, Feb 27, 2008 at 02:35:59PM -0800, Christoph Lameter wrote: > Could you be specific? This refers to page migration? Hmmm... Guess we If the reader schedule, the synchronize_rcu will return in the other cpu and the objects in the list will be freed and overwritten, and when the task is scheduled back in, it'll follow dangling pointers... You can't use RCU if you want any of your invalidate methods to schedule. Otherwise it's like having zero locking. > 2. Not handle file backed mappings (XPmem could work mostly in such a > config) IMHO that fits under your definition of "hacking something in now and then having to modify it later". > 3. Keep the refcount elevated until pages are freed in another execution > context. Page refcount is not enough (the mmu_notifier_release will run in another cpu the moment after i_mmap_lock is unlocked) but mm_users may prevent us to change the i_mmap_lock to a mutex, but it'll slowdown truncate as it'll have to drop the lock and restart the radix tree walk every time so a change like this better fits as a separate CONFIG_XPMEM IMHO. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-28 0:11 ` [ofa-general] " Andrea Arcangeli @ 2008-02-28 0:14 ` Christoph Lameter 2008-02-28 0:52 ` [ofa-general] " Andrea Arcangeli 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-28 0:14 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Thu, 28 Feb 2008, Andrea Arcangeli wrote: > > 3. Keep the refcount elevated until pages are freed in another execution > > context. > > Page refcount is not enough (the mmu_notifier_release will run in > another cpu the moment after i_mmap_lock is unlocked) but mm_users may > prevent us to change the i_mmap_lock to a mutex, but it'll slowdown > truncate as it'll have to drop the lock and restart the radix tree > walk every time so a change like this better fits as a separate > CONFIG_XPMEM IMHO. Erm. This would also be needed by RDMA etc. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-28 0:14 ` Christoph Lameter @ 2008-02-28 0:52 ` Andrea Arcangeli 2008-02-28 1:03 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-28 0:52 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Wed, Feb 27, 2008 at 04:14:08PM -0800, Christoph Lameter wrote: > Erm. This would also be needed by RDMA etc. The only RDMA I know is Quadrics, and Quadrics apparently doesn't need to schedule inside the invalidate methods AFIK, so I doubt the above is true. It'd be interesting to know if IB is like Quadrics and it also doesn't require blocking to invalidate certain remote mappings. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-28 0:52 ` [ofa-general] " Andrea Arcangeli @ 2008-02-28 1:03 ` Christoph Lameter 2008-02-28 1:10 ` Andrea Arcangeli 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-28 1:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Thu, 28 Feb 2008, Andrea Arcangeli wrote: > On Wed, Feb 27, 2008 at 04:14:08PM -0800, Christoph Lameter wrote: > > Erm. This would also be needed by RDMA etc. > > The only RDMA I know is Quadrics, and Quadrics apparently doesn't need > to schedule inside the invalidate methods AFIK, so I doubt the above > is true. It'd be interesting to know if IB is like Quadrics and it > also doesn't require blocking to invalidate certain remote mappings. RDMA works across a network and I would assume that it needs confirmation that a connection has been torn down before pages can be unmapped. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-28 1:03 ` Christoph Lameter @ 2008-02-28 1:10 ` Andrea Arcangeli 2008-02-28 18:43 ` [ofa-general] " Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-28 1:10 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Wed, Feb 27, 2008 at 05:03:21PM -0800, Christoph Lameter wrote: > RDMA works across a network and I would assume that it needs confirmation > that a connection has been torn down before pages can be unmapped. Depends on the latency of the network, for example with page pinning it can even try to reduce the wait time, by tearing down the mapping in range_begin and spin waiting the ack only later in range_end. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-28 1:10 ` Andrea Arcangeli @ 2008-02-28 18:43 ` Christoph Lameter 2008-02-29 0:55 ` Andrea Arcangeli 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-28 18:43 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Thu, 28 Feb 2008, Andrea Arcangeli wrote: > On Wed, Feb 27, 2008 at 05:03:21PM -0800, Christoph Lameter wrote: > > RDMA works across a network and I would assume that it needs confirmation > > that a connection has been torn down before pages can be unmapped. > > Depends on the latency of the network, for example with page pinning > it can even try to reduce the wait time, by tearing down the mapping > in range_begin and spin waiting the ack only later in range_end. What about invalidate_page()? ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-28 18:43 ` [ofa-general] " Christoph Lameter @ 2008-02-29 0:55 ` Andrea Arcangeli 2008-02-29 0:59 ` [ofa-general] " Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-29 0:55 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Thu, Feb 28, 2008 at 10:43:54AM -0800, Christoph Lameter wrote: > What about invalidate_page()? That would just spin waiting an ack (just like the smp-tlb-flushing invalidates in numa already does). Thinking more about this, we could also parallelize it with an invalidate_page_before/end. If it takes 1usec to flush remotely, scheduling would be overkill, but spending 1usec in a while loop isn't nice if we can parallelize that 1usec with the ipi-tlb-flush. Not sure if it makes sense... it certainly would be quick to add it (especially thanks to _notify ;). ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-29 0:55 ` Andrea Arcangeli @ 2008-02-29 0:59 ` Christoph Lameter 2008-02-29 13:13 ` Andrea Arcangeli 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-29 0:59 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Fri, 29 Feb 2008, Andrea Arcangeli wrote: > On Thu, Feb 28, 2008 at 10:43:54AM -0800, Christoph Lameter wrote: > > What about invalidate_page()? > > That would just spin waiting an ack (just like the smp-tlb-flushing > invalidates in numa already does). And thus the device driver may stop receiving data on a UP system? It will never get the ack. > Thinking more about this, we could also parallelize it with an > invalidate_page_before/end. If it takes 1usec to flush remotely, > scheduling would be overkill, but spending 1usec in a while loop isn't > nice if we can parallelize that 1usec with the ipi-tlb-flush. Not sure > if it makes sense... it certainly would be quick to add it (especially > thanks to _notify ;). invalidate_page_before/end could be realized as an invalidate_range_begin/end on a page sized range? ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-29 0:59 ` [ofa-general] " Christoph Lameter @ 2008-02-29 13:13 ` Andrea Arcangeli 2008-02-29 19:55 ` [ofa-general] " Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-29 13:13 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Thu, Feb 28, 2008 at 04:59:59PM -0800, Christoph Lameter wrote: > And thus the device driver may stop receiving data on a UP system? It will > never get the ack. Not sure to follow, sorry. My idea was: post the invalidate in the mmio region of the device smp_call_function() while (mmio device wait-bitflag is on); Instead of the current: smp_call_function() post the invalidate in the mmio region of the device while (mmio device wait-bitflag is on); To decrease the wait loop time. > invalidate_page_before/end could be realized as an > invalidate_range_begin/end on a page sized range? If we go this route, once you add support to xpmem, you'll have to make the anon_vma lock a mutex too, that would be fine with me though. The main reason invalidate_page exists, is to allow you to leave it as non-sleep-capable even after you make invalidate_range sleep capable, and to implement the mmu_rmap_notifiers sleep capable in all the paths that invalidate_page would be called. That was the strategy you had in your patch. I'll try to drop invalidate_page. I wonder if then you won't need the mmu_rmap_notifiers anymore. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-29 13:13 ` Andrea Arcangeli @ 2008-02-29 19:55 ` Christoph Lameter 2008-02-29 20:17 ` Andrea Arcangeli 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-29 19:55 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Fri, 29 Feb 2008, Andrea Arcangeli wrote: > On Thu, Feb 28, 2008 at 04:59:59PM -0800, Christoph Lameter wrote: > > And thus the device driver may stop receiving data on a UP system? It will > > never get the ack. > > Not sure to follow, sorry. > > My idea was: > > post the invalidate in the mmio region of the device > smp_call_function() > while (mmio device wait-bitflag is on); So the device driver on UP can only operate through interrupts? If you are hogging the only cpu then driver operations may not be possible. > > invalidate_page_before/end could be realized as an > > invalidate_range_begin/end on a page sized range? > > If we go this route, once you add support to xpmem, you'll have to > make the anon_vma lock a mutex too, that would be fine with me > though. The main reason invalidate_page exists, is to allow you to > leave it as non-sleep-capable even after you make invalidate_range > sleep capable, and to implement the mmu_rmap_notifiers sleep capable > in all the paths that invalidate_page would be called. That was the > strategy you had in your patch. I'll try to drop invalidate_page. I > wonder if then you won't need the mmu_rmap_notifiers anymore. I am mainly concerned with making the mmu notifier a generally useful feature for multiple users. Xpmem is one example of a different user. It should be considered as one example of a different type of callback user. It is not the gold standard that you make it to be. RDMA is another and there are likely scores of others (DMA engines etc) once it becomes clear that such a feature is available. In general the mmu notifier will allows us to fix the problems caused by memory pinning and mlock by various devices and other mechanisms that need to directly access memory. And yes I would like to get rid of the mmu_rmap_notifiers altogether. It would be much cleaner with just one mmu_notifier that can sleep in all functions. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-29 19:55 ` [ofa-general] " Christoph Lameter @ 2008-02-29 20:17 ` Andrea Arcangeli 2008-02-29 21:03 ` [ofa-general] " Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-29 20:17 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Fri, Feb 29, 2008 at 11:55:17AM -0800, Christoph Lameter wrote: > > post the invalidate in the mmio region of the device > > smp_call_function() > > while (mmio device wait-bitflag is on); > > So the device driver on UP can only operate through interrupts? If you are > hogging the only cpu then driver operations may not be possible. There was no irq involved in the above pseudocode, the irq if something would run in the remote system. Still irqs can run fine during the while loop like they run fine on top of smp_call_function. The send-irq and the following spin-on-a-bitflag works exactly as smp_call_function except this isn't a numa-CPU to invalidate. > And yes I would like to get rid of the mmu_rmap_notifiers altogether. It > would be much cleaner with just one mmu_notifier that can sleep in all > functions. Agreed. I just thought xpmem needed an invalidate-by-page, but I'm glad if xpmem can go in sync with the KVM/GRU/DRI model in this regard. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-29 20:17 ` Andrea Arcangeli @ 2008-02-29 21:03 ` Christoph Lameter 2008-02-29 21:23 ` Andrea Arcangeli 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-29 21:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Fri, 29 Feb 2008, Andrea Arcangeli wrote: > Agreed. I just thought xpmem needed an invalidate-by-page, but > I'm glad if xpmem can go in sync with the KVM/GRU/DRI model in this > regard. That means we need both the anon_vma locks and the i_mmap_lock to become semaphores. I think semaphores are better than mutexes. Rik and Lee saw some performance improvements because list can be traversed in parallel when the anon_vma lock is switched to be a rw lock. Sounds like we get to a conceptually clean version here? ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-29 21:03 ` [ofa-general] " Christoph Lameter @ 2008-02-29 21:23 ` Andrea Arcangeli 2008-02-29 21:29 ` Christoph Lameter 2008-02-29 21:34 ` Christoph Lameter 0 siblings, 2 replies; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-29 21:23 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Fri, Feb 29, 2008 at 01:03:16PM -0800, Christoph Lameter wrote: > That means we need both the anon_vma locks and the i_mmap_lock to become > semaphores. I think semaphores are better than mutexes. Rik and Lee saw > some performance improvements because list can be traversed in parallel > when the anon_vma lock is switched to be a rw lock. The improvement was with a rw spinlock IIRC, so I don't see how it's related to this. Perhaps the rwlock spinlock can be changed to a rw semaphore without measurable overscheduling in the fast path. However theoretically speaking the rw_lock spinlock is more efficient than a rw semaphore in case of a little contention during the page fault fast path because the critical section is just a list_add so it'd be overkill to schedule while waiting. That's why currently it's a spinlock (or rw spinlock). > Sounds like we get to a conceptually clean version here? I don't have a strong opinion if it should become a semaphore unconditionally or only with a CONFIG_XPMEM=y. But keep in mind preempt-rt runs quite a bit slower, or we could rip spinlocks out of the kernel in the first place ;) ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-29 21:23 ` Andrea Arcangeli @ 2008-02-29 21:29 ` Christoph Lameter 2008-02-29 21:34 ` Christoph Lameter 1 sibling, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-29 21:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Fri, 29 Feb 2008, Andrea Arcangeli wrote: > I don't have a strong opinion if it should become a semaphore > unconditionally or only with a CONFIG_XPMEM=y. But keep in mind > preempt-rt runs quite a bit slower, or we could rip spinlocks out of > the kernel in the first place ;) D you just skip comments of people on the mmu_notifier? It took me to remind you about Andrew's comments to note those. And I just responded on the XPmem issue in the morning. Again for the gazillionth time: There will be no CONFIG_XPMEM because the functionality needs to be generic and not XPMEM specific. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-29 21:23 ` Andrea Arcangeli 2008-02-29 21:29 ` Christoph Lameter @ 2008-02-29 21:34 ` Christoph Lameter 2008-02-29 21:48 ` [ofa-general] " Andrea Arcangeli 1 sibling, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-02-29 21:34 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Fri, 29 Feb 2008, Andrea Arcangeli wrote: > On Fri, Feb 29, 2008 at 01:03:16PM -0800, Christoph Lameter wrote: > > That means we need both the anon_vma locks and the i_mmap_lock to become > > semaphores. I think semaphores are better than mutexes. Rik and Lee saw > > some performance improvements because list can be traversed in parallel > > when the anon_vma lock is switched to be a rw lock. > > The improvement was with a rw spinlock IIRC, so I don't see how it's > related to this. AFAICT The rw semaphore fastpath is similar in performance to a rw spinlock. > Perhaps the rwlock spinlock can be changed to a rw semaphore without > measurable overscheduling in the fast path. However theoretically Overscheduling? You mean overhead? > speaking the rw_lock spinlock is more efficient than a rw semaphore in > case of a little contention during the page fault fast path because > the critical section is just a list_add so it'd be overkill to > schedule while waiting. That's why currently it's a spinlock (or rw > spinlock). On the other hand a semaphore puts the process to sleep and may actually improve performance because there is less time spend in a busy loop. Other processes may do something useful and we stay off the contended cacheline reducing traffic on the interconnect. > preempt-rt runs quite a bit slower, or we could rip spinlocks out of > the kernel in the first place ;) The question is why that is the case and it seesm that there are issues with interrupt on/off that are important here and particularly significant with the SLAB allocator (significant hacks there to deal with that issue). The fastpath that we have in the works for SLUB may address a large part of that issue because it no longer relies on disabling interrupts. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-29 21:34 ` Christoph Lameter @ 2008-02-29 21:48 ` Andrea Arcangeli 2008-02-29 22:12 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-02-29 21:48 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Fri, Feb 29, 2008 at 01:34:34PM -0800, Christoph Lameter wrote: > On Fri, 29 Feb 2008, Andrea Arcangeli wrote: > > > On Fri, Feb 29, 2008 at 01:03:16PM -0800, Christoph Lameter wrote: > > > That means we need both the anon_vma locks and the i_mmap_lock to become > > > semaphores. I think semaphores are better than mutexes. Rik and Lee saw > > > some performance improvements because list can be traversed in parallel > > > when the anon_vma lock is switched to be a rw lock. > > > > The improvement was with a rw spinlock IIRC, so I don't see how it's > > related to this. > > AFAICT The rw semaphore fastpath is similar in performance to a rw > spinlock. read side is taken in the slow path. write side is taken in the fast path. pagefault is fast path, VM during swapping is slow path. > > Perhaps the rwlock spinlock can be changed to a rw semaphore without > > measurable overscheduling in the fast path. However theoretically > > Overscheduling? You mean overhead? The only possible overhead that a rw semaphore could ever generate vs a rw lock is overscheduling. > > speaking the rw_lock spinlock is more efficient than a rw semaphore in > > case of a little contention during the page fault fast path because > > the critical section is just a list_add so it'd be overkill to > > schedule while waiting. That's why currently it's a spinlock (or rw > > spinlock). > > On the other hand a semaphore puts the process to sleep and may actually > improve performance because there is less time spend in a busy loop. > Other processes may do something useful and we stay off the contended > cacheline reducing traffic on the interconnect. Yes, that's the positive side, the negative side is that you'll put the task in uninterruptible sleep and call schedule() and require a wakeup, because a list_add taking <1usec is running in the other cpu. No other downside. But that's the only reason it's a spinlock right now, infact there can't be any other reason. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-29 21:48 ` [ofa-general] " Andrea Arcangeli @ 2008-02-29 22:12 ` Christoph Lameter 0 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-02-29 22:12 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, steiner, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Fri, 29 Feb 2008, Andrea Arcangeli wrote: > > AFAICT The rw semaphore fastpath is similar in performance to a rw > > spinlock. > > read side is taken in the slow path. Slowpath meaning VM slowpath or lock slow path? Its seems that the rwsem read side path is pretty efficient: static inline void __down_read(struct rw_semaphore *sem) { __asm__ __volatile__( "# beginning down_read\n\t" LOCK_PREFIX " incl (%%eax)\n\t" /* adds 0x00000001, returns the old value */ " jns 1f\n" " call call_rwsem_down_read_failed\n" "1:\n\t" "# ending down_read\n\t" : "+m" (sem->count) : "a" (sem) : "memory", "cc"); } > > write side is taken in the fast path. > > pagefault is fast path, VM during swapping is slow path. Not sure what you are saying here. A pagefault should be considered as a fast path and swapping is not performance critical? > > > Perhaps the rwlock spinlock can be changed to a rw semaphore without > > > measurable overscheduling in the fast path. However theoretically > > > > Overscheduling? You mean overhead? > > The only possible overhead that a rw semaphore could ever generate vs > a rw lock is overscheduling. Ok too many calls to schedule() because the slow path (of the semaphore) is taken? > > On the other hand a semaphore puts the process to sleep and may actually > > improve performance because there is less time spend in a busy loop. > > Other processes may do something useful and we stay off the contended > > cacheline reducing traffic on the interconnect. > > Yes, that's the positive side, the negative side is that you'll put > the task in uninterruptible sleep and call schedule() and require a > wakeup, because a list_add taking <1usec is running in the > other cpu. No other downside. But that's the only reason it's a > spinlock right now, infact there can't be any other reason. But that is only happening for the contended case. Certainly a spinlock is better for 2p system but the more processors content for the lock (and the longer the hold off is, typical for the processors with 4p or 8p or more) the better a semaphore will work. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-02-27 22:35 ` [ofa-general] " Christoph Lameter 2008-02-28 0:10 ` Christoph Lameter 2008-02-28 0:11 ` [ofa-general] " Andrea Arcangeli @ 2008-03-03 5:11 ` Nick Piggin 2008-03-03 19:28 ` [ofa-general] " Christoph Lameter 2 siblings, 1 reply; 150+ messages in thread From: Nick Piggin @ 2008-03-03 5:11 UTC (permalink / raw) To: Christoph Lameter Cc: steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Thursday 28 February 2008 09:35, Christoph Lameter wrote: > On Wed, 20 Feb 2008, Nick Piggin wrote: > > On Friday 15 February 2008 17:49, Christoph Lameter wrote: > > Also, what we are going to need here are not skeleton drivers > > that just do all the *easy* bits (of registering their callbacks), > > but actual fully working examples that do everything that any > > real driver will need to do. If not for the sanity of the driver > > writer, then for the sanity of the VM developers (I don't want > > to have to understand xpmem or infiniband in order to understand > > how the VM works). > > There are 3 different drivers that can already use it but the code is > complex and not easy to review. Skeletons are easy to allow people to get > started with it. Your skeleton is just registering notifiers and saying /* you fill the hard part in */ If somebody needs a skeleton in order just to register the notifiers, then almost by definition they are unqualified to write the hard part ;) > > > lru_add_drain(); > > > tlb = tlb_gather_mmu(mm, 0); > > > update_hiwater_rss(mm); > > > + mmu_notifier(invalidate_range_begin, mm, address, end, atomic); > > > end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); > > > if (tlb) > > > tlb_finish_mmu(tlb, address, end); > > > + mmu_notifier(invalidate_range_end, mm, address, end, atomic); > > > return end; > > > } > > > > Where do you invalidate for munmap()? > > zap_page_range() called from unmap_vmas(). But it is not allowed to sleep. Where do you call the sleepable one from? > > Also, how to you resolve the case where you are not allowed to sleep? > > I would have thought either you have to handle it, in which case nobody > > needs to sleep; or you can't handle it, in which case the code is > > broken. > > That can be done in a variety of ways: > > 1. Change VM locking > > 2. Not handle file backed mappings (XPmem could work mostly in such a > config) > > 3. Keep the refcount elevated until pages are freed in another execution > context. OK, there are ways to solve it or hack around it. But this is exactly why I think the implementations should be kept seperate. Andrea's notifiers are coherent, work on all types of mappings, and will hopefully match closely the regular TLB invalidation sequence in the Linux VM (at the moment it is quite close, but I hope to make it a bit closer) so that it requires almost no changes to the mm. All the other things to try to make it sleep are either hacking holes in it (eg by removing coherency). So I don't think it is reasonable to require that any patch handle all cases. I actually think Andrea's patch is quite nice and simple itself, wheras I am against the patches that you posted. What about a completely different approach... XPmem runs over NUMAlink, right? Why not provide some non-sleeping way to basically IPI remote nodes over the NUMAlink where they can process the invalidation? If you intra-node cache coherency has to run over this link anyway, then presumably it is capable. Or another idea, why don't you LD_PRELOAD in the MPT library to also intercept munmap, mprotect, mremap etc as well as just fork()? That would give you similarly "good enough" coherency as the mmu notifier patches except that you can't swap (which Robin said was not a big problem). ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [ofa-general] Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-03-03 5:11 ` Nick Piggin @ 2008-03-03 19:28 ` Christoph Lameter 2008-03-03 19:50 ` Nick Piggin 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-03-03 19:28 UTC (permalink / raw) To: Nick Piggin Cc: steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Mon, 3 Mar 2008, Nick Piggin wrote: > Your skeleton is just registering notifiers and saying > > /* you fill the hard part in */ > > If somebody needs a skeleton in order just to register the notifiers, > then almost by definition they are unqualified to write the hard > part ;) Its also providing a locking scheme. > OK, there are ways to solve it or hack around it. But this is exactly > why I think the implementations should be kept seperate. Andrea's > notifiers are coherent, work on all types of mappings, and will > hopefully match closely the regular TLB invalidation sequence in the > Linux VM (at the moment it is quite close, but I hope to make it a > bit closer) so that it requires almost no changes to the mm. Then put it into the arch code for TLB invalidation. Paravirt ops gives good examples on how to do that. > What about a completely different approach... XPmem runs over NUMAlink, > right? Why not provide some non-sleeping way to basically IPI remote > nodes over the NUMAlink where they can process the invalidation? If you > intra-node cache coherency has to run over this link anyway, then > presumably it is capable. There is another Linux instance at the remote end that first has to remove its own ptes. Also would not work for Inifiniband and other solutions. All the approaches that require evictions in an atomic context are limiting the approach and do not allow the generic functionality that we want in order to not add alternate APIs for this. > Or another idea, why don't you LD_PRELOAD in the MPT library to also > intercept munmap, mprotect, mremap etc as well as just fork()? That > would give you similarly "good enough" coherency as the mmu notifier > patches except that you can't swap (which Robin said was not a big > problem). The good enough solution right now is to pin pages by elevating refcounts. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-03-03 19:28 ` [ofa-general] " Christoph Lameter @ 2008-03-03 19:50 ` Nick Piggin 2008-03-04 18:58 ` Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Nick Piggin @ 2008-03-03 19:50 UTC (permalink / raw) To: Christoph Lameter Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman On Tuesday 04 March 2008 06:28, Christoph Lameter wrote: > On Mon, 3 Mar 2008, Nick Piggin wrote: > > Your skeleton is just registering notifiers and saying > > > > /* you fill the hard part in */ > > > > If somebody needs a skeleton in order just to register the notifiers, > > then almost by definition they are unqualified to write the hard > > part ;) > > Its also providing a locking scheme. Not the full locking scheme. If you have a look at the real code required to do it, it is non trivial. > > OK, there are ways to solve it or hack around it. But this is exactly > > why I think the implementations should be kept seperate. Andrea's > > notifiers are coherent, work on all types of mappings, and will > > hopefully match closely the regular TLB invalidation sequence in the > > Linux VM (at the moment it is quite close, but I hope to make it a > > bit closer) so that it requires almost no changes to the mm. > > Then put it into the arch code for TLB invalidation. Paravirt ops gives > good examples on how to do that. Put what into arch code? > > What about a completely different approach... XPmem runs over NUMAlink, > > right? Why not provide some non-sleeping way to basically IPI remote > > nodes over the NUMAlink where they can process the invalidation? If you > > intra-node cache coherency has to run over this link anyway, then > > presumably it is capable. > > There is another Linux instance at the remote end that first has to > remove its own ptes. Yeah, what's the problem? > Also would not work for Inifiniband and other > solutions. infiniband doesn't want it. Other solutions is just handwaving, because if we don't know what the other soloutions are, then we can't make any sort of informed choices. > All the approaches that require evictions in an atomic context > are limiting the approach and do not allow the generic functionality that > we want in order to not add alternate APIs for this. The only generic way to do this that I have seen (and the only proposed way that doesn't add alternate APIs for that matter) is turning VM locks into sleeping locks. In which case, Andrea's notifiers will work just fine (except for relatively minor details like rcu list scanning). So I don't see what you're arguing for. There is no requirement that we support sleeping notifiers in the same patch as non-sleeping ones. Considering the simplicity of the non-sleeping notifiers and the problems with sleeping ones, I think it is pretty clear that they are different beasts (unless VM locking is changed). > > Or another idea, why don't you LD_PRELOAD in the MPT library to also > > intercept munmap, mprotect, mremap etc as well as just fork()? That > > would give you similarly "good enough" coherency as the mmu notifier > > patches except that you can't swap (which Robin said was not a big > > problem). > > The good enough solution right now is to pin pages by elevating > refcounts. Which kind of leads to the question of why do you need any further kernel patches if that is good enough? ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-03-03 19:50 ` Nick Piggin @ 2008-03-04 18:58 ` Christoph Lameter 2008-03-05 0:52 ` Nick Piggin 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-03-04 18:58 UTC (permalink / raw) To: Nick Piggin Cc: steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm, Kanoj Sarcar, Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, general, akpm On Tue, 4 Mar 2008, Nick Piggin wrote: > > Then put it into the arch code for TLB invalidation. Paravirt ops gives > > good examples on how to do that. > > Put what into arch code? The mmu notifier code. > > > What about a completely different approach... XPmem runs over NUMAlink, > > > right? Why not provide some non-sleeping way to basically IPI remote > > > nodes over the NUMAlink where they can process the invalidation? If you > > > intra-node cache coherency has to run over this link anyway, then > > > presumably it is capable. > > > > There is another Linux instance at the remote end that first has to > > remove its own ptes. > > Yeah, what's the problem? The remote end has to invalidate the page which involves locking etc. > > Also would not work for Inifiniband and other > > solutions. > > infiniband doesn't want it. Other solutions is just handwaving, > because if we don't know what the other soloutions are, then we can't > make any sort of informed choices. We need a solution in general to avoid the pinning problems. Infiniband has those too. > > All the approaches that require evictions in an atomic context > > are limiting the approach and do not allow the generic functionality that > > we want in order to not add alternate APIs for this. > > The only generic way to do this that I have seen (and the only proposed > way that doesn't add alternate APIs for that matter) is turning VM locks > into sleeping locks. In which case, Andrea's notifiers will work just > fine (except for relatively minor details like rcu list scanning). No they wont. As you pointed out the callback need RCU locking. > > The good enough solution right now is to pin pages by elevating > > refcounts. > > Which kind of leads to the question of why do you need any further > kernel patches if that is good enough? Well its good enough with severe problems during reclaim, livelocks etc. One could improve on that scheme through Rik's work trying to add a new page flag that mark pinned pages and then keep them off the LRUs and limiting their number. Having pinned page would limit the ability to reclaim by the VM and make page migration, memory unplug etc impossible. It is better to have notifier scheme that allows to tell a device driver to free up the memory it has mapped. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-03-04 18:58 ` Christoph Lameter @ 2008-03-05 0:52 ` Nick Piggin 0 siblings, 0 replies; 150+ messages in thread From: Nick Piggin @ 2008-03-05 0:52 UTC (permalink / raw) To: Christoph Lameter Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman On Wednesday 05 March 2008 05:58, Christoph Lameter wrote: > On Tue, 4 Mar 2008, Nick Piggin wrote: > > > Then put it into the arch code for TLB invalidation. Paravirt ops gives > > > good examples on how to do that. > > > > Put what into arch code? > > The mmu notifier code. It isn't arch specific. > > > > What about a completely different approach... XPmem runs over > > > > NUMAlink, right? Why not provide some non-sleeping way to basically > > > > IPI remote nodes over the NUMAlink where they can process the > > > > invalidation? If you intra-node cache coherency has to run over this > > > > link anyway, then presumably it is capable. > > > > > > There is another Linux instance at the remote end that first has to > > > remove its own ptes. > > > > Yeah, what's the problem? > > The remote end has to invalidate the page which involves locking etc. I don't see what the problem is. > > > Also would not work for Inifiniband and other > > > solutions. > > > > infiniband doesn't want it. Other solutions is just handwaving, > > because if we don't know what the other soloutions are, then we can't > > make any sort of informed choices. > > We need a solution in general to avoid the pinning problems. Infiniband > has those too. > > > > All the approaches that require evictions in an atomic context > > > are limiting the approach and do not allow the generic functionality > > > that we want in order to not add alternate APIs for this. > > > > The only generic way to do this that I have seen (and the only proposed > > way that doesn't add alternate APIs for that matter) is turning VM locks > > into sleeping locks. In which case, Andrea's notifiers will work just > > fine (except for relatively minor details like rcu list scanning). > > No they wont. As you pointed out the callback need RCU locking. That can be fixed easily. > > > The good enough solution right now is to pin pages by elevating > > > refcounts. > > > > Which kind of leads to the question of why do you need any further > > kernel patches if that is good enough? > > Well its good enough with severe problems during reclaim, livelocks etc. > One could improve on that scheme through Rik's work trying to add a new > page flag that mark pinned pages and then keep them off the LRUs and > limiting their number. Having pinned page would limit the ability to > reclaim by the VM and make page migration, memory unplug etc impossible. Well not impossible. You could have a callback to invalidate the remote TLB and drop the pin on a given page. > It is better to have notifier scheme that allows to tell a device driver > to free up the memory it has mapped. Yeah, it would be nice for those people with clusters of Altixes. Doesn't mean it has to go upstream, though. ^ permalink raw reply [flat|nested] 150+ messages in thread
* [patch 0/6] [RFC] MMU Notifiers V3 @ 2008-01-30 2:29 Christoph Lameter 2008-01-30 2:29 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-30 2:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins This is a patchset implementing MMU notifier callbacks based on Andrea's earlier work. These are needed if Linux pages are referenced from something else than tracked by the rmaps of the kernel. The known immediate users are KVM (establishes a refcount to the page. External references called spte) GRU (simple TLB shootdown without refcount. Has its own pagetable/tlb) XPmem (uses its own reverse mappings and refcount. Remote ptes, Needs to sleep when sending messages) Issues: - Feedback from uses of the callbacks for KVM, RDMA, XPmem and GRU Early tests with the GRU were successful. - Pages may be freed before the external mapping are torn down through invalidate_range() if no refcount on the page is taken. There is the chance that page content may be visible after they have been reallocated (mainly an issue for the GRU that takes no refcount). - invalidate_range() callbacks are sometimes called under i_mmap_lock. These need to be dealt with or XPmem needs to be able to work around these. - filemap_xip.c does not follow conventions for Rmap callbacks. We could depends on XIP support not being active to avoid the issue. Things that we leave as is: - RCU quiescent periods are required on registering and unregistering notifiers to guarantee visibility to other processors. Currently only mmu_notifier_release() does the correct thing. It is up to the user to provide RCU quiescent periods for register/unregister functions if they are called outside of the ->release method. Andrea's mmu_notifier #4 -> RFC V1 - Merge subsystem rmap based with Linux rmap based approach - Move Linux rmap based notifiers out of macro - Try to account for what locks are held while the notifiers are called. - Develop a patch sequence that separates out the different types of hooks so that we can review their use. - Avoid adding include to linux/mm_types.h - Integrate RCU logic suggested by Peter. V1->V2: - Improve RCU support - Use mmap_sem for mmu_notifier register / unregister - Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we already have invalidate_range() callbacks there. - Clean compile for !MMU_NOTIFIER - Isolate filemap_xip strangeness into its own diff - Pass a the flag to invalidate_range to indicate if a spinlock is held. - Add invalidate_all() V2->V3: - Further RCU fixes - Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page and sys_remap_file_pages() after the pte clearing. -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-01-30 2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter @ 2008-01-30 2:29 ` Christoph Lameter 0 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-01-30 2:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins [-- Attachment #1: mmu_invalidate_range_callbacks --] [-- Type: text/plain, Size: 4940 bytes --] The invalidation of address ranges in a mm_struct needs to be performed when pages are removed or permissions etc change. Most of the VM address space changes can use the range invalidate callback. invalidate_range() is generally called with mmap_sem held but no spinlocks are active. If invalidate_range() is called with locks held then we pass a flag into invalidate_range() Comments state that mmap_sem must be held for remap_pfn_range() but various drivers do not seem to do this. Signed-off-by: Andrea Arcangeli <andrea-atKUWr5tajBWk0Htik3J/w@public.gmane.org> Signed-off-by: Robin Holt <holt-sJ/iWh9BUns@public.gmane.org> Signed-off-by: Christoph Lameter <clameter-sJ/iWh9BUns@public.gmane.org> --- mm/fremap.c | 2 ++ mm/hugetlb.c | 2 ++ mm/memory.c | 11 +++++++++-- mm/mmap.c | 1 + 4 files changed, 14 insertions(+), 2 deletions(-) Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/mm/fremap.c 2008-01-29 16:59:24.000000000 -0800 @@ -15,6 +15,7 @@ #include <linux/rmap.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/mmu_context.h> #include <asm/cacheflush.h> @@ -212,6 +213,7 @@ asmlinkage long sys_remap_file_pages(uns } err = populate_range(mm, vma, start, size, pgoff); + mmu_notifier(invalidate_range, mm, start, start + size, 0); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/mm/memory.c 2008-01-29 16:59:24.000000000 -0800 @@ -50,6 +50,7 @@ #include <linux/delayacct.h> #include <linux/init.h> #include <linux/writeback.h> +#include <linux/mmu_notifier.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -891,6 +892,8 @@ unsigned long zap_page_range(struct vm_a end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); if (tlb) tlb_finish_mmu(tlb, address, end); + mmu_notifier(invalidate_range, mm, address, end, + (details ? (details->i_mmap_lock != NULL) : 0)); return end; } @@ -1319,7 +1322,7 @@ int remap_pfn_range(struct vm_area_struc { pgd_t *pgd; unsigned long next; - unsigned long end = addr + PAGE_ALIGN(size); + unsigned long start = addr, end = addr + PAGE_ALIGN(size); struct mm_struct *mm = vma->vm_mm; int err; @@ -1360,6 +1363,7 @@ int remap_pfn_range(struct vm_area_struc if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, start, end, 0); return err; } EXPORT_SYMBOL(remap_pfn_range); @@ -1443,7 +1447,7 @@ int apply_to_page_range(struct mm_struct { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); @@ -1454,6 +1458,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, start, end, 0); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1669,6 +1674,8 @@ gotten: page_cache_release(old_page); unlock: pte_unmap_unlock(page_table, ptl); + mmu_notifier(invalidate_range, mm, address, + address + PAGE_SIZE - 1, 0); if (dirty_page) { if (vma->vm_file) file_update_time(vma->vm_file); Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-01-29 16:56:36.000000000 -0800 +++ linux-2.6/mm/mmap.c 2008-01-29 16:58:15.000000000 -0800 @@ -1748,6 +1748,7 @@ static void unmap_region(struct mm_struc free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); tlb_finish_mmu(tlb, start, end); + mmu_notifier(invalidate_range, mm, start, end, 0); } /* Index: linux-2.6/mm/hugetlb.c =================================================================== --- linux-2.6.orig/mm/hugetlb.c 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/mm/hugetlb.c 2008-01-29 16:58:15.000000000 -0800 @@ -14,6 +14,7 @@ #include <linux/mempolicy.h> #include <linux/cpuset.h> #include <linux/mutex.h> +#include <linux/mmu_notifier.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -763,6 +764,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier(invalidate_range, mm, start, end, 1); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [patch 0/6] [RFC] MMU Notifiers V2 @ 2008-01-28 20:28 Christoph Lameter 2008-01-28 20:28 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins This is a patchset implementing MMU notifier callbacks based on Andrea's earlier work. These are needed if Linux pages are referenced from something else than tracked by the rmaps of the kernel. Issues: - Feedback from uses of the callbacks for KVM, RDMA, XPmem and GRU - RCU quiescent periods are required on registering and unregistering notifiers to guarantee visibility to other processors. Currently only mmu_notifier_release() does the correct thing. It is up to the user to provide RCU quiescent periods for register/unregister functions if they are called outside of the ->release method. Andrea's mmu_notifier #4 -> RFC V1 - Merge subsystem rmap based with Linux rmap based approach - Move Linux rmap based notifiers out of macro - Try to account for what locks are held while the notifiers are called. - Develop a patch sequence that separates out the different types of hooks so that we can review their use. - Avoid adding include to linux/mm_types.h - Integrate RCU logic suggested by Peter. V1->V2: - Improve RCU support - Use mmap_sem for mmu_notifier register / unregister - Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we already have invalidate_range() callbacks there. - Clean compile for !MMU_NOTIFIER - Isolate filemap_xip strangeness into its own diff - Pass a the flag to invalidate_range to indicate if a spinlock is held. - Add invalidate_all() -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter @ 2008-01-28 20:28 ` Christoph Lameter [not found] ` <20080128202923.849058104-sJ/iWh9BUns@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins [-- Attachment #1: mmu_invalidate_range_callbacks --] [-- Type: text/plain, Size: 4970 bytes --] The invalidation of address ranges in a mm_struct needs to be performed when pages are removed or permissions etc change. Most of the VM address space changes can use the range invalidate callback. invalidate_range() is generally called with mmap_sem held but no spinlocks are active. If invalidate_range() is called with locks held then we pass a flag into invalidate_range() Comments state that mmap_sem must be held for remap_pfn_range() but various drivers do not seem to do this. Signed-off-by: Andrea Arcangeli <andrea-atKUWr5tajBWk0Htik3J/w@public.gmane.org> Signed-off-by: Robin Holt <holt-sJ/iWh9BUns@public.gmane.org> Signed-off-by: Christoph Lameter <clameter-sJ/iWh9BUns@public.gmane.org> --- mm/fremap.c | 2 ++ mm/hugetlb.c | 2 ++ mm/memory.c | 11 +++++++++-- mm/mmap.c | 1 + 4 files changed, 14 insertions(+), 2 deletions(-) Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c 2008-01-25 19:31:05.000000000 -0800 +++ linux-2.6/mm/fremap.c 2008-01-25 19:32:49.000000000 -0800 @@ -15,6 +15,7 @@ #include <linux/rmap.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/mmu_context.h> #include <asm/cacheflush.h> @@ -211,6 +212,7 @@ asmlinkage long sys_remap_file_pages(uns spin_unlock(&mapping->i_mmap_lock); } + mmu_notifier(invalidate_range, mm, start, start + size, 0); err = populate_range(mm, vma, start, size, pgoff); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2008-01-25 19:31:05.000000000 -0800 +++ linux-2.6/mm/memory.c 2008-01-25 19:32:49.000000000 -0800 @@ -50,6 +50,7 @@ #include <linux/delayacct.h> #include <linux/init.h> #include <linux/writeback.h> +#include <linux/mmu_notifier.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -891,6 +892,8 @@ unsigned long zap_page_range(struct vm_a end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); if (tlb) tlb_finish_mmu(tlb, address, end); + mmu_notifier(invalidate_range, mm, address, end, + (details ? (details->i_mmap_lock != NULL) : 0)); return end; } @@ -1319,7 +1322,7 @@ int remap_pfn_range(struct vm_area_struc { pgd_t *pgd; unsigned long next; - unsigned long end = addr + PAGE_ALIGN(size); + unsigned long start = addr, end = addr + PAGE_ALIGN(size); struct mm_struct *mm = vma->vm_mm; int err; @@ -1360,6 +1363,7 @@ int remap_pfn_range(struct vm_area_struc if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, start, end, 0); return err; } EXPORT_SYMBOL(remap_pfn_range); @@ -1443,7 +1447,7 @@ int apply_to_page_range(struct mm_struct { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); @@ -1454,6 +1458,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, start, end, 0); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1634,6 +1639,8 @@ gotten: /* * Re-check the pte - we dropped the lock */ + mmu_notifier(invalidate_range, mm, address, + address + PAGE_SIZE - 1, 0); page_table = pte_offset_map_lock(mm, pmd, address, &ptl); if (likely(pte_same(*page_table, orig_pte))) { if (old_page) { Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-01-25 19:31:05.000000000 -0800 +++ linux-2.6/mm/mmap.c 2008-01-25 19:32:49.000000000 -0800 @@ -1748,6 +1748,7 @@ static void unmap_region(struct mm_struc free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); tlb_finish_mmu(tlb, start, end); + mmu_notifier(invalidate_range, mm, start, end, 0); } /* Index: linux-2.6/mm/hugetlb.c =================================================================== --- linux-2.6.orig/mm/hugetlb.c 2008-01-25 19:33:58.000000000 -0800 +++ linux-2.6/mm/hugetlb.c 2008-01-25 19:34:13.000000000 -0800 @@ -14,6 +14,7 @@ #include <linux/mempolicy.h> #include <linux/cpuset.h> #include <linux/mutex.h> +#include <linux/mmu_notifier.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -763,6 +764,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier(invalidate_range, mm, start, end, 1); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080128202923.849058104-sJ/iWh9BUns@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080128202923.849058104-sJ/iWh9BUns@public.gmane.org> @ 2008-01-29 16:20 ` Andrea Arcangeli [not found] ` <20080129162004.GL7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-29 16:20 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Mon, Jan 28, 2008 at 12:28:42PM -0800, Christoph Lameter wrote: > Index: linux-2.6/mm/fremap.c > =================================================================== > --- linux-2.6.orig/mm/fremap.c 2008-01-25 19:31:05.000000000 -0800 > +++ linux-2.6/mm/fremap.c 2008-01-25 19:32:49.000000000 -0800 > @@ -15,6 +15,7 @@ > #include <linux/rmap.h> > #include <linux/module.h> > #include <linux/syscalls.h> > +#include <linux/mmu_notifier.h> > > #include <asm/mmu_context.h> > #include <asm/cacheflush.h> > @@ -211,6 +212,7 @@ asmlinkage long sys_remap_file_pages(uns > spin_unlock(&mapping->i_mmap_lock); > } > > + mmu_notifier(invalidate_range, mm, start, start + size, 0); > err = populate_range(mm, vma, start, size, pgoff); How can it be right to invalidate_range _before_ ptep_clear_flush? > @@ -1634,6 +1639,8 @@ gotten: > /* > * Re-check the pte - we dropped the lock > */ > + mmu_notifier(invalidate_range, mm, address, > + address + PAGE_SIZE - 1, 0); > page_table = pte_offset_map_lock(mm, pmd, address, &ptl); > if (likely(pte_same(*page_table, orig_pte))) { > if (old_page) { What's the point of invalidate_range when the size is PAGE_SIZE? And how can it be right to invalidate_range _before_ ptep_clear_flush? ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080129162004.GL7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080129162004.GL7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-29 18:28 ` Andrea Arcangeli [not found] ` <20080129182831.GS7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 2008-01-29 19:55 ` Christoph Lameter 1 sibling, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-29 18:28 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins Christoph, the below patch should fix the current leak of the pinned pages. I hope the page-pin that should be dropped by the invalidate_range op, is enough to prevent the "physical page" mapped on that "mm+address" to change before invalidate_range returns. If that would ever happen, there would be a coherency loss between the guest VM writes and the writes coming from userland on the same mm+address from a different thread (qemu, whatever). invalidate_page before PT lock was obviously safe. Now we entirely relay on the pin to prevent the page to change before invalidate_range returns. If the pte is unmapped and the page is mapped back in with a minor fault that's ok, as long as the physical page remains the same for that mm+address, until all sptes are gone. Signed-off-by: Andrea Arcangeli <andrea-atKUWr5tajBWk0Htik3J/w@public.gmane.org> diff --git a/mm/fremap.c b/mm/fremap.c --- a/mm/fremap.c +++ b/mm/fremap.c @@ -212,8 +212,8 @@ asmlinkage long sys_remap_file_pages(uns spin_unlock(&mapping->i_mmap_lock); } + err = populate_range(mm, vma, start, size, pgoff); mmu_notifier(invalidate_range, mm, start, start + size, 0); - err = populate_range(mm, vma, start, size, pgoff); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -1639,8 +1639,6 @@ gotten: /* * Re-check the pte - we dropped the lock */ - mmu_notifier(invalidate_range, mm, address, - address + PAGE_SIZE - 1, 0); page_table = pte_offset_map_lock(mm, pmd, address, &ptl); if (likely(pte_same(*page_table, orig_pte))) { if (old_page) { @@ -1676,6 +1674,8 @@ gotten: page_cache_release(old_page); unlock: pte_unmap_unlock(page_table, ptl); + mmu_notifier(invalidate_range, mm, address, + address + PAGE_SIZE - 1, 0); if (dirty_page) { if (vma->vm_file) file_update_time(vma->vm_file); ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080129182831.GS7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080129182831.GS7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-29 20:30 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801291219030.25629-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-29 20:30 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, 29 Jan 2008, Andrea Arcangeli wrote: > diff --git a/mm/fremap.c b/mm/fremap.c > --- a/mm/fremap.c > +++ b/mm/fremap.c > @@ -212,8 +212,8 @@ asmlinkage long sys_remap_file_pages(uns > spin_unlock(&mapping->i_mmap_lock); > } > > + err = populate_range(mm, vma, start, size, pgoff); > mmu_notifier(invalidate_range, mm, start, start + size, 0); > - err = populate_range(mm, vma, start, size, pgoff); > if (!err && !(flags & MAP_NONBLOCK)) { > if (unlikely(has_write_lock)) { > downgrade_write(&mm->mmap_sem); We invalidate the range *after* populating it? Isnt it okay to establish references while populate_range() runs? > diff --git a/mm/memory.c b/mm/memory.c > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1639,8 +1639,6 @@ gotten: > /* > * Re-check the pte - we dropped the lock > */ > - mmu_notifier(invalidate_range, mm, address, > - address + PAGE_SIZE - 1, 0); > page_table = pte_offset_map_lock(mm, pmd, address, &ptl); > if (likely(pte_same(*page_table, orig_pte))) { > if (old_page) { What we did is to invalidate the page (?!) before taking the pte lock. In the lock we replace the pte to point to another page. This means that we need to clear stale information. So we zap it before. If another reference is established after taking the spinlock then the pte contents have changed at the cirtical section fails. Before the critical section starts we have gotten an extra refcount on the original page so the page cannot vanish from under us. > @@ -1676,6 +1674,8 @@ gotten: > page_cache_release(old_page); > unlock: > pte_unmap_unlock(page_table, ptl); > + mmu_notifier(invalidate_range, mm, address, > + address + PAGE_SIZE - 1, 0); > if (dirty_page) { > if (vma->vm_file) > file_update_time(vma->vm_file); Now we invalidate the page after the transaction is complete. This means external pte can persist while we change the pte? Possibly even dirty the page? ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801291219030.25629-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801291219030.25629-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-29 21:36 ` Andrea Arcangeli [not found] ` <20080129213604.GW7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-29 21:36 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, Jan 29, 2008 at 12:30:06PM -0800, Christoph Lameter wrote: > On Tue, 29 Jan 2008, Andrea Arcangeli wrote: > > > diff --git a/mm/fremap.c b/mm/fremap.c > > --- a/mm/fremap.c > > +++ b/mm/fremap.c > > @@ -212,8 +212,8 @@ asmlinkage long sys_remap_file_pages(uns > > spin_unlock(&mapping->i_mmap_lock); > > } > > > > + err = populate_range(mm, vma, start, size, pgoff); > > mmu_notifier(invalidate_range, mm, start, start + size, 0); > > - err = populate_range(mm, vma, start, size, pgoff); > > if (!err && !(flags & MAP_NONBLOCK)) { > > if (unlikely(has_write_lock)) { > > downgrade_write(&mm->mmap_sem); > > We invalidate the range *after* populating it? Isnt it okay to establish > references while populate_range() runs? It's not ok because that function can very well overwrite existing and present ptes (it's actually the nonlinear common case fast path for db). With your code the sptes created between invalidate_range and populate_range, will keep pointing forever to the old physical page instead of the newly populated one. I'm also asking myself if it's a smp race not to call mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at in install_file_pte. Probably not because the guest VM running in a different thread would need to serialize outside the install_file_pte code with the task running install_file_pte, if it wants to be sure to write either all its data to the old or the new page. Certainly doing the invalidate_page inside the PT lock was obviously safe but I hope this is safe and this can accommodate your needs too. > > diff --git a/mm/memory.c b/mm/memory.c > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -1639,8 +1639,6 @@ gotten: > > /* > > * Re-check the pte - we dropped the lock > > */ > > - mmu_notifier(invalidate_range, mm, address, > > - address + PAGE_SIZE - 1, 0); > > page_table = pte_offset_map_lock(mm, pmd, address, &ptl); > > if (likely(pte_same(*page_table, orig_pte))) { > > if (old_page) { > > What we did is to invalidate the page (?!) before taking the pte lock. In > the lock we replace the pte to point to another page. This means that we > need to clear stale information. So we zap it before. If another reference > is established after taking the spinlock then the pte contents have > changed at the cirtical section fails. > > Before the critical section starts we have gotten an extra refcount on the > original page so the page cannot vanish from under us. The problem is the missing invalidate_page/range _after_ ptep_clear_flush. If a spte is built between invalidate_range and pte_offset_map_lock, it will remain pointing to the old page forever. Nothing will be called to invalidate that stale spte built between invalidate_page/range and ptep_clear_flush. This is why for the last few days I kept saying the mmu notifiers have to be invoked _after_ ptep_clear_flush and never before (remember the export notifier?). No idea how you can deal with this in your code, certainly for KVM sptes that's backwards and unworkable ordering of operation (exactly as backwards are doing the tlb flush before pte_clear in ptep_clear_flush, think spte as a tlb, you can't flush the tlb before clearing/updating the pte or it's smp unsafe). > > @@ -1676,6 +1674,8 @@ gotten: > > page_cache_release(old_page); > > unlock: > > pte_unmap_unlock(page_table, ptl); > > + mmu_notifier(invalidate_range, mm, address, > > + address + PAGE_SIZE - 1, 0); > > if (dirty_page) { > > if (vma->vm_file) > > file_update_time(vma->vm_file); > > Now we invalidate the page after the transaction is complete. This means > external pte can persist while we change the pte? Possibly even dirty the > page? Yes, and the only reason this can be safe is for the reason explained at the top of the email, if the other cpu wants to serialize to be sure to write in the "new" page, it has to serialize with the page-fault but to serialize it has to wait the page fault to return (example: we're not going to call futex code until the page fault returns). ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080129213604.GW7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080129213604.GW7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-29 21:53 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801291343530.26824-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-29 21:53 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, 29 Jan 2008, Andrea Arcangeli wrote: > > We invalidate the range *after* populating it? Isnt it okay to establish > > references while populate_range() runs? > > It's not ok because that function can very well overwrite existing and > present ptes (it's actually the nonlinear common case fast path for > db). With your code the sptes created between invalidate_range and > populate_range, will keep pointing forever to the old physical page > instead of the newly populated one. Seems though that the mmap_sem is taken for regular vmas writably and will hold off new mappings. > I'm also asking myself if it's a smp race not to call > mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at > in install_file_pte. Probably not because the guest VM running in a > different thread would need to serialize outside the install_file_pte > code with the task running install_file_pte, if it wants to be sure to > write either all its data to the old or the new page. Certainly doing > the invalidate_page inside the PT lock was obviously safe but I hope > this is safe and this can accommodate your needs too. But that would be doing two invalidates on one pte. One range and one page invalidate. > > > diff --git a/mm/memory.c b/mm/memory.c > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -1639,8 +1639,6 @@ gotten: > > > /* > > > * Re-check the pte - we dropped the lock > > > */ > > > - mmu_notifier(invalidate_range, mm, address, > > > - address + PAGE_SIZE - 1, 0); > > > page_table = pte_offset_map_lock(mm, pmd, address, &ptl); > > > if (likely(pte_same(*page_table, orig_pte))) { > > > if (old_page) { > > > > What we did is to invalidate the page (?!) before taking the pte lock. In > > the lock we replace the pte to point to another page. This means that we > > need to clear stale information. So we zap it before. If another reference > > is established after taking the spinlock then the pte contents have > > changed at the cirtical section fails. > > > > Before the critical section starts we have gotten an extra refcount on the > > original page so the page cannot vanish from under us. > > The problem is the missing invalidate_page/range _after_ > ptep_clear_flush. If a spte is built between invalidate_range and > pte_offset_map_lock, it will remain pointing to the old page > forever. Nothing will be called to invalidate that stale spte built > between invalidate_page/range and ptep_clear_flush. This is why for > the last few days I kept saying the mmu notifiers have to be invoked > _after_ ptep_clear_flush and never before (remember the export > notifier?). No idea how you can deal with this in your code, certainly > for KVM sptes that's backwards and unworkable ordering of operation > (exactly as backwards are doing the tlb flush before pte_clear in > ptep_clear_flush, think spte as a tlb, you can't flush the tlb before > clearing/updating the pte or it's smp unsafe). Hmmm... So we could only do an invalidate_page here? Drop the strange invalidate_range()? > > > > @@ -1676,6 +1674,8 @@ gotten: > > > page_cache_release(old_page); > > > unlock: > > > pte_unmap_unlock(page_table, ptl); > > > + mmu_notifier(invalidate_range, mm, address, > > > + address + PAGE_SIZE - 1, 0); > > > if (dirty_page) { > > > if (vma->vm_file) > > > file_update_time(vma->vm_file); > > > > Now we invalidate the page after the transaction is complete. This means > > external pte can persist while we change the pte? Possibly even dirty the > > page? > > Yes, and the only reason this can be safe is for the reason explained > at the top of the email, if the other cpu wants to serialize to be > sure to write in the "new" page, it has to serialize with the > page-fault but to serialize it has to wait the page fault to return > (example: we're not going to call futex code until the page fault > returns). Serialize how? mmap_sem? ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801291343530.26824-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801291343530.26824-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-29 22:35 ` Andrea Arcangeli [not found] ` <20080129223503.GY7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-29 22:35 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, Jan 29, 2008 at 01:53:05PM -0800, Christoph Lameter wrote: > On Tue, 29 Jan 2008, Andrea Arcangeli wrote: > > > > We invalidate the range *after* populating it? Isnt it okay to establish > > > references while populate_range() runs? > > > > It's not ok because that function can very well overwrite existing and > > present ptes (it's actually the nonlinear common case fast path for > > db). With your code the sptes created between invalidate_range and > > populate_range, will keep pointing forever to the old physical page > > instead of the newly populated one. > > Seems though that the mmap_sem is taken for regular vmas writably and will > hold off new mappings. It's taken writable due to the code being inefficient the first time, all later times remap_populate_range overwrites ptes with the mmap_sem in readonly mode (finally rightfully so). The first remap_file_pages I guess it's irrelevant to optimize, the whole point of nonlinear is to call remap_file_pages zillon of times on the same vma, overwriting present ptes the whole time, so if the first time the mutex is not readonly it probably doesn't make a difference. get_user_pages invoked by the kvm spte-fault, can happen between invalidate_range and populate_range. If it can't happen, for sure nobody pointed out a good reason why it can't happen. The kvm page faults as well rightfully only takes the mmap_sem in readonly mode, so get_user_pages is only called internally to gfn_to_page with the readonly semaphore. With my approach ptep_clear_flush was not only invalidating sptes after ptep_clear_flush, but it was also invalidating them inside the PT lock, so it was totally obvious there could be no race vs get_user_pages. > > I'm also asking myself if it's a smp race not to call > > mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at > > in install_file_pte. Probably not because the guest VM running in a > > different thread would need to serialize outside the install_file_pte > > code with the task running install_file_pte, if it wants to be sure to > > write either all its data to the old or the new page. Certainly doing > > the invalidate_page inside the PT lock was obviously safe but I hope > > this is safe and this can accommodate your needs too. > > But that would be doing two invalidates on one pte. One range and one page > invalidate. Yes, but it would have been micro-optimized later if you really cared, by simply changing ptep_clear_flush to __ptep_clear_flush, no big deal. Definitely all methods must be robust about them being called multiple times, even if the rmap finds no spte mapping such host virtual address. > Hmmm... So we could only do an invalidate_page here? Drop the strange > invalidate_range()? That's a question you should answer. > > > > @@ -1676,6 +1674,8 @@ gotten: > > > > page_cache_release(old_page); > > > > unlock: > > > > pte_unmap_unlock(page_table, ptl); > > > > + mmu_notifier(invalidate_range, mm, address, > > > > + address + PAGE_SIZE - 1, 0); > > > > if (dirty_page) { > > > > if (vma->vm_file) > > > > file_update_time(vma->vm_file); > > > > > > Now we invalidate the page after the transaction is complete. This means > > > external pte can persist while we change the pte? Possibly even dirty the > > > page? > > > > Yes, and the only reason this can be safe is for the reason explained > > at the top of the email, if the other cpu wants to serialize to be > > sure to write in the "new" page, it has to serialize with the > > page-fault but to serialize it has to wait the page fault to return > > (example: we're not going to call futex code until the page fault > > returns). > > Serialize how? mmap_sem? No, that's a different angle. But now I think there may be an issue with a third thread that may show unsafe the removal of invalidate_page from ptep_clear_flush. A third thread writing to a page through the linux-pte and the guest VM writing to the same page through the sptes, will be writing on the same physical page concurrently and using an userspace spinlock w/o ever entering the kernel. With your patch that invalidate_range after dropping the PT lock, the third thread may start writing on the new page, when the guest is still writing to the old page through the sptes. While this couldn't happen with my patch. So really at the light of the third thread, it seems your approach is smp racey and ptep_clear_flush should invalidate_page as last thing before returning. My patch was enforcing that ptep_clear_flush would stop the third thread in a linux page fault, and to drop the spte, before the new mapping could be instantiated in both the linux pte and in the sptes. The PT lock provided the needed serialization. This ensured the third thread and the guest VM would always write on the same physical page even if the first thread runs a flood of remap_file_pages on that same page moving it around the pagecache. So it seems I found a unfixable smp race in pretending to invalidate in a sleeping place. Perhaps you want to change the PT lock to a mutex instead of a spinlock, that may be your only chance to sleep while maintaining 100% memory coherency with threads. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080129223503.GY7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080129223503.GY7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-29 22:55 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801291440170.27327-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-29 22:55 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, 29 Jan 2008, Andrea Arcangeli wrote: > But now I think there may be an issue with a third thread that may > show unsafe the removal of invalidate_page from ptep_clear_flush. > > A third thread writing to a page through the linux-pte and the guest > VM writing to the same page through the sptes, will be writing on the > same physical page concurrently and using an userspace spinlock w/o > ever entering the kernel. With your patch that invalidate_range after > dropping the PT lock, the third thread may start writing on the new > page, when the guest is still writing to the old page through the > sptes. While this couldn't happen with my patch. A user space spinlock plays into this??? That is irrelevant to the kernel. And we are discussing "your" placement of the invalidate_range not mine. This is the scenario that I described before. You just need two threads. One thread is in do_wp_page and the other is writing through the spte. We are in do_wp_page. Meaning the page is not writable. The writer will have to take fault which will properly serialize access. It a bug if the spte would allow write. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801291440170.27327-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801291440170.27327-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-29 23:43 ` Andrea Arcangeli [not found] ` <20080129234353.GZ7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-29 23:43 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, Jan 29, 2008 at 02:55:56PM -0800, Christoph Lameter wrote: > On Tue, 29 Jan 2008, Andrea Arcangeli wrote: > > > But now I think there may be an issue with a third thread that may > > show unsafe the removal of invalidate_page from ptep_clear_flush. > > > > A third thread writing to a page through the linux-pte and the guest > > VM writing to the same page through the sptes, will be writing on the > > same physical page concurrently and using an userspace spinlock w/o > > ever entering the kernel. With your patch that invalidate_range after > > dropping the PT lock, the third thread may start writing on the new > > page, when the guest is still writing to the old page through the > > sptes. While this couldn't happen with my patch. > > A user space spinlock plays into this??? That is irrelevant to the kernel. > And we are discussing "your" placement of the invalidate_range not mine. With "my" code, invalidate_range wasn't placed there at all, my modification to ptep_clear_flush already covered it in a automatic way, grep from the word fremap in my latest patch you won't find it, like you won't find any change to do_wp_page. Not sure why you keep thinking I added those invalidate_range when infact you did. The user space spinlock plays also in declaring rdtscp unworkable to provide a monotone vgettimeofday w/o kernel locking. My patch by calling invalidate_page inside ptep_clear_flush guaranteed that both the thread writing through sptes and the thread writing through linux ptes, couldn't possibly simultaneously write to two different physical pages. Your patch allows the thread writing through linux-pte to write to a new populated page while the old thread writing through sptes still writes to the old page. Is that safe? I don't know for sure. The fact the physical page backing the virtual address could change back and forth, perhaps invalidates the theory that somebody could possibly do some useful locking out of it relaying on all threads seeing the same physical page at the same time. Anyway as long as invalidate_page/range happens after ptep_clear_flush things are mostly ok. > This is the scenario that I described before. You just need two threads. > One thread is in do_wp_page and the other is writing through the spte. > We are in do_wp_page. Meaning the page is not writable. The writer will Actually above I was describing remap_file_pages not do_wp_page. > have to take fault which will properly serialize access. It a bug if the > spte would allow write. In that scenario because write is forbidden (unlike remap_file_pages) like you said things should be ok. The spte reader will eventually see the updates happening in the new page, as long as the spte invalidate happens after ptep_clear_flush (i.e. with my incremental fix applied to your code, or with my latest patch). ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080129234353.GZ7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080129234353.GZ7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-30 0:34 ` Christoph Lameter 0 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-01-30 0:34 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, 30 Jan 2008, Andrea Arcangeli wrote: > > A user space spinlock plays into this??? That is irrelevant to the kernel. > > And we are discussing "your" placement of the invalidate_range not mine. > > With "my" code, invalidate_range wasn't placed there at all, my > modification to ptep_clear_flush already covered it in a automatic > way, grep from the word fremap in my latest patch you won't find it, > like you won't find any change to do_wp_page. Not sure why you keep > thinking I added those invalidate_range when infact you did. Well you moved the code at minimum. Hmmm... according http://marc.info/?l=linux-kernel&m=120114755620891&w=2 it was Robin. > The user space spinlock plays also in declaring rdtscp unworkable to > provide a monotone vgettimeofday w/o kernel locking. No idea what you are talking about. > My patch by calling invalidate_page inside ptep_clear_flush guaranteed > that both the thread writing through sptes and the thread writing > through linux ptes, couldn't possibly simultaneously write to two > different physical pages. But then the ptep_clear_flush will issue invalidate_page() for ranges that were already covered by invalidate_range(). There are multiple calls to clear the same spte. > > Your patch allows the thread writing through linux-pte to write to a > new populated page while the old thread writing through sptes still > writes to the old page. Is that safe? I don't know for sure. The fact > the physical page backing the virtual address could change back and > forth, perhaps invalidates the theory that somebody could possibly do > some useful locking out of it relaying on all threads seeing the same > physical page at the same time. This is referrring to the remap issue not do_wp_page right? > Actually above I was describing remap_file_pages not do_wp_page. Ok. The serialization of remap_file_pages does not seem that critical since we only take a read lock on mmap_sem here. There may already be concurrent access to pages from other processors while the ptes are remapped. So there is already some overlap. We could take mmap_sem there writably and keep it writably for the case that we have an mmu notifier in the mm. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080129162004.GL7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 2008-01-29 18:28 ` Andrea Arcangeli @ 2008-01-29 19:55 ` Christoph Lameter 2008-01-29 21:17 ` Andrea Arcangeli 1 sibling, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-29 19:55 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, 29 Jan 2008, Andrea Arcangeli wrote: > > + mmu_notifier(invalidate_range, mm, address, > > + address + PAGE_SIZE - 1, 0); > > page_table = pte_offset_map_lock(mm, pmd, address, &ptl); > > if (likely(pte_same(*page_table, orig_pte))) { > > if (old_page) { > > What's the point of invalidate_range when the size is PAGE_SIZE? And > how can it be right to invalidate_range _before_ ptep_clear_flush? I am not sure. AFAICT you wrote that code. It seems to be okay to invalidate range if you hold mmap_sem writably. In that case no additional faults can happen that would create new ptes. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-01-29 19:55 ` Christoph Lameter @ 2008-01-29 21:17 ` Andrea Arcangeli [not found] ` <20080129211759.GV7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-29 21:17 UTC (permalink / raw) To: Christoph Lameter Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Tue, Jan 29, 2008 at 11:55:10AM -0800, Christoph Lameter wrote: > I am not sure. AFAICT you wrote that code. Actually I didn't need to change a single line in do_wp_page because ptep_clear_flush was already doing everything transparently for me. This was the memory.c part of my last patch I posted, it only touches zap_page_range, remap_pfn_range and apply_to_page_range. diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -889,6 +889,7 @@ unsigned long zap_page_range(struct vm_a end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); if (tlb) tlb_finish_mmu(tlb, address, end); + mmu_notifier(invalidate_range, mm, address, end); return end; } @@ -1317,7 +1318,7 @@ int remap_pfn_range(struct vm_area_struc { pgd_t *pgd; unsigned long next; - unsigned long end = addr + PAGE_ALIGN(size); + unsigned long start = addr, end = addr + PAGE_ALIGN(size); struct mm_struct *mm = vma->vm_mm; int err; @@ -1358,6 +1359,7 @@ int remap_pfn_range(struct vm_area_struc if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, start, end); return err; } EXPORT_SYMBOL(remap_pfn_range); @@ -1441,7 +1443,7 @@ int apply_to_page_range(struct mm_struct { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); @@ -1452,6 +1454,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, start, end); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); > It seems to be okay to invalidate range if you hold mmap_sem writably. In > that case no additional faults can happen that would create new ptes. In that place the mmap_sem is taken but in readonly mode. I never rely on the mmap_sem in the mmu notifier methods. Not invoking the notifier before releasing the PT lock adds quite some uncertainty on the smp safety of the spte invalidates, because the pte may be unmapped and remapped by a minor fault before invalidate_range is invoked, but I didn't figure out a kernel crashing race yet thanks to the pin we take through get_user_pages (and only thanks to it). The requirement is that invalidate_range is invoked after the last ptep_clear_flush or it leaks pins that's why I had to move it at the end. ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080129211759.GV7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080129211759.GV7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-29 21:35 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801291327330.26649-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-29 21:35 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, 29 Jan 2008, Andrea Arcangeli wrote: > > It seems to be okay to invalidate range if you hold mmap_sem writably. In > > that case no additional faults can happen that would create new ptes. > > In that place the mmap_sem is taken but in readonly mode. I never rely > on the mmap_sem in the mmu notifier methods. Not invoking the notifier Well it seems that we have to rely on mmap_sem otherwise concurrent faults can occur. The mmap_sem seems to be acquired for write there. if (!has_write_lock) { up_read(&mm->mmap_sem); down_write(&mm->mmap_sem); has_write_lock = 1; goto retry; } > before releasing the PT lock adds quite some uncertainty on the smp > safety of the spte invalidates, because the pte may be unmapped and > remapped by a minor fault before invalidate_range is invoked, but I > didn't figure out a kernel crashing race yet thanks to the pin we take > through get_user_pages (and only thanks to it). The requirement is > that invalidate_range is invoked after the last ptep_clear_flush or it > leaks pins that's why I had to move it at the end. So "pins" means a reference count right? I still do not get why you have refcount problems. You take a refcount when you export the page through KVM and then drop the refcount in invalidate page right? So you walk through the KVM ptes and drop the refcount for each spte you encounter? ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801291327330.26649-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801291327330.26649-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-29 22:02 ` Andrea Arcangeli [not found] ` <20080129220212.GX7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-29 22:02 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, Jan 29, 2008 at 01:35:58PM -0800, Christoph Lameter wrote: > On Tue, 29 Jan 2008, Andrea Arcangeli wrote: > > > > It seems to be okay to invalidate range if you hold mmap_sem writably. In > > > that case no additional faults can happen that would create new ptes. > > > > In that place the mmap_sem is taken but in readonly mode. I never rely > > on the mmap_sem in the mmu notifier methods. Not invoking the notifier > > Well it seems that we have to rely on mmap_sem otherwise concurrent faults > can occur. The mmap_sem seems to be acquired for write there. ^^^^^ > > if (!has_write_lock) { > up_read(&mm->mmap_sem); > down_write(&mm->mmap_sem); > has_write_lock = 1; > goto retry; > } hmm, "there" where? When I said it was taken in readonly mode I meant for the quoted code (it would be at the top if it wasn't cut), so I quote below again: > > + mmu_notifier(invalidate_range, mm, address, > > + address + PAGE_SIZE - 1, 0); > > page_table = pte_offset_map_lock(mm, pmd, address, &ptl); > > if (likely(pte_same(*page_table, orig_pte))) { > > if (old_page) { The "there" for me was do_wp_page. Even for the code you quoted in freemap.c, the has_write_lock is set to 1 _only_ for the very first time you call sys_remap_file_pages on a VMA. Only the transition of the VMA between linear to nonlinear requires the mmap in write mode. So you can be sure all freemap code 99% of the time is populating (overwriting) already present ptes with only the mmap_sem in readonly mode like do_wp_page. It would be unnecessary to populate the nonlinear range with the mmap in write mode. Only the "vma" mangling requires the mmap_sem in write mode, the pte modifications only requires the PT_lock + mmap_sem in read mode. Effectively the first invocation of populate_range runs with the mmap_sem in write mode, I wonder why, there seem to be no good reason for that. I guess it's a bit that should be optimized, by calling downgrade_write before calling populate_range even for the first time the vma switches from linear to nonlinear (after the vma has been fully updated to the new status). But for sure all later invocations runs populate_range with the semaphore readonly like the rest of the VM does when instantiating ptes in the page faults. > > before releasing the PT lock adds quite some uncertainty on the smp > > safety of the spte invalidates, because the pte may be unmapped and > > remapped by a minor fault before invalidate_range is invoked, but I > > didn't figure out a kernel crashing race yet thanks to the pin we take > > through get_user_pages (and only thanks to it). The requirement is > > that invalidate_range is invoked after the last ptep_clear_flush or it > > leaks pins that's why I had to move it at the end. > > So "pins" means a reference count right? I still do not get why you Yes. > have refcount problems. You take a refcount when you export the page > through KVM and then drop the refcount in invalidate page right? Yes. > So you walk through the KVM ptes and drop the refcount for each spte you > encounter? Yes. All pins are gone by the time invalidate_page/range returns. But there is no critical section between invalidate_page and the _later_ ptep_clear_flush. So get_user_pages is free to run and take the PT lock before the ptep_clear_flush, find the linux pte still instantiated, and to create a new spte, before ptep_clear_flush runs. Think of why the tlb flushes are being called at the end of ptep_clear_flush. The mmu notifier invalidate has to be called after for the exact same reason. Perhaps somebody else should explain this, I started exposing this smp race the moment after I've seen the backwards ordering being proposed in export-notifier-v1, sorry if I'm not clear enough. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080129220212.GX7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080129220212.GX7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-29 22:39 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801291407380.27104-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-29 22:39 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins n Tue, 29 Jan 2008, Andrea Arcangeli wrote: > hmm, "there" where? When I said it was taken in readonly mode I meant > for the quoted code (it would be at the top if it wasn't cut), so I > quote below again: > > > > + mmu_notifier(invalidate_range, mm, address, > > > + address + PAGE_SIZE - 1, 0); > > > page_table = pte_offset_map_lock(mm, pmd, address, &ptl); > > > if (likely(pte_same(*page_table, orig_pte))) { > > > if (old_page) { > > The "there" for me was do_wp_page. Maybe we better focus on one call at a time? > Even for the code you quoted in freemap.c, the has_write_lock is set > to 1 _only_ for the very first time you call sys_remap_file_pages on a > VMA. Only the transition of the VMA between linear to nonlinear > requires the mmap in write mode. So you can be sure all freemap code > 99% of the time is populating (overwriting) already present ptes with > only the mmap_sem in readonly mode like do_wp_page. It would be > unnecessary to populate the nonlinear range with the mmap in write > mode. Only the "vma" mangling requires the mmap_sem in write mode, the > pte modifications only requires the PT_lock + mmap_sem in read mode. > > Effectively the first invocation of populate_range runs with the > mmap_sem in write mode, I wonder why, there seem to be no good reason > for that. I guess it's a bit that should be optimized, by calling > downgrade_write before calling populate_range even for the first time > the vma switches from linear to nonlinear (after the vma has been > fully updated to the new status). But for sure all later invocations > runs populate_range with the semaphore readonly like the rest of the > VM does when instantiating ptes in the page faults. If it does not run in write mode then concurrent faults are permissible while we remap pages. Weird. Maybe we better handle this like individual page operations? Put the invalidate_page back into zap_pte. But then there would be no callback w/o lock as required by Robin. Doing the invalidate_range after populate allows access to memory for which ptes were zapped and the refcount was released. > All pins are gone by the time invalidate_page/range returns. But there > is no critical section between invalidate_page and the _later_ > ptep_clear_flush. So get_user_pages is free to run and take the PT > lock before the ptep_clear_flush, find the linux pte still > instantiated, and to create a new spte, before ptep_clear_flush runs. Hmmm... Right. Did not consider get_user_pages. A write to the page that is not marked dirty would typically require a fault that will serialize. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801291407380.27104-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801291407380.27104-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-30 0:00 ` Andrea Arcangeli [not found] ` <20080130000039.GA7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-30 0:00 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, Jan 29, 2008 at 02:39:00PM -0800, Christoph Lameter wrote: > If it does not run in write mode then concurrent faults are permissible > while we remap pages. Weird. Maybe we better handle this like individual > page operations? Put the invalidate_page back into zap_pte. But then there > would be no callback w/o lock as required by Robin. Doing the The Robin requirements and the need to schedule are the source of the complications indeed. I posted all the KVM patches using mmu notifiers, today I reposted the ones to work with your V2 (which crashes my host unlike my last simpler mmu notifier patch but I also changed a few other variable besides your mmu notifier changes, so I can't yet be sure it's a bug in your V2, and the SMP regressions I fixed so far sure can't explain the crashes because my KVM setup could never run in do_wp_page nor remap_file_pages so it's something else I need to find ASAP). Robin, if you don't mind, could you please post or upload somewhere your GPLv2 code that registers itself in Christoph's V2 notifiers? Or is it top secret? I wouldn't mind to have a look so I can better understand what's the exact reason you're sleeping besides attempting GFP_KERNEL allocations. Thanks! > invalidate_range after populate allows access to memory for which ptes > were zapped and the refcount was released. The last refcount is released by the invalidate_range itself. > > All pins are gone by the time invalidate_page/range returns. But there > > is no critical section between invalidate_page and the _later_ > > ptep_clear_flush. So get_user_pages is free to run and take the PT > > lock before the ptep_clear_flush, find the linux pte still > > instantiated, and to create a new spte, before ptep_clear_flush runs. > > Hmmm... Right. Did not consider get_user_pages. A write to the page that > is not marked dirty would typically require a fault that will serialize. The pte is already marked dirty (and this is the case only for get_user_pages, regular linux writes don't fault unless it's explicitly writeprotect, which is mandatory in a few archs, x86 not). ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080130000039.GA7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130000039.GA7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-30 0:05 ` Andrea Arcangeli [not found] ` <20080130000559.GB7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 2008-01-30 0:20 ` Christoph Lameter 2008-01-30 16:11 ` Robin Holt 2 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-30 0:05 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, Jan 30, 2008 at 01:00:39AM +0100, Andrea Arcangeli wrote: > get_user_pages, regular linux writes don't fault unless it's > explicitly writeprotect, which is mandatory in a few archs, x86 not). actually get_user_pages doesn't fault either but it calls into set_page_dirty, however get_user_pages (unlike a userland-write) at least requires mmap_sem in read mode and the PT lock as serialization, userland writes don't, they just go ahead and mark the pte in hardware w/o faults. Anyway anonymous memory these days always mapped with dirty bit set regardless, even for read-faults, after Nick finally rightfully cleaned up the zero-page trick. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080130000559.GB7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130000559.GB7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-30 0:22 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801291621380.28027-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-30 0:22 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, 30 Jan 2008, Andrea Arcangeli wrote: > On Wed, Jan 30, 2008 at 01:00:39AM +0100, Andrea Arcangeli wrote: > > get_user_pages, regular linux writes don't fault unless it's > > explicitly writeprotect, which is mandatory in a few archs, x86 not). > > actually get_user_pages doesn't fault either but it calls into > set_page_dirty, however get_user_pages (unlike a userland-write) at > least requires mmap_sem in read mode and the PT lock as serialization, > userland writes don't, they just go ahead and mark the pte in hardware > w/o faults. Anyway anonymous memory these days always mapped with > dirty bit set regardless, even for read-faults, after Nick finally > rightfully cleaned up the zero-page trick. That is only partially true. pte are created wronly in order to track dirty state these days. The first write will lead to a fault that switches the pte to writable. When the page undergoes writeback the page again becomes write protected. Thus our need to effectively deal with page_mkclean. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801291621380.28027-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801291621380.28027-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-30 0:59 ` Andrea Arcangeli 2008-01-30 8:26 ` Peter Zijlstra 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-30 0:59 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, Jan 29, 2008 at 04:22:46PM -0800, Christoph Lameter wrote: > That is only partially true. pte are created wronly in order to track > dirty state these days. The first write will lead to a fault that switches > the pte to writable. When the page undergoes writeback the page again > becomes write protected. Thus our need to effectively deal with > page_mkclean. Well I was talking about anonymous memory. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-01-30 0:59 ` Andrea Arcangeli @ 2008-01-30 8:26 ` Peter Zijlstra 0 siblings, 0 replies; 150+ messages in thread From: Peter Zijlstra @ 2008-01-30 8:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Wed, 2008-01-30 at 01:59 +0100, Andrea Arcangeli wrote: > On Tue, Jan 29, 2008 at 04:22:46PM -0800, Christoph Lameter wrote: > > That is only partially true. pte are created wronly in order to track > > dirty state these days. The first write will lead to a fault that switches > > the pte to writable. When the page undergoes writeback the page again > > becomes write protected. Thus our need to effectively deal with > > page_mkclean. > > Well I was talking about anonymous memory. Just to be absolutely clear on this (I lost track of what exactly we are talking about here), nonlinear mappings no not do the dirty accounting, and are not allowed on a backing store that would require dirty accounting. ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130000039.GA7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 2008-01-30 0:05 ` Andrea Arcangeli @ 2008-01-30 0:20 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801291620170.28027-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 2008-01-30 16:11 ` Robin Holt 2 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-30 0:20 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, 30 Jan 2008, Andrea Arcangeli wrote: > > invalidate_range after populate allows access to memory for which ptes > > were zapped and the refcount was released. > > The last refcount is released by the invalidate_range itself. That is true for your implementation and to address Robin's issues. Jack: Is that true for the GRU? ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801291620170.28027-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801291620170.28027-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-30 0:28 ` Jack Steiner [not found] ` <20080130002804.GA13840-sJ/iWh9BUns@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Jack Steiner @ 2008-01-30 0:28 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote: > On Wed, 30 Jan 2008, Andrea Arcangeli wrote: > > > > invalidate_range after populate allows access to memory for which ptes > > > were zapped and the refcount was released. > > > > The last refcount is released by the invalidate_range itself. > > That is true for your implementation and to address Robin's issues. Jack: > Is that true for the GRU? I'm not sure I understand the question. The GRU never (currently) takes a reference on a page. It has no mechanism for tracking pages that were exported to the external TLBs. --- jack ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080130002804.GA13840-sJ/iWh9BUns@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130002804.GA13840-sJ/iWh9BUns@public.gmane.org> @ 2008-01-30 0:35 ` Christoph Lameter 2008-01-30 13:37 ` Andrea Arcangeli 1 sibling, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-01-30 0:35 UTC (permalink / raw) To: Jack Steiner Cc: Nick Piggin, Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Tue, 29 Jan 2008, Jack Steiner wrote: > > That is true for your implementation and to address Robin's issues. Jack: > > Is that true for the GRU? > > I'm not sure I understand the question. The GRU never (currently) takes > a reference on a page. It has no mechanism for tracking pages that > were exported to the external TLBs. Thats what I was looking for. Thanks. KVM takes a refcount and so does XPmem. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130002804.GA13840-sJ/iWh9BUns@public.gmane.org> 2008-01-30 0:35 ` Christoph Lameter @ 2008-01-30 13:37 ` Andrea Arcangeli [not found] ` <20080130133720.GM7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 1 sibling, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-30 13:37 UTC (permalink / raw) To: Jack Steiner Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins, Christoph Lameter On Tue, Jan 29, 2008 at 06:28:05PM -0600, Jack Steiner wrote: > On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote: > > On Wed, 30 Jan 2008, Andrea Arcangeli wrote: > > > > > > invalidate_range after populate allows access to memory for which ptes > > > > were zapped and the refcount was released. > > > > > > The last refcount is released by the invalidate_range itself. > > > > That is true for your implementation and to address Robin's issues. Jack: > > Is that true for the GRU? > > I'm not sure I understand the question. The GRU never (currently) takes > a reference on a page. It has no mechanism for tracking pages that > were exported to the external TLBs. If you don't have a pin, then things like invalidate_range in remap_file_pages can't be safe as writes through the external TLBs can keep going on pages in the freelist. For you to be safe w/o a page-pin, you need to return in the direction of invalidate_page inside ptep_clear_flush (or anyway before page_cache_release/__free_page/put_page...). You're generally not safe with any invalidate_range that may run after the page pointed by the pte has been freed (or can be freed by the VM anytime because of being unpinned cache). ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080130133720.GM7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130133720.GM7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-30 14:43 ` Jack Steiner [not found] ` <20080130144305.GA25193-sJ/iWh9BUns@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Jack Steiner @ 2008-01-30 14:43 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins, Christoph Lameter On Wed, Jan 30, 2008 at 02:37:20PM +0100, Andrea Arcangeli wrote: > On Tue, Jan 29, 2008 at 06:28:05PM -0600, Jack Steiner wrote: > > On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote: > > > On Wed, 30 Jan 2008, Andrea Arcangeli wrote: > > > > > > > > invalidate_range after populate allows access to memory for which ptes > > > > > were zapped and the refcount was released. > > > > > > > > The last refcount is released by the invalidate_range itself. > > > > > > That is true for your implementation and to address Robin's issues. Jack: > > > Is that true for the GRU? > > > > I'm not sure I understand the question. The GRU never (currently) takes > > a reference on a page. It has no mechanism for tracking pages that > > were exported to the external TLBs. > > If you don't have a pin, then things like invalidate_range in > remap_file_pages can't be safe as writes through the external TLBs can > keep going on pages in the freelist. For you to be safe w/o a > page-pin, you need to return in the direction of invalidate_page > inside ptep_clear_flush (or anyway before > page_cache_release/__free_page/put_page...). You're generally not safe > with any invalidate_range that may run after the page pointed by the > pte has been freed (or can be freed by the VM anytime because of being > unpinned cache). Yuck.... I see what you mean. I need to review to mail to see why this changed but in the original discussions with Christoph, the invalidate_range callouts were suppose to be made BEFORE the pages were put on the freelist. --- jack ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080130144305.GA25193-sJ/iWh9BUns@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130144305.GA25193-sJ/iWh9BUns@public.gmane.org> @ 2008-01-30 19:41 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801301140320.30568-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-30 19:41 UTC (permalink / raw) To: Jack Steiner Cc: Nick Piggin, Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, 30 Jan 2008, Jack Steiner wrote: > I see what you mean. I need to review to mail to see why this changed > but in the original discussions with Christoph, the invalidate_range > callouts were suppose to be made BEFORE the pages were put on the freelist. Seems that we cannot rely on the invalidate_ranges for correctness at all? We need to have invalidate_page() always. invalidate_range() is only an optimization. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801301140320.30568-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801301140320.30568-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-30 20:29 ` Jack Steiner [not found] ` <20080130202918.GB11324-sJ/iWh9BUns@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Jack Steiner @ 2008-01-30 20:29 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, Jan 30, 2008 at 11:41:29AM -0800, Christoph Lameter wrote: > On Wed, 30 Jan 2008, Jack Steiner wrote: > > > I see what you mean. I need to review to mail to see why this changed > > but in the original discussions with Christoph, the invalidate_range > > callouts were suppose to be made BEFORE the pages were put on the freelist. > > Seems that we cannot rely on the invalidate_ranges for correctness at all? > We need to have invalidate_page() always. invalidate_range() is only an > optimization. > I don't understand your point "an optimization". How would invalidate_range as currently defined be correctly used? It _looks_ like it would work only if xpmem/gru/etc takes a refcnt on the page & drops it when invalidate_range is called. That may work (not sure) for xpmem but not for the GRU. --- jack ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080130202918.GB11324-sJ/iWh9BUns@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130202918.GB11324-sJ/iWh9BUns@public.gmane.org> @ 2008-01-30 20:55 ` Christoph Lameter 0 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-01-30 20:55 UTC (permalink / raw) To: Jack Steiner Cc: Nick Piggin, Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, 30 Jan 2008, Jack Steiner wrote: > > Seems that we cannot rely on the invalidate_ranges for correctness at all? > > We need to have invalidate_page() always. invalidate_range() is only an > > optimization. > > > > I don't understand your point "an optimization". How would invalidate_range > as currently defined be correctly used? We are changing definitions. The original patch by Andrea calls invalidate_page for each pte that is cleared. So strictly you would not need an invalidate_range. > It _looks_ like it would work only if xpmem/gru/etc takes a refcnt on > the page & drops it when invalidate_range is called. That may work (not sure) > for xpmem but not for the GRU. The refcount is not necessary if we adopt Andrea's approach of a callback on the clearing of each pte. At that point the page is still guaranteed to exist. If we do the range_invalidate later (as in V3) then the page may have been released (see sys_remap_file_pages() f.e.) before we zap the GRU ptes. So there will be a time when the GRU may write to a page that has been freed and used for another purpose. Taking a refcount on the page defers the free until the range_invalidate runs. I would prefer a solution that does not require taking refcounts (pins) for establishing an external pte and for release (like what the GRU does). If we could effectively determine that there are no external ptes in a range then the invalidate_page() call may return immediately. Maybe it is then effective to do these gazillions of invalidate_page() calls when a process terminates or an remap is performed. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130000039.GA7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 2008-01-30 0:05 ` Andrea Arcangeli 2008-01-30 0:20 ` Christoph Lameter @ 2008-01-30 16:11 ` Robin Holt [not found] ` <20080130161123.GS26420-sJ/iWh9BUns@public.gmane.org> 2 siblings, 1 reply; 150+ messages in thread From: Robin Holt @ 2008-01-30 16:11 UTC (permalink / raw) To: Andrea Arcangeli, Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins, Christoph Lameter > Robin, if you don't mind, could you please post or upload somewhere > your GPLv2 code that registers itself in Christoph's V2 notifiers? Or > is it top secret? I wouldn't mind to have a look so I can better > understand what's the exact reason you're sleeping besides attempting > GFP_KERNEL allocations. Thanks! Dean is still actively working on updating the xpmem patch posted here a few months ago reworked for the mmu_notifiers. I am sure we can give you a early look, but it is in a really rough state. http://marc.info/?l=linux-mm&w=2&r=1&s=xpmem&q=t The need to sleep comes from the fact that these PFNs are sent to other hosts on the same NUMA fabric which have direct access to the pages and then placed into remote process's page tables and then filled into their TLBs. Our only means of communicating the recall is async. I think I need to straighten this discussion out in my head a little bit. Am I correct in assuming Andrea's original patch set did not have any SMP race conditions for KVM? If so, then we need to start looking at how to implement Christoph's and my changes in a safe fashion. Andrea, I agree complete that our introduction of the range callouts have introduced SMP races. The three issues we need to simultaneously solve is revoking the remote page table/tlb information while still in a sleepable context and not having the remote faulters become out of sync with the granting process. Currently, I don't see a way to do that cleanly with a single callout. Could we consider doing a range-based recall and lock callout before clearing the processes page tables/TLBs, then use the _page or _range callouts from Andrea's patch to clear the mappings, finally make a range-based unlock callout. The mmu_notifier user would usually use ops for either the recall+lock/unlock family of callouts or the _page/_range family of callouts. Thanks, Robin ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080130161123.GS26420-sJ/iWh9BUns@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130161123.GS26420-sJ/iWh9BUns@public.gmane.org> @ 2008-01-30 17:04 ` Andrea Arcangeli [not found] ` <20080130170451.GP7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 2008-01-30 19:35 ` Christoph Lameter 1 sibling, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-30 17:04 UTC (permalink / raw) To: Robin Holt Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Hugh Dickins, Christoph Lameter On Wed, Jan 30, 2008 at 10:11:24AM -0600, Robin Holt wrote: > > Robin, if you don't mind, could you please post or upload somewhere > > your GPLv2 code that registers itself in Christoph's V2 notifiers? Or > > is it top secret? I wouldn't mind to have a look so I can better > > understand what's the exact reason you're sleeping besides attempting > > GFP_KERNEL allocations. Thanks! > > Dean is still actively working on updating the xpmem patch posted > here a few months ago reworked for the mmu_notifiers. I am sure > we can give you a early look, but it is in a really rough state. > > http://marc.info/?l=linux-mm&w=2&r=1&s=xpmem&q=t > > The need to sleep comes from the fact that these PFNs are sent to other > hosts on the same NUMA fabric which have direct access to the pages > and then placed into remote process's page tables and then filled into > their TLBs. Our only means of communicating the recall is async. > > I think I need to straighten this discussion out in my head a little bit. > Am I correct in assuming Andrea's original patch set did not have any SMP > race conditions for KVM? If so, then we need to start looking at how to Yes my last patch was SMP safe, stable and feature complete for KVM. I tested it for 1 week on my smp workstation with real desktop load and everything loaded, with 3G non-linux guest running on 2G of ram. Now for whatever reason I adapted the KVM side to Christoph's V2/V3 and it hangs the moment it hits swap. However in the meantime I changed test hardware, upgraded host to 2.6.24-hg, and upgraded kvm kernel and userland. all patches applied cleanly (with a minor nit in a .h include in V2 on top of current git). Swapping of regular tasks on the test system is 100% solid or I wouldn't even wasting time mentioning this. By code inspection I didn't expect a stability regression or I wouldn't have chanced all variables at the same time (taking the opportunity to move everything to bleeding edge while moving to V2 turned out to be a bad idea). I already audited the mmu notifiers a few times, infact I already went back to call invalidate_page and age_page inside ptep_clear_flush/young in case the page-pin wasn't enough to prevent the page to change under the sptes, as I thought yesterday. Christoph's V3 notably still misses the needed range flushes in mremap for example, but that's not my problem. (Jack instead will certainly kernel crash due to the missing invalidate_page after ptep_clear_flush in mremap, such an invalidate_page wasn't missing with my last patch) I'm now going to run the same binaries that still are stable on my workstation on the test system too, to rule out timings and hardware differences. > implement Christoph's and my changes in a safe fashion. Andrea, I agree > complete that our introduction of the range callouts have introduced > SMP races. I think for KVM basic swapping both V2 and V3 should be safe. V2 had race conditions that would later break KSM yes, I fixed it and V3 should be already ok and I'm not testing KSM. This is all thanks to the pin of the page in get_user_page that KVM does for every page mapped in any spte. > The three issues we need to simultaneously solve is revoking the remote > page table/tlb information while still in a sleepable context and not > having the remote faulters become out of sync with the granting process. > Currently, I don't see a way to do that cleanly with a single callout. Agreed. > Could we consider doing a range-based recall and lock callout before > clearing the processes page tables/TLBs, then use the _page or _range > callouts from Andrea's patch to clear the mappings, finally make a > range-based unlock callout. The mmu_notifier user would usually use ops > for either the recall+lock/unlock family of callouts or the _page/_range > family of callouts. invalidate_page/age_page can return inside ptep_clear_flush/young and Jack will need that too. Infact Jack will need an invalidate_page also inside ptep_get_and_clear. And the range callout will be done always in a sleeping context and it'll relay on the page-pin to be safe (when details->i_mmap_lock != NULL invalidate_range it shouldn't be called inside zap_page_range but before returning from unmap_mapping_range_vma before cond_resched). This will make everything a bit simpler and less prone to breakage IMHO, plus it'll have a chance to work for Jack w/o page-pin without additional cluttering of mm/*.c. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080130170451.GP7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130170451.GP7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-30 17:30 ` Robin Holt [not found] ` <20080130173009.GT26420-sJ/iWh9BUns@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Robin Holt @ 2008-01-30 17:30 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins, Christoph Lameter On Wed, Jan 30, 2008 at 06:04:52PM +0100, Andrea Arcangeli wrote: > On Wed, Jan 30, 2008 at 10:11:24AM -0600, Robin Holt wrote: ... > > The three issues we need to simultaneously solve is revoking the remote > > page table/tlb information while still in a sleepable context and not > > having the remote faulters become out of sync with the granting process. ... > > Could we consider doing a range-based recall and lock callout before > > clearing the processes page tables/TLBs, then use the _page or _range > > callouts from Andrea's patch to clear the mappings, finally make a > > range-based unlock callout. The mmu_notifier user would usually use ops > > for either the recall+lock/unlock family of callouts or the _page/_range > > family of callouts. > > invalidate_page/age_page can return inside ptep_clear_flush/young and > Jack will need that too. Infact Jack will need an invalidate_page also > inside ptep_get_and_clear. And the range callout will be done always > in a sleeping context and it'll relay on the page-pin to be safe (when > details->i_mmap_lock != NULL invalidate_range it shouldn't be called > inside zap_page_range but before returning from > unmap_mapping_range_vma before cond_resched). This will make > everything a bit simpler and less prone to breakage IMHO, plus it'll > have a chance to work for Jack w/o page-pin without additional > cluttering of mm/*.c. I don't think I saw the answer to my original question. I assume your original patch, extended in a way similar to what Christoph has done, can be made to work to cover both the KVM and GRU (Jack's) case. XPMEM, however, does not look to be solvable due to the three simultaneous issues above. To address that, I think I am coming to the conclusion that we need an accompanying but seperate pair of callouts. The first will ensure the remote page tables and TLBs are cleared and all page information is returned back to the process that is granting access to its address space. That will include an implicit block on the address range so no further faults will be satisfied by the remote accessor (forgot the KVM name for this, sorry). Any faults will be held off and only the processes page tables/TLBs are in play. Once the normal processing of the kernel is complete, an unlock callout would be made for the range and then faulting may occur on behalf of the process again. Currently, this is the only direct solution that I can see as a possibility. My question is two fold. Does this seem like a reasonable means to solve the three simultaneous issues above and if so, does it seem like the most reasonable means? Thanks, Robin ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080130173009.GT26420-sJ/iWh9BUns@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130173009.GT26420-sJ/iWh9BUns@public.gmane.org> @ 2008-01-30 18:25 ` Andrea Arcangeli [not found] ` <20080130182506.GQ7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-30 18:25 UTC (permalink / raw) To: Robin Holt Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Hugh Dickins, Christoph Lameter On Wed, Jan 30, 2008 at 11:30:09AM -0600, Robin Holt wrote: > I don't think I saw the answer to my original question. I assume your > original patch, extended in a way similar to what Christoph has done, > can be made to work to cover both the KVM and GRU (Jack's) case. Yes, I think so. > XPMEM, however, does not look to be solvable due to the three simultaneous > issues above. To address that, I think I am coming to the conclusion > that we need an accompanying but seperate pair of callouts. The first The mmu_rmap_notifiers are already one separate pair of callouts and we can add more of them of course. > will ensure the remote page tables and TLBs are cleared and all page > information is returned back to the process that is granting access to > its address space. That will include an implicit block on the address > range so no further faults will be satisfied by the remote accessor > (forgot the KVM name for this, sorry). Any faults will be held off > and only the processes page tables/TLBs are in play. Once the normal Good, this "block" is how you close the race condition, and you need the second callout to "unblock" (this is why it could hardly work well before with a single invalidate_range). > processing of the kernel is complete, an unlock callout would be made > for the range and then faulting may occur on behalf of the process again. This sounds good. > Currently, this is the only direct solution that I can see as a > possibility. My question is two fold. Does this seem like a reasonable > means to solve the three simultaneous issues above and if so, does it > seem like the most reasonable means? Yes. KVM can deal with both invalidate_page (atomic) and invalidate_range (sleepy) GRU can only deal with invalidate_page (atomic) XPMEM requires with invalidate_range (sleepy) + before_invalidate_range (sleepy). invalidate_all should also be called before_release (both sleepy). It sounds we need full overlap of information provided by invalidate_page and invalidate_range to fit all three models (the opposite of the zero objective that current V3 is taking). And the swap will be handled only by invalidate_page either through linux rmap or external rmap (with the latter that can sleep so it's ok for you, the former not). GRU can safely use the either the linux rmap notifier or the external rmap notifier equally well, because when try_to_unmap is called the page is locked and obviously pinned by the VM itself. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080130182506.GQ7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130182506.GQ7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-30 19:50 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801301147330.30568-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 2008-01-30 23:52 ` Andrea Arcangeli 0 siblings, 2 replies; 150+ messages in thread From: Christoph Lameter @ 2008-01-30 19:50 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, 30 Jan 2008, Andrea Arcangeli wrote: > XPMEM requires with invalidate_range (sleepy) + > before_invalidate_range (sleepy). invalidate_all should also be called > before_release (both sleepy). > > It sounds we need full overlap of information provided by > invalidate_page and invalidate_range to fit all three models (the > opposite of the zero objective that current V3 is taking). And the > swap will be handled only by invalidate_page either through linux rmap > or external rmap (with the latter that can sleep so it's ok for you, > the former not). GRU can safely use the either the linux rmap notifier > or the external rmap notifier equally well, because when try_to_unmap > is called the page is locked and obviously pinned by the VM itself. So put the invalidate_page() callbacks in everywhere. Then we have invalidate_range_start(mm) and invalidate_range_finish(mm, start, end) in addition to the invalidate rmap_notifier? --- include/linux/mmu_notifier.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) Index: linux-2.6/include/linux/mmu_notifier.h =================================================================== --- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-30 11:49:02.000000000 -0800 +++ linux-2.6/include/linux/mmu_notifier.h 2008-01-30 11:49:57.000000000 -0800 @@ -69,10 +69,13 @@ struct mmu_notifier_ops { /* * lock indicates that the function is called under spinlock. */ - void (*invalidate_range)(struct mmu_notifier *mn, + void (*invalidate_range_begin)(struct mmu_notifier *mn, struct mm_struct *mm, - unsigned long start, unsigned long end, int lock); + + void (*invalidate_range_end)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); }; struct mmu_rmap_notifier_ops; ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801301147330.30568-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801301147330.30568-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-30 22:18 ` Robin Holt 0 siblings, 0 replies; 150+ messages in thread From: Robin Holt @ 2008-01-30 22:18 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, Jan 30, 2008 at 11:50:26AM -0800, Christoph Lameter wrote: > On Wed, 30 Jan 2008, Andrea Arcangeli wrote: > > > XPMEM requires with invalidate_range (sleepy) + > > before_invalidate_range (sleepy). invalidate_all should also be called > > before_release (both sleepy). > > > > It sounds we need full overlap of information provided by > > invalidate_page and invalidate_range to fit all three models (the > > opposite of the zero objective that current V3 is taking). And the > > swap will be handled only by invalidate_page either through linux rmap > > or external rmap (with the latter that can sleep so it's ok for you, > > the former not). GRU can safely use the either the linux rmap notifier > > or the external rmap notifier equally well, because when try_to_unmap > > is called the page is locked and obviously pinned by the VM itself. > > So put the invalidate_page() callbacks in everywhere. The way I am envisioning it, we essentially drop back to Andrea's original patch. We then introduce a invalidate_range_begin (I was really thinking of it as invalidate_and_lock_range()) and an invalidate_range_end (again I was thinking of unlock_range). Thanks, Robin ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-01-30 19:50 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801301147330.30568-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-30 23:52 ` Andrea Arcangeli [not found] ` <20080130235214.GC7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 1 sibling, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-30 23:52 UTC (permalink / raw) To: Christoph Lameter Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Wed, Jan 30, 2008 at 11:50:26AM -0800, Christoph Lameter wrote: > Then we have > > invalidate_range_start(mm) > > and > > invalidate_range_finish(mm, start, end) > > in addition to the invalidate rmap_notifier? > > --- > include/linux/mmu_notifier.h | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > Index: linux-2.6/include/linux/mmu_notifier.h > =================================================================== > --- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-30 11:49:02.000000000 -0800 > +++ linux-2.6/include/linux/mmu_notifier.h 2008-01-30 11:49:57.000000000 -0800 > @@ -69,10 +69,13 @@ struct mmu_notifier_ops { > /* > * lock indicates that the function is called under spinlock. > */ > - void (*invalidate_range)(struct mmu_notifier *mn, > + void (*invalidate_range_begin)(struct mmu_notifier *mn, > struct mm_struct *mm, > - unsigned long start, unsigned long end, > int lock); > + > + void (*invalidate_range_end)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end); > }; start/finish/begin/end/before/after? ;) I'd drop the 'int lock', you should skip the before/after if i_mmap_lock isn't null and offload it to the caller before taking the lock. At least for the "after" call that looks a few liner change, didn't figure out the "before" yet. Given the amount of changes that are going on in design terms to cover both XPMEM and GRE, can we split the minimal invalidate_page that provides an obviously safe and feature complete mmu notifier code for KVM, and merge that first patch that will cover KVM 100%, it will cover GRE 90%, and then we add invalidate_range_before/after in a separate patch and we close the remaining 10% for GRE covering ptep_get_and_clear or whatever else ptep_*? The mmu notifiers are made so that are extendible in backwards compatible way. I think invalidate_page inside ptep_clear_flush is the first fundamental block of the mmu notifiers. Then once the fundamental is in and obviously safe and feature complete for KVM, the rest can be added very easily with incremental patches as far as I can tell. That would be my preferred route ;) ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080130235214.GC7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130235214.GC7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-31 0:01 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801301555550.1722-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-31 0:01 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Thu, 31 Jan 2008, Andrea Arcangeli wrote: > > - void (*invalidate_range)(struct mmu_notifier *mn, > > + void (*invalidate_range_begin)(struct mmu_notifier *mn, > > struct mm_struct *mm, > > - unsigned long start, unsigned long end, > > int lock); > > + > > + void (*invalidate_range_end)(struct mmu_notifier *mn, > > + struct mm_struct *mm, > > + unsigned long start, unsigned long end); > > }; > > start/finish/begin/end/before/after? ;) Well lets pick one and then stick to it. > I'd drop the 'int lock', you should skip the before/after if > i_mmap_lock isn't null and offload it to the caller before taking the > lock. At least for the "after" call that looks a few liner change, > didn't figure out the "before" yet. How we offload that? Before the scan of the rmaps we do not have the mmstruct. So we'd need another notifier_rmap_callback. > Given the amount of changes that are going on in design terms to cover > both XPMEM and GRE, can we split the minimal invalidate_page that > provides an obviously safe and feature complete mmu notifier code for > KVM, and merge that first patch that will cover KVM 100%, it will The obvious solution does not scale. You will have a callback for every page and there may be a million of those if you have a 4GB process. > made so that are extendible in backwards compatible way. I think > invalidate_page inside ptep_clear_flush is the first fundamental block > of the mmu notifiers. Then once the fundamental is in and obviously > safe and feature complete for KVM, the rest can be added very easily > with incremental patches as far as I can tell. That would be my > preferred route ;) We need to have a coherent notifier solution that works for multiple scenarios. I think a working invalidate_range would also be required for KVM. KVM and GRUB are very similar so they should be able to use the same mechanisms and we need to properly document how that mechanism is safe. Either both take a page refcount or none. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801301555550.1722-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801301555550.1722-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-31 0:34 ` Andrea Arcangeli [not found] ` <20080131003434.GE7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-31 0:34 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, Jan 30, 2008 at 04:01:31PM -0800, Christoph Lameter wrote: > How we offload that? Before the scan of the rmaps we do not have the > mmstruct. So we'd need another notifier_rmap_callback. My assumption is that that "int lock" exists just because unmap_mapping_range_vma exists. If I'm right then my suggestion was to move the invalidate_range after dropping the i_mmap_lock and not to invoke it inside zap_page_range. > The obvious solution does not scale. You will have a callback for every Scale is the wrong word. The PT lock will prevent any other cpu to trash on the mmu_lock, so it's a fixed cost for each pte_clear with no scalability risk, nor any complexity issue. Certainly we could average certain fixed costs over more than one pte_clear to boost performance, and that's good idea. Not really a short term concern, we need to swap reliably first ;). > page and there may be a million of those if you have a 4GB process. That can be optimized adding a __ptep_clear_flush and an invalidate_pages (let's call it pages to better show it's an 'clustered' version of invalidate_page, to avoid the confusion with _range_before/after that does an entirely different thing). Also for _range I tend to like before/after, as a means to say before the pte_clear and after the pte_clear but any other meaning is ok with me. We add invalidate_page and invalidate_pages immediately. invalidate_pages may never be called initially by the linux VM, we can start calling it later as we replace ptep_clear_flush with __ptep_clear_flush (or local_ptep_clear_flush). I don't see any problem with this approach and it looks quite clean to me and it leaves you full room for experimenting in practice with range_before/after while knowing those range_before/after won't require many changes. And for things like the age_page it will never happen that you could call the respective ptep_clear_flush_young w/o mmu notifier age_page after it, so you won't ever risk having to add an age_pages or a __ptep_clear_flush_young. > We need to have a coherent notifier solution that works for multiple > scenarios. I think a working invalidate_range would also be required for > KVM. KVM and GRUB are very similar so they should be able to use the same > mechanisms and we need to properly document how that mechanism is safe. > Either both take a page refcount or none. There's no reason why KVM should take any risk of corrupting memory due to a single missing mmu notifier, with not taking the refcount. get_user_pages will take it for us, so we have to pay the atomic-op anyway. It sure worth doing the atomic_dec inside the mmu notifier, and not immediately like this: get_user_pages(pages) __free_page(pages[0]) The idea is that what works for GRU, works for KVM too. So we do a single invalidate_page and clustered invalidate_pages, we add that, and then we make sure all places are covered so GRU will not kernel-crash, and KVM won't risk to run oom or to generate _userland_ corruption. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080131003434.GE7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080131003434.GE7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-31 1:46 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801301728110.2454-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 2008-01-31 2:08 ` Christoph Lameter 1 sibling, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-31 1:46 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Thu, 31 Jan 2008, Andrea Arcangeli wrote: > On Wed, Jan 30, 2008 at 04:01:31PM -0800, Christoph Lameter wrote: > > How we offload that? Before the scan of the rmaps we do not have the > > mmstruct. So we'd need another notifier_rmap_callback. > > My assumption is that that "int lock" exists just because > unmap_mapping_range_vma exists. If I'm right then my suggestion was to > move the invalidate_range after dropping the i_mmap_lock and not to > invoke it inside zap_page_range. There is still no pointer to the mm_struct available there because pages of a mapping may belong to multiple processes. So we need to add another rmap method? The same issue is also occurring for unmap_hugepages(). > There's no reason why KVM should take any risk of corrupting memory > due to a single missing mmu notifier, with not taking the > refcount. get_user_pages will take it for us, so we have to pay the > atomic-op anyway. It sure worth doing the atomic_dec inside the mmu > notifier, and not immediately like this: Well the GRU uses follow_page() instead of get_user_pages. Performance is a major issue for the GRU. > get_user_pages(pages) > __free_page(pages[0]) > > The idea is that what works for GRU, works for KVM too. So we do a > single invalidate_page and clustered invalidate_pages, we add that, > and then we make sure all places are covered so GRU will not > kernel-crash, and KVM won't risk to run oom or to generate _userland_ > corruption. Hmmmm.. Could we go to a scheme where we do not have to increase the page count? Modifications of the page struct require dirtying a cache line and it seems that we do not need an increased page count if we have an invalidate_range_start() that clears all the external references and stops the establishment of new ones and invalidate_range_end() that reenables new external references? Then we do not need the frequent invalidate_page() calls. The typical case would be anyways that invalidate_all() is called before anything else on exit. Invalidate_all() would remove all pages and disable creation of new references to the memory in the mm_struct. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801301728110.2454-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801301728110.2454-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-31 2:34 ` Robin Holt [not found] ` <20080131023401.GY26420-sJ/iWh9BUns@public.gmane.org> 2008-01-31 10:52 ` Andrea Arcangeli 1 sibling, 1 reply; 150+ messages in thread From: Robin Holt @ 2008-01-31 2:34 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Andrea Arcangeli, Peter Zijlstra, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins > Well the GRU uses follow_page() instead of get_user_pages. Performance is > a major issue for the GRU. Worse, the GRU takes its TLB faults from within an interrupt so we use follow_page to prevent going to sleep. That said, I think we could probably use follow_page() with FOLL_GET set to accomplish the requirements of mmu_notifier invalidate_range call. Doesn't look too promising for hugetlb pages. Thanks, Robin ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080131023401.GY26420-sJ/iWh9BUns@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080131023401.GY26420-sJ/iWh9BUns@public.gmane.org> @ 2008-01-31 2:37 ` Christoph Lameter 0 siblings, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-01-31 2:37 UTC (permalink / raw) To: Robin Holt Cc: Nick Piggin, Andrea Arcangeli, Peter Zijlstra, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Hugh Dickins On Wed, 30 Jan 2008, Robin Holt wrote: > > Well the GRU uses follow_page() instead of get_user_pages. Performance is > > a major issue for the GRU. > > Worse, the GRU takes its TLB faults from within an interrupt so we > use follow_page to prevent going to sleep. That said, I think we > could probably use follow_page() with FOLL_GET set to accomplish the > requirements of mmu_notifier invalidate_range call. Doesn't look too > promising for hugetlb pages. There may be no need to with the range_start/end scheme. The driver can have its own lock to make follow page secure. The lock needs to serialize the follow_page handler and the range_start/end calls as well as the invalidate_page callouts. I think that avoids the need for get_user_pages(). ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801301728110.2454-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 2008-01-31 2:34 ` Robin Holt @ 2008-01-31 10:52 ` Andrea Arcangeli 1 sibling, 0 replies; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-31 10:52 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, Jan 30, 2008 at 05:46:21PM -0800, Christoph Lameter wrote: > Well the GRU uses follow_page() instead of get_user_pages. Performance is > a major issue for the GRU. GRU is a external TLB, we have to allocate RAM instead but we do it through the regular userland paging mechanism. Performance is a major issue for kvm too, but the result of get_user_pages is used to fill a spte, so then the cpu will use the spte in hardware to fill its tlb, we won't have to keep calling follow_page in software to fill the tlb like GRU has to do, so you can imagine the difference in cpu utilization spent in those paths (plus our requirement to allocate memory). > Hmmmm.. Could we go to a scheme where we do not have to increase the page > count? Modifications of the page struct require dirtying a cache line and I doubt the atomic_inc is measurable given the rest of overhead like building the rmap for each new spte. There's no technical reason for not wanting proper reference counting other than microoptimization. What will work for GRU will work for KVM too regardless of whatever reference counting. Each mmu-notifier user should be free to do what it think it's better/safer or more convenient (and for anybody calling get_user_pages having the refcounting on external references is natural and zero additional cost). > it seems that we do not need an increased page count if we have an > invalidate_range_start() that clears all the external references > and stops the establishment of new ones and invalidate_range_end() that > reenables new external references? > > Then we do not need the frequent invalidate_page() calls. The increased page count is _mandatory_ to safely use range_start/end called outside the locks with _end called after releasing the old page. sptes will build themself the whole time until the pte_clear is called on the main linux pte. We don't want to clutter the VM fast paths with additional locks to stop the kvm pagefault while the VM is in the _range_start/end critical section like xpmem has to do be safe. So you're contradicting yourself by suggesting not to use invalidate_page and not to use a increased page count at the same time. And I need invalidate_page anyway for rmap.c which can't be provided as an invalidate_range and it can't sleep either. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080131003434.GE7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 2008-01-31 1:46 ` Christoph Lameter @ 2008-01-31 2:08 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801301805200.14071-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 1 sibling, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-31 2:08 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins Patch to 1. Remove sync on notifier_release. Must be called when only a single process remain. 2. Add invalidate_range_start/end. This should allow safe removal of ranges of external ptes without having to resort to a callback for every individual page. This must be able to nest so the driver needs to keep a refcount of range invalidates and wait if the refcount != 0. --- include/linux/mmu_notifier.h | 11 +++++++++-- mm/fremap.c | 3 ++- mm/hugetlb.c | 3 ++- mm/memory.c | 16 ++++++++++------ mm/mmu_notifier.c | 9 ++++----- 5 files changed, 27 insertions(+), 15 deletions(-) Index: linux-2.6/mm/mmu_notifier.c =================================================================== --- linux-2.6.orig/mm/mmu_notifier.c 2008-01-30 17:58:48.000000000 -0800 +++ linux-2.6/mm/mmu_notifier.c 2008-01-30 18:00:26.000000000 -0800 @@ -13,23 +13,22 @@ #include <linux/mm.h> #include <linux/mmu_notifier.h> +/* + * No synchronization. This function can only be called when only a single + * process remains that performs teardown. + */ void mmu_notifier_release(struct mm_struct *mm) { struct mmu_notifier *mn; struct hlist_node *n, *t; if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { - down_write(&mm->mmap_sem); - rcu_read_lock(); hlist_for_each_entry_safe_rcu(mn, n, t, &mm->mmu_notifier.head, hlist) { hlist_del_rcu(&mn->hlist); if (mn->ops->release) mn->ops->release(mn, mm); } - rcu_read_unlock(); - up_write(&mm->mmap_sem); - synchronize_rcu(); } } Index: linux-2.6/include/linux/mmu_notifier.h =================================================================== --- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-30 17:58:48.000000000 -0800 +++ linux-2.6/include/linux/mmu_notifier.h 2008-01-30 18:00:26.000000000 -0800 @@ -67,15 +67,22 @@ struct mmu_notifier_ops { int dummy); /* + * invalidate_range_begin() and invalidate_range_end() are paired. + * + * invalidate_range_begin must clear all references in the range + * and stop the establishment of new references. + * + * invalidate_range_end() reenables the establishment of references. + * * lock indicates that the function is called under spinlock. */ void (*invalidate_range_begin)(struct mmu_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end, int lock); void (*invalidate_range_end)(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, unsigned long end); + struct mm_struct *mm); }; struct mmu_rmap_notifier_ops; Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c 2008-01-30 17:58:48.000000000 -0800 +++ linux-2.6/mm/fremap.c 2008-01-30 18:00:26.000000000 -0800 @@ -212,8 +212,9 @@ asmlinkage long sys_remap_file_pages(uns spin_unlock(&mapping->i_mmap_lock); } + mmu_notifier(invalidate_range_start, mm, start, start + size, 0); err = populate_range(mm, vma, start, size, pgoff); - mmu_notifier(invalidate_range, mm, start, start + size, 0); + mmu_notifier(invalidate_range_end, mm); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); Index: linux-2.6/mm/hugetlb.c =================================================================== --- linux-2.6.orig/mm/hugetlb.c 2008-01-30 17:58:48.000000000 -0800 +++ linux-2.6/mm/hugetlb.c 2008-01-30 18:00:26.000000000 -0800 @@ -744,6 +744,7 @@ void __unmap_hugepage_range(struct vm_ar BUG_ON(start & ~HPAGE_MASK); BUG_ON(end & ~HPAGE_MASK); + mmu_notifier(invalidate_range_start, mm, start, end, 1); spin_lock(&mm->page_table_lock); for (address = start; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -764,7 +765,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); - mmu_notifier(invalidate_range, mm, start, end, 1); + mmu_notifier(invalidate_range_end, mm); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2008-01-30 17:58:48.000000000 -0800 +++ linux-2.6/mm/memory.c 2008-01-30 18:00:51.000000000 -0800 @@ -888,11 +888,12 @@ unsigned long zap_page_range(struct vm_a lru_add_drain(); tlb = tlb_gather_mmu(mm, 0); update_hiwater_rss(mm); + mmu_notifier(invalidate_range_start, mm, address, end, + (details ? (details->i_mmap_lock != NULL) : 0)); end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); if (tlb) tlb_finish_mmu(tlb, address, end); - mmu_notifier(invalidate_range, mm, address, end, - (details ? (details->i_mmap_lock != NULL) : 0)); + mmu_notifier(invalidate_range_end, mm); return end; } @@ -1355,6 +1356,7 @@ int remap_pfn_range(struct vm_area_struc pfn -= addr >> PAGE_SHIFT; pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); + mmu_notifier(invalidate_range_start, mm, start, end, 0); do { next = pgd_addr_end(addr, end); err = remap_pud_range(mm, pgd, addr, next, @@ -1362,7 +1364,7 @@ int remap_pfn_range(struct vm_area_struc if (err) break; } while (pgd++, addr = next, addr != end); - mmu_notifier(invalidate_range, mm, start, end, 0); + mmu_notifier(invalidate_range_end, mm); return err; } EXPORT_SYMBOL(remap_pfn_range); @@ -1450,6 +1452,7 @@ int apply_to_page_range(struct mm_struct int err; BUG_ON(addr >= end); + mmu_notifier(invalidate_range_start, mm, start, end, 0); pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); @@ -1457,7 +1460,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); - mmu_notifier(invalidate_range, mm, start, end, 0); + mmu_notifier(invalidate_range_end, mm); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1635,6 +1638,8 @@ gotten: goto oom; cow_user_page(new_page, old_page, address, vma); + mmu_notifier(invalidate_range_start, mm, address, + address + PAGE_SIZE - 1, 0); /* * Re-check the pte - we dropped the lock */ @@ -1673,8 +1678,7 @@ gotten: page_cache_release(old_page); unlock: pte_unmap_unlock(page_table, ptl); - mmu_notifier(invalidate_range, mm, address, - address + PAGE_SIZE - 1, 0); + mmu_notifier(invalidate_range_end, mm); if (dirty_page) { if (vma->vm_file) file_update_time(vma->vm_file); ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801301805200.14071-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801301805200.14071-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-31 2:42 ` Andrea Arcangeli [not found] ` <20080131024252.GF7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-31 2:42 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, Jan 30, 2008 at 06:08:14PM -0800, Christoph Lameter wrote: > hlist_for_each_entry_safe_rcu(mn, n, t, ^^^^ > &mm->mmu_notifier.head, hlist) { > hlist_del_rcu(&mn->hlist); ^^^^ _rcu can go away from both, if hlist_del_rcu can be called w/o locks. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <20080131024252.GF7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080131024252.GF7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org> @ 2008-01-31 2:51 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0801301848550.14263-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 150+ messages in thread From: Christoph Lameter @ 2008-01-31 2:51 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Thu, 31 Jan 2008, Andrea Arcangeli wrote: > On Wed, Jan 30, 2008 at 06:08:14PM -0800, Christoph Lameter wrote: > > hlist_for_each_entry_safe_rcu(mn, n, t, > ^^^^ > > > &mm->mmu_notifier.head, hlist) { > > hlist_del_rcu(&mn->hlist); > ^^^^ > > _rcu can go away from both, if hlist_del_rcu can be called w/o locks. True. hlist_del_init ok? That would allow to check the driver that the mmu_notifier is already linked in using !hlist_unhashed(). Driver then needs to properly initialize the mmu_notifier list with INIT_HLIST_NODE(). ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
[parent not found: <Pine.LNX.4.64.0801301848550.14263-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>]
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <Pine.LNX.4.64.0801301848550.14263-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org> @ 2008-01-31 13:39 ` Andrea Arcangeli 0 siblings, 0 replies; 150+ messages in thread From: Andrea Arcangeli @ 2008-01-31 13:39 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Robin Holt, Hugh Dickins On Wed, Jan 30, 2008 at 06:51:26PM -0800, Christoph Lameter wrote: > True. hlist_del_init ok? That would allow to check the driver that the > mmu_notifier is already linked in using !hlist_unhashed(). Driver then > needs to properly initialize the mmu_notifier list with INIT_HLIST_NODE(). A driver couldn't possibly care about the mmu notifier anymore at that point, we just agreed a moment ago that the list can't change under mmu_notifier_release, and in turn no driver could possibly call mmu_notifier_unregister/register at that point anymore regardless of the outcome of hlist_unhashed and external serialization must let the driver know he's done with the notifiers. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges [not found] ` <20080130161123.GS26420-sJ/iWh9BUns@public.gmane.org> 2008-01-30 17:04 ` Andrea Arcangeli @ 2008-01-30 19:35 ` Christoph Lameter 1 sibling, 0 replies; 150+ messages in thread From: Christoph Lameter @ 2008-01-30 19:35 UTC (permalink / raw) To: Robin Holt Cc: Nick Piggin, Andrea Arcangeli, Peter Zijlstra, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Benjamin Herrenschmidt, steiner-sJ/iWh9BUns, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Avi Kivity, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, daniel.blueman-xqY44rlHlBpWk0Htik3J/w, Hugh Dickins On Wed, 30 Jan 2008, Robin Holt wrote: > I think I need to straighten this discussion out in my head a little bit. > Am I correct in assuming Andrea's original patch set did not have any SMP > race conditions for KVM? If so, then we need to start looking at how to > implement Christoph's and my changes in a safe fashion. Andrea, I agree > complete that our introduction of the range callouts have introduced > SMP races. The original patch drew the clearing of the sptes into ptep_clear_flush(). So the invalidate_page was called for each page regardless if we had been doing an invalidate range before or not. It seems that the the invalidate_range() was just there for optimization. > The three issues we need to simultaneously solve is revoking the remote > page table/tlb information while still in a sleepable context and not > having the remote faulters become out of sync with the granting process. > Currently, I don't see a way to do that cleanly with a single callout. You could use the invalidate_page callouts to set a flag that no additional rmap entries may be added until the invalidate_range has occurred? We could add back all the original invalidate_pages() and pass a flag that specifies that an invalidate range will follow. The notifier can then decide what to do with that information. If its okay to defer then do nothing and wait for the range_invalidate. XPmem could stop allowing external references to be established until the invalidate_range was successful. Jack had a concern that multiple callouts for the same pte could cause problems. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 150+ messages in thread
end of thread, other threads:[~2008-03-05 0:52 UTC | newest]
Thread overview: 150+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-02-08 22:06 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
2008-02-08 22:06 ` [patch 3/6] mmu_notifier: invalidate_page callbacks Christoph Lameter
2008-02-08 22:06 ` [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier Christoph Lameter
2008-02-08 22:06 ` [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) Christoph Lameter
2008-02-08 22:06 ` [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps Christoph Lameter
2008-02-08 22:23 ` [ofa-general] Re: [patch 0/6] MMU Notifiers V6 Andrew Morton
[not found] ` <20080208142315.7fe4b95e.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2008-02-08 23:32 ` Christoph Lameter
2008-02-08 23:36 ` [ofa-general] " Robin Holt
2008-02-08 23:41 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0802081540180.4291-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-02-08 23:43 ` Robin Holt
2008-02-08 23:56 ` [ofa-general] " Andrew Morton
2008-02-09 0:05 ` Christoph Lameter
2008-02-09 0:12 ` Roland Dreier
2008-02-09 0:16 ` Christoph Lameter
2008-02-09 0:21 ` [ofa-general] trying to get of all lists R S
2008-02-09 0:22 ` [ofa-general] Re: [patch 0/6] MMU Notifiers V6 Roland Dreier
[not found] ` <adalk5v0yi6.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2008-02-09 0:36 ` Christoph Lameter
2008-02-09 1:24 ` Andrea Arcangeli
2008-02-09 1:27 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0802081725200.5445-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-02-09 1:56 ` Andrea Arcangeli
2008-02-09 2:16 ` Christoph Lameter
2008-02-09 12:55 ` Rik van Riel
2008-02-09 21:46 ` Christoph Lameter
2008-02-11 22:40 ` [ofa-general] Demand paging for memory regions (was Re: MMU Notifiers V6) Roland Dreier
2008-02-12 22:01 ` [ofa-general] " Steve Wise
2008-02-12 22:10 ` Christoph Lameter
2008-02-12 22:41 ` [ofa-general] Re: Demand paging for memory regions Roland Dreier
2008-02-12 23:14 ` Felix Marti
2008-02-13 0:57 ` Christoph Lameter
2008-02-14 15:09 ` Steve Wise
2008-02-14 15:53 ` Robin Holt
2008-02-14 16:23 ` Steve Wise
2008-02-14 17:48 ` Caitlin Bestler
2008-02-14 19:39 ` Christoph Lameter
2008-02-14 20:17 ` Caitlin Bestler
2008-02-14 20:20 ` Christoph Lameter
2008-02-14 22:43 ` Caitlin Bestler
2008-02-14 22:48 ` Christoph Lameter
2008-02-15 1:26 ` Caitlin Bestler
2008-02-15 2:37 ` Christoph Lameter
2008-02-15 18:09 ` Caitlin Bestler
2008-02-15 18:45 ` Christoph Lameter
2008-02-15 18:53 ` Caitlin Bestler
2008-02-15 20:02 ` Christoph Lameter
2008-02-15 20:14 ` Caitlin Bestler
2008-02-15 22:50 ` Christoph Lameter
2008-02-15 23:50 ` Caitlin Bestler
2008-02-12 23:23 ` Jason Gunthorpe
2008-02-13 1:01 ` Christoph Lameter
2008-02-13 1:26 ` Jason Gunthorpe
2008-02-13 1:45 ` Steve Wise
2008-02-13 2:35 ` Christoph Lameter
2008-02-13 3:25 ` Jason Gunthorpe
2008-02-13 18:51 ` Christoph Lameter
2008-02-13 19:51 ` Jason Gunthorpe
2008-02-13 20:36 ` Christoph Lameter
2008-02-13 4:09 ` Christian Bell
2008-02-13 19:00 ` Christoph Lameter
2008-02-13 19:46 ` Christian Bell
2008-02-13 20:32 ` Christoph Lameter
2008-02-13 22:44 ` Kanoj Sarcar
2008-02-13 23:02 ` Christoph Lameter
2008-02-13 23:43 ` Kanoj Sarcar
2008-02-13 23:48 ` Jesse Barnes
2008-02-14 0:56 ` Andrea Arcangeli
2008-02-14 19:35 ` Christoph Lameter
2008-02-13 23:23 ` Pete Wyckoff
2008-02-14 0:01 ` Jason Gunthorpe
2008-02-27 22:11 ` Christoph Lameter
2008-02-13 1:55 ` Christian Bell
2008-02-13 2:19 ` Christoph Lameter
2008-02-13 0:56 ` Christoph Lameter
2008-02-13 12:11 ` Christoph Raisch
2008-02-13 19:02 ` Christoph Lameter
2008-02-09 0:12 ` [ofa-general] Re: [patch 0/6] MMU Notifiers V6 Andrew Morton
2008-02-09 0:18 ` Christoph Lameter
2008-02-13 14:31 ` Jack Steiner
-- strict thread matches above, loose matches on Subject: below --
2008-02-15 6:48 [ofa-general] [patch 0/6] MMU Notifiers V7 Christoph Lameter
2008-02-15 6:49 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
2008-02-19 8:54 ` [ofa-general] " Nick Piggin
2008-02-19 13:34 ` Andrea Arcangeli
2008-02-27 22:23 ` Christoph Lameter
2008-02-27 23:57 ` Andrea Arcangeli
2008-02-19 23:08 ` [ofa-general] " Nick Piggin
2008-02-20 1:00 ` Andrea Arcangeli
2008-02-20 3:00 ` Robin Holt
2008-02-20 3:11 ` Nick Piggin
2008-02-27 22:39 ` Christoph Lameter
2008-02-28 0:38 ` Andrea Arcangeli
2008-02-27 22:35 ` [ofa-general] " Christoph Lameter
2008-02-28 0:10 ` Christoph Lameter
2008-02-28 0:11 ` [ofa-general] " Andrea Arcangeli
2008-02-28 0:14 ` Christoph Lameter
2008-02-28 0:52 ` [ofa-general] " Andrea Arcangeli
2008-02-28 1:03 ` Christoph Lameter
2008-02-28 1:10 ` Andrea Arcangeli
2008-02-28 18:43 ` [ofa-general] " Christoph Lameter
2008-02-29 0:55 ` Andrea Arcangeli
2008-02-29 0:59 ` [ofa-general] " Christoph Lameter
2008-02-29 13:13 ` Andrea Arcangeli
2008-02-29 19:55 ` [ofa-general] " Christoph Lameter
2008-02-29 20:17 ` Andrea Arcangeli
2008-02-29 21:03 ` [ofa-general] " Christoph Lameter
2008-02-29 21:23 ` Andrea Arcangeli
2008-02-29 21:29 ` Christoph Lameter
2008-02-29 21:34 ` Christoph Lameter
2008-02-29 21:48 ` [ofa-general] " Andrea Arcangeli
2008-02-29 22:12 ` Christoph Lameter
2008-03-03 5:11 ` Nick Piggin
2008-03-03 19:28 ` [ofa-general] " Christoph Lameter
2008-03-03 19:50 ` Nick Piggin
2008-03-04 18:58 ` Christoph Lameter
2008-03-05 0:52 ` Nick Piggin
2008-01-30 2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter
2008-01-30 2:29 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter
2008-01-28 20:28 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
[not found] ` <20080128202923.849058104-sJ/iWh9BUns@public.gmane.org>
2008-01-29 16:20 ` Andrea Arcangeli
[not found] ` <20080129162004.GL7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-29 18:28 ` Andrea Arcangeli
[not found] ` <20080129182831.GS7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-29 20:30 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801291219030.25629-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-29 21:36 ` Andrea Arcangeli
[not found] ` <20080129213604.GW7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-29 21:53 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801291343530.26824-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-29 22:35 ` Andrea Arcangeli
[not found] ` <20080129223503.GY7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-29 22:55 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801291440170.27327-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-29 23:43 ` Andrea Arcangeli
[not found] ` <20080129234353.GZ7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-30 0:34 ` Christoph Lameter
2008-01-29 19:55 ` Christoph Lameter
2008-01-29 21:17 ` Andrea Arcangeli
[not found] ` <20080129211759.GV7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-29 21:35 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801291327330.26649-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-29 22:02 ` Andrea Arcangeli
[not found] ` <20080129220212.GX7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-29 22:39 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801291407380.27104-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-30 0:00 ` Andrea Arcangeli
[not found] ` <20080130000039.GA7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-30 0:05 ` Andrea Arcangeli
[not found] ` <20080130000559.GB7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-30 0:22 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801291621380.28027-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-30 0:59 ` Andrea Arcangeli
2008-01-30 8:26 ` Peter Zijlstra
2008-01-30 0:20 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801291620170.28027-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-30 0:28 ` Jack Steiner
[not found] ` <20080130002804.GA13840-sJ/iWh9BUns@public.gmane.org>
2008-01-30 0:35 ` Christoph Lameter
2008-01-30 13:37 ` Andrea Arcangeli
[not found] ` <20080130133720.GM7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-30 14:43 ` Jack Steiner
[not found] ` <20080130144305.GA25193-sJ/iWh9BUns@public.gmane.org>
2008-01-30 19:41 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801301140320.30568-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-30 20:29 ` Jack Steiner
[not found] ` <20080130202918.GB11324-sJ/iWh9BUns@public.gmane.org>
2008-01-30 20:55 ` Christoph Lameter
2008-01-30 16:11 ` Robin Holt
[not found] ` <20080130161123.GS26420-sJ/iWh9BUns@public.gmane.org>
2008-01-30 17:04 ` Andrea Arcangeli
[not found] ` <20080130170451.GP7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-30 17:30 ` Robin Holt
[not found] ` <20080130173009.GT26420-sJ/iWh9BUns@public.gmane.org>
2008-01-30 18:25 ` Andrea Arcangeli
[not found] ` <20080130182506.GQ7233-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-30 19:50 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801301147330.30568-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-30 22:18 ` Robin Holt
2008-01-30 23:52 ` Andrea Arcangeli
[not found] ` <20080130235214.GC7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-31 0:01 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801301555550.1722-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-31 0:34 ` Andrea Arcangeli
[not found] ` <20080131003434.GE7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-31 1:46 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801301728110.2454-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-31 2:34 ` Robin Holt
[not found] ` <20080131023401.GY26420-sJ/iWh9BUns@public.gmane.org>
2008-01-31 2:37 ` Christoph Lameter
2008-01-31 10:52 ` Andrea Arcangeli
2008-01-31 2:08 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801301805200.14071-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-31 2:42 ` Andrea Arcangeli
[not found] ` <20080131024252.GF7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-31 2:51 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801301848550.14263-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-31 13:39 ` Andrea Arcangeli
2008-01-30 19:35 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox