* [PATCH] ummunotify: Userspace support for MMU notifications
@ 2010-04-12 6:22 ` Eric B Munson
0 siblings, 0 replies; 30+ messages in thread
From: Eric B Munson @ 2010-04-12 6:22 UTC (permalink / raw)
To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w,
peterz-wEGCiKHe2LqWVfeAwA7xHQ, pavel-+ZI9xUNit7I,
mingo-X9Un+BFzKDI, Eric B Munson
Andrew,
I am resubmitting this patch because I believe that the discussion
has shown this to be an acceptable solution. I have fixed the 32 bit
build errors, but other than that change, the code is the same as
Roland's V3 patch.
From: Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc. Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.
We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts. Specifically, we implement a
new character device driver, ummunotify, that creates a /dev/ummunotify
node. A userspace process can open this node read-only and use the fd
as follows:
1. ioctl() to register/unregister an address range to watch in the
kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>).
2. read() to retrieve events generated when a mapping in a watched
address range is invalidated (cf struct ummunotify_event in
<linux/ummunotify.h>). select()/poll()/epoll() and SIGIO are
handled for this IO.
3. mmap() one page at offset 0 to map a kernel page that contains a
generation counter that is incremented each time an event is
generated. This allows userspace to have a fast path that checks
that no events have occurred without a system call.
Thanks to Jason Gunthorpe <jgunthorpe <at> obsidianresearch.com> for
suggestions on the interface design. Also thanks to Jeff Squyres
<jsquyres <at> cisco.com> for prototyping support for this in Open MPI, which
helped find several bugs during development.
Signed-off-by: Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Eric B Munson <ebmunson-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
Changes since v3:
- Fixed replaced [get|put] user with copy_[from|to]_user to fix x86
builds
---
Documentation/Makefile | 3 +-
Documentation/ummunotify/Makefile | 7 +
Documentation/ummunotify/ummunotify.txt | 150 ++++++++
Documentation/ummunotify/umn-test.c | 200 +++++++++++
drivers/char/Kconfig | 12 +
drivers/char/Makefile | 1 +
drivers/char/ummunotify.c | 567 +++++++++++++++++++++++++++++++
include/linux/ummunotify.h | 121 +++++++
8 files changed, 1060 insertions(+), 1 deletions(-)
create mode 100644 Documentation/ummunotify/Makefile
create mode 100644 Documentation/ummunotify/ummunotify.txt
create mode 100644 Documentation/ummunotify/umn-test.c
create mode 100644 drivers/char/ummunotify.c
create mode 100644 include/linux/ummunotify.h
diff --git a/Documentation/Makefile b/Documentation/Makefile
index 6fc7ea1..27ba76a 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -1,3 +1,4 @@
obj-m := DocBook/ accounting/ auxdisplay/ connector/ \
filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \
- pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/
+ pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \
+ watchdog/src/
diff --git a/Documentation/ummunotify/Makefile b/Documentation/ummunotify/Makefile
new file mode 100644
index 0000000..89f31a0
--- /dev/null
+++ b/Documentation/ummunotify/Makefile
@@ -0,0 +1,7 @@
+# List of programs to build
+hostprogs-y := umn-test
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include
diff --git a/Documentation/ummunotify/ummunotify.txt b/Documentation/ummunotify/ummunotify.txt
new file mode 100644
index 0000000..78a79c2
--- /dev/null
+++ b/Documentation/ummunotify/ummunotify.txt
@@ -0,0 +1,150 @@
+UMMUNOTIFY
+
+ Ummunotify relays MMU notifier events to userspace. This is useful
+ for libraries that need to track the memory mapping of applications;
+ for example, MPI implementations using RDMA want to cache memory
+ registrations for performance, but tracking all possible crazy cases
+ such as when, say, the FORTRAN runtime frees memory is impossible
+ without kernel help.
+
+Basic Model
+
+ A userspace process uses it by opening /dev/ummunotify, which
+ returns a file descriptor. Interest in address ranges is registered
+ using ioctl() and MMU notifier events are retrieved using read(), as
+ described in more detail below. Userspace can register multiple
+ address ranges to watch, and can unregister individual ranges.
+
+ Userspace can also mmap() a single read-only page at offset 0 on
+ this file descriptor. This page contains (at offest 0) a single
+ 64-bit generation counter that the kernel increments each time an
+ MMU notifier event occurs. Userspace can use this to very quickly
+ check if there are any events to retrieve without needing to do a
+ system call.
+
+Control
+
+ To start using ummunotify, a process opens /dev/ummunotify in
+ read-only mode. Control from userspace is done via ioctl(); the
+ defined ioctls are:
+
+ UMMUNOTIFY_EXCHANGE_FEATURES: This ioctl takes a single 32-bit
+ word of feature flags as input, and the kernel updates the
+ features flags word to contain only features requested by
+ userspace and also supported by the kernel.
+
+ This ioctl is only included for forward compatibility; no
+ feature flags are currently defined, and the kernel will simply
+ update any requested feature mask to 0. The kernel will always
+ default to a feature mask of 0 if this ioctl is not used, so
+ current userspace does not need to perform this ioctl.
+
+ UMMUNOTIFY_REGISTER_REGION: Userspace uses this ioctl to tell the
+ kernel to start delivering events for an address range. The
+ range is described using struct ummunotify_register_ioctl:
+
+ struct ummunotify_register_ioctl {
+ __u64 start;
+ __u64 end;
+ __u64 user_cookie;
+ __u32 flags;
+ __u32 reserved;
+ };
+
+ start and end give the range of userspace virtual addresses;
+ start is included in the range and end is not, so an example of
+ a 4 KB range would be start=0x1000, end=0x2000.
+
+ user_cookie is an opaque 64-bit quantity that is returned by the
+ kernel in events involving the range, and used by userspace to
+ stop watching the range. Each registered address range must
+ have a distinct user_cookie.
+
+ It is fine with the kernel if userspace registers multiple
+ overlapping or even duplicate address ranges, as long as a
+ different cookie is used for each registration.
+
+ flags and reserved are included for forward compatibility;
+ userspace should simply set them to 0 for the current interface.
+
+ UMMUNOTIFY_UNREGISTER_REGION: Userspace passes in the 64-bit
+ user_cookie used to register a range to tell the kernel to stop
+ watching an address range. Once this ioctl completes, the
+ kernel will not deliver any further events for the range that is
+ unregistered.
+
+Events
+
+ When an event occurs that invalidates some of a process's memory
+ mapping in an address range being watched, ummunotify queues an
+ event report for that address range. If more than one event
+ invalidates parts of the same address range before userspace
+ retrieves the queued report, then further reports for the same range
+ will not be queued -- when userspace does read the queue, only a
+ single report for a given range will be returned.
+
+ If multiple ranges being watched are invalidated by a single event
+ (which is especially likely if userspace registers overlapping
+ ranges), then an event report structure will be queued for each
+ address range registration.
+
+ Userspace retrieves queued events via read() on the ummunotify file
+ descriptor; a buffer that is at least as big as struct
+ ummunotify_event should be used to retrieve event reports, and if a
+ larger buffer is passed to read(), multiple reports will be returned
+ (if available).
+
+ If the ummunotify file descriptor is in blocking mode, a read() call
+ will wait for an event report to be available. Userspace may also
+ set the ummunotify file descriptor to non-blocking mode and use all
+ standard ways of waiting for data to be available on the ummunotify
+ file descriptor, including epoll/poll()/select() and SIGIO.
+
+ The format of event reports is:
+
+ struct ummunotify_event {
+ __u32 type;
+ __u32 flags;
+ __u64 hint_start;
+ __u64 hint_end;
+ __u64 user_cookie_counter;
+ };
+
+ where the type field is either UMMUNOTIFY_EVENT_TYPE_INVAL or
+ UMMUNOTIFY_EVENT_TYPE_LAST. Events of type INVAL describe
+ invalidation events as follows: user_cookie_counter contains the
+ cookie passed in when userspace registered the range that the event
+ is for. hint_start and hint_end contain the start address and end
+ address that were invalidated.
+
+ The flags word contains bit flags, with only UMMUNOTIFY_EVENT_FLAG_HINT
+ defined at the moment. If HINT is set, then the invalidation event
+ invalidated less than the full address range and the kernel returns
+ the exact range invalidated; if HINT is not sent then hint_start and
+ hint_end are set to the original range registered by userspace.
+ (HINT will not be set if, for example, multiple events invalidated
+ disjoint parts of the range and so a single start/end pair cannot
+ represent the parts of the range that were invalidated)
+
+ If the event type is LAST, then the read operation has emptied the
+ list of invalidated regions, and the flags, hint_start and hint_end
+ fields are not used. user_cookie_counter holds the value of the
+ kernel's generation counter (see below of more details) when the
+ empty list occurred.
+
+Generation Count
+
+ Userspace may mmap() a page on a ummunotify file descriptor via
+
+ mmap(NULL, sizeof (__u64), PROT_READ, MAP_SHARED, ummunotify_fd, 0);
+
+ to get a read-only mapping of the kernel's 64-bit generation
+ counter. The kernel will increment this generation counter each
+ time an event report is queued.
+
+ Userspace can use the generation counter as a quick check to avoid
+ system calls; if the value read from the mapped kernel counter is
+ still equal to the value returned in user_cookie_counter for the
+ most recent LAST event retrieved, then no further events have been
+ queued and there is no need to try a read() on the ummunotify file
+ descriptor.
diff --git a/Documentation/ummunotify/umn-test.c b/Documentation/ummunotify/umn-test.c
new file mode 100644
index 0000000..143db2c
--- /dev/null
+++ b/Documentation/ummunotify/umn-test.c
@@ -0,0 +1,200 @@
+/*
+ * Copyright (c) 2009 Cisco Systems. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <stdint.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#include <linux/ummunotify.h>
+
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/ioctl.h>
+
+#define UMN_TEST_COOKIE 123
+
+static int umn_fd;
+static volatile __u64 *umn_counter;
+
+static int umn_init(void)
+{
+ __u32 flags;
+
+ umn_fd = open("/dev/ummunotify", O_RDONLY);
+ if (umn_fd < 0) {
+ perror("open");
+ return 1;
+ }
+
+ if (ioctl(umn_fd, UMMUNOTIFY_EXCHANGE_FEATURES, &flags)) {
+ perror("exchange ioctl");
+ return 1;
+ }
+
+ printf("kernel feature flags: 0x%08x\n", flags);
+
+ umn_counter = mmap(NULL, sizeof *umn_counter, PROT_READ,
+ MAP_SHARED, umn_fd, 0);
+ if (umn_counter == MAP_FAILED) {
+ perror("mmap");
+ return 1;
+ }
+
+ return 0;
+}
+
+static int umn_register(void *buf, size_t size, __u64 cookie)
+{
+ struct ummunotify_register_ioctl r = {
+ .start = (unsigned long) buf,
+ .end = (unsigned long) buf + size,
+ .user_cookie = cookie,
+ };
+
+ if (ioctl(umn_fd, UMMUNOTIFY_REGISTER_REGION, &r)) {
+ perror("register ioctl");
+ return 1;
+ }
+
+ return 0;
+}
+
+static int umn_unregister(__u64 cookie)
+{
+ if (ioctl(umn_fd, UMMUNOTIFY_UNREGISTER_REGION, &cookie)) {
+ perror("unregister ioctl");
+ return 1;
+ }
+
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ int page_size;
+ __u64 old_counter;
+ void *t;
+ int got_it;
+
+ if (umn_init())
+ return 1;
+
+ printf("\n");
+
+ old_counter = *umn_counter;
+ if (old_counter != 0) {
+ fprintf(stderr, "counter = %lld (expected 0)\n", old_counter);
+ return 1;
+ }
+
+ page_size = sysconf(_SC_PAGESIZE);
+ t = mmap(NULL, 3 * page_size, PROT_READ,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
+
+ if (umn_register(t, 3 * page_size, UMN_TEST_COOKIE))
+ return 1;
+
+ munmap(t + page_size, page_size);
+
+ old_counter = *umn_counter;
+ if (old_counter != 1) {
+ fprintf(stderr, "counter = %lld (expected 1)\n", old_counter);
+ return 1;
+ }
+
+ got_it = 0;
+ while (1) {
+ struct ummunotify_event ev;
+ int len;
+
+ len = read(umn_fd, &ev, sizeof ev);
+ if (len < 0) {
+ perror("read event");
+ return 1;
+ }
+ if (len != sizeof ev) {
+ fprintf(stderr, "Read gave %d bytes (!= event size %zd)\n",
+ len, sizeof ev);
+ return 1;
+ }
+
+ switch (ev.type) {
+ case UMMUNOTIFY_EVENT_TYPE_INVAL:
+ if (got_it) {
+ fprintf(stderr, "Extra invalidate event\n");
+ return 1;
+ }
+ if (ev.user_cookie_counter != UMN_TEST_COOKIE) {
+ fprintf(stderr, "Invalidate event for cookie %lld (expected %d)\n",
+ ev.user_cookie_counter,
+ UMN_TEST_COOKIE);
+ return 1;
+ }
+
+ printf("Invalidate event:\tcookie %lld\n",
+ ev.user_cookie_counter);
+
+ if (!(ev.flags & UMMUNOTIFY_EVENT_FLAG_HINT)) {
+ fprintf(stderr, "Hint flag not set\n");
+ return 1;
+ }
+
+ if (ev.hint_start != (uintptr_t) t + page_size ||
+ ev.hint_end != (uintptr_t) t + page_size * 2) {
+ fprintf(stderr, "Got hint %llx..%llx, expected %p..%p\n",
+ ev.hint_start, ev.hint_end,
+ t + page_size, t + page_size * 2);
+ return 1;
+ }
+
+ printf("\t\t\thint %llx...%llx\n",
+ ev.hint_start, ev.hint_end);
+
+ got_it = 1;
+ break;
+
+ case UMMUNOTIFY_EVENT_TYPE_LAST:
+ if (!got_it) {
+ fprintf(stderr, "Last event without invalidate event\n");
+ return 1;
+ }
+
+ printf("Empty event:\t\tcounter %lld\n",
+ ev.user_cookie_counter);
+ goto done;
+
+ default:
+ fprintf(stderr, "unknown event type %d\n",
+ ev.type);
+ return 1;
+ }
+ }
+
+done:
+ umn_unregister(123);
+ munmap(t, page_size);
+
+ old_counter = *umn_counter;
+ if (old_counter != 1) {
+ fprintf(stderr, "counter = %lld (expected 1)\n", old_counter);
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 3141dd3..cf26019 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1111,6 +1111,18 @@ config DEVPORT
depends on ISA || PCI
default y
+config UMMUNOTIFY
+ tristate "Userspace MMU notifications"
+ select MMU_NOTIFIER
+ help
+ The ummunotify (userspace MMU notification) driver creates a
+ character device that can be used by userspace libraries to
+ get notifications when an application's memory mapping
+ changed. This is used, for example, by RDMA libraries to
+ improve the reliability of memory registration caching, since
+ the kernel's MMU notifications can be used to know precisely
+ when to shoot down a cached registration.
+
source "drivers/s390/char/Kconfig"
endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index f957edf..521e5de 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_NSC_GPIO) += nsc_gpio.o
obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio.o
obj-$(CONFIG_GPIO_TB0219) += tb0219.o
obj-$(CONFIG_TELCLOCK) += tlclk.o
+obj-$(CONFIG_UMMUNOTIFY) += ummunotify.o
obj-$(CONFIG_MWAVE) += mwave/
obj-$(CONFIG_AGP) += agp/
diff --git a/drivers/char/ummunotify.c b/drivers/char/ummunotify.c
new file mode 100644
index 0000000..c14df3f
--- /dev/null
+++ b/drivers/char/ummunotify.c
@@ -0,0 +1,567 @@
+/*
+ * Copyright (c) 2009 Cisco Systems. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/rbtree.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>
+#include <linux/ummunotify.h>
+
+#include <asm/cacheflush.h>
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("Userspace MMU notifiers");
+MODULE_LICENSE("GPL v2");
+
+/*
+ * Information about an address range userspace has asked us to watch.
+ *
+ * user_cookie: Opaque cookie given to us when userspace registers the
+ * address range.
+ *
+ * start, end: Address range; start is inclusive, end is exclusive.
+ *
+ * hint_start, hint_end: If a single MMU notification event
+ * invalidates the address range, we hold the actual range of
+ * addresses that were invalidated (and set UMMUNOTIFY_FLAG_HINT).
+ * If another event hits this range before userspace reads the
+ * event, we give up and don't try to keep track of which subsets
+ * got invalidated.
+ *
+ * flags: Holds the INVALID flag for ranges that are on the invalid
+ * list and/or the HINT flag for ranges where the hint range holds
+ * good information.
+ *
+ * node: Used to put the range into an rbtree we use to be able to
+ * scan address ranges in order.
+ *
+ * list: Used to put the range on the invalid list when an MMU
+ * notification event hits the range.
+ */
+enum {
+ UMMUNOTIFY_FLAG_INVALID = 1,
+ UMMUNOTIFY_FLAG_HINT = 2,
+};
+
+struct ummunotify_reg {
+ u64 user_cookie;
+ unsigned long start;
+ unsigned long end;
+ unsigned long hint_start;
+ unsigned long hint_end;
+ unsigned long flags;
+ struct rb_node node;
+ struct list_head list;
+};
+
+/*
+ * Context attached to each file that userspace opens.
+ *
+ * mmu_notifier: MMU notifier registered for this context.
+ *
+ * mm: mm_struct for process that created the context; we use this to
+ * hold a reference to the mm to make sure it doesn't go away until
+ * we're done with it.
+ *
+ * reg_tree: RB tree of address ranges being watched, sorted by start
+ * address.
+ *
+ * invalid_list: List of address ranges that have been invalidated by
+ * MMU notification events; as userspace reads events, the address
+ * range corresponding to the event is removed from the list.
+ *
+ * counter: Page that can be mapped read-only by userspace, which
+ * holds a generation count that is incremented each time an event
+ * occurs.
+ *
+ * lock: Spinlock used to protect all context.
+ *
+ * read_wait: Wait queue used to wait for data to become available in
+ * blocking read()s.
+ *
+ * async_queue: Used to implement fasync().
+ *
+ * need_empty: Set when userspace reads an invalidation event, so that
+ * read() knows it must generate an "empty" event when userspace
+ * drains the invalid_list.
+ *
+ * used: Set after userspace does anything with the file, so that the
+ * "exchange flags" ioctl() knows it's too late to change anything.
+ */
+struct ummunotify_file {
+ struct mmu_notifier mmu_notifier;
+ struct mm_struct *mm;
+ struct rb_root reg_tree;
+ struct list_head invalid_list;
+ u64 *counter;
+ spinlock_t lock;
+ wait_queue_head_t read_wait;
+ struct fasync_struct *async_queue;
+ int need_empty;
+ int used;
+};
+
+static void ummunotify_handle_notify(struct mmu_notifier *mn,
+ unsigned long start, unsigned long end)
+{
+ struct ummunotify_file *priv =
+ container_of(mn, struct ummunotify_file, mmu_notifier);
+ struct rb_node *n;
+ struct ummunotify_reg *reg;
+ unsigned long flags;
+ int hit = 0;
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+ reg = rb_entry(n, struct ummunotify_reg, node);
+
+ /*
+ * Ranges overlap if they're not disjoint; and they're
+ * disjoint if the end of one is before the start of
+ * the other one. So if both disjointness comparisons
+ * fail then the ranges overlap.
+ *
+ * Since we keep the tree of regions we're watching
+ * sorted by start address, we can end this loop as
+ * soon as we hit a region that starts past the end of
+ * the range for the event we're handling.
+ */
+ if (reg->start >= end)
+ break;
+
+ /*
+ * Just go to the next region if the start of the
+ * range is after the end of the region -- there
+ * might still be more overlapping ranges that have a
+ * greater start.
+ */
+ if (start >= reg->end)
+ continue;
+
+ hit = 1;
+
+ if (test_and_set_bit(UMMUNOTIFY_FLAG_INVALID, ®->flags)) {
+ /* Already on invalid list */
+ clear_bit(UMMUNOTIFY_FLAG_HINT, ®->flags);
+ } else {
+ list_add_tail(®->list, &priv->invalid_list);
+ set_bit(UMMUNOTIFY_FLAG_HINT, ®->flags);
+ reg->hint_start = start;
+ reg->hint_end = end;
+ }
+ }
+
+ if (hit) {
+ ++(*priv->counter);
+ flush_dcache_page(virt_to_page(priv->counter));
+ wake_up_interruptible(&priv->read_wait);
+ kill_fasync(&priv->async_queue, SIGIO, POLL_IN);
+ }
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void ummunotify_invalidate_page(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long addr)
+{
+ ummunotify_handle_notify(mn, addr, addr + PAGE_SIZE);
+}
+
+static void ummunotify_invalidate_range_start(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ ummunotify_handle_notify(mn, start, end);
+}
+
+static const struct mmu_notifier_ops ummunotify_mmu_notifier_ops = {
+ .invalidate_page = ummunotify_invalidate_page,
+ .invalidate_range_start = ummunotify_invalidate_range_start,
+};
+
+static int ummunotify_open(struct inode *inode, struct file *filp)
+{
+ struct ummunotify_file *priv;
+ int ret;
+
+ if (filp->f_mode & FMODE_WRITE)
+ return -EINVAL;
+
+ priv = kmalloc(sizeof *priv, GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->counter = (void *) get_zeroed_page(GFP_KERNEL);
+ if (!priv->counter) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ priv->reg_tree = RB_ROOT;
+ INIT_LIST_HEAD(&priv->invalid_list);
+ spin_lock_init(&priv->lock);
+ init_waitqueue_head(&priv->read_wait);
+ priv->async_queue = NULL;
+ priv->need_empty = 0;
+ priv->used = 0;
+
+ priv->mmu_notifier.ops = &ummunotify_mmu_notifier_ops;
+ /*
+ * Register notifier last, since notifications can occur as
+ * soon as we register....
+ */
+ ret = mmu_notifier_register(&priv->mmu_notifier, current->mm);
+ if (ret)
+ goto err_page;
+
+ priv->mm = current->mm;
+ atomic_inc(&priv->mm->mm_count);
+
+ filp->private_data = priv;
+
+ return 0;
+
+err_page:
+ free_page((unsigned long) priv->counter);
+
+err:
+ kfree(priv);
+ return ret;
+}
+
+static int ummunotify_close(struct inode *inode, struct file *filp)
+{
+ struct ummunotify_file *priv = filp->private_data;
+ struct rb_node *n;
+ struct ummunotify_reg *reg;
+
+ mmu_notifier_unregister(&priv->mmu_notifier, priv->mm);
+ mmdrop(priv->mm);
+ free_page((unsigned long) priv->counter);
+
+ for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+ reg = rb_entry(n, struct ummunotify_reg, node);
+ kfree(reg);
+ }
+
+ kfree(priv);
+
+ return 0;
+}
+
+static bool ummunotify_readable(struct ummunotify_file *priv)
+{
+ return priv->need_empty || !list_empty(&priv->invalid_list);
+}
+
+static ssize_t ummunotify_read(struct file *filp, char __user *buf,
+ size_t count, loff_t *pos)
+{
+ struct ummunotify_file *priv = filp->private_data;
+ struct ummunotify_reg *reg;
+ ssize_t ret;
+ struct ummunotify_event *events;
+ int max;
+ int n;
+
+ priv->used = 1;
+
+ events = (void *) get_zeroed_page(GFP_KERNEL);
+ if (!events) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ spin_lock_irq(&priv->lock);
+
+ while (!ummunotify_readable(priv)) {
+ spin_unlock_irq(&priv->lock);
+
+ if (filp->f_flags & O_NONBLOCK) {
+ ret = -EAGAIN;
+ goto out;
+ }
+
+ if (wait_event_interruptible(priv->read_wait,
+ ummunotify_readable(priv))) {
+ ret = -ERESTARTSYS;
+ goto out;
+ }
+
+ spin_lock_irq(&priv->lock);
+ }
+
+ max = min_t(size_t, PAGE_SIZE, count) / sizeof *events;
+
+ for (n = 0; n < max; ++n) {
+ if (list_empty(&priv->invalid_list)) {
+ events[n].type = UMMUNOTIFY_EVENT_TYPE_LAST;
+ events[n].user_cookie_counter = *priv->counter;
+ ++n;
+ priv->need_empty = 0;
+ break;
+ }
+
+ reg = list_first_entry(&priv->invalid_list,
+ struct ummunotify_reg, list);
+
+ events[n].type = UMMUNOTIFY_EVENT_TYPE_INVAL;
+ if (test_bit(UMMUNOTIFY_FLAG_HINT, ®->flags)) {
+ events[n].flags = UMMUNOTIFY_EVENT_FLAG_HINT;
+ events[n].hint_start = max(reg->start, reg->hint_start);
+ events[n].hint_end = min(reg->end, reg->hint_end);
+ } else {
+ events[n].hint_start = reg->start;
+ events[n].hint_end = reg->end;
+ }
+ events[n].user_cookie_counter = reg->user_cookie;
+
+ list_del(®->list);
+ reg->flags = 0;
+ priv->need_empty = 1;
+ }
+
+ spin_unlock_irq(&priv->lock);
+
+ if (copy_to_user(buf, events, n * sizeof *events))
+ ret = -EFAULT;
+ else
+ ret = n * sizeof *events;
+
+out:
+ free_page((unsigned long) events);
+ return ret;
+}
+
+static unsigned int ummunotify_poll(struct file *filp,
+ struct poll_table_struct *wait)
+{
+ struct ummunotify_file *priv = filp->private_data;
+
+ poll_wait(filp, &priv->read_wait, wait);
+
+ return ummunotify_readable(priv) ? (POLLIN | POLLRDNORM) : 0;
+}
+
+static long ummunotify_exchange_features(struct ummunotify_file *priv,
+ __u32 __user *arg)
+{
+ u32 feature_mask;
+
+ if (priv->used)
+ return -EINVAL;
+
+ priv->used = 1;
+
+ if (copy_from_user(&feature_mask, arg, sizeof(feature_mask)))
+ return -EFAULT;
+
+ /* No extensions defined at present. */
+ feature_mask = 0;
+
+ if (copy_to_user(arg, &feature_mask, sizeof(feature_mask)))
+ return -EFAULT;
+
+ return 0;
+}
+
+static long ummunotify_register_region(struct ummunotify_file *priv,
+ void __user *arg)
+{
+ struct ummunotify_register_ioctl parm;
+ struct ummunotify_reg *reg, *treg;
+ struct rb_node **n = &priv->reg_tree.rb_node;
+ struct rb_node *pn;
+ int ret = 0;
+
+ if (copy_from_user(&parm, arg, sizeof parm))
+ return -EFAULT;
+
+ priv->used = 1;
+
+ reg = kmalloc(sizeof *reg, GFP_KERNEL);
+ if (!reg)
+ return -ENOMEM;
+
+ reg->user_cookie = parm.user_cookie;
+ reg->start = parm.start;
+ reg->end = parm.end;
+ reg->flags = 0;
+
+ spin_lock_irq(&priv->lock);
+
+ for (pn = rb_first(&priv->reg_tree); pn; pn = rb_next(pn)) {
+ treg = rb_entry(pn, struct ummunotify_reg, node);
+
+ if (treg->user_cookie == parm.user_cookie) {
+ kfree(reg);
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+
+ pn = NULL;
+ while (*n) {
+ pn = *n;
+ treg = rb_entry(pn, struct ummunotify_reg, node);
+
+ if (reg->start <= treg->start)
+ n = &pn->rb_left;
+ else
+ n = &pn->rb_right;
+ }
+
+ rb_link_node(®->node, pn, n);
+ rb_insert_color(®->node, &priv->reg_tree);
+
+out:
+ spin_unlock_irq(&priv->lock);
+
+ return ret;
+}
+
+static long ummunotify_unregister_region(struct ummunotify_file *priv,
+ __u64 __user *arg)
+{
+ u64 user_cookie;
+ struct rb_node *n;
+ struct ummunotify_reg *reg;
+ int ret = -EINVAL;
+
+ if (copy_from_user(&user_cookie, arg, sizeof(user_cookie)))
+ return -EFAULT;
+
+ spin_lock_irq(&priv->lock);
+
+ for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+ reg = rb_entry(n, struct ummunotify_reg, node);
+
+ if (reg->user_cookie == user_cookie) {
+ rb_erase(n, &priv->reg_tree);
+ if (test_bit(UMMUNOTIFY_FLAG_INVALID, ®->flags))
+ list_del(®->list);
+ kfree(reg);
+ ret = 0;
+ break;
+ }
+ }
+
+ spin_unlock_irq(&priv->lock);
+
+ return ret;
+}
+
+static long ummunotify_ioctl(struct file *filp, unsigned int cmd,
+ unsigned long arg)
+{
+ struct ummunotify_file *priv = filp->private_data;
+ void __user *argp = (void __user *) arg;
+
+ switch (cmd) {
+ case UMMUNOTIFY_EXCHANGE_FEATURES:
+ return ummunotify_exchange_features(priv, argp);
+ case UMMUNOTIFY_REGISTER_REGION:
+ return ummunotify_register_region(priv, argp);
+ case UMMUNOTIFY_UNREGISTER_REGION:
+ return ummunotify_unregister_region(priv, argp);
+ default:
+ return -ENOIOCTLCMD;
+ }
+}
+
+static int ummunotify_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct ummunotify_file *priv = vma->vm_private_data;
+
+ if (vmf->pgoff != 0)
+ return VM_FAULT_SIGBUS;
+
+ vmf->page = virt_to_page(priv->counter);
+ get_page(vmf->page);
+
+ return 0;
+
+}
+
+static struct vm_operations_struct ummunotify_vm_ops = {
+ .fault = ummunotify_fault,
+};
+
+static int ummunotify_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+ struct ummunotify_file *priv = filp->private_data;
+
+ if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff != 0)
+ return -EINVAL;
+
+ vma->vm_ops = &ummunotify_vm_ops;
+ vma->vm_private_data = priv;
+
+ return 0;
+}
+
+static int ummunotify_fasync(int fd, struct file *filp, int on)
+{
+ struct ummunotify_file *priv = filp->private_data;
+
+ return fasync_helper(fd, filp, on, &priv->async_queue);
+}
+
+static const struct file_operations ummunotify_fops = {
+ .owner = THIS_MODULE,
+ .open = ummunotify_open,
+ .release = ummunotify_close,
+ .read = ummunotify_read,
+ .poll = ummunotify_poll,
+ .unlocked_ioctl = ummunotify_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = ummunotify_ioctl,
+#endif
+ .mmap = ummunotify_mmap,
+ .fasync = ummunotify_fasync,
+};
+
+static struct miscdevice ummunotify_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "ummunotify",
+ .fops = &ummunotify_fops,
+};
+
+static int __init ummunotify_init(void)
+{
+ return misc_register(&ummunotify_misc);
+}
+
+static void __exit ummunotify_cleanup(void)
+{
+ misc_deregister(&ummunotify_misc);
+}
+
+module_init(ummunotify_init);
+module_exit(ummunotify_cleanup);
diff --git a/include/linux/ummunotify.h b/include/linux/ummunotify.h
new file mode 100644
index 0000000..21b0d03
--- /dev/null
+++ b/include/linux/ummunotify.h
@@ -0,0 +1,121 @@
+/*
+ * Copyright (c) 2009 Cisco Systems. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _LINUX_UMMUNOTIFY_H
+#define _LINUX_UMMUNOTIFY_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/*
+ * Ummunotify relays MMU notifier events to userspace. A userspace
+ * process uses it by opening /dev/ummunotify, which returns a file
+ * descriptor. Interest in address ranges is registered using ioctl()
+ * and MMU notifier events are retrieved using read(), as described in
+ * more detail below.
+ *
+ * Userspace can also mmap() a single read-only page at offset 0 on
+ * this file descriptor. This page contains (at offest 0) a single
+ * 64-bit generation counter that the kernel increments each time an
+ * MMU notifier event occurs. Userspace can use this to very quickly
+ * check if there are any events to retrieve without needing to do a
+ * system call.
+ */
+
+/*
+ * struct ummunotify_register_ioctl describes an address range from
+ * start to end (including start but not including end) to be
+ * monitored. user_cookie is an opaque handle that userspace assigns,
+ * and which is used to unregister. flags and reserved are currently
+ * unused and should be set to 0 for forward compatibility.
+ */
+struct ummunotify_register_ioctl {
+ __u64 start;
+ __u64 end;
+ __u64 user_cookie;
+ __u32 flags;
+ __u32 reserved;
+};
+
+#define UMMUNOTIFY_MAGIC 'U'
+
+/*
+ * Forward compatibility: Userspace passes in a 32-bit feature mask
+ * with feature flags set indicating which extensions it wishes to
+ * use. The kernel will return a feature mask with the bits of
+ * userspace's mask that the kernel implements; from that point on
+ * both userspace and the kernel should behave as described by the
+ * kernel's feature mask.
+ *
+ * If userspace does not perform a UMMUNOTIFY_EXCHANGE_FEATURES ioctl,
+ * then the kernel will use a feature mask of 0.
+ *
+ * No feature flags are currently defined, so the kernel will always
+ * return a feature mask of 0 at present.
+ */
+#define UMMUNOTIFY_EXCHANGE_FEATURES _IOWR(UMMUNOTIFY_MAGIC, 1, __u32)
+
+/*
+ * Register interest in an address range; userspace should pass in a
+ * struct ummunotify_register_ioctl describing the region.
+ */
+#define UMMUNOTIFY_REGISTER_REGION _IOW(UMMUNOTIFY_MAGIC, 2, \
+ struct ummunotify_register_ioctl)
+/*
+ * Unregister interest in an address range; userspace should pass in
+ * the user_cookie value that was used to register the address range.
+ * No events for the address range will be reported once it is
+ * unregistered.
+ */
+#define UMMUNOTIFY_UNREGISTER_REGION _IOW(UMMUNOTIFY_MAGIC, 3, __u64)
+
+/*
+ * Invalidation events are returned whenever the kernel changes the
+ * mapping for a monitored address. These events are retrieved by
+ * read() on the ummunotify file descriptor, which will fill the
+ * read() buffer with struct ummunotify_event.
+ *
+ * If type field is INVAL, then user_cookie_counter holds the
+ * user_cookie for the region being reported; if the HINT flag is set
+ * then hint_start/hint_end hold the start and end of the mapping that
+ * was invalidated. (If HINT is not set, then multiple events
+ * invalidated parts of the registered range and hint_start/hint_end
+ * and set to the start/end of the whole registered range)
+ *
+ * If type is LAST, then the read operation has emptied the list of
+ * invalidated regions, and user_cookie_counter holds the value of the
+ * kernel's generation counter when the empty list occurred. The
+ * other fields are not filled in for this event.
+ */
+enum {
+ UMMUNOTIFY_EVENT_TYPE_INVAL = 0,
+ UMMUNOTIFY_EVENT_TYPE_LAST = 1,
+};
+
+enum {
+ UMMUNOTIFY_EVENT_FLAG_HINT = 1 << 0,
+};
+
+struct ummunotify_event {
+ __u32 type;
+ __u32 flags;
+ __u64 hint_start;
+ __u64 hint_end;
+ __u64 user_cookie_counter;
+};
+
+#endif /* _LINUX_UMMUNOTIFY_H */
--
1.6.3.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH] ummunotify: Userspace support for MMU notifications @ 2010-04-12 6:22 ` Eric B Munson 0 siblings, 0 replies; 30+ messages in thread From: Eric B Munson @ 2010-04-12 6:22 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-rdma, rolandd, peterz, pavel, mingo, Eric B Munson Andrew, I am resubmitting this patch because I believe that the discussion has shown this to be an acceptable solution. I have fixed the 32 bit build errors, but other than that change, the code is the same as Roland's V3 patch. From: Roland Dreier <rolandd@cisco.com> As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925> and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunotify, that creates a /dev/ummunotify node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunotify_event in <linux/ummunotify.h>). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. Thanks to Jason Gunthorpe <jgunthorpe <at> obsidianresearch.com> for suggestions on the interface design. Also thanks to Jeff Squyres <jsquyres <at> cisco.com> for prototyping support for this in Open MPI, which helped find several bugs during development. Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Eric B Munson <ebmunson@us.ibm.com> --- Changes since v3: - Fixed replaced [get|put] user with copy_[from|to]_user to fix x86 builds --- Documentation/Makefile | 3 +- Documentation/ummunotify/Makefile | 7 + Documentation/ummunotify/ummunotify.txt | 150 ++++++++ Documentation/ummunotify/umn-test.c | 200 +++++++++++ drivers/char/Kconfig | 12 + drivers/char/Makefile | 1 + drivers/char/ummunotify.c | 567 +++++++++++++++++++++++++++++++ include/linux/ummunotify.h | 121 +++++++ 8 files changed, 1060 insertions(+), 1 deletions(-) create mode 100644 Documentation/ummunotify/Makefile create mode 100644 Documentation/ummunotify/ummunotify.txt create mode 100644 Documentation/ummunotify/umn-test.c create mode 100644 drivers/char/ummunotify.c create mode 100644 include/linux/ummunotify.h diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fc7ea1..27ba76a 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -1,3 +1,4 @@ obj-m := DocBook/ accounting/ auxdisplay/ connector/ \ filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \ - pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/ + pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \ + watchdog/src/ diff --git a/Documentation/ummunotify/Makefile b/Documentation/ummunotify/Makefile new file mode 100644 index 0000000..89f31a0 --- /dev/null +++ b/Documentation/ummunotify/Makefile @@ -0,0 +1,7 @@ +# List of programs to build +hostprogs-y := umn-test + +# Tell kbuild to always build the programs +always := $(hostprogs-y) + +HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include diff --git a/Documentation/ummunotify/ummunotify.txt b/Documentation/ummunotify/ummunotify.txt new file mode 100644 index 0000000..78a79c2 --- /dev/null +++ b/Documentation/ummunotify/ummunotify.txt @@ -0,0 +1,150 @@ +UMMUNOTIFY + + Ummunotify relays MMU notifier events to userspace. This is useful + for libraries that need to track the memory mapping of applications; + for example, MPI implementations using RDMA want to cache memory + registrations for performance, but tracking all possible crazy cases + such as when, say, the FORTRAN runtime frees memory is impossible + without kernel help. + +Basic Model + + A userspace process uses it by opening /dev/ummunotify, which + returns a file descriptor. Interest in address ranges is registered + using ioctl() and MMU notifier events are retrieved using read(), as + described in more detail below. Userspace can register multiple + address ranges to watch, and can unregister individual ranges. + + Userspace can also mmap() a single read-only page at offset 0 on + this file descriptor. This page contains (at offest 0) a single + 64-bit generation counter that the kernel increments each time an + MMU notifier event occurs. Userspace can use this to very quickly + check if there are any events to retrieve without needing to do a + system call. + +Control + + To start using ummunotify, a process opens /dev/ummunotify in + read-only mode. Control from userspace is done via ioctl(); the + defined ioctls are: + + UMMUNOTIFY_EXCHANGE_FEATURES: This ioctl takes a single 32-bit + word of feature flags as input, and the kernel updates the + features flags word to contain only features requested by + userspace and also supported by the kernel. + + This ioctl is only included for forward compatibility; no + feature flags are currently defined, and the kernel will simply + update any requested feature mask to 0. The kernel will always + default to a feature mask of 0 if this ioctl is not used, so + current userspace does not need to perform this ioctl. + + UMMUNOTIFY_REGISTER_REGION: Userspace uses this ioctl to tell the + kernel to start delivering events for an address range. The + range is described using struct ummunotify_register_ioctl: + + struct ummunotify_register_ioctl { + __u64 start; + __u64 end; + __u64 user_cookie; + __u32 flags; + __u32 reserved; + }; + + start and end give the range of userspace virtual addresses; + start is included in the range and end is not, so an example of + a 4 KB range would be start=0x1000, end=0x2000. + + user_cookie is an opaque 64-bit quantity that is returned by the + kernel in events involving the range, and used by userspace to + stop watching the range. Each registered address range must + have a distinct user_cookie. + + It is fine with the kernel if userspace registers multiple + overlapping or even duplicate address ranges, as long as a + different cookie is used for each registration. + + flags and reserved are included for forward compatibility; + userspace should simply set them to 0 for the current interface. + + UMMUNOTIFY_UNREGISTER_REGION: Userspace passes in the 64-bit + user_cookie used to register a range to tell the kernel to stop + watching an address range. Once this ioctl completes, the + kernel will not deliver any further events for the range that is + unregistered. + +Events + + When an event occurs that invalidates some of a process's memory + mapping in an address range being watched, ummunotify queues an + event report for that address range. If more than one event + invalidates parts of the same address range before userspace + retrieves the queued report, then further reports for the same range + will not be queued -- when userspace does read the queue, only a + single report for a given range will be returned. + + If multiple ranges being watched are invalidated by a single event + (which is especially likely if userspace registers overlapping + ranges), then an event report structure will be queued for each + address range registration. + + Userspace retrieves queued events via read() on the ummunotify file + descriptor; a buffer that is at least as big as struct + ummunotify_event should be used to retrieve event reports, and if a + larger buffer is passed to read(), multiple reports will be returned + (if available). + + If the ummunotify file descriptor is in blocking mode, a read() call + will wait for an event report to be available. Userspace may also + set the ummunotify file descriptor to non-blocking mode and use all + standard ways of waiting for data to be available on the ummunotify + file descriptor, including epoll/poll()/select() and SIGIO. + + The format of event reports is: + + struct ummunotify_event { + __u32 type; + __u32 flags; + __u64 hint_start; + __u64 hint_end; + __u64 user_cookie_counter; + }; + + where the type field is either UMMUNOTIFY_EVENT_TYPE_INVAL or + UMMUNOTIFY_EVENT_TYPE_LAST. Events of type INVAL describe + invalidation events as follows: user_cookie_counter contains the + cookie passed in when userspace registered the range that the event + is for. hint_start and hint_end contain the start address and end + address that were invalidated. + + The flags word contains bit flags, with only UMMUNOTIFY_EVENT_FLAG_HINT + defined at the moment. If HINT is set, then the invalidation event + invalidated less than the full address range and the kernel returns + the exact range invalidated; if HINT is not sent then hint_start and + hint_end are set to the original range registered by userspace. + (HINT will not be set if, for example, multiple events invalidated + disjoint parts of the range and so a single start/end pair cannot + represent the parts of the range that were invalidated) + + If the event type is LAST, then the read operation has emptied the + list of invalidated regions, and the flags, hint_start and hint_end + fields are not used. user_cookie_counter holds the value of the + kernel's generation counter (see below of more details) when the + empty list occurred. + +Generation Count + + Userspace may mmap() a page on a ummunotify file descriptor via + + mmap(NULL, sizeof (__u64), PROT_READ, MAP_SHARED, ummunotify_fd, 0); + + to get a read-only mapping of the kernel's 64-bit generation + counter. The kernel will increment this generation counter each + time an event report is queued. + + Userspace can use the generation counter as a quick check to avoid + system calls; if the value read from the mapped kernel counter is + still equal to the value returned in user_cookie_counter for the + most recent LAST event retrieved, then no further events have been + queued and there is no need to try a read() on the ummunotify file + descriptor. diff --git a/Documentation/ummunotify/umn-test.c b/Documentation/ummunotify/umn-test.c new file mode 100644 index 0000000..143db2c --- /dev/null +++ b/Documentation/ummunotify/umn-test.c @@ -0,0 +1,200 @@ +/* + * Copyright (c) 2009 Cisco Systems. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include <stdint.h> +#include <fcntl.h> +#include <stdio.h> +#include <unistd.h> + +#include <linux/ummunotify.h> + +#include <sys/mman.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <sys/ioctl.h> + +#define UMN_TEST_COOKIE 123 + +static int umn_fd; +static volatile __u64 *umn_counter; + +static int umn_init(void) +{ + __u32 flags; + + umn_fd = open("/dev/ummunotify", O_RDONLY); + if (umn_fd < 0) { + perror("open"); + return 1; + } + + if (ioctl(umn_fd, UMMUNOTIFY_EXCHANGE_FEATURES, &flags)) { + perror("exchange ioctl"); + return 1; + } + + printf("kernel feature flags: 0x%08x\n", flags); + + umn_counter = mmap(NULL, sizeof *umn_counter, PROT_READ, + MAP_SHARED, umn_fd, 0); + if (umn_counter == MAP_FAILED) { + perror("mmap"); + return 1; + } + + return 0; +} + +static int umn_register(void *buf, size_t size, __u64 cookie) +{ + struct ummunotify_register_ioctl r = { + .start = (unsigned long) buf, + .end = (unsigned long) buf + size, + .user_cookie = cookie, + }; + + if (ioctl(umn_fd, UMMUNOTIFY_REGISTER_REGION, &r)) { + perror("register ioctl"); + return 1; + } + + return 0; +} + +static int umn_unregister(__u64 cookie) +{ + if (ioctl(umn_fd, UMMUNOTIFY_UNREGISTER_REGION, &cookie)) { + perror("unregister ioctl"); + return 1; + } + + return 0; +} + +int main(int argc, char *argv[]) +{ + int page_size; + __u64 old_counter; + void *t; + int got_it; + + if (umn_init()) + return 1; + + printf("\n"); + + old_counter = *umn_counter; + if (old_counter != 0) { + fprintf(stderr, "counter = %lld (expected 0)\n", old_counter); + return 1; + } + + page_size = sysconf(_SC_PAGESIZE); + t = mmap(NULL, 3 * page_size, PROT_READ, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0); + + if (umn_register(t, 3 * page_size, UMN_TEST_COOKIE)) + return 1; + + munmap(t + page_size, page_size); + + old_counter = *umn_counter; + if (old_counter != 1) { + fprintf(stderr, "counter = %lld (expected 1)\n", old_counter); + return 1; + } + + got_it = 0; + while (1) { + struct ummunotify_event ev; + int len; + + len = read(umn_fd, &ev, sizeof ev); + if (len < 0) { + perror("read event"); + return 1; + } + if (len != sizeof ev) { + fprintf(stderr, "Read gave %d bytes (!= event size %zd)\n", + len, sizeof ev); + return 1; + } + + switch (ev.type) { + case UMMUNOTIFY_EVENT_TYPE_INVAL: + if (got_it) { + fprintf(stderr, "Extra invalidate event\n"); + return 1; + } + if (ev.user_cookie_counter != UMN_TEST_COOKIE) { + fprintf(stderr, "Invalidate event for cookie %lld (expected %d)\n", + ev.user_cookie_counter, + UMN_TEST_COOKIE); + return 1; + } + + printf("Invalidate event:\tcookie %lld\n", + ev.user_cookie_counter); + + if (!(ev.flags & UMMUNOTIFY_EVENT_FLAG_HINT)) { + fprintf(stderr, "Hint flag not set\n"); + return 1; + } + + if (ev.hint_start != (uintptr_t) t + page_size || + ev.hint_end != (uintptr_t) t + page_size * 2) { + fprintf(stderr, "Got hint %llx..%llx, expected %p..%p\n", + ev.hint_start, ev.hint_end, + t + page_size, t + page_size * 2); + return 1; + } + + printf("\t\t\thint %llx...%llx\n", + ev.hint_start, ev.hint_end); + + got_it = 1; + break; + + case UMMUNOTIFY_EVENT_TYPE_LAST: + if (!got_it) { + fprintf(stderr, "Last event without invalidate event\n"); + return 1; + } + + printf("Empty event:\t\tcounter %lld\n", + ev.user_cookie_counter); + goto done; + + default: + fprintf(stderr, "unknown event type %d\n", + ev.type); + return 1; + } + } + +done: + umn_unregister(123); + munmap(t, page_size); + + old_counter = *umn_counter; + if (old_counter != 1) { + fprintf(stderr, "counter = %lld (expected 1)\n", old_counter); + return 1; + } + + return 0; +} diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 3141dd3..cf26019 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -1111,6 +1111,18 @@ config DEVPORT depends on ISA || PCI default y +config UMMUNOTIFY + tristate "Userspace MMU notifications" + select MMU_NOTIFIER + help + The ummunotify (userspace MMU notification) driver creates a + character device that can be used by userspace libraries to + get notifications when an application's memory mapping + changed. This is used, for example, by RDMA libraries to + improve the reliability of memory registration caching, since + the kernel's MMU notifications can be used to know precisely + when to shoot down a cached registration. + source "drivers/s390/char/Kconfig" endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index f957edf..521e5de 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -97,6 +97,7 @@ obj-$(CONFIG_NSC_GPIO) += nsc_gpio.o obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio.o obj-$(CONFIG_GPIO_TB0219) += tb0219.o obj-$(CONFIG_TELCLOCK) += tlclk.o +obj-$(CONFIG_UMMUNOTIFY) += ummunotify.o obj-$(CONFIG_MWAVE) += mwave/ obj-$(CONFIG_AGP) += agp/ diff --git a/drivers/char/ummunotify.c b/drivers/char/ummunotify.c new file mode 100644 index 0000000..c14df3f --- /dev/null +++ b/drivers/char/ummunotify.c @@ -0,0 +1,567 @@ +/* + * Copyright (c) 2009 Cisco Systems. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include <linux/fs.h> +#include <linux/init.h> +#include <linux/list.h> +#include <linux/miscdevice.h> +#include <linux/mm.h> +#include <linux/mmu_notifier.h> +#include <linux/module.h> +#include <linux/poll.h> +#include <linux/rbtree.h> +#include <linux/sched.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/uaccess.h> +#include <linux/ummunotify.h> + +#include <asm/cacheflush.h> + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Userspace MMU notifiers"); +MODULE_LICENSE("GPL v2"); + +/* + * Information about an address range userspace has asked us to watch. + * + * user_cookie: Opaque cookie given to us when userspace registers the + * address range. + * + * start, end: Address range; start is inclusive, end is exclusive. + * + * hint_start, hint_end: If a single MMU notification event + * invalidates the address range, we hold the actual range of + * addresses that were invalidated (and set UMMUNOTIFY_FLAG_HINT). + * If another event hits this range before userspace reads the + * event, we give up and don't try to keep track of which subsets + * got invalidated. + * + * flags: Holds the INVALID flag for ranges that are on the invalid + * list and/or the HINT flag for ranges where the hint range holds + * good information. + * + * node: Used to put the range into an rbtree we use to be able to + * scan address ranges in order. + * + * list: Used to put the range on the invalid list when an MMU + * notification event hits the range. + */ +enum { + UMMUNOTIFY_FLAG_INVALID = 1, + UMMUNOTIFY_FLAG_HINT = 2, +}; + +struct ummunotify_reg { + u64 user_cookie; + unsigned long start; + unsigned long end; + unsigned long hint_start; + unsigned long hint_end; + unsigned long flags; + struct rb_node node; + struct list_head list; +}; + +/* + * Context attached to each file that userspace opens. + * + * mmu_notifier: MMU notifier registered for this context. + * + * mm: mm_struct for process that created the context; we use this to + * hold a reference to the mm to make sure it doesn't go away until + * we're done with it. + * + * reg_tree: RB tree of address ranges being watched, sorted by start + * address. + * + * invalid_list: List of address ranges that have been invalidated by + * MMU notification events; as userspace reads events, the address + * range corresponding to the event is removed from the list. + * + * counter: Page that can be mapped read-only by userspace, which + * holds a generation count that is incremented each time an event + * occurs. + * + * lock: Spinlock used to protect all context. + * + * read_wait: Wait queue used to wait for data to become available in + * blocking read()s. + * + * async_queue: Used to implement fasync(). + * + * need_empty: Set when userspace reads an invalidation event, so that + * read() knows it must generate an "empty" event when userspace + * drains the invalid_list. + * + * used: Set after userspace does anything with the file, so that the + * "exchange flags" ioctl() knows it's too late to change anything. + */ +struct ummunotify_file { + struct mmu_notifier mmu_notifier; + struct mm_struct *mm; + struct rb_root reg_tree; + struct list_head invalid_list; + u64 *counter; + spinlock_t lock; + wait_queue_head_t read_wait; + struct fasync_struct *async_queue; + int need_empty; + int used; +}; + +static void ummunotify_handle_notify(struct mmu_notifier *mn, + unsigned long start, unsigned long end) +{ + struct ummunotify_file *priv = + container_of(mn, struct ummunotify_file, mmu_notifier); + struct rb_node *n; + struct ummunotify_reg *reg; + unsigned long flags; + int hit = 0; + + spin_lock_irqsave(&priv->lock, flags); + + for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) { + reg = rb_entry(n, struct ummunotify_reg, node); + + /* + * Ranges overlap if they're not disjoint; and they're + * disjoint if the end of one is before the start of + * the other one. So if both disjointness comparisons + * fail then the ranges overlap. + * + * Since we keep the tree of regions we're watching + * sorted by start address, we can end this loop as + * soon as we hit a region that starts past the end of + * the range for the event we're handling. + */ + if (reg->start >= end) + break; + + /* + * Just go to the next region if the start of the + * range is after the end of the region -- there + * might still be more overlapping ranges that have a + * greater start. + */ + if (start >= reg->end) + continue; + + hit = 1; + + if (test_and_set_bit(UMMUNOTIFY_FLAG_INVALID, ®->flags)) { + /* Already on invalid list */ + clear_bit(UMMUNOTIFY_FLAG_HINT, ®->flags); + } else { + list_add_tail(®->list, &priv->invalid_list); + set_bit(UMMUNOTIFY_FLAG_HINT, ®->flags); + reg->hint_start = start; + reg->hint_end = end; + } + } + + if (hit) { + ++(*priv->counter); + flush_dcache_page(virt_to_page(priv->counter)); + wake_up_interruptible(&priv->read_wait); + kill_fasync(&priv->async_queue, SIGIO, POLL_IN); + } + + spin_unlock_irqrestore(&priv->lock, flags); +} + +static void ummunotify_invalidate_page(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long addr) +{ + ummunotify_handle_notify(mn, addr, addr + PAGE_SIZE); +} + +static void ummunotify_invalidate_range_start(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + ummunotify_handle_notify(mn, start, end); +} + +static const struct mmu_notifier_ops ummunotify_mmu_notifier_ops = { + .invalidate_page = ummunotify_invalidate_page, + .invalidate_range_start = ummunotify_invalidate_range_start, +}; + +static int ummunotify_open(struct inode *inode, struct file *filp) +{ + struct ummunotify_file *priv; + int ret; + + if (filp->f_mode & FMODE_WRITE) + return -EINVAL; + + priv = kmalloc(sizeof *priv, GFP_KERNEL); + if (!priv) + return -ENOMEM; + + priv->counter = (void *) get_zeroed_page(GFP_KERNEL); + if (!priv->counter) { + ret = -ENOMEM; + goto err; + } + + priv->reg_tree = RB_ROOT; + INIT_LIST_HEAD(&priv->invalid_list); + spin_lock_init(&priv->lock); + init_waitqueue_head(&priv->read_wait); + priv->async_queue = NULL; + priv->need_empty = 0; + priv->used = 0; + + priv->mmu_notifier.ops = &ummunotify_mmu_notifier_ops; + /* + * Register notifier last, since notifications can occur as + * soon as we register.... + */ + ret = mmu_notifier_register(&priv->mmu_notifier, current->mm); + if (ret) + goto err_page; + + priv->mm = current->mm; + atomic_inc(&priv->mm->mm_count); + + filp->private_data = priv; + + return 0; + +err_page: + free_page((unsigned long) priv->counter); + +err: + kfree(priv); + return ret; +} + +static int ummunotify_close(struct inode *inode, struct file *filp) +{ + struct ummunotify_file *priv = filp->private_data; + struct rb_node *n; + struct ummunotify_reg *reg; + + mmu_notifier_unregister(&priv->mmu_notifier, priv->mm); + mmdrop(priv->mm); + free_page((unsigned long) priv->counter); + + for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) { + reg = rb_entry(n, struct ummunotify_reg, node); + kfree(reg); + } + + kfree(priv); + + return 0; +} + +static bool ummunotify_readable(struct ummunotify_file *priv) +{ + return priv->need_empty || !list_empty(&priv->invalid_list); +} + +static ssize_t ummunotify_read(struct file *filp, char __user *buf, + size_t count, loff_t *pos) +{ + struct ummunotify_file *priv = filp->private_data; + struct ummunotify_reg *reg; + ssize_t ret; + struct ummunotify_event *events; + int max; + int n; + + priv->used = 1; + + events = (void *) get_zeroed_page(GFP_KERNEL); + if (!events) { + ret = -ENOMEM; + goto out; + } + + spin_lock_irq(&priv->lock); + + while (!ummunotify_readable(priv)) { + spin_unlock_irq(&priv->lock); + + if (filp->f_flags & O_NONBLOCK) { + ret = -EAGAIN; + goto out; + } + + if (wait_event_interruptible(priv->read_wait, + ummunotify_readable(priv))) { + ret = -ERESTARTSYS; + goto out; + } + + spin_lock_irq(&priv->lock); + } + + max = min_t(size_t, PAGE_SIZE, count) / sizeof *events; + + for (n = 0; n < max; ++n) { + if (list_empty(&priv->invalid_list)) { + events[n].type = UMMUNOTIFY_EVENT_TYPE_LAST; + events[n].user_cookie_counter = *priv->counter; + ++n; + priv->need_empty = 0; + break; + } + + reg = list_first_entry(&priv->invalid_list, + struct ummunotify_reg, list); + + events[n].type = UMMUNOTIFY_EVENT_TYPE_INVAL; + if (test_bit(UMMUNOTIFY_FLAG_HINT, ®->flags)) { + events[n].flags = UMMUNOTIFY_EVENT_FLAG_HINT; + events[n].hint_start = max(reg->start, reg->hint_start); + events[n].hint_end = min(reg->end, reg->hint_end); + } else { + events[n].hint_start = reg->start; + events[n].hint_end = reg->end; + } + events[n].user_cookie_counter = reg->user_cookie; + + list_del(®->list); + reg->flags = 0; + priv->need_empty = 1; + } + + spin_unlock_irq(&priv->lock); + + if (copy_to_user(buf, events, n * sizeof *events)) + ret = -EFAULT; + else + ret = n * sizeof *events; + +out: + free_page((unsigned long) events); + return ret; +} + +static unsigned int ummunotify_poll(struct file *filp, + struct poll_table_struct *wait) +{ + struct ummunotify_file *priv = filp->private_data; + + poll_wait(filp, &priv->read_wait, wait); + + return ummunotify_readable(priv) ? (POLLIN | POLLRDNORM) : 0; +} + +static long ummunotify_exchange_features(struct ummunotify_file *priv, + __u32 __user *arg) +{ + u32 feature_mask; + + if (priv->used) + return -EINVAL; + + priv->used = 1; + + if (copy_from_user(&feature_mask, arg, sizeof(feature_mask))) + return -EFAULT; + + /* No extensions defined at present. */ + feature_mask = 0; + + if (copy_to_user(arg, &feature_mask, sizeof(feature_mask))) + return -EFAULT; + + return 0; +} + +static long ummunotify_register_region(struct ummunotify_file *priv, + void __user *arg) +{ + struct ummunotify_register_ioctl parm; + struct ummunotify_reg *reg, *treg; + struct rb_node **n = &priv->reg_tree.rb_node; + struct rb_node *pn; + int ret = 0; + + if (copy_from_user(&parm, arg, sizeof parm)) + return -EFAULT; + + priv->used = 1; + + reg = kmalloc(sizeof *reg, GFP_KERNEL); + if (!reg) + return -ENOMEM; + + reg->user_cookie = parm.user_cookie; + reg->start = parm.start; + reg->end = parm.end; + reg->flags = 0; + + spin_lock_irq(&priv->lock); + + for (pn = rb_first(&priv->reg_tree); pn; pn = rb_next(pn)) { + treg = rb_entry(pn, struct ummunotify_reg, node); + + if (treg->user_cookie == parm.user_cookie) { + kfree(reg); + ret = -EINVAL; + goto out; + } + } + + pn = NULL; + while (*n) { + pn = *n; + treg = rb_entry(pn, struct ummunotify_reg, node); + + if (reg->start <= treg->start) + n = &pn->rb_left; + else + n = &pn->rb_right; + } + + rb_link_node(®->node, pn, n); + rb_insert_color(®->node, &priv->reg_tree); + +out: + spin_unlock_irq(&priv->lock); + + return ret; +} + +static long ummunotify_unregister_region(struct ummunotify_file *priv, + __u64 __user *arg) +{ + u64 user_cookie; + struct rb_node *n; + struct ummunotify_reg *reg; + int ret = -EINVAL; + + if (copy_from_user(&user_cookie, arg, sizeof(user_cookie))) + return -EFAULT; + + spin_lock_irq(&priv->lock); + + for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) { + reg = rb_entry(n, struct ummunotify_reg, node); + + if (reg->user_cookie == user_cookie) { + rb_erase(n, &priv->reg_tree); + if (test_bit(UMMUNOTIFY_FLAG_INVALID, ®->flags)) + list_del(®->list); + kfree(reg); + ret = 0; + break; + } + } + + spin_unlock_irq(&priv->lock); + + return ret; +} + +static long ummunotify_ioctl(struct file *filp, unsigned int cmd, + unsigned long arg) +{ + struct ummunotify_file *priv = filp->private_data; + void __user *argp = (void __user *) arg; + + switch (cmd) { + case UMMUNOTIFY_EXCHANGE_FEATURES: + return ummunotify_exchange_features(priv, argp); + case UMMUNOTIFY_REGISTER_REGION: + return ummunotify_register_region(priv, argp); + case UMMUNOTIFY_UNREGISTER_REGION: + return ummunotify_unregister_region(priv, argp); + default: + return -ENOIOCTLCMD; + } +} + +static int ummunotify_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct ummunotify_file *priv = vma->vm_private_data; + + if (vmf->pgoff != 0) + return VM_FAULT_SIGBUS; + + vmf->page = virt_to_page(priv->counter); + get_page(vmf->page); + + return 0; + +} + +static struct vm_operations_struct ummunotify_vm_ops = { + .fault = ummunotify_fault, +}; + +static int ummunotify_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct ummunotify_file *priv = filp->private_data; + + if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff != 0) + return -EINVAL; + + vma->vm_ops = &ummunotify_vm_ops; + vma->vm_private_data = priv; + + return 0; +} + +static int ummunotify_fasync(int fd, struct file *filp, int on) +{ + struct ummunotify_file *priv = filp->private_data; + + return fasync_helper(fd, filp, on, &priv->async_queue); +} + +static const struct file_operations ummunotify_fops = { + .owner = THIS_MODULE, + .open = ummunotify_open, + .release = ummunotify_close, + .read = ummunotify_read, + .poll = ummunotify_poll, + .unlocked_ioctl = ummunotify_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = ummunotify_ioctl, +#endif + .mmap = ummunotify_mmap, + .fasync = ummunotify_fasync, +}; + +static struct miscdevice ummunotify_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = "ummunotify", + .fops = &ummunotify_fops, +}; + +static int __init ummunotify_init(void) +{ + return misc_register(&ummunotify_misc); +} + +static void __exit ummunotify_cleanup(void) +{ + misc_deregister(&ummunotify_misc); +} + +module_init(ummunotify_init); +module_exit(ummunotify_cleanup); diff --git a/include/linux/ummunotify.h b/include/linux/ummunotify.h new file mode 100644 index 0000000..21b0d03 --- /dev/null +++ b/include/linux/ummunotify.h @@ -0,0 +1,121 @@ +/* + * Copyright (c) 2009 Cisco Systems. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef _LINUX_UMMUNOTIFY_H +#define _LINUX_UMMUNOTIFY_H + +#include <linux/types.h> +#include <linux/ioctl.h> + +/* + * Ummunotify relays MMU notifier events to userspace. A userspace + * process uses it by opening /dev/ummunotify, which returns a file + * descriptor. Interest in address ranges is registered using ioctl() + * and MMU notifier events are retrieved using read(), as described in + * more detail below. + * + * Userspace can also mmap() a single read-only page at offset 0 on + * this file descriptor. This page contains (at offest 0) a single + * 64-bit generation counter that the kernel increments each time an + * MMU notifier event occurs. Userspace can use this to very quickly + * check if there are any events to retrieve without needing to do a + * system call. + */ + +/* + * struct ummunotify_register_ioctl describes an address range from + * start to end (including start but not including end) to be + * monitored. user_cookie is an opaque handle that userspace assigns, + * and which is used to unregister. flags and reserved are currently + * unused and should be set to 0 for forward compatibility. + */ +struct ummunotify_register_ioctl { + __u64 start; + __u64 end; + __u64 user_cookie; + __u32 flags; + __u32 reserved; +}; + +#define UMMUNOTIFY_MAGIC 'U' + +/* + * Forward compatibility: Userspace passes in a 32-bit feature mask + * with feature flags set indicating which extensions it wishes to + * use. The kernel will return a feature mask with the bits of + * userspace's mask that the kernel implements; from that point on + * both userspace and the kernel should behave as described by the + * kernel's feature mask. + * + * If userspace does not perform a UMMUNOTIFY_EXCHANGE_FEATURES ioctl, + * then the kernel will use a feature mask of 0. + * + * No feature flags are currently defined, so the kernel will always + * return a feature mask of 0 at present. + */ +#define UMMUNOTIFY_EXCHANGE_FEATURES _IOWR(UMMUNOTIFY_MAGIC, 1, __u32) + +/* + * Register interest in an address range; userspace should pass in a + * struct ummunotify_register_ioctl describing the region. + */ +#define UMMUNOTIFY_REGISTER_REGION _IOW(UMMUNOTIFY_MAGIC, 2, \ + struct ummunotify_register_ioctl) +/* + * Unregister interest in an address range; userspace should pass in + * the user_cookie value that was used to register the address range. + * No events for the address range will be reported once it is + * unregistered. + */ +#define UMMUNOTIFY_UNREGISTER_REGION _IOW(UMMUNOTIFY_MAGIC, 3, __u64) + +/* + * Invalidation events are returned whenever the kernel changes the + * mapping for a monitored address. These events are retrieved by + * read() on the ummunotify file descriptor, which will fill the + * read() buffer with struct ummunotify_event. + * + * If type field is INVAL, then user_cookie_counter holds the + * user_cookie for the region being reported; if the HINT flag is set + * then hint_start/hint_end hold the start and end of the mapping that + * was invalidated. (If HINT is not set, then multiple events + * invalidated parts of the registered range and hint_start/hint_end + * and set to the start/end of the whole registered range) + * + * If type is LAST, then the read operation has emptied the list of + * invalidated regions, and user_cookie_counter holds the value of the + * kernel's generation counter when the empty list occurred. The + * other fields are not filled in for this event. + */ +enum { + UMMUNOTIFY_EVENT_TYPE_INVAL = 0, + UMMUNOTIFY_EVENT_TYPE_LAST = 1, +}; + +enum { + UMMUNOTIFY_EVENT_FLAG_HINT = 1 << 0, +}; + +struct ummunotify_event { + __u32 type; + __u32 flags; + __u64 hint_start; + __u64 hint_end; + __u64 user_cookie_counter; +}; + +#endif /* _LINUX_UMMUNOTIFY_H */ -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 30+ messages in thread
[parent not found: <1271053337-7121-1-git-send-email-ebmunson-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] ummunotify: Userspace support for MMU notifications 2010-04-12 6:22 ` Eric B Munson @ 2010-04-12 20:20 ` Pavel Machek -1 siblings, 0 replies; 30+ messages in thread From: Pavel Machek @ 2010-04-12 20:20 UTC (permalink / raw) To: Eric B Munson Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-X9Un+BFzKDI Hi! > I am resubmitting this patch because I believe that the discussion > has shown this to be an acceptable solution. I have fixed the 32 bit > build errors, but other than that change, the code is the same as > Roland's V3 patch. > > From: Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> > > As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925> > and follow-up messages, libraries using RDMA would like to track > precisely when application code changes memory mapping via free(), > munmap(), etc. Current pure-userspace solutions using malloc hooks > and other tricks are not robust, and the feeling among experts is that > the issue is unfixable without kernel help. I do not know. I still believe that this does not belong in the kernel; application should not need to trace itself to know what it does. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications @ 2010-04-12 20:20 ` Pavel Machek 0 siblings, 0 replies; 30+ messages in thread From: Pavel Machek @ 2010-04-12 20:20 UTC (permalink / raw) To: Eric B Munson; +Cc: akpm, linux-kernel, linux-rdma, rolandd, peterz, mingo Hi! > I am resubmitting this patch because I believe that the discussion > has shown this to be an acceptable solution. I have fixed the 32 bit > build errors, but other than that change, the code is the same as > Roland's V3 patch. > > From: Roland Dreier <rolandd@cisco.com> > > As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925> > and follow-up messages, libraries using RDMA would like to track > precisely when application code changes memory mapping via free(), > munmap(), etc. Current pure-userspace solutions using malloc hooks > and other tricks are not robust, and the feeling among experts is that > the issue is unfixable without kernel help. I do not know. I still believe that this does not belong in the kernel; application should not need to trace itself to know what it does. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] ummunotify: fix umn-test build 2010-04-12 6:22 ` Eric B Munson @ 2010-04-14 16:43 ` Randy Dunlap -1 siblings, 0 replies; 30+ messages in thread From: Randy Dunlap @ 2010-04-14 16:43 UTC (permalink / raw) To: Eric B Munson Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, pavel-+ZI9xUNit7I, mingo-X9Un+BFzKDI From: Randy Dunlap <randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> Add ummunotify.h to Kbuild list for export to userspace, fixing 27 build errors in umn-test.c when O=builddir is used. Signed-off-by: Randy Dunlap <randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> --- include/linux/Kbuild | 1 + 1 file changed, 1 insertion(+) maybe another item to add to SubmitChecklist :( --- lnx-2634-rc4.orig/include/linux/Kbuild +++ lnx-2634-rc4/include/linux/Kbuild @@ -163,6 +163,7 @@ header-y += tipc_config.h header-y += toshiba.h header-y += udf_fs_i.h header-y += ultrasound.h +header-y += ummunotify.h header-y += un.h header-y += utime.h header-y += veth.h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] ummunotify: fix umn-test build @ 2010-04-14 16:43 ` Randy Dunlap 0 siblings, 0 replies; 30+ messages in thread From: Randy Dunlap @ 2010-04-14 16:43 UTC (permalink / raw) To: Eric B Munson Cc: akpm, linux-kernel, linux-rdma, rolandd, peterz, pavel, mingo From: Randy Dunlap <randy.dunlap@oracle.com> Add ummunotify.h to Kbuild list for export to userspace, fixing 27 build errors in umn-test.c when O=builddir is used. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> --- include/linux/Kbuild | 1 + 1 file changed, 1 insertion(+) maybe another item to add to SubmitChecklist :( --- lnx-2634-rc4.orig/include/linux/Kbuild +++ lnx-2634-rc4/include/linux/Kbuild @@ -163,6 +163,7 @@ header-y += tipc_config.h header-y += toshiba.h header-y += udf_fs_i.h header-y += ultrasound.h +header-y += ummunotify.h header-y += un.h header-y += utime.h header-y += veth.h ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20100414094341.ba69842d.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] ummunotify: fix umn-test build 2010-04-14 16:43 ` Randy Dunlap @ 2010-04-17 17:44 ` Eric B Munson -1 siblings, 0 replies; 30+ messages in thread From: Eric B Munson @ 2010-04-17 17:44 UTC (permalink / raw) To: Randy Dunlap Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, pavel-+ZI9xUNit7I, mingo-X9Un+BFzKDI [-- Attachment #1: Type: text/plain, Size: 383 bytes --] On Wed, 14 Apr 2010, Randy Dunlap wrote: > From: Randy Dunlap <randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > > Add ummunotify.h to Kbuild list for export to userspace, > fixing 27 build errors in umn-test.c when O=builddir is used. > > Signed-off-by: Randy Dunlap <randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> Thanks, will wrap this into V2. Eric [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: fix umn-test build @ 2010-04-17 17:44 ` Eric B Munson 0 siblings, 0 replies; 30+ messages in thread From: Eric B Munson @ 2010-04-17 17:44 UTC (permalink / raw) To: Randy Dunlap Cc: akpm, linux-kernel, linux-rdma, rolandd, peterz, pavel, mingo [-- Attachment #1: Type: text/plain, Size: 325 bytes --] On Wed, 14 Apr 2010, Randy Dunlap wrote: > From: Randy Dunlap <randy.dunlap@oracle.com> > > Add ummunotify.h to Kbuild list for export to userspace, > fixing 27 build errors in umn-test.c when O=builddir is used. > > Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Thanks, will wrap this into V2. Eric [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20100417174404.GB3579-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] ummunotify: fix umn-test build 2010-04-17 17:44 ` Eric B Munson @ 2010-04-18 14:38 ` Roland Dreier -1 siblings, 0 replies; 30+ messages in thread From: Roland Dreier @ 2010-04-18 14:38 UTC (permalink / raw) To: Eric B Munson Cc: Randy Dunlap, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, pavel-+ZI9xUNit7I, mingo-X9Un+BFzKDI Eric, sorry for not seeing this earlier, but it looks like the last patch I emailed out did not include all the tiny changes I made prior to asking Linus to pull this (which he ended up not doing). For example this fix to the include Kbuild file was there. You could look at: http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=cfe60a43c3434907a47323fb82d2960e915a9538 to see what was actually in my pull request. There may be other trivial tweaks worth including. Thanks for resurrecting this work BTW. - R. -- Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: fix umn-test build @ 2010-04-18 14:38 ` Roland Dreier 0 siblings, 0 replies; 30+ messages in thread From: Roland Dreier @ 2010-04-18 14:38 UTC (permalink / raw) To: Eric B Munson Cc: Randy Dunlap, akpm, linux-kernel, linux-rdma, rolandd, peterz, pavel, mingo Eric, sorry for not seeing this earlier, but it looks like the last patch I emailed out did not include all the tiny changes I made prior to asking Linus to pull this (which he ended up not doing). For example this fix to the include Kbuild file was there. You could look at: http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=cfe60a43c3434907a47323fb82d2960e915a9538 to see what was actually in my pull request. There may be other trivial tweaks worth including. Thanks for resurrecting this work BTW. - R. -- Roland Dreier <rolandd@cisco.com> || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications 2010-04-12 6:22 ` Eric B Munson (?) (?) @ 2010-04-12 23:03 ` Andrew Morton [not found] ` <20100412160359.1d9074dc.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> -1 siblings, 1 reply; 30+ messages in thread From: Andrew Morton @ 2010-04-12 23:03 UTC (permalink / raw) To: Eric B Munson; +Cc: linux-kernel, linux-rdma, rolandd, peterz, pavel, mingo On Mon, 12 Apr 2010 07:22:17 +0100 Eric B Munson <ebmunson@us.ibm.com> wrote: > Andrew, > > I am resubmitting this patch because I believe that the discussion > has shown this to be an acceptable solution. To whom? Some acked-by's would clarify. > I have fixed the 32 bit > build errors, but other than that change, the code is the same as > Roland's V3 patch. > > From: Roland Dreier <rolandd@cisco.com> > > As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925> > and follow-up messages, libraries using RDMA would like to track > precisely when application code changes memory mapping via free(), > munmap(), etc. Current pure-userspace solutions using malloc hooks > and other tricks are not robust, and the feeling among experts is that > the issue is unfixable without kernel help. But this info could be reassembled by tracking syscall activity, yes? Perhaps some discussion here explaining why the (possibly enhanced) ptrace, audit, etc interfaces are unsuitable. > We solve this not by implementing the full API proposed in the email > linked above but rather with a simpler and more generic interface, > which may be useful in other contexts. Specifically, we implement a > new character device driver, ummunotify, that creates a /dev/ummunotify > node. A userspace process can open this node read-only and use the fd > as follows: > > 1. ioctl() to register/unregister an address range to watch in the > kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>). > > 2. read() to retrieve events generated when a mapping in a watched > address range is invalidated (cf struct ummunotify_event in > <linux/ummunotify.h>). select()/poll()/epoll() and SIGIO are > handled for this IO. > > 3. mmap() one page at offset 0 to map a kernel page that contains a > generation counter that is incremented each time an event is > generated. This allows userspace to have a fast path that checks > that no events have occurred without a system call. OK, what's missing from this whole description and from ummunotify.txt is: how does one specify the target process? Does /dev/ummunotify implicitly attach to current->mm? If so, why, and what are the implications of this? If instead it is possible to attach to some other process's mmu activity (/proc/<pid>/ummunotity?) then how is that done and what are the security/permissions implications? Also, the whole thing is obviously racy: by the time userspace finds out that something has happened, it might have changed. This inevitably reduces the applicability/usefulness of the whole thing as compared to some synchronous mechanism which halts the monitored thread until the request has been processed and acked. All this should (IMO) be explored, explained and justified. Also, what prevents the obvious DoS which occurs when I register for events and just let them queue up until the kernel runs out of memory? presumably events get dropped - what are the reliability implications of this and how is the max queue length managed? Also, ioctls are unpopular. Were other intefaces considered? > Thanks to Jason Gunthorpe <jgunthorpe <at> obsidianresearch.com> for > suggestions on the interface design. Also thanks to Jeff Squyres > <jsquyres <at> cisco.com> for prototyping support for this in Open MPI, which > helped find several bugs during development. > > Signed-off-by: Roland Dreier <rolandd@cisco.com> > Signed-off-by: Eric B Munson <ebmunson@us.ibm.com> > > --- > > Changes since v3: > - Fixed replaced [get|put] user with copy_[from|to]_user to fix x86 > builds > --- > Documentation/Makefile | 3 +- > Documentation/ummunotify/Makefile | 7 + > Documentation/ummunotify/ummunotify.txt | 150 ++++++++ > Documentation/ummunotify/umn-test.c | 200 +++++++++++ > drivers/char/Kconfig | 12 + > drivers/char/Makefile | 1 + > drivers/char/ummunotify.c | 567 +++++++++++++++++++++++++++++++ > include/linux/ummunotify.h | 121 +++++++ > 8 files changed, 1060 insertions(+), 1 deletions(-) > create mode 100644 Documentation/ummunotify/Makefile > create mode 100644 Documentation/ummunotify/ummunotify.txt > create mode 100644 Documentation/ummunotify/umn-test.c > create mode 100644 drivers/char/ummunotify.c > create mode 100644 include/linux/ummunotify.h > > diff --git a/Documentation/Makefile b/Documentation/Makefile > index 6fc7ea1..27ba76a 100644 > --- a/Documentation/Makefile > +++ b/Documentation/Makefile > @@ -1,3 +1,4 @@ > obj-m := DocBook/ accounting/ auxdisplay/ connector/ \ > filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \ > - pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/ > + pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \ > + watchdog/src/ > diff --git a/Documentation/ummunotify/Makefile b/Documentation/ummunotify/Makefile > new file mode 100644 > index 0000000..89f31a0 > --- /dev/null > +++ b/Documentation/ummunotify/Makefile > @@ -0,0 +1,7 @@ > +# List of programs to build > +hostprogs-y := umn-test > + > +# Tell kbuild to always build the programs > +always := $(hostprogs-y) > + > +HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include > diff --git a/Documentation/ummunotify/ummunotify.txt b/Documentation/ummunotify/ummunotify.txt > new file mode 100644 > index 0000000..78a79c2 > --- /dev/null > +++ b/Documentation/ummunotify/ummunotify.txt > @@ -0,0 +1,150 @@ > +UMMUNOTIFY > + > + Ummunotify relays MMU notifier events to userspace. This is useful > + for libraries that need to track the memory mapping of applications; > + for example, MPI implementations using RDMA want to cache memory > + registrations for performance, but tracking all possible crazy cases > + such as when, say, the FORTRAN runtime frees memory is impossible > + without kernel help. > + > +Basic Model > + > + A userspace process uses it by opening /dev/ummunotify, which > + returns a file descriptor. Interest in address ranges is registered > + using ioctl() and MMU notifier events are retrieved using read(), as > + described in more detail below. Userspace can register multiple > + address ranges to watch, and can unregister individual ranges. > + > + Userspace can also mmap() a single read-only page at offset 0 on > + this file descriptor. This page contains (at offest 0) a single > + 64-bit generation counter that the kernel increments each time an > + MMU notifier event occurs. Userspace can use this to very quickly > + check if there are any events to retrieve without needing to do a > + system call. > + > +Control > + > + To start using ummunotify, a process opens /dev/ummunotify in > + read-only mode. Control from userspace is done via ioctl(); the > + defined ioctls are: > + > + UMMUNOTIFY_EXCHANGE_FEATURES: This ioctl takes a single 32-bit > + word of feature flags as input, and the kernel updates the > + features flags word to contain only features requested by > + userspace and also supported by the kernel. > + > + This ioctl is only included for forward compatibility; no > + feature flags are currently defined, and the kernel will simply > + update any requested feature mask to 0. The kernel will always > + default to a feature mask of 0 if this ioctl is not used, so > + current userspace does not need to perform this ioctl. > + > + UMMUNOTIFY_REGISTER_REGION: Userspace uses this ioctl to tell the > + kernel to start delivering events for an address range. The > + range is described using struct ummunotify_register_ioctl: > + > + struct ummunotify_register_ioctl { > + __u64 start; > + __u64 end; > + __u64 user_cookie; > + __u32 flags; > + __u32 reserved; > + }; > + > + start and end give the range of userspace virtual addresses; > + start is included in the range and end is not, so an example of > + a 4 KB range would be start=0x1000, end=0x2000. > + > + user_cookie is an opaque 64-bit quantity that is returned by the > + kernel in events involving the range, and used by userspace to > + stop watching the range. Each registered address range must > + have a distinct user_cookie. > + > + It is fine with the kernel if userspace registers multiple > + overlapping or even duplicate address ranges, as long as a > + different cookie is used for each registration. > + > + flags and reserved are included for forward compatibility; > + userspace should simply set them to 0 for the current interface. > + > + UMMUNOTIFY_UNREGISTER_REGION: Userspace passes in the 64-bit > + user_cookie used to register a range to tell the kernel to stop > + watching an address range. Once this ioctl completes, the > + kernel will not deliver any further events for the range that is > + unregistered. What happens if I register two regions with the same cookie? Will UMMUNOTIFY_UNREGISTER_REGION unregister both, or did I cause a kernel leak? If it's possible to register for events from a different process then how does the lifetime management happen? What happens when the target process exits? What happens if I close() the fd when there are outstanding events? What happens if I close() the fd when there are still-registered regions? > +Events > + > + When an event occurs that invalidates some of a process's memory > + mapping in an address range being watched, ummunotify queues an > + event report for that address range. If more than one event > + invalidates parts of the same address range before userspace > + retrieves the queued report, then further reports for the same range > + will not be queued -- when userspace does read the queue, only a > + single report for a given range will be returned. So the implementation keeps track of queued events down the operation/start-address/end-address level and, for each new event, checks whether that event is wholly contained within one of the preceding start-address/end-address ranges and if that queued event "invalidated" the new event's range, the new event is suppressed? Sounds complex and expensive. Why was this added? Can't competent userspace handle this situation? And what does "invalidate" mean? munmap? Methinks this paragraph needs some fleshing out. > + If multiple ranges being watched are invalidated by a single event > + (which is especially likely if userspace registers overlapping > + ranges), then an event report structure will be queued for each > + address range registration. > + > + Userspace retrieves queued events via read() on the ummunotify file > + descriptor; a buffer that is at least as big as struct > + ummunotify_event should be used to retrieve event reports, and if a > + larger buffer is passed to read(), multiple reports will be returned > + (if available). What happens if I try to read() three bytes from that fd? > + If the ummunotify file descriptor is in blocking mode, Done how? Omitting O_NONBLOCK at open() time? > + a read() call > + will wait for an event report to be available. Under which circumstances will that read() terminate? signals, presumably? > + Userspace may also > + set the ummunotify file descriptor to non-blocking mode and use all > + standard ways of waiting for data to be available on the ummunotify > + file descriptor, including epoll/poll()/select() and SIGIO. > + > + The format of event reports is: > + > + struct ummunotify_event { > + __u32 type; > + __u32 flags; > + __u64 hint_start; > + __u64 hint_end; > + __u64 user_cookie_counter; > + }; > + > + where the type field is either UMMUNOTIFY_EVENT_TYPE_INVAL or > + UMMUNOTIFY_EVENT_TYPE_LAST. Events of type INVAL describe > + invalidation events as follows: As follows where? Confused. > + user_cookie_counter contains the > + cookie passed in when userspace registered the range that the event > + is for. Why does it have "_counter" in its name? > + hint_start and hint_end contain the start address and end > + address that were invalidated. > + > + The flags word contains bit flags, with only UMMUNOTIFY_EVENT_FLAG_HINT > + defined at the moment. If HINT is set, then the invalidation event > + invalidated less than the full address range and the kernel returns > + the exact range invalidated; So my registration now effectively covers a shorter address range? it may end up being very holey? I wish you'd told us what "invalidate" means. If it means munmap() then presumably the monitored target can later start to fill those holes in again. "hint" seems a strange name to use here. I don't understand why it is appropriate instead of, say, UMMUNOTIFY_EVENT_INVALIDATE. > + if HINT is not sent then hint_start and > + hint_end are set to the original range registered by userspace. > + (HINT will not be set if, for example, multiple events invalidated > + disjoint parts of the range and so a single start/end pair cannot > + represent the parts of the range that were invalidated) > + > + If the event type is LAST, then the read operation has emptied the > + list of invalidated regions, and the flags, hint_start and hint_end > + fields are not used. user_cookie_counter holds the value of the > + kernel's generation counter (see below of more details) when the > + empty list occurred. Ah. So user_cookie_counter is a union. C has support for unions - why not use one?? > +Generation Count > + > + Userspace may mmap() a page on a ummunotify file descriptor via > + > + mmap(NULL, sizeof (__u64), PROT_READ, MAP_SHARED, ummunotify_fd, 0); > + > + to get a read-only mapping of the kernel's 64-bit generation > + counter. The kernel will increment this generation counter each > + time an event report is queued. So we have one CPU writing a memory location and another CPU reading it without any locking? What are the weird-architecture memory-barrier requirements here and how are they handled? How does the code handle the non-atomicity of u64 writes on 32-bit CPUs? > + Userspace can use the generation counter as a quick check to avoid > + system calls; if the value read from the mapped kernel counter is > + still equal to the value returned in user_cookie_counter for the > + most recent LAST event retrieved, then no further events have been > + queued and there is no need to try a read() on the ummunotify file > + descriptor. I _guess_ that works OK on 32-bit, as long as userspace _only_ compares this value with some previous one. umm, no, there's still a race I think. If the counter increases from 0x00000000ffffffff to 0x0000000100000000 then userspace could see this as two events when using this scheme. I didn't get around to reading the code yet ;) ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20100412160359.1d9074dc.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]
* Re: [PATCH] ummunotify: Userspace support for MMU notifications 2010-04-12 23:03 ` [PATCH] ummunotify: Userspace support for MMU notifications Andrew Morton @ 2010-04-12 23:59 ` Jason Gunthorpe 0 siblings, 0 replies; 30+ messages in thread From: Jason Gunthorpe @ 2010-04-12 23:59 UTC (permalink / raw) To: Andrew Morton Cc: Eric B Munson, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, pavel-+ZI9xUNit7I, mingo-X9Un+BFzKDI On Mon, Apr 12, 2010 at 04:03:59PM -0700, Andrew Morton wrote: > > As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925> > > and follow-up messages, libraries using RDMA would like to track > > precisely when application code changes memory mapping via free(), > > munmap(), etc. Current pure-userspace solutions using malloc hooks > > and other tricks are not robust, and the feeling among experts is that > > the issue is unfixable without kernel help. > > But this info could be reassembled by tracking syscall activity, yes? > Perhaps some discussion here explaining why the (possibly enhanced) > ptrace, audit, etc interfaces are unsuitable. Just to summarize some of the key points of this thingy, as related to your comments: 1) It is really very narrowly focused on a particular problem MPI and RDMA have due to the way their APIs don't really match. Roland tried to make the interface general.. Maybe that is a mistake .. 2) A 'self-tracing' scheme is used, again, because of an API mistmatching between a MPI library and it's own applications. Attempting to hook the appropriate calls has proven unsatisfactory (missing cases, and slow). 3) Being intended for MPI applications, performance is a huge concern. Synchronous operation is very undesirable. Tracing APIs are lossy - and there is no recovery option if an event is lost. 4) Realistically the only thing MPI cares about is if a virtual page is unmapped/remapped. Loosing events is unacceptable. 5) This isn't really tracing. There is no queue. There aren't really events. This works more like the diry/access bit in a page table, it doesn't matter how many times something has been modified, only that it has at least once since last time you looked. This means the memory used is proportional to the number of page-ranges you watch, and the number of events against those page-ranges doesn't matter. No other API has this property. Basically, this entire scheme is designed to detect that when a == b, the internal state held by some_mpi_call is no longer valid, in this kind of situation: a = mmap(ONE_PAGE); some_mpi_call(a); munmap(a); b = mmap(ONE_PAGE); // Kernel picks b == a some_mpi_call(b); All the races you point out, just don't matter for the MPI use case. Essentially, if the app hits those races, then it is using the MPI library in a buggy way. That said, this could be explained better in the documentation file. :) I'm sure Eric can go through the rest of your questions in greater detail.. > > + Userspace can use the generation counter as a quick check to avoid > > + system calls; if the value read from the mapped kernel counter is > > + still equal to the value returned in user_cookie_counter for the > > + most recent LAST event retrieved, then no further events have been > > + queued and there is no need to try a read() on the ummunotify file > > + descriptor. > > I _guess_ that works OK on 32-bit, as long as userspace _only_ compares > this value with some previous one. > > umm, no, there's still a race I think. If the counter increases from > 0x00000000ffffffff to 0x0000000100000000 then userspace could see this > as two events when using this scheme. The only case that matters for the generation counter optimization is a false negative. As long as user space does: u64 val = *counter; if (val != last_counter) last_counter = val; Then you can get false positives as you point out, but never a false negative. A false positive results in an extra syscall and the kernel just returns no data. Regards, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications @ 2010-04-12 23:59 ` Jason Gunthorpe 0 siblings, 0 replies; 30+ messages in thread From: Jason Gunthorpe @ 2010-04-12 23:59 UTC (permalink / raw) To: Andrew Morton Cc: Eric B Munson, linux-kernel, linux-rdma, rolandd, peterz, pavel, mingo On Mon, Apr 12, 2010 at 04:03:59PM -0700, Andrew Morton wrote: > > As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925> > > and follow-up messages, libraries using RDMA would like to track > > precisely when application code changes memory mapping via free(), > > munmap(), etc. Current pure-userspace solutions using malloc hooks > > and other tricks are not robust, and the feeling among experts is that > > the issue is unfixable without kernel help. > > But this info could be reassembled by tracking syscall activity, yes? > Perhaps some discussion here explaining why the (possibly enhanced) > ptrace, audit, etc interfaces are unsuitable. Just to summarize some of the key points of this thingy, as related to your comments: 1) It is really very narrowly focused on a particular problem MPI and RDMA have due to the way their APIs don't really match. Roland tried to make the interface general.. Maybe that is a mistake .. 2) A 'self-tracing' scheme is used, again, because of an API mistmatching between a MPI library and it's own applications. Attempting to hook the appropriate calls has proven unsatisfactory (missing cases, and slow). 3) Being intended for MPI applications, performance is a huge concern. Synchronous operation is very undesirable. Tracing APIs are lossy - and there is no recovery option if an event is lost. 4) Realistically the only thing MPI cares about is if a virtual page is unmapped/remapped. Loosing events is unacceptable. 5) This isn't really tracing. There is no queue. There aren't really events. This works more like the diry/access bit in a page table, it doesn't matter how many times something has been modified, only that it has at least once since last time you looked. This means the memory used is proportional to the number of page-ranges you watch, and the number of events against those page-ranges doesn't matter. No other API has this property. Basically, this entire scheme is designed to detect that when a == b, the internal state held by some_mpi_call is no longer valid, in this kind of situation: a = mmap(ONE_PAGE); some_mpi_call(a); munmap(a); b = mmap(ONE_PAGE); // Kernel picks b == a some_mpi_call(b); All the races you point out, just don't matter for the MPI use case. Essentially, if the app hits those races, then it is using the MPI library in a buggy way. That said, this could be explained better in the documentation file. :) I'm sure Eric can go through the rest of your questions in greater detail.. > > + Userspace can use the generation counter as a quick check to avoid > > + system calls; if the value read from the mapped kernel counter is > > + still equal to the value returned in user_cookie_counter for the > > + most recent LAST event retrieved, then no further events have been > > + queued and there is no need to try a read() on the ummunotify file > > + descriptor. > > I _guess_ that works OK on 32-bit, as long as userspace _only_ compares > this value with some previous one. > > umm, no, there's still a race I think. If the counter increases from > 0x00000000ffffffff to 0x0000000100000000 then userspace could see this > as two events when using this scheme. The only case that matters for the generation counter optimization is a false negative. As long as user space does: u64 val = *counter; if (val != last_counter) last_counter = val; Then you can get false positives as you point out, but never a false negative. A false positive results in an extra syscall and the kernel just returns no data. Regards, Jason ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20100412235937.GF15629-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH] ummunotify: Userspace support for MMU notifications 2010-04-12 23:59 ` Jason Gunthorpe @ 2010-04-13 8:29 ` Håkon Bugge -1 siblings, 0 replies; 30+ messages in thread From: Håkon Bugge @ 2010-04-13 8:29 UTC (permalink / raw) To: Jason Gunthorpe Cc: Andrew Morton, Eric B Munson, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, pavel-+ZI9xUNit7I, mingo-X9Un+BFzKDI On Apr 13, 2010, at 1:59 , Jason Gunthorpe wrote: > On Mon, Apr 12, 2010 at 04:03:59PM -0700, Andrew Morton wrote: > >>> As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925> >>> and follow-up messages, libraries using RDMA would like to track >>> precisely when application code changes memory mapping via free(), >>> munmap(), etc. Current pure-userspace solutions using malloc hooks >>> and other tricks are not robust, and the feeling among experts is that >>> the issue is unfixable without kernel help. I am not sure I agree with the premises here. ptMalloc and malloc hooks are not related to the issue in my opinion. User space library calls do not change virtual to physical mapping, system calls do. The following sys calls might change virtual to physical mapping: munmap(), mremap(), sbrk(), madvice(). What we need is glibc to provide hooks for these 4 sys calls and the general syscall() when its argument is one of the four mentioned syscalls. To me, that is what is needed, and the ummunotify direction seems way too complicated to me. It is further claimed that "… other tricks are not robust". I wrote the code used in Scali/Platform MPI handling the issue. I do not think its fair to claim that this MPI is not robust in this matter nor that is performance is bad. Thanks, Håkon -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications @ 2010-04-13 8:29 ` Håkon Bugge 0 siblings, 0 replies; 30+ messages in thread From: Håkon Bugge @ 2010-04-13 8:29 UTC (permalink / raw) To: Jason Gunthorpe Cc: Andrew Morton, Eric B Munson, linux-kernel, linux-rdma, rolandd, peterz, pavel, mingo On Apr 13, 2010, at 1:59 , Jason Gunthorpe wrote: > On Mon, Apr 12, 2010 at 04:03:59PM -0700, Andrew Morton wrote: > >>> As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925> >>> and follow-up messages, libraries using RDMA would like to track >>> precisely when application code changes memory mapping via free(), >>> munmap(), etc. Current pure-userspace solutions using malloc hooks >>> and other tricks are not robust, and the feeling among experts is that >>> the issue is unfixable without kernel help. I am not sure I agree with the premises here. ptMalloc and malloc hooks are not related to the issue in my opinion. User space library calls do not change virtual to physical mapping, system calls do. The following sys calls might change virtual to physical mapping: munmap(), mremap(), sbrk(), madvice(). What we need is glibc to provide hooks for these 4 sys calls and the general syscall() when its argument is one of the four mentioned syscalls. To me, that is what is needed, and the ummunotify direction seems way too complicated to me. It is further claimed that "… other tricks are not robust". I wrote the code used in Scali/Platform MPI handling the issue. I do not think its fair to claim that this MPI is not robust in this matter nor that is performance is bad. Thanks, Håkon ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <3251DDDA-D705-4B1E-9595-9C24709EF146-U0mLk4xYmo8@public.gmane.org>]
* Re: [PATCH] ummunotify: Userspace support for MMU notifications 2010-04-13 8:29 ` Håkon Bugge @ 2010-04-13 17:57 ` Roland Dreier -1 siblings, 0 replies; 30+ messages in thread From: Roland Dreier @ 2010-04-13 17:57 UTC (permalink / raw) To: Håkon Bugge Cc: Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, pavel-+ZI9xUNit7I, mingo-X9Un+BFzKDI, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w > I am not sure I agree with the premises here. ptMalloc and malloc > hooks are not related to the issue in my opinion. User space library > calls do not change virtual to physical mapping, system calls do. The > following sys calls might change virtual to physical mapping: > munmap(), mremap(), sbrk(), madvice(). What we need is glibc to > provide hooks for these 4 sys calls and the general syscall() when > its argument is one of the four mentioned syscalls. To me, that is > what is needed, and the ummunotify direction seems way too > complicated to me. Are those system calls the only possible way that virtual to physical mappings can change? Can't page migration or something like that potentially affect things? And even if you did have hooks into every system call that mattered (keep in mind that relying on glibc is not enough, since an MPI application may not use glibc) would decoding them and figuring out what happened really be preferable to a single event type that tells you exactly what address range was affected? > It is further claimed that "… other tricks are not robust". I wrote > the code used in Scali/Platform MPI handling the issue. I do not > think its fair to claim that this MPI is not robust in this matter > nor that is performance is bad. The Open MPI developers have spent a lot of effort trying to handle this purely in userspace and still do not believe that a truly robust solution is possible without kernel help. Perhaps they can expand on what the obstacles are. - R. -- Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications @ 2010-04-13 17:57 ` Roland Dreier 0 siblings, 0 replies; 30+ messages in thread From: Roland Dreier @ 2010-04-13 17:57 UTC (permalink / raw) To: Håkon Bugge Cc: Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel, linux-rdma, rolandd, peterz, pavel, mingo, jsquyres > I am not sure I agree with the premises here. ptMalloc and malloc > hooks are not related to the issue in my opinion. User space library > calls do not change virtual to physical mapping, system calls do. The > following sys calls might change virtual to physical mapping: > munmap(), mremap(), sbrk(), madvice(). What we need is glibc to > provide hooks for these 4 sys calls and the general syscall() when > its argument is one of the four mentioned syscalls. To me, that is > what is needed, and the ummunotify direction seems way too > complicated to me. Are those system calls the only possible way that virtual to physical mappings can change? Can't page migration or something like that potentially affect things? And even if you did have hooks into every system call that mattered (keep in mind that relying on glibc is not enough, since an MPI application may not use glibc) would decoding them and figuring out what happened really be preferable to a single event type that tells you exactly what address range was affected? > It is further claimed that "… other tricks are not robust". I wrote > the code used in Scali/Platform MPI handling the issue. I do not > think its fair to claim that this MPI is not robust in this matter > nor that is performance is bad. The Open MPI developers have spent a lot of effort trying to handle this purely in userspace and still do not believe that a truly robust solution is possible without kernel help. Perhaps they can expand on what the obstacles are. - R. -- Roland Dreier <rolandd@cisco.com> || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <adahbnfme0j.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>]
* Re: [PATCH] ummunotify: Userspace support for MMU notifications 2010-04-13 17:57 ` Roland Dreier @ 2010-04-13 18:02 ` Peter Zijlstra -1 siblings, 0 replies; 30+ messages in thread From: Peter Zijlstra @ 2010-04-13 18:02 UTC (permalink / raw) To: Roland Dreier Cc: Håkon Bugge, Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, pavel-+ZI9xUNit7I, mingo-X9Un+BFzKDI, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w On Tue, 2010-04-13 at 10:57 -0700, Roland Dreier wrote: > Are those system calls the only possible way that virtual to physical > mappings can change? Can't page migration or something like that > potentially affect things? And even if you did have hooks into every > system call that mattered (keep in mind that relying on glibc is not > enough, since an MPI application may not use glibc) would decoding them > and figuring out what happened really be preferable to a single event > type that tells you exactly what address range was affected? Yeah, virtual<->physical maps can change through swapping, page migration, memory compaction, huge-page aggregation (the latter two not yet being upstream). Even mlock() doesn't pin virtual<->physical maps. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications @ 2010-04-13 18:02 ` Peter Zijlstra 0 siblings, 0 replies; 30+ messages in thread From: Peter Zijlstra @ 2010-04-13 18:02 UTC (permalink / raw) To: Roland Dreier Cc: Håkon Bugge, Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel, linux-rdma, rolandd, pavel, mingo, jsquyres On Tue, 2010-04-13 at 10:57 -0700, Roland Dreier wrote: > Are those system calls the only possible way that virtual to physical > mappings can change? Can't page migration or something like that > potentially affect things? And even if you did have hooks into every > system call that mattered (keep in mind that relying on glibc is not > enough, since an MPI application may not use glibc) would decoding them > and figuring out what happened really be preferable to a single event > type that tells you exactly what address range was affected? Yeah, virtual<->physical maps can change through swapping, page migration, memory compaction, huge-page aggregation (the latter two not yet being upstream). Even mlock() doesn't pin virtual<->physical maps. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications 2010-04-13 18:02 ` Peter Zijlstra @ 2010-04-14 5:18 ` Håkon Bugge -1 siblings, 0 replies; 30+ messages in thread From: Håkon Bugge @ 2010-04-14 5:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Roland Dreier, Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, pavel-+ZI9xUNit7I, mingo-X9Un+BFzKDI, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w On Apr 13, 2010, at 20:02 , Peter Zijlstra wrote: > > Yeah, virtual<->physical maps can change through swapping, page > migration, memory compaction, huge-page aggregation (the latter two not > yet being upstream). Assuming this holds true, RDMA will not work. And with no RDMA, we do not need ummunotify. Is that your argument? Seriously, RDMA requires the virtual to physical mapping to remain constant for a period of time. And that time period is from memory registration to de-registration. If the virtual to physical mapping changes in that period for the registered memory area, I can't see how an HCA can handle that. For MPI applications, the MPI API defines that the buffers used in communication cannot be changed or freed while a non-blocking transfer is in progress. So far, we are good. But memory registration (and in particular de-registration) is a costly process. Since MPI is about performance, the MPI library would like to re-use earlier memory registrations for other transfers. The MPI library records earlier memory mappings (VA+bound) and checks if the VA+bound of a new transfer is already contained in memory revisions registered earlier. The problem with this approach is that _normal_ activity _between_ the transfers changes the virtual to physical mapping and ruins previous memory registrations. The problem is to catch these. I simply argue that they should be caught the simplest possible way; a call-back from the system calls affecting the virtual to physical mapping. > > Even mlock() doesn't pin virtual<->physical maps. Can you elaborate on what you mean here? Thanks, Håkon -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications @ 2010-04-14 5:18 ` Håkon Bugge 0 siblings, 0 replies; 30+ messages in thread From: Håkon Bugge @ 2010-04-14 5:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Roland Dreier, Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel, linux-rdma, rolandd, pavel, mingo, jsquyres On Apr 13, 2010, at 20:02 , Peter Zijlstra wrote: > > Yeah, virtual<->physical maps can change through swapping, page > migration, memory compaction, huge-page aggregation (the latter two not > yet being upstream). Assuming this holds true, RDMA will not work. And with no RDMA, we do not need ummunotify. Is that your argument? Seriously, RDMA requires the virtual to physical mapping to remain constant for a period of time. And that time period is from memory registration to de-registration. If the virtual to physical mapping changes in that period for the registered memory area, I can't see how an HCA can handle that. For MPI applications, the MPI API defines that the buffers used in communication cannot be changed or freed while a non-blocking transfer is in progress. So far, we are good. But memory registration (and in particular de-registration) is a costly process. Since MPI is about performance, the MPI library would like to re-use earlier memory registrations for other transfers. The MPI library records earlier memory mappings (VA+bound) and checks if the VA+bound of a new transfer is already contained in memory revisions registered earlier. The problem with this approach is that _normal_ activity _between_ the transfers changes the virtual to physical mapping and ruins previous memory registrations. The problem is to catch these. I simply argue that they should be caught the simplest possible way; a call-back from the system calls affecting the virtual to physical mapping. > > Even mlock() doesn't pin virtual<->physical maps. Can you elaborate on what you mean here? Thanks, Håkon ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications 2010-04-13 18:02 ` Peter Zijlstra @ 2010-04-14 8:52 ` Gleb Natapov -1 siblings, 0 replies; 30+ messages in thread From: Gleb Natapov @ 2010-04-14 8:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Roland Dreier, Håkon Bugge, Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, pavel-+ZI9xUNit7I, mingo-X9Un+BFzKDI, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w On Tue, Apr 13, 2010 at 08:02:54PM +0200, Peter Zijlstra wrote: > On Tue, 2010-04-13 at 10:57 -0700, Roland Dreier wrote: > > Are those system calls the only possible way that virtual to physical > > mappings can change? Can't page migration or something like that > > potentially affect things? And even if you did have hooks into every > > system call that mattered (keep in mind that relying on glibc is not > > enough, since an MPI application may not use glibc) would decoding them > > and figuring out what happened really be preferable to a single event > > type that tells you exactly what address range was affected? > > Yeah, virtual<->physical maps can change through swapping, page > migration, memory compaction, huge-page aggregation (the latter two not > yet being upstream). > > Even mlock() doesn't pin virtual<->physical maps. Pages registered for RDMA are GUPed so no method above should touch them. Fork+cow or unmap/map on the other hand can change virtual<->physical maps. GUPed pages are still GUPed, but they are no longer mapped into process' virtual address space. MPI copes with Fork+cow by marking registered memory as MADV_DONTFORK. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications @ 2010-04-14 8:52 ` Gleb Natapov 0 siblings, 0 replies; 30+ messages in thread From: Gleb Natapov @ 2010-04-14 8:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Roland Dreier, Håkon Bugge, Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel, linux-rdma, rolandd, pavel, mingo, jsquyres On Tue, Apr 13, 2010 at 08:02:54PM +0200, Peter Zijlstra wrote: > On Tue, 2010-04-13 at 10:57 -0700, Roland Dreier wrote: > > Are those system calls the only possible way that virtual to physical > > mappings can change? Can't page migration or something like that > > potentially affect things? And even if you did have hooks into every > > system call that mattered (keep in mind that relying on glibc is not > > enough, since an MPI application may not use glibc) would decoding them > > and figuring out what happened really be preferable to a single event > > type that tells you exactly what address range was affected? > > Yeah, virtual<->physical maps can change through swapping, page > migration, memory compaction, huge-page aggregation (the latter two not > yet being upstream). > > Even mlock() doesn't pin virtual<->physical maps. Pages registered for RDMA are GUPed so no method above should touch them. Fork+cow or unmap/map on the other hand can change virtual<->physical maps. GUPed pages are still GUPed, but they are no longer mapped into process' virtual address space. MPI copes with Fork+cow by marking registered memory as MADV_DONTFORK. -- Gleb. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications 2010-04-14 8:52 ` Gleb Natapov (?) @ 2010-04-15 13:48 ` Peter Zijlstra -1 siblings, 0 replies; 30+ messages in thread From: Peter Zijlstra @ 2010-04-15 13:48 UTC (permalink / raw) To: Gleb Natapov Cc: Roland Dreier, Håkon Bugge, Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel, linux-rdma, rolandd, pavel, mingo, jsquyres On Wed, 2010-04-14 at 11:52 +0300, Gleb Natapov wrote: > On Tue, Apr 13, 2010 at 08:02:54PM +0200, Peter Zijlstra wrote: > > On Tue, 2010-04-13 at 10:57 -0700, Roland Dreier wrote: > > > Are those system calls the only possible way that virtual to physical > > > mappings can change? Can't page migration or something like that > > > potentially affect things? And even if you did have hooks into every > > > system call that mattered (keep in mind that relying on glibc is not > > > enough, since an MPI application may not use glibc) would decoding them > > > and figuring out what happened really be preferable to a single event > > > type that tells you exactly what address range was affected? > > > > Yeah, virtual<->physical maps can change through swapping, page > > migration, memory compaction, huge-page aggregation (the latter two not > > yet being upstream). > > > > Even mlock() doesn't pin virtual<->physical maps. > Pages registered for RDMA are GUPed so no method above should touch > them. Fork+cow or unmap/map on the other hand can change > virtual<->physical maps. GUPed pages are still GUPed, but they are no > longer mapped into process' virtual address space. MPI copes with > Fork+cow by marking registered memory as MADV_DONTFORK. Sure, holding a page-ref will pin the physical page, but that is not something userspace usually has means to. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications 2010-04-13 17:57 ` Roland Dreier @ 2010-04-14 9:06 ` Gleb Natapov -1 siblings, 0 replies; 30+ messages in thread From: Gleb Natapov @ 2010-04-14 9:06 UTC (permalink / raw) To: Roland Dreier Cc: Håkon Bugge, Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, pavel-+ZI9xUNit7I, mingo-X9Un+BFzKDI, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w On Tue, Apr 13, 2010 at 10:57:32AM -0700, Roland Dreier wrote: > > It is further claimed that "… other tricks are not robust". I wrote > > the code used in Scali/Platform MPI handling the issue. I do not > > think its fair to claim that this MPI is not robust in this matter > > nor that is performance is bad. > > The Open MPI developers have spent a lot of effort trying to handle this > purely in userspace and still do not believe that a truly robust > solution is possible without kernel help. Perhaps they can expand on > what the obstacles are. > The problem is that glibc doesn't provide correct type of hooks for MPI to use. You can hook into free(), but the hook is called when application frees memory, not when memory is returned back to the kernel and since MPI wants to cache registration across free()/malloc() if possible those hooks are not good enough. To overcome this MPI tries to provide its own memory management library (luckily glibc defines most/all memory management functions as weak) with proper hooks present, but that poses a whole lot of other problems and memory management is really not MPI's job. Even if glibc will provide proper hooks some day HPC users may want to use other, more performance oriented, memory management libraries instead of using build in glibc one. Relying on glibc hooks will prevent them from doing so. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications @ 2010-04-14 9:06 ` Gleb Natapov 0 siblings, 0 replies; 30+ messages in thread From: Gleb Natapov @ 2010-04-14 9:06 UTC (permalink / raw) To: Roland Dreier Cc: Håkon Bugge, Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel, linux-rdma, rolandd, peterz, pavel, mingo, jsquyres On Tue, Apr 13, 2010 at 10:57:32AM -0700, Roland Dreier wrote: > > It is further claimed that "… other tricks are not robust". I wrote > > the code used in Scali/Platform MPI handling the issue. I do not > > think its fair to claim that this MPI is not robust in this matter > > nor that is performance is bad. > > The Open MPI developers have spent a lot of effort trying to handle this > purely in userspace and still do not believe that a truly robust > solution is possible without kernel help. Perhaps they can expand on > what the obstacles are. > The problem is that glibc doesn't provide correct type of hooks for MPI to use. You can hook into free(), but the hook is called when application frees memory, not when memory is returned back to the kernel and since MPI wants to cache registration across free()/malloc() if possible those hooks are not good enough. To overcome this MPI tries to provide its own memory management library (luckily glibc defines most/all memory management functions as weak) with proper hooks present, but that poses a whole lot of other problems and memory management is really not MPI's job. Even if glibc will provide proper hooks some day HPC users may want to use other, more performance oriented, memory management libraries instead of using build in glibc one. Relying on glibc hooks will prevent them from doing so. -- Gleb. ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20100414090623.GM23554-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] ummunotify: Userspace support for MMU notifications 2010-04-14 9:06 ` Gleb Natapov @ 2010-04-14 14:36 ` Jeff Squyres -1 siblings, 0 replies; 30+ messages in thread From: Jeff Squyres @ 2010-04-14 14:36 UTC (permalink / raw) To: Gleb Natapov Cc: Roland Dreier (rdreier), Håkon Bugge, Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, pavel-+ZI9xUNit7I, mingo-X9Un+BFzKDI On Apr 14, 2010, at 5:06 AM, Gleb Natapov wrote: > > The Open MPI developers have spent a lot of effort trying to handle this > > purely in userspace and still do not believe that a truly robust > > solution is possible without kernel help. Perhaps they can expand on > > what the obstacles are. By "truly robust" we mean that some other user-level code can't override the hooks installed by the MPI (user level) middleware. All current glibc hooks are overridable by other user-level code -- and sometimes real applications do this (for their own good reasons). Most of the time, apps blithely override our hooks because they either don't know or can't know that our hooks are installed. It can be dicey to know what you can and cannot override pre-main(), for example (e.g., via the __malloc_initialize_hook). Opening up a direct channel to the kernel and saying "hey, tell me when something changes" is robust because no other entity can hijack your notifications. It also allows us to avoid using pre-main hooks, and makes it so that we don't have to hook into the memory subsystem (usually replacing it with our own). Both of these things are extremely distasteful -- fixing these two things alone make doing something like ummunotify worthwhile, IMHO. -- Jeff Squyres jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications @ 2010-04-14 14:36 ` Jeff Squyres 0 siblings, 0 replies; 30+ messages in thread From: Jeff Squyres @ 2010-04-14 14:36 UTC (permalink / raw) To: Gleb Natapov Cc: Roland Dreier (rdreier), Håkon Bugge, Jason Gunthorpe, Andrew Morton, Eric B Munson, linux-kernel, linux-rdma, rolandd, peterz, pavel, mingo On Apr 14, 2010, at 5:06 AM, Gleb Natapov wrote: > > The Open MPI developers have spent a lot of effort trying to handle this > > purely in userspace and still do not believe that a truly robust > > solution is possible without kernel help. Perhaps they can expand on > > what the obstacles are. By "truly robust" we mean that some other user-level code can't override the hooks installed by the MPI (user level) middleware. All current glibc hooks are overridable by other user-level code -- and sometimes real applications do this (for their own good reasons). Most of the time, apps blithely override our hooks because they either don't know or can't know that our hooks are installed. It can be dicey to know what you can and cannot override pre-main(), for example (e.g., via the __malloc_initialize_hook). Opening up a direct channel to the kernel and saying "hey, tell me when something changes" is robust because no other entity can hijack your notifications. It also allows us to avoid using pre-main hooks, and makes it so that we don't have to hook into the memory subsystem (usually replacing it with our own). Both of these things are extremely distasteful -- fixing these two things alone make doing something like ummunotify worthwhile, IMHO. -- Jeff Squyres jsquyres@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications 2010-04-12 23:03 ` [PATCH] ummunotify: Userspace support for MMU notifications Andrew Morton @ 2010-04-17 17:41 ` Eric B Munson 0 siblings, 0 replies; 30+ messages in thread From: Eric B Munson @ 2010-04-17 17:41 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, rolandd-FYB4Gu1CFyUAvxtiuMwx3w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, pavel-+ZI9xUNit7I, mingo-X9Un+BFzKDI [-- Attachment #1: Type: text/plain, Size: 3587 bytes --] On Mon, 12 Apr 2010, Andrew Morton wrote: > On Mon, 12 Apr 2010 07:22:17 +0100 > Eric B Munson <ebmunson-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > > Andrew, > > > > I am resubmitting this patch because I believe that the discussion > > has shown this to be an acceptable solution. > > To whom? Some acked-by's would clarify. > > > I have fixed the 32 bit > > build errors, but other than that change, the code is the same as > > Roland's V3 patch. > > > > From: Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> > > > > As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925> > > and follow-up messages, libraries using RDMA would like to track > > precisely when application code changes memory mapping via free(), > > munmap(), etc. Current pure-userspace solutions using malloc hooks > > and other tricks are not robust, and the feeling among experts is that > > the issue is unfixable without kernel help. > > But this info could be reassembled by tracking syscall activity, yes? > Perhaps some discussion here explaining why the (possibly enhanced) > ptrace, audit, etc interfaces are unsuitable. > > > We solve this not by implementing the full API proposed in the email > > linked above but rather with a simpler and more generic interface, > > which may be useful in other contexts. Specifically, we implement a > > new character device driver, ummunotify, that creates a /dev/ummunotify > > node. A userspace process can open this node read-only and use the fd > > as follows: > > > > 1. ioctl() to register/unregister an address range to watch in the > > kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>). > > > > 2. read() to retrieve events generated when a mapping in a watched > > address range is invalidated (cf struct ummunotify_event in > > <linux/ummunotify.h>). select()/poll()/epoll() and SIGIO are > > handled for this IO. > > > > 3. mmap() one page at offset 0 to map a kernel page that contains a > > generation counter that is incremented each time an event is > > generated. This allows userspace to have a fast path that checks > > that no events have occurred without a system call. > > OK, what's missing from this whole description and from ummunotify.txt > is: how does one specify the target process? Does /dev/ummunotify > implicitly attach to current->mm? If so, why, and what are the > implications of this? > > If instead it is possible to attach to some other process's mmu > activity (/proc/<pid>/ummunotity?) then how is that done and what are > the security/permissions implications? > > Also, the whole thing is obviously racy: by the time userspace finds > out that something has happened, it might have changed. This > inevitably reduces the applicability/usefulness of the whole thing as > compared to some synchronous mechanism which halts the monitored thread > until the request has been processed and acked. All this should (IMO) > be explored, explained and justified. > > Also, what prevents the obvious DoS which occurs when I register for > events and just let them queue up until the kernel runs out of memory? > presumably events get dropped - what are the reliability implications > of this and how is the max queue length managed? > > Also, ioctls are unpopular. Were other intefaces considered? > I am reworking the Documentation to address all these questions and will resubmit when finished. Thanks for the feedback, Eric [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] ummunotify: Userspace support for MMU notifications @ 2010-04-17 17:41 ` Eric B Munson 0 siblings, 0 replies; 30+ messages in thread From: Eric B Munson @ 2010-04-17 17:41 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-rdma, rolandd, peterz, pavel, mingo [-- Attachment #1: Type: text/plain, Size: 3528 bytes --] On Mon, 12 Apr 2010, Andrew Morton wrote: > On Mon, 12 Apr 2010 07:22:17 +0100 > Eric B Munson <ebmunson@us.ibm.com> wrote: > > > Andrew, > > > > I am resubmitting this patch because I believe that the discussion > > has shown this to be an acceptable solution. > > To whom? Some acked-by's would clarify. > > > I have fixed the 32 bit > > build errors, but other than that change, the code is the same as > > Roland's V3 patch. > > > > From: Roland Dreier <rolandd@cisco.com> > > > > As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925> > > and follow-up messages, libraries using RDMA would like to track > > precisely when application code changes memory mapping via free(), > > munmap(), etc. Current pure-userspace solutions using malloc hooks > > and other tricks are not robust, and the feeling among experts is that > > the issue is unfixable without kernel help. > > But this info could be reassembled by tracking syscall activity, yes? > Perhaps some discussion here explaining why the (possibly enhanced) > ptrace, audit, etc interfaces are unsuitable. > > > We solve this not by implementing the full API proposed in the email > > linked above but rather with a simpler and more generic interface, > > which may be useful in other contexts. Specifically, we implement a > > new character device driver, ummunotify, that creates a /dev/ummunotify > > node. A userspace process can open this node read-only and use the fd > > as follows: > > > > 1. ioctl() to register/unregister an address range to watch in the > > kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>). > > > > 2. read() to retrieve events generated when a mapping in a watched > > address range is invalidated (cf struct ummunotify_event in > > <linux/ummunotify.h>). select()/poll()/epoll() and SIGIO are > > handled for this IO. > > > > 3. mmap() one page at offset 0 to map a kernel page that contains a > > generation counter that is incremented each time an event is > > generated. This allows userspace to have a fast path that checks > > that no events have occurred without a system call. > > OK, what's missing from this whole description and from ummunotify.txt > is: how does one specify the target process? Does /dev/ummunotify > implicitly attach to current->mm? If so, why, and what are the > implications of this? > > If instead it is possible to attach to some other process's mmu > activity (/proc/<pid>/ummunotity?) then how is that done and what are > the security/permissions implications? > > Also, the whole thing is obviously racy: by the time userspace finds > out that something has happened, it might have changed. This > inevitably reduces the applicability/usefulness of the whole thing as > compared to some synchronous mechanism which halts the monitored thread > until the request has been processed and acked. All this should (IMO) > be explored, explained and justified. > > Also, what prevents the obvious DoS which occurs when I register for > events and just let them queue up until the kernel runs out of memory? > presumably events get dropped - what are the reliability implications > of this and how is the max queue length managed? > > Also, ioctls are unpopular. Were other intefaces considered? > I am reworking the Documentation to address all these questions and will resubmit when finished. Thanks for the feedback, Eric [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2010-04-18 14:38 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-12 6:22 [PATCH] ummunotify: Userspace support for MMU notifications Eric B Munson
2010-04-12 6:22 ` Eric B Munson
[not found] ` <1271053337-7121-1-git-send-email-ebmunson-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-04-12 20:20 ` Pavel Machek
2010-04-12 20:20 ` Pavel Machek
2010-04-14 16:43 ` [PATCH] ummunotify: fix umn-test build Randy Dunlap
2010-04-14 16:43 ` Randy Dunlap
[not found] ` <20100414094341.ba69842d.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2010-04-17 17:44 ` Eric B Munson
2010-04-17 17:44 ` Eric B Munson
[not found] ` <20100417174404.GB3579-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-04-18 14:38 ` Roland Dreier
2010-04-18 14:38 ` Roland Dreier
2010-04-12 23:03 ` [PATCH] ummunotify: Userspace support for MMU notifications Andrew Morton
[not found] ` <20100412160359.1d9074dc.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2010-04-12 23:59 ` Jason Gunthorpe
2010-04-12 23:59 ` Jason Gunthorpe
[not found] ` <20100412235937.GF15629-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-04-13 8:29 ` Håkon Bugge
2010-04-13 8:29 ` Håkon Bugge
[not found] ` <3251DDDA-D705-4B1E-9595-9C24709EF146-U0mLk4xYmo8@public.gmane.org>
2010-04-13 17:57 ` Roland Dreier
2010-04-13 17:57 ` Roland Dreier
[not found] ` <adahbnfme0j.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-04-13 18:02 ` Peter Zijlstra
2010-04-13 18:02 ` Peter Zijlstra
2010-04-14 5:18 ` Håkon Bugge
2010-04-14 5:18 ` Håkon Bugge
2010-04-14 8:52 ` Gleb Natapov
2010-04-14 8:52 ` Gleb Natapov
2010-04-15 13:48 ` Peter Zijlstra
2010-04-14 9:06 ` Gleb Natapov
2010-04-14 9:06 ` Gleb Natapov
[not found] ` <20100414090623.GM23554-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2010-04-14 14:36 ` Jeff Squyres
2010-04-14 14:36 ` Jeff Squyres
2010-04-17 17:41 ` Eric B Munson
2010-04-17 17:41 ` Eric B Munson
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.