Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
* Re: [PATCH] ummunotify: Userspace support for MMU notifications V2
From: Jeff Squyres @ 2010-05-07 12:01 UTC (permalink / raw)
  To: Eric B Munson
  Cc: akpm, linux-kernel, linux-rdma, linux-mm, rolandd, peterz, mingo,
	pavel, randy.dunlap
In-Reply-To: <1271943493-12120-1-git-send-email-ebmunson@us.ibm.com>

On Apr 22, 2010, at 9:38 AM, Eric B Munson wrote:

> From: Roland Dreier <rolandd@cisco.com>
> 
> As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
> and follow-up messages, libraries using RDMA would like to track
> precisely when application code changes memory mapping via free(),
> munmap(), etc.  Current pure-userspace solutions using malloc hooks
> and other tricks are not robust, and the feeling among experts is that
> the issue is unfixable without kernel help.

Sorry for not replying earlier -- just to throw in my $0.02 here: the MPI community is *very interested* in having this stuff in upstream kernels.  It solves a fairly major problem for us. 

Open MPI (www.open-mpi.org) is ready to pretty much immediately take advantage of these capabilities.  The code to use ummunotify is in a Mercurial branch; we're only waiting for ummunotify to go upstream before committing our support for it to our main SVN development trunk.

> We solve this not by implementing the full API proposed in the email
> linked above but rather with a simpler and more generic interface,
> which may be useful in other contexts.  Specifically, we implement a
> new character device driver, ummunotify, that creates a /dev/ummunotify
> node.  A userspace process can open this node read-only and use the fd
> as follows:
> 
>  1. ioctl() to register/unregister an address range to watch in the
>     kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>).
> 
>  2. read() to retrieve events generated when a mapping in a watched
>     address range is invalidated (cf struct ummunotify_event in
>     <linux/ummunotify.h>).  select()/poll()/epoll() and SIGIO are
>     handled for this IO.
> 
>  3. mmap() one page at offset 0 to map a kernel page that contains a
>     generation counter that is incremented each time an event is
>     generated.  This allows userspace to have a fast path that checks
>     that no events have occurred without a system call.
> 
> Thanks to Jason Gunthorpe <jgunthorpe <at> obsidianresearch.com> for
> suggestions on the interface design.  Also thanks to Jeff Squyres
> <jsquyres <at> cisco.com> for prototyping support for this in Open MPI, which
> helped find several bugs during development.
> 
> Signed-off-by: Roland Dreier <rolandd@cisco.com>
> Signed-off-by: Eric B Munson <ebmunson@us.ibm.com>

Acked-by: Jeff Squyers <jsquyres@cisco.com>

> ---
> 
> Changes from V1:
> - Update Kbuild to handle test program build properly
> - Update documentation to cover questions not addressed in previous
>   thread
> ---
>  Documentation/Makefile                  |    3 +-
>  Documentation/ummunotify/Makefile       |    7 +
>  Documentation/ummunotify/ummunotify.txt |  162 +++++++++
>  Documentation/ummunotify/umn-test.c     |  200 +++++++++++
>  drivers/char/Kconfig                    |   12 +
>  drivers/char/Makefile                   |    1 +
>  drivers/char/ummunotify.c               |  567 +++++++++++++++++++++++++++++++
>  include/linux/Kbuild                    |    1 +
>  include/linux/ummunotify.h              |  121 +++++++
>  9 files changed, 1073 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/ummunotify/Makefile
>  create mode 100644 Documentation/ummunotify/ummunotify.txt
>  create mode 100644 Documentation/ummunotify/umn-test.c
>  create mode 100644 drivers/char/ummunotify.c
>  create mode 100644 include/linux/ummunotify.h
> 
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index 6fc7ea1..27ba76a 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -1,3 +1,4 @@
>  obj-m := DocBook/ accounting/ auxdisplay/ connector/ \
>         filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \
> -       pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/
> +       pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \
> +       watchdog/src/
> diff --git a/Documentation/ummunotify/Makefile b/Documentation/ummunotify/Makefile
> new file mode 100644
> index 0000000..89f31a0
> --- /dev/null
> +++ b/Documentation/ummunotify/Makefile
> @@ -0,0 +1,7 @@
> +# List of programs to build
> +hostprogs-y := umn-test
> +
> +# Tell kbuild to always build the programs
> +always := $(hostprogs-y)
> +
> +HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include
> diff --git a/Documentation/ummunotify/ummunotify.txt b/Documentation/ummunotify/ummunotify.txt
> new file mode 100644
> index 0000000..d6c2ccc
> --- /dev/null
> +++ b/Documentation/ummunotify/ummunotify.txt
> @@ -0,0 +1,162 @@
> +UMMUNOTIFY
> +
> +  Ummunotify relays MMU notifier events to userspace.  This is useful
> +  for libraries that need to track the memory mapping of applications;
> +  for example, MPI implementations using RDMA want to cache memory
> +  registrations for performance, but tracking all possible crazy cases
> +  such as when, say, the FORTRAN runtime frees memory is impossible
> +  without kernel help.
> +
> +Basic Model
> +
> +  A userspace process uses it by opening /dev/ummunotify, which
> +  returns a file descriptor.  Interest in address ranges is registered
> +  using ioctl() and MMU notifier events are retrieved using read(), as
> +  described in more detail below.  Userspace can register multiple
> +  address ranges to watch, and can unregister individual ranges.
> +
> +  Userspace can also mmap() a single read-only page at offset 0 on
> +  this file descriptor.  This page contains (at offest 0) a single
> +  64-bit generation counter that the kernel increments each time an
> +  MMU notifier event occurs.  Userspace can use this to very quickly
> +  check if there are any events to retrieve without needing to do a
> +  system call.
> +
> +Control
> +
> +  To start using ummunotify, a process opens /dev/ummunotify in
> +  read-only mode.  This will attach to current->mm because the current
> +  consumers of this functionality do all monitoring in the process
> +  being monitored.  It is currently not possible to use this device to
> +  monitor other processes.  Control from userspace is done via ioctl().
> +  An ioctl was chosen because the number of files required to register
> +  a new address range in sysfs would be unwieldy and new procfs entries
> +  are discouraged.  The defined ioctls are:
> +
> +    UMMUNOTIFY_EXCHANGE_FEATURES: This ioctl takes a single 32-bit
> +      word of feature flags as input, and the kernel updates the
> +      features flags word to contain only features requested by
> +      userspace and also supported by the kernel.
> +
> +      This ioctl is only included for forward compatibility; no
> +      feature flags are currently defined, and the kernel will simply
> +      update any requested feature mask to 0.  The kernel will always
> +      default to a feature mask of 0 if this ioctl is not used, so
> +      current userspace does not need to perform this ioctl.
> +
> +    UMMUNOTIFY_REGISTER_REGION: Userspace uses this ioctl to tell the
> +      kernel to start delivering events for an address range.  The
> +      range is described using struct ummunotify_register_ioctl:
> +
> +       struct ummunotify_register_ioctl {
> +               __u64   start;
> +               __u64   end;
> +               __u64   user_cookie;
> +               __u32   flags;
> +               __u32   reserved;
> +       };
> +
> +      start and end give the range of userspace virtual addresses;
> +      start is included in the range and end is not, so an example of
> +      a 4 KB range would be start=0x1000, end=0x2000.
> +
> +      user_cookie is an opaque 64-bit quantity that is returned by the
> +      kernel in events involving the range, and used by userspace to
> +      stop watching the range.  Each registered address range must
> +      have a distinct user_cookie.
> +
> +      It is fine with the kernel if userspace registers multiple
> +      overlapping or even duplicate address ranges, as long as a
> +      different cookie is used for each registration.
> +
> +      flags and reserved are included for forward compatibility;
> +      userspace should simply set them to 0 for the current interface.
> +
> +    UMMUNOTIFY_UNREGISTER_REGION: Userspace passes in the 64-bit
> +      user_cookie used to register a range to tell the kernel to stop
> +      watching an address range.  Once this ioctl completes, the
> +      kernel will not deliver any further events for the range that is
> +      unregistered.
> +
> +Events
> +
> +  When an event occurs that invalidates some of a process's memory
> +  mapping in an address range being watched, ummunotify queues an
> +  event report for that address range.  If more than one event
> +  invalidates parts of the same address range before userspace
> +  retrieves the queued report, then further reports for the same range
> +  will not be queued -- when userspace does read the queue, only a
> +  single report for a given range will be returned.
> +
> +  If multiple ranges being watched are invalidated by a single event
> +  (which is especially likely if userspace registers overlapping
> +  ranges), then an event report structure will be queued for each
> +  address range registration.
> +
> +  It is possible, if a large enough number of overlapping ranges are
> +  registered and the list of invalidated events is busy enough and
> +  ignored long enough, to cause the kernel to run out of memory.
> +  Because this situation is unlikely to occur, the event queue size
> +  is not bounded in order to avoid dropping events if the queue grows
> +  beyond set bounds.
> +
> +  Userspace retrieves queued events via read() on the ummunotify file
> +  descriptor; a buffer that is at least as big as struct
> +  ummunotify_event should be used to retrieve event reports, and if a
> +  larger buffer is passed to read(), multiple reports will be returned
> +  (if available).
> +
> +  If the ummunotify file descriptor is in blocking mode, a read() call
> +  will wait for an event report to be available.  Userspace may also
> +  set the ummunotify file descriptor to non-blocking mode and use all
> +  standard ways of waiting for data to be available on the ummunotify
> +  file descriptor, including epoll/poll()/select() and SIGIO.
> +
> +  The format of event reports is:
> +
> +       struct ummunotify_event {
> +               __u32   type;
> +               __u32   flags;
> +               __u64   hint_start;
> +               __u64   hint_end;
> +               __u64   user_cookie_counter;
> +       };
> +
> +  where the type field is either UMMUNOTIFY_EVENT_TYPE_INVAL or
> +  UMMUNOTIFY_EVENT_TYPE_LAST.  Events of type INVAL describe
> +  invalidation events as follows: user_cookie_counter contains the
> +  cookie passed in when userspace registered the range that the event
> +  is for.  hint_start and hint_end contain the start address and end
> +  address that were invalidated.
> +
> +  The flags word contains bit flags, with only UMMUNOTIFY_EVENT_FLAG_HINT
> +  defined at the moment.  If HINT is set, then the invalidation event
> +  invalidated less than the full address range and the kernel returns
> +  the exact range invalidated; if HINT is not sent then hint_start and
> +  hint_end are set to the original range registered by userspace.
> +  (HINT will not be set if, for example, multiple events invalidated
> +  disjoint parts of the range and so a single start/end pair cannot
> +  represent the parts of the range that were invalidated)
> +
> +  If the event type is LAST, then the read operation has emptied the
> +  list of invalidated regions, and the flags, hint_start and hint_end
> +  fields are not used.  user_cookie_counter holds the value of the
> +  kernel's generation counter (see below of more details) when the
> +  empty list occurred.
> +
> +Generation Count
> +
> +  Userspace may mmap() a page on a ummunotify file descriptor via
> +
> +       mmap(NULL, sizeof (__u64), PROT_READ, MAP_SHARED, ummunotify_fd, 0);
> +
> +  to get a read-only mapping of the kernel's 64-bit generation
> +  counter.  The kernel will increment this generation counter each
> +  time an event report is queued.
> +
> +  Userspace can use the generation counter as a quick check to avoid
> +  system calls; if the value read from the mapped kernel counter is
> +  still equal to the value returned in user_cookie_counter for the
> +  most recent LAST event retrieved, then no further events have been
> +  queued and there is no need to try a read() on the ummunotify file
> +  descriptor.
> diff --git a/Documentation/ummunotify/umn-test.c b/Documentation/ummunotify/umn-test.c
> new file mode 100644
> index 0000000..143db2c
> --- /dev/null
> +++ b/Documentation/ummunotify/umn-test.c
> @@ -0,0 +1,200 @@
> +/*
> + * Copyright (c) 2009 Cisco Systems.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <stdint.h>
> +#include <fcntl.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +
> +#include <linux/ummunotify.h>
> +
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <sys/ioctl.h>
> +
> +#define UMN_TEST_COOKIE 123
> +
> +static int             umn_fd;
> +static volatile __u64  *umn_counter;
> +
> +static int umn_init(void)
> +{
> +       __u32 flags;
> +
> +       umn_fd = open("/dev/ummunotify", O_RDONLY);
> +       if (umn_fd < 0) {
> +               perror("open");
> +               return 1;
> +       }
> +
> +       if (ioctl(umn_fd, UMMUNOTIFY_EXCHANGE_FEATURES, &flags)) {
> +               perror("exchange ioctl");
> +               return 1;
> +       }
> +
> +       printf("kernel feature flags: 0x%08x\n", flags);
> +
> +       umn_counter = mmap(NULL, sizeof *umn_counter, PROT_READ,
> +                          MAP_SHARED, umn_fd, 0);
> +       if (umn_counter == MAP_FAILED) {
> +               perror("mmap");
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +static int umn_register(void *buf, size_t size, __u64 cookie)
> +{
> +       struct ummunotify_register_ioctl r = {
> +               .start          = (unsigned long) buf,
> +               .end            = (unsigned long) buf + size,
> +               .user_cookie    = cookie,
> +       };
> +
> +       if (ioctl(umn_fd, UMMUNOTIFY_REGISTER_REGION, &r)) {
> +               perror("register ioctl");
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +static int umn_unregister(__u64 cookie)
> +{
> +       if (ioctl(umn_fd, UMMUNOTIFY_UNREGISTER_REGION, &cookie)) {
> +               perror("unregister ioctl");
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +       int                     page_size;
> +       __u64                   old_counter;
> +       void                   *t;
> +       int                     got_it;
> +
> +       if (umn_init())
> +               return 1;
> +
> +       printf("\n");
> +
> +       old_counter = *umn_counter;
> +       if (old_counter != 0) {
> +               fprintf(stderr, "counter = %lld (expected 0)\n", old_counter);
> +               return 1;
> +       }
> +
> +       page_size = sysconf(_SC_PAGESIZE);
> +       t = mmap(NULL, 3 * page_size, PROT_READ,
> +                MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
> +
> +       if (umn_register(t, 3 * page_size, UMN_TEST_COOKIE))
> +               return 1;
> +
> +       munmap(t + page_size, page_size);
> +
> +       old_counter = *umn_counter;
> +       if (old_counter != 1) {
> +               fprintf(stderr, "counter = %lld (expected 1)\n", old_counter);
> +               return 1;
> +       }
> +
> +       got_it = 0;
> +       while (1) {
> +               struct ummunotify_event ev;
> +               int                     len;
> +
> +               len = read(umn_fd, &ev, sizeof ev);
> +               if (len < 0) {
> +                       perror("read event");
> +                       return 1;
> +               }
> +               if (len != sizeof ev) {
> +                       fprintf(stderr, "Read gave %d bytes (!= event size %zd)\n",
> +                               len, sizeof ev);
> +                       return 1;
> +               }
> +
> +               switch (ev.type) {
> +               case UMMUNOTIFY_EVENT_TYPE_INVAL:
> +                       if (got_it) {
> +                               fprintf(stderr, "Extra invalidate event\n");
> +                               return 1;
> +                       }
> +                       if (ev.user_cookie_counter != UMN_TEST_COOKIE) {
> +                               fprintf(stderr, "Invalidate event for cookie %lld (expected %d)\n",
> +                                       ev.user_cookie_counter,
> +                                       UMN_TEST_COOKIE);
> +                               return 1;
> +                       }
> +
> +                       printf("Invalidate event:\tcookie %lld\n",
> +                              ev.user_cookie_counter);
> +
> +                       if (!(ev.flags & UMMUNOTIFY_EVENT_FLAG_HINT)) {
> +                               fprintf(stderr, "Hint flag not set\n");
> +                               return 1;
> +                       }
> +
> +                       if (ev.hint_start != (uintptr_t) t + page_size ||
> +                           ev.hint_end != (uintptr_t) t + page_size * 2) {
> +                               fprintf(stderr, "Got hint %llx..%llx, expected %p..%p\n",
> +                                       ev.hint_start, ev.hint_end,
> +                                       t + page_size, t + page_size * 2);
> +                               return 1;
> +                       }
> +
> +                       printf("\t\t\thint %llx...%llx\n",
> +                              ev.hint_start, ev.hint_end);
> +
> +                       got_it = 1;
> +                       break;
> +
> +               case UMMUNOTIFY_EVENT_TYPE_LAST:
> +                       if (!got_it) {
> +                               fprintf(stderr, "Last event without invalidate event\n");
> +                               return 1;
> +                       }
> +
> +                       printf("Empty event:\t\tcounter %lld\n",
> +                              ev.user_cookie_counter);
> +                       goto done;
> +
> +               default:
> +                       fprintf(stderr, "unknown event type %d\n",
> +                               ev.type);
> +                       return 1;
> +               }
> +       }
> +
> +done:
> +       umn_unregister(123);
> +       munmap(t, page_size);
> +
> +       old_counter = *umn_counter;
> +       if (old_counter != 1) {
> +               fprintf(stderr, "counter = %lld (expected 1)\n", old_counter);
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
> index 3141dd3..cf26019 100644
> --- a/drivers/char/Kconfig
> +++ b/drivers/char/Kconfig
> @@ -1111,6 +1111,18 @@ config DEVPORT
>         depends on ISA || PCI
>         default y
> 
> +config UMMUNOTIFY
> +       tristate "Userspace MMU notifications"
> +       select MMU_NOTIFIER
> +       help
> +         The ummunotify (userspace MMU notification) driver creates a
> +         character device that can be used by userspace libraries to
> +         get notifications when an application's memory mapping
> +         changed.  This is used, for example, by RDMA libraries to
> +         improve the reliability of memory registration caching, since
> +         the kernel's MMU notifications can be used to know precisely
> +         when to shoot down a cached registration.
> +
>  source "drivers/s390/char/Kconfig"
> 
>  endmenu
> diff --git a/drivers/char/Makefile b/drivers/char/Makefile
> index f957edf..521e5de 100644
> --- a/drivers/char/Makefile
> +++ b/drivers/char/Makefile
> @@ -97,6 +97,7 @@ obj-$(CONFIG_NSC_GPIO)                += nsc_gpio.o
>  obj-$(CONFIG_CS5535_GPIO)      += cs5535_gpio.o
>  obj-$(CONFIG_GPIO_TB0219)      += tb0219.o
>  obj-$(CONFIG_TELCLOCK)         += tlclk.o
> +obj-$(CONFIG_UMMUNOTIFY)       += ummunotify.o
> 
>  obj-$(CONFIG_MWAVE)            += mwave/
>  obj-$(CONFIG_AGP)              += agp/
> diff --git a/drivers/char/ummunotify.c b/drivers/char/ummunotify.c
> new file mode 100644
> index 0000000..c14df3f
> --- /dev/null
> +++ b/drivers/char/ummunotify.c
> @@ -0,0 +1,567 @@
> +/*
> + * Copyright (c) 2009 Cisco Systems.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/init.h>
> +#include <linux/list.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/rbtree.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/uaccess.h>
> +#include <linux/ummunotify.h>
> +
> +#include <asm/cacheflush.h>
> +
> +MODULE_AUTHOR("Roland Dreier");
> +MODULE_DESCRIPTION("Userspace MMU notifiers");
> +MODULE_LICENSE("GPL v2");
> +
> +/*
> + * Information about an address range userspace has asked us to watch.
> + *
> + * user_cookie: Opaque cookie given to us when userspace registers the
> + *   address range.
> + *
> + * start, end: Address range; start is inclusive, end is exclusive.
> + *
> + * hint_start, hint_end: If a single MMU notification event
> + *   invalidates the address range, we hold the actual range of
> + *   addresses that were invalidated (and set UMMUNOTIFY_FLAG_HINT).
> + *   If another event hits this range before userspace reads the
> + *   event, we give up and don't try to keep track of which subsets
> + *   got invalidated.
> + *
> + * flags: Holds the INVALID flag for ranges that are on the invalid
> + *   list and/or the HINT flag for ranges where the hint range holds
> + *   good information.
> + *
> + * node: Used to put the range into an rbtree we use to be able to
> + *   scan address ranges in order.
> + *
> + * list: Used to put the range on the invalid list when an MMU
> + *   notification event hits the range.
> + */
> +enum {
> +       UMMUNOTIFY_FLAG_INVALID = 1,
> +       UMMUNOTIFY_FLAG_HINT    = 2,
> +};
> +
> +struct ummunotify_reg {
> +       u64                     user_cookie;
> +       unsigned long           start;
> +       unsigned long           end;
> +       unsigned long           hint_start;
> +       unsigned long           hint_end;
> +       unsigned long           flags;
> +       struct rb_node          node;
> +       struct list_head        list;
> +};
> +
> +/*
> + * Context attached to each file that userspace opens.
> + *
> + * mmu_notifier: MMU notifier registered for this context.
> + *
> + * mm: mm_struct for process that created the context; we use this to
> + *   hold a reference to the mm to make sure it doesn't go away until
> + *   we're done with it.
> + *
> + * reg_tree: RB tree of address ranges being watched, sorted by start
> + *   address.
> + *
> + * invalid_list: List of address ranges that have been invalidated by
> + *   MMU notification events; as userspace reads events, the address
> + *   range corresponding to the event is removed from the list.
> + *
> + * counter: Page that can be mapped read-only by userspace, which
> + *   holds a generation count that is incremented each time an event
> + *   occurs.
> + *
> + * lock: Spinlock used to protect all context.
> + *
> + * read_wait: Wait queue used to wait for data to become available in
> + *   blocking read()s.
> + *
> + * async_queue: Used to implement fasync().
> + *
> + * need_empty: Set when userspace reads an invalidation event, so that
> + *   read() knows it must generate an "empty" event when userspace
> + *   drains the invalid_list.
> + *
> + * used: Set after userspace does anything with the file, so that the
> + *   "exchange flags" ioctl() knows it's too late to change anything.
> + */
> +struct ummunotify_file {
> +       struct mmu_notifier     mmu_notifier;
> +       struct mm_struct       *mm;
> +       struct rb_root          reg_tree;
> +       struct list_head        invalid_list;
> +       u64                    *counter;
> +       spinlock_t              lock;
> +       wait_queue_head_t       read_wait;
> +       struct fasync_struct   *async_queue;
> +       int                     need_empty;
> +       int                     used;
> +};
> +
> +static void ummunotify_handle_notify(struct mmu_notifier *mn,
> +                                    unsigned long start, unsigned long end)
> +{
> +       struct ummunotify_file *priv =
> +               container_of(mn, struct ummunotify_file, mmu_notifier);
> +       struct rb_node *n;
> +       struct ummunotify_reg *reg;
> +       unsigned long flags;
> +       int hit = 0;
> +
> +       spin_lock_irqsave(&priv->lock, flags);
> +
> +       for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
> +               reg = rb_entry(n, struct ummunotify_reg, node);
> +
> +               /*
> +                * Ranges overlap if they're not disjoint; and they're
> +                * disjoint if the end of one is before the start of
> +                * the other one.  So if both disjointness comparisons
> +                * fail then the ranges overlap.
> +                *
> +                * Since we keep the tree of regions we're watching
> +                * sorted by start address, we can end this loop as
> +                * soon as we hit a region that starts past the end of
> +                * the range for the event we're handling.
> +                */
> +               if (reg->start >= end)
> +                       break;
> +
> +               /*
> +                * Just go to the next region if the start of the
> +                * range is after the end of the region -- there
> +                * might still be more overlapping ranges that have a
> +                * greater start.
> +                */
> +               if (start >= reg->end)
> +                       continue;
> +
> +               hit = 1;
> +
> +               if (test_and_set_bit(UMMUNOTIFY_FLAG_INVALID, &reg->flags)) {
> +                       /* Already on invalid list */
> +                       clear_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);
> +               } else {
> +                       list_add_tail(&reg->list, &priv->invalid_list);
> +                       set_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);
> +                       reg->hint_start = start;
> +                       reg->hint_end   = end;
> +               }
> +       }
> +
> +       if (hit) {
> +               ++(*priv->counter);
> +               flush_dcache_page(virt_to_page(priv->counter));
> +               wake_up_interruptible(&priv->read_wait);
> +               kill_fasync(&priv->async_queue, SIGIO, POLL_IN);
> +       }
> +
> +       spin_unlock_irqrestore(&priv->lock, flags);
> +}
> +
> +static void ummunotify_invalidate_page(struct mmu_notifier *mn,
> +                                      struct mm_struct *mm,
> +                                      unsigned long addr)
> +{
> +       ummunotify_handle_notify(mn, addr, addr + PAGE_SIZE);
> +}
> +
> +static void ummunotify_invalidate_range_start(struct mmu_notifier *mn,
> +                                             struct mm_struct *mm,
> +                                             unsigned long start,
> +                                             unsigned long end)
> +{
> +       ummunotify_handle_notify(mn, start, end);
> +}
> +
> +static const struct mmu_notifier_ops ummunotify_mmu_notifier_ops = {
> +       .invalidate_page        = ummunotify_invalidate_page,
> +       .invalidate_range_start = ummunotify_invalidate_range_start,
> +};
> +
> +static int ummunotify_open(struct inode *inode, struct file *filp)
> +{
> +       struct ummunotify_file *priv;
> +       int ret;
> +
> +       if (filp->f_mode & FMODE_WRITE)
> +               return -EINVAL;
> +
> +       priv = kmalloc(sizeof *priv, GFP_KERNEL);
> +       if (!priv)
> +               return -ENOMEM;
> +
> +       priv->counter = (void *) get_zeroed_page(GFP_KERNEL);
> +       if (!priv->counter) {
> +               ret = -ENOMEM;
> +               goto err;
> +       }
> +
> +       priv->reg_tree = RB_ROOT;
> +       INIT_LIST_HEAD(&priv->invalid_list);
> +       spin_lock_init(&priv->lock);
> +       init_waitqueue_head(&priv->read_wait);
> +       priv->async_queue = NULL;
> +       priv->need_empty  = 0;
> +       priv->used        = 0;
> +
> +       priv->mmu_notifier.ops = &ummunotify_mmu_notifier_ops;
> +       /*
> +        * Register notifier last, since notifications can occur as
> +        * soon as we register....
> +        */
> +       ret = mmu_notifier_register(&priv->mmu_notifier, current->mm);
> +       if (ret)
> +               goto err_page;
> +
> +       priv->mm = current->mm;
> +       atomic_inc(&priv->mm->mm_count);
> +
> +       filp->private_data = priv;
> +
> +       return 0;
> +
> +err_page:
> +       free_page((unsigned long) priv->counter);
> +
> +err:
> +       kfree(priv);
> +       return ret;
> +}
> +
> +static int ummunotify_close(struct inode *inode, struct file *filp)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +       struct rb_node *n;
> +       struct ummunotify_reg *reg;
> +
> +       mmu_notifier_unregister(&priv->mmu_notifier, priv->mm);
> +       mmdrop(priv->mm);
> +       free_page((unsigned long) priv->counter);
> +
> +       for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
> +               reg = rb_entry(n, struct ummunotify_reg, node);
> +               kfree(reg);
> +       }
> +
> +       kfree(priv);
> +
> +       return 0;
> +}
> +
> +static bool ummunotify_readable(struct ummunotify_file *priv)
> +{
> +       return priv->need_empty || !list_empty(&priv->invalid_list);
> +}
> +
> +static ssize_t ummunotify_read(struct file *filp, char __user *buf,
> +                              size_t count, loff_t *pos)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +       struct ummunotify_reg *reg;
> +       ssize_t ret;
> +       struct ummunotify_event *events;
> +       int max;
> +       int n;
> +
> +       priv->used = 1;
> +
> +       events = (void *) get_zeroed_page(GFP_KERNEL);
> +       if (!events) {
> +               ret = -ENOMEM;
> +               goto out;
> +       }
> +
> +       spin_lock_irq(&priv->lock);
> +
> +       while (!ummunotify_readable(priv)) {
> +               spin_unlock_irq(&priv->lock);
> +
> +               if (filp->f_flags & O_NONBLOCK) {
> +                       ret = -EAGAIN;
> +                       goto out;
> +               }
> +
> +               if (wait_event_interruptible(priv->read_wait,
> +                                            ummunotify_readable(priv))) {
> +                       ret = -ERESTARTSYS;
> +                       goto out;
> +               }
> +
> +               spin_lock_irq(&priv->lock);
> +       }
> +
> +       max = min_t(size_t, PAGE_SIZE, count) / sizeof *events;
> +
> +       for (n = 0; n < max; ++n) {
> +               if (list_empty(&priv->invalid_list)) {
> +                       events[n].type = UMMUNOTIFY_EVENT_TYPE_LAST;
> +                       events[n].user_cookie_counter = *priv->counter;
> +                       ++n;
> +                       priv->need_empty = 0;
> +                       break;
> +               }
> +
> +               reg = list_first_entry(&priv->invalid_list,
> +                                      struct ummunotify_reg, list);
> +
> +               events[n].type = UMMUNOTIFY_EVENT_TYPE_INVAL;
> +               if (test_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags)) {
> +                       events[n].flags      = UMMUNOTIFY_EVENT_FLAG_HINT;
> +                       events[n].hint_start = max(reg->start, reg->hint_start);
> +                       events[n].hint_end   = min(reg->end, reg->hint_end);
> +               } else {
> +                       events[n].hint_start = reg->start;
> +                       events[n].hint_end   = reg->end;
> +               }
> +               events[n].user_cookie_counter = reg->user_cookie;
> +
> +               list_del(&reg->list);
> +               reg->flags = 0;
> +               priv->need_empty = 1;
> +       }
> +
> +       spin_unlock_irq(&priv->lock);
> +
> +       if (copy_to_user(buf, events, n * sizeof *events))
> +               ret = -EFAULT;
> +       else
> +               ret = n * sizeof *events;
> +
> +out:
> +       free_page((unsigned long) events);
> +       return ret;
> +}
> +
> +static unsigned int ummunotify_poll(struct file *filp,
> +                                   struct poll_table_struct *wait)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +
> +       poll_wait(filp, &priv->read_wait, wait);
> +
> +       return ummunotify_readable(priv) ? (POLLIN | POLLRDNORM) : 0;
> +}
> +
> +static long ummunotify_exchange_features(struct ummunotify_file *priv,
> +                                        __u32 __user *arg)
> +{
> +       u32 feature_mask;
> +
> +       if (priv->used)
> +               return -EINVAL;
> +
> +       priv->used = 1;
> +
> +       if (copy_from_user(&feature_mask, arg, sizeof(feature_mask)))
> +               return -EFAULT;
> +
> +       /* No extensions defined at present. */
> +       feature_mask = 0;
> +
> +       if (copy_to_user(arg, &feature_mask, sizeof(feature_mask)))
> +               return -EFAULT;
> +
> +       return 0;
> +}
> +
> +static long ummunotify_register_region(struct ummunotify_file *priv,
> +                                      void __user *arg)
> +{
> +       struct ummunotify_register_ioctl parm;
> +       struct ummunotify_reg *reg, *treg;
> +       struct rb_node **n = &priv->reg_tree.rb_node;
> +       struct rb_node *pn;
> +       int ret = 0;
> +
> +       if (copy_from_user(&parm, arg, sizeof parm))
> +               return -EFAULT;
> +
> +       priv->used = 1;
> +
> +       reg = kmalloc(sizeof *reg, GFP_KERNEL);
> +       if (!reg)
> +               return -ENOMEM;
> +
> +       reg->user_cookie        = parm.user_cookie;
> +       reg->start              = parm.start;
> +       reg->end                = parm.end;
> +       reg->flags              = 0;
> +
> +       spin_lock_irq(&priv->lock);
> +
> +       for (pn = rb_first(&priv->reg_tree); pn; pn = rb_next(pn)) {
> +               treg = rb_entry(pn, struct ummunotify_reg, node);
> +
> +               if (treg->user_cookie == parm.user_cookie) {
> +                       kfree(reg);
> +                       ret = -EINVAL;
> +                       goto out;
> +               }
> +       }
> +
> +       pn = NULL;
> +       while (*n) {
> +               pn = *n;
> +               treg = rb_entry(pn, struct ummunotify_reg, node);
> +
> +               if (reg->start <= treg->start)
> +                       n = &pn->rb_left;
> +               else
> +                       n = &pn->rb_right;
> +       }
> +
> +       rb_link_node(&reg->node, pn, n);
> +       rb_insert_color(&reg->node, &priv->reg_tree);
> +
> +out:
> +       spin_unlock_irq(&priv->lock);
> +
> +       return ret;
> +}
> +
> +static long ummunotify_unregister_region(struct ummunotify_file *priv,
> +                                        __u64 __user *arg)
> +{
> +       u64 user_cookie;
> +       struct rb_node *n;
> +       struct ummunotify_reg *reg;
> +       int ret = -EINVAL;
> +
> +       if (copy_from_user(&user_cookie, arg, sizeof(user_cookie)))
> +               return -EFAULT;
> +
> +       spin_lock_irq(&priv->lock);
> +
> +       for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
> +               reg = rb_entry(n, struct ummunotify_reg, node);
> +
> +               if (reg->user_cookie == user_cookie) {
> +                       rb_erase(n, &priv->reg_tree);
> +                       if (test_bit(UMMUNOTIFY_FLAG_INVALID, &reg->flags))
> +                               list_del(&reg->list);
> +                       kfree(reg);
> +                       ret = 0;
> +                       break;
> +               }
> +       }
> +
> +       spin_unlock_irq(&priv->lock);
> +
> +       return ret;
> +}
> +
> +static long ummunotify_ioctl(struct file *filp, unsigned int cmd,
> +                            unsigned long arg)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +       void __user *argp = (void __user *) arg;
> +
> +       switch (cmd) {
> +       case UMMUNOTIFY_EXCHANGE_FEATURES:
> +               return ummunotify_exchange_features(priv, argp);
> +       case UMMUNOTIFY_REGISTER_REGION:
> +               return ummunotify_register_region(priv, argp);
> +       case UMMUNOTIFY_UNREGISTER_REGION:
> +               return ummunotify_unregister_region(priv, argp);
> +       default:
> +               return -ENOIOCTLCMD;
> +       }
> +}
> +
> +static int ummunotify_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +       struct ummunotify_file *priv = vma->vm_private_data;
> +
> +       if (vmf->pgoff != 0)
> +               return VM_FAULT_SIGBUS;
> +
> +       vmf->page = virt_to_page(priv->counter);
> +       get_page(vmf->page);
> +
> +       return 0;
> +
> +}
> +
> +static struct vm_operations_struct ummunotify_vm_ops = {
> +       .fault          = ummunotify_fault,
> +};
> +
> +static int ummunotify_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +
> +       if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff != 0)
> +               return -EINVAL;
> +
> +       vma->vm_ops             = &ummunotify_vm_ops;
> +       vma->vm_private_data    = priv;
> +
> +       return 0;
> +}
> +
> +static int ummunotify_fasync(int fd, struct file *filp, int on)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +
> +       return fasync_helper(fd, filp, on, &priv->async_queue);
> +}
> +
> +static const struct file_operations ummunotify_fops = {
> +       .owner          = THIS_MODULE,
> +       .open           = ummunotify_open,
> +       .release        = ummunotify_close,
> +       .read           = ummunotify_read,
> +       .poll           = ummunotify_poll,
> +       .unlocked_ioctl = ummunotify_ioctl,
> +#ifdef CONFIG_COMPAT
> +       .compat_ioctl   = ummunotify_ioctl,
> +#endif
> +       .mmap           = ummunotify_mmap,
> +       .fasync         = ummunotify_fasync,
> +};
> +
> +static struct miscdevice ummunotify_misc = {
> +       .minor  = MISC_DYNAMIC_MINOR,
> +       .name   = "ummunotify",
> +       .fops   = &ummunotify_fops,
> +};
> +
> +static int __init ummunotify_init(void)
> +{
> +       return misc_register(&ummunotify_misc);
> +}
> +
> +static void __exit ummunotify_cleanup(void)
> +{
> +       misc_deregister(&ummunotify_misc);
> +}
> +
> +module_init(ummunotify_init);
> +module_exit(ummunotify_cleanup);
> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> index e2ea0b2..e086b39 100644
> --- a/include/linux/Kbuild
> +++ b/include/linux/Kbuild
> @@ -163,6 +163,7 @@ header-y += tipc_config.h
>  header-y += toshiba.h
>  header-y += udf_fs_i.h
>  header-y += ultrasound.h
> +header-y += ummunotify.h
>  header-y += un.h
>  header-y += utime.h
>  header-y += veth.h
> diff --git a/include/linux/ummunotify.h b/include/linux/ummunotify.h
> new file mode 100644
> index 0000000..21b0d03
> --- /dev/null
> +++ b/include/linux/ummunotify.h
> @@ -0,0 +1,121 @@
> +/*
> + * Copyright (c) 2009 Cisco Systems.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#ifndef _LINUX_UMMUNOTIFY_H
> +#define _LINUX_UMMUNOTIFY_H
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +/*
> + * Ummunotify relays MMU notifier events to userspace.  A userspace
> + * process uses it by opening /dev/ummunotify, which returns a file
> + * descriptor.  Interest in address ranges is registered using ioctl()
> + * and MMU notifier events are retrieved using read(), as described in
> + * more detail below.
> + *
> + * Userspace can also mmap() a single read-only page at offset 0 on
> + * this file descriptor.  This page contains (at offest 0) a single
> + * 64-bit generation counter that the kernel increments each time an
> + * MMU notifier event occurs.  Userspace can use this to very quickly
> + * check if there are any events to retrieve without needing to do a
> + * system call.
> + */
> +
> +/*
> + * struct ummunotify_register_ioctl describes an address range from
> + * start to end (including start but not including end) to be
> + * monitored.  user_cookie is an opaque handle that userspace assigns,
> + * and which is used to unregister.  flags and reserved are currently
> + * unused and should be set to 0 for forward compatibility.
> + */
> +struct ummunotify_register_ioctl {
> +       __u64   start;
> +       __u64   end;
> +       __u64   user_cookie;
> +       __u32   flags;
> +       __u32   reserved;
> +};
> +
> +#define UMMUNOTIFY_MAGIC               'U'
> +
> +/*
> + * Forward compatibility: Userspace passes in a 32-bit feature mask
> + * with feature flags set indicating which extensions it wishes to
> + * use.  The kernel will return a feature mask with the bits of
> + * userspace's mask that the kernel implements; from that point on
> + * both userspace and the kernel should behave as described by the
> + * kernel's feature mask.
> + *
> + * If userspace does not perform a UMMUNOTIFY_EXCHANGE_FEATURES ioctl,
> + * then the kernel will use a feature mask of 0.
> + *
> + * No feature flags are currently defined, so the kernel will always
> + * return a feature mask of 0 at present.
> + */
> +#define UMMUNOTIFY_EXCHANGE_FEATURES   _IOWR(UMMUNOTIFY_MAGIC, 1, __u32)
> +
> +/*
> + * Register interest in an address range; userspace should pass in a
> + * struct ummunotify_register_ioctl describing the region.
> + */
> +#define UMMUNOTIFY_REGISTER_REGION     _IOW(UMMUNOTIFY_MAGIC, 2, \
> +                                            struct ummunotify_register_ioctl)
> +/*
> + * Unregister interest in an address range; userspace should pass in
> + * the user_cookie value that was used to register the address range.
> + * No events for the address range will be reported once it is
> + * unregistered.
> + */
> +#define UMMUNOTIFY_UNREGISTER_REGION   _IOW(UMMUNOTIFY_MAGIC, 3, __u64)
> +
> +/*
> + * Invalidation events are returned whenever the kernel changes the
> + * mapping for a monitored address.  These events are retrieved by
> + * read() on the ummunotify file descriptor, which will fill the
> + * read() buffer with struct ummunotify_event.
> + *
> + * If type field is INVAL, then user_cookie_counter holds the
> + * user_cookie for the region being reported; if the HINT flag is set
> + * then hint_start/hint_end hold the start and end of the mapping that
> + * was invalidated.  (If HINT is not set, then multiple events
> + * invalidated parts of the registered range and hint_start/hint_end
> + * and set to the start/end of the whole registered range)
> + *
> + * If type is LAST, then the read operation has emptied the list of
> + * invalidated regions, and user_cookie_counter holds the value of the
> + * kernel's generation counter when the empty list occurred.  The
> + * other fields are not filled in for this event.
> + */
> +enum {
> +       UMMUNOTIFY_EVENT_TYPE_INVAL     = 0,
> +       UMMUNOTIFY_EVENT_TYPE_LAST      = 1,
> +};
> +
> +enum {
> +       UMMUNOTIFY_EVENT_FLAG_HINT      = 1 << 0,
> +};
> +
> +struct ummunotify_event {
> +       __u32   type;
> +       __u32   flags;
> +       __u64   hint_start;
> +       __u64   hint_end;
> +       __u64   user_cookie_counter;
> +};
> +
> +#endif /* _LINUX_UMMUNOTIFY_H */
> --
> 1.6.3.3
> 
> 


-- 
Jeff Squyres
jsquyres@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v3 1/2] libibnetdisc: Convert to a multi-smp algorithm
From: Sasha Khapyorsky @ 2010-05-07 20:50 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Hal Rosenstock
In-Reply-To: <20100422191231.bf6021fb.weiny2-i2BcT+NCU+M@public.gmane.org>

Hi Ira,

On 19:12 Thu 22 Apr     , Ira Weiny wrote:
> On Tue, 13 Apr 2010 16:25:31 +0300
> Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org> wrote:
> 
> > On 13:46 Tue 13 Apr     , Sasha Khapyorsky wrote:
> > > 
> > > However see some comments and questions below.
> > 
> > Another thought. What about API like:
> > 
> > 	ibnd_discover_fabric(cosnt char *ca_name, unsigned port_num,
> > 			     struct ibnd_config *cfg);
> > 
> > So libibnetdisc will be responsible for opening and closing its own port
> > and in this way will be fully reenterable?
> 
> I think we need this...  But this alone will not fix ibnetdisc...
> 
> I found out why "iblinkinfo -S" is hanging for me.  The new algorithm has the
> library using both libibmad and libibumad calls simultaneously.
> 
> These libraries were not designed to be used this way.  Therefore, I don't
> think there is any direct bug in those layers.  However, this is why I thought
> it was better for ibnetdisc to sit on top of ibmad and not use the umad layer.
> 
> Anyway, I think it is a mistake to pass an ibmad_port construct directly to
> this library now.  Here is why.
> 
> umad_recv in query_smp.c and ib_resolve_self_via in ibnetdisc.c conflicted.
> The underlying call to _do_madrpc was discarding the response MAD which
> query_smp.c:umad_recv was expecting.

This makes sense for me.

> If we change the API to specify the ca name and port then the library can open
> 2 ports (or as many as it wants) and use them appropriately.  I think this is
> the only solution which does not involve fixing libibmad.
> 
> So what about something like this:
> 
>    int ibnd_discover_fabric(ibnd_fabric_t **fabric,
> 			    cosnt char *ca_name,  <== could we even default this?

I would think about ca_name and port_number. And this is of course may
have default (NULL, 0).

> 			    struct ibnd_config *cfg);

What is wrong with current ibdn_fabric_t *ibnd_discover_fabric(...)? Why
do we need an extra parameter?

> 
> I don't mind the ibnd_config_t struct but I don't think it should be visible
> to the user.  Make it opaque and use "set" functions.  Something like.
> 
> ibnd_fabric_t *fabric;
> ibnd_config_t cfg;
> ib_portid_t * from;
> 
> ibnd_set_hops(&cfg, hops);         <== default -1
> ibnd_set_port_num(&cfg, port_num); <== default 1
> ibnd_set_max_smps(&cfg, max_smps); <== default 2
> ibnd_set_from_node(&cfg, from);    <== default NULL

I would prefer to not complicate API with ibnd_set_this() helpers. It
would be necessary to add new ones in the future which will lead to API
changes.

> if (ibnd_discover_fabric(&fabric, "foo", &cfg)) {  <== anything not in cfg is
>                                                        defaulted here
>    fprintf(stderr, "Wow it failed\n");
> }
> 
> This allows us to change ibnd_config structure any time we want without
> affecting the API.  I don't think the "pad" you used is a good idea.

Without padding we will break ABI each time when new field is added to
the config structure.

> Also since we are breaking the API we might as well return the fabric as a
> parameter and have an error code.  But I could go either way on this one.
> 
> Ira
> 
> 
> [*] query_smp.c probably should have it's own timeout here but we can discuss
> later.
> 
> [#] What sucks about this is that libibmad already has the functionality to
> open the umad port and configure it (50 line function).  Now we will be
> duplicating this functionality.

I think you can use mad_rpc_open_port(ca_name, port_number, ...) just
fine (and so the rest of libibmad stuff) - it will open separate fd.

Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/4] ib_core: implement XRC RCV qp's
From: Jack Morgenstein @ 2010-05-08  7:28 UTC (permalink / raw)
  To: Roland Dreier
  Cc: rolandd-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Tziporet Koren, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w,
	panda-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP,
	ishai-VPRAkNaXOzVS1MOuV/RT9w
In-Reply-To: <adaeihqas5y.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

Dr. Panda, Jeff, and Ishai,

We are trying to get XRC integrated into the next mainstream kernel.

For the kernel submission, I added a destroy_xrc_rcv_qp method (to be
used if the application did not require persistence of the xrc_rcv qp
after the creating process terminated -- per Diego Copernicoff's request).
This did not affect the core API of create/modify/unreg that you have
been using until now.

However, even without the new destroy method (as I suggest below),
having the creating process call unreg is still a bit counterintuitive,
since it calls create, and registration is a side-effect.

Roland is now intensively reviewing the XRC patches, and a made suggestion
to simplify the API which Tziporet and I agree with (see Roland's comments below).

Please comment on this suggestion (which is to have reg_xrc_rcv_qp do create
as well).
This is a minor change, that would require two changes in your current calls:
1. Instead of calling create_xrc_rcv_qp(), as is done currently, MPI would call
   u32 qp_num = 0xFFFFFFFF;
	err = reg_xrc_rcv_qp(xrcd, &qp_num);
   and would have the created qp number returned in qp_num;
   (the qp_init attributes in the old create_xrc_rcv_qp are all ignored except for
   the xrc domain handle anyway)

2. instead of calling reg_xrc_rcv_qp(xrcd, qp_num), you would need to set the
   qp number in a u32 variable, and call reg_xrc_rcv_qp(xrcd, &qp_num).

The other xrc_rcv_qp verbs would work as they work now.

Regarding OFED, this change would not affect OFED 1.5.x ; it would only enter
OFED at 1.6.x.

Please comment.

-Jack

P.S. You can see the submission/discussion of XRC starting at:
	http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg02792.html
On Thursday 06 May 2010 01:40, Roland Dreier wrote:
>  > > I don't really understand the semantics here.  What is supposed to
>  > > happen if I do create/reg/destroy?> What happens if one process does 
>  > > destroy while another process is still registered?
> 
>  > Maybe we can simply assert that the unreg IS the destroy method of the
>  > IB_SPEC, and get rid of the destroy method.
>  > 
>  > The xrc target qp section of the spec was not written with QP persistence
>  > (after the creating process exited) in mind.  That requirement surfaced
>  > at the last minute as a result of testing by the MPI community during the
>  > implementation phase (as far as I know).  Unfortunately, this created
>  > a semantic problem.
> 
> Yes, I think we should try to simplify things here.
> 
> It's very unfortunate to diverge from the API that's been shipped for a
> while now, but I really think we don't want all these different ways of
> saying the same thing, with little difference between create and reg,
> and between destroy and unreg.
> 
> In fact the smallest possible API would be just
> 
>   register_xrc_rcv_qp(xrcd, *qp_num)
> 
> where the user can pass in an invalid qp_num (say, -1 aka ffffffff) and
> have a new QP created, or a valid one to take a new ref on the existing
> rcv QP, and
> 
>   unregister_xrc_rcv_qp(xrcd, qp_num).
> 
> (along these lines, the structure in these patches:
> 
> +struct ib_uverbs_create_xrc_rcv_qp {
> +	__u64 response;
> +	__u64 user_handle;
> +	__u32 xrcd_handle;
> +	__u32 max_send_wr;
> +	__u32 max_recv_wr;
> +	__u32 max_send_sge;
> +	__u32 max_recv_sge;
> +	__u32 max_inline_data;
> +	__u8  sq_sig_all;
> +	__u8  qp_type;
> +	__u8  reserved[6];
> +	__u64 driver_data[0];
> +};
> 
> has many fields we don't need.  Pretty much all the fields after
> xrcd_handle are ignored, except sq_sig_all is used -- and that is highly
> dubious since the rcv QP has no SQ!  So I would propose something like
> just having:
> 
> +struct ib_uverbs_reg_xrc_rcv_qp {
> +	__u64 response;
> +	__u32 xrcd_handle;
> +	__u32 qp_num;
> +	__u64 driver_data[0];
> +};
> 
> where response is used to pass back the qp_num in the create case.
> 
> And then we just have unreg_xrc_rcv_qp and no destroy method (since they
> are synonymous anyway).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/4] ib_core: implement XRC RCV qp's
From: Dhabaleswar Panda @ 2010-05-08 13:13 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: Roland Dreier, rolandd-FYB4Gu1CFyUAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Tziporet Koren,
	jsquyres-FYB4Gu1CFyUAvxtiuMwx3w, ishai-VPRAkNaXOzVS1MOuV/RT9w,
	sayantan sur, mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP
In-Reply-To: <201005081028.20327.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>

Hi Jack,

Thanks for your note and the suggested changes. I will discuss this with
my team members and get back to you with our thoughts next week.

Thanks,

DK

On Sat, 8 May 2010, Jack Morgenstein wrote:

> Dr. Panda, Jeff, and Ishai,
>
> We are trying to get XRC integrated into the next mainstream kernel.
>
> For the kernel submission, I added a destroy_xrc_rcv_qp method (to be
> used if the application did not require persistence of the xrc_rcv qp
> after the creating process terminated -- per Diego Copernicoff's request).
> This did not affect the core API of create/modify/unreg that you have
> been using until now.
>
> However, even without the new destroy method (as I suggest below),
> having the creating process call unreg is still a bit counterintuitive,
> since it calls create, and registration is a side-effect.
>
> Roland is now intensively reviewing the XRC patches, and a made suggestion
> to simplify the API which Tziporet and I agree with (see Roland's comments below).
>
> Please comment on this suggestion (which is to have reg_xrc_rcv_qp do create
> as well).
> This is a minor change, that would require two changes in your current calls:
> 1. Instead of calling create_xrc_rcv_qp(), as is done currently, MPI would call
>    u32 qp_num = 0xFFFFFFFF;
> 	err = reg_xrc_rcv_qp(xrcd, &qp_num);
>    and would have the created qp number returned in qp_num;
>    (the qp_init attributes in the old create_xrc_rcv_qp are all ignored except for
>    the xrc domain handle anyway)
>
> 2. instead of calling reg_xrc_rcv_qp(xrcd, qp_num), you would need to set the
>    qp number in a u32 variable, and call reg_xrc_rcv_qp(xrcd, &qp_num).
>
> The other xrc_rcv_qp verbs would work as they work now.
>
> Regarding OFED, this change would not affect OFED 1.5.x ; it would only enter
> OFED at 1.6.x.
>
> Please comment.
>
> -Jack
>
> P.S. You can see the submission/discussion of XRC starting at:
> 	http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg02792.html
> On Thursday 06 May 2010 01:40, Roland Dreier wrote:
> >  > > I don't really understand the semantics here.  What is supposed to
> >  > > happen if I do create/reg/destroy?> What happens if one process does
> >  > > destroy while another process is still registered?
> >
> >  > Maybe we can simply assert that the unreg IS the destroy method of the
> >  > IB_SPEC, and get rid of the destroy method.
> >  >
> >  > The xrc target qp section of the spec was not written with QP persistence
> >  > (after the creating process exited) in mind.  That requirement surfaced
> >  > at the last minute as a result of testing by the MPI community during the
> >  > implementation phase (as far as I know).  Unfortunately, this created
> >  > a semantic problem.
> >
> > Yes, I think we should try to simplify things here.
> >
> > It's very unfortunate to diverge from the API that's been shipped for a
> > while now, but I really think we don't want all these different ways of
> > saying the same thing, with little difference between create and reg,
> > and between destroy and unreg.
> >
> > In fact the smallest possible API would be just
> >
> >   register_xrc_rcv_qp(xrcd, *qp_num)
> >
> > where the user can pass in an invalid qp_num (say, -1 aka ffffffff) and
> > have a new QP created, or a valid one to take a new ref on the existing
> > rcv QP, and
> >
> >   unregister_xrc_rcv_qp(xrcd, qp_num).
> >
> > (along these lines, the structure in these patches:
> >
> > +struct ib_uverbs_create_xrc_rcv_qp {
> > +	__u64 response;
> > +	__u64 user_handle;
> > +	__u32 xrcd_handle;
> > +	__u32 max_send_wr;
> > +	__u32 max_recv_wr;
> > +	__u32 max_send_sge;
> > +	__u32 max_recv_sge;
> > +	__u32 max_inline_data;
> > +	__u8  sq_sig_all;
> > +	__u8  qp_type;
> > +	__u8  reserved[6];
> > +	__u64 driver_data[0];
> > +};
> >
> > has many fields we don't need.  Pretty much all the fields after
> > xrcd_handle are ignored, except sq_sig_all is used -- and that is highly
> > dubious since the rcv QP has no SQ!  So I would propose something like
> > just having:
> >
> > +struct ib_uverbs_reg_xrc_rcv_qp {
> > +	__u64 response;
> > +	__u32 xrcd_handle;
> > +	__u32 qp_num;
> > +	__u64 driver_data[0];
> > +};
> >
> > where response is used to pass back the qp_num in the create case.
> >
> > And then we just have unreg_xrc_rcv_qp and no destroy method (since they
> > are synonymous anyway).
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH/RFC] cxgb4: Add MAINTAINERS info
From: Or Gerlitz @ 2010-05-08 15:43 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW, dm-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <adapr199foo.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

Roland Dreier <rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote:
> Thanks [...] Pretty impressive eagle eyes to notice that...

yes... one's eager for patches to be reviewed/merged, eagle eyes come
to the stage...

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH 2/4] ib_core: implement XRC RCV qp's
From: Ishai Rabinovitz @ 2010-05-09  8:10 UTC (permalink / raw)
  To: Jack Morgenstein, Roland Dreier
  Cc: rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Tziporet Koren, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
	panda-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP@public.gmane.org,
	Pavel Shamis
In-Reply-To: <201005081028.20327.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>

It should not be a problem in open MPI.
Is there going to be a change in the library version number?

Ishai

-----Original Message-----
From: Jack Morgenstein [mailto:jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org] 
Sent: Saturday, May 08, 2010 10:28 AM
To: Roland Dreier
Cc: rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Tziporet Koren; jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org; panda-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP@public.gmane.org; Ishai Rabinovitz
Subject: Re: [PATCH 2/4] ib_core: implement XRC RCV qp's

Dr. Panda, Jeff, and Ishai,

We are trying to get XRC integrated into the next mainstream kernel.

For the kernel submission, I added a destroy_xrc_rcv_qp method (to be
used if the application did not require persistence of the xrc_rcv qp
after the creating process terminated -- per Diego Copernicoff's request).
This did not affect the core API of create/modify/unreg that you have
been using until now.

However, even without the new destroy method (as I suggest below),
having the creating process call unreg is still a bit counterintuitive,
since it calls create, and registration is a side-effect.

Roland is now intensively reviewing the XRC patches, and a made suggestion
to simplify the API which Tziporet and I agree with (see Roland's comments below).

Please comment on this suggestion (which is to have reg_xrc_rcv_qp do create
as well).
This is a minor change, that would require two changes in your current calls:
1. Instead of calling create_xrc_rcv_qp(), as is done currently, MPI would call
   u32 qp_num = 0xFFFFFFFF;
	err = reg_xrc_rcv_qp(xrcd, &qp_num);
   and would have the created qp number returned in qp_num;
   (the qp_init attributes in the old create_xrc_rcv_qp are all ignored except for
   the xrc domain handle anyway)

2. instead of calling reg_xrc_rcv_qp(xrcd, qp_num), you would need to set the
   qp number in a u32 variable, and call reg_xrc_rcv_qp(xrcd, &qp_num).

The other xrc_rcv_qp verbs would work as they work now.

Regarding OFED, this change would not affect OFED 1.5.x ; it would only enter
OFED at 1.6.x.

Please comment.

-Jack

P.S. You can see the submission/discussion of XRC starting at:
	http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg02792.html
On Thursday 06 May 2010 01:40, Roland Dreier wrote:
>  > > I don't really understand the semantics here.  What is supposed to
>  > > happen if I do create/reg/destroy?> What happens if one process does 
>  > > destroy while another process is still registered?
> 
>  > Maybe we can simply assert that the unreg IS the destroy method of the
>  > IB_SPEC, and get rid of the destroy method.
>  > 
>  > The xrc target qp section of the spec was not written with QP persistence
>  > (after the creating process exited) in mind.  That requirement surfaced
>  > at the last minute as a result of testing by the MPI community during the
>  > implementation phase (as far as I know).  Unfortunately, this created
>  > a semantic problem.
> 
> Yes, I think we should try to simplify things here.
> 
> It's very unfortunate to diverge from the API that's been shipped for a
> while now, but I really think we don't want all these different ways of
> saying the same thing, with little difference between create and reg,
> and between destroy and unreg.
> 
> In fact the smallest possible API would be just
> 
>   register_xrc_rcv_qp(xrcd, *qp_num)
> 
> where the user can pass in an invalid qp_num (say, -1 aka ffffffff) and
> have a new QP created, or a valid one to take a new ref on the existing
> rcv QP, and
> 
>   unregister_xrc_rcv_qp(xrcd, qp_num).
> 
> (along these lines, the structure in these patches:
> 
> +struct ib_uverbs_create_xrc_rcv_qp {
> +	__u64 response;
> +	__u64 user_handle;
> +	__u32 xrcd_handle;
> +	__u32 max_send_wr;
> +	__u32 max_recv_wr;
> +	__u32 max_send_sge;
> +	__u32 max_recv_sge;
> +	__u32 max_inline_data;
> +	__u8  sq_sig_all;
> +	__u8  qp_type;
> +	__u8  reserved[6];
> +	__u64 driver_data[0];
> +};
> 
> has many fields we don't need.  Pretty much all the fields after
> xrcd_handle are ignored, except sq_sig_all is used -- and that is highly
> dubious since the rcv QP has no SQ!  So I would propose something like
> just having:
> 
> +struct ib_uverbs_reg_xrc_rcv_qp {
> +	__u64 response;
> +	__u32 xrcd_handle;
> +	__u32 qp_num;
> +	__u64 driver_data[0];
> +};
> 
> where response is used to pass back the qp_num in the create case.
> 
> And then we just have unreg_xrc_rcv_qp and no destroy method (since they
> are synonymous anyway).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Mellanox implementation for atomic operations
From: lihaidong @ 2010-05-10  6:42 UTC (permalink / raw)
  To: linux-rdma; +Cc: Steve Wise, Tziporet Koren, shipr
In-Reply-To: <201005101435267348378@inspur.com>

Hi,
   I have a question about atomic operations.
   According to IB specification o10-48, all atomic operation request made to the same HCA, referencing the same physical memory are serialized with respect to each other.  I know this should be complied with if HCA supports atomic operations, right?
   According to  IB specification o10-49, all atomic operations requests that referencing the same physical memory are serialized with respect to each other. This means that atomic operations performed by processors should serialized with atomic operations performed by HCAs, too, if they were referencing the same physical memory.
  I want to know whether Mellanox implementation for atomic operations comply with o10-49 or not.
  if not ,to what extent it comply with the rule? 
  I also was intrested in how this rule is complied with by others vendors?

2010-05-10 
lihaidong 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/4] ib_core: implement XRC RCV qp's
From: Jack Morgenstein @ 2010-05-10 10:01 UTC (permalink / raw)
  To: Roland Dreier
  Cc: rolandd-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Tziporet Koren
In-Reply-To: <adaaasearf5.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

On Thursday 06 May 2010 01:56, Roland Dreier wrote:
>  > In my original implementation, the low-level driver was responsible for generating
>  > the events for all the processes.  To move this mechanism to the core would require
>  > a fairly extensive re-write.  I would need to introduce ib_core methods for create,
>  > reg, unreg, and destroy, since the uverbs layer operates per user process and does
>  > not track across multiple processes.  I was concerned that modifications of this
>  > magnitude would introduce instability in XRC, and would require a significant QA cycle
> 
> It seems to me that it should, if anything, be easier to track all these
> lists at the uverbs layer.  We already have a global xrcd structure for
> each xrcd that we could easily stick a list of rcv QPs in (and keep the
> list of rcv QP registrations per rcv QP). 
> 
Roland,
While examining your proposed solution, I ran into the following difficulty.
In order to avoid the reg/unreg verbs in the low-level driver, I need to move
event handling for XRC RCV qp's to the core layer.

Currently, the low-level driver maintains, for each XRC RCV qp a list of user contexts
which are registered to that QP -- and generates an event for each of those contexts.
(this is so that each of the contexts may unregister, thus allowing the QP to be
destroyed).

To move this to the core requires that I maintain a global list of all xrc_rcv QPs,
and per QP, a list of contexts registered to it -- regardless of which xrc domain the
xrc rcv QP is attached to (I do not have xrc domain information in the XRC RCV qp event).

Thus, I do not see how we can reasonably use the global xrcd structure for this purpose.
(I don't see searching all xrc domains for the QP as a reasonable solution).

I am therefore back at my proposal above to introduce ib_core methods: ib_reg_xrc_rcv_qp
and ib_unreg_xrc_rcv_qp (using the reg for create as well, and the unreg for destroy as well,
as you suggested in a subsequent mail).  That way, I can track xrc rcv QPs, and keep a list
per QP of process files to signal an event to.

I have an initial implementation of this which I will clean up and send you for review (actually, an
implementation which has the ib_create_xrc_rcv_qp/ib_destroy_xrc_rcv_qp, and which does
not need the low-level driver implementations of reg/unreg.  I started off with this
implementation, and discontinued it when I noticed how different it was from the original
XRC RCV implementation... but saved it just in case).

Comments?
-Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/4] ib_core: implement XRC RCV qp's
From: Jack Morgenstein @ 2010-05-10 10:34 UTC (permalink / raw)
  To: Roland Dreier
  Cc: rolandd-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Tziporet Koren
In-Reply-To: <201005101301.43113.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>

On Monday 10 May 2010 13:01, Jack Morgenstein wrote:
> I have an initial implementation of this which I will clean up and send you for review (actually, an
> implementation which has the ib_create_xrc_rcv_qp/ib_destroy_xrc_rcv_qp, and which does
> not need the low-level driver implementations of reg/unreg.  I started off with this
> implementation, and discontinued it when I noticed how different it was from the original
> XRC RCV implementation... but saved it just in case).
> 
I remember now why I discontinued developing this implementation.  There is no provision for core-layer-only
verbs in the core driver.

To implement ib_reg_xrc_rcv_qp/ib_unreg_xrc_rcv_qp (when I still had the create and destroy verbs),
I needed to declare these in the uverbs_cmd_mask in the low-level driver.

(from procedure mlx4_ib_add, in file hw/mlx4/main.c:)
       	ibdev->ib_dev.uverbs_cmd_mask |=
			(1ull << IB_USER_VERBS_CMD_CREATE_XRC_SRQ)	|
			(1ull << IB_USER_VERBS_CMD_OPEN_XRCD)	|
			(1ull << IB_USER_VERBS_CMD_CLOSE_XRCD)	|
			(1ull << IB_USER_VERBS_CMD_CREATE_XRC_RCV_QP)	|
			(1ull << IB_USER_VERBS_CMD_MODIFY_XRC_RCV_QP)	|
			(1ull << IB_USER_VERBS_CMD_QUERY_XRC_RCV_QP)	|
			(1ull << IB_USER_VERBS_CMD_DESTROY_XRC_RCV_QP)  |
==>			(1ull << IB_USER_VERBS_CMD_REG_XRC_RCV_QP)	|
==>			(1ull << IB_USER_VERBS_CMD_UNREG_XRC_RCV_QP);

with use of this mask in ib_uverbs_write:
	if (!(file->device->ib_dev->uverbs_cmd_mask & (1ull << hdr.command)))
		return -ENOSYS;

In order to fix this layering violation,I would have needed to do fairly ugly things in
ib_register_device -- to test if, say, the IB_USER_VERBS_CMD_CREATE_XRC_RCV_QP bit is
set -- and if yes, to also add bits at the core layer for IB_USER_VERBS_CMD_REG_XRC_RCV_QP and
IB_USER_VERBS_CMD_UNREG_XRC_RCV_QP. Ugh.

However, with your suggestion of combining the create/reg and the destroy/unreg into a single
verb, this difficulty goes away, and the low-level driver can still declare those verbs for its task of
creating and destroying the XRC RCV QPs.

I will continue with the reg/unreg implementation.

-Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 7/7] CQ overflow detection giving false positives
From: Roland Dreier @ 2010-05-10 15:42 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100506230652.25362.18848.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>

All look fine... I rolled them up into the patch adding iw_cxgb4 (might
as well have the initial driver merge have all know fixes).
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] RDMA/ucma: Copy iWARP route information.
From: Steve Wise @ 2010-05-10 15:46 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

For iWARP rdma_cm ids, the "route" information is the L2 src and
next hop addresses.

Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---

 drivers/infiniband/core/ucma.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index ac7edc2..0498383 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -583,6 +583,16 @@ static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp,
 	}
 }
 
+static void ucma_copy_iw_route(struct rdma_ucm_query_route_resp *resp,
+			       struct rdma_route *route)
+{
+	struct rdma_dev_addr *dev_addr;
+
+	dev_addr = &route->addr.dev_addr;
+	rdma_addr_get_dgid(dev_addr, (union ib_gid *) &resp->ib_route[0].dgid);
+	rdma_addr_get_sgid(dev_addr, (union ib_gid *) &resp->ib_route[0].sgid);
+}
+
 static ssize_t ucma_query_route(struct ucma_file *file,
 				const char __user *inbuf,
 				int in_len, int out_len)
@@ -621,6 +631,9 @@ static ssize_t ucma_query_route(struct ucma_file *file,
 	case RDMA_TRANSPORT_IB:
 		ucma_copy_ib_route(&resp, &ctx->cm_id->route);
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ucma_copy_iw_route(&resp, &ctx->cm_id->route);
+		break;
 	default:
 		break;
 	}

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH 1/2] RDMA/core: add support for iWarp multicast acceleration over IB_QPT_RAW_ETY QP type
From: Steve Wise @ 2010-05-10 15:51 UTC (permalink / raw)
  To: rdreier-FYB4Gu1CFyUAvxtiuMwx3w
  Cc: miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100430165348.1386.77503.stgit-dAdtdUp2yJRU7keBU/FxOFDQ4js95KgL@public.gmane.org>

Hey Roland,

Do you have thoughts on this change?  It is useful along with the 
IBV_QPT_RAW_ETY type added to libibverbs for cxgb devices too.

Steve.

miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> The patch enables usage of attach_mcast()/detach_mcast driver
> specific verbs for iWARP devices.
>
> IB path is unchanged.
> iWARP path adds checking for raw ethernet QP type
>
> Signed-off-by: Mirek Walukiewicz <miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>
>
>
> ---
>
>  drivers/infiniband/core/verbs.c |   24 ++++++++++++++++++++----
>  1 files changed, 20 insertions(+), 4 deletions(-)
>
>
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index a7da9be..b5023f2 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -887,9 +887,17 @@ int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid)
>  {
>  	if (!qp->device->attach_mcast)
>  		return -ENOSYS;
> -	if (gid->raw[0] != 0xff || qp->qp_type != IB_QPT_UD)
> -		return -EINVAL;
>  
> +	switch (qp->device->node_type) {
> +	case RDMA_TRANSPORT_IB:
> +		if (gid->raw[0] != 0xff || qp->qp_type != IB_QPT_UD)
> +			return -EINVAL;
> +		break;
> +	case RDMA_TRANSPORT_IWARP:
> +		if (qp->qp_type != IB_QPT_RAW_ETY)
> +			return -EINVAL;
> +		break;
> +	}
>  	return qp->device->attach_mcast(qp, gid, lid);
>  }
>  EXPORT_SYMBOL(ib_attach_mcast);
> @@ -898,9 +906,17 @@ int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid)
>  {
>  	if (!qp->device->detach_mcast)
>  		return -ENOSYS;
> -	if (gid->raw[0] != 0xff || qp->qp_type != IB_QPT_UD)
> -		return -EINVAL;
>  
> +	switch (qp->device->node_type) {
> +	case RDMA_TRANSPORT_IB:
> +		if (gid->raw[0] != 0xff || qp->qp_type != IB_QPT_UD)
> +			return -EINVAL;
> +		break;
> +	case RDMA_TRANSPORT_IWARP:
> +		if (qp->qp_type != IB_QPT_RAW_ETY)
> +			return -EINVAL;
> +		break;
> +	}
>  	return qp->device->detach_mcast(qp, gid, lid);
>  }
>  EXPORT_SYMBOL(ib_detach_mcast);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH] ummunotify: Userspace support for MMU notifications V2
From: Sean Hefty @ 2010-05-10 17:12 UTC (permalink / raw)
  To: 'Jeff Squyres', Eric B Munson
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, rolandd-FYB4Gu1CFyUAvxtiuMwx3w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-X9Un+BFzKDI,
	pavel-+ZI9xUNit7I, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA
In-Reply-To: <01C0A6AB-E4B6-4B56-AFAE-52952D152110-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>

>> As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
>> and follow-up messages, libraries using RDMA would like to track
>> precisely when application code changes memory mapping via free(),
>> munmap(), etc.  Current pure-userspace solutions using malloc hooks
>> and other tricks are not robust, and the feeling among experts is that
>> the issue is unfixable without kernel help.
>
>Sorry for not replying earlier -- just to throw in my $0.02 here: the MPI
>community is *very interested* in having this stuff in upstream kernels.  It
>solves a fairly major problem for us.
>
>Open MPI (www.open-mpi.org) is ready to pretty much immediately take advantage
>of these capabilities.  The code to use ummunotify is in a Mercurial branch;
>we're only waiting for ummunotify to go upstream before committing our support
>for it to our main SVN development trunk.

Intel's MPI team has examined this proposal as well and would also like to see
this merged upstream.  It is helpful implementing MPI over RDMA devices.

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH] RDMA/ucma: Copy iWARP route information.
From: Sean Hefty @ 2010-05-10 17:36 UTC (permalink / raw)
  To: 'Steve Wise', linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100510154656.22415.47740.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>

>For iWARP rdma_cm ids, the "route" information is the L2 src and
>next hop addresses.
>
>Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>---
>
> drivers/infiniband/core/ucma.c |   13 +++++++++++++
> 1 files changed, 13 insertions(+), 0 deletions(-)
>
>diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
>index ac7edc2..0498383 100644
>--- a/drivers/infiniband/core/ucma.c
>+++ b/drivers/infiniband/core/ucma.c
>@@ -583,6 +583,16 @@ static void ucma_copy_ib_route(struct
>rdma_ucm_query_route_resp *resp,
> 	}
> }
>
>+static void ucma_copy_iw_route(struct rdma_ucm_query_route_resp *resp,
>+			       struct rdma_route *route)
>+{
>+	struct rdma_dev_addr *dev_addr;
>+
>+	dev_addr = &route->addr.dev_addr;
>+	rdma_addr_get_dgid(dev_addr, (union ib_gid *) &resp->ib_route[0].dgid);
>+	rdma_addr_get_sgid(dev_addr, (union ib_gid *) &resp->ib_route[0].sgid);

This looks good for the current code.

For 2.6.35/36, I've submitted changes to the query operation to support AF_IB,
essentially breaking the query_route call into query_addr, query_path, and
query_gid.  I'm guessing that this functionality should be part of the query_gid
call.  I'll adjust those patches, but if you get a chance to review the query
changes, I'd appreciate it.

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [ANNOUNCE] libcxgb4 1.0.1 Release
From: Steve Wise @ 2010-05-10 17:42 UTC (permalink / raw)
  To: linux-rdma

The libcxgb4 package is a userspace driver for the new Chelsio T4 iWARP 
RNICs.   It is a plug-in module for libibverbs that allows programs to 
use Chelsio RDMA T4 hardware directly from userspace.

Release 1.0.1 is now available at:

http://www.openfabrics.org/downloads/cxgb4/libcxgb4-1.0.1.tar.gz

with md5sum:

04fc3ba283c7f73b4fba3774f2c4a648 libcxgb4-1.0.1.tar.gz

The development git tree is at:

git://www.openfabrics.org/~swise/libcxgb4

Thanks,

Steve.






--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 5/37] librdmacm: replace query_route call with separate queries
From: Steve Wise @ 2010-05-10 18:07 UTC (permalink / raw)
  To: Sean Hefty; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <D81E70C032A3452B9FD00D7EB445F6CF-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>

Hey Sean,

You define global int af_ib_support.  It never gets set to non-zero.  
Was this debug code or does it get set somewhere else that I'm missing?


Steve.


Sean Hefty wrote:
> To support other address families and multiple path records,
> replace the query_route call with specific query calls to obtain
> only the desired information.
>
> Signed-off-by: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>
>  src/cma.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 files changed, 60 insertions(+), 9 deletions(-)
>
> diff --git a/src/cma.c b/src/cma.c
> index 2a70d20..c57d166 100644
> --- a/src/cma.c
> +++ b/src/cma.c
> @@ -161,6 +161,7 @@ static struct cma_device *cma_dev_array;
>  static int cma_dev_cnt;
>  static pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER;
>  static int abi_ver = RDMA_USER_CM_MAX_ABI_VERSION;
> +int af_ib_support;
>  
>  #define container_of(ptr, type, field) \
>  	((type *) ((void *)ptr - offsetof(type, field)))
> @@ -627,7 +628,7 @@ static int ucma_query_route(struct rdma_cm_id *id)
>  	struct cma_id_private *id_priv;
>  	void *msg;
>  	int ret, size, i;
> -	
> +
>  	CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_QUERY_ROUTE, size);
>  	id_priv = container_of(id, struct cma_id_private, id);
>  	cmd->id = id_priv->handle;
> @@ -1060,7 +1061,10 @@ int rdma_listen(struct rdma_cm_id *id, int backlog)
>  	if (ret != size)
>  		return (ret >= 0) ? ERR(ENODATA) : -1;
>  
> -	return ucma_query_route(id);
> +	if (af_ib_support)
> +		return ucma_query_addr(id);
> +	else
> +		return ucma_query_route(id);
>  }
>  
>  int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
> @@ -1326,6 +1330,57 @@ int rdma_ack_cm_event(struct rdma_cm_event *event)
>  	return 0;
>  }
>  
> +static void ucma_process_addr_resolved(struct cma_event *evt)
> +{
> +	if (af_ib_support) {
> +		evt->event.status = ucma_query_addr(&evt->id_priv->id);
> +		if (!evt->event.status &&
> +		    evt->id_priv->id.verbs->device->transport_type == IBV_TRANSPORT_IB)
> +			evt->event.status = ucma_query_gid(&evt->id_priv->id);
> +	} else {
> +		evt->event.status = ucma_query_route(&evt->id_priv->id);
> +	}
> +
> +	if (evt->event.status)
> +		evt->event.event = RDMA_CM_EVENT_ADDR_ERROR;
> +}
> +
> +static void ucma_process_route_resolved(struct cma_event *evt)
> +{
> +	if (evt->id_priv->id.verbs->device->transport_type != IBV_TRANSPORT_IB)
> +		return;
> +
> +	if (af_ib_support)
> +		evt->event.status = ucma_query_path(&evt->id_priv->id);
> +	else
> +		evt->event.status = ucma_query_route(&evt->id_priv->id);
> +
> +	if (evt->event.status)
> +		evt->event.event = RDMA_CM_EVENT_ROUTE_ERROR;
> +}
> +
> +static int ucma_query_req_info(struct rdma_cm_id *id)
> +{
> +	int ret;
> +
> +	if (!af_ib_support)
> +		return ucma_query_route(id);
> +
> +	ret = ucma_query_addr(id);
> +	if (ret)
> +		return ret;
> +
> +	ret = ucma_query_gid(id);
> +	if (ret)
> +		return ret;
> +
> +	ret = ucma_query_path(id);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
>  static int ucma_process_conn_req(struct cma_event *evt,
>  				 uint32_t handle)
>  {
> @@ -1344,7 +1399,7 @@ static int ucma_process_conn_req(struct cma_event *evt,
>  	evt->event.id = &id_priv->id;
>  	id_priv->handle = handle;
>  
> -	ret = ucma_query_route(&id_priv->id);
> +	ret = ucma_query_req_info(&id_priv->id);
>  	if (ret) {
>  		rdma_destroy_id(&id_priv->id);
>  		goto err;
> @@ -1473,14 +1528,10 @@ retry:
>  
>  	switch (resp->event) {
>  	case RDMA_CM_EVENT_ADDR_RESOLVED:
> -		evt->event.status = ucma_query_route(&evt->id_priv->id);
> -		if (evt->event.status)
> -			evt->event.event = RDMA_CM_EVENT_ADDR_ERROR;
> +		ucma_process_addr_resolved(evt);
>  		break;
>  	case RDMA_CM_EVENT_ROUTE_RESOLVED:
> -		evt->event.status = ucma_query_route(&evt->id_priv->id);
> -		if (evt->event.status)
> -			evt->event.event = RDMA_CM_EVENT_ROUTE_ERROR;
> +		ucma_process_route_resolved(evt);
>  		break;
>  	case RDMA_CM_EVENT_CONNECT_REQUEST:
>  		evt->id_priv = (void *) (uintptr_t) resp->uid;
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH 5/37] librdmacm: replace query_route call with separate queries
From: Sean Hefty @ 2010-05-10 18:33 UTC (permalink / raw)
  To: 'Steve Wise'; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4BE84B7D.40703-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>

>You define global int af_ib_support.  It never gets set to non-zero.
>Was this debug code or does it get set somewhere else that I'm missing?

It doesn't get set until a later patch in the series, once all the needed
infrastructure is in place.

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 1/2] libibnetdisc: Convert to a multi-smp algorithm
From: Ira Weiny @ 2010-05-10 20:53 UTC (permalink / raw)
  To: Sasha Khapyorsky
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Hal Rosenstock
In-Reply-To: <20100507205052.GW7099-o14lFNPAa+WKTadZzrrH2Q@public.gmane.org>

On Fri, 7 May 2010 13:50:53 -0700
Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org> wrote:

> Hi Ira,
> 
> On 19:12 Thu 22 Apr     , Ira Weiny wrote:
> > On Tue, 13 Apr 2010 16:25:31 +0300
> > Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org> wrote:
> > 
> > > On 13:46 Tue 13 Apr     , Sasha Khapyorsky wrote:
> > > > 
> > > > However see some comments and questions below.
> > > 
> > > Another thought. What about API like:
> > > 
> > > 	ibnd_discover_fabric(cosnt char *ca_name, unsigned port_num,
> > > 			     struct ibnd_config *cfg);
> > > 
> > > So libibnetdisc will be responsible for opening and closing its own port
> > > and in this way will be fully reenterable?
> > 
> > I think we need this...  But this alone will not fix ibnetdisc...
> > 
> > I found out why "iblinkinfo -S" is hanging for me.  The new algorithm has the
> > library using both libibmad and libibumad calls simultaneously.
> > 
> > These libraries were not designed to be used this way.  Therefore, I don't
> > think there is any direct bug in those layers.  However, this is why I thought
> > it was better for ibnetdisc to sit on top of ibmad and not use the umad layer.
> > 
> > Anyway, I think it is a mistake to pass an ibmad_port construct directly to
> > this library now.  Here is why.
> > 
> > umad_recv in query_smp.c and ib_resolve_self_via in ibnetdisc.c conflicted.
> > The underlying call to _do_madrpc was discarding the response MAD which
> > query_smp.c:umad_recv was expecting.
> 
> This makes sense for me.
> 
> > If we change the API to specify the ca name and port then the library can open
> > 2 ports (or as many as it wants) and use them appropriately.  I think this is
> > the only solution which does not involve fixing libibmad.
> > 
> > So what about something like this:
> > 
> >    int ibnd_discover_fabric(ibnd_fabric_t **fabric,
> > 			    cosnt char *ca_name,  <== could we even default this?
> 
> I would think about ca_name and port_number. And this is of course may
> have default (NULL, 0).

Ok, ca_name and ca_port will be explicit.

> 
> > 			    struct ibnd_config *cfg);
> 
> What is wrong with current ibdn_fabric_t *ibnd_discover_fabric(...)? Why
> do we need an extra parameter?

Well we are breaking the interface again so I figure we might as well clean some things up.  Returning an int allows us to have a reason for the failure returned to the caller rather than just "NULL".  We have cleaned up most of the internals of the library to allow for this.

> 
> > 
> > I don't mind the ibnd_config_t struct but I don't think it should be visible
> > to the user.  Make it opaque and use "set" functions.  Something like.
> > 
> > ibnd_fabric_t *fabric;
> > ibnd_config_t cfg;
> > ib_portid_t * from;
> > 
> > ibnd_set_hops(&cfg, hops);         <== default -1
> > ibnd_set_port_num(&cfg, port_num); <== default 1
> > ibnd_set_max_smps(&cfg, max_smps); <== default 2
> > ibnd_set_from_node(&cfg, from);    <== default NULL
> 
> I would prefer to not complicate API with ibnd_set_this() helpers. It
> would be necessary to add new ones in the future which will lead to API
> changes.

See below.

> 
> > if (ibnd_discover_fabric(&fabric, "foo", &cfg)) {  <== anything not in cfg is
> >                                                        defaulted here
> >    fprintf(stderr, "Wow it failed\n");
> > }
> > 
> > This allows us to change ibnd_config structure any time we want without
> > affecting the API.  I don't think the "pad" you used is a good idea.
> 
> Without padding we will break ABI each time when new field is added to
> the config structure.

No it does not iff you use the ibnd_set_this() helpers and make the config private.

> 
> > Also since we are breaking the API we might as well return the fabric as a
> > parameter and have an error code.  But I could go either way on this one.
> > 
> > Ira
> > 
> > 
> > [*] query_smp.c probably should have it's own timeout here but we can discuss
> > later.
> > 
> > [#] What sucks about this is that libibmad already has the functionality to
> > open the umad port and configure it (50 line function).  Now we will be
> > duplicating this functionality.
> 
> I think you can use mad_rpc_open_port(ca_name, port_number, ...) just
> fine (and so the rest of libibmad stuff) - it will open separate fd.

Yes for the libibmad functionality we can do that.  I was speaking of the use of the umad layer.  To use that layer we have to duplicate the functionality of mad_rpc_open_port in every tool which requires umad layer access, correct?  Right now we are mixing the 2 layers (mad and umad) in saquery (get_bind_handle) as well as libibnetdisc.

I think we need to be careful we don't do this again!

Ira

> 
> Sasha
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://*vger.kernel.org/majordomo-info.html
> 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2-i2BcT+NCU+M@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 5/37] librdmacm: replace query_route call with separate queries
From: Steve Wise @ 2010-05-10 21:07 UTC (permalink / raw)
  To: Sean Hefty; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <053733AA9B7B4EBFB94A54DEF34DC73A-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>

Sean Hefty wrote:
>> You define global int af_ib_support.  It never gets set to non-zero.
>> Was this debug code or does it get set somewhere else that I'm missing?
>>     
>
> It doesn't get set until a later patch in the series, once all the needed
> infrastructure is in place.
>
>   

Ok I see it.  It allows you to only enable this if the kernel CMA 
modules support it.

I reviewed the whole series, but only at a high level.  There are LOTS 
of changes.  :)

The query gid stuff looks ok to me though. 

If you update the series with the iwarp gid changes, I'll review that too.

Thanks,

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* IPoIB multiqueue support?
From: Christoph Lameter @ 2010-05-10 21:08 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

I see that some IB nics can do multiqueu in ethernet mode.

Is there any work on multiqueue support for IPoIB going on?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: IPoIB multiqueue support?
From: Roland Dreier @ 2010-05-10 21:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.DEB.2.00.1005101607580.17916-sBS69tsa9Uj/9pzu0YdTqQ@public.gmane.org>

 > Is there any work on multiqueue support for IPoIB going on?

No, although one could view connected mode as an even better place to
start, since you already get perfect classification by remote peer for
free.

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 33/52] IB/qib: Add qib_sd7220.c
From: Roland Dreier @ 2010-05-10 22:50 UTC (permalink / raw)
  To: Ralph Campbell; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100507000145.3441.59808.stgit-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>

I don't think this is really a merge-blocker, but this code seems to
have faith in the talismanic power of atomic_t to avoid races in nearby
code.  The following anti-pattern appears in a few other variants in the
driver as well (just search for atomic_inc_return).

 > +		if (atomic_inc_return(&irp->relock_timer_active) == 1) {
 > +			init_timer(&irp->relock_timer);
 > +			irp->relock_timer.function = qib_run_relock;
 > +			irp->relock_timer.data = (unsigned long) dd;
 > +			irp->relock_interval = timeout;
 > +			irp->relock_timer.expires = jiffies + timeout;
 > +			add_timer(&irp->relock_timer);
 > +		} else {
 > +			irp->relock_interval = timeout;
 > +			mod_timer(&irp->relock_timer, jiffies + timeout);
 > +			atomic_dec(&irp->relock_timer_active);
 > +		}

Think about if two threads hit this code at the same time:

	if (atomic_inc_return(&active) == 1) {
		/* yep, first time through */

					if (atomic_inc_return(&active) == 1) {
						/* nope, go to else */
					} else {
						irp->relock_interval = timeout;
						/* err, who set up this timer? */
						mod_timer(&irp->relock_timer, jiffies + timeout);
						atomic_dec(&irp->relock_timer_active);
					}


		/* a bit too late, the timer might already have run */
		init_timer(&irp->relock_timer);
		irp->relock_timer.function = qib_run_relock;
		irp->relock_timer.data = (unsigned long) dd;
		irp->relock_interval = timeout;
		irp->relock_timer.expires = jiffies + timeout;
		add_timer(&irp->relock_timer);

Now, maybe this is safe because the code is actually never running
concurrently, but using this atomic_t relock_timer_active makes it
really hard to see that this code is safe.  Why not just use a proper
lock for this, since it doesn't look particularly performance critical
(and an atomic_t is really the same cost as a spinlock anyway)?

As I said, not a merge-blocker but it would be nice to get this fixed
someday before everyone loses interest in this driver.

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 18/52] IB/qib: Add qib_iba7322.c
From: Roland Dreier @ 2010-05-10 22:55 UTC (permalink / raw)
  To: Ralph Campbell; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100507000026.3441.87902.stgit-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>

 > +/*
 > + * Setup QMH7342 receive and transmit parameters, necessary because
 > + * each bay, Mez connector, and IB port need different tuning, beyond
 > + * what the switch and HCA can do automatically.
 > + * It's expected to be done by cat'ing files to the modules file,
 > + * rather than setting up as a module parameter.
 > + * It's a "write-only" file, returns 0 when read back.
 > + * The unit, port, bay (if given), and values MUST be done as a single write.
 > + * The unit, port, and bay must precede the values to be effective.
 > + */
 > +static int setup_qmh_params(const char *, struct kernel_param *);
 > +static unsigned dummy_qmh_params;
 > +module_param_call(qmh_serdes_setup, setup_qmh_params, param_get_uint,
 > +		  &dummy_qmh_params, S_IWUSR | S_IRUGO);

This seems like a really bogus user interface.  You create a module
parameter you expect people not to use just to create a file under
/sys/module that people can write to?  And then it's a global module
setting so you need some way of specifying which port to apply it to?

We need a more supportable way of setting this.  Why can't you put
some more attributes in the per-port driver-specific stuff you're
already creating?  If you need to pass in multiple values atomically
then just create files like

  value1
  value2
  value3
  apply_values

and have the values not take effect until the user writes a 1 to the
"apply" file (or call it "commit" if you like).

You could even make the value files readable to help debug what settings
are getting used.

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] opensm/osm_log.h: osm_log_is_active should return true for syslog
From: Yevgeny Kliteynik @ 2010-05-11  9:06 UTC (permalink / raw)
  To: Sasha Khapyorsky; +Cc: Linux RDMA


osm_log() always logs messages that came with OSM_LOG_SYS level,
so osm_log_is_active() should concur with this.
As a by-product of this fix, OSM_LOG_SYS messages can now be
printed with OSM_LOG macro, instead of using osm_log() directly.

Signed-off-by: Yevgeny Kliteynik <kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
---
 opensm/include/opensm/osm_log.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/include/opensm/osm_log.h b/opensm/include/opensm/osm_log.h
index b2f105a..a494bc3 100644
--- a/opensm/include/opensm/osm_log.h
+++ b/opensm/include/opensm/osm_log.h
@@ -355,7 +355,7 @@ static inline void osm_log_set_level(IN osm_log_t * p_log,
 static inline boolean_t osm_log_is_active(IN const osm_log_t * p_log,
 					  IN osm_log_level_t level)
 {
-	return ((p_log->level & level) != 0);
+	return (((OSM_LOG_SYS | p_log->level) & level) != 0);
 }

 /*
-- 
1.5.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: IPoIB multiqueue support?
From: Christoph Lameter @ 2010-05-11 13:53 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <adawrvbxxyd.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

On Mon, 10 May 2010, Roland Dreier wrote:

>  > Is there any work on multiqueue support for IPoIB going on?
>
> No, although one could view connected mode as an even better place to
> start, since you already get perfect classification by remote peer for
> free.

I am mostly interested in multicast traffic. Connected mode is not
relevant to that usage scenario.




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox