Linux RDMA and InfiniBand development

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* Re: IPoIB multiqueue support?
From: Christoph Lameter @ 2010-05-11 13:53 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <adawrvbxxyd.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

On Mon, 10 May 2010, Roland Dreier wrote:

>  > Is there any work on multiqueue support for IPoIB going on?
>
> No, although one could view connected mode as an even better place to
> start, since you already get perfect classification by remote peer for
> free.

I am mostly interested in multicast traffic. Connected mode is not
relevant to that usage scenario.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 18/52] IB/qib: Add qib_iba7322.c (serdes parameters)
From: Dave Olson @ 2010-05-11 15:49 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Ralph Campbell,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <adad3x3xt6q.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

On Mon, 10 May 2010, Roland Dreier wrote:

|  > +/*
|  > + * Setup QMH7342 receive and transmit parameters, necessary because
|  > + * each bay, Mez connector, and IB port need different tuning, beyond
|  > + * what the switch and HCA can do automatically.
|  > + * It's expected to be done by cat'ing files to the modules file,
|  > + * rather than setting up as a module parameter.
|  > + * It's a "write-only" file, returns 0 when read back.
|  > + * The unit, port, bay (if given), and values MUST be done as a single write.
|  > + * The unit, port, and bay must precede the values to be effective.
|  > + */
|  > +static int setup_qmh_params(const char *, struct kernel_param *);
|  > +static unsigned dummy_qmh_params;
|  > +module_param_call(qmh_serdes_setup, setup_qmh_params, param_get_uint,
|  > +		  &dummy_qmh_params, S_IWUSR | S_IRUGO);
| 
| This seems like a really bogus user interface.  You create a module
| parameter you expect people not to use just to create a file under
| /sys/module that people can write to?  And then it's a global module
| setting so you need some way of specifying which port to apply it to?
| 
| We need a more supportable way of setting this.  Why can't you put
| some more attributes in the per-port driver-specific stuff you're
| already creating?  If you need to pass in multiple values atomically
| then just create files like

Yeah.  The interface is ugly.  Mea culpa. It was time constrained, and had
to be shipped to customers before we really knew all the variables, so
it was overly general.

I've implemented a newer interface (it's in the same set of patches),
but we've not yet converted over the userland.  The new interface is unit
and port specific.  It's not separate files per serdes setting, though.

It takes a string with a default (global) index, followed by optional
unit and port-specific tuples, like this:

	10 0,1=8 1,2=7 ...

The newer interface has the values readable as well as writable.

When we had stuff like this in the port-specific directories, people
dinged us on it.   We also had people who wanted to be able to set
it as a module parameter to modprobe  The newer interface is the
cable_atten module parameter, and it just selects an index into a table of
parameters in the driver.

The new interface needs to have a table extended a bit more to replace
the setup_qme and setup_qmh functions (once again, time constraints for
our internal release cycles caused the incomplete implementation).

Sorry for exposing all the ugliness.  If you see it as a serious issue,
we can try to accelerate the cleanup effort.

Dave Olson
dave.olson-h88ZbnxC6KDQT0dZR+AlfA@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: IPoIB multiqueue support?
From: Roland Dreier @ 2010-05-11 15:56 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.DEB.2.00.1005110851530.1500-sBS69tsa9Uj/9pzu0YdTqQ@public.gmane.org>

 > I am mostly interested in multicast traffic. Connected mode is not
 > relevant to that usage scenario.

As I said, I don't think anyone is working on it.  However it wouldn't
be that hard to get something pretty good for multicast, since the
InfiniBand multicast join mechanism would let you have essentially a
perfect filter for steering individual multicast groups to whichever QP
(ring) you wanted to.

Of course you could also implement the equivalent thing in userspace and
probably get even better performance.

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] ummunotify: Userspace support for MMU notifications V2
From: Sayantan Sur @ 2010-05-11 16:17 UTC (permalink / raw)
  To: Eric B Munson
  Cc: akpm, linux-kernel, linux-rdma, linux-mm, rolandd, peterz, mingo,
	pavel, Jeff Squyres (jsquyres), randy.dunlap
In-Reply-To: <1271943493-12120-1-git-send-email-ebmunson@us.ibm.com>

Hi,

I understand that this patch went through to the -mm tree.
MVAPICH/MVAPICH2 MPI stacks intend to utilize this feature as well.

Thanks.

On Thu, Apr 22, 2010 at 6:38 AM, Eric B Munson <ebmunson@us.ibm.com> wrote:
> From: Roland Dreier <rolandd@cisco.com>
>
> As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
> and follow-up messages, libraries using RDMA would like to track
> precisely when application code changes memory mapping via free(),
> munmap(), etc.  Current pure-userspace solutions using malloc hooks
> and other tricks are not robust, and the feeling among experts is that
> the issue is unfixable without kernel help.
>
> We solve this not by implementing the full API proposed in the email
> linked above but rather with a simpler and more generic interface,
> which may be useful in other contexts.  Specifically, we implement a
> new character device driver, ummunotify, that creates a /dev/ummunotify
> node.  A userspace process can open this node read-only and use the fd
> as follows:
>
>  1. ioctl() to register/unregister an address range to watch in the
>     kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>).
>
>  2. read() to retrieve events generated when a mapping in a watched
>     address range is invalidated (cf struct ummunotify_event in
>     <linux/ummunotify.h>).  select()/poll()/epoll() and SIGIO are
>     handled for this IO.
>
>  3. mmap() one page at offset 0 to map a kernel page that contains a
>     generation counter that is incremented each time an event is
>     generated.  This allows userspace to have a fast path that checks
>     that no events have occurred without a system call.
>
> Thanks to Jason Gunthorpe <jgunthorpe <at> obsidianresearch.com> for
> suggestions on the interface design.  Also thanks to Jeff Squyres
> <jsquyres <at> cisco.com> for prototyping support for this in Open MPI,
> which
> helped find several bugs during development.
>
> Signed-off-by: Roland Dreier <rolandd@cisco.com>
> Signed-off-by: Eric B Munson <ebmunson@us.ibm.com>
>
> ---
>
> Changes from V1:
> - Update Kbuild to handle test program build properly
> - Update documentation to cover questions not addressed in previous
>   thread
> ---
>  Documentation/Makefile                  |    3 +-
>  Documentation/ummunotify/Makefile       |    7 +
>  Documentation/ummunotify/ummunotify.txt |  162 +++++++++
>  Documentation/ummunotify/umn-test.c     |  200 +++++++++++
>  drivers/char/Kconfig                    |   12 +
>  drivers/char/Makefile                   |    1 +
>  drivers/char/ummunotify.c               |  567
> +++++++++++++++++++++++++++++++
>  include/linux/Kbuild                    |    1 +
>  include/linux/ummunotify.h              |  121 +++++++
>  9 files changed, 1073 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/ummunotify/Makefile
>  create mode 100644 Documentation/ummunotify/ummunotify.txt
>  create mode 100644 Documentation/ummunotify/umn-test.c
>  create mode 100644 drivers/char/ummunotify.c
>  create mode 100644 include/linux/ummunotify.h
>
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index 6fc7ea1..27ba76a 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -1,3 +1,4 @@
>  obj-m := DocBook/ accounting/ auxdisplay/ connector/ \
>         filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \
> -       pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/
> +       pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \
> +       watchdog/src/
> diff --git a/Documentation/ummunotify/Makefile
> b/Documentation/ummunotify/Makefile
> new file mode 100644
> index 0000000..89f31a0
> --- /dev/null
> +++ b/Documentation/ummunotify/Makefile
> @@ -0,0 +1,7 @@
> +# List of programs to build
> +hostprogs-y := umn-test
> +
> +# Tell kbuild to always build the programs
> +always := $(hostprogs-y)
> +
> +HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include
> diff --git a/Documentation/ummunotify/ummunotify.txt
> b/Documentation/ummunotify/ummunotify.txt
> new file mode 100644
> index 0000000..d6c2ccc
> --- /dev/null
> +++ b/Documentation/ummunotify/ummunotify.txt
> @@ -0,0 +1,162 @@
> +UMMUNOTIFY
> +
> +  Ummunotify relays MMU notifier events to userspace.  This is useful
> +  for libraries that need to track the memory mapping of applications;
> +  for example, MPI implementations using RDMA want to cache memory
> +  registrations for performance, but tracking all possible crazy cases
> +  such as when, say, the FORTRAN runtime frees memory is impossible
> +  without kernel help.
> +
> +Basic Model
> +
> +  A userspace process uses it by opening /dev/ummunotify, which
> +  returns a file descriptor.  Interest in address ranges is registered
> +  using ioctl() and MMU notifier events are retrieved using read(), as
> +  described in more detail below.  Userspace can register multiple
> +  address ranges to watch, and can unregister individual ranges.
> +
> +  Userspace can also mmap() a single read-only page at offset 0 on
> +  this file descriptor.  This page contains (at offest 0) a single
> +  64-bit generation counter that the kernel increments each time an
> +  MMU notifier event occurs.  Userspace can use this to very quickly
> +  check if there are any events to retrieve without needing to do a
> +  system call.
> +
> +Control
> +
> +  To start using ummunotify, a process opens /dev/ummunotify in
> +  read-only mode.  This will attach to current->mm because the current
> +  consumers of this functionality do all monitoring in the process
> +  being monitored.  It is currently not possible to use this device to
> +  monitor other processes.  Control from userspace is done via ioctl().
> +  An ioctl was chosen because the number of files required to register
> +  a new address range in sysfs would be unwieldy and new procfs entries
> +  are discouraged.  The defined ioctls are:
> +
> +    UMMUNOTIFY_EXCHANGE_FEATURES: This ioctl takes a single 32-bit
> +      word of feature flags as input, and the kernel updates the
> +      features flags word to contain only features requested by
> +      userspace and also supported by the kernel.
> +
> +      This ioctl is only included for forward compatibility; no
> +      feature flags are currently defined, and the kernel will simply
> +      update any requested feature mask to 0.  The kernel will always
> +      default to a feature mask of 0 if this ioctl is not used, so
> +      current userspace does not need to perform this ioctl.
> +
> +    UMMUNOTIFY_REGISTER_REGION: Userspace uses this ioctl to tell the
> +      kernel to start delivering events for an address range.  The
> +      range is described using struct ummunotify_register_ioctl:
> +
> +       struct ummunotify_register_ioctl {
> +               __u64   start;
> +               __u64   end;
> +               __u64   user_cookie;
> +               __u32   flags;
> +               __u32   reserved;
> +       };
> +
> +      start and end give the range of userspace virtual addresses;
> +      start is included in the range and end is not, so an example of
> +      a 4 KB range would be start=0x1000, end=0x2000.
> +
> +      user_cookie is an opaque 64-bit quantity that is returned by the
> +      kernel in events involving the range, and used by userspace to
> +      stop watching the range.  Each registered address range must
> +      have a distinct user_cookie.
> +
> +      It is fine with the kernel if userspace registers multiple
> +      overlapping or even duplicate address ranges, as long as a
> +      different cookie is used for each registration.
> +
> +      flags and reserved are included for forward compatibility;
> +      userspace should simply set them to 0 for the current interface.
> +
> +    UMMUNOTIFY_UNREGISTER_REGION: Userspace passes in the 64-bit
> +      user_cookie used to register a range to tell the kernel to stop
> +      watching an address range.  Once this ioctl completes, the
> +      kernel will not deliver any further events for the range that is
> +      unregistered.
> +
> +Events
> +
> +  When an event occurs that invalidates some of a process's memory
> +  mapping in an address range being watched, ummunotify queues an
> +  event report for that address range.  If more than one event
> +  invalidates parts of the same address range before userspace
> +  retrieves the queued report, then further reports for the same range
> +  will not be queued -- when userspace does read the queue, only a
> +  single report for a given range will be returned.
> +
> +  If multiple ranges being watched are invalidated by a single event
> +  (which is especially likely if userspace registers overlapping
> +  ranges), then an event report structure will be queued for each
> +  address range registration.
> +
> +  It is possible, if a large enough number of overlapping ranges are
> +  registered and the list of invalidated events is busy enough and
> +  ignored long enough, to cause the kernel to run out of memory.
> +  Because this situation is unlikely to occur, the event queue size
> +  is not bounded in order to avoid dropping events if the queue grows
> +  beyond set bounds.
> +
> +  Userspace retrieves queued events via read() on the ummunotify file
> +  descriptor; a buffer that is at least as big as struct
> +  ummunotify_event should be used to retrieve event reports, and if a
> +  larger buffer is passed to read(), multiple reports will be returned
> +  (if available).
> +
> +  If the ummunotify file descriptor is in blocking mode, a read() call
> +  will wait for an event report to be available.  Userspace may also
> +  set the ummunotify file descriptor to non-blocking mode and use all
> +  standard ways of waiting for data to be available on the ummunotify
> +  file descriptor, including epoll/poll()/select() and SIGIO.
> +
> +  The format of event reports is:
> +
> +       struct ummunotify_event {
> +               __u32   type;
> +               __u32   flags;
> +               __u64   hint_start;
> +               __u64   hint_end;
> +               __u64   user_cookie_counter;
> +       };
> +
> +  where the type field is either UMMUNOTIFY_EVENT_TYPE_INVAL or
> +  UMMUNOTIFY_EVENT_TYPE_LAST.  Events of type INVAL describe
> +  invalidation events as follows: user_cookie_counter contains the
> +  cookie passed in when userspace registered the range that the event
> +  is for.  hint_start and hint_end contain the start address and end
> +  address that were invalidated.
> +
> +  The flags word contains bit flags, with only UMMUNOTIFY_EVENT_FLAG_HINT
> +  defined at the moment.  If HINT is set, then the invalidation event
> +  invalidated less than the full address range and the kernel returns
> +  the exact range invalidated; if HINT is not sent then hint_start and
> +  hint_end are set to the original range registered by userspace.
> +  (HINT will not be set if, for example, multiple events invalidated
> +  disjoint parts of the range and so a single start/end pair cannot
> +  represent the parts of the range that were invalidated)
> +
> +  If the event type is LAST, then the read operation has emptied the
> +  list of invalidated regions, and the flags, hint_start and hint_end
> +  fields are not used.  user_cookie_counter holds the value of the
> +  kernel's generation counter (see below of more details) when the
> +  empty list occurred.
> +
> +Generation Count
> +
> +  Userspace may mmap() a page on a ummunotify file descriptor via
> +
> +       mmap(NULL, sizeof (__u64), PROT_READ, MAP_SHARED, ummunotify_fd, 0);
> +
> +  to get a read-only mapping of the kernel's 64-bit generation
> +  counter.  The kernel will increment this generation counter each
> +  time an event report is queued.
> +
> +  Userspace can use the generation counter as a quick check to avoid
> +  system calls; if the value read from the mapped kernel counter is
> +  still equal to the value returned in user_cookie_counter for the
> +  most recent LAST event retrieved, then no further events have been
> +  queued and there is no need to try a read() on the ummunotify file
> +  descriptor.
> diff --git a/Documentation/ummunotify/umn-test.c
> b/Documentation/ummunotify/umn-test.c
> new file mode 100644
> index 0000000..143db2c
> --- /dev/null
> +++ b/Documentation/ummunotify/umn-test.c
> @@ -0,0 +1,200 @@
> +/*
> + * Copyright (c) 2009 Cisco Systems.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <stdint.h>
> +#include <fcntl.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +
> +#include <linux/ummunotify.h>
> +
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <sys/ioctl.h>
> +
> +#define UMN_TEST_COOKIE 123
> +
> +static int             umn_fd;
> +static volatile __u64  *umn_counter;
> +
> +static int umn_init(void)
> +{
> +       __u32 flags;
> +
> +       umn_fd = open("/dev/ummunotify", O_RDONLY);
> +       if (umn_fd < 0) {
> +               perror("open");
> +               return 1;
> +       }
> +
> +       if (ioctl(umn_fd, UMMUNOTIFY_EXCHANGE_FEATURES, &flags)) {
> +               perror("exchange ioctl");
> +               return 1;
> +       }
> +
> +       printf("kernel feature flags: 0x%08x\n", flags);
> +
> +       umn_counter = mmap(NULL, sizeof *umn_counter, PROT_READ,
> +                          MAP_SHARED, umn_fd, 0);
> +       if (umn_counter == MAP_FAILED) {
> +               perror("mmap");
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +static int umn_register(void *buf, size_t size, __u64 cookie)
> +{
> +       struct ummunotify_register_ioctl r = {
> +               .start          = (unsigned long) buf,
> +               .end            = (unsigned long) buf + size,
> +               .user_cookie    = cookie,
> +       };
> +
> +       if (ioctl(umn_fd, UMMUNOTIFY_REGISTER_REGION, &r)) {
> +               perror("register ioctl");
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +static int umn_unregister(__u64 cookie)
> +{
> +       if (ioctl(umn_fd, UMMUNOTIFY_UNREGISTER_REGION, &cookie)) {
> +               perror("unregister ioctl");
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +       int                     page_size;
> +       __u64                   old_counter;
> +       void                   *t;
> +       int                     got_it;
> +
> +       if (umn_init())
> +               return 1;
> +
> +       printf("\n");
> +
> +       old_counter = *umn_counter;
> +       if (old_counter != 0) {
> +               fprintf(stderr, "counter = %lld (expected 0)\n",
> old_counter);
> +               return 1;
> +       }
> +
> +       page_size = sysconf(_SC_PAGESIZE);
> +       t = mmap(NULL, 3 * page_size, PROT_READ,
> +                MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
> +
> +       if (umn_register(t, 3 * page_size, UMN_TEST_COOKIE))
> +               return 1;
> +
> +       munmap(t + page_size, page_size);
> +
> +       old_counter = *umn_counter;
> +       if (old_counter != 1) {
> +               fprintf(stderr, "counter = %lld (expected 1)\n",
> old_counter);
> +               return 1;
> +       }
> +
> +       got_it = 0;
> +       while (1) {
> +               struct ummunotify_event ev;
> +               int                     len;
> +
> +               len = read(umn_fd, &ev, sizeof ev);
> +               if (len < 0) {
> +                       perror("read event");
> +                       return 1;
> +               }
> +               if (len != sizeof ev) {
> +                       fprintf(stderr, "Read gave %d bytes (!= event size
> %zd)\n",
> +                               len, sizeof ev);
> +                       return 1;
> +               }
> +
> +               switch (ev.type) {
> +               case UMMUNOTIFY_EVENT_TYPE_INVAL:
> +                       if (got_it) {
> +                               fprintf(stderr, "Extra invalidate event\n");
> +                               return 1;
> +                       }
> +                       if (ev.user_cookie_counter != UMN_TEST_COOKIE) {
> +                               fprintf(stderr, "Invalidate event for cookie
> %lld (expected %d)\n",
> +                                       ev.user_cookie_counter,
> +                                       UMN_TEST_COOKIE);
> +                               return 1;
> +                       }
> +
> +                       printf("Invalidate event:\tcookie %lld\n",
> +                              ev.user_cookie_counter);
> +
> +                       if (!(ev.flags & UMMUNOTIFY_EVENT_FLAG_HINT)) {
> +                               fprintf(stderr, "Hint flag not set\n");
> +                               return 1;
> +                       }
> +
> +                       if (ev.hint_start != (uintptr_t) t + page_size ||
> +                           ev.hint_end != (uintptr_t) t + page_size * 2) {
> +                               fprintf(stderr, "Got hint %llx..%llx,
> expected %p..%p\n",
> +                                       ev.hint_start, ev.hint_end,
> +                                       t + page_size, t + page_size * 2);
> +                               return 1;
> +                       }
> +
> +                       printf("\t\t\thint %llx...%llx\n",
> +                              ev.hint_start, ev.hint_end);
> +
> +                       got_it = 1;
> +                       break;
> +
> +               case UMMUNOTIFY_EVENT_TYPE_LAST:
> +                       if (!got_it) {
> +                               fprintf(stderr, "Last event without
> invalidate event\n");
> +                               return 1;
> +                       }
> +
> +                       printf("Empty event:\t\tcounter %lld\n",
> +                              ev.user_cookie_counter);
> +                       goto done;
> +
> +               default:
> +                       fprintf(stderr, "unknown event type %d\n",
> +                               ev.type);
> +                       return 1;
> +               }
> +       }
> +
> +done:
> +       umn_unregister(123);
> +       munmap(t, page_size);
> +
> +       old_counter = *umn_counter;
> +       if (old_counter != 1) {
> +               fprintf(stderr, "counter = %lld (expected 1)\n",
> old_counter);
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
> index 3141dd3..cf26019 100644
> --- a/drivers/char/Kconfig
> +++ b/drivers/char/Kconfig
> @@ -1111,6 +1111,18 @@ config DEVPORT
>         depends on ISA || PCI
>         default y
>
> +config UMMUNOTIFY
> +       tristate "Userspace MMU notifications"
> +       select MMU_NOTIFIER
> +       help
> +         The ummunotify (userspace MMU notification) driver creates a
> +         character device that can be used by userspace libraries to
> +         get notifications when an application's memory mapping
> +         changed.  This is used, for example, by RDMA libraries to
> +         improve the reliability of memory registration caching, since
> +         the kernel's MMU notifications can be used to know precisely
> +         when to shoot down a cached registration.
> +
>  source "drivers/s390/char/Kconfig"
>
>  endmenu
> diff --git a/drivers/char/Makefile b/drivers/char/Makefile
> index f957edf..521e5de 100644
> --- a/drivers/char/Makefile
> +++ b/drivers/char/Makefile
> @@ -97,6 +97,7 @@ obj-$(CONFIG_NSC_GPIO)                += nsc_gpio.o
>  obj-$(CONFIG_CS5535_GPIO)      += cs5535_gpio.o
>  obj-$(CONFIG_GPIO_TB0219)      += tb0219.o
>  obj-$(CONFIG_TELCLOCK)         += tlclk.o
> +obj-$(CONFIG_UMMUNOTIFY)       += ummunotify.o
>
>  obj-$(CONFIG_MWAVE)            += mwave/
>  obj-$(CONFIG_AGP)              += agp/
> diff --git a/drivers/char/ummunotify.c b/drivers/char/ummunotify.c
> new file mode 100644
> index 0000000..c14df3f
> --- /dev/null
> +++ b/drivers/char/ummunotify.c
> @@ -0,0 +1,567 @@
> +/*
> + * Copyright (c) 2009 Cisco Systems.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/init.h>
> +#include <linux/list.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/rbtree.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/uaccess.h>
> +#include <linux/ummunotify.h>
> +
> +#include <asm/cacheflush.h>
> +
> +MODULE_AUTHOR("Roland Dreier");
> +MODULE_DESCRIPTION("Userspace MMU notifiers");
> +MODULE_LICENSE("GPL v2");
> +
> +/*
> + * Information about an address range userspace has asked us to watch.
> + *
> + * user_cookie: Opaque cookie given to us when userspace registers the
> + *   address range.
> + *
> + * start, end: Address range; start is inclusive, end is exclusive.
> + *
> + * hint_start, hint_end: If a single MMU notification event
> + *   invalidates the address range, we hold the actual range of
> + *   addresses that were invalidated (and set UMMUNOTIFY_FLAG_HINT).
> + *   If another event hits this range before userspace reads the
> + *   event, we give up and don't try to keep track of which subsets
> + *   got invalidated.
> + *
> + * flags: Holds the INVALID flag for ranges that are on the invalid
> + *   list and/or the HINT flag for ranges where the hint range holds
> + *   good information.
> + *
> + * node: Used to put the range into an rbtree we use to be able to
> + *   scan address ranges in order.
> + *
> + * list: Used to put the range on the invalid list when an MMU
> + *   notification event hits the range.
> + */
> +enum {
> +       UMMUNOTIFY_FLAG_INVALID = 1,
> +       UMMUNOTIFY_FLAG_HINT    = 2,
> +};
> +
> +struct ummunotify_reg {
> +       u64                     user_cookie;
> +       unsigned long           start;
> +       unsigned long           end;
> +       unsigned long           hint_start;
> +       unsigned long           hint_end;
> +       unsigned long           flags;
> +       struct rb_node          node;
> +       struct list_head        list;
> +};
> +
> +/*
> + * Context attached to each file that userspace opens.
> + *
> + * mmu_notifier: MMU notifier registered for this context.
> + *
> + * mm: mm_struct for process that created the context; we use this to
> + *   hold a reference to the mm to make sure it doesn't go away until
> + *   we're done with it.
> + *
> + * reg_tree: RB tree of address ranges being watched, sorted by start
> + *   address.
> + *
> + * invalid_list: List of address ranges that have been invalidated by
> + *   MMU notification events; as userspace reads events, the address
> + *   range corresponding to the event is removed from the list.
> + *
> + * counter: Page that can be mapped read-only by userspace, which
> + *   holds a generation count that is incremented each time an event
> + *   occurs.
> + *
> + * lock: Spinlock used to protect all context.
> + *
> + * read_wait: Wait queue used to wait for data to become available in
> + *   blocking read()s.
> + *
> + * async_queue: Used to implement fasync().
> + *
> + * need_empty: Set when userspace reads an invalidation event, so that
> + *   read() knows it must generate an "empty" event when userspace
> + *   drains the invalid_list.
> + *
> + * used: Set after userspace does anything with the file, so that the
> + *   "exchange flags" ioctl() knows it's too late to change anything.
> + */
> +struct ummunotify_file {
> +       struct mmu_notifier     mmu_notifier;
> +       struct mm_struct       *mm;
> +       struct rb_root          reg_tree;
> +       struct list_head        invalid_list;
> +       u64                    *counter;
> +       spinlock_t              lock;
> +       wait_queue_head_t       read_wait;
> +       struct fasync_struct   *async_queue;
> +       int                     need_empty;
> +       int                     used;
> +};
> +
> +static void ummunotify_handle_notify(struct mmu_notifier *mn,
> +                                    unsigned long start, unsigned long end)
> +{
> +       struct ummunotify_file *priv =
> +               container_of(mn, struct ummunotify_file, mmu_notifier);
> +       struct rb_node *n;
> +       struct ummunotify_reg *reg;
> +       unsigned long flags;
> +       int hit = 0;
> +
> +       spin_lock_irqsave(&priv->lock, flags);
> +
> +       for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
> +               reg = rb_entry(n, struct ummunotify_reg, node);
> +
> +               /*
> +                * Ranges overlap if they're not disjoint; and they're
> +                * disjoint if the end of one is before the start of
> +                * the other one.  So if both disjointness comparisons
> +                * fail then the ranges overlap.
> +                *
> +                * Since we keep the tree of regions we're watching
> +                * sorted by start address, we can end this loop as
> +                * soon as we hit a region that starts past the end of
> +                * the range for the event we're handling.
> +                */
> +               if (reg->start >= end)
> +                       break;
> +
> +               /*
> +                * Just go to the next region if the start of the
> +                * range is after the end of the region -- there
> +                * might still be more overlapping ranges that have a
> +                * greater start.
> +                */
> +               if (start >= reg->end)
> +                       continue;
> +
> +               hit = 1;
> +
> +               if (test_and_set_bit(UMMUNOTIFY_FLAG_INVALID, &reg->flags))
> {
> +                       /* Already on invalid list */
> +                       clear_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);
> +               } else {
> +                       list_add_tail(&reg->list, &priv->invalid_list);
> +                       set_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);
> +                       reg->hint_start = start;
> +                       reg->hint_end   = end;
> +               }
> +       }
> +
> +       if (hit) {
> +               ++(*priv->counter);
> +               flush_dcache_page(virt_to_page(priv->counter));
> +               wake_up_interruptible(&priv->read_wait);
> +               kill_fasync(&priv->async_queue, SIGIO, POLL_IN);
> +       }
> +
> +       spin_unlock_irqrestore(&priv->lock, flags);
> +}
> +
> +static void ummunotify_invalidate_page(struct mmu_notifier *mn,
> +                                      struct mm_struct *mm,
> +                                      unsigned long addr)
> +{
> +       ummunotify_handle_notify(mn, addr, addr + PAGE_SIZE);
> +}
> +
> +static void ummunotify_invalidate_range_start(struct mmu_notifier *mn,
> +                                             struct mm_struct *mm,
> +                                             unsigned long start,
> +                                             unsigned long end)
> +{
> +       ummunotify_handle_notify(mn, start, end);
> +}
> +
> +static const struct mmu_notifier_ops ummunotify_mmu_notifier_ops = {
> +       .invalidate_page        = ummunotify_invalidate_page,
> +       .invalidate_range_start = ummunotify_invalidate_range_start,
> +};
> +
> +static int ummunotify_open(struct inode *inode, struct file *filp)
> +{
> +       struct ummunotify_file *priv;
> +       int ret;
> +
> +       if (filp->f_mode & FMODE_WRITE)
> +               return -EINVAL;
> +
> +       priv = kmalloc(sizeof *priv, GFP_KERNEL);
> +       if (!priv)
> +               return -ENOMEM;
> +
> +       priv->counter = (void *) get_zeroed_page(GFP_KERNEL);
> +       if (!priv->counter) {
> +               ret = -ENOMEM;
> +               goto err;
> +       }
> +
> +       priv->reg_tree = RB_ROOT;
> +       INIT_LIST_HEAD(&priv->invalid_list);
> +       spin_lock_init(&priv->lock);
> +       init_waitqueue_head(&priv->read_wait);
> +       priv->async_queue = NULL;
> +       priv->need_empty  = 0;
> +       priv->used        = 0;
> +
> +       priv->mmu_notifier.ops = &ummunotify_mmu_notifier_ops;
> +       /*
> +        * Register notifier last, since notifications can occur as
> +        * soon as we register....
> +        */
> +       ret = mmu_notifier_register(&priv->mmu_notifier, current->mm);
> +       if (ret)
> +               goto err_page;
> +
> +       priv->mm = current->mm;
> +       atomic_inc(&priv->mm->mm_count);
> +
> +       filp->private_data = priv;
> +
> +       return 0;
> +
> +err_page:
> +       free_page((unsigned long) priv->counter);
> +
> +err:
> +       kfree(priv);
> +       return ret;
> +}
> +
> +static int ummunotify_close(struct inode *inode, struct file *filp)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +       struct rb_node *n;
> +       struct ummunotify_reg *reg;
> +
> +       mmu_notifier_unregister(&priv->mmu_notifier, priv->mm);
> +       mmdrop(priv->mm);
> +       free_page((unsigned long) priv->counter);
> +
> +       for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
> +               reg = rb_entry(n, struct ummunotify_reg, node);
> +               kfree(reg);
> +       }
> +
> +       kfree(priv);
> +
> +       return 0;
> +}
> +
> +static bool ummunotify_readable(struct ummunotify_file *priv)
> +{
> +       return priv->need_empty || !list_empty(&priv->invalid_list);
> +}
> +
> +static ssize_t ummunotify_read(struct file *filp, char __user *buf,
> +                              size_t count, loff_t *pos)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +       struct ummunotify_reg *reg;
> +       ssize_t ret;
> +       struct ummunotify_event *events;
> +       int max;
> +       int n;
> +
> +       priv->used = 1;
> +
> +       events = (void *) get_zeroed_page(GFP_KERNEL);
> +       if (!events) {
> +               ret = -ENOMEM;
> +               goto out;
> +       }
> +
> +       spin_lock_irq(&priv->lock);
> +
> +       while (!ummunotify_readable(priv)) {
> +               spin_unlock_irq(&priv->lock);
> +
> +               if (filp->f_flags & O_NONBLOCK) {
> +                       ret = -EAGAIN;
> +                       goto out;
> +               }
> +
> +               if (wait_event_interruptible(priv->read_wait,
> +                                            ummunotify_readable(priv))) {
> +                       ret = -ERESTARTSYS;
> +                       goto out;
> +               }
> +
> +               spin_lock_irq(&priv->lock);
> +       }
> +
> +       max = min_t(size_t, PAGE_SIZE, count) / sizeof *events;
> +
> +       for (n = 0; n < max; ++n) {
> +               if (list_empty(&priv->invalid_list)) {
> +                       events[n].type = UMMUNOTIFY_EVENT_TYPE_LAST;
> +                       events[n].user_cookie_counter = *priv->counter;
> +                       ++n;
> +                       priv->need_empty = 0;
> +                       break;
> +               }
> +
> +               reg = list_first_entry(&priv->invalid_list,
> +                                      struct ummunotify_reg, list);
> +
> +               events[n].type = UMMUNOTIFY_EVENT_TYPE_INVAL;
> +               if (test_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags)) {
> +                       events[n].flags      = UMMUNOTIFY_EVENT_FLAG_HINT;
> +                       events[n].hint_start = max(reg->start,
> reg->hint_start);
> +                       events[n].hint_end   = min(reg->end, reg->hint_end);
> +               } else {
> +                       events[n].hint_start = reg->start;
> +                       events[n].hint_end   = reg->end;
> +               }
> +               events[n].user_cookie_counter = reg->user_cookie;
> +
> +               list_del(&reg->list);
> +               reg->flags = 0;
> +               priv->need_empty = 1;
> +       }
> +
> +       spin_unlock_irq(&priv->lock);
> +
> +       if (copy_to_user(buf, events, n * sizeof *events))
> +               ret = -EFAULT;
> +       else
> +               ret = n * sizeof *events;
> +
> +out:
> +       free_page((unsigned long) events);
> +       return ret;
> +}
> +
> +static unsigned int ummunotify_poll(struct file *filp,
> +                                   struct poll_table_struct *wait)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +
> +       poll_wait(filp, &priv->read_wait, wait);
> +
> +       return ummunotify_readable(priv) ? (POLLIN | POLLRDNORM) : 0;
> +}
> +
> +static long ummunotify_exchange_features(struct ummunotify_file *priv,
> +                                        __u32 __user *arg)
> +{
> +       u32 feature_mask;
> +
> +       if (priv->used)
> +               return -EINVAL;
> +
> +       priv->used = 1;
> +
> +       if (copy_from_user(&feature_mask, arg, sizeof(feature_mask)))
> +               return -EFAULT;
> +
> +       /* No extensions defined at present. */
> +       feature_mask = 0;
> +
> +       if (copy_to_user(arg, &feature_mask, sizeof(feature_mask)))
> +               return -EFAULT;
> +
> +       return 0;
> +}
> +
> +static long ummunotify_register_region(struct ummunotify_file *priv,
> +                                      void __user *arg)
> +{
> +       struct ummunotify_register_ioctl parm;
> +       struct ummunotify_reg *reg, *treg;
> +       struct rb_node **n = &priv->reg_tree.rb_node;
> +       struct rb_node *pn;
> +       int ret = 0;
> +
> +       if (copy_from_user(&parm, arg, sizeof parm))
> +               return -EFAULT;
> +
> +       priv->used = 1;
> +
> +       reg = kmalloc(sizeof *reg, GFP_KERNEL);
> +       if (!reg)
> +               return -ENOMEM;
> +
> +       reg->user_cookie        = parm.user_cookie;
> +       reg->start              = parm.start;
> +       reg->end                = parm.end;
> +       reg->flags              = 0;
> +
> +       spin_lock_irq(&priv->lock);
> +
> +       for (pn = rb_first(&priv->reg_tree); pn; pn = rb_next(pn)) {
> +               treg = rb_entry(pn, struct ummunotify_reg, node);
> +
> +               if (treg->user_cookie == parm.user_cookie) {
> +                       kfree(reg);
> +                       ret = -EINVAL;
> +                       goto out;
> +               }
> +       }
> +
> +       pn = NULL;
> +       while (*n) {
> +               pn = *n;
> +               treg = rb_entry(pn, struct ummunotify_reg, node);
> +
> +               if (reg->start <= treg->start)
> +                       n = &pn->rb_left;
> +               else
> +                       n = &pn->rb_right;
> +       }
> +
> +       rb_link_node(&reg->node, pn, n);
> +       rb_insert_color(&reg->node, &priv->reg_tree);
> +
> +out:
> +       spin_unlock_irq(&priv->lock);
> +
> +       return ret;
> +}
> +
> +static long ummunotify_unregister_region(struct ummunotify_file *priv,
> +                                        __u64 __user *arg)
> +{
> +       u64 user_cookie;
> +       struct rb_node *n;
> +       struct ummunotify_reg *reg;
> +       int ret = -EINVAL;
> +
> +       if (copy_from_user(&user_cookie, arg, sizeof(user_cookie)))
> +               return -EFAULT;
> +
> +       spin_lock_irq(&priv->lock);
> +
> +       for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
> +               reg = rb_entry(n, struct ummunotify_reg, node);
> +
> +               if (reg->user_cookie == user_cookie) {
> +                       rb_erase(n, &priv->reg_tree);
> +                       if (test_bit(UMMUNOTIFY_FLAG_INVALID, &reg->flags))
> +                               list_del(&reg->list);
> +                       kfree(reg);
> +                       ret = 0;
> +                       break;
> +               }
> +       }
> +
> +       spin_unlock_irq(&priv->lock);
> +
> +       return ret;
> +}
> +
> +static long ummunotify_ioctl(struct file *filp, unsigned int cmd,
> +                            unsigned long arg)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +       void __user *argp = (void __user *) arg;
> +
> +       switch (cmd) {
> +       case UMMUNOTIFY_EXCHANGE_FEATURES:
> +               return ummunotify_exchange_features(priv, argp);
> +       case UMMUNOTIFY_REGISTER_REGION:
> +               return ummunotify_register_region(priv, argp);
> +       case UMMUNOTIFY_UNREGISTER_REGION:
> +               return ummunotify_unregister_region(priv, argp);
> +       default:
> +               return -ENOIOCTLCMD;
> +       }
> +}
> +
> +static int ummunotify_fault(struct vm_area_struct *vma, struct vm_fault
> *vmf)
> +{
> +       struct ummunotify_file *priv = vma->vm_private_data;
> +
> +       if (vmf->pgoff != 0)
> +               return VM_FAULT_SIGBUS;
> +
> +       vmf->page = virt_to_page(priv->counter);
> +       get_page(vmf->page);
> +
> +       return 0;
> +
> +}
> +
> +static struct vm_operations_struct ummunotify_vm_ops = {
> +       .fault          = ummunotify_fault,
> +};
> +
> +static int ummunotify_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +
> +       if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff != 0)
> +               return -EINVAL;
> +
> +       vma->vm_ops             = &ummunotify_vm_ops;
> +       vma->vm_private_data    = priv;
> +
> +       return 0;
> +}
> +
> +static int ummunotify_fasync(int fd, struct file *filp, int on)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +
> +       return fasync_helper(fd, filp, on, &priv->async_queue);
> +}
> +
> +static const struct file_operations ummunotify_fops = {
> +       .owner          = THIS_MODULE,
> +       .open           = ummunotify_open,
> +       .release        = ummunotify_close,
> +       .read           = ummunotify_read,
> +       .poll           = ummunotify_poll,
> +       .unlocked_ioctl = ummunotify_ioctl,
> +#ifdef CONFIG_COMPAT
> +       .compat_ioctl   = ummunotify_ioctl,
> +#endif
> +       .mmap           = ummunotify_mmap,
> +       .fasync         = ummunotify_fasync,
> +};
> +
> +static struct miscdevice ummunotify_misc = {
> +       .minor  = MISC_DYNAMIC_MINOR,
> +       .name   = "ummunotify",
> +       .fops   = &ummunotify_fops,
> +};
> +
> +static int __init ummunotify_init(void)
> +{
> +       return misc_register(&ummunotify_misc);
> +}
> +
> +static void __exit ummunotify_cleanup(void)
> +{
> +       misc_deregister(&ummunotify_misc);
> +}
> +
> +module_init(ummunotify_init);
> +module_exit(ummunotify_cleanup);
> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> index e2ea0b2..e086b39 100644
> --- a/include/linux/Kbuild
> +++ b/include/linux/Kbuild
> @@ -163,6 +163,7 @@ header-y += tipc_config.h
>  header-y += toshiba.h
>  header-y += udf_fs_i.h
>  header-y += ultrasound.h
> +header-y += ummunotify.h
>  header-y += un.h
>  header-y += utime.h
>  header-y += veth.h
> diff --git a/include/linux/ummunotify.h b/include/linux/ummunotify.h
> new file mode 100644
> index 0000000..21b0d03
> --- /dev/null
> +++ b/include/linux/ummunotify.h
> @@ -0,0 +1,121 @@
> +/*
> + * Copyright (c) 2009 Cisco Systems.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#ifndef _LINUX_UMMUNOTIFY_H
> +#define _LINUX_UMMUNOTIFY_H
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +/*
> + * Ummunotify relays MMU notifier events to userspace.  A userspace
> + * process uses it by opening /dev/ummunotify, which returns a file
> + * descriptor.  Interest in address ranges is registered using ioctl()
> + * and MMU notifier events are retrieved using read(), as described in
> + * more detail below.
> + *
> + * Userspace can also mmap() a single read-only page at offset 0 on
> + * this file descriptor.  This page contains (at offest 0) a single
> + * 64-bit generation counter that the kernel increments each time an
> + * MMU notifier event occurs.  Userspace can use this to very quickly
> + * check if there are any events to retrieve without needing to do a
> + * system call.
> + */
> +
> +/*
> + * struct ummunotify_register_ioctl describes an address range from
> + * start to end (including start but not including end) to be
> + * monitored.  user_cookie is an opaque handle that userspace assigns,
> + * and which is used to unregister.  flags and reserved are currently
> + * unused and should be set to 0 for forward compatibility.
> + */
> +struct ummunotify_register_ioctl {
> +       __u64   start;
> +       __u64   end;
> +       __u64   user_cookie;
> +       __u32   flags;
> +       __u32   reserved;
> +};
> +
> +#define UMMUNOTIFY_MAGIC               'U'
> +
> +/*
> + * Forward compatibility: Userspace passes in a 32-bit feature mask
> + * with feature flags set indicating which extensions it wishes to
> + * use.  The kernel will return a feature mask with the bits of
> + * userspace's mask that the kernel implements; from that point on
> + * both userspace and the kernel should behave as described by the
> + * kernel's feature mask.
> + *
> + * If userspace does not perform a UMMUNOTIFY_EXCHANGE_FEATURES ioctl,
> + * then the kernel will use a feature mask of 0.
> + *
> + * No feature flags are currently defined, so the kernel will always
> + * return a feature mask of 0 at present.
> + */
> +#define UMMUNOTIFY_EXCHANGE_FEATURES   _IOWR(UMMUNOTIFY_MAGIC, 1, __u32)
> +
> +/*
> + * Register interest in an address range; userspace should pass in a
> + * struct ummunotify_register_ioctl describing the region.
> + */
> +#define UMMUNOTIFY_REGISTER_REGION     _IOW(UMMUNOTIFY_MAGIC, 2, \
> +                                            struct
> ummunotify_register_ioctl)
> +/*
> + * Unregister interest in an address range; userspace should pass in
> + * the user_cookie value that was used to register the address range.
> + * No events for the address range will be reported once it is
> + * unregistered.
> + */
> +#define UMMUNOTIFY_UNREGISTER_REGION   _IOW(UMMUNOTIFY_MAGIC, 3, __u64)
> +
> +/*
> + * Invalidation events are returned whenever the kernel changes the
> + * mapping for a monitored address.  These events are retrieved by
> + * read() on the ummunotify file descriptor, which will fill the
> + * read() buffer with struct ummunotify_event.
> + *
> + * If type field is INVAL, then user_cookie_counter holds the
> + * user_cookie for the region being reported; if the HINT flag is set
> + * then hint_start/hint_end hold the start and end of the mapping that
> + * was invalidated.  (If HINT is not set, then multiple events
> + * invalidated parts of the registered range and hint_start/hint_end
> + * and set to the start/end of the whole registered range)
> + *
> + * If type is LAST, then the read operation has emptied the list of
> + * invalidated regions, and user_cookie_counter holds the value of the
> + * kernel's generation counter when the empty list occurred.  The
> + * other fields are not filled in for this event.
> + */
> +enum {
> +       UMMUNOTIFY_EVENT_TYPE_INVAL     = 0,
> +       UMMUNOTIFY_EVENT_TYPE_LAST      = 1,
> +};
> +
> +enum {
> +       UMMUNOTIFY_EVENT_FLAG_HINT      = 1 << 0,
> +};
> +
> +struct ummunotify_event {
> +       __u32   type;
> +       __u32   flags;
> +       __u64   hint_start;
> +       __u64   hint_end;
> +       __u64   user_cookie_counter;
> +};
> +
> +#endif /* _LINUX_UMMUNOTIFY_H */
> --
> 1.6.3.3
>
>



-- 
Sayantan Sur

Research Scientist
Department of Computer Science
The Ohio State University.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] opensm/osm_log.h: osm_log_is_active should return true for syslog
From: Sasha Khapyorsky @ 2010-05-11 16:21 UTC (permalink / raw)
  To: Yevgeny Kliteynik; +Cc: Linux RDMA
In-Reply-To: <4BE91DF9.4020902-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>

On 12:06 Tue 11 May     , Yevgeny Kliteynik wrote:
> 
> osm_log() always logs messages that came with OSM_LOG_SYS level,
> so osm_log_is_active() should concur with this.
> As a by-product of this fix, OSM_LOG_SYS messages can now be
> printed with OSM_LOG macro, instead of using osm_log() directly.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
> ---
>  opensm/include/opensm/osm_log.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/opensm/include/opensm/osm_log.h b/opensm/include/opensm/osm_log.h
> index b2f105a..a494bc3 100644
> --- a/opensm/include/opensm/osm_log.h
> +++ b/opensm/include/opensm/osm_log.h
> @@ -355,7 +355,7 @@ static inline void osm_log_set_level(IN osm_log_t * p_log,
>  static inline boolean_t osm_log_is_active(IN const osm_log_t * p_log,
>  					  IN osm_log_level_t level)
>  {
> -	return ((p_log->level & level) != 0);
> +	return (((OSM_LOG_SYS | p_log->level) & level) != 0);
>  }

What about to set OSM_LOG_SYS bits in p_log->level at stage of
initialization and to remove all subsequent explicit checks? Like this
(against master):

diff --git a/opensm/opensm/osm_log.c b/opensm/opensm/osm_log.c
index 54c2f36..bd4a200 100644
--- a/opensm/opensm/osm_log.c
+++ b/opensm/opensm/osm_log.c
@@ -119,7 +119,7 @@ void osm_log(IN osm_log_t * p_log, IN osm_log_level_t verbosity,
 #endif				/* __WIN__ */
 
 	/* If this is a call to syslog - always print it */
-	if (!(verbosity & (OSM_LOG_SYS | p_log->level)))
+	if (!(verbosity & p_log->level))
 		return;
 
 	va_start(args, p_str);
@@ -306,7 +306,7 @@ ib_api_status_t osm_log_init_v2(IN osm_log_t * p_log, IN boolean_t flush,
 				IN unsigned long max_size,
 				IN boolean_t accum_log_file)
 {
-	p_log->level = log_flags;
+	p_log->level = log_flags | OSM_LOG_SYS;
 	p_log->flush = flush;
 	p_log->count = 0;
 	p_log->max_size = max_size << 20; /* convert size in MB to bytes */


Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v3 1/2] libibnetdisc: Convert to a multi-smp algorithm
From: Sasha Khapyorsky @ 2010-05-11 16:42 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Hal Rosenstock
In-Reply-To: <20100510135353.257d76c0.weiny2-i2BcT+NCU+M@public.gmane.org>

On 13:53 Mon 10 May     , Ira Weiny wrote:
> > > 
> > >    int ibnd_discover_fabric(ibnd_fabric_t **fabric,
> > > 			    cosnt char *ca_name,  <== could we even default this?
> > 
> > I would think about ca_name and port_number. And this is of course may
> > have default (NULL, 0).
> 
> Ok, ca_name and ca_port will be explicit.
> 
> > 
> > > 			    struct ibnd_config *cfg);
> > 
> > What is wrong with current ibdn_fabric_t *ibnd_discover_fabric(...)? Why
> > do we need an extra parameter?
> 
> Well we are breaking the interface again so I figure we might as well clean some things up.  Returning an int allows us to have a reason for the failure returned to the caller rather than just "NULL".  We have cleaned up most of the internals of the library to allow for this.

But we want to keep API simple, no?

> 
> > 
> > > 
> > > I don't mind the ibnd_config_t struct but I don't think it should be visible
> > > to the user.  Make it opaque and use "set" functions.  Something like.
> > > 
> > > ibnd_fabric_t *fabric;
> > > ibnd_config_t cfg;
> > > ib_portid_t * from;
> > > 
> > > ibnd_set_hops(&cfg, hops);         <== default -1
> > > ibnd_set_port_num(&cfg, port_num); <== default 1
> > > ibnd_set_max_smps(&cfg, max_smps); <== default 2
> > > ibnd_set_from_node(&cfg, from);    <== default NULL
> > 
> > I would prefer to not complicate API with ibnd_set_this() helpers. It
> > would be necessary to add new ones in the future which will lead to API
> > changes.
> 
> See below.
> 
> > 
> > > if (ibnd_discover_fabric(&fabric, "foo", &cfg)) {  <== anything not in cfg is
> > >                                                        defaulted here
> > >    fprintf(stderr, "Wow it failed\n");
> > > }
> > > 
> > > This allows us to change ibnd_config structure any time we want without
> > > affecting the API.  I don't think the "pad" you used is a good idea.
> > 
> > Without padding we will break ABI each time when new field is added to
> > the config structure.
> 
> No it does not iff you use the ibnd_set_this() helpers and make the config private.

In you example 'ibnd_config_t cfg' is on the stack... :)

I would really suggest to keep API simple and useful - to fill up
a structure described in header files is much simpler than pass over
various helpers calls.

Sasha

> 
> > 
> > > Also since we are breaking the API we might as well return the fabric as a
> > > parameter and have an error code.  But I could go either way on this one.
> > > 
> > > Ira
> > > 
> > > 
> > > [*] query_smp.c probably should have it's own timeout here but we can discuss
> > > later.
> > > 
> > > [#] What sucks about this is that libibmad already has the functionality to
> > > open the umad port and configure it (50 line function).  Now we will be
> > > duplicating this functionality.
> > 
> > I think you can use mad_rpc_open_port(ca_name, port_number, ...) just
> > fine (and so the rest of libibmad stuff) - it will open separate fd.
> 
> Yes for the libibmad functionality we can do that.  I was speaking of the use of the umad layer.  To use that layer we have to duplicate the functionality of mad_rpc_open_port in every tool which requires umad layer access, correct?  Right now we are mixing the 2 layers (mad and umad) in saquery (get_bind_handle) as well as libibnetdisc.
> 
> I think we need to be careful we don't do this again!
> 
> Ira
> 
> > 
> > Sasha
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://*vger.kernel.org/majordomo-info.html
> > 
> 
> 
> -- 
> Ira Weiny
> Math Programmer/Computer Scientist
> Lawrence Livermore National Lab
> 925-423-8008
> weiny2-i2BcT+NCU+M@public.gmane.org
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] opensm/osm_log.h: osm_log_is_active should return true for syslog
From: Yevgeny Kliteynik @ 2010-05-11 18:20 UTC (permalink / raw)
  To: Sasha Khapyorsky; +Cc: Linux RDMA
In-Reply-To: <20100511162148.GD28549@me>

On 11-May-10 7:21 PM, Sasha Khapyorsky wrote:
> On 12:06 Tue 11 May     , Yevgeny Kliteynik wrote:
>>
>> osm_log() always logs messages that came with OSM_LOG_SYS level,
>> so osm_log_is_active() should concur with this.
>> As a by-product of this fix, OSM_LOG_SYS messages can now be
>> printed with OSM_LOG macro, instead of using osm_log() directly.
>>
>> Signed-off-by: Yevgeny Kliteynik<kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
>> ---
>>   opensm/include/opensm/osm_log.h |    2 +-
>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/opensm/include/opensm/osm_log.h b/opensm/include/opensm/osm_log.h
>> index b2f105a..a494bc3 100644
>> --- a/opensm/include/opensm/osm_log.h
>> +++ b/opensm/include/opensm/osm_log.h
>> @@ -355,7 +355,7 @@ static inline void osm_log_set_level(IN osm_log_t * p_log,
>>   static inline boolean_t osm_log_is_active(IN const osm_log_t * p_log,
>>   					  IN osm_log_level_t level)
>>   {
>> -	return ((p_log->level&  level) != 0);
>> +	return (((OSM_LOG_SYS | p_log->level)&  level) != 0);
>>   }
>
> What about to set OSM_LOG_SYS bits in p_log->level at stage of
> initialization and to remove all subsequent explicit checks? Like this
> (against master):
>
> diff --git a/opensm/opensm/osm_log.c b/opensm/opensm/osm_log.c
> index 54c2f36..bd4a200 100644
> --- a/opensm/opensm/osm_log.c
> +++ b/opensm/opensm/osm_log.c
> @@ -119,7 +119,7 @@ void osm_log(IN osm_log_t * p_log, IN osm_log_level_t verbosity,
>   #endif				/* __WIN__ */
>
>   	/* If this is a call to syslog - always print it */
> -	if (!(verbosity&  (OSM_LOG_SYS | p_log->level)))
> +	if (!(verbosity&  p_log->level))
>   		return;
>
>   	va_start(args, p_str);
> @@ -306,7 +306,7 @@ ib_api_status_t osm_log_init_v2(IN osm_log_t * p_log, IN boolean_t flush,
>   				IN unsigned long max_size,
>   				IN boolean_t accum_log_file)
>   {
> -	p_log->level = log_flags;
> +	p_log->level = log_flags | OSM_LOG_SYS;

Sure, that should do the trick too.
Want me to send a patch, or will you do it?

-- Yevgeny

>   	p_log->flush = flush;
>   	p_log->count = 0;
>   	p_log->max_size = max_size<<  20; /* convert size in MB to bytes */
>
>
> Sasha
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: IPoIB multiqueue support?
From: Christoph Lameter @ 2010-05-11 20:17 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <adawrvawhyc.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

On Tue, 11 May 2010, Roland Dreier wrote:

>  > I am mostly interested in multicast traffic. Connected mode is not
>  > relevant to that usage scenario.
>
> As I said, I don't think anyone is working on it.  However it wouldn't
> be that hard to get something pretty good for multicast, since the
> InfiniBand multicast join mechanism would let you have essentially a
> perfect filter for steering individual multicast groups to whichever QP
> (ring) you wanted to.

Right but then would each individual QP need its own IP address?

> Of course you could also implement the equivalent thing in userspace and
> probably get even better performance.

Start a QP listening to IPoIB mc traffic?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH] RDMA/ucma: Copy iWARP route information.
From: Sean Hefty @ 2010-05-11 20:28 UTC (permalink / raw)
  To: Hefty, Sean, 'Steve Wise',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1C56604F650B44A8B00B3CA55D00B95D-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>

>>+static void ucma_copy_iw_route(struct rdma_ucm_query_route_resp *resp,
>>+			       struct rdma_route *route)
>>+{
>>+	struct rdma_dev_addr *dev_addr;
>>+
>>+	dev_addr = &route->addr.dev_addr;
>>+	rdma_addr_get_dgid(dev_addr, (union ib_gid *) &resp->ib_route[0].dgid);
>>+	rdma_addr_get_sgid(dev_addr, (union ib_gid *) &resp->ib_route[0].sgid);
>
>essentially breaking the query_route call into query_addr, query_path, and
>query_gid.

I'm unsure where exactly this change would fit into the newer query model.  Are
these values available after rdma_resolve_addr completes?  Are the sgid and dgid
the same type of values (e.g. L2 addresses)?  If the device does ARP internally,
is the dgid value still set?

My current implementation for 'query_gid' returns GIDs using an AF_IB address
structure.  I'm trying to figure out what structure we should be using to return
the iwarp values -- some other sockaddr, an iw_path_record, other? 

Here are links to the proposed changes:

http://www.openfabrics.org/git/?p=~shefty/rdma-dev.git;a=commitdiff;h=b87cccdb27
a7e75c0f9e03a9d37593ceab4d4ede

http://www.openfabrics.org/git/?p=~shefty/rdma-dev.git;a=commitdiff;h=1830722bb5
e84451e3b458719cea8c746a2fb6d4

http://www.openfabrics.org/git/?p=~shefty/rdma-dev.git;a=commitdiff;h=0257cc9298
aa70c4bfc1a8c393f970375a897825

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] RDMA/ucma: Copy iWARP route information.
From: Steve Wise @ 2010-05-11 20:38 UTC (permalink / raw)
  To: Sean Hefty; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <B28805226A0F4AD7B53BD430925D3D20-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>

Sean Hefty wrote:
>>> +static void ucma_copy_iw_route(struct rdma_ucm_query_route_resp *resp,
>>> +			       struct rdma_route *route)
>>> +{
>>> +	struct rdma_dev_addr *dev_addr;
>>> +
>>> +	dev_addr = &route->addr.dev_addr;
>>> +	rdma_addr_get_dgid(dev_addr, (union ib_gid *) &resp->ib_route[0].dgid);
>>> +	rdma_addr_get_sgid(dev_addr, (union ib_gid *) &resp->ib_route[0].sgid);
>>>       
>> essentially breaking the query_route call into query_addr, query_path, and
>> query_gid.
>>     
>
> I'm unsure where exactly this change would fit into the newer query model.  Are
> these values available after rdma_resolve_addr completes? 


Yes.


>  Are the sgid and dgid
> the same type of values (e.g. L2 addresses)?  


Yes.


> If the device does ARP internally,
> is the dgid value still set?
>
>   


No.  But only Amso does this, and its old and crusty...


> My current implementation for 'query_gid' returns GIDs using an AF_IB address
> structure.  I'm trying to figure out what structure we should be using to return
> the iwarp values -- some other sockaddr, an iw_path_record, other? 
>
>   


I'll have to ponder this.   In the past, we've been using the gid.raw 
areas to hold these mac addresses...

> Here are links to the proposed changes:
>
> http://www.openfabrics.org/git/?p=~shefty/rdma-dev.git;a=commitdiff;h=b87cccdb27
> a7e75c0f9e03a9d37593ceab4d4ede
>
> http://www.openfabrics.org/git/?p=~shefty/rdma-dev.git;a=commitdiff;h=1830722bb5
> e84451e3b458719cea8c746a2fb6d4
>
> http://www.openfabrics.org/git/?p=~shefty/rdma-dev.git;a=commitdiff;h=0257cc9298
> aa70c4bfc1a8c393f970375a897825
>
> - Sean
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: IPoIB multiqueue support?
From: Jason Gunthorpe @ 2010-05-11 20:43 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.DEB.2.00.1005111516500.1500-sBS69tsa9Uj/9pzu0YdTqQ@public.gmane.org>

On Tue, May 11, 2010 at 03:17:36PM -0500, Christoph Lameter wrote:
> On Tue, 11 May 2010, Roland Dreier wrote:
> 
> >  > I am mostly interested in multicast traffic. Connected mode is not
> >  > relevant to that usage scenario.
> >
> > As I said, I don't think anyone is working on it.  However it wouldn't
> > be that hard to get something pretty good for multicast, since the
> > InfiniBand multicast join mechanism would let you have essentially a
> > perfect filter for steering individual multicast groups to whichever QP
> > (ring) you wanted to.
> 
> Right but then would each individual QP need its own IP address?

I think Roland means that each IP multicast address is mapped into an
IB multicast GID, and you can bind a QP to a set of MGIDs. Right now
the driver binds all MGIDs to the rx QP and basically ignores the
MGID on receive.

To go multi-queue you'd create multiple QPs and spread the MGID binds
amongst them.

> > Of course you could also implement the equivalent thing in userspace and
> > probably get even better performance.
> 
> Start a QP listening to IPoIB mc traffic?

Yes, but the downside is that if you rely on the kernel to the group
join then the HCA will send the packet to user space and the kernel QP.

Some of the weird features in the RDMA CM seem to be for supporting
this..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: IPoIB multiqueue support?
From: Christoph Lameter @ 2010-05-11 20:50 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100511204358.GQ15969-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Tue, 11 May 2010, Jason Gunthorpe wrote:

> > Right but then would each individual QP need its own IP address?
>
> I think Roland means that each IP multicast address is mapped into an
> IB multicast GID, and you can bind a QP to a set of MGIDs. Right now
> the driver binds all MGIDs to the rx QP and basically ignores the
> MGID on receive.

Aha.

> To go multi-queue you'd create multiple QPs and spread the MGID binds
> amongst them.

It would be best to bind them to the QP of the local processor (assuming
that the process continues to run on that processor).

What about unicast traffic? One QP gets all unicast?

> Yes, but the downside is that if you rely on the kernel to the group
> join then the HCA will send the packet to user space and the kernel QP.
>
> Some of the weird features in the RDMA CM seem to be for supporting
> this..

The UMCAST flag can stop the kernel from processing the IGMP reply.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: IPoIB multiqueue support?
From: Jason Gunthorpe @ 2010-05-11 20:58 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.DEB.2.00.1005111548450.8388-sBS69tsa9Uj/9pzu0YdTqQ@public.gmane.org>

On Tue, May 11, 2010 at 03:50:35PM -0500, Christoph Lameter wrote:

> > To go multi-queue you'd create multiple QPs and spread the MGID binds
> > amongst them.
> 
> It would be best to bind them to the QP of the local processor (assuming
> that the process continues to run on that processor).

Yes

> What about unicast traffic? One QP gets all unicast?

IIRC, there are some HCA-specific features for spreading traffic
amongst QPs using a hash of the IP/TCP/UDP headers. I don't know
anything about them though.

Within the standard IB functionality the best you could do is to
create a wack of QPs and then return different QPNs in your ARP
replies. Though this is very limited and probably not worth doing.

> > Yes, but the downside is that if you rely on the kernel to the group
> > join then the HCA will send the packet to user space and the kernel QP.
> >
> > Some of the weird features in the RDMA CM seem to be for supporting
> > this..
> 
> The UMCAST flag can stop the kernel from processing the IGMP reply.

I'm not talking about IGMP, but the IB version of IGMP, the kernel
joins the group in IB land and also attaches the IPOIB QP. This can
all be faked out in userspace, but it isn't entirely straightforward.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH] RDMA/ucma: Copy iWARP route information.
From: Sean Hefty @ 2010-05-11 21:22 UTC (permalink / raw)
  To: 'Steve Wise'; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4BE9C04E.4040907-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>

>I'll have to ponder this.   In the past, we've been using the gid.raw
>areas to hold these mac addresses...

There's not too many options with the existing ABI, but the newer query
interfaces will give us more flexibility, like variable length addresses, which
may make this cleaner.  Maybe we can return the L2 address through an AF_UNSPEC
sockaddr?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: IPoIB multiqueue support?
From: Roland Dreier @ 2010-05-11 21:52 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.DEB.2.00.1005111516500.1500-sBS69tsa9Uj/9pzu0YdTqQ@public.gmane.org>

 > > As I said, I don't think anyone is working on it.  However it wouldn't
 > > be that hard to get something pretty good for multicast, since the
 > > InfiniBand multicast join mechanism would let you have essentially a
 > > perfect filter for steering individual multicast groups to whichever QP
 > > (ring) you wanted to.

 > Right but then would each individual QP need its own IP address?

No, that's the beauty of multicast -- you just join the multicast group
for a given IP address, and you get the traffic.  Doing multiqueue for
unicast traffic would require some form of flow steering (RSS) in the
adapter (which some have).

 > > Of course you could also implement the equivalent thing in userspace and
 > > probably get even better performance.

 > Start a QP listening to IPoIB mc traffic?

Yes, I believe there have have several proprietary userspace libraries
that essentially do UDP IPoIB multicast in userspace.  There are a few
funky hooks in the IPoIB driver related to that IIRC.

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 18/52] IB/qib: Add qib_iba7322.c (serdes parameters)
From: Roland Dreier @ 2010-05-11 21:55 UTC (permalink / raw)
  To: Dave Olson; +Cc: Ralph Campbell, linux-rdma@vger.kernel.org
In-Reply-To: <alpine.LFD.1.10.1005110837180.7212-vxnkQ4oxbxUi9g6yJnKVd0EOCMrvLtNR@public.gmane.org>

 > I've implemented a newer interface (it's in the same set of patches),
 > but we've not yet converted over the userland.  The new interface is unit
 > and port specific.  It's not separate files per serdes setting, though.
 > 
 > It takes a string with a default (global) index, followed by optional
 > unit and port-specific tuples, like this:
 > 
 > 	10 0,1=8 1,2=7 ...
 > 
 > The newer interface has the values readable as well as writable.
 > 
 > When we had stuff like this in the port-specific directories, people
 > dinged us on it.   We also had people who wanted to be able to set
 > it as a module parameter to modprobe  The newer interface is the
 > cable_atten module parameter, and it just selects an index into a table of
 > parameters in the driver.
 > 
 > The new interface needs to have a table extended a bit more to replace
 > the setup_qme and setup_qmh functions (once again, time constraints for
 > our internal release cycles caused the incomplete implementation).
 > 
 > Sorry for exposing all the ugliness.  If you see it as a serious issue,
 > we can try to accelerate the cleanup effort.

OK.  Yes, I do see this as a serious issue -- getting rid of the bogus
interface which is exposed as a user-visible interface is very painful
once we've merged it.  So I would really really prefer to just have the
good interface upstream.

We're probably ~1 week away from 2.6.34 final, and the merge window will
be two weeks after that.  So if you could get some form of this finished
in the next 3 weeks, that would be the best way to merge qib.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: IPoIB multiqueue support?
From: Christoph Lameter @ 2010-05-11 22:11 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100511205808.GS15969-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Tue, 11 May 2010, Jason Gunthorpe wrote:

> > The UMCAST flag can stop the kernel from processing the IGMP reply.
>
> I'm not talking about IGMP, but the IB version of IGMP, the kernel
> joins the group in IB land and also attaches the IPOIB QP. This can
> all be faked out in userspace, but it isn't entirely straightforward.

Yes vendors use the UMCAST flag to avoid this.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 3/3] ib/iser: enhance disconnection logic for multi-pathing
From: Or Gerlitz @ 2010-05-11 22:12 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, Mike Christie
In-Reply-To: <Pine.LNX.4.64.1005051730450.29957-aDiYczhfhVLdX2U7gxhm1tBPR1lH4CV8@public.gmane.org>

Or Gerlitz <ogerlitz-smomgflXvOZWk0Htik3J/w@public.gmane.org> wrote:
>  [...] with this patch, multipath fail-over time is about 30 seconds, which is seen here,
> when a DD over the multi-path device is done before/during/after the fail-over [...] without
 > this patch, multipath fail-over time is about 130 seconds

Hi Roland, as we're @ -rc7 now, I wanted to check with you if there's
any issue merging this patch series for 2.6.35. If you have any
question or anything need to be addressed/fixed, I'd like to do that
sooner rather then later.

Or
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 1/2] libibnetdisc: Convert to a multi-smp algorithm
From: Ira Weiny @ 2010-05-11 23:44 UTC (permalink / raw)
  To: Sasha Khapyorsky
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Hal Rosenstock
In-Reply-To: <20100511164234.GF28549@me>

On Tue, 11 May 2010 09:42:34 -0700
Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org> wrote:

> On 13:53 Mon 10 May     , Ira Weiny wrote:
> > > > 
> > > >    int ibnd_discover_fabric(ibnd_fabric_t **fabric,
> > > > 			    cosnt char *ca_name,  <== could we even default this?
> > > 
> > > I would think about ca_name and port_number. And this is of course may
> > > have default (NULL, 0).
> > 
> > Ok, ca_name and ca_port will be explicit.
> > 
> > > 
> > > > 			    struct ibnd_config *cfg);
> > > 
> > > What is wrong with current ibdn_fabric_t *ibnd_discover_fabric(...)? Why
> > > do we need an extra parameter?
> > 
> > Well we are breaking the interface again so I figure we might as well clean some things up.  Returning an int allows us to have a reason for the failure returned to the caller rather than just "NULL".  We have cleaned up most of the internals of the library to allow for this.
> 
> But we want to keep API simple, no?

Ok, patch to follow.

> 
> > 
> > > 
> > > > 
> > > > I don't mind the ibnd_config_t struct but I don't think it should be visible
> > > > to the user.  Make it opaque and use "set" functions.  Something like.
> > > > 
> > > > ibnd_fabric_t *fabric;
> > > > ibnd_config_t cfg;
> > > > ib_portid_t * from;
> > > > 
> > > > ibnd_set_hops(&cfg, hops);         <== default -1
> > > > ibnd_set_port_num(&cfg, port_num); <== default 1
> > > > ibnd_set_max_smps(&cfg, max_smps); <== default 2
> > > > ibnd_set_from_node(&cfg, from);    <== default NULL
> > > 
> > > I would prefer to not complicate API with ibnd_set_this() helpers. It
> > > would be necessary to add new ones in the future which will lead to API
> > > changes.
> > 
> > See below.
> > 
> > > 
> > > > if (ibnd_discover_fabric(&fabric, "foo", &cfg)) {  <== anything not in cfg is
> > > >                                                        defaulted here
> > > >    fprintf(stderr, "Wow it failed\n");
> > > > }
> > > > 
> > > > This allows us to change ibnd_config structure any time we want without
> > > > affecting the API.  I don't think the "pad" you used is a good idea.
> > > 
> > > Without padding we will break ABI each time when new field is added to
> > > the config structure.
> > 
> > No it does not iff you use the ibnd_set_this() helpers and make the config private.
> 
> In you example 'ibnd_config_t cfg' is on the stack... :)

yes, bad example.

> 
> I would really suggest to keep API simple and useful - to fill up
> a structure described in header files is much simpler than pass over
> various helpers calls.

I disagree but since you are looking to branch I think this bug needs to be fixed.

A patch is following this email,
Ira

> 
> Sasha
> 
> > 
> > > 
> > > > Also since we are breaking the API we might as well return the fabric as a
> > > > parameter and have an error code.  But I could go either way on this one.
> > > > 
> > > > Ira
> > > > 
> > > > 
> > > > [*] query_smp.c probably should have it's own timeout here but we can discuss
> > > > later.
> > > > 
> > > > [#] What sucks about this is that libibmad already has the functionality to
> > > > open the umad port and configure it (50 line function).  Now we will be
> > > > duplicating this functionality.
> > > 
> > > I think you can use mad_rpc_open_port(ca_name, port_number, ...) just
> > > fine (and so the rest of libibmad stuff) - it will open separate fd.
> > 
> > Yes for the libibmad functionality we can do that.  I was speaking of the use of the umad layer.  To use that layer we have to duplicate the functionality of mad_rpc_open_port in every tool which requires umad layer access, correct?  Right now we are mixing the 2 layers (mad and umad) in saquery (get_bind_handle) as well as libibnetdisc.
> > 
> > I think we need to be careful we don't do this again!
> > 
> > Ira
> > 
> > > 
> > > Sasha
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > More majordomo info at  http://**vger.kernel.org/majordomo-info.html
> > > 
> > 
> > 
> > -- 
> > Ira Weiny
> > Math Programmer/Computer Scientist
> > Lawrence Livermore National Lab
> > 925-423-8008
> > weiny2-i2BcT+NCU+M@public.gmane.org
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://*vger.kernel.org/majordomo-info.html
> 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2-i2BcT+NCU+M@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] ibnetdisc: Separate calls to umad and mad layer to avoid race condition on response MAD's
From: Ira Weiny @ 2010-05-11 23:48 UTC (permalink / raw)
  To: Sasha Khapyorsky; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org


From: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
Date: Tue, 11 May 2010 15:36:08 -0700
Subject: [PATCH] ibnetdisc: Separate calls to umad and mad layer to avoid race condition on response MAD's

	Specify CA/Port to use which allows parallel scanning to other operations.

Signed-off-by: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
---
 .../libibnetdisc/include/infiniband/ibnetdisc.h    |   15 ++--
 infiniband-diags/libibnetdisc/src/ibnetdisc.c      |   52 +++++++-----
 infiniband-diags/libibnetdisc/src/internal.h       |   11 ++-
 infiniband-diags/libibnetdisc/src/query_smp.c      |   83 ++++++++++++++++----
 infiniband-diags/libibnetdisc/test/testleaks.c     |   16 +---
 infiniband-diags/src/iblinkinfo.c                  |    8 +-
 infiniband-diags/src/ibnetdiscover.c               |   14 +---
 infiniband-diags/src/ibqueryerrors.c               |   11 ++-
 8 files changed, 134 insertions(+), 76 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
index 2735224..83d0ba7 100644
--- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
+++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
@@ -134,7 +134,9 @@ typedef struct ibnd_config {
 	unsigned show_progress;
 	unsigned max_hops;
 	unsigned debug;
-	uint8_t pad[64];
+	unsigned timeout_ms;
+	unsigned retries;
+	uint8_t pad[56];
 } ibnd_config_t;
 
 /** =========================================================================
@@ -166,15 +168,16 @@ typedef struct ibnd_fabric {
  * Initialization (fabric operations)
  */
 
-MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port,
+MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(char * ca_name,
+					       int ca_port,
 					       ib_portid_t * from,
 					       struct ibnd_config *config);
 	/**
-	 * open: (required) ibmad_port object from libibmad
+	 * ca_name: (optional) name of the CA to use
+	 * ca_port: (optional) CA port to use
 	 * from: (optional) specify the node to start scanning from.
-	 *       If NULL start from the node we are running on.
-	 * hops: (optional) Specify how much of the fabric to traverse.
-	 *       negative value == scan entire fabric
+	 *       If NULL start from the CA/CA port specified
+	 * config: (optional) additional config options for the scan
 	 */
 MAD_EXPORT void ibnd_destroy_fabric(ibnd_fabric_t * fabric);
 
diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 98801de..3c374c7 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -380,21 +380,6 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid)
 	return NULL;
 }
 
-static int _check_ibmad_port(struct ibmad_port *ibmad_port)
-{
-	if (!ibmad_port) {
-		IBND_DEBUG("ibmad_port must be specified\n");
-		return -1;
-	}
-	if (mad_rpc_class_agent(ibmad_port, IB_SMI_CLASS) == -1
-	    || mad_rpc_class_agent(ibmad_port, IB_SMI_DIRECT_CLASS) == -1) {
-		IBND_DEBUG("ibmad_port must be opened with "
-			   "IB_SMI_CLASS && IB_SMI_DIRECT_CLASS\n");
-		return -1;
-	}
-	return 0;
-}
-
 ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str)
 {
 	int i = 0;
@@ -462,17 +447,38 @@ void add_to_type_list(ibnd_node_t * node, ibnd_fabric_t * fabric)
 	}
 }
 
-ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port,
-				    ib_portid_t * from, struct ibnd_config *cfg)
+static int set_config(struct ibnd_config *config, struct ibnd_config *cfg)
 {
-	struct ibnd_config default_config = { 0 };
+	if (!config)
+		return (-EINVAL);
+
+	if (cfg)
+		memcpy(config, cfg, sizeof(*config));
+
+	if (!config->max_smps)
+		config->max_smps = DEFAULT_MAX_SMP_ON_WIRE;
+	if (!config->timeout_ms)
+		config->timeout_ms = DEFAULT_TIMEOUT;
+	if (!config->retries)
+		config->retries = DEFAULT_RETRIES;
+
+	return (0);
+}
+
+ibnd_fabric_t *ibnd_discover_fabric(char * ca_name, int ca_port,
+				    ib_portid_t * from,
+				    struct ibnd_config *cfg)
+{
+	struct ibnd_config config = { 0 };
 	ibnd_fabric_t *fabric = NULL;
 	ib_portid_t my_portid = { 0 };
 	smp_engine_t engine;
 	ibnd_scan_t scan;
 
-	if (_check_ibmad_port(ibmad_port) < 0)
+	if (set_config(&config, cfg)) {
+		IBND_ERROR("Invalid ibnd_config\n");
 		return NULL;
+	}
 
 	/* If not specified start from "my" port */
 	if (!from)
@@ -488,10 +494,12 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port,
 
 	memset(&scan.selfportid, 0, sizeof(scan.selfportid));
 	scan.fabric = fabric;
-	scan.cfg = cfg ? cfg : &default_config;
+	scan.cfg = &config;
 
-	smp_engine_init(&engine, ibmad_port, &scan, cfg->max_smps ?
-			cfg->max_smps : DEFAULT_MAX_SMP_ON_WIRE);
+	if (smp_engine_init(&engine, ca_name, ca_port, &scan, &config)) {
+		free(fabric);
+		return (NULL);
+	}
 
 	IBND_DEBUG("from %s\n", portid2str(from));
 
diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h
index 2cfde02..d037a60 100644
--- a/infiniband-diags/libibnetdisc/src/internal.h
+++ b/infiniband-diags/libibnetdisc/src/internal.h
@@ -54,6 +54,8 @@
 #define MAXHOPS         63
 
 #define DEFAULT_MAX_SMP_ON_WIRE 2
+#define DEFAULT_TIMEOUT 1000
+#define DEFAULT_RETRIES 3
 
 typedef struct ibnd_scan {
 	ib_portid_t selfportid;
@@ -76,16 +78,19 @@ struct ibnd_smp {
 
 struct smp_engine {
 	struct ibmad_port *ibmad_port;
+	int umad_fd;
+	int smi_agent;
+	int smi_dir_agent;
 	ibnd_smp_t *smp_queue_head;
 	ibnd_smp_t *smp_queue_tail;
 	void *user_data;
 	cl_qmap_t smps_on_wire;
-	int max_smps_on_wire;
+	struct ibnd_config *cfg;
 	unsigned total_smps;
 };
 
-void smp_engine_init(smp_engine_t * engine, struct ibmad_port *ibmad_port,
-		     void *user_data, int max_smps_on_wire);
+int smp_engine_init(smp_engine_t * engine, char * ca_name, int ca_port,
+		    void *user_data, ibnd_config_t *cfg);
 int issue_smp(smp_engine_t * engine, ib_portid_t * portid,
 	      unsigned attrid, unsigned mod, smp_comp_cb_t cb, void *cb_data);
 int process_mads(smp_engine_t * engine);
diff --git a/infiniband-diags/libibnetdisc/src/query_smp.c b/infiniband-diags/libibnetdisc/src/query_smp.c
index 7234844..4dbfa0d 100644
--- a/infiniband-diags/libibnetdisc/src/query_smp.c
+++ b/infiniband-diags/libibnetdisc/src/query_smp.c
@@ -61,25 +61,32 @@ static ibnd_smp_t *get_smp(smp_engine_t * engine)
 	return rc;
 }
 
-static int send_smp(ibnd_smp_t * smp, struct ibmad_port *srcport)
+static int send_smp(ibnd_smp_t * smp, smp_engine_t * engine)
 {
 	int rc = 0;
 	uint8_t umad[1024];
 	ib_rpc_t *rpc = &smp->rpc;
+	int agent = 0;
 
 	memset(umad, 0, umad_size() + IB_MAD_SIZE);
 
+	if (rpc->mgtclass == IB_SMI_CLASS) {
+		agent = engine->smi_agent;
+	} else if (rpc->mgtclass == IB_SMI_DIRECT_CLASS) {
+		agent = engine->smi_dir_agent;
+	} else {
+		IBND_ERROR("Invalid class for RPC\n");
+		return (-EIO);
+	}
+
 	if ((rc = mad_build_pkt(umad, &smp->rpc, &smp->path, NULL, NULL))
 	    < 0) {
 		IBND_ERROR("mad_build_pkt failed; %d\n", rc);
 		return rc;
 	}
 
-	if ((rc = umad_send(mad_rpc_portid(srcport),
-			    mad_rpc_class_agent(srcport, rpc->mgtclass),
-			    umad, IB_MAD_SIZE,
-			    mad_get_timeout(srcport, rpc->timeout),
-			    mad_get_retries(srcport))) < 0) {
+	if ((rc = umad_send(engine->umad_fd, agent, umad, IB_MAD_SIZE,
+			    engine->cfg->timeout_ms, engine->cfg->retries)) < 0) {
 		IBND_ERROR("send failed; %d\n", rc);
 		return rc;
 	}
@@ -91,12 +98,13 @@ static int process_smp_queue(smp_engine_t * engine)
 {
 	int rc = 0;
 	ibnd_smp_t *smp;
-	while (cl_qmap_count(&engine->smps_on_wire) < engine->max_smps_on_wire) {
+	while (cl_qmap_count(&engine->smps_on_wire)
+	       < engine->cfg->max_smps) {
 		smp = get_smp(engine);
 		if (!smp)
 			return 0;
 
-		if ((rc = send_smp(smp, engine->ibmad_port)) != 0) {
+		if ((rc = send_smp(smp, engine)) != 0) {
 			free(smp);
 			return rc;
 		}
@@ -122,7 +130,7 @@ int issue_smp(smp_engine_t * engine, ib_portid_t * portid,
 	smp->rpc.method = IB_MAD_METHOD_GET;
 	smp->rpc.attr.id = attrid;
 	smp->rpc.attr.mod = mod;
-	smp->rpc.timeout = mad_get_timeout(engine->ibmad_port, 0);
+	smp->rpc.timeout = engine->cfg->timeout_ms;
 	smp->rpc.datasz = IB_SMP_DATA_SIZE;
 	smp->rpc.dataoffs = IB_SMP_DATA_OFFS;
 	smp->rpc.trid = mad_trid();
@@ -153,7 +161,7 @@ static int process_one_recv(smp_engine_t * engine)
 	memset(umad, 0, sizeof(umad));
 
 	/* wait for the next message */
-	if ((rc = umad_recv(mad_rpc_portid(engine->ibmad_port), umad, &length,
+	if ((rc = umad_recv(engine->umad_fd, umad, &length,
 			    0)) < 0) {
 		if (rc == -EWOULDBLOCK)
 			return 0;
@@ -190,14 +198,58 @@ error:
 	return rc;
 }
 
-void smp_engine_init(smp_engine_t * engine, struct ibmad_port *ibmad_port,
-		     void *user_data, int max_smps_on_wire)
+int smp_engine_init(smp_engine_t * engine, char * ca_name, int ca_port,
+		    void *user_data, ibnd_config_t *cfg)
 {
+	int nc = 2;
+	int mc[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS };
+
 	memset(engine, 0, sizeof(*engine));
-	engine->ibmad_port = ibmad_port;
+
+	engine->ibmad_port = mad_rpc_open_port(ca_name, ca_port, mc, nc);
+	if (!engine->ibmad_port) {
+		IBND_ERROR("can't open MAD port (%s:%d)\n", ca_name, ca_port);
+		return -EIO;
+	}
+	mad_rpc_set_timeout(engine->ibmad_port, cfg->timeout_ms);
+	mad_rpc_set_retries(engine->ibmad_port, cfg->retries);
+
+	if (umad_init() < 0) {
+		IBND_ERROR("umad_init failed\n");
+		mad_rpc_close_port(engine->ibmad_port);
+		return -EIO;
+	}
+
+	engine->umad_fd = umad_open_port(ca_name, ca_port);
+	if (engine->umad_fd < 0) {
+		IBND_ERROR("can't open UMAD port (%s:%d)\n", ca_name, ca_port);
+		mad_rpc_close_port(engine->ibmad_port);
+		return -EIO;
+	}
+
+	if ((engine->smi_agent = umad_register(engine->umad_fd,
+	     IB_SMI_CLASS, 1, 0, 0)) < 0) {
+		IBND_ERROR("Failed to register SMI agent on (%s:%d)\n",
+			   ca_name, ca_port);
+		goto eio_close;
+	}
+
+	if ((engine->smi_dir_agent = umad_register(engine->umad_fd,
+	     IB_SMI_DIRECT_CLASS, 1, 0, 0)) < 0) {
+		IBND_ERROR("Failed to register SMI_DIRECT agent on (%s:%d)\n",
+			   ca_name, ca_port);
+		goto eio_close;
+	}
+
 	engine->user_data = user_data;
 	cl_qmap_init(&engine->smps_on_wire);
-	engine->max_smps_on_wire = max_smps_on_wire;
+	engine->cfg = cfg;
+	return (0);
+
+eio_close:
+	mad_rpc_close_port(engine->ibmad_port);
+	umad_close_port(engine->umad_fd);
+	return (-EIO);
 }
 
 void smp_engine_destroy(smp_engine_t * engine)
@@ -221,6 +273,9 @@ void smp_engine_destroy(smp_engine_t * engine)
 		cl_qmap_remove_item(&engine->smps_on_wire, item);
 		free(item);
 	}
+
+	umad_close_port(engine->umad_fd);
+	mad_rpc_close_port(engine->ibmad_port);
 }
 
 int process_mads(smp_engine_t * engine)
diff --git a/infiniband-diags/libibnetdisc/test/testleaks.c b/infiniband-diags/libibnetdisc/test/testleaks.c
index da2fc0a..9a91f50 100644
--- a/infiniband-diags/libibnetdisc/test/testleaks.c
+++ b/infiniband-diags/libibnetdisc/test/testleaks.c
@@ -54,8 +54,6 @@
 char *argv0 = "iblinkinfotest";
 static FILE *f;
 
-static int timeout_ms = 500;
-
 void usage(void)
 {
 	fprintf(stderr,
@@ -88,9 +86,6 @@ int main(int argc, char **argv)
 	ib_portid_t port_id;
 	int iters = -1;
 
-	struct ibmad_port *ibmad_port;
-	int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS };
-
 	static char const str_opts[] = "S:D:n:C:P:t:shuf:i:";
 	static const struct option long_opts[] = {
 		{"S", 1, 0, 'S'},
@@ -139,7 +134,7 @@ int main(int argc, char **argv)
 			iters = (int)strtol(optarg, NULL, 0);
 			break;
 		case 't':
-			timeout_ms = strtoul(optarg, 0, 0);
+			config.timeout_ms = strtoul(optarg, 0, 0);
 			break;
 		case 'S':
 			guid = (uint64_t) strtoull(optarg, 0, 0);
@@ -152,15 +147,11 @@ int main(int argc, char **argv)
 	argc -= optind;
 	argv += optind;
 
-	ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2);
-
-	mad_rpc_set_timeout(ibmad_port, timeout_ms);
-
 	while (iters == -1 || iters-- > 0) {
 		if (from) {
 			/* only scan part of the fabric */
 			str2drpath(&(port_id.drpath), from, 0, 0);
-			if ((fabric = ibnd_discover_fabric(ibmad_port,
+			if ((fabric = ibnd_discover_fabric(ca, ca_port,
 							   &port_id, &config))
 			    == NULL) {
 				fprintf(stderr, "discover failed\n");
@@ -168,7 +159,7 @@ int main(int argc, char **argv)
 				goto close_port;
 			}
 			guid = 0;
-		} else if ((fabric = ibnd_discover_fabric(ibmad_port, NULL,
+		} else if ((fabric = ibnd_discover_fabric(ca, ca_port, NULL,
 							  &config)) == NULL) {
 			fprintf(stderr, "discover failed\n");
 			rc = 1;
@@ -179,6 +170,5 @@ int main(int argc, char **argv)
 	}
 
 close_port:
-	mad_rpc_close_port(ibmad_port);
 	exit(rc);
 }
diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c
index 029573f..d0c9b13 100644
--- a/infiniband-diags/src/iblinkinfo.c
+++ b/infiniband-diags/src/iblinkinfo.c
@@ -337,8 +337,10 @@ int main(int argc, char **argv)
 		exit(1);
 	}
 
-	if (ibd_timeout)
+	if (ibd_timeout) {
 		mad_rpc_set_timeout(ibmad_port, ibd_timeout);
+		config.timeout_ms = ibd_timeout;
+	}
 
 	node_name_map = open_node_name_map(node_name_map_file);
 
@@ -371,12 +373,12 @@ int main(int argc, char **argv)
 	} else {
 		if (resolved >= 0 &&
 		    !(fabric =
-		      ibnd_discover_fabric(ibmad_port, &port_id, &config)))
+		      ibnd_discover_fabric(ibd_ca, ibd_ca_port, &port_id, &config)))
 			IBWARN("Single node discover failed;"
 			       " attempting full scan\n");
 
 		if (!fabric &&
-		    !(fabric = ibnd_discover_fabric(ibmad_port, NULL, &config))) {
+		    !(fabric = ibnd_discover_fabric(ibd_ca, ibd_ca_port, NULL, &config))) {
 			fprintf(stderr, "discover failed\n");
 			rc = 1;
 			goto close_port;
diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
index 57f9625..8f08f06 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -67,8 +67,6 @@
 #define DIFF_FLAG_DEFAULT (DIFF_FLAG_SWITCH | DIFF_FLAG_CA | DIFF_FLAG_ROUTER \
 			   | DIFF_FLAG_PORT_CONNECTION)
 
-struct ibmad_port *srcport;
-
 static FILE *f;
 
 static char *node_name_map_file = NULL;
@@ -938,9 +936,6 @@ int main(int argc, char **argv)
 	ibnd_fabric_t *fabric = NULL;
 	ibnd_fabric_t *diff_fabric = NULL;
 
-	struct ibmad_port *ibmad_port;
-	int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS };
-
 	const struct ibdiag_opt opts[] = {
 		{"show", 's', 0, NULL, "show more information"},
 		{"list", 'l', 0, NULL, "list of connected nodes"},
@@ -975,12 +970,8 @@ int main(int argc, char **argv)
 	argc -= optind;
 	argv += optind;
 
-	ibmad_port = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2);
-	if (!ibmad_port)
-		IBERROR("Failed to open %s port %d", ibd_ca, ibd_ca_port);
-
 	if (ibd_timeout)
-		mad_rpc_set_timeout(ibmad_port, ibd_timeout);
+		config.timeout_ms = ibd_timeout;
 
 	if (argc && !(f = fopen(argv[0], "w")))
 		IBERROR("can't open file %s for writing", argv[0]);
@@ -996,7 +987,7 @@ int main(int argc, char **argv)
 			IBERROR("loading cached fabric failed\n");
 	} else {
 		if ((fabric =
-		     ibnd_discover_fabric(ibmad_port, NULL, &config)) == NULL)
+		     ibnd_discover_fabric(ibd_ca, ibd_ca_port, NULL, &config)) == NULL)
 			IBERROR("discover failed\n");
 	}
 
@@ -1017,6 +1008,5 @@ int main(int argc, char **argv)
 	if (diff_fabric)
 		ibnd_destroy_fabric(diff_fabric);
 	close_node_name_map(node_name_map);
-	mad_rpc_close_port(ibmad_port);
 	exit(0);
 }
diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c
index e896254..f04e47f 100644
--- a/infiniband-diags/src/ibqueryerrors.c
+++ b/infiniband-diags/src/ibqueryerrors.c
@@ -600,8 +600,10 @@ int main(int argc, char **argv)
 	if (!ibmad_port)
 		IBERROR("Failed to open port; %s:%d\n", ibd_ca, ibd_ca_port);
 
-	if (ibd_timeout)
+	if (ibd_timeout) {
 		mad_rpc_set_timeout(ibmad_port, ibd_timeout);
+		config.timeout_ms = ibd_timeout;
+	}
 
 	node_name_map = open_node_name_map(node_name_map_file);
 
@@ -633,11 +635,14 @@ int main(int argc, char **argv)
 		}
 	} else {
 		if (resolved >= 0 &&
-		    !(fabric = ibnd_discover_fabric(ibmad_port, &portid, 0)))
+		    !(fabric = ibnd_discover_fabric(ibd_ca, ibd_ca_port,
+						    &portid, &config)))
 			IBWARN("Single node discover failed;"
 			       " attempting full scan");
 
-		if (!fabric && !(fabric = ibnd_discover_fabric(ibmad_port, NULL,
+		if (!fabric && !(fabric = ibnd_discover_fabric(ibd_ca,
+							       ibd_ca_port,
+							       NULL,
 							       &config))) {
 			fprintf(stderr, "discover failed\n");
 			rc = 1;
-- 
1.5.4.5

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH 3/3] ib/iser: enhance disconnection logic for  multi-pathing
From: Roland Dreier @ 2010-05-12 16:32 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma, Mike Christie
In-Reply-To: <AANLkTin2z7MkpMtDqUwfcGA0RgJ7Nrsn9FSh6pXxWyGF-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

 > Hi Roland, as we're @ -rc7 now, I wanted to check with you if there's
 > any issue merging this patch series for 2.6.35. If you have any
 > question or anything need to be addressed/fixed, I'd like to do that
 > sooner rather then later.

No, just needed to get to it.  I have these 3 + Dan Carpenter's fix
applied now.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC] libibverbs: ibv_fork_init() and libhugetlbfs
From: Roland Dreier @ 2010-05-12 16:40 UTC (permalink / raw)
  To: Alexander Schmidt
  Cc: of-ewg, Linux RDMA, Hoang-Nam Nguyen, Stefan Roscher,
	Joachim Fenkes, Christoph Raisch, Alex Vainman
In-Reply-To: <20100507121936.283a18c6@alex-laptop>

 >  * added get_huge_page_size() to read the huge page size from
 >    /proc/meminfo. This is done at ibv_fork_init() time.

That doesn't work on systems that have multiple huge page sizes (eg
powerpc).  You can compare the code to get the size in libhugetlbfs.

Also I think the munging through /proc/pid/maps doesn't really work.
First of all, essentially grepping for libhugetlbfs is not robust as I
mentioned -- this will break in the same way for apps using huge pages
directly.  And while it is nice to be able to tell if a range came from
libhugetlbfs, I think there may be some bad performance impact from
reading /proc/pid/maps line-by-line.  (And by the way, as a trivial
optimization, it would make sense to me to check the address of each map
before doing the strstr).

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 18/52] IB/qib: Add qib_iba7322.c (serdes parameters)
From: Dave Olson @ 2010-05-12 17:22 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Ralph Campbell,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <adad3x2w1c3.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

On Tue, 11 May 2010, Roland Dreier wrote:
|  > Sorry for exposing all the ugliness.  If you see it as a serious issue,
|  > we can try to accelerate the cleanup effort.
| 
| OK.  Yes, I do see this as a serious issue -- getting rid of the bogus
| interface which is exposed as a user-visible interface is very painful
| once we've merged it.  So I would really really prefer to just have the
| good interface upstream.
| 
| We're probably ~1 week away from 2.6.34 final, and the merge window will
| be two weeks after that.  So if you could get some form of this finished
| in the next 3 weeks, that would be the best way to merge qib.

OK.   I'll be working on the redone version, and at this point, expect
to have it ready by this Friday, so either Ralph or I will send it
at that time.

Dave Olson
dave.olson-h88ZbnxC6KDQT0dZR+AlfA@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] RDMA/ucma: Copy iWARP route information.
From: Steve Wise @ 2010-05-12 19:33 UTC (permalink / raw)
  To: Sean Hefty; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <23BD55BBBEC444B18E7B48D98CC1EBA5-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>


Sean Hefty wrote:
>> I'll have to ponder this.   In the past, we've been using the gid.raw
>> areas to hold these mac addresses...
>>     
>
> There's not too many options with the existing ABI, but the newer query
> interfaces will give us more flexibility, like variable length addresses, which
> may make this cleaner.  Maybe we can return the L2 address through an AF_UNSPEC
> sockaddr?
>
>   
What does AF_UNSPEC imply about the format of the sockaddr?




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1
From: Roland Dreier @ 2010-05-12 19:56 UTC (permalink / raw)
  To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172344.GC12286@mtls03>

 > @@ -1017,9 +1020,12 @@ static void ib_sa_add_one(struct ib_device *device)
 >  	sa_dev->end_port   = e;
 >  
 >  	for (i = 0; i <= e - s; ++i) {
 > +		spin_lock_init(&sa_dev->port[i].ah_lock);
 > +		if (rdma_port_link_layer(device, i + 1) != IB_LINK_LAYER_INFINIBAND)
 > +			continue;

Not sure I understand why you move the initialization of the spinlock up
here?  It seems we ignore everything that might have to do with spinlock
if this is an IBoE port.

But the larger issue is what if someone calls one of the ib_sa_XXX_query
functions on an IBoE port?  Seems we just crash on uninitialized
structures.  I guess we're assuming that the kernel is smart enough not
to do that?

Also I'm wondering why you did the "count" stuff to ignore IBoE-only
devices in multicast.c but not sa_query.c?

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox