linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH libibverbs V4 0/5] Add support for UD QPs under RoCE IP addressing
@ 2014-05-18  9:38 Or Gerlitz
       [not found] ` <1400405929-14313-1-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Or Gerlitz @ 2014-05-18  9:38 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	Or Gerlitz

From: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Hi Roland,

This series adds support for Ethernet L2 address resolution for RoCE UD QPs,
whose L2 address-handles, unlike RC/UC/XRC/etc QPs are set from user space
without going through uverbs and the kernel IB core. The code is also
compatible both with old kernels that don't run IP based addressing.

Matan and Or.

changes from V3 -- addressed feedback from Jason:
  - replaced the char* that used to hold the MAC address provided to
    the provider library (e.g libmlx4) with sockaddr_storage and use ARPHRD
    for the af_family (similar to the practice used in the kernel)

  - Dropped mask1 from ibv_query_port_ex and put all fields in comp_mask
  - Added a man page for ibv_query_port_ex
  - Made the code c99 friendly by dropping the anonymous union and struct
  - Splitted the patches to patch that adds the interface and patch that uses it
  - Removed the usage of drv_ and lib_ schemas
  - Removed the VERBS_CONTEXT_XXXXXXX
 
changes from V2:
  - when using libnl3, link libibverbs itself with libnl3 and not the utilities
  - fix memory leak which took place when nl_addr_cmp_prefix_msb returns non zero.
  - rebased against libibverbs-1.1.8

changes from V1, addressed feedback from Yann Droneaud:
  - ported the code to libnl3 with automatic fallback option to libnl1
  - better error handling (more checks to erroneous flows)
  - fixed typos and removed blank lines
  - set close-on-exec open flag for the relevant fds
  - remove wrong comment from the change-log of patch #1

changes from V0:
 - don't change struct ibv_port_attr flags field from uint32_t
   to enum (size could change)
 - ibv_ah_attr_ex is extendable and contains ibv_ah_attr

Matan Barak (5):
  Add ibv_port_cap_flags
  Extend create_ah in order to pass L2 parameters to the provider
  Use neighbour lookup for RoCE UD QPs Eth L2 resolution
  Add ibv_query_port_ex support
  Optimize ibv_create_ah link query

 Makefile.am                |   27 +-
 configure.ac               |   31 ++
 include/infiniband/verbs.h |  140 ++++++
 man/ibv_query_port_ex.3    |   97 +++++
 src/neigh.c                | 1003 ++++++++++++++++++++++++++++++++++++++++++++
 src/neigh.h                |   53 +++
 src/verbs.c                |  164 +++++++-
 7 files changed, 1502 insertions(+), 13 deletions(-)
 create mode 100644 man/ibv_query_port_ex.3
 create mode 100644 src/neigh.c
 create mode 100644 src/neigh.h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH libibverbs V4 1/5] Add ibv_port_cap_flags
       [not found] ` <1400405929-14313-1-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
@ 2014-05-18  9:38   ` Or Gerlitz
       [not found]     ` <1400405929-14313-2-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
  2014-05-18  9:38   ` [PATCH libibverbs V4 2/5] Extend create_ah in order to pass L2 parameters to the provider Or Gerlitz
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Or Gerlitz @ 2014-05-18  9:38 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	Or Gerlitz

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add an enum that describes ibv_port_cap_flags that complies
with the respective kernel enum.

This value could be fetched when using ibv_query_port in
port_cap_flags.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/infiniband/verbs.h |   26 ++++++++++++++++++++++++++
 1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 9826b72..71adf2a 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -191,6 +191,32 @@ enum {
 	IBV_LINK_LAYER_ETHERNET,
 };
 
+enum ibv_port_cap_flags {
+	IBV_PORT_SM				= 1 <<  1,
+	IBV_PORT_NOTICE_SUP			= 1 <<  2,
+	IBV_PORT_TRAP_SUP			= 1 <<  3,
+	IBV_PORT_OPT_IPD_SUP			= 1 <<  4,
+	IBV_PORT_AUTO_MIGR_SUP			= 1 <<  5,
+	IBV_PORT_SL_MAP_SUP			= 1 <<  6,
+	IBV_PORT_MKEY_NVRAM			= 1 <<  7,
+	IBV_PORT_PKEY_NVRAM			= 1 <<  8,
+	IBV_PORT_LED_INFO_SUP			= 1 <<  9,
+	IBV_PORT_SYS_IMAGE_GUID_SUP		= 1 << 11,
+	IBV_PORT_PKEY_SW_EXT_PORT_TRAP_SUP	= 1 << 12,
+	IBV_PORT_EXTENDED_SPEEDS_SUP		= 1 << 14,
+	IBV_PORT_CM_SUP				= 1 << 16,
+	IBV_PORT_SNMP_TUNNEL_SUP		= 1 << 17,
+	IBV_PORT_REINIT_SUP			= 1 << 18,
+	IBV_PORT_DEVICE_MGMT_SUP		= 1 << 19,
+	IB_PORT_VENDOR_CLASS_SUP		= 1 << 20,
+	IB_PORT_DR_NOTICE_SUP			= 1 << 21,
+	IB_PORT_CAP_MASK_NOTICE_SUP		= 1 << 22,
+	IB_PORT_BOOT_MGMT_SUP			= 1 << 23,
+	IB_PORT_LINK_LATENCY_SUP		= 1 << 24,
+	IB_PORT_CLIENT_REG_SUP			= 1 << 25,
+	IBV_PORT_IP_BASED_GIDS			= 1 << 26
+};
+
 struct ibv_port_attr {
 	enum ibv_port_state	state;
 	enum ibv_mtu		max_mtu;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH libibverbs V4 2/5] Extend create_ah in order to pass L2 parameters to the provider
       [not found] ` <1400405929-14313-1-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
  2014-05-18  9:38   ` [PATCH libibverbs V4 1/5] Add ibv_port_cap_flags Or Gerlitz
@ 2014-05-18  9:38   ` Or Gerlitz
       [not found]     ` <1400405929-14313-3-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
  2014-05-18  9:38   ` [PATCH libibverbs V4 3/5] Use neighbour lookup for RoCE UD QPs Eth L2 resolution Or Gerlitz
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Or Gerlitz @ 2014-05-18  9:38 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	Or Gerlitz

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

In order to support IP based addressing, one needs to pass the L2
parameters to the provider drive. This is done through a new extendable
interface between libibverbs and the provider library.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/infiniband/verbs.h |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 71adf2a..ab497b1 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -40,6 +40,7 @@
 #include <pthread.h>
 #include <stddef.h>
 #include <errno.h>
+#include <sys/socket.h>
 
 #ifdef __cplusplus
 #  define BEGIN_C_DECLS extern "C" {
@@ -467,6 +468,28 @@ struct ibv_ah_attr {
 	uint8_t			port_num;
 };
 
+enum ibv_ah_attr_ex_attr_mask {
+	IBV_AH_ATTR_EX_LL = 1UL << 0,
+	IBV_AH_ATTR_EX_VID = 1UL << 1,
+};
+
+struct ibv_ah_attr_ex {
+	struct ibv_global_route	grh;
+	uint16_t		dlid;
+	uint8_t			sl;
+	uint8_t			src_path_bits;
+	uint8_t			static_rate;
+	uint8_t			is_global;
+	uint8_t			port_num;
+	uint32_t		comp_mask;
+	union {
+		struct sockaddr		sa;
+		struct sockaddr_storage _s;
+	} ll;
+	uint16_t		vid;
+};
+
+
 enum ibv_srq_attr_mask {
 	IBV_SRQ_MAX_WR	= 1 << 0,
 	IBV_SRQ_LIMIT	= 1 << 1
@@ -975,6 +998,8 @@ enum verbs_context_mask {
 
 struct verbs_context {
 	/*  "grows up" - new fields go here */
+	struct ibv_ah * (*create_ah_ex)(struct ibv_pd *pd,
+					struct ibv_ah_attr_ex *attr);
 	void (*_reserved_2)(void);
 	int (*destroy_flow)(struct ibv_flow *flow);
 	void (*_reserved_1)(void);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH libibverbs V4 3/5] Use neighbour lookup for RoCE UD QPs Eth L2 resolution
       [not found] ` <1400405929-14313-1-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
  2014-05-18  9:38   ` [PATCH libibverbs V4 1/5] Add ibv_port_cap_flags Or Gerlitz
  2014-05-18  9:38   ` [PATCH libibverbs V4 2/5] Extend create_ah in order to pass L2 parameters to the provider Or Gerlitz
@ 2014-05-18  9:38   ` Or Gerlitz
       [not found]     ` <1400405929-14313-4-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
  2014-05-18  9:38   ` [PATCH libibverbs V4 4/5] Add ibv_query_port_ex support Or Gerlitz
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Or Gerlitz @ 2014-05-18  9:38 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	Or Gerlitz

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

In order to implement RoCE IP based addressing for UD QPs, without introducing
uverbs changes, we need a way to resolve the L2 Ethernet addresses from user-space.

This is done with netlink through libnl, and in libibverbs such that multiple
vendor provider libraries can use the code.

The Ethernet L2 params are passed to the provider driver using an extension
verb create_ah_ex call. This resolution is done only for providers
that support this extension verb, otherwise, there's not real reason to do that.

The steps for resolution:

1. get sgid

2. from sgid, get the interface

3. query route from interface to the destination

4. query the neigh table, if the MAC for the destination is already known, done.

5. if not, loop until timeout:
    a. send a UDP packet to that IP targeted to the "discard" port
    b. listen to events from the neigh table
    c. query neigh table in order to know if the neigh has shown up between
       query until we started monitoring it

6. query vlan id from the interface

This solution depends on libnl-3 with backports to libnl-1.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 Makefile.am  |   24 +-
 configure.ac |   31 ++
 src/neigh.c  | 1003 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 src/neigh.h  |   53 +++
 src/verbs.c  |  162 ++++++++++-
 5 files changed, 1261 insertions(+), 12 deletions(-)
 create mode 100644 src/neigh.c
 create mode 100644 src/neigh.h

diff --git a/Makefile.am b/Makefile.am
index 3a40f3e..3104d5a 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -5,13 +5,17 @@ lib_LTLIBRARIES = src/libibverbs.la
 ACLOCAL_AMFLAGS = -I config
 AM_CFLAGS = -g -Wall
 
-src_libibverbs_la_CFLAGS = $(AM_CFLAGS) -DIBV_CONFIG_DIR=\"$(sysconfdir)/libibverbs.d\"
-
+src_libibverbs_la_CFLAGS = $(AM_CFLAGS) -DIBV_CONFIG_DIR=\"$(sysconfdir)/libibverbs.d\" \
+			   $(LIBNL_CFLAGS)
 libibverbs_version_script = @LIBIBVERBS_VERSION_SCRIPT@
+src_libibverbs_la_LIBADD = $(LIBNL_LIBS)
 
 src_libibverbs_la_SOURCES = src/cmd.c src/compat-1_0.c src/device.c src/init.c \
 			    src/marshall.c src/memory.c src/sysfs.c src/verbs.c \
 			    src/enum_strs.c
+if ! NO_RESOLVE_NEIGH
+src_libibverbs_la_SOURCES += src/neigh.c
+endif
 src_libibverbs_la_LDFLAGS = -version-info 1 -export-dynamic \
     $(libibverbs_version_script)
 src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/libibverbs.map
@@ -20,21 +24,21 @@ bin_PROGRAMS = examples/ibv_devices examples/ibv_devinfo \
     examples/ibv_asyncwatch examples/ibv_rc_pingpong examples/ibv_uc_pingpong \
     examples/ibv_ud_pingpong examples/ibv_srq_pingpong examples/ibv_xsrq_pingpong
 examples_ibv_devices_SOURCES = examples/device_list.c
-examples_ibv_devices_LDADD = $(top_builddir)/src/libibverbs.la
+examples_ibv_devices_LDADD = $(top_builddir)/src/libibverbs.la $(LIBNL_LIBS)
 examples_ibv_devinfo_SOURCES = examples/devinfo.c
-examples_ibv_devinfo_LDADD = $(top_builddir)/src/libibverbs.la
+examples_ibv_devinfo_LDADD = $(top_builddir)/src/libibverbs.la $(LIBNL_LIBS)
 examples_ibv_rc_pingpong_SOURCES = examples/rc_pingpong.c examples/pingpong.c
-examples_ibv_rc_pingpong_LDADD = $(top_builddir)/src/libibverbs.la
+examples_ibv_rc_pingpong_LDADD = $(top_builddir)/src/libibverbs.la $(LIBNL_LIBS)
 examples_ibv_uc_pingpong_SOURCES = examples/uc_pingpong.c examples/pingpong.c
-examples_ibv_uc_pingpong_LDADD = $(top_builddir)/src/libibverbs.la
+examples_ibv_uc_pingpong_LDADD = $(top_builddir)/src/libibverbs.la $(LIBNL_LIBS)
 examples_ibv_ud_pingpong_SOURCES = examples/ud_pingpong.c examples/pingpong.c
-examples_ibv_ud_pingpong_LDADD = $(top_builddir)/src/libibverbs.la
+examples_ibv_ud_pingpong_LDADD = $(top_builddir)/src/libibverbs.la $(LIBNL_LIBS)
 examples_ibv_srq_pingpong_SOURCES = examples/srq_pingpong.c examples/pingpong.c
-examples_ibv_srq_pingpong_LDADD = $(top_builddir)/src/libibverbs.la
+examples_ibv_srq_pingpong_LDADD = $(top_builddir)/src/libibverbs.la $(LIBNL_LIBS)
 examples_ibv_xsrq_pingpong_SOURCES = examples/xsrq_pingpong.c examples/pingpong.c
-examples_ibv_xsrq_pingpong_LDADD = $(top_builddir)/src/libibverbs.la
+examples_ibv_xsrq_pingpong_LDADD = $(top_builddir)/src/libibverbs.la $(LIBNL_LIBS)
 examples_ibv_asyncwatch_SOURCES = examples/asyncwatch.c
-examples_ibv_asyncwatch_LDADD = $(top_builddir)/src/libibverbs.la
+examples_ibv_asyncwatch_LDADD = $(top_builddir)/src/libibverbs.la $(LIBNL_LIBS)
 
 libibverbsincludedir = $(includedir)/infiniband
 
diff --git a/configure.ac b/configure.ac
index b5f6dfc..bef0e35 100644
--- a/configure.ac
+++ b/configure.ac
@@ -28,6 +28,37 @@ else
     fi
 fi
 
+AC_ARG_WITH([resolve-neigh],
+    AC_HELP_STRING([--with-resolve-neigh],
+        [Enable neighbour resolution in Ethernet (default YES)]))
+have_libnl=no
+if  test x$with_resolve_neigh = x || test x$with_resolve_neigh = xyes; then
+	PKG_CHECK_MODULES([LIBNL],[libnl-3.0],[
+			   have_libnl=yes
+			   AC_DEFINE([HAVE_LIBNL3], [1], [Use libnl-3.0])
+			   AC_DEFINE([HAVE_LIBNL], [1],  [Use libnl])
+			   PKG_CHECK_MODULES([LIBNL_ROUTE3], [libnl-route-3.0])
+			   LIBNL_CFLAGS="$LIBNL_CFLAGS $LIBNL_ROUTE3_CFLAGS"
+			   LIBNL_LIBS="$LIBNL_LIBS $LIBNL_ROUTE3_LIBS"], [:]
+			  );
+	if test "$have_libnl" = no; then
+		PKG_CHECK_MODULES([LIBNL], [libnl-1], [have_libnl=yes
+				  AC_DEFINE([HAVE_LIBNL1], [1], [Use libnl-1])
+				  AC_DEFINE([HAVE_LIBNL], [1],  [Use libnl])
+				  AC_CHECK_LIB(nl, rtnl_link_vlan_get_id, [],
+					AC_MSG_ERROR([rtnl_link_vlan_get_id not found.  libibverbs requires libnl.]))
+				 ],[
+					AC_MSG_ERROR([libibverbs requires libnl.])
+				 ])
+	fi
+else
+    AC_DEFINE([NRESOLVE_NEIGH], 1, [Define to 1 to disable resovle neigh annotations.])
+fi
+AM_CONDITIONAL([HAVE_LIBNL], [test "$have_libnl" = "yes"])
+AC_SUBST([LIBNL_CFLAGS])
+AC_SUBST([LIBNL_LIBS])
+AM_CONDITIONAL(NO_RESOLVE_NEIGH, test x$with_resolve_neigh = xno)
+
 dnl Checks for libraries
 AC_CHECK_LIB(dl, dlsym, [],
     AC_MSG_ERROR([dlsym() not found.  libibverbs requires libdl.]))
diff --git a/src/neigh.c b/src/neigh.c
new file mode 100644
index 0000000..afa2281
--- /dev/null
+++ b/src/neigh.c
@@ -0,0 +1,1003 @@
+
+#include "config.h"
+#include <net/if_packet.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <netlink/route/rtnl.h>
+#include <netlink/route/link.h>
+#include <netlink/route/route.h>
+#include <netlink/route/neighbour.h>
+#ifndef HAVE_LIBNL1
+#include <netlink/route/link/vlan.h>
+#endif
+
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/timerfd.h>
+#include <netinet/in.h>
+#include <errno.h>
+#include <unistd.h>
+#include <ifaddrs.h>
+#include <netdb.h>
+#ifndef _LINUX_IF_H
+#include <net/if.h>
+#else
+/*Workaround when there's a collision between the includes */
+extern unsigned int if_nametoindex(__const char *__ifname) __THROW;
+#endif
+
+/* for PFX */
+#include "ibverbs.h"
+
+#include "neigh.h"
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+#define print_hdr PFX "resolver: "
+#define print_err(...) fprintf(stderr, print_hdr __VA_ARGS__)
+#ifdef _DEBUG_
+#define print_dbg_s(stream, ...) fprintf((stream), __VA_ARGS__)
+#define print_address(stream, msg, addr) \
+	{							      \
+		char buff[100];					      \
+		print_dbg_s((stream), (msg), nl_addr2str((addr), buff,\
+			    sizeof(buff)));			      \
+	}
+#else
+#define print_dbg_s(stream, args...)
+#define print_address(stream, msg, ...)
+#endif
+#define print_dbg(...) print_dbg_s(stderr, print_hdr __VA_ARGS__)
+
+#ifdef HAVE_LIBNL1
+#define nl_geterror(x) nl_geterror()
+#endif
+
+/* Workaround - declaration missing */
+extern int		rtnl_link_vlan_get_id(struct rtnl_link *);
+
+static pthread_once_t device_neigh_alloc = PTHREAD_ONCE_INIT;
+#ifdef HAVE_LIBNL1
+static struct nl_handle *zero_socket;
+#else
+static struct nl_sock *zero_socket;
+#endif
+
+union sktaddr {
+	struct sockaddr s;
+	struct sockaddr_in s4;
+	struct sockaddr_in6 s6;
+};
+
+struct skt {
+	union sktaddr sktaddr;
+	socklen_t len;
+};
+
+
+static int set_link_port(union sktaddr *s, int port, int oif)
+{
+	switch (s->s.sa_family) {
+	case AF_INET:
+		s->s4.sin_port = port;
+		break;
+	case AF_INET6:
+		s->s6.sin6_port = port;
+		s->s6.sin6_scope_id = oif;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int cmp_address(const struct sockaddr *s1,
+		       const struct sockaddr *s2) {
+	if (s1->sa_family != s2->sa_family)
+		return s1->sa_family ^ s2->sa_family;
+
+	switch (s1->sa_family) {
+	case AF_INET:
+		return ((struct sockaddr_in *)s1)->sin_addr.s_addr ^
+		       ((struct sockaddr_in *)s2)->sin_addr.s_addr;
+	case AF_INET6:
+		return memcmp(
+			((struct sockaddr_in6 *)s1)->sin6_addr.s6_addr,
+			((struct sockaddr_in6 *)s2)->sin6_addr.s6_addr,
+			sizeof(((struct sockaddr_in6 *)s1)->sin6_addr.s6_addr));
+	default:
+		return -ENOTSUP;
+	}
+}
+
+
+static int get_ifindex(const struct sockaddr *s)
+{
+	struct ifaddrs *ifaddr, *ifa;
+	int name2index = -ENODEV;
+
+	if (-1 == getifaddrs(&ifaddr))
+		return errno;
+
+	for (ifa = ifaddr; ifa != NULL; ifa = ifa->ifa_next) {
+		if (NULL == ifa->ifa_addr)
+			continue;
+
+		if (!cmp_address(ifa->ifa_addr, s)) {
+			name2index = if_nametoindex(ifa->ifa_name);
+			break;
+		}
+	}
+
+	freeifaddrs(ifaddr);
+
+	return name2index;
+}
+
+static struct nl_addr *get_neigh_mac(struct get_neigh_handler *neigh_handler)
+{
+	struct rtnl_neigh *neigh;
+	struct nl_addr *ll_addr = NULL;
+
+	/* future optimization - if link local address - parse address and
+	 * return mac */
+	neigh = rtnl_neigh_get(neigh_handler->neigh_cache,
+			       neigh_handler->oif,
+			       neigh_handler->dst);
+	if (NULL == neigh) {
+		print_dbg("Neigh isn't at cache\n");
+		return NULL;
+	}
+
+	ll_addr = rtnl_neigh_get_lladdr(neigh);
+	if (NULL == ll_addr)
+		print_err("Failed to get hw address from neighbour\n");
+	else
+		ll_addr = nl_addr_clone(ll_addr);
+
+	rtnl_neigh_put(neigh);
+	return ll_addr;
+}
+
+static void get_neigh_cb_event(struct nl_object *obj, void *arg)
+{
+	struct get_neigh_handler *neigh_handler =
+		(struct get_neigh_handler *)arg;
+	/* assumed serilized callback (no parallel execution of function) */
+	if (nl_object_match_filter(
+		obj,
+		(struct nl_object *)neigh_handler->filter_neigh)) {
+		struct rtnl_neigh *neigh = (struct rtnl_neigh *)obj;
+		print_dbg("Found a match for neighbour\n");
+		/* check that we didn't set it already */
+		if (NULL == neigh_handler->found_ll_addr) {
+			if (NULL == rtnl_neigh_get_lladdr(neigh)) {
+				print_err("Neighbour doesn't have a hw addr\n");
+				return;
+			}
+			neigh_handler->found_ll_addr =
+				nl_addr_clone(rtnl_neigh_get_lladdr(neigh));
+			if (NULL == neigh_handler->found_ll_addr)
+				print_err("Couldn't copy neighbour hw addr\n");
+		}
+	}
+}
+
+static int get_neigh_cb(struct nl_msg *msg, void *arg)
+{
+	struct get_neigh_handler *neigh_handler =
+		(struct get_neigh_handler *)arg;
+
+	if (nl_msg_parse(msg, &get_neigh_cb_event, neigh_handler) < 0)
+		print_err("Unknown event\n");
+
+	return NL_OK;
+}
+
+static void set_neigh_filter(struct get_neigh_handler *neigh_handler,
+			     struct rtnl_neigh *filter) {
+	neigh_handler->filter_neigh = filter;
+}
+
+static struct rtnl_neigh *create_filter_neigh_for_dst(struct nl_addr *dst_addr,
+						      int oif)
+{
+	struct rtnl_neigh *filter_neigh;
+
+	filter_neigh = rtnl_neigh_alloc();
+	if (NULL == filter_neigh) {
+		print_err("Couldn't allocate filter neigh\n");
+		return NULL;
+	}
+	rtnl_neigh_set_ifindex(filter_neigh, oif);
+	rtnl_neigh_set_dst(filter_neigh, dst_addr);
+
+	return filter_neigh;
+}
+
+#define PORT_DISCARD htons(9)
+#define SEND_PAYLOAD "H"
+
+static int create_socket(struct get_neigh_handler *neigh_handler,
+			 struct skt *addr_dst, int *psock_fd)
+{
+	int err;
+	struct skt addr_src;
+	int sock_fd;
+
+	memset(addr_dst, 0, sizeof(*addr_dst));
+	memset(&addr_src, 0, sizeof(addr_src));
+	addr_src.len = sizeof(addr_src.sktaddr);
+
+	err = nl_addr_fill_sockaddr(neigh_handler->src,
+				    &addr_src.sktaddr.s,
+				    &addr_src.len);
+	if (err) {
+		print_err("couldn't convert src to sockaddr\n");
+		return err;
+	}
+
+	addr_dst->len = sizeof(addr_dst->sktaddr);
+	err = nl_addr_fill_sockaddr(neigh_handler->dst,
+				    &addr_dst->sktaddr.s,
+				    &addr_dst->len);
+	if (err) {
+		print_err("couldn't convert dst to sockaddr\n");
+		return err;
+	}
+
+	err = set_link_port(&addr_dst->sktaddr, PORT_DISCARD,
+			    neigh_handler->oif);
+	if (err)
+		return err;
+
+	sock_fd = socket(addr_dst->sktaddr.s.sa_family,
+			 SOCK_DGRAM | SOCK_CLOEXEC, 0);
+	if (sock_fd == -1)
+		return errno ? -errno : -1;
+	err = bind(sock_fd, &addr_src.sktaddr.s, addr_src.len);
+	if (err) {
+		int bind_err = -errno;
+		print_err("Couldn't bind socket\n");
+		close(sock_fd);
+		return bind_err ?: err;
+	}
+
+	*psock_fd = sock_fd;
+
+	return 0;
+}
+
+#define NUM_OF_RETRIES 10
+#define NUM_OF_TRIES ((NUM_OF_RETRIES) + 1)
+#if NUM_OF_TRIES < 1
+#error "neigh: invalid value of NUM_OF_RETRIES"
+#endif
+static int create_timer(struct get_neigh_handler *neigh_handler)
+{
+	int user_timeout = neigh_handler->timeout/NUM_OF_TRIES;
+	struct timespec timeout = {
+		.tv_sec = user_timeout / 1000,
+		.tv_nsec = (user_timeout % 1000) * 1000000
+	};
+	struct itimerspec timer_time = {.it_value = timeout};
+	int timer_fd;
+
+	timer_fd = timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC);
+	if (-1 == timer_fd) {
+		print_err("Couldn't create timer\n");
+		return timer_fd;
+	}
+
+	if (neigh_handler->timeout) {
+		if (NUM_OF_TRIES <= 1)
+			bzero(&timer_time.it_interval,
+			      sizeof(timer_time.it_interval));
+		else
+			timer_time.it_interval = timeout;
+		if (timerfd_settime(timer_fd, 0, &timer_time, NULL)) {
+			print_err("Couldn't set timer\n");
+			return -1;
+		}
+	}
+
+	return timer_fd;
+}
+
+#define UDP_SOCKET_MAX_SENDTO 100000ULL
+
+static struct nl_addr *process_get_neigh_mac(
+		struct get_neigh_handler *neigh_handler)
+{
+	int err;
+	struct nl_addr *ll_addr = get_neigh_mac(neigh_handler);
+	struct rtnl_neigh *neigh_filter;
+	fd_set fdset;
+	int sock_fd;
+	int fd;
+	int nfds;
+	int timer_fd;
+	int ret;
+	struct skt addr_dst;
+	char buff[sizeof(SEND_PAYLOAD)] = SEND_PAYLOAD;
+	int retries = 0;
+	uint64_t max_count = UDP_SOCKET_MAX_SENDTO;
+
+	if (NULL != ll_addr)
+		return ll_addr;
+
+	err = nl_socket_add_membership(neigh_handler->sock,
+				       RTNLGRP_NEIGH);
+	if (err < 0) {
+		print_err("%s\n", nl_geterror(err));
+		return NULL;
+	}
+
+	neigh_filter = create_filter_neigh_for_dst(neigh_handler->dst,
+						   neigh_handler->oif);
+	if (NULL == neigh_filter)
+		return NULL;
+
+	set_neigh_filter(neigh_handler, neigh_filter);
+
+#ifdef HAVE_LIBNL1
+	nl_disable_sequence_check(neigh_handler->sock);
+#else
+	nl_socket_disable_seq_check(neigh_handler->sock);
+#endif
+	nl_socket_modify_cb(neigh_handler->sock, NL_CB_VALID, NL_CB_CUSTOM,
+			    &get_neigh_cb, neigh_handler);
+
+	fd = nl_socket_get_fd(neigh_handler->sock);
+
+	err = create_socket(neigh_handler, &addr_dst, &sock_fd);
+
+	if (err)
+		return NULL;
+
+	do {
+		err = sendto(sock_fd, buff, sizeof(buff), 0,
+			     &addr_dst.sktaddr.s,
+			     addr_dst.len);
+		if (err > 0)
+			err = 0;
+	} while (-1 == err && EADDRNOTAVAIL == errno && --max_count);
+
+	if (err) {
+		print_err("Failed to send packet %d", err);
+		goto close_socket;
+	}
+	timer_fd = create_timer(neigh_handler);
+	if (timer_fd < 0)
+		goto close_socket;
+
+	nfds = MAX(fd, timer_fd) + 1;
+
+
+	while (1) {
+		FD_ZERO(&fdset);
+		FD_SET(fd, &fdset);
+		FD_SET(timer_fd, &fdset);
+
+		/* wait for an incoming message on the netlink socket */
+		ret = select(nfds, &fdset, NULL, NULL, NULL);
+
+		if (ret) {
+			if (FD_ISSET(fd, &fdset)) {
+				nl_recvmsgs_default(neigh_handler->sock);
+				if (neigh_handler->found_ll_addr)
+					break;
+			} else {
+				nl_cache_refill(neigh_handler->sock,
+						neigh_handler->neigh_cache);
+				ll_addr = get_neigh_mac(neigh_handler);
+				if (NULL != ll_addr) {
+					break;
+				} else if (FD_ISSET(timer_fd, &fdset) &&
+					   retries < NUM_OF_RETRIES) {
+					if (sendto(sock_fd, buff, sizeof(buff),
+						   0, &addr_dst.sktaddr.s,
+						   addr_dst.len) < 0)
+						print_err("Failed to send "
+							  "packet while waiting"
+							  " for events\n");
+				}
+			}
+
+			if (FD_ISSET(timer_fd, &fdset)) {
+				uint64_t read_val;
+				if (read(timer_fd, &read_val, sizeof(read_val))
+				    < 0)
+					print_dbg("Read timer_fd failed\n");
+				if (++retries >=  NUM_OF_TRIES) {
+					print_dbg("Timeout while trying to fetch "
+						  "neighbours\n");
+					break;
+				}
+			}
+		}
+	}
+	close(timer_fd);
+close_socket:
+	close(sock_fd);
+	return ll_addr ? ll_addr : neigh_handler->found_ll_addr;
+}
+
+
+static int get_mcast_mac_ipv4(struct nl_addr *dst, struct nl_addr **ll_addr)
+{
+	char mac_addr[6] = {0x01, 0x00, 0x5E};
+	uint32_t addr = ntohl(*(uint32_t *)nl_addr_get_binary_addr(dst));
+	mac_addr[5] = addr & 0xFF;
+	addr >>= 8;
+	mac_addr[4] = addr & 0xFF;
+	addr >>= 8;
+	mac_addr[3] = addr & 0x7F;
+
+	*ll_addr = nl_addr_build(AF_LLC, mac_addr, sizeof(mac_addr));
+
+	return *ll_addr == NULL ? -EINVAL : 0;
+}
+
+static int get_mcast_mac_ipv6(struct nl_addr *dst, struct nl_addr **ll_addr)
+{
+	char mac_addr[6] = {0x33, 0x33};
+	memcpy(mac_addr + 2, (char *)nl_addr_get_binary_addr(dst) + 12, 4);
+
+	*ll_addr = nl_addr_build(AF_LLC, mac_addr, sizeof(mac_addr));
+
+	return *ll_addr == NULL ? -EINVAL : 0;
+}
+
+static int get_link_local_mac_ipv6(struct nl_addr *dst,
+				   struct nl_addr **ll_addr)
+{
+	char mac_addr[6];
+
+	memcpy(mac_addr + 3, (char *)nl_addr_get_binary_addr(dst) + 13, 3);
+	memcpy(mac_addr, (char *)nl_addr_get_binary_addr(dst) + 8, 3);
+	mac_addr[0] ^= 2;
+
+	*ll_addr = nl_addr_build(AF_LLC, mac_addr, sizeof(mac_addr));
+	return *ll_addr == NULL ? -EINVAL : 0;
+}
+
+const struct encoded_l3_addr {
+	short family;
+	uint8_t prefix_bits;
+	const char data[16];
+	int (*getter)(struct nl_addr *dst, struct nl_addr **ll_addr);
+} encoded_prefixes[] = {
+	{.family = AF_INET,
+	 .prefix_bits = 4,
+	 .data = {0xe0},
+	 .getter = &get_mcast_mac_ipv4},
+	{.family = AF_INET6,
+	 .prefix_bits = 8,
+	 .data = {0xff},
+	 .getter = &get_mcast_mac_ipv6},
+	{.family = AF_INET6,
+	 .prefix_bits = 64,
+	 .data = {0xfe, 0x80},
+	 .getter = get_link_local_mac_ipv6},
+};
+
+int nl_addr_cmp_prefix_msb(void *addr1, int len1, void *addr2, int len2)
+{
+	int len = MIN(len1, len2);
+	int bytes = len / 8;
+	int d = memcmp(addr1, addr2, bytes);
+
+	if (d == 0) {
+		int mask = ((1UL << (len % 8)) - 1UL) << (8 - len);
+
+		d = (((char *)addr1)[bytes] & mask) -
+		    (((char *)addr2)[bytes] & mask);
+	}
+
+	return d;
+}
+static int handle_encoded_mac(struct nl_addr *dst, struct nl_addr **ll_addr)
+{
+	uint32_t family = nl_addr_get_family(dst);
+	struct nl_addr *prefix = NULL;
+	int i;
+	int ret = 1;
+
+	for (i = 0;
+	     i < sizeof(encoded_prefixes)/sizeof(encoded_prefixes[0]) &&
+	     ret; prefix = NULL, i++) {
+		if (encoded_prefixes[i].family != family)
+			continue;
+
+		prefix = nl_addr_build(
+				family,
+				(void *)encoded_prefixes[i].data,
+				MIN(encoded_prefixes[i].prefix_bits/8 +
+				    !!(encoded_prefixes[i].prefix_bits % 8),
+				    sizeof(encoded_prefixes[i].data)));
+
+		if (NULL == prefix)
+			return -ENOMEM;
+		nl_addr_set_prefixlen(prefix,
+				      encoded_prefixes[i].prefix_bits);
+
+		if (nl_addr_cmp_prefix_msb(nl_addr_get_binary_addr(dst),
+					   nl_addr_get_prefixlen(dst),
+					   nl_addr_get_binary_addr(prefix),
+					   nl_addr_get_prefixlen(prefix)))
+			continue;
+
+		ret = encoded_prefixes[i].getter(dst, ll_addr);
+#ifdef HAVE_LIBNL1
+		nl_addr_destroy(prefix);
+#else
+		nl_addr_put(prefix);
+#endif
+	}
+
+	return ret;
+}
+
+static void get_route_cb_parser(struct nl_object *obj, void *arg)
+{
+	struct get_neigh_handler *neigh_handler =
+		(struct get_neigh_handler *)arg;
+
+	struct rtnl_route *route = (struct rtnl_route *)obj;
+	struct nl_addr *gateway;
+	struct nl_addr *src = rtnl_route_get_pref_src(route);
+	int oif;
+	int type = rtnl_route_get_type(route);
+	struct rtnl_link *link;
+
+#ifdef HAVE_LIBNL1
+	gateway = rtnl_route_get_gateway(route);
+	oif = rtnl_route_get_oif(route);
+#else
+	struct rtnl_nexthop *nh = rtnl_route_nexthop_n(route, 0);
+	if (!nh)
+		print_err("Out of memory\n");
+	gateway = rtnl_route_nh_get_gateway(nh);
+	oif = rtnl_route_nh_get_ifindex(nh);
+#endif
+
+	if (gateway) {
+#ifdef HAVE_LIBNL1
+		nl_addr_destroy(neigh_handler->dst);
+#else
+		nl_addr_put(neigh_handler->dst);
+#endif
+		neigh_handler->dst = nl_addr_clone(gateway);
+		print_dbg("Found gateway\n");
+	}
+
+	if (RTN_BLACKHOLE == type ||
+	    RTN_UNREACHABLE == type ||
+	    RTN_PROHIBIT == type ||
+	    RTN_THROW == type) {
+		print_err("Destination unrechable (type %d)\n", type);
+		goto err;
+	}
+
+	if (!neigh_handler->src && src)
+		neigh_handler->src = nl_addr_clone(src);
+
+	if (neigh_handler->oif < 0 && oif > 0)
+		neigh_handler->oif = oif;
+
+	/* Link Local */
+	if (RTN_LOCAL == type) {
+		struct nl_addr *lladdr;
+
+		link = rtnl_link_get(neigh_handler->link_cache,
+				     neigh_handler->oif);
+
+		if (NULL == link)
+			goto err;
+
+		lladdr = rtnl_link_get_addr(link);
+
+		if (NULL == lladdr)
+			goto err_link;
+
+		neigh_handler->found_ll_addr = nl_addr_clone(lladdr);
+		rtnl_link_put(link);
+	} else {
+		if (!handle_encoded_mac(
+				neigh_handler->dst,
+				&neigh_handler->found_ll_addr))
+			print_address(stderr, "calculated address %s\n",
+				      neigh_handler->found_ll_addr);
+	}
+
+	print_address(stderr, "Current dst %s\n", neigh_handler->dst);
+	return;
+
+err_link:
+	rtnl_link_put(link);
+err:
+	if (neigh_handler->src) {
+#ifdef HAVE_LIBNL1
+		nl_addr_put(neigh_handler->src);
+#else
+		nl_addr_put(neigh_handler->src);
+#endif
+		neigh_handler->src = NULL;
+	}
+}
+
+static int get_route_cb(struct nl_msg *msg, void *arg)
+{
+	struct get_neigh_handler *neigh_handler =
+		(struct get_neigh_handler *)arg;
+	int err;
+
+	err = nl_msg_parse(msg, &get_route_cb_parser, neigh_handler);
+	if (err < 0) {
+		print_err("Unable to parse object: %s", nl_geterror(err));
+		return err;
+	}
+
+	if (!neigh_handler->dst || !neigh_handler->src ||
+	    neigh_handler->oif <= 0) {
+		print_err("Missing params\n");
+		return -1;
+	}
+
+	if (NULL != neigh_handler->found_ll_addr)
+		goto found;
+
+	neigh_handler->found_ll_addr =
+		process_get_neigh_mac(neigh_handler);
+	if (neigh_handler->found_ll_addr)
+		print_address(stderr, "Neigh is %s\n",
+			      neigh_handler->found_ll_addr);
+
+found:
+	return neigh_handler->found_ll_addr ? 0 : -1;
+}
+
+int neigh_get_oif_from_src(struct get_neigh_handler *neigh_handler)
+{
+	int oif = -ENODEV;
+	struct addrinfo *src_info;
+
+#ifdef HAVE_LIBNL1
+	src_info = nl_addr_info(neigh_handler->src);
+	if (NULL == src_info) {
+#else
+	int err;
+	err = nl_addr_info(neigh_handler->src, &src_info);
+	if (err) {
+#endif
+		print_err("Unable to get address info %s\n",
+			  nl_geterror(err));
+		return oif;
+	}
+
+	oif = get_ifindex(src_info->ai_addr);
+	if (oif <= 0)
+		goto free;
+
+	print_dbg("IF index is %d\n", oif);
+
+free:
+	freeaddrinfo(src_info);
+	return oif;
+}
+
+static void destroy_zero_based_socket(void)
+{
+	if (NULL != zero_socket) {
+		print_dbg("destroying zero based socket\n");
+#ifdef HAVE_LIBNL1
+		nl_handle_destroy(zero_socket);
+#else
+		nl_socket_free(zero_socket);
+#endif
+	}
+}
+
+static void alloc_zero_based_socket(void)
+{
+#ifdef HAVE_LIBNL1
+	zero_socket = nl_handle_alloc();
+#else
+	zero_socket = nl_socket_alloc();
+#endif
+	print_dbg("creating zero based socket\n");
+	atexit(&destroy_zero_based_socket);
+}
+
+int neigh_init_resources(struct get_neigh_handler *neigh_handler, int timeout)
+{
+	int err;
+
+	pthread_once(&device_neigh_alloc, &alloc_zero_based_socket);
+#ifdef HAVE_LIBNL1
+	neigh_handler->sock = nl_handle_alloc();
+#else
+	neigh_handler->sock = nl_socket_alloc();
+#endif
+	if (NULL == neigh_handler->sock) {
+		print_err("Unable to allocate netlink socket\n");
+		return -ENOMEM;
+	}
+
+	err = nl_connect(neigh_handler->sock, NETLINK_ROUTE);
+	if (err < 0) {
+		print_err("Unable to connect netlink socket: %s\n",
+			  nl_geterror(err));
+		goto free_socket;
+	}
+
+#ifdef HAVE_LIBNL1
+	neigh_handler->link_cache =
+		rtnl_link_alloc_cache(neigh_handler->sock);
+	if (NULL == neigh_handler->link_cache) {
+		err = -ENOMEM;
+#else
+
+	err = rtnl_link_alloc_cache(neigh_handler->sock, AF_UNSPEC,
+				    &neigh_handler->link_cache);
+	if (err) {
+#endif
+		print_err("Unable to allocate link cache: %s\n",
+			  nl_geterror(err));
+		err = -ENOMEM;
+		goto close_connection;
+	}
+
+	nl_cache_mngt_provide(neigh_handler->link_cache);
+
+#ifdef HAVE_LIBNL1
+	neigh_handler->route_cache =
+		rtnl_route_alloc_cache(neigh_handler->sock);
+	if (NULL == neigh_handler->route_cache) {
+		err = -ENOMEM;
+#else
+	err = rtnl_route_alloc_cache(neigh_handler->sock, AF_UNSPEC, 0,
+				     &neigh_handler->route_cache);
+	if (err) {
+#endif
+		print_err("Unable to allocate route cache: %s\n",
+			  nl_geterror(err));
+		goto free_link_cache;
+	}
+
+	nl_cache_mngt_provide(neigh_handler->route_cache);
+
+#ifdef HAVE_LIBNL1
+	neigh_handler->neigh_cache =
+		rtnl_neigh_alloc_cache(neigh_handler->sock);
+	if (NULL == neigh_handler->neigh_cache) {
+		err = -ENOMEM;
+#else
+	err = rtnl_neigh_alloc_cache(neigh_handler->sock,
+				     &neigh_handler->neigh_cache);
+	if (err) {
+#endif
+		print_err("Couldn't allocate neigh cache %s\n",
+			  nl_geterror(err));
+		goto free_route_cache;
+	}
+
+	nl_cache_mngt_provide(neigh_handler->neigh_cache);
+
+	/* init structure */
+	neigh_handler->timeout = timeout;
+	neigh_handler->oif = -1;
+	neigh_handler->filter_neigh = NULL;
+	neigh_handler->found_ll_addr = NULL;
+	neigh_handler->dst = NULL;
+	neigh_handler->src = NULL;
+	neigh_handler->vid = -1;
+	neigh_handler->neigh_status = GET_NEIGH_STATUS_OK;
+
+	return 0;
+
+free_route_cache:
+	nl_cache_mngt_unprovide(neigh_handler->route_cache);
+	nl_cache_free(neigh_handler->route_cache);
+	neigh_handler->route_cache = NULL;
+free_link_cache:
+	nl_cache_mngt_unprovide(neigh_handler->link_cache);
+	nl_cache_free(neigh_handler->link_cache);
+	neigh_handler->link_cache = NULL;
+close_connection:
+	nl_close(neigh_handler->sock);
+free_socket:
+#ifdef HAVE_LIBNL1
+	nl_handle_destroy(neigh_handler->sock);
+		nl_handle_destroy(zero_socket);
+#else
+	nl_socket_free(neigh_handler->sock);
+#endif
+	neigh_handler->sock = NULL;
+	return err;
+}
+
+uint16_t neigh_get_vlan_id_from_dev(struct get_neigh_handler *neigh_handler)
+{
+	struct rtnl_link *link;
+	int vid = 0xffff;
+
+	link = rtnl_link_get(neigh_handler->link_cache, neigh_handler->oif);
+	if (NULL == link) {
+		print_err("Can't find link in cache\n");
+		return -EINVAL;
+	}
+
+#ifndef HAVE_LIBNL1
+	if (rtnl_link_is_vlan(link))
+#endif
+		vid = rtnl_link_vlan_get_id(link);
+	rtnl_link_put(link);
+	return vid >= 0 && vid <= 0xfff ? vid : 0xffff;
+}
+
+void neigh_set_vlan_id(struct get_neigh_handler *neigh_handler, uint16_t vid)
+{
+	if (vid >= 0 && vid <= 0xfff)
+		neigh_handler->vid = vid;
+}
+
+int neigh_set_dst(struct get_neigh_handler *neigh_handler,
+		  int family, void *buf, size_t size)
+{
+	neigh_handler->dst = nl_addr_build(family, buf, size);
+	return NULL == neigh_handler->dst;
+}
+
+int neigh_set_src(struct get_neigh_handler *neigh_handler,
+		  int family, void *buf, size_t size)
+{
+	neigh_handler->src = nl_addr_build(family, buf, size);
+	return NULL == neigh_handler->src;
+}
+
+void neigh_set_oif(struct get_neigh_handler *neigh_handler, int oif)
+{
+	neigh_handler->oif = oif;
+}
+
+int neigh_get_ll(struct get_neigh_handler *neigh_handler, void *addr_buff,
+		 int addr_size) {
+	int neigh_len;
+
+	if (NULL == neigh_handler->found_ll_addr)
+		return -EINVAL;
+
+	 neigh_len = nl_addr_get_len(neigh_handler->found_ll_addr);
+
+	if (neigh_len > addr_size)
+		return -EINVAL;
+
+	memcpy(addr_buff, nl_addr_get_binary_addr(neigh_handler->found_ll_addr),
+	       neigh_len);
+
+	return neigh_len;
+}
+
+void neigh_free_resources(struct get_neigh_handler *neigh_handler)
+{
+	/* Should be released first because it's holding a reference to dst */
+	if (NULL != neigh_handler->filter_neigh) {
+		rtnl_neigh_put(neigh_handler->filter_neigh);
+		neigh_handler->filter_neigh = NULL;
+	}
+
+	if (NULL != neigh_handler->src) {
+#ifdef HAVE_LIBNL1
+		nl_addr_put(neigh_handler->src);
+#else
+		nl_addr_put(neigh_handler->src);
+#endif
+		neigh_handler->src = NULL;
+	}
+
+	if (NULL != neigh_handler->dst) {
+#ifdef HAVE_LIBNL1
+		nl_addr_destroy(neigh_handler->dst);
+#else
+		nl_addr_put(neigh_handler->dst);
+#endif
+		neigh_handler->dst = NULL;
+	}
+
+	if (NULL != neigh_handler->found_ll_addr) {
+#ifdef HAVE_LIBNL1
+		nl_addr_destroy(neigh_handler->found_ll_addr);
+#else
+		nl_addr_put(neigh_handler->found_ll_addr);
+#endif
+		neigh_handler->found_ll_addr = NULL;
+	}
+
+	if (NULL != neigh_handler->neigh_cache) {
+		nl_cache_mngt_unprovide(neigh_handler->neigh_cache);
+		nl_cache_free(neigh_handler->neigh_cache);
+		neigh_handler->neigh_cache = NULL;
+	}
+
+	if (NULL != neigh_handler->route_cache) {
+		nl_cache_mngt_unprovide(neigh_handler->route_cache);
+		nl_cache_free(neigh_handler->route_cache);
+		neigh_handler->route_cache = NULL;
+	}
+
+	if (NULL != neigh_handler->link_cache) {
+		nl_cache_mngt_unprovide(neigh_handler->link_cache);
+		nl_cache_free(neigh_handler->link_cache);
+		neigh_handler->link_cache = NULL;
+	}
+
+	if (NULL != neigh_handler->sock) {
+		nl_close(neigh_handler->sock);
+#ifdef HAVE_LIBNL1
+		nl_handle_destroy(neigh_handler->sock);
+#else
+		nl_socket_free(neigh_handler->sock);
+#endif
+		neigh_handler->sock = NULL;
+	}
+}
+
+int process_get_neigh(struct get_neigh_handler *neigh_handler)
+{
+	struct nl_msg *m;
+	struct rtmsg rmsg = {
+		.rtm_family = nl_addr_get_family(neigh_handler->dst),
+		.rtm_dst_len = nl_addr_get_prefixlen(neigh_handler->dst),
+	};
+	int err;
+	int last_status;
+
+	last_status = __sync_fetch_and_or(
+			&neigh_handler->neigh_status,
+			GET_NEIGH_STATUS_IN_PROCESS);
+
+	if (last_status != GET_NEIGH_STATUS_OK)
+		return -EINPROGRESS;
+
+	m = nlmsg_alloc_simple(RTM_GETROUTE, 0);
+
+	if (NULL == m)
+		return -ENOMEM;
+
+	nlmsg_append(m, &rmsg, sizeof(rmsg), NLMSG_ALIGNTO);
+
+	nla_put_addr(m, RTA_DST, neigh_handler->dst);
+
+	if (neigh_handler->oif > 0)
+		nla_put_u32(m, RTA_OIF, neigh_handler->oif);
+
+	err = nl_send_auto_complete(neigh_handler->sock, m);
+	nlmsg_free(m);
+	if (err < 0) {
+		print_err("%s", nl_geterror(err));
+		return err;
+	}
+
+	nl_socket_modify_cb(neigh_handler->sock, NL_CB_VALID,
+			    NL_CB_CUSTOM, &get_route_cb, neigh_handler);
+
+	err = nl_recvmsgs_default(neigh_handler->sock);
+	if (err < 0) {
+		print_err("%s", nl_geterror(err));
+		__sync_fetch_and_or(
+			&neigh_handler->neigh_status,
+			GET_NEIGH_STATUS_ERR);
+	}
+
+	__sync_fetch_and_and(
+		&neigh_handler->neigh_status,
+		~GET_NEIGH_STATUS_IN_PROCESS);
+
+	return err;
+}
diff --git a/src/neigh.h b/src/neigh.h
new file mode 100644
index 0000000..09e01e1
--- /dev/null
+++ b/src/neigh.h
@@ -0,0 +1,53 @@
+#ifndef _GET_NEIGH_
+#define _GET_NEIGH_
+
+#include <stddef.h>
+#include <stdint.h>
+#include "config.h"
+#ifdef HAVE_LIBNL1
+#include <netlink/object.h>
+#else
+#include <netlink/object-api.h>
+#endif
+
+enum get_neigh_status {
+	GET_NEIGH_STATUS_OK = 0,
+	GET_NEIGH_STATUS_IN_PROCESS = 1 << 0,
+	GET_NEIGH_STATUS_ERR = 1 << 1,
+};
+
+struct get_neigh_handler {
+#ifdef HAVE_LIBNL1
+	struct nl_handle *sock;
+#else
+	struct nl_sock *sock;
+#endif
+	struct nl_cache *link_cache;
+	struct nl_cache	*neigh_cache;
+	struct nl_cache *route_cache;
+	int32_t oif;
+	int vid;
+	struct rtnl_neigh *filter_neigh;
+	struct nl_addr *found_ll_addr;
+	struct nl_addr *dst;
+	struct nl_addr *src;
+	enum get_neigh_status neigh_status;
+	uint64_t timeout;
+};
+
+int process_get_neigh(struct get_neigh_handler *neigh_handler);
+void neigh_free_resources(struct get_neigh_handler *neigh_handler);
+void neigh_set_vlan_id(struct get_neigh_handler *neigh_handler, uint16_t vid);
+uint16_t neigh_get_vlan_id_from_dev(struct get_neigh_handler *neigh_handler);
+int neigh_init_resources(struct get_neigh_handler *neigh_handler, int timeout);
+
+int neigh_set_src(struct get_neigh_handler *neigh_handler,
+		  int family, void *buf, size_t size);
+void neigh_set_oif(struct get_neigh_handler *neigh_handler, int oif);
+int neigh_set_dst(struct get_neigh_handler *neigh_handler,
+		  int family, void *buf, size_t size);
+int neigh_get_oif_from_src(struct get_neigh_handler *neigh_handler);
+int neigh_get_ll(struct get_neigh_handler *neigh_handler, void *addr_buf,
+		 int addr_size);
+
+#endif
diff --git a/src/verbs.c b/src/verbs.c
index a6aae70..6db79e8 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -43,6 +43,11 @@
 #include <string.h>
 
 #include "ibverbs.h"
+#ifndef NRESOLVE_NEIGH
+#include <net/if.h>
+#include <net/if_arp.h>
+#include "neigh.h"
+#endif
 
 int ibv_rate_to_mult(enum ibv_rate rate)
 {
@@ -505,15 +510,168 @@ int __ibv_destroy_qp(struct ibv_qp *qp)
 }
 default_symver(__ibv_destroy_qp, ibv_destroy_qp);
 
+static inline int ipv6_addr_v4mapped(const struct in6_addr *a)
+{
+	return ((a->s6_addr32[0] | a->s6_addr32[1]) |
+		(a->s6_addr32[2] ^ htonl(0x0000ffff))) == 0UL ||
+		/* IPv4 encoded multicast addresses */
+	       (a->s6_addr32[0]  == htonl(0xff0e0000) &&
+		((a->s6_addr32[1] |
+		  (a->s6_addr32[2] ^ htonl(0x0000ffff))) == 0UL));
+}
+
+
+struct peer_address {
+	void *address;
+	uint32_t size;
+};
+
+static inline int create_peer_from_gid(int family, void *raw_gid,
+				       struct peer_address *peer_address)
+{
+	switch (family) {
+	case AF_INET:
+		peer_address->address = raw_gid + 12;
+		peer_address->size = 4;
+		break;
+	case AF_INET6:
+		peer_address->address = raw_gid;
+		peer_address->size = 16;
+		break;
+	default:
+		return -1;
+	}
+
+	return 0;
+}
+
+#define ETHERNET_LL_SIZE 6
+#define NEIGH_GET_DEFAULT_TIMEOUT_MS 3000
 struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
 {
-	struct ibv_ah *ah = pd->context->ops.create_ah(pd, attr);
+	int err;
+	struct ibv_ah *ah = NULL;
+#ifndef NRESOLVE_NEIGH
+	struct ibv_port_attr port_attr;
+	int dst_family;
+	int src_family;
+	int oif;
+	struct get_neigh_handler neigh_handler;
+	union ibv_gid sgid;
+	struct ibv_ah_attr_ex attr_ex;
+	int ether_len;
+	struct verbs_context *vctx = verbs_get_ctx_op(pd->context,
+						      create_ah_ex);
+	struct peer_address src;
+	struct peer_address dst;
+
+	if (!vctx) {
+#endif
+		ah = pd->context->ops.create_ah(pd, attr);
+#ifndef NRESOLVE_NEIGH
+		goto return_ah;
+	}
+
+	err = ibv_query_port(pd->context, attr->port_num, &port_attr);
+
+	if (err) {
+		fprintf(stderr, PFX "ibv_create_ah failed to query port.\n");
+		return NULL;
+	}
 
+	if (IBV_LINK_LAYER_ETHERNET != port_attr.link_layer ||
+	    !(port_attr.port_cap_flags & IBV_PORT_IP_BASED_GIDS)) {
+		ah = pd->context->ops.create_ah(pd, attr);
+		goto return_ah;
+	}
+
+	memcpy(&attr_ex, attr, sizeof(*attr));
+	memset((void *)&attr_ex + sizeof(*attr), 0,
+	       sizeof(attr_ex) - sizeof(*attr));
+
+	err = ibv_query_gid(pd->context, attr->port_num,
+			    attr->grh.sgid_index, &sgid);
+
+	if (err) {
+		fprintf(stderr, PFX "ibv_create_ah failed to query sgid.\n");
+		return NULL;
+	}
+
+	if (neigh_init_resources(&neigh_handler, NEIGH_GET_DEFAULT_TIMEOUT_MS))
+		return NULL;
+
+	dst_family = ipv6_addr_v4mapped((struct in6_addr *)attr->grh.dgid.raw) ?
+			AF_INET : AF_INET6;
+	src_family = ipv6_addr_v4mapped((struct in6_addr *)sgid.raw) ?
+			AF_INET : AF_INET6;
+
+	if (create_peer_from_gid(dst_family, attr->grh.dgid.raw, &dst)) {
+		fprintf(stderr, PFX
+			"ibv_create_ah failed to create dst peer\n");
+		goto free_resources;
+	}
+	if (create_peer_from_gid(src_family, &sgid.raw, &src)) {
+		fprintf(stderr, PFX
+			"ibv_create_ah failed to create src peer\n");
+		goto free_resources;
+	}
+	if (neigh_set_dst(&neigh_handler, dst_family, dst.address,
+			  dst.size)) {
+		fprintf(stderr, PFX
+			"ibv_create_ah failed to create dst addr\n");
+		goto free_resources;
+	}
+
+	if (neigh_set_src(&neigh_handler, src_family, src.address,
+			  src.size)) {
+		fprintf(stderr, PFX
+			"ibv_create_ah failed to create src addr\n");
+		goto free_resources;
+	}
+
+	oif = neigh_get_oif_from_src(&neigh_handler);
+
+	if (oif > 0) {
+		neigh_set_oif(&neigh_handler, oif);
+	} else {
+		fprintf(stderr, PFX "ibv_create_ah failed to get output IF\n");
+		goto free_resources;
+	}
+
+	/* blocking call */
+	if (process_get_neigh(&neigh_handler)) {
+		fprintf(stderr, PFX "Neigh resolution process failed\n");
+		goto free_resources;
+	}
+
+	attr_ex.vid = neigh_get_vlan_id_from_dev(&neigh_handler);
+
+	if (attr_ex.vid <= 0xfff) {
+		neigh_set_vlan_id(&neigh_handler, attr_ex.vid);
+		attr_ex.comp_mask |= IBV_AH_ATTR_EX_VID;
+	}
+	/* We are using only Ethernet here */
+	ether_len = neigh_get_ll(&neigh_handler,
+				 &attr_ex.ll.sa.sa_data,
+				 IFHWADDRLEN);
+
+	if (ether_len <= 0)
+		goto free_resources;
+
+	attr_ex.comp_mask |= IBV_AH_ATTR_EX_LL;
+	attr_ex.ll.sa.sa_family = ARPHRD_ETHER;
+
+	ah = vctx->create_ah_ex(pd, &attr_ex);
+
+free_resources:
+	neigh_free_resources(&neigh_handler);
+
+return_ah:
+#endif
 	if (ah) {
 		ah->context = pd->context;
 		ah->pd      = pd;
 	}
-
 	return ah;
 }
 default_symver(__ibv_create_ah, ibv_create_ah);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH libibverbs V4 4/5] Add ibv_query_port_ex support
       [not found] ` <1400405929-14313-1-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
                     ` (2 preceding siblings ...)
  2014-05-18  9:38   ` [PATCH libibverbs V4 3/5] Use neighbour lookup for RoCE UD QPs Eth L2 resolution Or Gerlitz
@ 2014-05-18  9:38   ` Or Gerlitz
       [not found]     ` <1400405929-14313-5-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
  2014-05-18  9:38   ` [PATCH libibverbs V4 5/5] Optimize ibv_create_ah link query Or Gerlitz
  2014-05-20  6:58   ` [PATCH libibverbs V4 0/5] Add support for UD QPs under RoCE IP addressing Or Gerlitz
  5 siblings, 1 reply; 18+ messages in thread
From: Or Gerlitz @ 2014-05-18  9:38 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	Or Gerlitz

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch adds extended support for ibv_query_port.

This allows to request fields that aren't availible by the current
ibv_query_port API and avoid fetching from vendor library fields that
the user doesn't need, which gives more room for optimizations.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 Makefile.am                |    3 +-
 include/infiniband/verbs.h |   89 ++++++++++++++++++++++++++++++++++++++++
 man/ibv_query_port_ex.3    |   97 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 188 insertions(+), 1 deletions(-)
 create mode 100644 man/ibv_query_port_ex.3

diff --git a/Makefile.am b/Makefile.am
index 3104d5a..c992a36 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -62,7 +62,8 @@ man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1	\
     man/ibv_query_srq.3 man/ibv_rate_to_mult.3 man/ibv_reg_mr.3		\
     man/ibv_req_notify_cq.3 man/ibv_resize_cq.3 man/ibv_rate_to_mbps.3  \
     man/ibv_create_qp_ex.3 man/ibv_create_srq_ex.3 man/ibv_open_xrcd.3  \
-    man/ibv_get_srq_num.3 man/ibv_open_qp.3 man/ibv_create_flow.3
+    man/ibv_get_srq_num.3 man/ibv_open_qp.3 man/ibv_create_flow.3 	\
+    man/ibv_query_port_ex.3
 
 DEBIAN = debian/changelog debian/compat debian/control debian/copyright \
     debian/ibverbs-utils.install debian/libibverbs1.install \
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index ab497b1..1dc2a8d 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -489,6 +489,58 @@ struct ibv_ah_attr_ex {
 	uint16_t		vid;
 };
 
+enum ibv_query_port_ex_attr_mask {
+	IBV_QUERY_PORT_EX_STATE			= 1 << 0,
+	IBV_QUERY_PORT_EX_MAX_MTU		= 1 << 1,
+	IBV_QUERY_PORT_EX_ACTIVE_MTU		= 1 << 2,
+	IBV_QUERY_PORT_EX_GID_TBL_LEN		= 1 << 3,
+	IBV_QUERY_PORT_EX_CAP_FLAGS		= 1 << 4,
+	IBV_QUERY_PORT_EX_MAX_MSG_SZ		= 1 << 5,
+	IBV_QUERY_PORT_EX_BAD_PKEY_CNTR		= 1 << 6,
+	IBV_QUERY_PORT_EX_QKEY_VIOL_CNTR	= 1 << 7,
+	IBV_QUERY_PORT_EX_PKEY_TBL_LEN		= 1 << 8,
+	IBV_QUERY_PORT_EX_LID			= 1 << 9,
+	IBV_QUERY_PORT_EX_SM_LID		= 1 << 10,
+	IBV_QUERY_PORT_EX_LMC			= 1 << 11,
+	IBV_QUERY_PORT_EX_MAX_VL_NUM		= 1 << 12,
+	IBV_QUERY_PORT_EX_SM_SL			= 1 << 13,
+	IBV_QUERY_PORT_EX_SUBNET_TIMEOUT	= 1 << 14,
+	IBV_QUERY_PORT_EX_INIT_TYPE_REPLY	= 1 << 15,
+	IBV_QUERY_PORT_EX_ACTIVE_WIDTH		= 1 << 16,
+	IBV_QUERY_PORT_EX_ACTIVE_SPEED		= 1 << 17,
+	IBV_QUERY_PORT_EX_PHYS_STATE		= 1 << 18,
+	IBV_QUERY_PORT_EX_LINK_LAYER		= 1 << 19,
+	/* mask of the fields that exists in the standard query_port_command */
+	IBV_QUERY_PORT_EX_STD_MASK		= (1 << 20) - 1,
+	/* mask of all supported fields */
+	IBV_QUERY_PORT_EX_MASK			= IBV_QUERY_PORT_EX_STD_MASK,
+};
+
+struct ibv_port_attr_ex {
+	enum ibv_port_state	state;
+	enum ibv_mtu		max_mtu;
+	enum ibv_mtu		active_mtu;
+	int			gid_tbl_len;
+	uint32_t		port_cap_flags;
+	uint32_t		max_msg_sz;
+	uint32_t		bad_pkey_cntr;
+	uint32_t		qkey_viol_cntr;
+	uint16_t		pkey_tbl_len;
+	uint16_t		lid;
+	uint16_t		sm_lid;
+	uint8_t			lmc;
+	uint8_t			max_vl_num;
+	uint8_t			sm_sl;
+	uint8_t			subnet_timeout;
+	uint8_t			init_type_reply;
+	uint8_t			active_width;
+	uint8_t			active_speed;
+	uint8_t			phys_state;
+	uint8_t			link_layer;
+	uint8_t			reserved;
+	uint32_t		comp_mask;
+};
+
 
 enum ibv_srq_attr_mask {
 	IBV_SRQ_MAX_WR	= 1 << 0,
@@ -998,6 +1050,8 @@ enum verbs_context_mask {
 
 struct verbs_context {
 	/*  "grows up" - new fields go here */
+	int (*query_port_ex)(struct ibv_context *context, uint8_t port_num,
+			     struct ibv_port_attr_ex *port_attr);
 	struct ibv_ah * (*create_ah_ex)(struct ibv_pd *pd,
 					struct ibv_ah_attr_ex *attr);
 	void (*_reserved_2)(void);
@@ -1128,6 +1182,41 @@ static inline int ___ibv_query_port(struct ibv_context *context,
 	return ibv_query_port(context, port_num, port_attr);
 }
 
+static inline int ibv_query_port_ex(struct ibv_context *context,
+				    uint8_t port_num,
+				    struct ibv_port_attr_ex *port_attr)
+{
+	struct verbs_context *vctx;
+
+	port_attr->link_layer = IBV_LINK_LAYER_UNSPECIFIED;
+	port_attr->reserved   = 0;
+
+	if (0 == port_attr->comp_mask)
+		return ibv_query_port(context, port_num,
+				      (struct ibv_port_attr *)port_attr);
+
+	/* Check that only valid flags were given */
+	if (port_attr->comp_mask & ~IBV_QUERY_PORT_EX_MASK) {
+		errno = EINVAL;
+		return -errno;
+	}
+
+	vctx = verbs_get_ctx_op(context, query_port_ex);
+
+	if (!vctx) {
+		/* Fallback to legacy mode */
+		if (!(port_attr->comp_mask & ~IBV_QUERY_PORT_EX_STD_MASK))
+			return ibv_query_port(context, port_num,
+					      (struct ibv_port_attr *)port_attr);
+
+		/* Unsupported field was requested */
+		errno = ENOSYS;
+		return -errno;
+	}
+
+	return vctx->query_port_ex(context, port_num, port_attr);
+}
+
 #define ibv_query_port(context, port_num, port_attr) \
 	___ibv_query_port(context, port_num, port_attr)
 
diff --git a/man/ibv_query_port_ex.3 b/man/ibv_query_port_ex.3
new file mode 100644
index 0000000..07073fd
--- /dev/null
+++ b/man/ibv_query_port_ex.3
@@ -0,0 +1,97 @@
+.\" -*- nroff -*-
+.\"
+.TH IBV_QUERY_PORT_EX 3 2006-10-31 libibverbs "Libibverbs Programmer's Manual"
+.SH "NAME"
+ibv_query_port_ex \- query an RDMA port's attributes
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/verbs.h>
+.sp
+.BI "int ibv_query_port_ex(struct ibv_context " "*context" ", uint8_t " "port_num" ,
+.BI "                      struct ibv_port_attr_ex " "*port_attr" ");
+.fi
+.SH "DESCRIPTION"
+.B ibv_query_port_ex()
+returns the attributes of port
+.I port_num
+for device context
+.I context
+through the pointer
+.I port_attr\fR.
+The argument
+.I port_attr
+is an ibv_port_attr struct, as defined in <infiniband/verbs.h>.
+.PP
+.nf
+struct ibv_port_attr_ex {
+.in +8
+enum ibv_port_state     state;          /* Logical port state */
+enum ibv_mtu            max_mtu;        /* Max MTU supported by port */
+enum ibv_mtu            active_mtu;     /* Actual MTU */
+int                     gid_tbl_len;    /* Length of source GID table */
+uint32_t                port_cap_flags; /* Port capabilities */
+uint32_t                max_msg_sz;     /* Maximum message size */
+uint32_t                bad_pkey_cntr;  /* Bad P_Key counter */
+uint32_t                qkey_viol_cntr; /* Q_Key violation counter */
+uint16_t                pkey_tbl_len;   /* Length of partition table */
+uint16_t                lid;            /* Base port LID */
+uint16_t                sm_lid;         /* SM LID */
+uint8_t                 lmc;            /* LMC of LID */
+uint8_t                 max_vl_num;     /* Maximum number of VLs */
+uint8_t                 sm_sl;          /* SM service level */
+uint8_t                 subnet_timeout; /* Subnet propagation delay */
+uint8_t                 init_type_reply;/* Type of initialization performed by SM */
+uint8_t                 active_width;   /* Currently active link width */
+uint8_t                 active_speed;   /* Currently active link speed */
+uint8_t                 phys_state;     /* Physical port state */
+uint8_t                 link_layer;     /* link layer protocol of the port */
+uint8_t                 reserved;       /* reserved field */
+uint32_t                comp_mask;      /* relevant fields for command */
+.in -8
+};
+.sp
+The comp_mask value describes which values the user is intrested in.
+Only those fields have a meaningful value when the command returns successfully.
+The available fields are described by ibv_query_port_ex_attr_mask, which is defined
+in <infiniband/verbs.h>. Every bit in comp_mask corresponds to a field in
+struct ibv_port_attr_ex, by the same order.
+.PP
+.nf
+enum ibv_query_port_ex_attr_mask {
+        IBV_QUERY_PORT_EX_STATE                 = 1 << 0,
+        IBV_QUERY_PORT_EX_MAX_MTU               = 1 << 1,
+        IBV_QUERY_PORT_EX_ACTIVE_MTU            = 1 << 2,
+        IBV_QUERY_PORT_EX_GID_TBL_LEN           = 1 << 3,
+        IBV_QUERY_PORT_EX_CAP_FLAGS             = 1 << 4,
+        IBV_QUERY_PORT_EX_MAX_MSG_SZ            = 1 << 5,
+        IBV_QUERY_PORT_EX_BAD_PKEY_CNTR         = 1 << 6,
+        IBV_QUERY_PORT_EX_QKEY_VIOL_CNTR        = 1 << 7,
+        IBV_QUERY_PORT_EX_PKEY_TBL_LEN          = 1 << 8,
+        IBV_QUERY_PORT_EX_LID                   = 1 << 9,
+        IBV_QUERY_PORT_EX_SM_LID                = 1 << 10,
+        IBV_QUERY_PORT_EX_LMC                   = 1 << 11,
+        IBV_QUERY_PORT_EX_MAX_VL_NUM            = 1 << 12,
+        IBV_QUERY_PORT_EX_SM_SL                 = 1 << 13,
+        IBV_QUERY_PORT_EX_SUBNET_TIMEOUT        = 1 << 14,
+        IBV_QUERY_PORT_EX_INIT_TYPE_REPLY       = 1 << 15,
+        IBV_QUERY_PORT_EX_ACTIVE_WIDTH          = 1 << 16,
+        IBV_QUERY_PORT_EX_ACTIVE_SPEED          = 1 << 17,
+        IBV_QUERY_PORT_EX_PHYS_STATE            = 1 << 18,
+        IBV_QUERY_PORT_EX_LINK_LAYER            = 1 << 19
+}
+.sp
+possible values for the link layer field are IBV_LINK_LAYER_INFINIBAND,
+IBV_LINK_LAYER_ETHERNET, or IBV_LINK_LAYER_UNSPECIFIED.
+.sp
+.fi
+.SH "RETURN VALUE"
+.B ibv_query_port_ex()
+returns 0 on success, or the value of errno on failure (which indicates the failure reason).
+.SH "SEE ALSO"
+.BR ibv_create_qp (3),
+.BR ibv_destroy_qp (3),
+.BR ibv_query_qp (3),
+.BR ibv_create_ah (3)
+.SH "AUTHORS"
+.TP
+Dotan Barak <dotanba-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>        Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH libibverbs V4 5/5] Optimize ibv_create_ah link query
       [not found] ` <1400405929-14313-1-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
                     ` (3 preceding siblings ...)
  2014-05-18  9:38   ` [PATCH libibverbs V4 4/5] Add ibv_query_port_ex support Or Gerlitz
@ 2014-05-18  9:38   ` Or Gerlitz
  2014-05-20  6:58   ` [PATCH libibverbs V4 0/5] Add support for UD QPs under RoCE IP addressing Or Gerlitz
  5 siblings, 0 replies; 18+ messages in thread
From: Or Gerlitz @ 2014-05-18  9:38 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	Or Gerlitz

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Since creating AH depends on resolving the MAC address for
Ethernet link layer, ibv_query_port is used.
This verb gives some information that is irrelevant for creating
AH, but in some vendors it forces us to call kernel uverb.
Therefore, we prefer using the extended verb and query only the
relevant fields and by thus give the vendor more chance to optimizations.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 src/verbs.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/verbs.c b/src/verbs.c
index 6db79e8..e022cd8 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -552,7 +552,7 @@ struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
 	int err;
 	struct ibv_ah *ah = NULL;
 #ifndef NRESOLVE_NEIGH
-	struct ibv_port_attr port_attr;
+	struct ibv_port_attr_ex port_attr;
 	int dst_family;
 	int src_family;
 	int oif;
@@ -572,7 +572,9 @@ struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
 		goto return_ah;
 	}
 
-	err = ibv_query_port(pd->context, attr->port_num, &port_attr);
+	port_attr.comp_mask = IBV_QUERY_PORT_EX_LINK_LAYER |
+			      IBV_QUERY_PORT_EX_CAP_FLAGS;
+	err = ibv_query_port_ex(pd->context, attr->port_num, &port_attr);
 
 	if (err) {
 		fprintf(stderr, PFX "ibv_create_ah failed to query port.\n");
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH libibverbs V4 1/5] Add ibv_port_cap_flags
       [not found]     ` <1400405929-14313-2-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
@ 2014-05-20  6:41       ` Matan Barak
  0 siblings, 0 replies; 18+ messages in thread
From: Matan Barak @ 2014-05-20  6:41 UTC (permalink / raw)
  To: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	yishaih-VPRAkNaXOzVWk0Htik3J/w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On 18/5/2014 12:38 PM, Or Gerlitz wrote:
> From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>
> Add an enum that describes ibv_port_cap_flags that complies
> with the respective kernel enum.
>
> This value could be fetched when using ibv_query_port in
> port_cap_flags.
>
> Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>   include/infiniband/verbs.h |   26 ++++++++++++++++++++++++++
>   1 files changed, 26 insertions(+), 0 deletions(-)
>
> diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
> index 9826b72..71adf2a 100644
> --- a/include/infiniband/verbs.h
> +++ b/include/infiniband/verbs.h
> @@ -191,6 +191,32 @@ enum {
>   	IBV_LINK_LAYER_ETHERNET,
>   };
>
> +enum ibv_port_cap_flags {
> +	IBV_PORT_SM				= 1 <<  1,
> +	IBV_PORT_NOTICE_SUP			= 1 <<  2,
> +	IBV_PORT_TRAP_SUP			= 1 <<  3,
> +	IBV_PORT_OPT_IPD_SUP			= 1 <<  4,
> +	IBV_PORT_AUTO_MIGR_SUP			= 1 <<  5,
> +	IBV_PORT_SL_MAP_SUP			= 1 <<  6,
> +	IBV_PORT_MKEY_NVRAM			= 1 <<  7,
> +	IBV_PORT_PKEY_NVRAM			= 1 <<  8,
> +	IBV_PORT_LED_INFO_SUP			= 1 <<  9,
> +	IBV_PORT_SYS_IMAGE_GUID_SUP		= 1 << 11,
> +	IBV_PORT_PKEY_SW_EXT_PORT_TRAP_SUP	= 1 << 12,
> +	IBV_PORT_EXTENDED_SPEEDS_SUP		= 1 << 14,
> +	IBV_PORT_CM_SUP				= 1 << 16,
> +	IBV_PORT_SNMP_TUNNEL_SUP		= 1 << 17,
> +	IBV_PORT_REINIT_SUP			= 1 << 18,
> +	IBV_PORT_DEVICE_MGMT_SUP		= 1 << 19,
> +	IB_PORT_VENDOR_CLASS_SUP		= 1 << 20,
> +	IB_PORT_DR_NOTICE_SUP			= 1 << 21,
> +	IB_PORT_CAP_MASK_NOTICE_SUP		= 1 << 22,
> +	IB_PORT_BOOT_MGMT_SUP			= 1 << 23,
> +	IB_PORT_LINK_LATENCY_SUP		= 1 << 24,
> +	IB_PORT_CLIENT_REG_SUP			= 1 << 25,
> +	IBV_PORT_IP_BASED_GIDS			= 1 << 26
> +};
> +

The last few values miss the IBV_ prefix.
We'll issue a V5 to fix that soon.

Matan

>   struct ibv_port_attr {
>   	enum ibv_port_state	state;
>   	enum ibv_mtu		max_mtu;
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH libibverbs V4 0/5] Add support for UD QPs under RoCE IP addressing
       [not found] ` <1400405929-14313-1-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
                     ` (4 preceding siblings ...)
  2014-05-18  9:38   ` [PATCH libibverbs V4 5/5] Optimize ibv_create_ah link query Or Gerlitz
@ 2014-05-20  6:58   ` Or Gerlitz
       [not found]     ` <537AFD31.3000401-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  5 siblings, 1 reply; 18+ messages in thread
From: Or Gerlitz @ 2014-05-20  6:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, matanb-VPRAkNaXOzVWk0Htik3J/w

On 18/05/2014 12:38, Or Gerlitz wrote:
[...]
> changes from V3 -- addressed feedback from Jason:
>    - replaced the char* that used to hold the MAC address provided to
>      the provider library (e.g libmlx4) with sockaddr_storage and use ARPHRD
>      for the af_family (similar to the practice used in the kernel)
>
>    - Dropped mask1 from ibv_query_port_ex and put all fields in comp_mask
>    - Added a man page for ibv_query_port_ex
>    - Made the code c99 friendly by dropping the anonymous union and struct
>    - Splitted the patches to patch that adds the interface and patch that uses it
>    - Removed the usage of drv_ and lib_ schemas
>    - Removed the VERBS_CONTEXT_XXXXXXX

Hi Jason,  so Matan has addressed you comments from V3, could you please 
take a look and ack? we find tiny little bug in patch #1 but want to get 
your feedback before re-spinning  V5.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH libibverbs V4 2/5] Extend create_ah in order to pass L2 parameters to the provider
       [not found]     ` <1400405929-14313-3-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
@ 2014-05-21 19:55       ` Jason Gunthorpe
       [not found]         ` <20140521195549.GA26909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2014-05-21 19:55 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, matanb-VPRAkNaXOzVWk0Htik3J/w

On Sun, May 18, 2014 at 12:38:46PM +0300, Or Gerlitz wrote:
> From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 
> In order to support IP based addressing, one needs to pass the L2
> parameters to the provider drive. This is done through a new extendable
                               ^^^^
'driver'

Please provide a man page. In this case you probably want to provide a
patch to the existing man/ibv_create_ah.3 to discuss the 2nd entry
point and new fields.

> +enum ibv_ah_attr_ex_attr_mask {
> +	IBV_AH_ATTR_EX_LL = 1UL << 0,
> +	IBV_AH_ATTR_EX_VID = 1UL << 1,
> +};

OK

> +struct ibv_ah_attr_ex {
> +	struct ibv_global_route	grh;
> +	uint16_t		dlid;
> +	uint8_t			sl;
> +	uint8_t			src_path_bits;
> +	uint8_t			static_rate;
> +	uint8_t			is_global;
> +	uint8_t			port_num;
> +	uint32_t		comp_mask;

OK

> +	union {
> +		struct sockaddr		sa;
> +		struct sockaddr_storage _s;
> +	} ll;
> +	uint16_t		vid;
> +};

Hard to know exactly what is going on here without a man page, but I
thought one of the points was to provide an ethernet L2 MAC address?
Shouldn't there be a mechanism for that?

I see this:

+       attr_ex.ll.sa.sa_family = ARPHRD_ETHER;

Which makes no sense, sa_family should be an AF_* constant.

So, I think this bit needs some kind of work. If you want to setup
ethernet, then setup ethernet:

uint64_t eth_dmac
uint16_t eth_vid;

and what about all the trendy new ethernet stuff, overlay networks,
etc? Can that be expressed in there?

> @@ -975,6 +998,8 @@ enum verbs_context_mask {
>  
>  struct verbs_context {
>  	/*  "grows up" - new fields go here */
> +	struct ibv_ah * (*create_ah_ex)(struct ibv_pd *pd,
> +					struct ibv_ah_attr_ex *attr);

Can you check if 'attr' should be const? I see the existing API isn't
const, but I am suspecting that is a mistake?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH libibverbs V4 4/5] Add ibv_query_port_ex support
       [not found]     ` <1400405929-14313-5-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
@ 2014-05-21 20:10       ` Jason Gunthorpe
       [not found]         ` <20140521201059.GC26909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2014-05-21 20:10 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, matanb-VPRAkNaXOzVWk0Htik3J/w

On Sun, May 18, 2014 at 12:38:48PM +0300, Or Gerlitz wrote:
> +enum ibv_query_port_ex_attr_mask {
> +	IBV_QUERY_PORT_EX_STATE			= 1 << 0,
> +	IBV_QUERY_PORT_EX_MAX_MTU		= 1 << 1,
> +	IBV_QUERY_PORT_EX_ACTIVE_MTU		= 1 << 2,
> +	IBV_QUERY_PORT_EX_GID_TBL_LEN		= 1 << 3,
> +	IBV_QUERY_PORT_EX_CAP_FLAGS		= 1 << 4,
> +	IBV_QUERY_PORT_EX_MAX_MSG_SZ		= 1 << 5,
> +	IBV_QUERY_PORT_EX_BAD_PKEY_CNTR		= 1 << 6,
> +	IBV_QUERY_PORT_EX_QKEY_VIOL_CNTR	= 1 << 7,
> +	IBV_QUERY_PORT_EX_PKEY_TBL_LEN		= 1 << 8,
> +	IBV_QUERY_PORT_EX_LID			= 1 << 9,
> +	IBV_QUERY_PORT_EX_SM_LID		= 1 << 10,
> +	IBV_QUERY_PORT_EX_LMC			= 1 << 11,
> +	IBV_QUERY_PORT_EX_MAX_VL_NUM		= 1 << 12,
> +	IBV_QUERY_PORT_EX_SM_SL			= 1 << 13,
> +	IBV_QUERY_PORT_EX_SUBNET_TIMEOUT	= 1 << 14,
> +	IBV_QUERY_PORT_EX_INIT_TYPE_REPLY	= 1 << 15,
> +	IBV_QUERY_PORT_EX_ACTIVE_WIDTH		= 1 << 16,
> +	IBV_QUERY_PORT_EX_ACTIVE_SPEED		= 1 << 17,
> +	IBV_QUERY_PORT_EX_PHYS_STATE		= 1 << 18,
> +	IBV_QUERY_PORT_EX_LINK_LAYER		= 1 << 19,
> +	/* mask of the fields that exists in the standard query_port_command */
> +	IBV_QUERY_PORT_EX_STD_MASK		= (1 << 20) - 1,
> +	/* mask of all supported fields */
> +	IBV_QUERY_PORT_EX_MASK			= IBV_QUERY_PORT_EX_STD_MASK,
> +};

OK

> +struct ibv_port_attr_ex {
> +	enum ibv_port_state	state;
> +	enum ibv_mtu		max_mtu;
> +	enum ibv_mtu		active_mtu;
> +	int			gid_tbl_len;
> +	uint32_t		port_cap_flags;
> +	uint32_t		max_msg_sz;
> +	uint32_t		bad_pkey_cntr;
> +	uint32_t		qkey_viol_cntr;
> +	uint16_t		pkey_tbl_len;
> +	uint16_t		lid;
> +	uint16_t		sm_lid;
> +	uint8_t			lmc;
> +	uint8_t			max_vl_num;
> +	uint8_t			sm_sl;
> +	uint8_t			subnet_timeout;
> +	uint8_t			init_type_reply;
> +	uint8_t			active_width;
> +	uint8_t			active_speed;
> +	uint8_t			phys_state;
> +	uint8_t			link_layer;
> +	uint8_t			reserved;

OK

> +	uint32_t		comp_mask;

This uses the first 20 bits already, should comp_mask just be 64 bits
off the bat?

> @@ -998,6 +1050,8 @@ enum verbs_context_mask {
>  
>  struct verbs_context {
>  	/*  "grows up" - new fields go here */
> +	int (*query_port_ex)(struct ibv_context *context, uint8_t port_num,
> +			     struct ibv_port_attr_ex *port_attr);

OK

> +static inline int ibv_query_port_ex(struct ibv_context *context,
> +				    uint8_t port_num,
> +				    struct ibv_port_attr_ex *port_attr)
> +{
> +	struct verbs_context *vctx;
> +
> +	port_attr->link_layer = IBV_LINK_LAYER_UNSPECIFIED;
> +	port_attr->reserved   = 0;

We don't need this. All the calls to ibv_query_port already set these
values and we can simply require that all implementations of
ibv_query_port_ex set them too.

> +	if (0 == port_attr->comp_mask)
> +		return ibv_query_port(context, port_num,
> +				      (struct ibv_port_attr *)port_attr);

I'm confused what this is doing? Why is 0 ever a valid comp_mask?

> +	/* Check that only valid flags were given */
> +	if (port_attr->comp_mask & ~IBV_QUERY_PORT_EX_MASK) {
> +		errno = EINVAL;
> +		return -errno;
> +	}

And this doesn't seem to make sense either.

> +	vctx = verbs_get_ctx_op(context, query_port_ex);
> +
> +	if (!vctx) {
> +		/* Fallback to legacy mode */
> +		if (!(port_attr->comp_mask & ~IBV_QUERY_PORT_EX_STD_MASK))
> +			return ibv_query_port(context, port_num,
> +					      (struct ibv_port_attr *)port_attr);
> +
> +		/* Unsupported field was requested */
> +		errno = ENOSYS;
> +		return -errno;
> +	}
> +
> +	return vctx->query_port_ex(context, port_num, port_attr);
> +}
> +
>  #define ibv_query_port(context, port_num, port_attr) \
>  	___ibv_query_port(context, port_num, port_attr)
>  
> diff --git a/man/ibv_query_port_ex.3 b/man/ibv_query_port_ex.3
> new file mode 100644
> index 0000000..07073fd
> +++ b/man/ibv_query_port_ex.3
> @@ -0,0 +1,97 @@
> +.\" -*- nroff -*-
> +.\"
> +.TH IBV_QUERY_PORT_EX 3 2006-10-31 libibverbs "Libibverbs Programmer's Manual"
> +.SH "NAME"
> +ibv_query_port_ex \- query an RDMA port's attributes
> +.SH "SYNOPSIS"
> +.nf
> +.B #include <infiniband/verbs.h>
> +.sp
> +.BI "int ibv_query_port_ex(struct ibv_context " "*context" ", uint8_t " "port_num" ,
> +.BI "                      struct ibv_port_attr_ex " "*port_attr" ");
> +.fi
> +.SH "DESCRIPTION"
> +.B ibv_query_port_ex()

I feel it would be nicer to just patch the existing ibv_query_port man
page to have the new call and a paragraph about the extra field.

Similar to how 'man accept' works with accept and accept4


> +returns the attributes of port
> +.I port_num
> +for device context
> +.I context
> +through the pointer
> +.I port_attr\fR.
> +The argument
> +.I port_attr
> +is an ibv_port_attr struct, as defined in <infiniband/verbs.h>.
                 ^^^^^^^

No it isn't. Please proof-read everything.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH libibverbs V4 3/5] Use neighbour lookup for RoCE UD QPs Eth L2 resolution
       [not found]     ` <1400405929-14313-4-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
@ 2014-05-21 20:31       ` Jason Gunthorpe
       [not found]         ` <20140521203107.GD26909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2014-05-21 20:31 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, matanb-VPRAkNaXOzVWk0Htik3J/w

On Sun, May 18, 2014 at 12:38:47PM +0300, Or Gerlitz wrote:
>  struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
>  {
> -	struct ibv_ah *ah = pd->context->ops.create_ah(pd, attr);
> +	int err;
> +	struct ibv_ah *ah = NULL;
> +#ifndef NRESOLVE_NEIGH
> +	struct ibv_port_attr port_attr;
> +	int dst_family;
> +	int src_family;
> +	int oif;
> +	struct get_neigh_handler neigh_handler;
> +	union ibv_gid sgid;
> +	struct ibv_ah_attr_ex attr_ex;
> +	int ether_len;
> +	struct verbs_context *vctx = verbs_get_ctx_op(pd->context,
> +						      create_ah_ex);
> +	struct peer_address src;
> +	struct peer_address dst;
> +
> +	if (!vctx) {
> +#endif
> +		ah = pd->context->ops.create_ah(pd, attr);
> +#ifndef NRESOLVE_NEIGH
> +		goto return_ah;
> +	}
> +
> +	err = ibv_query_port(pd->context, attr->port_num, &port_attr);

It feels like a regression to force this overhead. Many HCAs have no
possibility to support anything but IB, or Ethernet and don't need
this.

This whole arrangement seems strange. create_ah_ex should be a full
fledged user callable function, not a buried driver entry point.
It is also very unusual for verbs to have all this generic code in
a driver wrapper.

I suspect the answer here is to have the driver call into helper
functions from verbs to do this addressing
work. 'get_ethernet_l2_from_ah' or something.

If you do that then we don't even need create_ah_ex and query_port_ex
- those function seem to only be required to support this wrapper
technique.

So please rethink how this flows.. Maybe wrapping is not the best
choice??

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH libibverbs V4 2/5] Extend create_ah in order to pass L2 parameters to the provider
       [not found]         ` <20140521195549.GA26909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2014-05-22  8:52           ` Matan Barak
       [not found]             ` <537DBAC3.9090008-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Matan Barak @ 2014-05-22  8:52 UTC (permalink / raw)
  To: Jason Gunthorpe, Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On 21/5/2014 10:55 PM, Jason Gunthorpe wrote:
> On Sun, May 18, 2014 at 12:38:46PM +0300, Or Gerlitz wrote:
>> From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>
>> In order to support IP based addressing, one needs to pass the L2
>> parameters to the provider drive. This is done through a new extendable
>                                 ^^^^
> 'driver'
>
> Please provide a man page. In this case you probably want to provide a
> patch to the existing man/ibv_create_ah.3 to discuss the 2nd entry
> point and new fields.
>

Ok, I'll add this entry to the ibv_create_ah man page.

>> +enum ibv_ah_attr_ex_attr_mask {
>> +	IBV_AH_ATTR_EX_LL = 1UL << 0,
>> +	IBV_AH_ATTR_EX_VID = 1UL << 1,
>> +};
>
> OK
>
>> +struct ibv_ah_attr_ex {
>> +	struct ibv_global_route	grh;
>> +	uint16_t		dlid;
>> +	uint8_t			sl;
>> +	uint8_t			src_path_bits;
>> +	uint8_t			static_rate;
>> +	uint8_t			is_global;
>> +	uint8_t			port_num;
>> +	uint32_t		comp_mask;
>
> OK
>
>> +	union {
>> +		struct sockaddr		sa;
>> +		struct sockaddr_storage _s;
>> +	} ll;
>> +	uint16_t		vid;
>> +};
>
> Hard to know exactly what is going on here without a man page, but I
> thought one of the points was to provide an ethernet L2 MAC address?
> Shouldn't there be a mechanism for that?
>
> I see this:
>
> +       attr_ex.ll.sa.sa_family = ARPHRD_ETHER;
>
> Which makes no sense, sa_family should be an AF_* constant.
>

Kernel code *sometimes* use sa_family as ARPHRD_ETHER.

> So, I think this bit needs some kind of work. If you want to setup
> ethernet, then setup ethernet:
>
> uint64_t eth_dmac
> uint16_t eth_vid;
>
> and what about all the trendy new ethernet stuff, overlay networks,
> etc? Can that be expressed in there?
>

I don't want to make this Ethernet specific. That's why the previous 
pointer used a pointer. I don't mind creating a generic interface for
link layer if the existing solutions aren't good enough. Any suggestions?

>> @@ -975,6 +998,8 @@ enum verbs_context_mask {
>>
>>   struct verbs_context {
>>   	/*  "grows up" - new fields go here */
>> +	struct ibv_ah * (*create_ah_ex)(struct ibv_pd *pd,
>> +					struct ibv_ah_attr_ex *attr);
>
> Can you check if 'attr' should be const? I see the existing API isn't
> const, but I am suspecting that is a mistake?

You're probably right, but wouldn't we want to be aligned with the 
non-extended verb:
struct ibv_ah *         (*create_ah)(struct ibv_pd *pd, struct 
ibv_ah_attr *attr);
Since the provider driver usually go through a common path for both the 
non-extended and the extended verb, this could cause an ugly const cast.

>
> Jason
>

Matan

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH libibverbs V4 4/5] Add ibv_query_port_ex support
       [not found]         ` <20140521201059.GC26909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2014-05-22  9:07           ` Matan Barak
  0 siblings, 0 replies; 18+ messages in thread
From: Matan Barak @ 2014-05-22  9:07 UTC (permalink / raw)
  To: Jason Gunthorpe, Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On 21/5/2014 11:10 PM, Jason Gunthorpe wrote:
> On Sun, May 18, 2014 at 12:38:48PM +0300, Or Gerlitz wrote:
>> +enum ibv_query_port_ex_attr_mask {
>> +	IBV_QUERY_PORT_EX_STATE			= 1 << 0,
>> +	IBV_QUERY_PORT_EX_MAX_MTU		= 1 << 1,
>> +	IBV_QUERY_PORT_EX_ACTIVE_MTU		= 1 << 2,
>> +	IBV_QUERY_PORT_EX_GID_TBL_LEN		= 1 << 3,
>> +	IBV_QUERY_PORT_EX_CAP_FLAGS		= 1 << 4,
>> +	IBV_QUERY_PORT_EX_MAX_MSG_SZ		= 1 << 5,
>> +	IBV_QUERY_PORT_EX_BAD_PKEY_CNTR		= 1 << 6,
>> +	IBV_QUERY_PORT_EX_QKEY_VIOL_CNTR	= 1 << 7,
>> +	IBV_QUERY_PORT_EX_PKEY_TBL_LEN		= 1 << 8,
>> +	IBV_QUERY_PORT_EX_LID			= 1 << 9,
>> +	IBV_QUERY_PORT_EX_SM_LID		= 1 << 10,
>> +	IBV_QUERY_PORT_EX_LMC			= 1 << 11,
>> +	IBV_QUERY_PORT_EX_MAX_VL_NUM		= 1 << 12,
>> +	IBV_QUERY_PORT_EX_SM_SL			= 1 << 13,
>> +	IBV_QUERY_PORT_EX_SUBNET_TIMEOUT	= 1 << 14,
>> +	IBV_QUERY_PORT_EX_INIT_TYPE_REPLY	= 1 << 15,
>> +	IBV_QUERY_PORT_EX_ACTIVE_WIDTH		= 1 << 16,
>> +	IBV_QUERY_PORT_EX_ACTIVE_SPEED		= 1 << 17,
>> +	IBV_QUERY_PORT_EX_PHYS_STATE		= 1 << 18,
>> +	IBV_QUERY_PORT_EX_LINK_LAYER		= 1 << 19,
>> +	/* mask of the fields that exists in the standard query_port_command */
>> +	IBV_QUERY_PORT_EX_STD_MASK		= (1 << 20) - 1,
>> +	/* mask of all supported fields */
>> +	IBV_QUERY_PORT_EX_MASK			= IBV_QUERY_PORT_EX_STD_MASK,
>> +};
>
> OK
>
>> +struct ibv_port_attr_ex {
>> +	enum ibv_port_state	state;
>> +	enum ibv_mtu		max_mtu;
>> +	enum ibv_mtu		active_mtu;
>> +	int			gid_tbl_len;
>> +	uint32_t		port_cap_flags;
>> +	uint32_t		max_msg_sz;
>> +	uint32_t		bad_pkey_cntr;
>> +	uint32_t		qkey_viol_cntr;
>> +	uint16_t		pkey_tbl_len;
>> +	uint16_t		lid;
>> +	uint16_t		sm_lid;
>> +	uint8_t			lmc;
>> +	uint8_t			max_vl_num;
>> +	uint8_t			sm_sl;
>> +	uint8_t			subnet_timeout;
>> +	uint8_t			init_type_reply;
>> +	uint8_t			active_width;
>> +	uint8_t			active_speed;
>> +	uint8_t			phys_state;
>> +	uint8_t			link_layer;
>> +	uint8_t			reserved;
>
> OK
>
>> +	uint32_t		comp_mask;
>
> This uses the first 20 bits already, should comp_mask just be 64 bits
> off the bat?
>

First of all, I think that comp_mask should be the same type for all 
extension verbs and since ibv_flow_attr already uses a 32bit comp_mask, 
so should this verb.
Moreover, I don't think we expect to reach 32 values anytime soon.
In term of future scalability, I understand that the last bit is 
reserved for comp_mask2 field.

>> @@ -998,6 +1050,8 @@ enum verbs_context_mask {
>>
>>   struct verbs_context {
>>   	/*  "grows up" - new fields go here */
>> +	int (*query_port_ex)(struct ibv_context *context, uint8_t port_num,
>> +			     struct ibv_port_attr_ex *port_attr);
>
> OK
>
>> +static inline int ibv_query_port_ex(struct ibv_context *context,
>> +				    uint8_t port_num,
>> +				    struct ibv_port_attr_ex *port_attr)
>> +{
>> +	struct verbs_context *vctx;
>> +
>> +	port_attr->link_layer = IBV_LINK_LAYER_UNSPECIFIED;
>> +	port_attr->reserved   = 0;
>
> We don't need this. All the calls to ibv_query_port already set these
> values and we can simply require that all implementations of
> ibv_query_port_ex set them too.

Correct

>
>> +	if (0 == port_attr->comp_mask)
>> +		return ibv_query_port(context, port_num,
>> +				      (struct ibv_port_attr *)port_attr);
>
> I'm confused what this is doing? Why is 0 ever a valid comp_mask?
>

I'll remove this.

>> +	/* Check that only valid flags were given */
>> +	if (port_attr->comp_mask & ~IBV_QUERY_PORT_EX_MASK) {
>> +		errno = EINVAL;
>> +		return -errno;
>> +	}
>
> And this doesn't seem to make sense either.
>

Sanity check - the user should provide a combination of 
ibv_query_port_ex_attr_mask flags.

>> +	vctx = verbs_get_ctx_op(context, query_port_ex);
>> +
>> +	if (!vctx) {
>> +		/* Fallback to legacy mode */
>> +		if (!(port_attr->comp_mask & ~IBV_QUERY_PORT_EX_STD_MASK))
>> +			return ibv_query_port(context, port_num,
>> +					      (struct ibv_port_attr *)port_attr);
>> +
>> +		/* Unsupported field was requested */
>> +		errno = ENOSYS;
>> +		return -errno;
>> +	}
>> +
>> +	return vctx->query_port_ex(context, port_num, port_attr);
>> +}
>> +
>>   #define ibv_query_port(context, port_num, port_attr) \
>>   	___ibv_query_port(context, port_num, port_attr)
>>
>> diff --git a/man/ibv_query_port_ex.3 b/man/ibv_query_port_ex.3
>> new file mode 100644
>> index 0000000..07073fd
>> +++ b/man/ibv_query_port_ex.3
>> @@ -0,0 +1,97 @@
>> +.\" -*- nroff -*-
>> +.\"
>> +.TH IBV_QUERY_PORT_EX 3 2006-10-31 libibverbs "Libibverbs Programmer's Manual"
>> +.SH "NAME"
>> +ibv_query_port_ex \- query an RDMA port's attributes
>> +.SH "SYNOPSIS"
>> +.nf
>> +.B #include <infiniband/verbs.h>
>> +.sp
>> +.BI "int ibv_query_port_ex(struct ibv_context " "*context" ", uint8_t " "port_num" ,
>> +.BI "                      struct ibv_port_attr_ex " "*port_attr" ");
>> +.fi
>> +.SH "DESCRIPTION"
>> +.B ibv_query_port_ex()
>
> I feel it would be nicer to just patch the existing ibv_query_port man
> page to have the new call and a paragraph about the extra field.
>
> Similar to how 'man accept' works with accept and accept4
>
>

Ok

>> +returns the attributes of port
>> +.I port_num
>> +for device context
>> +.I context
>> +through the pointer
>> +.I port_attr\fR.
>> +The argument
>> +.I port_attr
>> +is an ibv_port_attr struct, as defined in <infiniband/verbs.h>.
>                   ^^^^^^^
>
> No it isn't. Please proof-read everything.
>
> Jason
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH libibverbs V4 2/5] Extend create_ah in order to pass L2 parameters to the provider
       [not found]             ` <537DBAC3.9090008-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-06-11 10:54               ` Or Gerlitz
       [not found]                 ` <53983580.1080002-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Or Gerlitz @ 2014-06-11 10:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matan Barak, roland-DgEjT+Ai2ygdnm+yROfE0A,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On 22/05/2014 11:52, Matan Barak wrote:
> On 21/5/2014 10:55 PM, Jason Gunthorpe wrote:
>> On Sun, May 18, 2014 at 12:38:46PM +0300, Or Gerlitz wrote:
>>> From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>
>>> In order to support IP based addressing, one needs to pass the L2
>>> parameters to the provider drive. This is done through a new extendable
>>                                 ^^^^
>> 'driver'
>>
>> Please provide a man page. In this case you probably want to provide a
>> patch to the existing man/ibv_create_ah.3 to discuss the 2nd entry
>> point and new fields.
>>
>
> Ok, I'll add this entry to the ibv_create_ah man page.
>
>>> +enum ibv_ah_attr_ex_attr_mask {
>>> +    IBV_AH_ATTR_EX_LL = 1UL << 0,
>>> +    IBV_AH_ATTR_EX_VID = 1UL << 1,
>>> +};
>>
>> OK
>>
>>> +struct ibv_ah_attr_ex {
>>> +    struct ibv_global_route    grh;
>>> +    uint16_t        dlid;
>>> +    uint8_t            sl;
>>> +    uint8_t            src_path_bits;
>>> +    uint8_t            static_rate;
>>> +    uint8_t            is_global;
>>> +    uint8_t            port_num;
>>> +    uint32_t        comp_mask;
>>
>> OK
>>
>>> +    union {
>>> +        struct sockaddr        sa;
>>> +        struct sockaddr_storage _s;
>>> +    } ll;
>>> +    uint16_t        vid;
>>> +};
>>
>> Hard to know exactly what is going on here without a man page, but I
>> thought one of the points was to provide an ethernet L2 MAC address?
>> Shouldn't there be a mechanism for that?
>>
>> I see this:
>>
>> +       attr_ex.ll.sa.sa_family = ARPHRD_ETHER;
>>
>> Which makes no sense, sa_family should be an AF_* constant.
>>
>
> Kernel code *sometimes* use sa_family as ARPHRD_ETHER.
>
>> So, I think this bit needs some kind of work. If you want to setup
>> ethernet, then setup ethernet:
>>
>> uint64_t eth_dmac
>> uint16_t eth_vid;
>>
>> and what about all the trendy new ethernet stuff, overlay networks,
>> etc? Can that be expressed in there?
>>
>
> I don't want to make this Ethernet specific. That's why the previous 
> pointer used a pointer. I don't mind creating a generic interface for 
> link layer if the existing solutions aren't good enough. Any suggestions?

Hi Jason,

Can you please address  Matan's comments? we'd like this thread to 
converge...

Or.

>
>>> @@ -975,6 +998,8 @@ enum verbs_context_mask {
>>>
>>>   struct verbs_context {
>>>       /*  "grows up" - new fields go here */
>>> +    struct ibv_ah * (*create_ah_ex)(struct ibv_pd *pd,
>>> +                    struct ibv_ah_attr_ex *attr);
>>
>> Can you check if 'attr' should be const? I see the existing API isn't
>> const, but I am suspecting that is a mistake?
>
> You're probably right, but wouldn't we want to be aligned with the 
> non-extended verb:
> struct ibv_ah *         (*create_ah)(struct ibv_pd *pd, struct 
> ibv_ah_attr *attr);
> Since the provider driver usually go through a common path for both 
> the non-extended and the extended verb, this could cause an ugly const 
> cast.
>
>>
>> Jason
>>
>
> Matan
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH libibverbs V4 2/5] Extend create_ah in order to pass L2 parameters to the provider
       [not found]                 ` <53983580.1080002-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-06-11 15:05                   ` Jason Gunthorpe
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2014-06-11 15:05 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Matan Barak, roland-DgEjT+Ai2ygdnm+yROfE0A,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On Wed, Jun 11, 2014 at 01:54:56PM +0300, Or Gerlitz wrote:

> Can you please address  Matan's comments? we'd like this thread to
> converge...

I am waiting for your analysis as I mentioned in my last message to
this thread:

https://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg19925.html

That is rather a larger question, and the answer may well be we don't
need two new extended verbs to do what you want...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH libibverbs V4 3/5] Use neighbour lookup for RoCE UD QPs Eth L2 resolution
       [not found]         ` <20140521203107.GD26909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2014-06-12 12:44           ` Matan Barak
       [not found]             ` <5399A093.9070104-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Matan Barak @ 2014-06-12 12:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On 21/5/2014 11:31 PM, Jason Gunthorpe wrote:
> On Sun, May 18, 2014 at 12:38:47PM +0300, Or Gerlitz wrote:
>>   struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
>>   {
>> -	struct ibv_ah *ah = pd->context->ops.create_ah(pd, attr);
>> +	int err;
>> +	struct ibv_ah *ah = NULL;
>> +#ifndef NRESOLVE_NEIGH
>> +	struct ibv_port_attr port_attr;
>> +	int dst_family;
>> +	int src_family;
>> +	int oif;
>> +	struct get_neigh_handler neigh_handler;
>> +	union ibv_gid sgid;
>> +	struct ibv_ah_attr_ex attr_ex;
>> +	int ether_len;
>> +	struct verbs_context *vctx = verbs_get_ctx_op(pd->context,
>> +						      create_ah_ex);
>> +	struct peer_address src;
>> +	struct peer_address dst;
>> +
>> +	if (!vctx) {
>> +#endif
>> +		ah = pd->context->ops.create_ah(pd, attr);
>> +#ifndef NRESOLVE_NEIGH
>> +		goto return_ah;
>> +	}
>> +
>> +	err = ibv_query_port(pd->context, attr->port_num, &port_attr);
>
> It feels like a regression to force this overhead. Many HCAs have no
> possibility to support anything but IB, or Ethernet and don't need
> this.
>
> This whole arrangement seems strange. create_ah_ex should be a full
> fledged user callable function, not a buried driver entry point.
> It is also very unusual for verbs to have all this generic code in
> a driver wrapper.
>
> I suspect the answer here is to have the driver call into helper
> functions from verbs to do this addressing
> work. 'get_ethernet_l2_from_ah' or something.
>
> If you do that then we don't even need create_ah_ex and query_port_ex
> - those function seem to only be required to support this wrapper
> technique.
>
> So please rethink how this flows.. Maybe wrapping is not the best
> choice??
>
> Jason
>

We could use a libibverbs API call in order to resolve the IP based GID 
into a MAC, but I think it could cause multiple vendors to have some 
code duplication. We hope that in the future, more products will use 
RoCE with IP based GIDs. All those providers will have to supply similar 
code that checks if the link layer is Ethernet and IP based GID is used, 
they'll have to use the libibverbs utility function.

A possible future create_ah_ex *libibverbs* could be a full fledged verb 
that uses the new provider's create_ah_ex. Currently, it's hidden inside 
the good old create_ah call, but we shouldn't limit ourselves to keep it 
that way.

Anyway, I have no objection to use a utility function instead of this 
method, but I do think that the current code has some advantages.

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH libibverbs V4 3/5] Use neighbour lookup for RoCE UD QPs Eth L2 resolution
       [not found]             ` <5399A093.9070104-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-06-12 17:40               ` Jason Gunthorpe
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2014-06-12 17:40 UTC (permalink / raw)
  To: Matan Barak
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On Thu, Jun 12, 2014 at 03:44:03PM +0300, Matan Barak wrote:

> We could use a libibverbs API call in order to resolve the IP based
> GID into a MAC, but I think it could cause multiple vendors to have
> some code duplication.

If that is the only objection, I would prefer to see this techique. A
little provider code duplication is a lessor evil than introducing and
vetting new verbs APIs.

I think the patch will be very small and there will be very little to
talk about from an API perspective.

> We hope that in the future, more products will use RoCE with IP
> based GIDs. All those providers will have to supply similar code
> that checks if the link layer is Ethernet and IP based GID is used,
> they'll have to use the libibverbs utility function.

AFAIK all the other RoCEE implementations don't do InfiniBand link
layer, so their providers don't even need the test.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH libibverbs V4 0/5] Add support for UD QPs under RoCE IP addressing
       [not found]     ` <537AFD31.3000401-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-07-24 11:34       ` Devesh Sharma
  0 siblings, 0 replies; 18+ messages in thread
From: Devesh Sharma @ 2014-07-24 11:34 UTC (permalink / raw)
  To: Or Gerlitz, Jason Gunthorpe
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org

Hi Or,

Is V5 on the cards in the near future?

-Regards
 Devesh

> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Or Gerlitz
> Sent: Tuesday, May 20, 2014 12:29 PM
> To: Jason Gunthorpe
> Cc: roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org
> Subject: Re: [PATCH libibverbs V4 0/5] Add support for UD QPs under RoCE IP
> addressing
> 
> On 18/05/2014 12:38, Or Gerlitz wrote:
> [...]
> > changes from V3 -- addressed feedback from Jason:
> >    - replaced the char* that used to hold the MAC address provided to
> >      the provider library (e.g libmlx4) with sockaddr_storage and use ARPHRD
> >      for the af_family (similar to the practice used in the kernel)
> >
> >    - Dropped mask1 from ibv_query_port_ex and put all fields in
> comp_mask
> >    - Added a man page for ibv_query_port_ex
> >    - Made the code c99 friendly by dropping the anonymous union and
> struct
> >    - Splitted the patches to patch that adds the interface and patch that uses
> it
> >    - Removed the usage of drv_ and lib_ schemas
> >    - Removed the VERBS_CONTEXT_XXXXXXX
> 
> Hi Jason,  so Matan has addressed you comments from V3, could you please
> take a look and ack? we find tiny little bug in patch #1 but want to get your
> feedback before re-spinning  V5.
> 
> Or.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-07-24 11:34 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-18  9:38 [PATCH libibverbs V4 0/5] Add support for UD QPs under RoCE IP addressing Or Gerlitz
     [not found] ` <1400405929-14313-1-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
2014-05-18  9:38   ` [PATCH libibverbs V4 1/5] Add ibv_port_cap_flags Or Gerlitz
     [not found]     ` <1400405929-14313-2-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
2014-05-20  6:41       ` Matan Barak
2014-05-18  9:38   ` [PATCH libibverbs V4 2/5] Extend create_ah in order to pass L2 parameters to the provider Or Gerlitz
     [not found]     ` <1400405929-14313-3-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
2014-05-21 19:55       ` Jason Gunthorpe
     [not found]         ` <20140521195549.GA26909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2014-05-22  8:52           ` Matan Barak
     [not found]             ` <537DBAC3.9090008-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-06-11 10:54               ` Or Gerlitz
     [not found]                 ` <53983580.1080002-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-06-11 15:05                   ` Jason Gunthorpe
2014-05-18  9:38   ` [PATCH libibverbs V4 3/5] Use neighbour lookup for RoCE UD QPs Eth L2 resolution Or Gerlitz
     [not found]     ` <1400405929-14313-4-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
2014-05-21 20:31       ` Jason Gunthorpe
     [not found]         ` <20140521203107.GD26909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2014-06-12 12:44           ` Matan Barak
     [not found]             ` <5399A093.9070104-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-06-12 17:40               ` Jason Gunthorpe
2014-05-18  9:38   ` [PATCH libibverbs V4 4/5] Add ibv_query_port_ex support Or Gerlitz
     [not found]     ` <1400405929-14313-5-git-send-email-ogerlitz-lZu6o6FoHf8DyJZ7l8Lk7nI+JuX82XLG@public.gmane.org>
2014-05-21 20:10       ` Jason Gunthorpe
     [not found]         ` <20140521201059.GC26909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2014-05-22  9:07           ` Matan Barak
2014-05-18  9:38   ` [PATCH libibverbs V4 5/5] Optimize ibv_create_ah link query Or Gerlitz
2014-05-20  6:58   ` [PATCH libibverbs V4 0/5] Add support for UD QPs under RoCE IP addressing Or Gerlitz
     [not found]     ` <537AFD31.3000401-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-07-24 11:34       ` Devesh Sharma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).