public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
@ 2015-07-30 15:33 Matan Barak
  2015-07-30 15:33 ` [PATCH for-next V7 01/10] net/ipv6: Export addrconf_ifid_eui48 Matan Barak
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Matan Barak @ 2015-07-30 15:33 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Jason Gunthorpe, linux-rdma, Sean Hefty,
	Somnath Kotur, Moni Shoua, talal, haggaie, netdev

This series has been running in linux-rdma for a while. We added here
CC to netdev for the three pre-patches which come first. They allow
the IB core to access some helpers (e.g generating default Eth IPv6
link local address), gain more info on bonding changes, etc.

Previously, every vendor implemented its net device notifiers in its own
driver. This introduces a huge code duplication as figuring
whether an event is related to the vendor's net device in the
various cases (bonding, vlan or any other upper device) is
similar for all vendors. In the future, when multiple GID types will
be supported, this code duplication would have gotten even worse.

Therefore, we decided moving this into a common core core.
roce_gid_table and roce_gid_mgmt were created in order to store and
manage the new GID table, by filling it when getting the related events.
Vendors now only have to implement modify_gid and get_netdev IB
device calls, which are truly unique for each vendor.
roce_gid_table is implemented as IB client that manages the GID
table of the IB device. Each GID is associated with a GID type and a
network device (which is mandatory for management of the GID table).
The GID table is populated by using roce_gid_mgmt. roce_gid_mgmt
registers to net device/inet/inet events and calls roce_gid_table
in order to populate the GID table accordingly.

Patch 0005 is the core patch in this series. It creates a new infrastructure
for storing GIDs and their attributes in IB/core. This infrastructure support
reading and writing GIDs alongside with their meta-data. The new infrastructure
is used for both manageing RoCE ports and IB ports. The core difference is that
in IB ports, this infrastructure is used souly as a cache, while in RoCE we
actually manage the vendor's GID table by calling add_gid and del_gid callbacks.
In RoCE, we always enable default gids for an active device (an active device
is defined here as a device that doesn't have a bonding master or is the current
active slave). This is done in order to allow loopback traffic.

Patch 0004 replaces the locking schema for IB devices. Previously, device_mutex
was used in order to lock the devices/clients list against every modification.
However, downstream patches add new functions which iterate over the device
list. Those functions could be executed for a workqueue contexts on behalf
of IB clients. Thus, when a client is removed, we need to wait for all works
to be finished. Since a client removal was done in device_mutex lock, we'll
be in fact waiting for a work which requires to lock the device_mutex itself
(=DEADLOCK). In order to mitigate this problem, we use rw semaphore to allow
multiple readers. We use a mutex in order to solve races between adding
(or removing) a client and a device simultaneously, which could have resulted
in calling client->add (or client->remove) twice for the same device and client.
This patch was sent as part of "Add network namespace support in the RDMA-CM"
series.

Patch 0006 adds population of this table for the bonding case based on net
device events. Only the active slaves retain their master's IP based gids and
default gids.

Patch 0001 exports addrconf_ifid_eui48 in order to generate the default GID.
Patch 0002 adds information for NETDEV_CHANGEUPPER which is used in order to
understand the nature of change - link/unlink and which master net-device is
related to this change.
Patch 0003 exports bond_option_active_slave_get_rcu which is necassary in
order to assign the GIDs only to the active slave.

The rest of the patches add support for ocrdma and mlx4 devices.

This series is rebased over Doug's to-be-rebase/for-4.3.

Thanks,
Devesh, Somnath, Moni and Matan

Changes from V6:
(1) Addressed Jason's comments:
	(a) Cache is no longer a client but part of IB infrastructure
	(b) No need for READ_ONCE and flush_workqueue when tearing down
	    the cache

Changes from V5:
(1) Incoporate the changes to cache.c so we use the same infrastructure
    to manage both IB and RoCE (per Doug's request)
(2) Replace the locking mechanism in the IB core GID cache from seqcount +
    rcu to rwlock (addressing comments from Jason)
(3) get_netdev returns a helded (dev_hold) device
(4) Squashed the RocE GID table, RoCE GID management and default GID handling
    code into one patch (per Doug's request).
(5) Change modify_gid to add_gid and del_gid.
(6) set the netdev related changes into three dedicated patches and make
    them be 1st in the series.

Changes from V4:
(1) Remove any API changes.
(2) Fixed a bug regarding bonding upper devices.
(3) Rebased ontop of Doug's k.o/for-4.2.

Changes from V3:
(1) Remove RoCE V2 functionality (it will be sent at later patchset).
(2) Instead of removing qp_attr_mask flags, reserve them.
(3) Remove the kref from IB devices in favor of rwsem.
(4) Change the name of roce_gid_cache to roce_gid_table.
(5) Fix a race when roce_gid_table is free'd while getting events.
(6) Remove the roce_gid_cache active/inactive flag/API.

Changes from V2:
(1) When creating multiple vlans over an interface,
    only the last created vlan's GID was populated in the table
    (regression from V2).
(2) Inactive slave of bonding sometimes lost GIDs related to IPs
    that were directly applied to it.
(3) Memory leak in mlx4
(4) roce_gid_cache now calls modify_gid with zgid in order to cause
    the provider to delete all the information it allocated for those
    GIDs.
(4) A mlx4 patch didn't compile and a downstream patch fixed it.
(5) cma_configfs should depend on both address translation and configfs.
(6) ocrdma driver redefined zgid.
(7) Added event information for NETDEV_CHANGEUPPER event.

Changes from V1:
(1) Addressed Shachar and Haggai's comments
(2) Fixed multicast support
(3) Generalized bonding support
(4) Added default GID after the IB device's net device was removed from bonding
(5) Fixed bugs in mlx4 implementation regarding multicast
(6) Fixed bugs in mlx4 when using XRC QPs after this patchset was applied
(7) Fixed bug when the RoCE gid cache didn't exist
(8) Moved the bonding's DRV macros to a private header
(9) Support non-configfs configurations

Haggai Eran (1):
  IB/core: Add rwsem to allow reading device list or client list

Matan Barak (5):
  net/ipv6: Export addrconf_ifid_eui48
  net: Add info for NETDEV_CHANGEUPPER event
  net/bonding: Export bond_option_active_slave_get_rcu
  IB/core: Add RoCE GID table management
  IB/core: Add RoCE table bonding support

Moni Shoua (3):
  net/mlx4: Postpone the registration of net_device
  IB/mlx4: Implement ib_device callbacks
  IB/mlx4: Replace mechanism for RoCE GID management

Somnath Kotur (1):
  RDMA/ocrdma: Incorporate the moving of GID Table mgmt to IB/Core

 drivers/infiniband/core/Makefile             |   3 +-
 drivers/infiniband/core/cache.c              | 704 ++++++++++++++++++++++---
 drivers/infiniband/core/core_priv.h          |  50 +-
 drivers/infiniband/core/device.c             | 134 ++++-
 drivers/infiniband/core/roce_gid_mgmt.c      | 730 ++++++++++++++++++++++++++
 drivers/infiniband/core/sysfs.c              |   1 +
 drivers/infiniband/hw/mlx4/ah.c              |   2 +-
 drivers/infiniband/hw/mlx4/main.c            | 749 ++++++++++-----------------
 drivers/infiniband/hw/mlx4/mlx4_ib.h         |  21 +-
 drivers/infiniband/hw/mlx4/qp.c              |  10 +-
 drivers/infiniband/hw/ocrdma/ocrdma.h        |   1 -
 drivers/infiniband/hw/ocrdma/ocrdma_main.c   | 234 +--------
 drivers/infiniband/hw/ocrdma/ocrdma_sli.h    |   2 +
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |  45 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h  |  11 +
 drivers/net/bonding/bond_options.c           |  13 -
 drivers/net/ethernet/mellanox/mlx4/en_main.c |  36 +-
 drivers/net/ethernet/mellanox/mlx4/intf.c    |   3 +
 include/linux/mlx4/device.h                  |   3 +-
 include/linux/mlx4/driver.h                  |   1 +
 include/linux/netdevice.h                    |  14 +
 include/net/addrconf.h                       |  31 ++
 include/net/bonding.h                        |   7 +
 include/rdma/ib_verbs.h                      |  68 ++-
 net/core/dev.c                               |  12 +-
 net/ipv6/addrconf.c                          |  31 --
 26 files changed, 2017 insertions(+), 899 deletions(-)
 create mode 100644 drivers/infiniband/core/roce_gid_mgmt.c

-- 
2.1.0

Cc: netdev@vger.kernel.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH for-next V7 01/10] net/ipv6: Export addrconf_ifid_eui48
  2015-07-30 15:33 [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core Matan Barak
@ 2015-07-30 15:33 ` Matan Barak
  2015-07-30 15:33 ` [PATCH for-next V7 02/10] net: Add info for NETDEV_CHANGEUPPER event Matan Barak
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Matan Barak @ 2015-07-30 15:33 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Jason Gunthorpe, linux-rdma, Sean Hefty,
	Somnath Kotur, Moni Shoua, talal, haggaie, netdev

For loopback purposes, RoCE devices should have a default GID in the
port GID table, even when the interface is down. In order to do so,
we use the IPv6 link local address which would have been genenrated
for the related Ethernet netdevice when it goes up as a default GID.

addrconf_ifid_eui48 is used to gernerate this address, export it.

Signed-off-by: Matan Barak <matanb@mellanox.com>
---
 include/net/addrconf.h | 31 +++++++++++++++++++++++++++++++
 net/ipv6/addrconf.c    | 31 -------------------------------
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index def59d3..431fdfa 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -91,6 +91,37 @@ int ipv6_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2);
 void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr);
 void addrconf_leave_solict(struct inet6_dev *idev, const struct in6_addr *addr);
 
+static inline int addrconf_ifid_eui48(u8 *eui, struct net_device *dev)
+{
+	if (dev->addr_len != ETH_ALEN)
+		return -1;
+	memcpy(eui, dev->dev_addr, 3);
+	memcpy(eui + 5, dev->dev_addr + 3, 3);
+
+	/*
+	 * The zSeries OSA network cards can be shared among various
+	 * OS instances, but the OSA cards have only one MAC address.
+	 * This leads to duplicate address conflicts in conjunction
+	 * with IPv6 if more than one instance uses the same card.
+	 *
+	 * The driver for these cards can deliver a unique 16-bit
+	 * identifier for each instance sharing the same card.  It is
+	 * placed instead of 0xFFFE in the interface identifier.  The
+	 * "u" bit of the interface identifier is not inverted in this
+	 * case.  Hence the resulting interface identifier has local
+	 * scope according to RFC2373.
+	 */
+	if (dev->dev_id) {
+		eui[3] = (dev->dev_id >> 8) & 0xFF;
+		eui[4] = dev->dev_id & 0xFF;
+	} else {
+		eui[3] = 0xFF;
+		eui[4] = 0xFE;
+		eui[0] ^= 2;
+	}
+	return 0;
+}
+
 static inline unsigned long addrconf_timeout_fixup(u32 timeout,
 						   unsigned int unit)
 {
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..5b0c041 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1845,37 +1845,6 @@ static void addrconf_leave_anycast(struct inet6_ifaddr *ifp)
 	__ipv6_dev_ac_dec(ifp->idev, &addr);
 }
 
-static int addrconf_ifid_eui48(u8 *eui, struct net_device *dev)
-{
-	if (dev->addr_len != ETH_ALEN)
-		return -1;
-	memcpy(eui, dev->dev_addr, 3);
-	memcpy(eui + 5, dev->dev_addr + 3, 3);
-
-	/*
-	 * The zSeries OSA network cards can be shared among various
-	 * OS instances, but the OSA cards have only one MAC address.
-	 * This leads to duplicate address conflicts in conjunction
-	 * with IPv6 if more than one instance uses the same card.
-	 *
-	 * The driver for these cards can deliver a unique 16-bit
-	 * identifier for each instance sharing the same card.  It is
-	 * placed instead of 0xFFFE in the interface identifier.  The
-	 * "u" bit of the interface identifier is not inverted in this
-	 * case.  Hence the resulting interface identifier has local
-	 * scope according to RFC2373.
-	 */
-	if (dev->dev_id) {
-		eui[3] = (dev->dev_id >> 8) & 0xFF;
-		eui[4] = dev->dev_id & 0xFF;
-	} else {
-		eui[3] = 0xFF;
-		eui[4] = 0xFE;
-		eui[0] ^= 2;
-	}
-	return 0;
-}
-
 static int addrconf_ifid_eui64(u8 *eui, struct net_device *dev)
 {
 	if (dev->addr_len != IEEE802154_ADDR_LEN)
-- 
2.1.0

Cc: netdev@vger.kernel.org

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next V7 02/10] net: Add info for NETDEV_CHANGEUPPER event
  2015-07-30 15:33 [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core Matan Barak
  2015-07-30 15:33 ` [PATCH for-next V7 01/10] net/ipv6: Export addrconf_ifid_eui48 Matan Barak
@ 2015-07-30 15:33 ` Matan Barak
       [not found] ` <1438270411-17648-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-07-31  9:40 ` [PATCH for-next V7 00/10] Move RoCE GID management " Or Gerlitz
  3 siblings, 0 replies; 23+ messages in thread
From: Matan Barak @ 2015-07-30 15:33 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Jason Gunthorpe, linux-rdma, Sean Hefty,
	Somnath Kotur, Moni Shoua, talal, haggaie, netdev

Some consumers of NETDEV_CHANGEUPPER event would like to know which
upper device was linked/unlinked and what operation was carried.

Add information in the notifier info block for that purpose.

Signed-off-by: Matan Barak <matanb@mellanox.com>
---
 include/linux/netdevice.h | 14 ++++++++++++++
 net/core/dev.c            | 12 ++++++++++--
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e20979d..2b7fe4e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3556,6 +3556,20 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
 struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb,
 				    netdev_features_t features);
 
+enum netdev_changeupper_event {
+	NETDEV_CHANGEUPPER_LINK,
+	NETDEV_CHANGEUPPER_UNLINK,
+};
+
+struct netdev_changeupper_info {
+	struct netdev_notifier_info	info; /* must be first */
+	enum netdev_changeupper_event	event;
+	struct net_device		*upper;
+};
+
+void netdev_changeupper_info_change(struct net_device *dev,
+				    struct netdev_changeupper_info *info);
+
 struct netdev_bonding_info {
 	ifslave	slave;
 	ifbond	master;
diff --git a/net/core/dev.c b/net/core/dev.c
index a8e4dd4..6e6f14e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5302,6 +5302,7 @@ static int __netdev_upper_dev_link(struct net_device *dev,
 				   void *private)
 {
 	struct netdev_adjacent *i, *j, *to_i, *to_j;
+	struct netdev_changeupper_info changeupper_info;
 	int ret = 0;
 
 	ASSERT_RTNL();
@@ -5357,7 +5358,10 @@ static int __netdev_upper_dev_link(struct net_device *dev,
 			goto rollback_lower_mesh;
 	}
 
-	call_netdevice_notifiers(NETDEV_CHANGEUPPER, dev);
+	changeupper_info.event = NETDEV_CHANGEUPPER_LINK;
+	changeupper_info.upper = upper_dev;
+	call_netdevice_notifiers_info(NETDEV_CHANGEUPPER, dev,
+				      &changeupper_info.info);
 	return 0;
 
 rollback_lower_mesh:
@@ -5453,6 +5457,7 @@ void netdev_upper_dev_unlink(struct net_device *dev,
 			     struct net_device *upper_dev)
 {
 	struct netdev_adjacent *i, *j;
+	struct netdev_changeupper_info changeupper_info;
 	ASSERT_RTNL();
 
 	__netdev_adjacent_dev_unlink_neighbour(dev, upper_dev);
@@ -5474,7 +5479,10 @@ void netdev_upper_dev_unlink(struct net_device *dev,
 	list_for_each_entry(i, &upper_dev->all_adj_list.upper, list)
 		__netdev_adjacent_dev_unlink(dev, i->dev);
 
-	call_netdevice_notifiers(NETDEV_CHANGEUPPER, dev);
+	changeupper_info.event = NETDEV_CHANGEUPPER_UNLINK;
+	changeupper_info.upper = upper_dev;
+	call_netdevice_notifiers_info(NETDEV_CHANGEUPPER, dev,
+				      &changeupper_info.info);
 }
 EXPORT_SYMBOL(netdev_upper_dev_unlink);
 
-- 
2.1.0

Cc: netdev@vger.kernel.org

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next V7 03/10] net/bonding: Export bond_option_active_slave_get_rcu
       [not found] ` <1438270411-17648-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-07-30 15:33   ` Matan Barak
  2015-07-30 15:33   ` [PATCH for-next V7 04/10] IB/core: Add rwsem to allow reading device list or client list Matan Barak
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Matan Barak @ 2015-07-30 15:33 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sean Hefty, Somnath Kotur,
	Moni Shoua, talal-VPRAkNaXOzVWk0Htik3J/w,
	haggaie-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA

Some consumers of the netdev events API would like to know who is the
active slave when a NETDEV_CHANGEUPPER or NETDEV_BONDING_FAILOVER
events occur. For example, when managing RoCE GIDs, GIDs based on the
bond's ips should only be set on the port which corresponds to active
slave netdevice.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/net/bonding/bond_options.c | 13 -------------
 include/net/bonding.h              |  7 +++++++
 2 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index e9c624d..28bd005 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -730,19 +730,6 @@ static int bond_option_mode_set(struct bonding *bond,
 	return 0;
 }
 
-static struct net_device *__bond_option_active_slave_get(struct bonding *bond,
-							 struct slave *slave)
-{
-	return bond_uses_primary(bond) && slave ? slave->dev : NULL;
-}
-
-struct net_device *bond_option_active_slave_get_rcu(struct bonding *bond)
-{
-	struct slave *slave = rcu_dereference(bond->curr_active_slave);
-
-	return __bond_option_active_slave_get(bond, slave);
-}
-
 static int bond_option_active_slave_set(struct bonding *bond,
 					const struct bond_opt_value *newval)
 {
diff --git a/include/net/bonding.h b/include/net/bonding.h
index 20defc0..c1740a2 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -310,6 +310,13 @@ static inline bool bond_uses_primary(struct bonding *bond)
 	return bond_mode_uses_primary(BOND_MODE(bond));
 }
 
+static inline struct net_device *bond_option_active_slave_get_rcu(struct bonding *bond)
+{
+	struct slave *slave = rcu_dereference(bond->curr_active_slave);
+
+	return bond_uses_primary(bond) && slave ? slave->dev : NULL;
+}
+
 static inline bool bond_slave_is_up(struct slave *slave)
 {
 	return netif_running(slave->dev) && netif_carrier_ok(slave->dev);
-- 
2.1.0

Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next V7 04/10] IB/core: Add rwsem to allow reading device list or client list
       [not found] ` <1438270411-17648-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-07-30 15:33   ` [PATCH for-next V7 03/10] net/bonding: Export bond_option_active_slave_get_rcu Matan Barak
@ 2015-07-30 15:33   ` Matan Barak
  2015-07-30 15:33   ` [PATCH for-next V7 05/10] IB/core: Add RoCE GID table management Matan Barak
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Matan Barak @ 2015-07-30 15:33 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sean Hefty, Somnath Kotur,
	Moni Shoua, talal-VPRAkNaXOzVWk0Htik3J/w,
	haggaie-VPRAkNaXOzVWk0Htik3J/w

From: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Currently the RDMA subsystem's device list and client list are
protected by a single mutex. This prevents adding user-facing APIs
that iterate these lists, since using them may cause a deadlock.
The patch attempts to solve this problem by adding a read-write
semaphore to protect the lists. Readers now don't need the mutex,
and are safe just by read-locking the semaphore.

The ib_register_device, ib_register_client, ib_unregister_device, and
ib_unregister_client functions are modified to lock the semaphore for
write during their respective list modification. Also, in order to
make sure client callbacks are called only between add() and remove()
calls, the code is changed to only add items to the lists after the
add() calls and remove from the lists before the remove() calls.

Reviewed-By: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/device.c | 39 ++++++++++++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 9567756..f08d438 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -55,17 +55,24 @@ struct ib_client_data {
 struct workqueue_struct *ib_wq;
 EXPORT_SYMBOL_GPL(ib_wq);
 
+/* The device_list and client_list contain devices and clients after their
+ * registration has completed, and the devices and clients are removed
+ * during unregistration. */
 static LIST_HEAD(device_list);
 static LIST_HEAD(client_list);
 
 /*
- * device_mutex protects access to both device_list and client_list.
- * There's no real point to using multiple locks or something fancier
- * like an rwsem: we always access both lists, and we're always
- * modifying one list or the other list.  In any case this is not a
- * hot path so there's no point in trying to optimize.
+ * device_mutex and lists_rwsem protect access to both device_list and
+ * client_list.  device_mutex protects writer access by device and client
+ * registration / de-registration.  lists_rwsem protects reader access to
+ * these lists.  Iterators of these lists must lock it for read, while updates
+ * to the lists must be done with a write lock. A special case is when the
+ * device_mutex is locked. In this case locking the lists for read access is
+ * not necessary as the device_mutex implies it.
  */
 static DEFINE_MUTEX(device_mutex);
+static DECLARE_RWSEM(lists_rwsem);
+
 
 static int ib_device_check_mandatory(struct ib_device *device)
 {
@@ -305,8 +312,6 @@ int ib_register_device(struct ib_device *device,
 		goto out;
 	}
 
-	list_add_tail(&device->core_list, &device_list);
-
 	device->reg_state = IB_DEV_REGISTERED;
 
 	{
@@ -317,6 +322,10 @@ int ib_register_device(struct ib_device *device,
 				client->add(device);
 	}
 
+	down_write(&lists_rwsem);
+	list_add_tail(&device->core_list, &device_list);
+	up_write(&lists_rwsem);
+
  out:
 	mutex_unlock(&device_mutex);
 	return ret;
@@ -337,12 +346,14 @@ void ib_unregister_device(struct ib_device *device)
 
 	mutex_lock(&device_mutex);
 
+	down_write(&lists_rwsem);
+	list_del(&device->core_list);
+	up_write(&lists_rwsem);
+
 	list_for_each_entry_reverse(client, &client_list, list)
 		if (client->remove)
 			client->remove(device);
 
-	list_del(&device->core_list);
-
 	mutex_unlock(&device_mutex);
 
 	ib_device_unregister_sysfs(device);
@@ -375,11 +386,14 @@ int ib_register_client(struct ib_client *client)
 
 	mutex_lock(&device_mutex);
 
-	list_add_tail(&client->list, &client_list);
 	list_for_each_entry(device, &device_list, core_list)
 		if (client->add && !add_client_context(device, client))
 			client->add(device);
 
+	down_write(&lists_rwsem);
+	list_add_tail(&client->list, &client_list);
+	up_write(&lists_rwsem);
+
 	mutex_unlock(&device_mutex);
 
 	return 0;
@@ -402,6 +416,10 @@ void ib_unregister_client(struct ib_client *client)
 
 	mutex_lock(&device_mutex);
 
+	down_write(&lists_rwsem);
+	list_del(&client->list);
+	up_write(&lists_rwsem);
+
 	list_for_each_entry(device, &device_list, core_list) {
 		if (client->remove)
 			client->remove(device);
@@ -414,7 +432,6 @@ void ib_unregister_client(struct ib_client *client)
 			}
 		spin_unlock_irqrestore(&device->client_data_lock, flags);
 	}
-	list_del(&client->list);
 
 	mutex_unlock(&device_mutex);
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next V7 05/10] IB/core: Add RoCE GID table management
       [not found] ` <1438270411-17648-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-07-30 15:33   ` [PATCH for-next V7 03/10] net/bonding: Export bond_option_active_slave_get_rcu Matan Barak
  2015-07-30 15:33   ` [PATCH for-next V7 04/10] IB/core: Add rwsem to allow reading device list or client list Matan Barak
@ 2015-07-30 15:33   ` Matan Barak
  2015-07-30 15:33   ` [PATCH for-next V7 06/10] IB/core: Add RoCE table bonding support Matan Barak
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Matan Barak @ 2015-07-30 15:33 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sean Hefty, Somnath Kotur,
	Moni Shoua, talal-VPRAkNaXOzVWk0Htik3J/w,
	haggaie-VPRAkNaXOzVWk0Htik3J/w

RoCE GIDs are based on IP addresses configured on Ethernet net-devices
which relate to the RDMA (RoCE) device port.

Currently, each of the low-level drivers that support RoCE (ocrdma,
mlx4) manages its own RoCE port GID table. As there's nothing which is
essentially vendor specific, we generalize that, and enhance the RDMA
core GID cache to do this job.

In order to populate the GID table, we listen for events:

(a) netdev up/down/change_addr events - if a netdev is built onto
    our RoCE device, we need to add/delete its IPs. This involves
    adding all GIDs related to this ndev, add default GIDs, etc.

(b) inet events - add new GIDs (according to the IP addresses)
    to the table.

For programming the port RoCE GID table, providers must implement
the add_gid and del_gid callbacks.

RoCE GID management requires us to state the associated net_device
alongside the GID. This information is necessary in order to manage
the GID table. For example, when a net_device is removed, its
associated GIDs need to be removed as well.

RoCE mandates generating a default GID for each port, based on the
related net-device's IPv6 link local. In contrast to the GID based on
the regular IPv6 link-local (as we generate GID per IP address),
the default GID is also available when the net device is down (in
order to support loopback).

Locking is done as follows:
The patch modify the GID table code both for new RoCE drivers
implementing the add_gid/del_gid callbacks and for current RoCE and
IB drivers that do not. The flows for updating the table are
different, so the locking requirements are too.

While updating RoCE GID table, protection against multiple writers is
achieved via mutex_lock(&table->lock). Since writing to a table
requires us to find an entry (possible a free entry) in the table and
then modify it, this mutex protects both the find_gid and write_gid
ensuring the atomicity of the action.
Each entry in the GID cache is protected by rwlock. In RoCE, writing
(usually results from netdev notifier) involves invoking the vendor's
add_gid and del_gid callbacks, which could sleep.
Therefore, an invalid flag is added for each entry. Updates for RoCE are
done via a workqueue, thus sleeping is permitted.

In IB, updates are done in write_lock_irq(&device->cache.lock), thus
write_gid isn't allowed to sleep and add_gid/del_gid are not called.

When passing net-device into/out-of the GID cache, the device
is always passed held (dev_hold).

The code uses a single work item for updating all RDMA devices,
following a netdev or inet notifier.

The patch moves the cache from being a client (which was incorrect,
as the cache is part of the IB infrastructure) to being explicitly
initialized/freed when a device is registered/removed.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/Makefile        |   3 +-
 drivers/infiniband/core/cache.c         | 703 ++++++++++++++++++++++++++++----
 drivers/infiniband/core/core_priv.h     |  50 ++-
 drivers/infiniband/core/device.c        |  95 ++++-
 drivers/infiniband/core/roce_gid_mgmt.c | 465 +++++++++++++++++++++
 drivers/infiniband/core/sysfs.c         |   1 +
 include/rdma/ib_verbs.h                 |  66 ++-
 7 files changed, 1286 insertions(+), 97 deletions(-)
 create mode 100644 drivers/infiniband/core/roce_gid_mgmt.c

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index acf7367..d43a899 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -9,7 +9,8 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o \
 					$(user_access-y)
 
 ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
-				device.o fmr_pool.o cache.o netlink.o
+				device.o fmr_pool.o cache.o netlink.o \
+				roce_gid_mgmt.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
 
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 871da83..9b4f16b 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -37,6 +37,8 @@
 #include <linux/errno.h>
 #include <linux/slab.h>
 #include <linux/workqueue.h>
+#include <linux/netdevice.h>
+#include <net/addrconf.h>
 
 #include <rdma/ib_cache.h>
 
@@ -47,77 +49,596 @@ struct ib_pkey_cache {
 	u16             table[0];
 };
 
-struct ib_gid_cache {
-	int             table_len;
-	union ib_gid    table[0];
-};
-
 struct ib_update_work {
 	struct work_struct work;
 	struct ib_device  *device;
 	u8                 port_num;
 };
 
-int ib_get_cached_gid(struct ib_device *device,
-		      u8                port_num,
-		      int               index,
-		      union ib_gid     *gid)
+static union ib_gid zgid;
+
+static const struct ib_gid_attr zattr;
+
+enum gid_attr_find_mask {
+	GID_ATTR_FIND_MASK_GID          = 1UL << 0,
+	GID_ATTR_FIND_MASK_NETDEV	= 1UL << 1,
+	GID_ATTR_FIND_MASK_DEFAULT	= 1UL << 2,
+};
+
+enum gid_table_entry_props {
+	GID_TABLE_ENTRY_INVALID		= 1UL << 0,
+	GID_TABLE_ENTRY_DEFAULT		= 1UL << 1,
+};
+
+enum gid_table_write_action {
+	GID_TABLE_WRITE_ACTION_ADD,
+	GID_TABLE_WRITE_ACTION_DEL,
+	/* MODIFY only updates the GID table. Currently only used by
+	 * ib_cache_update.
+	 */
+	GID_TABLE_WRITE_ACTION_MODIFY
+};
+
+struct ib_gid_table_entry {
+	/* This lock protects an entry from being
+	 * read and written simultaneously.
+	 */
+	rwlock_t	    lock;
+	unsigned long	    props;
+	union ib_gid        gid;
+	struct ib_gid_attr  attr;
+	void		   *context;
+};
+
+struct ib_gid_table {
+	int                  sz;
+	/* In RoCE, adding a GID to the table requires:
+	 * (a) Find if this GID is already exists.
+	 * (b) Find a free space.
+	 * (c) Write the new GID
+	 *
+	 * Delete requires different set of operations:
+	 * (a) Find the GID
+	 * (b) Delete it.
+	 *
+	 * Add/delete should be carried out atomically.
+	 * This is done by locking this mutex from multiple
+	 * writers. We don't need this lock for IB, as the MAD
+	 * layer replaces all entries. All data_vec entries
+	 * are locked by this lock.
+	 **/
+	struct mutex         lock;
+	struct ib_gid_table_entry *data_vec;
+};
+
+static int write_gid(struct ib_device *ib_dev, u8 port,
+		     struct ib_gid_table *table, int ix,
+		     const union ib_gid *gid,
+		     const struct ib_gid_attr *attr,
+		     enum gid_table_write_action action,
+		     bool  default_gid)
 {
-	struct ib_gid_cache *cache;
-	unsigned long flags;
 	int ret = 0;
+	struct net_device *old_net_dev;
+
+	/* in rdma_cap_roce_gid_table, this funciton should be protected by a
+	 * sleep-able lock.
+	 */
+	write_lock(&table->data_vec[ix].lock);
+
+	if (rdma_cap_roce_gid_table(ib_dev, port)) {
+		table->data_vec[ix].props |= GID_TABLE_ENTRY_INVALID;
+		write_unlock(&table->data_vec[ix].lock);
+		/* GID_TABLE_WRITE_ACTION_MODIFY currently isn't supported by
+		 * RoCE providers and thus only updates the cache.
+		 */
+		if (action == GID_TABLE_WRITE_ACTION_ADD)
+			ret = ib_dev->add_gid(ib_dev, port, ix, gid, attr,
+					      &table->data_vec[ix].context);
+		else if (action == GID_TABLE_WRITE_ACTION_DEL)
+			ret = ib_dev->del_gid(ib_dev, port, ix,
+					      &table->data_vec[ix].context);
+		write_lock(&table->data_vec[ix].lock);
+	}
 
-	if (port_num < rdma_start_port(device) || port_num > rdma_end_port(device))
-		return -EINVAL;
+	old_net_dev = table->data_vec[ix].attr.ndev;
+	if (old_net_dev && old_net_dev != attr->ndev)
+		dev_put(old_net_dev);
+	/* if modify_gid failed, just delete the old gid */
+	if (ret || action == GID_TABLE_WRITE_ACTION_DEL) {
+		gid = &zgid;
+		attr = &zattr;
+		table->data_vec[ix].context = NULL;
+	}
+	if (default_gid)
+		table->data_vec[ix].props |= GID_TABLE_ENTRY_DEFAULT;
+	memcpy(&table->data_vec[ix].gid, gid, sizeof(*gid));
+	memcpy(&table->data_vec[ix].attr, attr, sizeof(*attr));
+	if (table->data_vec[ix].attr.ndev &&
+	    table->data_vec[ix].attr.ndev != old_net_dev)
+		dev_hold(table->data_vec[ix].attr.ndev);
 
-	read_lock_irqsave(&device->cache.lock, flags);
+	table->data_vec[ix].props &= ~GID_TABLE_ENTRY_INVALID;
 
-	cache = device->cache.gid_cache[port_num - rdma_start_port(device)];
+	write_unlock(&table->data_vec[ix].lock);
 
-	if (index < 0 || index >= cache->table_len)
-		ret = -EINVAL;
-	else
-		*gid = cache->table[index];
+	if (!ret && rdma_cap_roce_gid_table(ib_dev, port)) {
+		struct ib_event event;
 
-	read_unlock_irqrestore(&device->cache.lock, flags);
+		event.device		= ib_dev;
+		event.element.port_num	= port;
+		event.event		= IB_EVENT_GID_CHANGE;
 
+		ib_dispatch_event(&event);
+	}
 	return ret;
 }
-EXPORT_SYMBOL(ib_get_cached_gid);
 
-int ib_find_cached_gid(struct ib_device   *device,
-		       const union ib_gid *gid,
-		       u8                 *port_num,
-		       u16                *index)
+static int add_gid(struct ib_device *ib_dev, u8 port,
+		   struct ib_gid_table *table, int ix,
+		   const union ib_gid *gid,
+		   const struct ib_gid_attr *attr,
+		   bool  default_gid) {
+	return write_gid(ib_dev, port, table, ix, gid, attr,
+			 GID_TABLE_WRITE_ACTION_ADD, default_gid);
+}
+
+static int modify_gid(struct ib_device *ib_dev, u8 port,
+		      struct ib_gid_table *table, int ix,
+		      const union ib_gid *gid,
+		      const struct ib_gid_attr *attr,
+		      bool  default_gid) {
+	return write_gid(ib_dev, port, table, ix, gid, attr,
+			 GID_TABLE_WRITE_ACTION_MODIFY, default_gid);
+}
+
+static int del_gid(struct ib_device *ib_dev, u8 port,
+		   struct ib_gid_table *table, int ix,
+		   bool  default_gid) {
+	return write_gid(ib_dev, port, table, ix, &zgid, &zattr,
+			 GID_TABLE_WRITE_ACTION_DEL, default_gid);
+}
+
+static int find_gid(struct ib_gid_table *table, const union ib_gid *gid,
+		    const struct ib_gid_attr *val, bool default_gid,
+		    unsigned long mask)
 {
-	struct ib_gid_cache *cache;
-	unsigned long flags;
-	int p, i;
-	int ret = -ENOENT;
+	int i;
 
-	*port_num = -1;
-	if (index)
-		*index = -1;
+	for (i = 0; i < table->sz; i++) {
+		unsigned long flags;
+		struct ib_gid_attr *attr = &table->data_vec[i].attr;
 
-	read_lock_irqsave(&device->cache.lock, flags);
+		read_lock_irqsave(&table->data_vec[i].lock, flags);
 
-	for (p = 0; p <= rdma_end_port(device) - rdma_start_port(device); ++p) {
-		cache = device->cache.gid_cache[p];
-		for (i = 0; i < cache->table_len; ++i) {
-			if (!memcmp(gid, &cache->table[i], sizeof *gid)) {
-				*port_num = p + rdma_start_port(device);
-				if (index)
-					*index = i;
-				ret = 0;
-				goto found;
+		if (table->data_vec[i].props & GID_TABLE_ENTRY_INVALID)
+			goto next;
+
+		if (mask & GID_ATTR_FIND_MASK_GID &&
+		    memcmp(gid, &table->data_vec[i].gid, sizeof(*gid)))
+			goto next;
+
+		if (mask & GID_ATTR_FIND_MASK_NETDEV &&
+		    attr->ndev != val->ndev)
+			goto next;
+
+		if (mask & GID_ATTR_FIND_MASK_DEFAULT &&
+		    !!(table->data_vec[i].props & GID_TABLE_ENTRY_DEFAULT) !=
+		    default_gid)
+			goto next;
+
+		read_unlock_irqrestore(&table->data_vec[i].lock, flags);
+		return i;
+next:
+		read_unlock_irqrestore(&table->data_vec[i].lock, flags);
+	}
+
+	return -1;
+}
+
+static void make_default_gid(struct  net_device *dev, union ib_gid *gid)
+{
+	gid->global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
+	addrconf_ifid_eui48(&gid->raw[8], dev);
+}
+
+int ib_cache_gid_add(struct ib_device *ib_dev, u8 port,
+		     union ib_gid *gid, struct ib_gid_attr *attr)
+{
+	struct ib_gid_table **ports_table = ib_dev->cache.gid_cache;
+	struct ib_gid_table *table;
+	int ix;
+	int ret = 0;
+	struct net_device *idev;
+
+	table = ports_table[port - rdma_start_port(ib_dev)];
+
+	if (!memcmp(gid, &zgid, sizeof(*gid)))
+		return -EINVAL;
+
+	if (ib_dev->get_netdev) {
+		idev = ib_dev->get_netdev(ib_dev, port);
+		if (idev && attr->ndev != idev) {
+			union ib_gid default_gid;
+
+			/* Adding default GIDs in not permitted */
+			make_default_gid(idev, &default_gid);
+			if (!memcmp(gid, &default_gid, sizeof(*gid))) {
+				dev_put(idev);
+				return -EPERM;
 			}
 		}
+		if (idev)
+			dev_put(idev);
 	}
-found:
-	read_unlock_irqrestore(&device->cache.lock, flags);
 
+	mutex_lock(&table->lock);
+
+	ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID |
+		      GID_ATTR_FIND_MASK_NETDEV);
+	if (ix >= 0)
+		goto out_unlock;
+
+	ix = find_gid(table, &zgid, NULL, false, GID_ATTR_FIND_MASK_GID |
+		      GID_ATTR_FIND_MASK_DEFAULT);
+	if (ix < 0) {
+		ret = -ENOSPC;
+		goto out_unlock;
+	}
+
+	add_gid(ib_dev, port, table, ix, gid, attr, false);
+
+out_unlock:
+	mutex_unlock(&table->lock);
 	return ret;
 }
+
+int ib_cache_gid_del(struct ib_device *ib_dev, u8 port,
+		     union ib_gid *gid, struct ib_gid_attr *attr)
+{
+	struct ib_gid_table **ports_table = ib_dev->cache.gid_cache;
+	struct ib_gid_table *table;
+	int ix;
+
+	table = ports_table[port - rdma_start_port(ib_dev)];
+
+	mutex_lock(&table->lock);
+
+	ix = find_gid(table, gid, attr, false,
+		      GID_ATTR_FIND_MASK_GID	  |
+		      GID_ATTR_FIND_MASK_NETDEV	  |
+		      GID_ATTR_FIND_MASK_DEFAULT);
+	if (ix < 0)
+		goto out_unlock;
+
+	del_gid(ib_dev, port, table, ix, false);
+
+out_unlock:
+	mutex_unlock(&table->lock);
+	return 0;
+}
+
+int ib_cache_gid_del_all_netdev_gids(struct ib_device *ib_dev, u8 port,
+				     struct net_device *ndev)
+{
+	struct ib_gid_table **ports_table = ib_dev->cache.gid_cache;
+	struct ib_gid_table *table;
+	int ix;
+
+	table  = ports_table[port - rdma_start_port(ib_dev)];
+
+	mutex_lock(&table->lock);
+
+	for (ix = 0; ix < table->sz; ix++)
+		if (table->data_vec[ix].attr.ndev == ndev)
+			del_gid(ib_dev, port, table, ix, false);
+
+	mutex_unlock(&table->lock);
+	return 0;
+}
+
+static int __ib_cache_gid_get(struct ib_device *ib_dev, u8 port, int index,
+			      union ib_gid *gid, struct ib_gid_attr *attr)
+{
+	struct ib_gid_table **ports_table = ib_dev->cache.gid_cache;
+	struct ib_gid_table *table;
+	unsigned long flags;
+
+	table = ports_table[port - rdma_start_port(ib_dev)];
+
+	if (index < 0 || index >= table->sz)
+		return -EINVAL;
+
+	read_lock_irqsave(&table->data_vec[index].lock, flags);
+	if (table->data_vec[index].props & GID_TABLE_ENTRY_INVALID) {
+		read_unlock_irqrestore(&table->data_vec[index].lock, flags);
+		return -EAGAIN;
+	}
+
+	memcpy(gid, &table->data_vec[index].gid, sizeof(*gid));
+	if (attr) {
+		memcpy(attr, &table->data_vec[index].attr, sizeof(*attr));
+		if (attr->ndev)
+			dev_hold(attr->ndev);
+	}
+
+	read_unlock_irqrestore(&table->data_vec[index].lock, flags);
+	return 0;
+}
+
+static int _ib_cache_gid_table_find(struct ib_device *ib_dev,
+				    const union ib_gid *gid,
+				    const struct ib_gid_attr *val,
+				    unsigned long mask,
+				    u8 *port, u16 *index)
+{
+	struct ib_gid_table **ports_table = ib_dev->cache.gid_cache;
+	struct ib_gid_table *table;
+	u8 p;
+	int local_index;
+
+	for (p = 0; p < ib_dev->phys_port_cnt; p++) {
+		table = ports_table[p];
+		local_index = find_gid(table, gid, val, false, mask);
+		if (local_index >= 0) {
+			if (index)
+				*index = local_index;
+			if (port)
+				*port = p + rdma_start_port(ib_dev);
+			return 0;
+		}
+	}
+
+	return -ENOENT;
+}
+
+static int ib_cache_gid_find(struct ib_device *ib_dev,
+			     const union ib_gid *gid,
+			     struct net_device *ndev, u8 *port,
+			     u16 *index)
+{
+	unsigned long mask = GID_ATTR_FIND_MASK_GID;
+	struct ib_gid_attr gid_attr_val = {.ndev = ndev};
+
+	if (ndev)
+		mask |= GID_ATTR_FIND_MASK_NETDEV;
+
+	return _ib_cache_gid_table_find(ib_dev, gid, &gid_attr_val,
+					mask, port, index);
+}
+
+int ib_cache_gid_find_by_port(struct ib_device *ib_dev,
+			      const union ib_gid *gid,
+			      u8 port, struct net_device *ndev,
+			      u16 *index)
+{
+	int local_index;
+	struct ib_gid_table **ports_table = ib_dev->cache.gid_cache;
+	struct ib_gid_table *table;
+	unsigned long mask = GID_ATTR_FIND_MASK_GID;
+	struct ib_gid_attr val = {.ndev = ndev};
+
+	if (port < rdma_start_port(ib_dev) ||
+	    port > rdma_end_port(ib_dev))
+		return -ENOENT;
+
+	table = ports_table[port - rdma_start_port(ib_dev)];
+
+	if (ndev)
+		mask |= GID_ATTR_FIND_MASK_NETDEV;
+
+	local_index = find_gid(table, gid, &val, false, mask);
+	if (local_index >= 0) {
+		if (index)
+			*index = local_index;
+		return 0;
+	}
+
+	return -ENOENT;
+}
+
+static struct ib_gid_table *alloc_gid_table(int sz)
+{
+	unsigned int i;
+	struct ib_gid_table *table =
+		kzalloc(sizeof(struct ib_gid_table), GFP_KERNEL);
+	if (!table)
+		return NULL;
+
+	table->data_vec = kcalloc(sz, sizeof(*table->data_vec), GFP_KERNEL);
+	if (!table->data_vec)
+		goto err_free_table;
+
+	mutex_init(&table->lock);
+
+	table->sz = sz;
+
+	for (i = 0; i < sz; i++)
+		rwlock_init(&table->data_vec[i].lock);
+
+	return table;
+
+err_free_table:
+	kfree(table);
+	return NULL;
+}
+
+static void free_gid_table(struct ib_device *ib_dev, u8 port,
+			   struct ib_gid_table *table)
+{
+	int i;
+
+	if (!table)
+		return;
+
+	for (i = 0; i < table->sz; ++i) {
+		if (memcmp(&table->data_vec[i].gid, &zgid,
+			   sizeof(table->data_vec[i].gid)))
+			del_gid(ib_dev, port, table, i,
+				table->data_vec[i].props &
+				GID_ATTR_FIND_MASK_DEFAULT);
+	}
+	kfree(table->data_vec);
+	kfree(table);
+}
+
+void ib_cache_gid_set_default_gid(struct ib_device *ib_dev, u8 port,
+				  struct net_device *ndev,
+				  enum ib_cache_gid_default_mode mode)
+{
+	struct ib_gid_table **ports_table = ib_dev->cache.gid_cache;
+	union ib_gid gid;
+	struct ib_gid_attr gid_attr;
+	struct ib_gid_table *table;
+	int ix;
+	union ib_gid current_gid;
+	struct ib_gid_attr current_gid_attr = {};
+
+	table  = ports_table[port - rdma_start_port(ib_dev)];
+
+	make_default_gid(ndev, &gid);
+	memset(&gid_attr, 0, sizeof(gid_attr));
+	gid_attr.ndev = ndev;
+
+	ix = find_gid(table, NULL, NULL, true, GID_ATTR_FIND_MASK_DEFAULT);
+
+	/* Coudn't find default GID location */
+	WARN_ON(ix < 0);
+
+	mutex_lock(&table->lock);
+	if (!__ib_cache_gid_get(ib_dev, port, ix,
+				&current_gid, &current_gid_attr) &&
+	    mode == IB_CACHE_GID_DEFAULT_MODE_SET &&
+	    !memcmp(&gid, &current_gid, sizeof(gid)) &&
+	    !memcmp(&gid_attr, &current_gid_attr, sizeof(gid_attr)))
+		goto unlock;
+
+	if ((memcmp(&current_gid, &zgid, sizeof(current_gid)) ||
+	     memcmp(&current_gid_attr, &zattr,
+		    sizeof(current_gid_attr))) &&
+	    del_gid(ib_dev, port, table, ix, true)) {
+		pr_warn("ib_cache_gid: can't delete index %d for default gid %pI6\n",
+			ix, gid.raw);
+		goto unlock;
+	}
+
+	if (mode == IB_CACHE_GID_DEFAULT_MODE_SET)
+		if (add_gid(ib_dev, port, table, ix, &gid, &gid_attr, true))
+			pr_warn("ib_cache_gid: unable to add default gid %pI6\n",
+				gid.raw);
+
+unlock:
+	if (current_gid_attr.ndev)
+		dev_put(current_gid_attr.ndev);
+	mutex_unlock(&table->lock);
+}
+
+static int gid_table_reserve_default(struct ib_device *ib_dev, u8 port,
+				     struct ib_gid_table *table)
+{
+	if (rdma_protocol_roce(ib_dev, port)) {
+		struct ib_gid_table_entry *entry = &table->data_vec[0];
+
+		entry->props |= GID_TABLE_ENTRY_DEFAULT;
+	}
+
+	return 0;
+}
+
+static int _gid_table_setup_one(struct ib_device *ib_dev)
+{
+	u8 port;
+	struct ib_gid_table **table;
+	int err = 0;
+
+	table = kcalloc(ib_dev->phys_port_cnt, sizeof(*table), GFP_KERNEL);
+
+	if (!table) {
+		pr_warn("failed to allocate ib gid cache for %s\n",
+			ib_dev->name);
+		return -ENOMEM;
+	}
+
+	for (port = 0; port < ib_dev->phys_port_cnt; port++) {
+		u8 rdma_port = port + rdma_start_port(ib_dev);
+
+		table[port] =
+			alloc_gid_table(
+				ib_dev->port_immutable[rdma_port].gid_tbl_len);
+		if (!table[port]) {
+			err = -ENOMEM;
+			goto rollback_table_setup;
+		}
+
+		err = gid_table_reserve_default(ib_dev,
+						port + rdma_start_port(ib_dev),
+						table[port]);
+		if (err)
+			goto rollback_table_setup;
+	}
+
+	ib_dev->cache.gid_cache = table;
+	return 0;
+
+rollback_table_setup:
+	for (port = 1; port <= ib_dev->phys_port_cnt; port++)
+		free_gid_table(ib_dev, port, table[port]);
+
+	kfree(table);
+	return err;
+}
+
+static void gid_table_cleanup_one(struct ib_device *ib_dev)
+{
+	struct ib_gid_table **table = ib_dev->cache.gid_cache;
+	u8 port;
+
+	if (!table)
+		return;
+
+	for (port = 0; port < ib_dev->phys_port_cnt; port++)
+		free_gid_table(ib_dev, port + rdma_start_port(ib_dev),
+			       table[port]);
+
+	kfree(table);
+}
+
+static int gid_table_setup_one(struct ib_device *ib_dev)
+{
+	int err;
+
+	err = _gid_table_setup_one(ib_dev);
+
+	if (err)
+		return err;
+
+	err = roce_rescan_device(ib_dev);
+
+	if (err)
+		gid_table_cleanup_one(ib_dev);
+
+	return err;
+}
+
+int ib_get_cached_gid(struct ib_device *device,
+		      u8                port_num,
+		      int               index,
+		      union ib_gid     *gid)
+{
+	if (port_num < rdma_start_port(device) || port_num > rdma_end_port(device))
+		return -EINVAL;
+
+	return __ib_cache_gid_get(device, port_num, index, gid, NULL);
+}
+EXPORT_SYMBOL(ib_get_cached_gid);
+
+int ib_find_cached_gid(struct ib_device *device,
+		       const union ib_gid *gid,
+		       u8               *port_num,
+		       u16              *index)
+{
+	return ib_cache_gid_find(device, gid, NULL, port_num, index);
+}
 EXPORT_SYMBOL(ib_find_cached_gid);
 
 int ib_get_cached_pkey(struct ib_device *device,
@@ -243,9 +764,21 @@ static void ib_cache_update(struct ib_device *device,
 {
 	struct ib_port_attr       *tprops = NULL;
 	struct ib_pkey_cache      *pkey_cache = NULL, *old_pkey_cache;
-	struct ib_gid_cache       *gid_cache = NULL, *old_gid_cache;
+	struct ib_gid_cache {
+		int             table_len;
+		union ib_gid    table[0];
+	}			  *gid_cache = NULL;
 	int                        i;
 	int                        ret;
+	struct ib_gid_table	  *table;
+	struct ib_gid_table	 **ports_table = device->cache.gid_cache;
+	bool			   use_roce_gid_table =
+					rdma_cap_roce_gid_table(device, port);
+
+	if (port < rdma_start_port(device) || port > rdma_end_port(device))
+		return;
+
+	table = ports_table[port - rdma_start_port(device)];
 
 	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
 	if (!tprops)
@@ -265,12 +798,14 @@ static void ib_cache_update(struct ib_device *device,
 
 	pkey_cache->table_len = tprops->pkey_tbl_len;
 
-	gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len *
-			    sizeof *gid_cache->table, GFP_KERNEL);
-	if (!gid_cache)
-		goto err;
+	if (!use_roce_gid_table) {
+		gid_cache = kmalloc(sizeof(*gid_cache) + tprops->gid_tbl_len *
+			    sizeof(*gid_cache->table), GFP_KERNEL);
+		if (!gid_cache)
+			goto err;
 
-	gid_cache->table_len = tprops->gid_tbl_len;
+		gid_cache->table_len = tprops->gid_tbl_len;
+	}
 
 	for (i = 0; i < pkey_cache->table_len; ++i) {
 		ret = ib_query_pkey(device, port, i, pkey_cache->table + i);
@@ -281,29 +816,36 @@ static void ib_cache_update(struct ib_device *device,
 		}
 	}
 
-	for (i = 0; i < gid_cache->table_len; ++i) {
-		ret = ib_query_gid(device, port, i, gid_cache->table + i);
-		if (ret) {
-			printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n",
-			       ret, device->name, i);
-			goto err;
+	if (!use_roce_gid_table) {
+		for (i = 0;  i < gid_cache->table_len; ++i) {
+			ret = ib_query_gid(device, port, i,
+					   gid_cache->table + i);
+			if (ret) {
+				printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n",
+				       ret, device->name, i);
+				goto err;
+			}
 		}
 	}
 
 	write_lock_irq(&device->cache.lock);
 
 	old_pkey_cache = device->cache.pkey_cache[port - rdma_start_port(device)];
-	old_gid_cache  = device->cache.gid_cache [port - rdma_start_port(device)];
 
 	device->cache.pkey_cache[port - rdma_start_port(device)] = pkey_cache;
-	device->cache.gid_cache [port - rdma_start_port(device)] = gid_cache;
+	if (!use_roce_gid_table) {
+		for (i = 0; i < gid_cache->table_len; i++) {
+			modify_gid(device, port, table, i, gid_cache->table + i,
+				   &zattr, false);
+		}
+	}
 
 	device->cache.lmc_cache[port - rdma_start_port(device)] = tprops->lmc;
 
 	write_unlock_irq(&device->cache.lock);
 
+	kfree(gid_cache);
 	kfree(old_pkey_cache);
-	kfree(old_gid_cache);
 	kfree(tprops);
 	return;
 
@@ -344,85 +886,76 @@ static void ib_cache_event(struct ib_event_handler *handler,
 	}
 }
 
-static void ib_cache_setup_one(struct ib_device *device)
+int ib_cache_setup_one(struct ib_device *device)
 {
 	int p;
+	int err;
 
 	rwlock_init(&device->cache.lock);
 
 	device->cache.pkey_cache =
 		kmalloc(sizeof *device->cache.pkey_cache *
 			(rdma_end_port(device) - rdma_start_port(device) + 1), GFP_KERNEL);
-	device->cache.gid_cache =
-		kmalloc(sizeof *device->cache.gid_cache *
-			(rdma_end_port(device) - rdma_start_port(device) + 1), GFP_KERNEL);
-
 	device->cache.lmc_cache = kmalloc(sizeof *device->cache.lmc_cache *
 					  (rdma_end_port(device) -
 					   rdma_start_port(device) + 1),
 					  GFP_KERNEL);
+	err = gid_table_setup_one(device);
 
 	if (!device->cache.pkey_cache || !device->cache.gid_cache ||
 	    !device->cache.lmc_cache) {
 		printk(KERN_WARNING "Couldn't allocate cache "
 		       "for %s\n", device->name);
+		err = -ENOMEM;
 		goto err;
 	}
 
 	for (p = 0; p <= rdma_end_port(device) - rdma_start_port(device); ++p) {
 		device->cache.pkey_cache[p] = NULL;
-		device->cache.gid_cache [p] = NULL;
 		ib_cache_update(device, p + rdma_start_port(device));
 	}
 
 	INIT_IB_EVENT_HANDLER(&device->cache.event_handler,
 			      device, ib_cache_event);
-	if (ib_register_event_handler(&device->cache.event_handler))
+	err = ib_register_event_handler(&device->cache.event_handler);
+	if (err)
 		goto err_cache;
 
-	return;
+	return 0;
 
 err_cache:
-	for (p = 0; p <= rdma_end_port(device) - rdma_start_port(device); ++p) {
+	for (p = 0; p <= rdma_end_port(device) - rdma_start_port(device); ++p)
 		kfree(device->cache.pkey_cache[p]);
-		kfree(device->cache.gid_cache[p]);
-	}
 
 err:
 	kfree(device->cache.pkey_cache);
-	kfree(device->cache.gid_cache);
+	gid_table_cleanup_one(device);
 	kfree(device->cache.lmc_cache);
+
+	return err;
 }
 
-static void ib_cache_cleanup_one(struct ib_device *device)
+void ib_cache_cleanup_one(struct ib_device *device)
 {
 	int p;
 
 	ib_unregister_event_handler(&device->cache.event_handler);
 	flush_workqueue(ib_wq);
 
-	for (p = 0; p <= rdma_end_port(device) - rdma_start_port(device); ++p) {
+	for (p = 0; p <= rdma_end_port(device) - rdma_start_port(device); ++p)
 		kfree(device->cache.pkey_cache[p]);
-		kfree(device->cache.gid_cache[p]);
-	}
 
 	kfree(device->cache.pkey_cache);
-	kfree(device->cache.gid_cache);
+	gid_table_cleanup_one(device);
 	kfree(device->cache.lmc_cache);
 }
 
-static struct ib_client cache_client = {
-	.name   = "cache",
-	.add    = ib_cache_setup_one,
-	.remove = ib_cache_cleanup_one
-};
-
-int __init ib_cache_setup(void)
+void __init ib_cache_setup(void)
 {
-	return ib_register_client(&cache_client);
+	roce_gid_mgmt_init();
 }
 
 void __exit ib_cache_cleanup(void)
 {
-	ib_unregister_client(&cache_client);
+	roce_gid_mgmt_cleanup();
 }
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 87d1936..210ddd2 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -46,9 +46,57 @@ void ib_device_unregister_sysfs(struct ib_device *device);
 int  ib_sysfs_setup(void);
 void ib_sysfs_cleanup(void);
 
-int  ib_cache_setup(void);
+void  ib_cache_setup(void);
 void ib_cache_cleanup(void);
 
 int ib_resolve_eth_l2_attrs(struct ib_qp *qp,
 			    struct ib_qp_attr *qp_attr, int *qp_attr_mask);
+
+typedef void (*roce_netdev_callback)(struct ib_device *device, u8 port,
+	      struct net_device *idev, void *cookie);
+
+typedef int (*roce_netdev_filter)(struct ib_device *device, u8 port,
+	     struct net_device *idev, void *cookie);
+
+void ib_enum_roce_netdev(struct ib_device *ib_dev,
+			 roce_netdev_filter filter,
+			 void *filter_cookie,
+			 roce_netdev_callback cb,
+			 void *cookie);
+void ib_enum_all_roce_netdevs(roce_netdev_filter filter,
+			      void *filter_cookie,
+			      roce_netdev_callback cb,
+			      void *cookie);
+
+int ib_cache_gid_find_by_port(struct ib_device *ib_dev,
+			      const union ib_gid *gid,
+			      u8 port, struct net_device *ndev,
+			      u16 *index);
+
+enum ib_cache_gid_default_mode {
+	IB_CACHE_GID_DEFAULT_MODE_SET,
+	IB_CACHE_GID_DEFAULT_MODE_DELETE
+};
+
+void ib_cache_gid_set_default_gid(struct ib_device *ib_dev, u8 port,
+				  struct net_device *ndev,
+				  enum ib_cache_gid_default_mode mode);
+
+int ib_cache_gid_add(struct ib_device *ib_dev, u8 port,
+		     union ib_gid *gid, struct ib_gid_attr *attr);
+
+int ib_cache_gid_del(struct ib_device *ib_dev, u8 port,
+		     union ib_gid *gid, struct ib_gid_attr *attr);
+
+int ib_cache_gid_del_all_netdev_gids(struct ib_device *ib_dev, u8 port,
+				     struct net_device *ndev);
+
+int roce_gid_mgmt_init(void);
+void roce_gid_mgmt_cleanup(void);
+
+int roce_rescan_device(struct ib_device *ib_dev);
+
+int ib_cache_setup_one(struct ib_device *device);
+void ib_cache_cleanup_one(struct ib_device *device);
+
 #endif /* _CORE_PRIV_H */
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index f08d438..14ea709 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -39,6 +39,8 @@
 #include <linux/init.h>
 #include <linux/mutex.h>
 #include <rdma/rdma_netlink.h>
+#include <rdma/ib_addr.h>
+#include <rdma/ib_cache.h>
 
 #include "core_priv.h"
 
@@ -314,6 +316,14 @@ int ib_register_device(struct ib_device *device,
 
 	device->reg_state = IB_DEV_REGISTERED;
 
+	ret = ib_cache_setup_one(device);
+	if (ret) {
+		printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n");
+		ib_device_unregister_sysfs(device);
+		kfree(device->port_immutable);
+		goto out;
+	}
+
 	{
 		struct ib_client *client;
 
@@ -607,11 +617,80 @@ EXPORT_SYMBOL(ib_query_port);
 int ib_query_gid(struct ib_device *device,
 		 u8 port_num, int index, union ib_gid *gid)
 {
+	if (rdma_cap_roce_gid_table(device, port_num))
+		return ib_get_cached_gid(device, port_num, index, gid);
+
 	return device->query_gid(device, port_num, index, gid);
 }
 EXPORT_SYMBOL(ib_query_gid);
 
 /**
+ * ib_enum_roce_netdev - enumerate all RoCE ports
+ * @ib_dev : IB device we want to query
+ * @filter: Should we call the callback?
+ * @filter_cookie: Cookie passed to filter
+ * @cb: Callback to call for each found RoCE ports
+ * @cookie: Cookie passed back to the callback
+ *
+ * Enumerates all of the physical RoCE ports of ib_dev
+ * which are related to netdevice and calls callback() on each
+ * device for which filter() function returns non zero.
+ */
+void ib_enum_roce_netdev(struct ib_device *ib_dev,
+			 roce_netdev_filter filter,
+			 void *filter_cookie,
+			 roce_netdev_callback cb,
+			 void *cookie)
+{
+	u8 port;
+
+	for (port = rdma_start_port(ib_dev); port <= rdma_end_port(ib_dev);
+	     port++)
+		if (rdma_protocol_roce(ib_dev, port)) {
+			struct net_device *idev = NULL;
+
+			if (ib_dev->get_netdev)
+				idev = ib_dev->get_netdev(ib_dev, port);
+
+			if (idev &&
+			    idev->reg_state >= NETREG_UNREGISTERED) {
+				dev_put(idev);
+				idev = NULL;
+			}
+
+			if (filter(ib_dev, port, idev, filter_cookie))
+				cb(ib_dev, port, idev, cookie);
+
+			if (idev)
+				dev_put(idev);
+		}
+}
+
+/**
+ * ib_enum_all_roce_netdevs - enumerate all RoCE devices
+ * @filter: Should we call the callback?
+ * @filter_cookie: Cookie passed to filter
+ * @cb: Callback to call for each found RoCE ports
+ * @cookie: Cookie passed back to the callback
+ *
+ * Enumerates all RoCE devices' physical ports which are related
+ * to netdevices and calls callback() on each device for which
+ * filter() function returns non zero.
+ */
+void ib_enum_all_roce_netdevs(roce_netdev_filter filter,
+			      void *filter_cookie,
+			      roce_netdev_callback cb,
+			      void *cookie)
+{
+	struct ib_device *dev;
+
+	down_read(&lists_rwsem);
+	list_for_each_entry(dev, &device_list, core_list)
+		ib_enum_roce_netdev(dev, filter, filter_cookie, cb, cookie);
+	up_read(&lists_rwsem);
+}
+
+/**
  * ib_query_pkey - Get P_Key table entry
  * @device:Device to query
  * @port_num:Port number to query
@@ -690,6 +769,13 @@ int ib_find_gid(struct ib_device *device, union ib_gid *gid,
 	int ret, port, i;
 
 	for (port = rdma_start_port(device); port <= rdma_end_port(device); ++port) {
+		if (rdma_cap_roce_gid_table(device, port)) {
+			if (!ib_cache_gid_find_by_port(device, gid, port,
+						       NULL, index))
+				*port_num = port;
+				return 0;
+		}
+
 		for (i = 0; i < device->port_immutable[port].gid_tbl_len; ++i) {
 			ret = ib_query_gid(device, port, i, &tmp_gid);
 			if (ret)
@@ -766,17 +852,10 @@ static int __init ib_core_init(void)
 		goto err_sysfs;
 	}
 
-	ret = ib_cache_setup();
-	if (ret) {
-		printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n");
-		goto err_nl;
-	}
+	ib_cache_setup();
 
 	return 0;
 
-err_nl:
-	ibnl_cleanup();
-
 err_sysfs:
 	ib_sysfs_cleanup();
 
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
new file mode 100644
index 0000000..7bf4798
--- /dev/null
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -0,0 +1,465 @@
+/*
+ * Copyright (c) 2015, Mellanox Technologies inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "core_priv.h"
+
+#include <linux/in.h>
+#include <linux/in6.h>
+
+/* For in6_dev_get/in6_dev_put */
+#include <net/addrconf.h>
+
+#include <rdma/ib_cache.h>
+#include <rdma/ib_addr.h>
+
+enum gid_op_type {
+	GID_DEL = 0,
+	GID_ADD
+};
+
+struct update_gid_event_work {
+	struct work_struct work;
+	union ib_gid       gid;
+	struct ib_gid_attr gid_attr;
+	enum gid_op_type gid_op;
+};
+
+#define ROCE_NETDEV_CALLBACK_SZ		2
+struct netdev_event_work_cmd {
+	roce_netdev_callback	cb;
+	roce_netdev_filter	filter;
+};
+
+struct netdev_event_work {
+	struct work_struct		work;
+	struct netdev_event_work_cmd	cmds[ROCE_NETDEV_CALLBACK_SZ];
+	struct net_device		*ndev;
+};
+
+static void update_gid(enum gid_op_type gid_op, struct ib_device *ib_dev,
+		       u8 port, union ib_gid *gid,
+		       struct ib_gid_attr *gid_attr)
+{
+	switch (gid_op) {
+	case GID_ADD:
+		ib_cache_gid_add(ib_dev, port, gid, gid_attr);
+		break;
+	case GID_DEL:
+		ib_cache_gid_del(ib_dev, port, gid, gid_attr);
+		break;
+	}
+}
+
+static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port,
+				 struct net_device *rdma_ndev, void *cookie)
+{
+	struct net_device *real_dev;
+	struct net_device *master_dev;
+	struct net_device *event_ndev = (struct net_device *)cookie;
+	int res;
+
+	if (!rdma_ndev)
+		return 0;
+
+	rcu_read_lock();
+	master_dev = netdev_master_upper_dev_get_rcu(rdma_ndev);
+	real_dev = rdma_vlan_dev_real_dev(event_ndev);
+	res = (real_dev ? real_dev : event_ndev) ==
+		(master_dev ? master_dev : rdma_ndev);
+	rcu_read_unlock();
+
+	return res;
+}
+
+static int pass_all_filter(struct ib_device *ib_dev, u8 port,
+			   struct net_device *rdma_ndev, void *cookie)
+{
+	return 1;
+}
+
+static void update_gid_ip(enum gid_op_type gid_op,
+			  struct ib_device *ib_dev,
+			  u8 port, struct net_device *ndev,
+			  struct sockaddr *addr)
+{
+	union ib_gid gid;
+	struct ib_gid_attr gid_attr;
+
+	rdma_ip2gid(addr, &gid);
+	memset(&gid_attr, 0, sizeof(gid_attr));
+	gid_attr.ndev = ndev;
+
+	update_gid(gid_op, ib_dev, port, &gid, &gid_attr);
+}
+
+static void enum_netdev_default_gids(struct ib_device *ib_dev,
+				     u8 port, struct net_device *event_ndev,
+				     struct net_device *rdma_ndev)
+{
+	if (rdma_ndev != event_ndev)
+		return;
+
+	ib_cache_gid_set_default_gid(ib_dev, port, rdma_ndev,
+				     IB_CACHE_GID_DEFAULT_MODE_SET);
+}
+
+static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
+				 u8 port, struct net_device *ndev)
+{
+	struct in_device *in_dev;
+
+	if (ndev->reg_state >= NETREG_UNREGISTERING)
+		return;
+
+	in_dev = in_dev_get(ndev);
+	if (!in_dev)
+		return;
+
+	for_ifa(in_dev) {
+		struct sockaddr_in ip;
+
+		ip.sin_family = AF_INET;
+		ip.sin_addr.s_addr = ifa->ifa_address;
+		update_gid_ip(GID_ADD, ib_dev, port, ndev,
+			      (struct sockaddr *)&ip);
+	}
+	endfor_ifa(in_dev);
+
+	in_dev_put(in_dev);
+}
+
+static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,
+				 u8 port, struct net_device *ndev)
+{
+	struct inet6_ifaddr *ifp;
+	struct inet6_dev *in6_dev;
+	struct sin6_list {
+		struct list_head	list;
+		struct sockaddr_in6	sin6;
+	};
+	struct sin6_list *sin6_iter;
+	struct sin6_list *sin6_temp;
+	struct ib_gid_attr gid_attr = {.ndev = ndev};
+	LIST_HEAD(sin6_list);
+
+	if (ndev->reg_state >= NETREG_UNREGISTERING)
+		return;
+
+	in6_dev = in6_dev_get(ndev);
+	if (!in6_dev)
+		return;
+
+	read_lock_bh(&in6_dev->lock);
+	list_for_each_entry(ifp, &in6_dev->addr_list, if_list) {
+		struct sin6_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
+
+		if (!entry) {
+			pr_warn("roce_gid_mgmt: couldn't allocate entry for IPv6 update\n");
+			continue;
+		}
+
+		entry->sin6.sin6_family = AF_INET6;
+		entry->sin6.sin6_addr = ifp->addr;
+		list_add_tail(&entry->list, &sin6_list);
+	}
+	read_unlock_bh(&in6_dev->lock);
+
+	in6_dev_put(in6_dev);
+
+	list_for_each_entry_safe(sin6_iter, sin6_temp, &sin6_list, list) {
+		union ib_gid	gid;
+
+		rdma_ip2gid((struct sockaddr *)&sin6_iter->sin6, &gid);
+		update_gid(GID_ADD, ib_dev, port, &gid, &gid_attr);
+		list_del(&sin6_iter->list);
+		kfree(sin6_iter);
+	}
+}
+
+static void add_netdev_ips(struct ib_device *ib_dev, u8 port,
+			   struct net_device *rdma_ndev, void *cookie)
+{
+	struct net_device *event_ndev = (struct net_device *)cookie;
+
+	enum_netdev_default_gids(ib_dev, port, event_ndev, rdma_ndev);
+	enum_netdev_ipv4_ips(ib_dev, port, event_ndev);
+	if (IS_ENABLED(CONFIG_IPV6))
+		enum_netdev_ipv6_ips(ib_dev, port, event_ndev);
+}
+
+static void del_netdev_ips(struct ib_device *ib_dev, u8 port,
+			   struct net_device *rdma_ndev, void *cookie)
+{
+	struct net_device *event_ndev = (struct net_device *)cookie;
+
+	ib_cache_gid_del_all_netdev_gids(ib_dev, port, event_ndev);
+}
+
+static void enum_all_gids_of_dev_cb(struct ib_device *ib_dev,
+				    u8 port,
+				    struct net_device *rdma_ndev,
+				    void *cookie)
+{
+	struct net *net;
+	struct net_device *ndev;
+
+	/* Lock the rtnl to make sure the netdevs does not move under
+	 * our feet
+	 */
+	rtnl_lock();
+	for_each_net(net)
+		for_each_netdev(net, ndev)
+			if (is_eth_port_of_netdev(ib_dev, port, rdma_ndev, ndev))
+				add_netdev_ips(ib_dev, port, rdma_ndev, ndev);
+	rtnl_unlock();
+}
+
+/* This function will rescan all of the network devices in the system
+ * and add their gids, as needed, to the relevant RoCE devices. */
+int roce_rescan_device(struct ib_device *ib_dev)
+{
+	ib_enum_roce_netdev(ib_dev, pass_all_filter, NULL,
+			    enum_all_gids_of_dev_cb, NULL);
+
+	return 0;
+}
+
+static void callback_for_addr_gid_device_scan(struct ib_device *device,
+					      u8 port,
+					      struct net_device *rdma_ndev,
+					      void *cookie)
+{
+	struct update_gid_event_work *parsed = cookie;
+
+	return update_gid(parsed->gid_op, device,
+			  port, &parsed->gid,
+			  &parsed->gid_attr);
+}
+
+/* The following functions operate on all IB devices. netdevice_event and
+ * addr_event execute ib_enum_all_roce_netdevs through a work.
+ * ib_enum_all_roce_netdevs iterates through all IB devices.
+ */
+
+static void netdevice_event_work_handler(struct work_struct *_work)
+{
+	struct netdev_event_work *work =
+		container_of(_work, struct netdev_event_work, work);
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(work->cmds) && work->cmds[i].cb; i++)
+		ib_enum_all_roce_netdevs(work->cmds[i].filter, work->ndev,
+					 work->cmds[i].cb, work->ndev);
+
+	dev_put(work->ndev);
+	kfree(work);
+}
+
+static int netdevice_queue_work(struct netdev_event_work_cmd *cmds,
+				struct net_device *ndev)
+{
+	struct netdev_event_work *ndev_work =
+		kmalloc(sizeof(*ndev_work), GFP_KERNEL);
+
+	if (!ndev_work) {
+		pr_warn("roce_gid_mgmt: can't allocate work for netdevice_event\n");
+		return NOTIFY_DONE;
+	}
+
+	memcpy(ndev_work->cmds, cmds, sizeof(ndev_work->cmds));
+	ndev_work->ndev = ndev;
+	dev_hold(ndev);
+	INIT_WORK(&ndev_work->work, netdevice_event_work_handler);
+
+	queue_work(ib_wq, &ndev_work->work);
+
+	return NOTIFY_DONE;
+}
+
+static int netdevice_event(struct notifier_block *this, unsigned long event,
+			   void *ptr)
+{
+	static const struct netdev_event_work_cmd add_cmd = {
+		.cb = add_netdev_ips, .filter = is_eth_port_of_netdev};
+	static const struct netdev_event_work_cmd del_cmd = {
+		.cb = del_netdev_ips, .filter = pass_all_filter};
+	struct net_device *ndev = netdev_notifier_info_to_dev(ptr);
+	struct netdev_event_work_cmd cmds[ROCE_NETDEV_CALLBACK_SZ] = { {NULL} };
+
+	if (ndev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+	case NETDEV_UP:
+		cmds[0] = add_cmd;
+		break;
+
+	case NETDEV_UNREGISTER:
+		if (ndev->reg_state < NETREG_UNREGISTERED)
+			cmds[0] = del_cmd;
+		else
+			return NOTIFY_DONE;
+		break;
+
+	case NETDEV_CHANGEADDR:
+		cmds[0] = del_cmd;
+		cmds[1] = add_cmd;
+		break;
+	default:
+		return NOTIFY_DONE;
+	}
+
+	return netdevice_queue_work(cmds, ndev);
+}
+
+static void update_gid_event_work_handler(struct work_struct *_work)
+{
+	struct update_gid_event_work *work =
+		container_of(_work, struct update_gid_event_work, work);
+
+	ib_enum_all_roce_netdevs(is_eth_port_of_netdev, work->gid_attr.ndev,
+				 callback_for_addr_gid_device_scan, work);
+
+	dev_put(work->gid_attr.ndev);
+	kfree(work);
+}
+
+static int addr_event(struct notifier_block *this, unsigned long event,
+		      struct sockaddr *sa, struct net_device *ndev)
+{
+	struct update_gid_event_work *work;
+	enum gid_op_type gid_op;
+
+	if (ndev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_UP:
+		gid_op = GID_ADD;
+		break;
+
+	case NETDEV_DOWN:
+		gid_op = GID_DEL;
+		break;
+
+	default:
+		return NOTIFY_DONE;
+	}
+
+	work = kmalloc(sizeof(*work), GFP_ATOMIC);
+	if (!work) {
+		pr_warn("roce_gid_mgmt: Couldn't allocate work for addr_event\n");
+		return NOTIFY_DONE;
+	}
+
+	INIT_WORK(&work->work, update_gid_event_work_handler);
+
+	rdma_ip2gid(sa, &work->gid);
+	work->gid_op = gid_op;
+
+	memset(&work->gid_attr, 0, sizeof(work->gid_attr));
+	dev_hold(ndev);
+	work->gid_attr.ndev   = ndev;
+
+	queue_work(ib_wq, &work->work);
+
+	return NOTIFY_DONE;
+}
+
+static int inetaddr_event(struct notifier_block *this, unsigned long event,
+			  void *ptr)
+{
+	struct sockaddr_in	in;
+	struct net_device	*ndev;
+	struct in_ifaddr	*ifa = ptr;
+
+	in.sin_family = AF_INET;
+	in.sin_addr.s_addr = ifa->ifa_address;
+	ndev = ifa->ifa_dev->dev;
+
+	return addr_event(this, event, (struct sockaddr *)&in, ndev);
+}
+
+static int inet6addr_event(struct notifier_block *this, unsigned long event,
+			   void *ptr)
+{
+	struct sockaddr_in6	in6;
+	struct net_device	*ndev;
+	struct inet6_ifaddr	*ifa6 = ptr;
+
+	in6.sin6_family = AF_INET6;
+	in6.sin6_addr = ifa6->addr;
+	ndev = ifa6->idev->dev;
+
+	return addr_event(this, event, (struct sockaddr *)&in6, ndev);
+}
+
+static struct notifier_block nb_netdevice = {
+	.notifier_call = netdevice_event
+};
+
+static struct notifier_block nb_inetaddr = {
+	.notifier_call = inetaddr_event
+};
+
+static struct notifier_block nb_inet6addr = {
+	.notifier_call = inet6addr_event
+};
+
+int __init roce_gid_mgmt_init(void)
+{
+	register_inetaddr_notifier(&nb_inetaddr);
+	if (IS_ENABLED(CONFIG_IPV6))
+		register_inet6addr_notifier(&nb_inet6addr);
+	/* We relay on the netdevice notifier to enumerate all
+	 * existing devices in the system. Register to this notifier
+	 * last to make sure we will not miss any IP add/del
+	 * callbacks.
+	 */
+	register_netdevice_notifier(&nb_netdevice);
+
+	return 0;
+}
+
+void __exit roce_gid_mgmt_cleanup(void)
+{
+	if (IS_ENABLED(CONFIG_IPV6))
+		unregister_inet6addr_notifier(&nb_inet6addr);
+	unregister_inetaddr_notifier(&nb_inetaddr);
+	unregister_netdevice_notifier(&nb_netdevice);
+	/* Ensure all gid deletion tasks complete before we go down,
+	 * to avoid any reference to free'd memory. By the time
+	 * ib-core is removed, all physical devices have been removed,
+	 * so no issue with remaining hardware contexts.
+	 */
+}
diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 0b84a9c..c10c6d3 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -461,6 +461,7 @@ static void ib_device_release(struct device *device)
 {
 	struct ib_device *dev = container_of(device, struct ib_device, dev);
 
+	ib_cache_cleanup_one(dev);
 	kfree(dev->port_immutable);
 	kfree(dev);
 }
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 080a976..74fc94c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -64,6 +64,10 @@ union ib_gid {
 	} global;
 };
 
+struct ib_gid_attr {
+	struct net_device	*ndev;
+};
+
 enum rdma_node_type {
 	/* IB values map to NodeInfo:NodeType. */
 	RDMA_NODE_IB_CA 	= 1,
@@ -284,7 +288,7 @@ enum ib_port_cap_flags {
 	IB_PORT_BOOT_MGMT_SUP			= 1 << 23,
 	IB_PORT_LINK_LATENCY_SUP		= 1 << 24,
 	IB_PORT_CLIENT_REG_SUP			= 1 << 25,
-	IB_PORT_IP_BASED_GIDS			= 1 << 26
+	IB_PORT_IP_BASED_GIDS			= 1 << 26,
 };
 
 enum ib_port_width {
@@ -1488,7 +1492,7 @@ struct ib_cache {
 	rwlock_t                lock;
 	struct ib_event_handler event_handler;
 	struct ib_pkey_cache  **pkey_cache;
-	struct ib_gid_cache   **gid_cache;
+	struct ib_gid_table   **gid_cache;
 	u8                     *lmc_cache;
 };
 
@@ -1572,9 +1576,47 @@ struct ib_device {
 						 struct ib_port_attr *port_attr);
 	enum rdma_link_layer	   (*get_link_layer)(struct ib_device *device,
 						     u8 port_num);
+	/* When calling get_netdev, the HW vendor's driver should return the
+	 * net device of device @device at port @port_num or NULL if such
+	 * a net device doesn't exist. The vendor driver should call dev_hold
+	 * on this net device. The HW vendor's device driver must guarantee
+	 * that this function returns NULL before the net device reaches
+	 * NETDEV_UNREGISTER_FINAL state.
+	 */
+	struct net_device	  *(*get_netdev)(struct ib_device *device,
+						 u8 port_num);
 	int		           (*query_gid)(struct ib_device *device,
 						u8 port_num, int index,
 						union ib_gid *gid);
+	/* When calling add_gid, the HW vendor's driver should
+	 * add the gid of device @device at gid index @index of
+	 * port @port_num to be @gid. Meta-info of that gid (for example,
+	 * the network device related to this gid is available
+	 * at @attr. @context allows the HW vendor driver to store extra
+	 * information together with a GID entry. The HW vendor may allocate
+	 * memory to contain this information and store it in @context when a
+	 * new GID entry is written to. Params are consistent until the next
+	 * call of add_gid or delete_gid. The function should return 0 on
+	 * success or error otherwise. The function could be called
+	 * concurrently for different ports. This function is only called
+	 * when roce_gid_table is used.
+	 */
+	int		           (*add_gid)(struct ib_device *device,
+					      u8 port_num,
+					      unsigned int index,
+					      const union ib_gid *gid,
+					      const struct ib_gid_attr *attr,
+					      void **context);
+	/* When calling del_gid, the HW vendor's driver should delete the
+	 * gid of device @device at gid index @index of port @port_num.
+	 * Upon the deletion of a GID entry, the HW vendor must free any
+	 * allocated memory. The caller will clear @context afterwards.
+	 * This function is only called when roce_gid_table is used.
+	 */
+	int		           (*del_gid)(struct ib_device *device,
+					      u8 port_num,
+					      unsigned int index,
+					      void **context);
 	int		           (*query_pkey)(struct ib_device *device,
 						 u8 port_num, u16 index, u16 *pkey);
 	int		           (*modify_device)(struct ib_device *device,
@@ -2088,6 +2130,26 @@ static inline size_t rdma_max_mad_size(const struct ib_device *device, u8 port_n
 	return device->port_immutable[port_num].max_mad_size;
 }
 
+/**
+ * rdma_cap_roce_gid_table - Check if the port of device uses roce_gid_table
+ * @device: Device to check
+ * @port_num: Port number to check
+ *
+ * RoCE GID table mechanism manages the various GIDs for a device.
+ *
+ * NOTE: if allocating the port's GID table has failed, this call will still
+ * return true, but any RoCE GID table API will fail.
+ *
+ * Return: true if the port uses RoCE GID table mechanism in order to manage
+ * its GIDs.
+ */
+static inline bool rdma_cap_roce_gid_table(const struct ib_device *device,
+					   u8 port_num)
+{
+	return rdma_protocol_roce(device, port_num) &&
+		device->add_gid && device->del_gid;
+}
+
 int ib_query_gid(struct ib_device *device,
 		 u8 port_num, int index, union ib_gid *gid);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next V7 06/10] IB/core: Add RoCE table bonding support
       [not found] ` <1438270411-17648-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-07-30 15:33   ` [PATCH for-next V7 05/10] IB/core: Add RoCE GID table management Matan Barak
@ 2015-07-30 15:33   ` Matan Barak
  2015-07-30 15:33   ` [PATCH for-next V7 07/10] net/mlx4: Postpone the registration of net_device Matan Barak
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Matan Barak @ 2015-07-30 15:33 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sean Hefty, Somnath Kotur,
	Moni Shoua, talal-VPRAkNaXOzVWk0Htik3J/w,
	haggaie-VPRAkNaXOzVWk0Htik3J/w

Handling bonding and other devices require us to all all GIDs of the
net-devices which are upper-devices of the RoCE port related
net-device.

Active-backup configurations imposes even more challenges as the
default GID should only be set on the active devices (this is
necessary as otherwise the same MAC could be used for several
slaves and thus several slaves will have identical GIDs).

Managing these configurations are done by listening to:
(a) NETDEV_CHANGEUPPER event
	(1) if a related net-device is linked, delete all inactive
	    slaves default GIDs and add the upper device GIDs.
	(2) if a related net-device is unlinked, delete all upper GIDs
	    and add the default GIDs.
(b) NETDEV_BONDING_FAILOVER:
	(1) delete the bond GIDs from inactive slaves
	(2) delete the inactive slave's default GIDs
	(3) Add the bond GIDs to the active slave.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/roce_gid_mgmt.c | 305 +++++++++++++++++++++++++++++---
 1 file changed, 285 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
index 7bf4798..6eecdfb 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -37,6 +37,7 @@
 
 /* For in6_dev_get/in6_dev_put */
 #include <net/addrconf.h>
+#include <net/bonding.h>
 
 #include <rdma/ib_cache.h>
 #include <rdma/ib_addr.h>
@@ -53,16 +54,17 @@ struct update_gid_event_work {
 	enum gid_op_type gid_op;
 };
 
-#define ROCE_NETDEV_CALLBACK_SZ		2
+#define ROCE_NETDEV_CALLBACK_SZ		3
 struct netdev_event_work_cmd {
 	roce_netdev_callback	cb;
 	roce_netdev_filter	filter;
+	struct net_device	*ndev;
+	struct net_device	*filter_ndev;
 };
 
 struct netdev_event_work {
 	struct work_struct		work;
 	struct netdev_event_work_cmd	cmds[ROCE_NETDEV_CALLBACK_SZ];
-	struct net_device		*ndev;
 };
 
 static void update_gid(enum gid_op_type gid_op, struct ib_device *ib_dev,
@@ -79,12 +81,70 @@ static void update_gid(enum gid_op_type gid_op, struct ib_device *ib_dev,
 	}
 }
 
+enum bonding_slave_state {
+	BONDING_SLAVE_STATE_ACTIVE	= 1UL << 0,
+	BONDING_SLAVE_STATE_INACTIVE	= 1UL << 1,
+	/* No primary slave or the device isn't a slave in bonding */
+	BONDING_SLAVE_STATE_NA		= 1UL << 2,
+};
+
+static enum bonding_slave_state is_eth_active_slave_of_bonding_rcu(struct net_device *dev,
+								   struct net_device *upper)
+{
+	if (upper && netif_is_bond_master(upper)) {
+		struct net_device *pdev =
+			bond_option_active_slave_get_rcu(netdev_priv(upper));
+
+		if (pdev)
+			return dev == pdev ? BONDING_SLAVE_STATE_ACTIVE :
+				BONDING_SLAVE_STATE_INACTIVE;
+	}
+
+	return BONDING_SLAVE_STATE_NA;
+}
+
+static bool is_upper_dev_rcu(struct net_device *dev, struct net_device *upper)
+{
+	struct net_device *_upper = NULL;
+	struct list_head *iter;
+
+	netdev_for_each_all_upper_dev_rcu(dev, _upper, iter)
+		if (_upper == upper)
+			break;
+
+	return _upper == upper;
+}
+
+#define REQUIRED_BOND_STATES		(BONDING_SLAVE_STATE_ACTIVE |	\
+					 BONDING_SLAVE_STATE_NA)
 static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port,
 				 struct net_device *rdma_ndev, void *cookie)
 {
+	struct net_device *event_ndev = (struct net_device *)cookie;
 	struct net_device *real_dev;
+	int res;
+
+	if (!rdma_ndev)
+		return 0;
+
+	rcu_read_lock();
+	real_dev = rdma_vlan_dev_real_dev(event_ndev);
+	if (!real_dev)
+		real_dev = event_ndev;
+
+	res = ((is_upper_dev_rcu(rdma_ndev, event_ndev) &&
+	       (is_eth_active_slave_of_bonding_rcu(rdma_ndev, real_dev) &
+		REQUIRED_BOND_STATES)) ||
+	       real_dev == rdma_ndev);
+
+	rcu_read_unlock();
+	return res;
+}
+
+static int is_eth_port_inactive_slave(struct ib_device *ib_dev, u8 port,
+				      struct net_device *rdma_ndev, void *cookie)
+{
 	struct net_device *master_dev;
-	struct net_device *event_ndev = (struct net_device *)cookie;
 	int res;
 
 	if (!rdma_ndev)
@@ -92,9 +152,8 @@ static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port,
 
 	rcu_read_lock();
 	master_dev = netdev_master_upper_dev_get_rcu(rdma_ndev);
-	real_dev = rdma_vlan_dev_real_dev(event_ndev);
-	res = (real_dev ? real_dev : event_ndev) ==
-		(master_dev ? master_dev : rdma_ndev);
+	res = is_eth_active_slave_of_bonding_rcu(rdma_ndev, master_dev) ==
+		BONDING_SLAVE_STATE_INACTIVE;
 	rcu_read_unlock();
 
 	return res;
@@ -106,6 +165,25 @@ static int pass_all_filter(struct ib_device *ib_dev, u8 port,
 	return 1;
 }
 
+static int upper_device_filter(struct ib_device *ib_dev, u8 port,
+			       struct net_device *rdma_ndev, void *cookie)
+{
+	struct net_device *event_ndev = (struct net_device *)cookie;
+	int res;
+
+	if (!rdma_ndev)
+		return 0;
+
+	if (rdma_ndev == event_ndev)
+		return 1;
+
+	rcu_read_lock();
+	res = is_upper_dev_rcu(rdma_ndev, event_ndev);
+	rcu_read_unlock();
+
+	return res;
+}
+
 static void update_gid_ip(enum gid_op_type gid_op,
 			  struct ib_device *ib_dev,
 			  u8 port, struct net_device *ndev,
@@ -125,13 +203,49 @@ static void enum_netdev_default_gids(struct ib_device *ib_dev,
 				     u8 port, struct net_device *event_ndev,
 				     struct net_device *rdma_ndev)
 {
-	if (rdma_ndev != event_ndev)
+	rcu_read_lock();
+	if (!rdma_ndev ||
+	    ((rdma_ndev != event_ndev &&
+	      !is_upper_dev_rcu(rdma_ndev, event_ndev)) ||
+	     is_eth_active_slave_of_bonding_rcu(rdma_ndev,
+						netdev_master_upper_dev_get_rcu(rdma_ndev)) ==
+	     BONDING_SLAVE_STATE_INACTIVE)) {
+		rcu_read_unlock();
 		return;
+	}
+	rcu_read_unlock();
 
 	ib_cache_gid_set_default_gid(ib_dev, port, rdma_ndev,
 				     IB_CACHE_GID_DEFAULT_MODE_SET);
 }
 
+static void bond_delete_netdev_default_gids(struct ib_device *ib_dev,
+					    u8 port,
+					    struct net_device *event_ndev,
+					    struct net_device *rdma_ndev)
+{
+	struct net_device *real_dev = rdma_vlan_dev_real_dev(event_ndev);
+
+	if (!rdma_ndev)
+		return;
+
+	if (!real_dev)
+		real_dev = event_ndev;
+
+	rcu_read_lock();
+
+	if (is_upper_dev_rcu(rdma_ndev, event_ndev) &&
+	    is_eth_active_slave_of_bonding_rcu(rdma_ndev, real_dev) ==
+	    BONDING_SLAVE_STATE_INACTIVE) {
+		rcu_read_unlock();
+
+		ib_cache_gid_set_default_gid(ib_dev, port, rdma_ndev,
+					     IB_CACHE_GID_DEFAULT_MODE_DELETE);
+	} else {
+		rcu_read_unlock();
+	}
+}
+
 static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
 				 u8 port, struct net_device *ndev)
 {
@@ -205,15 +319,21 @@ static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,
 	}
 }
 
+static void _add_netdev_ips(struct ib_device *ib_dev, u8 port,
+			    struct net_device *ndev)
+{
+	enum_netdev_ipv4_ips(ib_dev, port, ndev);
+	if (IS_ENABLED(CONFIG_IPV6))
+		enum_netdev_ipv6_ips(ib_dev, port, ndev);
+}
+
 static void add_netdev_ips(struct ib_device *ib_dev, u8 port,
 			   struct net_device *rdma_ndev, void *cookie)
 {
 	struct net_device *event_ndev = (struct net_device *)cookie;
 
 	enum_netdev_default_gids(ib_dev, port, event_ndev, rdma_ndev);
-	enum_netdev_ipv4_ips(ib_dev, port, event_ndev);
-	if (IS_ENABLED(CONFIG_IPV6))
-		enum_netdev_ipv6_ips(ib_dev, port, event_ndev);
+	_add_netdev_ips(ib_dev, port, event_ndev);
 }
 
 static void del_netdev_ips(struct ib_device *ib_dev, u8 port,
@@ -265,6 +385,94 @@ static void callback_for_addr_gid_device_scan(struct ib_device *device,
 			  &parsed->gid_attr);
 }
 
+static void handle_netdev_upper(struct ib_device *ib_dev, u8 port,
+				void *cookie,
+				void (*handle_netdev)(struct ib_device *ib_dev,
+						      u8 port,
+						      struct net_device *ndev))
+{
+	struct net_device *ndev = (struct net_device *)cookie;
+	struct upper_list {
+		struct list_head list;
+		struct net_device *upper;
+	};
+	struct net_device *upper;
+	struct list_head *iter;
+	struct upper_list *upper_iter;
+	struct upper_list *upper_temp;
+	LIST_HEAD(upper_list);
+
+	rcu_read_lock();
+	netdev_for_each_all_upper_dev_rcu(ndev, upper, iter) {
+		struct upper_list *entry = kmalloc(sizeof(*entry),
+						   GFP_ATOMIC);
+
+		if (!entry) {
+			pr_info("roce_gid_mgmt: couldn't allocate entry to delete ndev\n");
+			continue;
+		}
+
+		list_add_tail(&entry->list, &upper_list);
+		dev_hold(upper);
+		entry->upper = upper;
+	}
+	rcu_read_unlock();
+
+	handle_netdev(ib_dev, port, ndev);
+	list_for_each_entry_safe(upper_iter, upper_temp, &upper_list,
+				 list) {
+		handle_netdev(ib_dev, port, upper_iter->upper);
+		dev_put(upper_iter->upper);
+		list_del(&upper_iter->list);
+		kfree(upper_iter);
+	}
+}
+
+static void _roce_del_all_netdev_gids(struct ib_device *ib_dev, u8 port,
+				      struct net_device *event_ndev)
+{
+	ib_cache_gid_del_all_netdev_gids(ib_dev, port, event_ndev);
+}
+
+static void del_netdev_upper_ips(struct ib_device *ib_dev, u8 port,
+				 struct net_device *rdma_ndev, void *cookie)
+{
+	handle_netdev_upper(ib_dev, port, cookie, _roce_del_all_netdev_gids);
+}
+
+static void add_netdev_upper_ips(struct ib_device *ib_dev, u8 port,
+				 struct net_device *rdma_ndev, void *cookie)
+{
+	handle_netdev_upper(ib_dev, port, cookie, _add_netdev_ips);
+}
+
+static void del_netdev_default_ips_join(struct ib_device *ib_dev, u8 port,
+					struct net_device *rdma_ndev,
+					void *cookie)
+{
+	struct net_device *master_ndev;
+
+	rcu_read_lock();
+	master_ndev = netdev_master_upper_dev_get_rcu(rdma_ndev);
+	if (master_ndev)
+		dev_hold(master_ndev);
+	rcu_read_unlock();
+
+	if (master_ndev) {
+		bond_delete_netdev_default_gids(ib_dev, port, master_ndev,
+						rdma_ndev);
+		dev_put(master_ndev);
+	}
+}
+
+static void del_netdev_default_ips(struct ib_device *ib_dev, u8 port,
+				   struct net_device *rdma_ndev, void *cookie)
+{
+	struct net_device *event_ndev = (struct net_device *)cookie;
+
+	bond_delete_netdev_default_gids(ib_dev, port, event_ndev, rdma_ndev);
+}
+
 /* The following functions operate on all IB devices. netdevice_event and
  * addr_event execute ib_enum_all_roce_netdevs through a work.
  * ib_enum_all_roce_netdevs iterates through all IB devices.
@@ -276,17 +484,22 @@ static void netdevice_event_work_handler(struct work_struct *_work)
 		container_of(_work, struct netdev_event_work, work);
 	unsigned int i;
 
-	for (i = 0; i < ARRAY_SIZE(work->cmds) && work->cmds[i].cb; i++)
-		ib_enum_all_roce_netdevs(work->cmds[i].filter, work->ndev,
-					 work->cmds[i].cb, work->ndev);
+	for (i = 0; i < ARRAY_SIZE(work->cmds) && work->cmds[i].cb; i++) {
+		ib_enum_all_roce_netdevs(work->cmds[i].filter,
+					 work->cmds[i].filter_ndev,
+					 work->cmds[i].cb,
+					 work->cmds[i].ndev);
+		dev_put(work->cmds[i].ndev);
+		dev_put(work->cmds[i].filter_ndev);
+	}
 
-	dev_put(work->ndev);
 	kfree(work);
 }
 
 static int netdevice_queue_work(struct netdev_event_work_cmd *cmds,
 				struct net_device *ndev)
 {
+	unsigned int i;
 	struct netdev_event_work *ndev_work =
 		kmalloc(sizeof(*ndev_work), GFP_KERNEL);
 
@@ -296,8 +509,14 @@ static int netdevice_queue_work(struct netdev_event_work_cmd *cmds,
 	}
 
 	memcpy(ndev_work->cmds, cmds, sizeof(ndev_work->cmds));
-	ndev_work->ndev = ndev;
-	dev_hold(ndev);
+	for (i = 0; i < ARRAY_SIZE(ndev_work->cmds) && ndev_work->cmds[i].cb; i++) {
+		if (!ndev_work->cmds[i].ndev)
+			ndev_work->cmds[i].ndev = ndev;
+		if (!ndev_work->cmds[i].filter_ndev)
+			ndev_work->cmds[i].filter_ndev = ndev;
+		dev_hold(ndev_work->cmds[i].ndev);
+		dev_hold(ndev_work->cmds[i].filter_ndev);
+	}
 	INIT_WORK(&ndev_work->work, netdevice_event_work_handler);
 
 	queue_work(ib_wq, &ndev_work->work);
@@ -305,13 +524,45 @@ static int netdevice_queue_work(struct netdev_event_work_cmd *cmds,
 	return NOTIFY_DONE;
 }
 
+static const struct netdev_event_work_cmd add_cmd = {
+	.cb = add_netdev_ips, .filter = is_eth_port_of_netdev};
+static const struct netdev_event_work_cmd add_cmd_upper_ips = {
+	.cb = add_netdev_upper_ips, .filter = is_eth_port_of_netdev};
+
+static void netdevice_event_changeupper(struct netdev_changeupper_info *changeupper_info,
+					struct netdev_event_work_cmd *cmds)
+{
+	static const struct netdev_event_work_cmd upper_ips_del_cmd = {
+		.cb = del_netdev_upper_ips, .filter = upper_device_filter};
+	static const struct netdev_event_work_cmd bonding_default_del_cmd = {
+		.cb = del_netdev_default_ips, .filter = is_eth_port_inactive_slave};
+
+	if (changeupper_info->event ==
+	    NETDEV_CHANGEUPPER_UNLINK) {
+		cmds[0] = upper_ips_del_cmd;
+		cmds[0].ndev = changeupper_info->upper;
+		cmds[1] = add_cmd;
+	} else if (changeupper_info->event ==
+		   NETDEV_CHANGEUPPER_LINK) {
+		cmds[0] = bonding_default_del_cmd;
+		cmds[0].ndev = changeupper_info->upper;
+		cmds[1] = add_cmd_upper_ips;
+		cmds[1].ndev = changeupper_info->upper;
+		cmds[1].filter_ndev = changeupper_info->upper;
+	}
+}
+
 static int netdevice_event(struct notifier_block *this, unsigned long event,
 			   void *ptr)
 {
-	static const struct netdev_event_work_cmd add_cmd = {
-		.cb = add_netdev_ips, .filter = is_eth_port_of_netdev};
 	static const struct netdev_event_work_cmd del_cmd = {
 		.cb = del_netdev_ips, .filter = pass_all_filter};
+	static const struct netdev_event_work_cmd bonding_default_del_cmd_join = {
+		.cb = del_netdev_default_ips_join, .filter = is_eth_port_inactive_slave};
+	static const struct netdev_event_work_cmd default_del_cmd = {
+		.cb = del_netdev_default_ips, .filter = pass_all_filter};
+	static const struct netdev_event_work_cmd bonding_event_ips_del_cmd = {
+		.cb = del_netdev_upper_ips, .filter = upper_device_filter};
 	struct net_device *ndev = netdev_notifier_info_to_dev(ptr);
 	struct netdev_event_work_cmd cmds[ROCE_NETDEV_CALLBACK_SZ] = { {NULL} };
 
@@ -321,7 +572,8 @@ static int netdevice_event(struct notifier_block *this, unsigned long event,
 	switch (event) {
 	case NETDEV_REGISTER:
 	case NETDEV_UP:
-		cmds[0] = add_cmd;
+		cmds[0] = bonding_default_del_cmd_join;
+		cmds[1] = add_cmd;
 		break;
 
 	case NETDEV_UNREGISTER:
@@ -332,9 +584,22 @@ static int netdevice_event(struct notifier_block *this, unsigned long event,
 		break;
 
 	case NETDEV_CHANGEADDR:
-		cmds[0] = del_cmd;
+		cmds[0] = default_del_cmd;
 		cmds[1] = add_cmd;
 		break;
+
+	case NETDEV_CHANGEUPPER:
+		netdevice_event_changeupper(
+			container_of(ptr, struct netdev_changeupper_info, info),
+			cmds);
+		break;
+
+	case NETDEV_BONDING_FAILOVER:
+		cmds[0] = bonding_event_ips_del_cmd;
+		cmds[1] = bonding_default_del_cmd_join;
+		cmds[2] = add_cmd_upper_ips;
+		break;
+
 	default:
 		return NOTIFY_DONE;
 	}
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next V7 07/10] net/mlx4: Postpone the registration of net_device
       [not found] ` <1438270411-17648-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-07-30 15:33   ` [PATCH for-next V7 06/10] IB/core: Add RoCE table bonding support Matan Barak
@ 2015-07-30 15:33   ` Matan Barak
  2015-07-30 15:33   ` [PATCH for-next V7 08/10] IB/mlx4: Implement ib_device callbacks Matan Barak
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Matan Barak @ 2015-07-30 15:33 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sean Hefty, Somnath Kotur,
	Moni Shoua, talal-VPRAkNaXOzVWk0Htik3J/w,
	haggaie-VPRAkNaXOzVWk0Htik3J/w

From: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

The mlx4 network driver was registered in the context of the 'add'
function of the core driver (called when HW should be registered).
This makes the netdev event NETDEV_REGISTER to be sent in a context
where the answer to get_protocol_dev() callback returns NULL. This may
be confusing to listeners of netdev events.
This patch is a preparation to the patch that implements the
get_netdev() callback in the IB/mlx4 driver.

Signed-off-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/net/ethernet/mellanox/mlx4/en_main.c | 36 ++++++++++++++++------------
 drivers/net/ethernet/mellanox/mlx4/intf.c    |  3 +++
 include/linux/mlx4/driver.h                  |  1 +
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_main.c b/drivers/net/ethernet/mellanox/mlx4/en_main.c
index 913b716..a946e4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_main.c
@@ -224,6 +224,26 @@ static void mlx4_en_remove(struct mlx4_dev *dev, void *endev_ptr)
 	kfree(mdev);
 }
 
+static void mlx4_en_activate(struct mlx4_dev *dev, void *ctx)
+{
+	int i;
+	struct mlx4_en_dev *mdev = ctx;
+
+	/* Create a netdev for each port */
+	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_ETH) {
+		mlx4_info(mdev, "Activating port:%d\n", i);
+		if (mlx4_en_init_netdev(mdev, i, &mdev->profile.prof[i]))
+			mdev->pndev[i] = NULL;
+	}
+
+	/* register notifier */
+	mdev->nb.notifier_call = mlx4_en_netdev_event;
+	if (register_netdevice_notifier(&mdev->nb)) {
+		mdev->nb.notifier_call = NULL;
+		mlx4_err(mdev, "Failed to create notifier\n");
+	}
+}
+
 static void *mlx4_en_add(struct mlx4_dev *dev)
 {
 	struct mlx4_en_dev *mdev;
@@ -297,21 +317,6 @@ static void *mlx4_en_add(struct mlx4_dev *dev)
 	mutex_init(&mdev->state_lock);
 	mdev->device_up = true;
 
-	/* Setup ports */
-
-	/* Create a netdev for each port */
-	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_ETH) {
-		mlx4_info(mdev, "Activating port:%d\n", i);
-		if (mlx4_en_init_netdev(mdev, i, &mdev->profile.prof[i]))
-			mdev->pndev[i] = NULL;
-	}
-	/* register notifier */
-	mdev->nb.notifier_call = mlx4_en_netdev_event;
-	if (register_netdevice_notifier(&mdev->nb)) {
-		mdev->nb.notifier_call = NULL;
-		mlx4_err(mdev, "Failed to create notifier\n");
-	}
-
 	return mdev;
 
 err_mr:
@@ -335,6 +340,7 @@ static struct mlx4_interface mlx4_en_interface = {
 	.event		= mlx4_en_event,
 	.get_dev	= mlx4_en_get_netdev,
 	.protocol	= MLX4_PROT_ETH,
+	.activate	= mlx4_en_activate,
 };
 
 static void mlx4_en_verify_params(void)
diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c b/drivers/net/ethernet/mellanox/mlx4/intf.c
index 0d80aed..0472941 100644
--- a/drivers/net/ethernet/mellanox/mlx4/intf.c
+++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
@@ -63,8 +63,11 @@ static void mlx4_add_device(struct mlx4_interface *intf, struct mlx4_priv *priv)
 		spin_lock_irq(&priv->ctx_lock);
 		list_add_tail(&dev_ctx->list, &priv->ctx_list);
 		spin_unlock_irq(&priv->ctx_lock);
+		if (intf->activate)
+			intf->activate(&priv->dev, dev_ctx->context);
 	} else
 		kfree(dev_ctx);
+
 }
 
 static void mlx4_remove_device(struct mlx4_interface *intf, struct mlx4_priv *priv)
diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h
index 9553a73..5a06d96 100644
--- a/include/linux/mlx4/driver.h
+++ b/include/linux/mlx4/driver.h
@@ -59,6 +59,7 @@ struct mlx4_interface {
 	void			(*event) (struct mlx4_dev *dev, void *context,
 					  enum mlx4_dev_event event, unsigned long param);
 	void *			(*get_dev)(struct mlx4_dev *dev, void *context, u8 port);
+	void			(*activate)(struct mlx4_dev *dev, void *context);
 	struct list_head	list;
 	enum mlx4_protocol	protocol;
 	int			flags;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next V7 08/10] IB/mlx4: Implement ib_device callbacks
       [not found] ` <1438270411-17648-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-07-30 15:33   ` [PATCH for-next V7 07/10] net/mlx4: Postpone the registration of net_device Matan Barak
@ 2015-07-30 15:33   ` Matan Barak
  2015-07-30 15:33   ` [PATCH for-next V7 09/10] IB/mlx4: Replace mechanism for RoCE GID management Matan Barak
  2015-07-30 15:33   ` [PATCH for-next V7 10/10] RDMA/ocrdma: Incorporate the moving of GID Table mgmt to IB/Core Matan Barak
  7 siblings, 0 replies; 23+ messages in thread
From: Matan Barak @ 2015-07-30 15:33 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sean Hefty, Somnath Kotur,
	Moni Shoua, talal-VPRAkNaXOzVWk0Htik3J/w,
	haggaie-VPRAkNaXOzVWk0Htik3J/w

From: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

get_netdev: get the net_device on the physical port of the IB transport port. In
port aggregation mode it is required to return the netdev of the active port.

modify_gid: note for a change in the RoCE gid cache. Handle this by writing to
the harsware GID table. It is possible that indexes in cahce and hardware tables
won't match so a translation is required when modifying a QP or creating an
address handle.

Signed-off-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/cache.c      |   3 +-
 drivers/infiniband/hw/mlx4/main.c    | 236 ++++++++++++++++++++++++++++++++++-
 drivers/infiniband/hw/mlx4/mlx4_ib.h |  17 +++
 include/linux/mlx4/device.h          |   3 +-
 include/rdma/ib_verbs.h              |   2 +
 5 files changed, 257 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 9b4f16b..a9d5c70 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -55,7 +55,8 @@ struct ib_update_work {
 	u8                 port_num;
 };
 
-static union ib_gid zgid;
+union ib_gid zgid;
+EXPORT_SYMBOL(zgid);
 
 static const struct ib_gid_attr zattr;
 
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 2f81723..61df5c9 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -45,6 +45,9 @@
 #include <rdma/ib_smi.h>
 #include <rdma/ib_user_verbs.h>
 #include <rdma/ib_addr.h>
+#include <rdma/ib_cache.h>
+
+#include <net/bonding.h>
 
 #include <linux/mlx4/driver.h>
 #include <linux/mlx4/cmd.h>
@@ -93,8 +96,6 @@ static void init_query_mad(struct ib_smp *mad)
 	mad->method	   = IB_MGMT_METHOD_GET;
 }
 
-static union ib_gid zgid;
-
 static int check_flow_steering_support(struct mlx4_dev *dev)
 {
 	int eth_num_ports = 0;
@@ -131,6 +132,237 @@ static int num_ib_ports(struct mlx4_dev *dev)
 	return ib_ports;
 }
 
+static struct net_device *mlx4_ib_get_netdev(struct ib_device *device, u8 port_num)
+{
+	struct mlx4_ib_dev *ibdev = to_mdev(device);
+	struct net_device *dev;
+
+	rcu_read_lock();
+	dev = mlx4_get_protocol_dev(ibdev->dev, MLX4_PROT_ETH, port_num);
+
+	if (dev) {
+		if (mlx4_is_bonded(ibdev->dev)) {
+			struct net_device *upper = NULL;
+
+			upper = netdev_master_upper_dev_get_rcu(dev);
+			if (upper) {
+				struct net_device *active;
+
+				active = bond_option_active_slave_get_rcu(netdev_priv(upper));
+				if (active)
+					dev = active;
+			}
+		}
+	}
+	if (dev)
+		dev_hold(dev);
+
+	rcu_read_unlock();
+	return dev;
+}
+
+static int mlx4_ib_update_gids(struct gid_entry *gids,
+			       struct mlx4_ib_dev *ibdev,
+			       u8 port_num)
+{
+	struct mlx4_cmd_mailbox *mailbox;
+	int err;
+	struct mlx4_dev *dev = ibdev->dev;
+	int i;
+	union ib_gid *gid_tbl;
+
+	mailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(mailbox))
+		return -ENOMEM;
+
+	gid_tbl = mailbox->buf;
+
+	for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i)
+		memcpy(&gid_tbl[i], &gids[i].gid, sizeof(union ib_gid));
+
+	err = mlx4_cmd(dev, mailbox->dma,
+		       MLX4_SET_PORT_GID_TABLE << 8 | port_num,
+		       1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
+		       MLX4_CMD_WRAPPED);
+	if (mlx4_is_bonded(dev))
+		err += mlx4_cmd(dev, mailbox->dma,
+				MLX4_SET_PORT_GID_TABLE << 8 | 2,
+				1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
+				MLX4_CMD_WRAPPED);
+
+	mlx4_free_cmd_mailbox(dev, mailbox);
+	return err;
+}
+
+static int mlx4_ib_add_gid(struct ib_device *device,
+			   u8 port_num,
+			   unsigned int index,
+			   const union ib_gid *gid,
+			   const struct ib_gid_attr *attr,
+			   void **context)
+{
+	struct mlx4_ib_dev *ibdev = to_mdev(device);
+	struct mlx4_ib_iboe *iboe = &ibdev->iboe;
+	struct mlx4_port_gid_table   *port_gid_table;
+	int free = -1, found = -1;
+	int ret = 0;
+	int hw_update = 0;
+	int i;
+	struct gid_entry *gids = NULL;
+
+	if (!rdma_cap_roce_gid_table(device, port_num))
+		return -EINVAL;
+
+	if (port_num > MLX4_MAX_PORTS)
+		return -EINVAL;
+
+	if (!context)
+		return -EINVAL;
+
+	port_gid_table = &iboe->gids[port_num - 1];
+	spin_lock_bh(&iboe->lock);
+	for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i) {
+		if (!memcmp(&port_gid_table->gids[i].gid, gid, sizeof(*gid))) {
+			found = i;
+			break;
+		}
+		if (free < 0 && !memcmp(&port_gid_table->gids[i].gid, &zgid, sizeof(*gid)))
+			free = i; /* HW has space */
+	}
+
+	if (found < 0) {
+		if (free < 0) {
+			ret = -ENOSPC;
+		} else {
+			port_gid_table->gids[free].ctx = kmalloc(sizeof(*port_gid_table->gids[free].ctx), GFP_ATOMIC);
+			if (!port_gid_table->gids[free].ctx) {
+				ret = -ENOMEM;
+			} else {
+				*context = port_gid_table->gids[free].ctx;
+				memcpy(&port_gid_table->gids[free].gid, gid, sizeof(*gid));
+				port_gid_table->gids[free].ctx->real_index = free;
+				port_gid_table->gids[free].ctx->refcount = 1;
+				hw_update = 1;
+			}
+		}
+	} else {
+		struct gid_cache_context *ctx = port_gid_table->gids[found].ctx;
+		*context = ctx;
+		ctx->refcount++;
+	}
+	if (!ret && hw_update) {
+		gids = kmalloc(sizeof(*gids) * MLX4_MAX_PORT_GIDS, GFP_ATOMIC);
+		if (!gids) {
+			ret = -ENOMEM;
+		} else {
+			for (i = 0; i < MLX4_MAX_PORT_GIDS; i++)
+				memcpy(&gids[i].gid, &port_gid_table->gids[i].gid, sizeof(union ib_gid));
+		}
+	}
+	spin_unlock_bh(&iboe->lock);
+
+	if (!ret && hw_update) {
+		ret = mlx4_ib_update_gids(gids, ibdev, port_num);
+		kfree(gids);
+	}
+
+	return ret;
+}
+
+static int mlx4_ib_del_gid(struct ib_device *device,
+			   u8 port_num,
+			   unsigned int index,
+			   void **context)
+{
+	struct gid_cache_context *ctx = *context;
+	struct mlx4_ib_dev *ibdev = to_mdev(device);
+	struct mlx4_ib_iboe *iboe = &ibdev->iboe;
+	struct mlx4_port_gid_table   *port_gid_table;
+	int ret = 0;
+	int hw_update = 0;
+	struct gid_entry *gids = NULL;
+
+	if (!rdma_cap_roce_gid_table(device, port_num))
+		return -EINVAL;
+
+	if (port_num > MLX4_MAX_PORTS)
+		return -EINVAL;
+
+	port_gid_table = &iboe->gids[port_num - 1];
+	spin_lock_bh(&iboe->lock);
+	if (ctx) {
+		ctx->refcount--;
+		if (!ctx->refcount) {
+			unsigned int real_index = ctx->real_index;
+
+			memcpy(&port_gid_table->gids[real_index].gid, &zgid, sizeof(zgid));
+			kfree(port_gid_table->gids[real_index].ctx);
+			port_gid_table->gids[real_index].ctx = NULL;
+			hw_update = 1;
+		}
+	}
+	if (!ret && hw_update) {
+		int i;
+
+		gids = kmalloc(sizeof(*gids) * MLX4_MAX_PORT_GIDS, GFP_ATOMIC);
+		if (!gids) {
+			ret = -ENOMEM;
+		} else {
+			for (i = 0; i < MLX4_MAX_PORT_GIDS; i++)
+				memcpy(&gids[i].gid, &port_gid_table->gids[i].gid, sizeof(union ib_gid));
+		}
+	}
+	spin_unlock_bh(&iboe->lock);
+
+	if (!ret && hw_update) {
+		ret = mlx4_ib_update_gids(gids, ibdev, port_num);
+		kfree(gids);
+	}
+	return ret;
+}
+
+int mlx4_ib_gid_index_to_real_index(struct mlx4_ib_dev *ibdev,
+				    u8 port_num, int index)
+{
+	struct mlx4_ib_iboe *iboe = &ibdev->iboe;
+	struct gid_cache_context *ctx = NULL;
+	union ib_gid gid;
+	struct mlx4_port_gid_table   *port_gid_table;
+	int real_index = -EINVAL;
+	int i;
+	int ret;
+	unsigned long flags;
+
+	if (port_num > MLX4_MAX_PORTS)
+		return -EINVAL;
+
+	if (mlx4_is_bonded(ibdev->dev))
+		port_num = 1;
+
+	if (!rdma_cap_roce_gid_table(&ibdev->ib_dev, port_num))
+		return index;
+
+	ret = ib_get_cached_gid(&ibdev->ib_dev, port_num, index, &gid);
+	if (ret)
+		return ret;
+
+	if (!memcmp(&gid, &zgid, sizeof(gid)))
+		return -EINVAL;
+
+	spin_lock_irqsave(&iboe->lock, flags);
+	port_gid_table = &iboe->gids[port_num - 1];
+
+	for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i)
+		if (!memcmp(&port_gid_table->gids[i].gid, &gid, sizeof(gid))) {
+			ctx = port_gid_table->gids[i].ctx;
+			break;
+		}
+	if (ctx)
+		real_index = ctx->real_index;
+	spin_unlock_irqrestore(&iboe->lock, flags);
+	return real_index;
+}
+
 static int mlx4_ib_query_device(struct ib_device *ibdev,
 				struct ib_device_attr *props,
 				struct ib_udata *uhw)
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 4eb7a85..4e91ca5 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -470,6 +470,20 @@ struct mlx4_ib_sriov {
 	struct idr pv_id_table;
 };
 
+struct gid_cache_context {
+	int real_index;
+	int refcount;
+};
+
+struct gid_entry {
+	union ib_gid	gid;
+	struct gid_cache_context *ctx;
+};
+
+struct mlx4_port_gid_table {
+	struct gid_entry gids[MLX4_MAX_PORT_GIDS];
+};
+
 struct mlx4_ib_iboe {
 	spinlock_t		lock;
 	struct net_device      *netdevs[MLX4_MAX_PORTS];
@@ -479,6 +493,7 @@ struct mlx4_ib_iboe {
 	struct notifier_block	nb_inet;
 	struct notifier_block	nb_inet6;
 	union ib_gid		gid_table[MLX4_MAX_PORTS][128];
+	struct mlx4_port_gid_table gids[MLX4_MAX_PORTS];
 };
 
 struct pkey_mgt {
@@ -851,5 +866,7 @@ int mlx4_ib_rereg_user_mr(struct ib_mr *mr, int flags,
 			  u64 start, u64 length, u64 virt_addr,
 			  int mr_access_flags, struct ib_pd *pd,
 			  struct ib_udata *udata);
+int mlx4_ib_gid_index_to_real_index(struct mlx4_ib_dev *ibdev,
+				    u8 port_num, int index);
 
 #endif /* MLX4_IB_H */
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index fd13c1c..5c7687a 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -79,7 +79,8 @@ enum {
 
 enum {
 	MLX4_MAX_PORTS		= 2,
-	MLX4_MAX_PORT_PKEYS	= 128
+	MLX4_MAX_PORT_PKEYS	= 128,
+	MLX4_MAX_PORT_GIDS	= 128
 };
 
 /* base qkey for use in sriov tunnel-qp/proxy-qp communication.
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 74fc94c..c856276 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -64,6 +64,8 @@ union ib_gid {
 	} global;
 };
 
+extern union ib_gid zgid;
+
 struct ib_gid_attr {
 	struct net_device	*ndev;
 };
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next V7 09/10] IB/mlx4: Replace mechanism for RoCE GID management
       [not found] ` <1438270411-17648-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (5 preceding siblings ...)
  2015-07-30 15:33   ` [PATCH for-next V7 08/10] IB/mlx4: Implement ib_device callbacks Matan Barak
@ 2015-07-30 15:33   ` Matan Barak
  2015-07-30 15:33   ` [PATCH for-next V7 10/10] RDMA/ocrdma: Incorporate the moving of GID Table mgmt to IB/Core Matan Barak
  7 siblings, 0 replies; 23+ messages in thread
From: Matan Barak @ 2015-07-30 15:33 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sean Hefty, Somnath Kotur,
	Moni Shoua, talal-VPRAkNaXOzVWk0Htik3J/w,
	haggaie-VPRAkNaXOzVWk0Htik3J/w

From: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Manage RoCE gid table with logic in IB/core, which is common to all
vendors, and remove the mechanism from the mlx4 IB driver.
Since management of the GID cache may lead to index mismatch with the
hardware GID table, a translation between indexes is required when
modifying a QP or creating an address handle.

Signed-off-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/ah.c      |   2 +-
 drivers/infiniband/hw/mlx4/main.c    | 513 ++---------------------------------
 drivers/infiniband/hw/mlx4/mlx4_ib.h |   4 -
 drivers/infiniband/hw/mlx4/qp.c      |  10 +-
 4 files changed, 35 insertions(+), 494 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c
index f50a546..7ad6f96 100644
--- a/drivers/infiniband/hw/mlx4/ah.c
+++ b/drivers/infiniband/hw/mlx4/ah.c
@@ -89,7 +89,7 @@ static struct ib_ah *create_iboe_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr
 	if (vlan_tag < 0x1000)
 		vlan_tag |= (ah_attr->sl & 7) << 13;
 	ah->av.eth.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24));
-	ah->av.eth.gid_index = ah_attr->grh.sgid_index;
+	ah->av.eth.gid_index = mlx4_ib_gid_index_to_real_index(ibdev, ah_attr->port_num, ah_attr->grh.sgid_index);
 	ah->av.eth.vlan = cpu_to_be16(vlan_tag);
 	if (ah_attr->static_rate) {
 		ah->av.eth.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET;
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 61df5c9..35a6f85 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -77,13 +77,6 @@ static const char mlx4_ib_version[] =
 	DRV_NAME ": Mellanox ConnectX InfiniBand driver v"
 	DRV_VERSION " (" DRV_RELDATE ")\n";
 
-struct update_gid_work {
-	struct work_struct	work;
-	union ib_gid		gids[128];
-	struct mlx4_ib_dev     *dev;
-	int			port;
-};
-
 static void do_slave_init(struct mlx4_ib_dev *ibdev, int slave, int do_init);
 
 static struct workqueue_struct *wq;
@@ -647,12 +640,13 @@ static int eth_link_query_port(struct ib_device *ibdev, u8 port,
 	props->state		= IB_PORT_DOWN;
 	props->phys_state	= state_to_phys_state(props->state);
 	props->active_mtu	= IB_MTU_256;
-	if (is_bonded)
-		rtnl_lock(); /* required to get upper dev */
 	spin_lock_bh(&iboe->lock);
 	ndev = iboe->netdevs[port - 1];
-	if (ndev && is_bonded)
-		ndev = netdev_master_upper_dev_get(ndev);
+	if (ndev && is_bonded) {
+		rcu_read_lock(); /* required to get upper dev */
+		ndev = netdev_master_upper_dev_get_rcu(ndev);
+		rcu_read_unlock();
+	}
 	if (!ndev)
 		goto out_unlock;
 
@@ -664,8 +658,6 @@ static int eth_link_query_port(struct ib_device *ibdev, u8 port,
 	props->phys_state	= state_to_phys_state(props->state);
 out_unlock:
 	spin_unlock_bh(&iboe->lock);
-	if (is_bonded)
-		rtnl_unlock();
 out:
 	mlx4_free_cmd_mailbox(mdev->dev, mailbox);
 	return err;
@@ -748,23 +740,27 @@ out:
 	return err;
 }
 
-static int iboe_query_gid(struct ib_device *ibdev, u8 port, int index,
-			  union ib_gid *gid)
-{
-	struct mlx4_ib_dev *dev = to_mdev(ibdev);
-
-	*gid = dev->iboe.gid_table[port - 1][index];
-
-	return 0;
-}
-
 static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
 			     union ib_gid *gid)
 {
-	if (rdma_port_get_link_layer(ibdev, port) == IB_LINK_LAYER_INFINIBAND)
+	int ret;
+
+	if (rdma_protocol_ib(ibdev, port))
 		return __mlx4_ib_query_gid(ibdev, port, index, gid, 0);
-	else
-		return iboe_query_gid(ibdev, port, index, gid);
+
+	if (!rdma_protocol_roce(ibdev, port))
+		return -ENODEV;
+
+	if (!rdma_cap_roce_gid_table(ibdev, port))
+		return -ENODEV;
+
+	ret = ib_get_cached_gid(ibdev, port, index, gid);
+	if (ret == -EAGAIN) {
+		memcpy(gid, &zgid, sizeof(*gid));
+		return 0;
+	}
+
+	return ret;
 }
 
 int __mlx4_ib_query_pkey(struct ib_device *ibdev, u8 port, u16 index,
@@ -1914,272 +1910,6 @@ static struct device_attribute *mlx4_class_attributes[] = {
 	&dev_attr_board_id
 };
 
-static void mlx4_addrconf_ifid_eui48(u8 *eui, u16 vlan_id,
-				     struct net_device *dev)
-{
-	memcpy(eui, dev->dev_addr, 3);
-	memcpy(eui + 5, dev->dev_addr + 3, 3);
-	if (vlan_id < 0x1000) {
-		eui[3] = vlan_id >> 8;
-		eui[4] = vlan_id & 0xff;
-	} else {
-		eui[3] = 0xff;
-		eui[4] = 0xfe;
-	}
-	eui[0] ^= 2;
-}
-
-static void update_gids_task(struct work_struct *work)
-{
-	struct update_gid_work *gw = container_of(work, struct update_gid_work, work);
-	struct mlx4_cmd_mailbox *mailbox;
-	union ib_gid *gids;
-	int err;
-	struct mlx4_dev	*dev = gw->dev->dev;
-	int is_bonded = mlx4_is_bonded(dev);
-
-	if (!gw->dev->ib_active)
-		return;
-
-	mailbox = mlx4_alloc_cmd_mailbox(dev);
-	if (IS_ERR(mailbox)) {
-		pr_warn("update gid table failed %ld\n", PTR_ERR(mailbox));
-		return;
-	}
-
-	gids = mailbox->buf;
-	memcpy(gids, gw->gids, sizeof gw->gids);
-
-	err = mlx4_cmd(dev, mailbox->dma, MLX4_SET_PORT_GID_TABLE << 8 | gw->port,
-		       MLX4_SET_PORT_ETH_OPCODE, MLX4_CMD_SET_PORT,
-		       MLX4_CMD_TIME_CLASS_B, MLX4_CMD_WRAPPED);
-	if (err)
-		pr_warn("set port command failed\n");
-	else
-		if ((gw->port == 1) || !is_bonded)
-			mlx4_ib_dispatch_event(gw->dev,
-					       is_bonded ? 1 : gw->port,
-					       IB_EVENT_GID_CHANGE);
-
-	mlx4_free_cmd_mailbox(dev, mailbox);
-	kfree(gw);
-}
-
-static void reset_gids_task(struct work_struct *work)
-{
-	struct update_gid_work *gw =
-			container_of(work, struct update_gid_work, work);
-	struct mlx4_cmd_mailbox *mailbox;
-	union ib_gid *gids;
-	int err;
-	struct mlx4_dev	*dev = gw->dev->dev;
-
-	if (!gw->dev->ib_active)
-		return;
-
-	mailbox = mlx4_alloc_cmd_mailbox(dev);
-	if (IS_ERR(mailbox)) {
-		pr_warn("reset gid table failed\n");
-		goto free;
-	}
-
-	gids = mailbox->buf;
-	memcpy(gids, gw->gids, sizeof(gw->gids));
-
-	if (mlx4_ib_port_link_layer(&gw->dev->ib_dev, gw->port) ==
-				    IB_LINK_LAYER_ETHERNET) {
-		err = mlx4_cmd(dev, mailbox->dma,
-			       MLX4_SET_PORT_GID_TABLE << 8 | gw->port,
-			       MLX4_SET_PORT_ETH_OPCODE, MLX4_CMD_SET_PORT,
-			       MLX4_CMD_TIME_CLASS_B,
-			       MLX4_CMD_WRAPPED);
-		if (err)
-			pr_warn("set port %d command failed\n", gw->port);
-	}
-
-	mlx4_free_cmd_mailbox(dev, mailbox);
-free:
-	kfree(gw);
-}
-
-static int update_gid_table(struct mlx4_ib_dev *dev, int port,
-			    union ib_gid *gid, int clear,
-			    int default_gid)
-{
-	struct update_gid_work *work;
-	int i;
-	int need_update = 0;
-	int free = -1;
-	int found = -1;
-	int max_gids;
-
-	if (default_gid) {
-		free = 0;
-	} else {
-		max_gids = dev->dev->caps.gid_table_len[port];
-		for (i = 1; i < max_gids; ++i) {
-			if (!memcmp(&dev->iboe.gid_table[port - 1][i], gid,
-				    sizeof(*gid)))
-				found = i;
-
-			if (clear) {
-				if (found >= 0) {
-					need_update = 1;
-					dev->iboe.gid_table[port - 1][found] =
-						zgid;
-					break;
-				}
-			} else {
-				if (found >= 0)
-					break;
-
-				if (free < 0 &&
-				    !memcmp(&dev->iboe.gid_table[port - 1][i],
-					    &zgid, sizeof(*gid)))
-					free = i;
-			}
-		}
-	}
-
-	if (found == -1 && !clear && free >= 0) {
-		dev->iboe.gid_table[port - 1][free] = *gid;
-		need_update = 1;
-	}
-
-	if (!need_update)
-		return 0;
-
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
-	if (!work)
-		return -ENOMEM;
-
-	memcpy(work->gids, dev->iboe.gid_table[port - 1], sizeof(work->gids));
-	INIT_WORK(&work->work, update_gids_task);
-	work->port = port;
-	work->dev = dev;
-	queue_work(wq, &work->work);
-
-	return 0;
-}
-
-static void mlx4_make_default_gid(struct  net_device *dev, union ib_gid *gid)
-{
-	gid->global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
-	mlx4_addrconf_ifid_eui48(&gid->raw[8], 0xffff, dev);
-}
-
-
-static int reset_gid_table(struct mlx4_ib_dev *dev, u8 port)
-{
-	struct update_gid_work *work;
-
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
-	if (!work)
-		return -ENOMEM;
-
-	memset(dev->iboe.gid_table[port - 1], 0, sizeof(work->gids));
-	memset(work->gids, 0, sizeof(work->gids));
-	INIT_WORK(&work->work, reset_gids_task);
-	work->dev = dev;
-	work->port = port;
-	queue_work(wq, &work->work);
-	return 0;
-}
-
-static int mlx4_ib_addr_event(int event, struct net_device *event_netdev,
-			      struct mlx4_ib_dev *ibdev, union ib_gid *gid)
-{
-	struct mlx4_ib_iboe *iboe;
-	int port = 0;
-	struct net_device *real_dev = rdma_vlan_dev_real_dev(event_netdev) ?
-				rdma_vlan_dev_real_dev(event_netdev) :
-				event_netdev;
-	union ib_gid default_gid;
-
-	mlx4_make_default_gid(real_dev, &default_gid);
-
-	if (!memcmp(gid, &default_gid, sizeof(*gid)))
-		return 0;
-
-	if (event != NETDEV_DOWN && event != NETDEV_UP)
-		return 0;
-
-	if ((real_dev != event_netdev) &&
-	    (event == NETDEV_DOWN) &&
-	    rdma_link_local_addr((struct in6_addr *)gid))
-		return 0;
-
-	iboe = &ibdev->iboe;
-	spin_lock_bh(&iboe->lock);
-
-	for (port = 1; port <= ibdev->dev->caps.num_ports; ++port)
-		if ((netif_is_bond_master(real_dev) &&
-		     (real_dev == iboe->masters[port - 1])) ||
-		     (!netif_is_bond_master(real_dev) &&
-		     (real_dev == iboe->netdevs[port - 1])))
-			update_gid_table(ibdev, port, gid,
-					 event == NETDEV_DOWN, 0);
-
-	spin_unlock_bh(&iboe->lock);
-	return 0;
-
-}
-
-static u8 mlx4_ib_get_dev_port(struct net_device *dev,
-			       struct mlx4_ib_dev *ibdev)
-{
-	u8 port = 0;
-	struct mlx4_ib_iboe *iboe;
-	struct net_device *real_dev = rdma_vlan_dev_real_dev(dev) ?
-				rdma_vlan_dev_real_dev(dev) : dev;
-
-	iboe = &ibdev->iboe;
-
-	for (port = 1; port <= ibdev->dev->caps.num_ports; ++port)
-		if ((netif_is_bond_master(real_dev) &&
-		     (real_dev == iboe->masters[port - 1])) ||
-		     (!netif_is_bond_master(real_dev) &&
-		     (real_dev == iboe->netdevs[port - 1])))
-			break;
-
-	if ((port == 0) || (port > ibdev->dev->caps.num_ports))
-		return 0;
-	else
-		return port;
-}
-
-static int mlx4_ib_inet_event(struct notifier_block *this, unsigned long event,
-				void *ptr)
-{
-	struct mlx4_ib_dev *ibdev;
-	struct in_ifaddr *ifa = ptr;
-	union ib_gid gid;
-	struct net_device *event_netdev = ifa->ifa_dev->dev;
-
-	ipv6_addr_set_v4mapped(ifa->ifa_address, (struct in6_addr *)&gid);
-
-	ibdev = container_of(this, struct mlx4_ib_dev, iboe.nb_inet);
-
-	mlx4_ib_addr_event(event, event_netdev, ibdev, &gid);
-	return NOTIFY_DONE;
-}
-
-#if IS_ENABLED(CONFIG_IPV6)
-static int mlx4_ib_inet6_event(struct notifier_block *this, unsigned long event,
-				void *ptr)
-{
-	struct mlx4_ib_dev *ibdev;
-	struct inet6_ifaddr *ifa = ptr;
-	union  ib_gid *gid = (union ib_gid *)&ifa->addr;
-	struct net_device *event_netdev = ifa->idev->dev;
-
-	ibdev = container_of(this, struct mlx4_ib_dev, iboe.nb_inet6);
-
-	mlx4_ib_addr_event(event, event_netdev, ibdev, gid);
-	return NOTIFY_DONE;
-}
-#endif
-
 #define MLX4_IB_INVALID_MAC	((u64)-1)
 static void mlx4_ib_update_qps(struct mlx4_ib_dev *ibdev,
 			       struct net_device *dev,
@@ -2238,94 +1968,6 @@ unlock:
 	mutex_unlock(&ibdev->qp1_proxy_lock[port - 1]);
 }
 
-static void mlx4_ib_get_dev_addr(struct net_device *dev,
-				 struct mlx4_ib_dev *ibdev, u8 port)
-{
-	struct in_device *in_dev;
-#if IS_ENABLED(CONFIG_IPV6)
-	struct inet6_dev *in6_dev;
-	union ib_gid  *pgid;
-	struct inet6_ifaddr *ifp;
-	union ib_gid default_gid;
-#endif
-	union ib_gid gid;
-
-
-	if ((port == 0) || (port > ibdev->dev->caps.num_ports))
-		return;
-
-	/* IPv4 gids */
-	in_dev = in_dev_get(dev);
-	if (in_dev) {
-		for_ifa(in_dev) {
-			/*ifa->ifa_address;*/
-			ipv6_addr_set_v4mapped(ifa->ifa_address,
-					       (struct in6_addr *)&gid);
-			update_gid_table(ibdev, port, &gid, 0, 0);
-		}
-		endfor_ifa(in_dev);
-		in_dev_put(in_dev);
-	}
-#if IS_ENABLED(CONFIG_IPV6)
-	mlx4_make_default_gid(dev, &default_gid);
-	/* IPv6 gids */
-	in6_dev = in6_dev_get(dev);
-	if (in6_dev) {
-		read_lock_bh(&in6_dev->lock);
-		list_for_each_entry(ifp, &in6_dev->addr_list, if_list) {
-			pgid = (union ib_gid *)&ifp->addr;
-			if (!memcmp(pgid, &default_gid, sizeof(*pgid)))
-				continue;
-			update_gid_table(ibdev, port, pgid, 0, 0);
-		}
-		read_unlock_bh(&in6_dev->lock);
-		in6_dev_put(in6_dev);
-	}
-#endif
-}
-
-static void mlx4_ib_set_default_gid(struct mlx4_ib_dev *ibdev,
-				 struct  net_device *dev, u8 port)
-{
-	union ib_gid gid;
-	mlx4_make_default_gid(dev, &gid);
-	update_gid_table(ibdev, port, &gid, 0, 1);
-}
-
-static int mlx4_ib_init_gid_table(struct mlx4_ib_dev *ibdev)
-{
-	struct	net_device *dev;
-	struct mlx4_ib_iboe *iboe = &ibdev->iboe;
-	int i;
-	int err = 0;
-
-	for (i = 1; i <= ibdev->num_ports; ++i) {
-		if (rdma_port_get_link_layer(&ibdev->ib_dev, i) ==
-		    IB_LINK_LAYER_ETHERNET) {
-			err = reset_gid_table(ibdev, i);
-			if (err)
-				goto out;
-		}
-	}
-
-	read_lock(&dev_base_lock);
-	spin_lock_bh(&iboe->lock);
-
-	for_each_netdev(&init_net, dev) {
-		u8 port = mlx4_ib_get_dev_port(dev, ibdev);
-		/* port will be non-zero only for ETH ports */
-		if (port) {
-			mlx4_ib_set_default_gid(ibdev, dev, port);
-			mlx4_ib_get_dev_addr(dev, ibdev, port);
-		}
-	}
-
-	spin_unlock_bh(&iboe->lock);
-	read_unlock(&dev_base_lock);
-out:
-	return err;
-}
-
 static void mlx4_ib_scan_netdevs(struct mlx4_ib_dev *ibdev,
 				 struct net_device *dev,
 				 unsigned long event)
@@ -2335,81 +1977,22 @@ static void mlx4_ib_scan_netdevs(struct mlx4_ib_dev *ibdev,
 	int update_qps_port = -1;
 	int port;
 
+	ASSERT_RTNL();
+
 	iboe = &ibdev->iboe;
 
 	spin_lock_bh(&iboe->lock);
 	mlx4_foreach_ib_transport_port(port, ibdev->dev) {
-		enum ib_port_state	port_state = IB_PORT_NOP;
-		struct net_device *old_master = iboe->masters[port - 1];
-		struct net_device *curr_netdev;
-		struct net_device *curr_master;
 
 		iboe->netdevs[port - 1] =
 			mlx4_get_protocol_dev(ibdev->dev, MLX4_PROT_ETH, port);
-		if (iboe->netdevs[port - 1])
-			mlx4_ib_set_default_gid(ibdev,
-						iboe->netdevs[port - 1], port);
-		curr_netdev = iboe->netdevs[port - 1];
-
-		if (iboe->netdevs[port - 1] &&
-		    netif_is_bond_slave(iboe->netdevs[port - 1])) {
-			iboe->masters[port - 1] = netdev_master_upper_dev_get(
-				iboe->netdevs[port - 1]);
-		} else {
-			iboe->masters[port - 1] = NULL;
-		}
-		curr_master = iboe->masters[port - 1];
 
 		if (dev == iboe->netdevs[port - 1] &&
 		    (event == NETDEV_CHANGEADDR || event == NETDEV_REGISTER ||
 		     event == NETDEV_UP || event == NETDEV_CHANGE))
 			update_qps_port = port;
 
-		if (curr_netdev) {
-			port_state = (netif_running(curr_netdev) && netif_carrier_ok(curr_netdev)) ?
-						IB_PORT_ACTIVE : IB_PORT_DOWN;
-			mlx4_ib_set_default_gid(ibdev, curr_netdev, port);
-			if (curr_master) {
-				/* if using bonding/team and a slave port is down, we
-				 * don't want the bond IP based gids in the table since
-				 * flows that select port by gid may get the down port.
-				*/
-				if (port_state == IB_PORT_DOWN &&
-				    !mlx4_is_bonded(ibdev->dev)) {
-					reset_gid_table(ibdev, port);
-					mlx4_ib_set_default_gid(ibdev,
-								curr_netdev,
-								port);
-				} else {
-					/* gids from the upper dev (bond/team)
-					 * should appear in port's gid table
-					*/
-					mlx4_ib_get_dev_addr(curr_master,
-							     ibdev, port);
-				}
-			}
-			/* if bonding is used it is possible that we add it to
-			 * masters only after IP address is assigned to the
-			 * net bonding interface.
-			*/
-			if (curr_master && (old_master != curr_master)) {
-				reset_gid_table(ibdev, port);
-				mlx4_ib_set_default_gid(ibdev,
-							curr_netdev, port);
-				mlx4_ib_get_dev_addr(curr_master, ibdev, port);
-			}
-
-			if (!curr_master && (old_master != curr_master)) {
-				reset_gid_table(ibdev, port);
-				mlx4_ib_set_default_gid(ibdev,
-							curr_netdev, port);
-				mlx4_ib_get_dev_addr(curr_netdev, ibdev, port);
-			}
-		} else {
-			reset_gid_table(ibdev, port);
-		}
 	}
-
 	spin_unlock_bh(&iboe->lock);
 
 	if (update_qps_port > 0)
@@ -2592,6 +2175,9 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 						1 : ibdev->num_ports;
 	ibdev->ib_dev.num_comp_vectors	= dev->caps.num_comp_vectors;
 	ibdev->ib_dev.dma_device	= &dev->persist->pdev->dev;
+	ibdev->ib_dev.get_netdev	= mlx4_ib_get_netdev;
+	ibdev->ib_dev.add_gid		= mlx4_ib_add_gid;
+	ibdev->ib_dev.del_gid		= mlx4_ib_del_gid;
 
 	if (dev->caps.userspace_caps)
 		ibdev->ib_dev.uverbs_abi_ver = MLX4_IB_UVERBS_ABI_VERSION;
@@ -2803,26 +2389,6 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 				goto err_notif;
 			}
 		}
-		if (!iboe->nb_inet.notifier_call) {
-			iboe->nb_inet.notifier_call = mlx4_ib_inet_event;
-			err = register_inetaddr_notifier(&iboe->nb_inet);
-			if (err) {
-				iboe->nb_inet.notifier_call = NULL;
-				goto err_notif;
-			}
-		}
-#if IS_ENABLED(CONFIG_IPV6)
-		if (!iboe->nb_inet6.notifier_call) {
-			iboe->nb_inet6.notifier_call = mlx4_ib_inet6_event;
-			err = register_inet6addr_notifier(&iboe->nb_inet6);
-			if (err) {
-				iboe->nb_inet6.notifier_call = NULL;
-				goto err_notif;
-			}
-		}
-#endif
-		if (mlx4_ib_init_gid_table(ibdev))
-			goto err_notif;
 	}
 
 	for (j = 0; j < ARRAY_SIZE(mlx4_class_attributes); ++j) {
@@ -2853,18 +2419,6 @@ err_notif:
 			pr_warn("failure unregistering notifier\n");
 		ibdev->iboe.nb.notifier_call = NULL;
 	}
-	if (ibdev->iboe.nb_inet.notifier_call) {
-		if (unregister_inetaddr_notifier(&ibdev->iboe.nb_inet))
-			pr_warn("failure unregistering notifier\n");
-		ibdev->iboe.nb_inet.notifier_call = NULL;
-	}
-#if IS_ENABLED(CONFIG_IPV6)
-	if (ibdev->iboe.nb_inet6.notifier_call) {
-		if (unregister_inet6addr_notifier(&ibdev->iboe.nb_inet6))
-			pr_warn("failure unregistering notifier\n");
-		ibdev->iboe.nb_inet6.notifier_call = NULL;
-	}
-#endif
 	flush_workqueue(wq);
 
 	mlx4_ib_close_sriov(ibdev);
@@ -2990,19 +2544,6 @@ static void mlx4_ib_remove(struct mlx4_dev *dev, void *ibdev_ptr)
 		kfree(ibdev->ib_uc_qpns_bitmap);
 	}
 
-	if (ibdev->iboe.nb_inet.notifier_call) {
-		if (unregister_inetaddr_notifier(&ibdev->iboe.nb_inet))
-			pr_warn("failure unregistering notifier\n");
-		ibdev->iboe.nb_inet.notifier_call = NULL;
-	}
-#if IS_ENABLED(CONFIG_IPV6)
-	if (ibdev->iboe.nb_inet6.notifier_call) {
-		if (unregister_inet6addr_notifier(&ibdev->iboe.nb_inet6))
-			pr_warn("failure unregistering notifier\n");
-		ibdev->iboe.nb_inet6.notifier_call = NULL;
-	}
-#endif
-
 	iounmap(ibdev->uar_map);
 	for (p = 0; p < ibdev->num_ports; ++p)
 		if (ibdev->counters[p].index != -1 &&
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 4e91ca5..1ee14e9 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -487,12 +487,8 @@ struct mlx4_port_gid_table {
 struct mlx4_ib_iboe {
 	spinlock_t		lock;
 	struct net_device      *netdevs[MLX4_MAX_PORTS];
-	struct net_device      *masters[MLX4_MAX_PORTS];
 	atomic64_t		mac[MLX4_MAX_PORTS];
 	struct notifier_block 	nb;
-	struct notifier_block	nb_inet;
-	struct notifier_block	nb_inet6;
-	union ib_gid		gid_table[MLX4_MAX_PORTS][128];
 	struct mlx4_port_gid_table gids[MLX4_MAX_PORTS];
 };
 
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index c5a3a5f..4ad9be3 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1292,14 +1292,18 @@ static int _mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah,
 		path->static_rate = 0;
 
 	if (ah->ah_flags & IB_AH_GRH) {
-		if (ah->grh.sgid_index >= dev->dev->caps.gid_table_len[port]) {
+		int real_sgid_index = mlx4_ib_gid_index_to_real_index(dev,
+								      port,
+								      ah->grh.sgid_index);
+
+		if (real_sgid_index >= dev->dev->caps.gid_table_len[port]) {
 			pr_err("sgid_index (%u) too large. max is %d\n",
-			       ah->grh.sgid_index, dev->dev->caps.gid_table_len[port] - 1);
+			       real_sgid_index, dev->dev->caps.gid_table_len[port] - 1);
 			return -1;
 		}
 
 		path->grh_mylmc |= 1 << 7;
-		path->mgid_index = ah->grh.sgid_index;
+		path->mgid_index = real_sgid_index;
 		path->hop_limit  = ah->grh.hop_limit;
 		path->tclass_flowlabel =
 			cpu_to_be32((ah->grh.traffic_class << 20) |
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next V7 10/10] RDMA/ocrdma: Incorporate the moving of GID Table mgmt to IB/Core
       [not found] ` <1438270411-17648-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (6 preceding siblings ...)
  2015-07-30 15:33   ` [PATCH for-next V7 09/10] IB/mlx4: Replace mechanism for RoCE GID management Matan Barak
@ 2015-07-30 15:33   ` Matan Barak
  7 siblings, 0 replies; 23+ messages in thread
From: Matan Barak @ 2015-07-30 15:33 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sean Hefty, Somnath Kotur,
	Moni Shoua, talal-VPRAkNaXOzVWk0Htik3J/w,
	haggaie-VPRAkNaXOzVWk0Htik3J/w, Somnath Kotur, Devesh Sharma

From: Somnath Kotur <somnath.kotur-1wcpHE2jlwO1Z/+hSey0Gg@public.gmane.org>

1.Change query_gid hook to return value from IB/Core GID
  management APIs.
2.Get rid of all the netdev notifier chain subscription code as well
  as maintenance of SGID Table in memory.
3.Implement get_netdev hook in driver.

Signed-off-by: Somnath Kotur <somnath.kotur-1wcpHE2jlwO1Z/+hSey0Gg@public.gmane.org>
Signed-off-by: Devesh Sharma <devesh.sharma-1wcpHE2jlwO1Z/+hSey0Gg@public.gmane.org>
---
 drivers/infiniband/hw/ocrdma/ocrdma.h       |   1 -
 drivers/infiniband/hw/ocrdma/ocrdma_main.c  | 234 +---------------------------
 drivers/infiniband/hw/ocrdma/ocrdma_sli.h   |   2 +
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |  45 +++++-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h |  11 ++
 5 files changed, 60 insertions(+), 233 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h b/drivers/infiniband/hw/ocrdma/ocrdma.h
index b396344..10ae1b6 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma.h
@@ -231,7 +231,6 @@ struct ocrdma_dev {
 	u16 base_eqid;
 	u16 max_eq;
 
-	union ib_gid *sgid_tbl;
 	/* provided synchronization to sgid table for
 	 * updating gid entries triggered by notifier.
 	 */
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index d98a707..5e59f81 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -52,8 +52,6 @@ static LIST_HEAD(ocrdma_dev_list);
 static DEFINE_SPINLOCK(ocrdma_devlist_lock);
 static DEFINE_IDR(ocrdma_dev_id);
 
-static union ib_gid ocrdma_zero_sgid;
-
 void ocrdma_get_guid(struct ocrdma_dev *dev, u8 *guid)
 {
 	u8 mac_addr[6];
@@ -68,135 +66,6 @@ void ocrdma_get_guid(struct ocrdma_dev *dev, u8 *guid)
 	guid[6] = mac_addr[4];
 	guid[7] = mac_addr[5];
 }
-
-static bool ocrdma_add_sgid(struct ocrdma_dev *dev, union ib_gid *new_sgid)
-{
-	int i;
-	unsigned long flags;
-
-	memset(&ocrdma_zero_sgid, 0, sizeof(union ib_gid));
-
-
-	spin_lock_irqsave(&dev->sgid_lock, flags);
-	for (i = 0; i < OCRDMA_MAX_SGID; i++) {
-		if (!memcmp(&dev->sgid_tbl[i], &ocrdma_zero_sgid,
-			    sizeof(union ib_gid))) {
-			/* found free entry */
-			memcpy(&dev->sgid_tbl[i], new_sgid,
-			       sizeof(union ib_gid));
-			spin_unlock_irqrestore(&dev->sgid_lock, flags);
-			return true;
-		} else if (!memcmp(&dev->sgid_tbl[i], new_sgid,
-				   sizeof(union ib_gid))) {
-			/* entry already present, no addition is required. */
-			spin_unlock_irqrestore(&dev->sgid_lock, flags);
-			return false;
-		}
-	}
-	spin_unlock_irqrestore(&dev->sgid_lock, flags);
-	return false;
-}
-
-static bool ocrdma_del_sgid(struct ocrdma_dev *dev, union ib_gid *sgid)
-{
-	int found = false;
-	int i;
-	unsigned long flags;
-
-
-	spin_lock_irqsave(&dev->sgid_lock, flags);
-	/* first is default sgid, which cannot be deleted. */
-	for (i = 1; i < OCRDMA_MAX_SGID; i++) {
-		if (!memcmp(&dev->sgid_tbl[i], sgid, sizeof(union ib_gid))) {
-			/* found matching entry */
-			memset(&dev->sgid_tbl[i], 0, sizeof(union ib_gid));
-			found = true;
-			break;
-		}
-	}
-	spin_unlock_irqrestore(&dev->sgid_lock, flags);
-	return found;
-}
-
-static int ocrdma_addr_event(unsigned long event, struct net_device *netdev,
-			     union ib_gid *gid)
-{
-	struct ib_event gid_event;
-	struct ocrdma_dev *dev;
-	bool found = false;
-	bool updated = false;
-	bool is_vlan = false;
-
-	is_vlan = netdev->priv_flags & IFF_802_1Q_VLAN;
-	if (is_vlan)
-		netdev = rdma_vlan_dev_real_dev(netdev);
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(dev, &ocrdma_dev_list, entry) {
-		if (dev->nic_info.netdev == netdev) {
-			found = true;
-			break;
-		}
-	}
-	rcu_read_unlock();
-
-	if (!found)
-		return NOTIFY_DONE;
-
-	mutex_lock(&dev->dev_lock);
-	switch (event) {
-	case NETDEV_UP:
-		updated = ocrdma_add_sgid(dev, gid);
-		break;
-	case NETDEV_DOWN:
-		updated = ocrdma_del_sgid(dev, gid);
-		break;
-	default:
-		break;
-	}
-	if (updated) {
-		/* GID table updated, notify the consumers about it */
-		gid_event.device = &dev->ibdev;
-		gid_event.element.port_num = 1;
-		gid_event.event = IB_EVENT_GID_CHANGE;
-		ib_dispatch_event(&gid_event);
-	}
-	mutex_unlock(&dev->dev_lock);
-	return NOTIFY_OK;
-}
-
-static int ocrdma_inetaddr_event(struct notifier_block *notifier,
-				  unsigned long event, void *ptr)
-{
-	struct in_ifaddr *ifa = ptr;
-	union ib_gid gid;
-	struct net_device *netdev = ifa->ifa_dev->dev;
-
-	ipv6_addr_set_v4mapped(ifa->ifa_address, (struct in6_addr *)&gid);
-	return ocrdma_addr_event(event, netdev, &gid);
-}
-
-static struct notifier_block ocrdma_inetaddr_notifier = {
-	.notifier_call = ocrdma_inetaddr_event
-};
-
-#if IS_ENABLED(CONFIG_IPV6)
-
-static int ocrdma_inet6addr_event(struct notifier_block *notifier,
-				  unsigned long event, void *ptr)
-{
-	struct inet6_ifaddr *ifa = (struct inet6_ifaddr *)ptr;
-	union  ib_gid *gid = (union ib_gid *)&ifa->addr;
-	struct net_device *netdev = ifa->idev->dev;
-	return ocrdma_addr_event(event, netdev, gid);
-}
-
-static struct notifier_block ocrdma_inet6addr_notifier = {
-	.notifier_call = ocrdma_inet6addr_event
-};
-
-#endif /* IPV6 and VLAN */
-
 static enum rdma_link_layer ocrdma_link_layer(struct ib_device *device,
 					      u8 port_num)
 {
@@ -265,6 +134,9 @@ static int ocrdma_register_device(struct ocrdma_dev *dev)
 	dev->ibdev.query_port = ocrdma_query_port;
 	dev->ibdev.modify_port = ocrdma_modify_port;
 	dev->ibdev.query_gid = ocrdma_query_gid;
+	dev->ibdev.get_netdev = ocrdma_get_netdev;
+	dev->ibdev.add_gid = ocrdma_add_gid;
+	dev->ibdev.del_gid = ocrdma_del_gid;
 	dev->ibdev.get_link_layer = ocrdma_link_layer;
 	dev->ibdev.alloc_pd = ocrdma_alloc_pd;
 	dev->ibdev.dealloc_pd = ocrdma_dealloc_pd;
@@ -327,12 +199,6 @@ static int ocrdma_register_device(struct ocrdma_dev *dev)
 static int ocrdma_alloc_resources(struct ocrdma_dev *dev)
 {
 	mutex_init(&dev->dev_lock);
-	dev->sgid_tbl = kzalloc(sizeof(union ib_gid) *
-				OCRDMA_MAX_SGID, GFP_KERNEL);
-	if (!dev->sgid_tbl)
-		goto alloc_err;
-	spin_lock_init(&dev->sgid_lock);
-
 	dev->cq_tbl = kzalloc(sizeof(struct ocrdma_cq *) *
 			      OCRDMA_MAX_CQ, GFP_KERNEL);
 	if (!dev->cq_tbl)
@@ -364,7 +230,6 @@ static void ocrdma_free_resources(struct ocrdma_dev *dev)
 	kfree(dev->stag_arr);
 	kfree(dev->qp_tbl);
 	kfree(dev->cq_tbl);
-	kfree(dev->sgid_tbl);
 }
 
 /* OCRDMA sysfs interface */
@@ -410,68 +275,6 @@ static void ocrdma_remove_sysfiles(struct ocrdma_dev *dev)
 		device_remove_file(&dev->ibdev.dev, ocrdma_attributes[i]);
 }
 
-static void ocrdma_add_default_sgid(struct ocrdma_dev *dev)
-{
-	/* GID Index 0 - Invariant manufacturer-assigned EUI-64 */
-	union ib_gid *sgid = &dev->sgid_tbl[0];
-
-	sgid->global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
-	ocrdma_get_guid(dev, &sgid->raw[8]);
-}
-
-static void ocrdma_init_ipv4_gids(struct ocrdma_dev *dev,
-				  struct net_device *net)
-{
-	struct in_device *in_dev;
-	union ib_gid gid;
-	in_dev = in_dev_get(net);
-	if (in_dev) {
-		for_ifa(in_dev) {
-			ipv6_addr_set_v4mapped(ifa->ifa_address,
-					       (struct in6_addr *)&gid);
-			ocrdma_add_sgid(dev, &gid);
-		}
-		endfor_ifa(in_dev);
-		in_dev_put(in_dev);
-	}
-}
-
-static void ocrdma_init_ipv6_gids(struct ocrdma_dev *dev,
-				  struct net_device *net)
-{
-#if IS_ENABLED(CONFIG_IPV6)
-	struct inet6_dev *in6_dev;
-	union ib_gid  *pgid;
-	struct inet6_ifaddr *ifp;
-	in6_dev = in6_dev_get(net);
-	if (in6_dev) {
-		read_lock_bh(&in6_dev->lock);
-		list_for_each_entry(ifp, &in6_dev->addr_list, if_list) {
-			pgid = (union ib_gid *)&ifp->addr;
-			ocrdma_add_sgid(dev, pgid);
-		}
-		read_unlock_bh(&in6_dev->lock);
-		in6_dev_put(in6_dev);
-	}
-#endif
-}
-
-static void ocrdma_init_gid_table(struct ocrdma_dev *dev)
-{
-	struct  net_device *net_dev;
-
-	for_each_netdev(&init_net, net_dev) {
-		struct net_device *real_dev = rdma_vlan_dev_real_dev(net_dev) ?
-				rdma_vlan_dev_real_dev(net_dev) : net_dev;
-
-		if (real_dev == dev->nic_info.netdev) {
-			ocrdma_add_default_sgid(dev);
-			ocrdma_init_ipv4_gids(dev, net_dev);
-			ocrdma_init_ipv6_gids(dev, net_dev);
-		}
-	}
-}
-
 static struct ocrdma_dev *ocrdma_add(struct be_dev_info *dev_info)
 {
 	int status = 0, i;
@@ -500,7 +303,6 @@ static struct ocrdma_dev *ocrdma_add(struct be_dev_info *dev_info)
 		goto alloc_err;
 
 	ocrdma_init_service_level(dev);
-	ocrdma_init_gid_table(dev);
 	status = ocrdma_register_device(dev);
 	if (status)
 		goto alloc_err;
@@ -647,34 +449,12 @@ static struct ocrdma_driver ocrdma_drv = {
 	.be_abi_version		= OCRDMA_BE_ROCE_ABI_VERSION,
 };
 
-static void ocrdma_unregister_inet6addr_notifier(void)
-{
-#if IS_ENABLED(CONFIG_IPV6)
-	unregister_inet6addr_notifier(&ocrdma_inet6addr_notifier);
-#endif
-}
-
-static void ocrdma_unregister_inetaddr_notifier(void)
-{
-	unregister_inetaddr_notifier(&ocrdma_inetaddr_notifier);
-}
-
 static int __init ocrdma_init_module(void)
 {
 	int status;
 
 	ocrdma_init_debugfs();
 
-	status = register_inetaddr_notifier(&ocrdma_inetaddr_notifier);
-	if (status)
-		return status;
-
-#if IS_ENABLED(CONFIG_IPV6)
-	status = register_inet6addr_notifier(&ocrdma_inet6addr_notifier);
-	if (status)
-		goto err_notifier6;
-#endif
-
 	status = be_roce_register_driver(&ocrdma_drv);
 	if (status)
 		goto err_be_reg;
@@ -682,19 +462,13 @@ static int __init ocrdma_init_module(void)
 	return 0;
 
 err_be_reg:
-#if IS_ENABLED(CONFIG_IPV6)
-	ocrdma_unregister_inet6addr_notifier();
-err_notifier6:
-#endif
-	ocrdma_unregister_inetaddr_notifier();
+
 	return status;
 }
 
 static void __exit ocrdma_exit_module(void)
 {
 	be_roce_unregister_driver(&ocrdma_drv);
-	ocrdma_unregister_inet6addr_notifier();
-	ocrdma_unregister_inetaddr_notifier();
 	ocrdma_rem_debugfs();
 	idr_destroy(&ocrdma_dev_id);
 }
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_sli.h b/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
index 02ad0ae..25a3bd4 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
@@ -125,6 +125,8 @@ enum {
 	OCRDMA_DB_RQ_SHIFT		= 24
 };
 
+#define OCRDMA_ROUDP_FLAGS_SHIFT	0x03
+
 #define OCRDMA_DB_CQ_RING_ID_MASK       0x3FF	/* bits 0 - 9 */
 #define OCRDMA_DB_CQ_RING_ID_EXT_MASK  0x0C00	/* bits 10-11 of qid at 12-11 */
 /* qid #2 msbits at 12-11 */
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 5bb61eb..4b05333 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -31,6 +31,7 @@
 #include <rdma/iw_cm.h>
 #include <rdma/ib_umem.h>
 #include <rdma/ib_addr.h>
+#include <rdma/ib_cache.h>
 
 #include "ocrdma.h"
 #include "ocrdma_hw.h"
@@ -49,6 +50,7 @@ int ocrdma_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey)
 int ocrdma_query_gid(struct ib_device *ibdev, u8 port,
 		     int index, union ib_gid *sgid)
 {
+	int ret;
 	struct ocrdma_dev *dev;
 
 	dev = get_ocrdma_dev(ibdev);
@@ -56,8 +58,28 @@ int ocrdma_query_gid(struct ib_device *ibdev, u8 port,
 	if (index >= OCRDMA_MAX_SGID)
 		return -EINVAL;
 
-	memcpy(sgid, &dev->sgid_tbl[index], sizeof(*sgid));
+	ret = ib_get_cached_gid(ibdev, port, index, sgid);
+	if (ret == -EAGAIN) {
+		memcpy(sgid, &zgid, sizeof(*sgid));
+		return 0;
+	}
+
+	return ret;
+}
 
+int ocrdma_add_gid(struct ib_device *device,
+		   u8 port_num,
+		   unsigned int index,
+		   const union ib_gid *gid,
+		   const struct ib_gid_attr *attr,
+		   void **context) {
+	return  0;
+}
+
+int  ocrdma_del_gid(struct ib_device *device,
+		    u8 port_num,
+		    unsigned int index,
+		    void **context) {
 	return 0;
 }
 
@@ -110,6 +132,24 @@ int ocrdma_query_device(struct ib_device *ibdev, struct ib_device_attr *attr,
 	return 0;
 }
 
+struct net_device *ocrdma_get_netdev(struct ib_device *ibdev, u8 port_num)
+{
+	struct ocrdma_dev *dev;
+	struct net_device *ndev = NULL;
+
+	rcu_read_lock();
+
+	dev = get_ocrdma_dev(ibdev);
+	if (dev)
+		ndev = dev->nic_info.netdev;
+	if (ndev)
+		dev_hold(ndev);
+
+	rcu_read_unlock();
+
+	return ndev;
+}
+
 static inline void get_link_speed_and_width(struct ocrdma_dev *dev,
 					    u8 *ib_speed, u8 *ib_width)
 {
@@ -179,7 +219,8 @@ int ocrdma_query_port(struct ib_device *ibdev,
 	props->port_cap_flags =
 	    IB_PORT_CM_SUP |
 	    IB_PORT_REINIT_SUP |
-	    IB_PORT_DEVICE_MGMT_SUP | IB_PORT_VENDOR_CLASS_SUP | IB_PORT_IP_BASED_GIDS;
+	    IB_PORT_DEVICE_MGMT_SUP | IB_PORT_VENDOR_CLASS_SUP |
+	    IB_PORT_IP_BASED_GIDS;
 	props->gid_tbl_len = OCRDMA_MAX_SGID;
 	props->pkey_tbl_len = 1;
 	props->bad_pkey_cntr = 0;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
index b15c608..b2c30f9 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
@@ -48,6 +48,17 @@ ocrdma_query_protocol(struct ib_device *device, u8 port_num);
 void ocrdma_get_guid(struct ocrdma_dev *, u8 *guid);
 int ocrdma_query_gid(struct ib_device *, u8 port,
 		     int index, union ib_gid *gid);
+struct net_device *ocrdma_get_netdev(struct ib_device *device, u8 port_num);
+int ocrdma_add_gid(struct ib_device *device,
+		   u8 port_num,
+		   unsigned int index,
+		   const union ib_gid *gid,
+		   const struct ib_gid_attr *attr,
+		   void **context);
+int  ocrdma_del_gid(struct ib_device *device,
+		    u8 port_num,
+		    unsigned int index,
+		    void **context);
 int ocrdma_query_pkey(struct ib_device *, u8 port, u16 index, u16 *pkey);
 
 struct ib_ucontext *ocrdma_alloc_ucontext(struct ib_device *,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
  2015-07-30 15:33 [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core Matan Barak
                   ` (2 preceding siblings ...)
       [not found] ` <1438270411-17648-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-07-31  9:40 ` Or Gerlitz
       [not found]   ` <CAJ3xEMh9z=LG7dW0nBP8Zdrbf0SkZ_GCxjCoowWXuw2dueN+xg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  3 siblings, 1 reply; 23+ messages in thread
From: Or Gerlitz @ 2015-07-31  9:40 UTC (permalink / raw)
  To: Matan Barak, Doug Ledford, Jason Gunthorpe
  Cc: linux-rdma@vger.kernel.org, Sean Hefty, Somnath Kotur, Moni Shoua,
	talal@mellanox.com, Haggai Eran, Linux Netdev List

On Thu, Jul 30, 2015 at 6:33 PM, Matan Barak <matanb@mellanox.com> wrote:

[...]

> Changes from V6:
> (1) Addressed Jason's comments:
>         (a) Cache is no longer a client but part of IB infrastructure
>         (b) No need for READ_ONCE and flush_workqueue when tearing down
>             the cache

Doug

So... are we ready to go with V7 upstream?

Or.


> Changes from V5:
> (1) Incoporate the changes to cache.c so we use the same infrastructure
>     to manage both IB and RoCE (per Doug's request)
> (2) Replace the locking mechanism in the IB core GID cache from seqcount +
>     rcu to rwlock (addressing comments from Jason)
> (3) get_netdev returns a helded (dev_hold) device
> (4) Squashed the RocE GID table, RoCE GID management and default GID handling
>     code into one patch (per Doug's request).
> (5) Change modify_gid to add_gid and del_gid.
> (6) set the netdev related changes into three dedicated patches and make
>     them be 1st in the series.
>
> Changes from V4:
> (1) Remove any API changes.
> (2) Fixed a bug regarding bonding upper devices.
> (3) Rebased ontop of Doug's k.o/for-4.2.
>
> Changes from V3:
> (1) Remove RoCE V2 functionality (it will be sent at later patchset).
> (2) Instead of removing qp_attr_mask flags, reserve them.
> (3) Remove the kref from IB devices in favor of rwsem.
> (4) Change the name of roce_gid_cache to roce_gid_table.
> (5) Fix a race when roce_gid_table is free'd while getting events.
> (6) Remove the roce_gid_cache active/inactive flag/API.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
       [not found]   ` <CAJ3xEMh9z=LG7dW0nBP8Zdrbf0SkZ_GCxjCoowWXuw2dueN+xg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-31 12:50     ` Doug Ledford
       [not found]       ` <55BB6F10.5030408-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Doug Ledford @ 2015-07-31 12:50 UTC (permalink / raw)
  To: Or Gerlitz, Matan Barak, Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty,
	Somnath Kotur, Moni Shoua,
	talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, Haggai Eran,
	Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 1827 bytes --]

On 07/31/2015 05:40 AM, Or Gerlitz wrote:
> On Thu, Jul 30, 2015 at 6:33 PM, Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> 
> [...]
> 
>> Changes from V6:
>> (1) Addressed Jason's comments:
>>         (a) Cache is no longer a client but part of IB infrastructure
>>         (b) No need for READ_ONCE and flush_workqueue when tearing down
>>             the cache
> 
> Doug
> 
> So... are we ready to go with V7 upstream?

Yes.  I've already pulled it in.

> Or.
> 
> 
>> Changes from V5:
>> (1) Incoporate the changes to cache.c so we use the same infrastructure
>>     to manage both IB and RoCE (per Doug's request)
>> (2) Replace the locking mechanism in the IB core GID cache from seqcount +
>>     rcu to rwlock (addressing comments from Jason)
>> (3) get_netdev returns a helded (dev_hold) device
>> (4) Squashed the RocE GID table, RoCE GID management and default GID handling
>>     code into one patch (per Doug's request).
>> (5) Change modify_gid to add_gid and del_gid.
>> (6) set the netdev related changes into three dedicated patches and make
>>     them be 1st in the series.
>>
>> Changes from V4:
>> (1) Remove any API changes.
>> (2) Fixed a bug regarding bonding upper devices.
>> (3) Rebased ontop of Doug's k.o/for-4.2.
>>
>> Changes from V3:
>> (1) Remove RoCE V2 functionality (it will be sent at later patchset).
>> (2) Instead of removing qp_attr_mask flags, reserve them.
>> (3) Remove the kref from IB devices in favor of rwsem.
>> (4) Change the name of roce_gid_cache to roce_gid_table.
>> (5) Fix a race when roce_gid_table is free'd while getting events.
>> (6) Remove the roce_gid_cache active/inactive flag/API.


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
       [not found]       ` <55BB6F10.5030408-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-07-31 16:32         ` Jason Gunthorpe
       [not found]           ` <20150731163201.GA6027-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2015-07-31 16:32 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Or Gerlitz, Matan Barak,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty,
	Somnath Kotur, Moni Shoua,
	talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, Haggai Eran,
	Linux Netdev List

On Fri, Jul 31, 2015 at 08:50:24AM -0400, Doug Ledford wrote:

> > So... are we ready to go with V7 upstream?
> 
> Yes.  I've already pulled it in.

You are taking the netdev stuff without an ack from netdev??

I've been too busy too look at v7, but a quick check of the 'move the
cache into core code stuff' looks wrong to
me. ib_unregister_event_handler and flush_workqueue should not/can not be
called from a kref free'r.

Looking at your for-rebase branch.. please make sure you get all the
attributions this cycle :|

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
       [not found]           ` <20150731163201.GA6027-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-31 17:41             ` Doug Ledford
       [not found]               ` <55BBB353.7020806-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-07-31 21:24               ` Or Gerlitz
  0 siblings, 2 replies; 23+ messages in thread
From: Doug Ledford @ 2015-07-31 17:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Or Gerlitz, Matan Barak,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty,
	Somnath Kotur, Moni Shoua,
	talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, Haggai Eran,
	Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 3279 bytes --]

On 07/31/2015 12:32 PM, Jason Gunthorpe wrote:
> On Fri, Jul 31, 2015 at 08:50:24AM -0400, Doug Ledford wrote:
> 
>>> So... are we ready to go with V7 upstream?
>>
>> Yes.  I've already pulled it in.
> 
> You are taking the netdev stuff without an ack from netdev??

I've pulled it in, yes.  Dave Miller/netdev needs to ack it still.  I
really doubt they will object to the 1st or 3rd patch, but they might
have comments on the second.  Since they weren't copied on the original
submission, I'll be pinging them separately.  Dave may prefer if they go
through his tree, and that's fine, but I still need them in my tree as
of now for testing purposes.

> I've been too busy too look at v7, but a quick check of the 'move the
> cache into core code stuff' looks wrong to
> me. ib_unregister_event_handler and flush_workqueue should not/can not be
> called from a kref free'r.

Please be more specific here.  What are you objecting to?  Are you
objecting to a flush_workqueue from a release() context?  Because I
don't see anything in the kref documentation or the kref implementation
that prevents or prohibits this.  Or are you referring to the fact that
they aren't unregistering their event handler until well after the
parent device is unregistered?  That's obviously wrong, but it's also
easy to fix (the obvious fix is that they should be calling
ib_cache_cleanup_one from the top of ib_unregister_device versus waiting
for a kref put).

Regardless though, the reason I'm taking this (and a number of other
things) into my tree is that waiting for things to be absolutely perfect
on a submission is an interminable waiting game.  We have about 3 weeks
or maybe a little more until the next merge window opens.  In that time,
doing v8 of this patchset, and v9 of that patchset, and v5 of another
patchset all gets to be a huge bog on everyone's time.  These various
patchsets have reached the point where they are close and incremental
patches would be better than resubmitting the entire patchset over and
over again.  Not to mention that some of these fixes are quick and easy
to do and I'd rather quit waiting 24+ hours for a respin turnaround when
I could just hop in and make the fix myself in 15 minutes and test it
immediately.

> Looking at your for-rebase branch.. please make sure you get all the
> attributions this cycle :|

Yet *another* reason why these v6, v7, v8 patchsets are a huge time
drain :-/.  For a 10 patch v7 patchset, I need to look through 70 patch
listings in patchworks and try to see which reviewers didn't get their
attribution carried forward, and on top of that I have to make a
judgment call about which reviewed-bys should or shouldn't get carried
forward based upon how much changed in the next patch and whether or not
a previous review is still relevant or has the patch changed enough that
it needs a new review?  Really, this is my only complaint about
patchworks.  Aside from attribution loss on resubmit, it works great.

Anyway, this thread brings up an important issue, scheduling for the 4.3
merge window.  I'll bring that up in a separate email later today.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
       [not found]               ` <55BBB353.7020806-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-07-31 20:18                 ` Jason Gunthorpe
       [not found]                   ` <20150731201828.GA28263-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2015-07-31 20:18 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Or Gerlitz, Matan Barak,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty,
	Somnath Kotur, Moni Shoua,
	talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, Haggai Eran,
	Linux Netdev List

On Fri, Jul 31, 2015 at 01:41:39PM -0400, Doug Ledford wrote:
> Please be more specific here.  What are you objecting to?  Are you
> objecting to a flush_workqueue from a release() context?  Because I
> don't see anything in the kref documentation or the kref
> implementation that prevents or prohibits this.  Or are you
> referring to the fact that they aren't unregistering their event
> handler until well after the parent device is unregistered?

I'm talking about the basic invarient we have for our removal process:
ib_unregister_device must fence all calls into the driver, so the
driver can proceed with the low level remove on the asumption that
nothing will call into it after unregister returns.

Clients create this fence by calling things like flush_workqueue and
ib_unregister_event_handler.

If the fence is moved out of ib_unregister_device and into release
then the whole model breaks.

I've explained this all before on the v1 of these series..

> That's obviously wrong, but it's also easy to fix (the obvious fix
> is that they should be calling ib_cache_cleanup_one from the top of
> ib_unregister_device versus waiting for a kref put).

Well, no, that is not the obvious fix, that puts things back the way
they were in v6. As I explained the kfrees need to be in the release
function, the async fencing needs to remain in the remove_one or
equivalent.

> Regardless though, the reason I'm taking this (and a number of other
> things) into my tree is that waiting for things to be absolutely perfect
> on a submission is an interminable waiting game.

So, I'm confused by this..

I haven't been waiting, I've looked at every release of every one of
these three big core change patch set. And found actual problems.

What I don't understand is your timing on these.. No 'hey, could we
finish these reviews' or anything, just out of the blue, v7 merged! It
was just posted yesterday!

> Not to mention that some of these fixes are quick and easy
> to do and I'd rather quit waiting 24+ hours for a respin turnaround when
> I could just hop in and make the fix myself in 15 minutes and test it
> immediately.

It would have been a lot less work for me to just fix the various
issues than to explain them to the authors. I'm hoping that investment
means the next series around will be good at V1...

> > Looking at your for-rebase branch.. please make sure you get all the
> > attributions this cycle :|
> 
> Yet *another* reason why these v6, v7, v8 patchsets are a huge time
> drain :-/.

Sagi's recent patch set is missing several attributions too..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
  2015-07-31 17:41             ` Doug Ledford
       [not found]               ` <55BBB353.7020806-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-07-31 21:24               ` Or Gerlitz
  2015-07-31 22:01                 ` Jason Gunthorpe
  1 sibling, 1 reply; 23+ messages in thread
From: Or Gerlitz @ 2015-07-31 21:24 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, Matan Barak, linux-rdma@vger.kernel.org,
	Sean Hefty, Somnath Kotur, Moni Shoua, talal@mellanox.com,
	Haggai Eran, Linux Netdev List

On Fri, Jul 31, 2015 at 8:41 PM, Doug Ledford <dledford@redhat.com> wrote:
> On 07/31/2015 12:32 PM, Jason Gunthorpe wrote:
>> On Fri, Jul 31, 2015 at 08:50:24AM -0400, Doug Ledford wrote:

>>>> So... are we ready to go with V7 upstream?
>>> Yes.  I've already pulled it in.
>> You are taking the netdev stuff without an ack from netdev??

> I've pulled it in, yes.  Dave Miller/netdev needs to ack it still.  I
> really doubt they will object to the 1st or 3rd patch, but they might
> have comments on the second.  Since they weren't copied on the original
> submission, I'll be pinging them separately.  Dave may prefer if they go
> through his tree, and that's fine, but I still need them in my tree as
> of now for testing purposes.

Doug, netdev are copied since V6 which was posed on June 24, pretty
much enough time for anyone there to comment if needed. So the
remaining thing to take care of is ask Dave if he's OK that these
patches will go upstream through your tree or his.

>> I've been too busy too look at v7, but a quick check of the 'move the
>> cache into core code stuff' looks wrong to
>> me. ib_unregister_event_handler and flush_workqueue should not/can not be
>> called from a kref free'r.

> Please be more specific here.  What are you objecting to?  Are you
> objecting to a flush_workqueue from a release() context?  Because I
> don't see anything in the kref documentation or the kref implementation
> that prevents or prohibits this.  Or are you referring to the fact that
> they aren't unregistering their event handler until well after the
> parent device is unregistered?  That's obviously wrong, but it's also
> easy to fix (the obvious fix is that they should be calling
> ib_cache_cleanup_one from the top of ib_unregister_device versus waiting
> for a kref put).

Yes, the cover letter states the v6 --> v7 changes and if this is new
complaint, then indeed @ this point it would be fair to have it
addressed in incremental patch, as Doug suggested. Jason, it's wrong
to send developers again and again to fix things which were not
perfect in Vn-1 but also not being covered by reviewers on Vn-1, at
some point the reviewer can't load new comments which gate the
acceptance but rather be willing for things to be fixed incrementally.

Or.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
  2015-07-31 21:24               ` Or Gerlitz
@ 2015-07-31 22:01                 ` Jason Gunthorpe
  2015-08-01 21:48                   ` Or Gerlitz
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2015-07-31 22:01 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Doug Ledford, Matan Barak, linux-rdma@vger.kernel.org, Sean Hefty,
	Somnath Kotur, Moni Shoua, talal@mellanox.com, Haggai Eran,
	Linux Netdev List

On Sat, Aug 01, 2015 at 12:24:23AM +0300, Or Gerlitz wrote:

> addressed in incremental patch, as Doug suggested. Jason, it's wrong
> to send developers again and again to fix things which were 
> perfect in Vn-1 but also not being covered by reviewers on Vn-1, at
> some point the reviewer can't load new comments which gate the

I don't even know what you are talking about Or.

v6 had some small problems in the logic and v7 introduces a fairly
serious flaw while trying to fix them. IMHO, you are better to merge
v6 than v7, at least v6's problems are less likely to be serious.

IMHO, it took until around v5/v6 before this series was even worth
taking a detailed look at. I'm certainly not willing to waste my time
doing detailed reviews on other elements when basic things like
locking and refcounting are screwed up.

I think you are really off side with the idea that a review has to see
every problem in Vn or ignore it in Vn+1. That is obviously
unworkable.

> acceptance but rather be willing for things to be fixed incrementally.

That is the same argument you used for the timestamp _ex UAPI mess,
last cycle, where are the incremental fixes for that?

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
       [not found]                   ` <20150731201828.GA28263-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-08-01  0:30                     ` Doug Ledford
       [not found]                       ` <be961828-39ff-4a15-863e-93cc86dfcf49@email.android.com>
  0 siblings, 1 reply; 23+ messages in thread
From: Doug Ledford @ 2015-08-01  0:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Or Gerlitz, Matan Barak,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty,
	Somnath Kotur, Moni Shoua,
	talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, Haggai Eran,
	Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 8215 bytes --]

On 07/31/2015 04:18 PM, Jason Gunthorpe wrote:
> On Fri, Jul 31, 2015 at 01:41:39PM -0400, Doug Ledford wrote:
>> Please be more specific here.  What are you objecting to?  Are you
>> objecting to a flush_workqueue from a release() context?  Because I
>> don't see anything in the kref documentation or the kref
>> implementation that prevents or prohibits this.  Or are you
>> referring to the fact that they aren't unregistering their event
>> handler until well after the parent device is unregistered?
> 
> I'm talking about the basic invarient we have for our removal process:
> ib_unregister_device must fence all calls into the driver, so the
> driver can proceed with the low level remove on the asumption that
> nothing will call into it after unregister returns.

Agreed.

> Clients create this fence by calling things like flush_workqueue and
> ib_unregister_event_handler.

Agreed.

> If the fence is moved out of ib_unregister_device and into release
> then the whole model breaks.

Agreed.

> I've explained this all before on the v1 of these series..
> 
>> That's obviously wrong, but it's also easy to fix (the obvious fix
>> is that they should be calling ib_cache_cleanup_one from the top of
>> ib_unregister_device versus waiting for a kref put).
> 
> Well, no, that is not the obvious fix, that puts things back the way
> they were in v6. As I explained the kfrees need to be in the release
> function,

Disagree.

> the async fencing needs to remain in the remove_one or
> equivalent.

Agree.

So, my point of disagreement above.  Except in circumstances with a damn
good reason, I expect code to be symmetrical.  If you allocate some
caches in ib_cache_init_one(), I expect you to kfree those caches in
ib_cache_cleanup_one().  Further, if you call ib_cache_init_one() as one
of the last items before returning from ib_register_device(), then I
expect you to call ib_cache_cleanup_one() as one of the first items in
ib_unregister_device().

However, the fencing doesn't belong to ib_cache_cleanup_one(), it
belongs at the head of ib_unregister_device() and ib_cache_cleanup_one()
should only be called after the fence is complete.

And in this respect, removing the device from the various clients is
actually part of the fencing.  The order of ops in ib_register_device is:

ib_device_register_sysfs
ib_cache_init_one
for client_list {
    add_client_context
    client->add
}

Consequently, the order of ops in ib_unregister_device should be:

for client_list
    mark coming down
for client_list
    remove(client)
for client_list
    kfree(client)
ib_cache_cleanup_one <- missing
ib_device_unregister_sysfs

In many respects, I expect the ib_unregister_device() call to mirror the
error unwind found in the register call with the modifications for
dealing with a device that was actually live.

It's certainly possible in certain circumstances to veer from this
paradigm, but I expect a good reason for doing so.  A person who isn't
immediately RDMA aware can come in and fix a locking bug when the code
is like it is above.  If you start playing subtle tricks with the
locking and ordering of removal, you start to trip up anyone that
doesn't have first hand knowledge of those tricks and why they were
implemented.  That negatively impacts the long term support of the code,
so the reason should be sufficient to warrant the drawback and the
subtle trick should be documented in the code so someone coming around
later knows what's going on.

To a certain extent, your comments in the v6 thread about the cache
being expected to work beyond the life of the device and therefore part
of the core stack are somewhat reasonable, but the life of the cache on
a per device basis is no different than the life of the client
registration.  The above flow would remove the cache from the client
list, but will still in effect treat it just like a client.  The only
difference (and one that I think is correct), is that by taking it out
of the client_list and those loops, you are able to make sure it's the
first client that comes up and the last one that goes down and that it
continues to work for as long as it needs to (after all, once all of the
clients are unregistered and removed for that device, and we are in the
ib_unregister_device routine, the device is going away and no more calls
into the cache will arrive for it).

>> Regardless though, the reason I'm taking this (and a number of other
>> things) into my tree is that waiting for things to be absolutely perfect
>> on a submission is an interminable waiting game.
> 
> So, I'm confused by this..
> 
> I haven't been waiting, I've looked at every release of every one of
> these three big core change patch set. And found actual problems.

Some of lesser and greater importance.  Some worth resubmitting the
entire patchset, and some that I would be perfectly OK with an
incremental patch later on.  A problem does not always mean resubmit, it
can mean fix before merge window with another patch (or during merge
window depending on the situation).

> What I don't understand is your timing on these.. No 'hey, could we
> finish these reviews' or anything, just out of the blue, v7 merged! It
> was just posted yesterday!

That's because v6 was mostly done and the v6 -> v7 changes were minor.
I would have taken v6 and then accepted incremental patches had it still
applied.

These branches are from my internal tree and really are WIP branches.
I've pushed those WIP branches to github.  As I stated when I set my two
different repos up, my github repo is my WIP repo.  It's for people to
be able to see where we are, pull it if they wish, and preferably to
test against ahead of time.  I *want* people to be able to see where I'm
going.  But that doesn't mean it is fixed in stone.  For things like the
fix to the v7 version of this patch, I will accept an incremental patch,
but it need not have a complete changelog because if it just fixes one
of the patches in this submission, I'm OK with squashing it down into
the original patch in order to keep the patch count down and
bisectability up.  But that can only be done prior to the merge window.
 After the merge window opens and the code heads to Linus, fixes must be
discreet patches.

So you can actually break the merge/testing window down into two
distinct phases: pre-pull (where we can fix broken patches by squashing
a fix into and thereby preserve bisections) and post-pull (which is what
we are all used to).

This merge window is riskier than most because of the patches queued up
to go in.  By pulling all of this together and publishing my git repo
early, I'm inviting other interested parties to join me in the early
testing process where we can fix problems and also preserve bisection.

So that's why I'm pulling things together and posting my git repo ahead
of the merge window.  To speak more directly to the timing issue though,
there is only about 3 weeks left before the merge window opens, and all
of next week I'm going to be working on other things.  So I wanted to
get something out that people can look at and test with before I have to
redirect for a week, and that will still leave me two weeks to integrate
fixes prior to the merge window opening.

>> Not to mention that some of these fixes are quick and easy
>> to do and I'd rather quit waiting 24+ hours for a respin turnaround when
>> I could just hop in and make the fix myself in 15 minutes and test it
>> immediately.
> 
> It would have been a lot less work for me to just fix the various
> issues than to explain them to the authors. I'm hoping that investment
> means the next series around will be good at V1...
> 
>>> Looking at your for-rebase branch.. please make sure you get all the
>>> attributions this cycle :|
>>
>> Yet *another* reason why these v6, v7, v8 patchsets are a huge time
>> drain :-/.
> 
> Sagi's recent patch set is missing several attributions too..
> 
> Jason
> 


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
  2015-07-31 22:01                 ` Jason Gunthorpe
@ 2015-08-01 21:48                   ` Or Gerlitz
       [not found]                     ` <CAJ3xEMjgvgvc8WyUhHagiH4gNs6KK7pL_TmhXCywNiWEy4p70w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Or Gerlitz @ 2015-08-01 21:48 UTC (permalink / raw)
  To: Jason Gunthorpe, Matan Barak
  Cc: Doug Ledford, linux-rdma@vger.kernel.org, Sean Hefty,
	Somnath Kotur, Moni Shoua, talal@mellanox.com, Haggai Eran,
	Linux Netdev List

On Sat, Aug 1, 2015 at 1:01 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Sat, Aug 01, 2015 at 12:24:23AM +0300, Or Gerlitz wrote:
>
>> addressed in incremental patch, as Doug suggested. Jason, it's wrong
>> to send developers again and again to fix things which were
>> perfect in Vn-1 but also not being covered by reviewers on Vn-1, at
>> some point the reviewer can't load new comments which gate the
>
> I don't even know what you are talking about Or.
>
> v6 had some small problems in the logic and v7 introduces a fairly
> serious flaw while trying to fix them. IMHO, you are better to merge
> v6 than v7, at least v6's problems are less likely to be serious.

Jason, can you be more specific? I don't see any comments from you
expect for the cover-letter, so if something broke out, sure, a fix is
needed, but what is that?

> That is the same argument you used for the timestamp _ex UAPI mess,
> last cycle, where are the incremental fixes for that?

I remember you have provided review comment which pointed that the
time-stamping series stepped on something which was there before needs
some cleanup, not a real mess to my taste. Matan, do have the plan to
do that work?

Or.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Re: [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
       [not found]                         ` <be961828-39ff-4a15-863e-93cc86dfcf49-2ueSQiBKiTY7tOexoI0I+QC/G2K4zDHf@public.gmane.org>
@ 2015-08-02  3:03                           ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2015-08-02  3:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: Or Gerlitz, Matan Barak, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Sean Hefty, Somnath Kotur, Moni Shoua,
	talal-VPRAkNaXOzVWk0Htik3J/w, Haggai Eran, Linux Netdev List

> In many respects, I expect the ib_unregister_device() call to mirror the
>  error unwind found in the register call with the modifications for
>  dealing with a device that was actually live.

Yes, it should look like that, I also noticed there were ordering
problems in this area. and we probably have some long standing
use-after-free bugs in here which are hopefully harmless..

However, the kfrees should still be in release.

> To a certain extent, your comments in the v6 thread about the cache
> being expected to work beyond the life of the device and therefore part
> of the core stack are somewhat reasonable, but the life of the cache on
> a per device basis is no different than the life of the client
> registration.

Well, no, later patches in the series add calls into this code from
the drivers, and I looked into those and it wasn't obviously clear
that they were or could be made safe, or that future uses from drivers
would also be safe.

I looked at all of this, and thought about addressing the issue by
suggesting reordering removal, but it just doesn't guarantee safety,
requires too much auditing to prove and is a lot more work. Moving the
kfree is 10 lines or so and is guaranteed correct.

I disagree this is subtle or unexpected design pattern. The standard
kernel pattern with kref is a create/remove/release triplet. When you
have kref controlled structure every single member in that structure
needs to be analyzed for lifetime and the unwind(s) needs to be placed
in remove or release. For this cache code, that analysis tells me the
kfree unwind needs to live in release and the workqueue/event stuff in
remove. That is how I noticed to start with: a kfree of a member
inside a kref'd structure outside of release looks out of place,
especially if there is no locking protecting it...

The idiom is also to often use kref_put as part of the error unwind in
create, and since v7 doesn't do this it introduces a double kfree bug
as well..

At the end of the day this really shouldn't be the problem of the
caller, as long as the ib_device pointer is valid then core API should
be callable without causing a use-after-free.

> These branches are from my internal tree and really are WIP branches.
> I've pushed those WIP branches to github.  As I stated when I set my two

Sure testing sounds reasonable, maybe you need a different phrase for
this situation than 'thanks applied'.

Anyhow, it is a holiday here till Tuesday, hopefully Matan can fix
this up for your testing before then.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
       [not found]                     ` <CAJ3xEMjgvgvc8WyUhHagiH4gNs6KK7pL_TmhXCywNiWEy4p70w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-08-02  7:56                       ` Matan Barak
       [not found]                         ` <55BDCD36.6000301-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Matan Barak @ 2015-08-02  7:56 UTC (permalink / raw)
  To: Or Gerlitz, Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Sean Hefty, Somnath Kotur, Moni Shoua,
	talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, Haggai Eran,
	Linux Netdev List



On 8/2/2015 12:48 AM, Or Gerlitz wrote:
> On Sat, Aug 1, 2015 at 1:01 AM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>> On Sat, Aug 01, 2015 at 12:24:23AM +0300, Or Gerlitz wrote:
>>
>>> addressed in incremental patch, as Doug suggested. Jason, it's wrong
>>> to send developers again and again to fix things which were
>>> perfect in Vn-1 but also not being covered by reviewers on Vn-1, at
>>> some point the reviewer can't load new comments which gate the
>>
>> I don't even know what you are talking about Or.
>>
>> v6 had some small problems in the logic and v7 introduces a fairly
>> serious flaw while trying to fix them. IMHO, you are better to merge
>> v6 than v7, at least v6's problems are less likely to be serious.
>
> Jason, can you be more specific? I don't see any comments from you
> expect for the cover-letter, so if something broke out, sure, a fix is
> needed, but what is that?
>
>> That is the same argument you used for the timestamp _ex UAPI mess,
>> last cycle, where are the incremental fixes for that?
>
> I remember you have provided review comment which pointed that the
> time-stamping series stepped on something which was there before needs
> some cleanup, not a real mess to my taste. Matan, do have the plan to
> do that work?

Indeed this design flaw was introduced when the first legacy verb was
extended. I think that falling back from extended code to legacy code
should be in the uverbs code. ib_uverbs_write will return -ENOSYS only
if both extended and non-extended don't exist. The uverbs command itself 
will call the non-extended form if the comp_mask is zero and all
data between legacy size and the given size are zero as well.
What do you think?

>
> Or.
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core
       [not found]                         ` <55BDCD36.6000301-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-08-06  4:53                           ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2015-08-06  4:53 UTC (permalink / raw)
  To: Matan Barak
  Cc: Or Gerlitz, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty,
	Somnath Kotur, Moni Shoua,
	talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, Haggai Eran,
	Linux Netdev List

On Sun, Aug 02, 2015 at 10:56:38AM +0300, Matan Barak wrote:

> Indeed this design flaw was introduced when the first legacy verb was
> extended. I think that falling back from extended code to legacy code
> should be in the uverbs code. ib_uverbs_write will return -ENOSYS only
> if both extended and non-extended don't exist. The uverbs command itself
> will call the non-extended form if the comp_mask is zero and all
> data between legacy size and the given size are zero as well.
> What do you think?

Yes, that is what I had expected from day 1.. Didn't notice it in the
first introduction.

Thanks,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-08-06  4:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-30 15:33 [PATCH for-next V7 00/10] Move RoCE GID management to IB/Core Matan Barak
2015-07-30 15:33 ` [PATCH for-next V7 01/10] net/ipv6: Export addrconf_ifid_eui48 Matan Barak
2015-07-30 15:33 ` [PATCH for-next V7 02/10] net: Add info for NETDEV_CHANGEUPPER event Matan Barak
     [not found] ` <1438270411-17648-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-07-30 15:33   ` [PATCH for-next V7 03/10] net/bonding: Export bond_option_active_slave_get_rcu Matan Barak
2015-07-30 15:33   ` [PATCH for-next V7 04/10] IB/core: Add rwsem to allow reading device list or client list Matan Barak
2015-07-30 15:33   ` [PATCH for-next V7 05/10] IB/core: Add RoCE GID table management Matan Barak
2015-07-30 15:33   ` [PATCH for-next V7 06/10] IB/core: Add RoCE table bonding support Matan Barak
2015-07-30 15:33   ` [PATCH for-next V7 07/10] net/mlx4: Postpone the registration of net_device Matan Barak
2015-07-30 15:33   ` [PATCH for-next V7 08/10] IB/mlx4: Implement ib_device callbacks Matan Barak
2015-07-30 15:33   ` [PATCH for-next V7 09/10] IB/mlx4: Replace mechanism for RoCE GID management Matan Barak
2015-07-30 15:33   ` [PATCH for-next V7 10/10] RDMA/ocrdma: Incorporate the moving of GID Table mgmt to IB/Core Matan Barak
2015-07-31  9:40 ` [PATCH for-next V7 00/10] Move RoCE GID management " Or Gerlitz
     [not found]   ` <CAJ3xEMh9z=LG7dW0nBP8Zdrbf0SkZ_GCxjCoowWXuw2dueN+xg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-31 12:50     ` Doug Ledford
     [not found]       ` <55BB6F10.5030408-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-07-31 16:32         ` Jason Gunthorpe
     [not found]           ` <20150731163201.GA6027-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-31 17:41             ` Doug Ledford
     [not found]               ` <55BBB353.7020806-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-07-31 20:18                 ` Jason Gunthorpe
     [not found]                   ` <20150731201828.GA28263-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-08-01  0:30                     ` Doug Ledford
     [not found]                       ` <be961828-39ff-4a15-863e-93cc86dfcf49@email.android.com>
     [not found]                         ` <be961828-39ff-4a15-863e-93cc86dfcf49-2ueSQiBKiTY7tOexoI0I+QC/G2K4zDHf@public.gmane.org>
2015-08-02  3:03                           ` Jason Gunthorpe
2015-07-31 21:24               ` Or Gerlitz
2015-07-31 22:01                 ` Jason Gunthorpe
2015-08-01 21:48                   ` Or Gerlitz
     [not found]                     ` <CAJ3xEMjgvgvc8WyUhHagiH4gNs6KK7pL_TmhXCywNiWEy4p70w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-08-02  7:56                       ` Matan Barak
     [not found]                         ` <55BDCD36.6000301-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-08-06  4:53                           ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox