* [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
@ 2015-05-17 5:50 Haggai Eran
2015-05-17 5:50 ` [PATCH v4 for-next 02/12] IB/addr: Pass network namespace as a parameter Haggai Eran
` (9 more replies)
0 siblings, 10 replies; 68+ messages in thread
From: Haggai Eran @ 2015-05-17 5:50 UTC (permalink / raw)
To: Doug Ledford
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth,
Haggai Eran
Thanks again everyone for the review comments. I've updated the patch set
accordingly. The main changes are in the first patch to use a read-write
semaphore instead of an SRCU, and with the reference counting of shared
ib_cm_ids.
Please let me know if I missed anything, or if there are other issues with
the series.
Regards,
Haggai
Changes from v3:
- Patch 1 and 3: use read-write semaphore instead of an SRCU.
- Patch 5:
* Use a direct reference count instead of a kref.
* Instead of adding get/put pair for ib_cm_ids, just avoid destroying an
id when it is still in use.
* Squashes these two patches together, since the first one became too
short:
IB/cm: Reference count ib_cm_ids
IB/cm: API to retrieve existing listening CM IDs
- Rebase to Doug's to-be-rebased/for-4.2 branch.
Changes from v2:
- Add patch 1 to change device_mutex to an RCU.
- Remove patch that fixed IPv4 connections to an IPv4/IPv6 listener.
- Limit namespace related changes to RDMA CM and InfiniBand only.
- Rebase on dledford/for-v4.2, with David Ahern's unaligned access patch.
* Use Michael Wang's capability functions where needed.
- Move the struct net argument to be the first in all functions, to match the
networking core scheme.
- Patch 2:
* Remove unwanted braces.
- Patch 4: check the return value of ib_find_cached_pkey.
- Patch 8: verify the address family before calling cm_save_ib_info.
- Patch 10: use generic_net instead of a custom radix tree for having per
network namespace data.
- Minor changes.
Changes from v1:
- Include patch 1 in this series.
- Rebase for v4.1.
Changes from v0:
- Fix code review comments by Yann
- Rebase on top of linux-3.19
RDMA-CM uses IP based addressing and routing to setup RDMA connections between
hosts. Currently, all of the IP interfaces and addresses used by the RDMA-CM
must reside in the init_net namespace. This restricts the usage of containers
with RDMA to only work with host network namespace (aka the kernel init_net NS
instance).
This patchset allows using network namespaces with the RDMA-CM.
Each RDMA-CM id keeps a reference to a network namespace.
This reference is based on the process network namespace at the time of the
creation of the object or inherited from the listener.
This network namespace is used to perform all IP and network related
operations. Specifically, the local device lookup, as well as the remote GID
address resolution are done in the context of the RDMA-CM object's namespace.
This allows outgoing connections to reach the right target, even if the same
IP address exists in multiple network namespaces. This can happen if each
network namespace resides on a different P_Key.
Additionally, the network namespace is used to split the listener service ID
table. From the user point of view, each network namespace has a unique,
completely independent table of service IDs. This allows running multiple
instances of a single service on the same machine, using containers. To
implement this, multiple RDMA CM IDs, belonging to different namespaces may
now share their CM ID. When a request on such a CM ID arrives, the RDMA CM
module finds out the correct namespaces and looks for the RDMA CM ID
matching the request's parameters.
The functionality introduced by this series would come into play when the
transport is InfiniBand and IPoIB interfaces are assigned to each namespace.
Multiple IPoIB interfaces can be created and assigned to different RDMA-CM
capable containers, for example using pipework [1].
Full support for RoCE will be introduced in a later stage.
The patches apply against Doug's tree for v4.2.
The patchset is structured as follows:
Patch 1 adds a read-write semaphore in addition to the device mutex in
ib_core to allow traversing the client list without a deadlock in Patch 3.
Patch 2 is a relatively trivial API extension, requiring the callers
of certain ib_addr functions to provide a network namespace, as needed.
Patches 3 and 4 adds the ability to lookup a network namespace according to
the IP address, device and P_Key. It finds the matching IPoIB interfaces, and
safely takes a reference on the network namespace before returning to the
caller.
Patches 5-6 make necessary changes to the CM layer, to allow sharing of a
single CM ID between multiple RDMA CM IDs. This includes adding a reference
count to ib_cm_id structs, add an API to either create a new CM ID or use
an existing one, and expose the service ID to ib_cm clients.
Patches 7-8 do some preliminary refactoring to the rdma_cm module. Patch 7
refactors the logic that extracts the IP address from a connect request to
allow reuse by the namespace lookup code further on. Patch 8 changes the
way RDMA CM module creates CM IDs, to avoid relying on the compare_data
feature of ib_cm. This feature associate a single compare_data struct per
ib_cm_id, so it cannot be used when sharing CM IDs.
Patches 9-12 add proper namespace support to the RDMA-CM module. This
includes adding multiple port space tables, sharing ib_cm_ids between
rdma_cm_ids, adding a network namespace parameter, and finally retrieving
the namespace from the creating process.
[1] https://github.com/jpetazzo/pipework/pull/108
Guy Shapiro (4):
IB/addr: Pass network namespace as a parameter
IB/ipoib: Return IPoIB devices matching connection parameters
IB/cma: Add support for network namespaces
IB/ucma: Take the network namespace from the process
Haggai Eran (7):
IB/core: Add rwsem to allow reading device list or client list
IB/cm: Share listening CM IDs
IB/cm: Expose service ID in request events
IB/cma: Refactor RDMA IP CM private-data parsing code
IB/cma: Add compare_data checks to the RDMA CM module
IB/cma: Separate port allocation to network namespaces
IB/cma: Share CM IDs between namespaces
Yotam Kenneth (1):
IB/core: Find the network namespace matching connection parameters
drivers/infiniband/core/addr.c | 18 +-
drivers/infiniband/core/cm.c | 129 +++++-
drivers/infiniband/core/cma.c | 461 +++++++++++++++------
drivers/infiniband/core/device.c | 76 +++-
drivers/infiniband/core/ucma.c | 4 +-
drivers/infiniband/ulp/ipoib/ipoib_main.c | 139 ++++++-
drivers/infiniband/ulp/iser/iser_verbs.c | 2 +-
drivers/infiniband/ulp/isert/ib_isert.c | 2 +-
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 4 +-
include/rdma/ib_addr.h | 16 +-
include/rdma/ib_cm.h | 8 +
include/rdma/ib_verbs.h | 33 ++
include/rdma/rdma_cm.h | 6 +-
net/9p/trans_rdma.c | 4 +-
net/rds/ib.c | 2 +-
net/rds/ib_cm.c | 2 +-
net/rds/iw.c | 2 +-
net/rds/iw_cm.c | 2 +-
net/rds/rdma_transport.c | 4 +-
net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 +-
net/sunrpc/xprtrdma/verbs.c | 3 +-
21 files changed, 767 insertions(+), 154 deletions(-)
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 68+ messages in thread* [PATCH v4 for-next 02/12] IB/addr: Pass network namespace as a parameter 2015-05-17 5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran @ 2015-05-17 5:50 ` Haggai Eran 2015-05-17 5:50 ` [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters Haggai Eran ` (8 subsequent siblings) 9 siblings, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-17 5:50 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Haggai Eran From: Guy Shapiro <guysh@mellanox.com> Add network namespace support to the ib_addr module. For that, all the address resolution and matching should be done using the appropriate namespace instead of init_net. This is achieved by: 1. Adding an explicit network namespace argument to exported function that require a namespace. 2. Saving the namespace in the rdma_addr_client structure. 3. Using it when calling networking functions. In order to preserve the behavior of calling modules, &init_net is passed as the parameter in calls from other modules. This is modified as namespace support is added on more levels. Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Yotam Kenneth <yotamke@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> Signed-off-by: Guy Shapiro <guysh@mellanox.com> --- drivers/infiniband/core/addr.c | 18 ++++++++++-------- drivers/infiniband/core/cma.c | 1 + include/rdma/ib_addr.h | 16 +++++++++++++++- 3 files changed, 26 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index 38339d220d7f..95c0bc4190a3 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -128,7 +128,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, int ret = -EADDRNOTAVAIL; if (dev_addr->bound_dev_if) { - dev = dev_get_by_index(&init_net, dev_addr->bound_dev_if); + dev = dev_get_by_index(dev_addr->net, dev_addr->bound_dev_if); if (!dev) return -ENODEV; ret = rdma_copy_addr(dev_addr, dev, NULL); @@ -138,7 +138,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, switch (addr->sa_family) { case AF_INET: - dev = ip_dev_find(&init_net, + dev = ip_dev_find(dev_addr->net, ((struct sockaddr_in *) addr)->sin_addr.s_addr); if (!dev) @@ -149,12 +149,11 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, *vlan_id = rdma_vlan_dev_vlan_id(dev); dev_put(dev); break; - #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: rcu_read_lock(); - for_each_netdev_rcu(&init_net, dev) { - if (ipv6_chk_addr(&init_net, + for_each_netdev_rcu(dev_addr->net, dev) { + if (ipv6_chk_addr(dev_addr->net, &((struct sockaddr_in6 *) addr)->sin6_addr, dev, 1)) { ret = rdma_copy_addr(dev_addr, dev, NULL); @@ -236,7 +235,7 @@ static int addr4_resolve(struct sockaddr_in *src_in, fl4.daddr = dst_ip; fl4.saddr = src_ip; fl4.flowi4_oif = addr->bound_dev_if; - rt = ip_route_output_key(&init_net, &fl4); + rt = ip_route_output_key(addr->net, &fl4); if (IS_ERR(rt)) { ret = PTR_ERR(rt); goto out; @@ -278,12 +277,13 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, fl6.saddr = src_in->sin6_addr; fl6.flowi6_oif = addr->bound_dev_if; - dst = ip6_route_output(&init_net, NULL, &fl6); + dst = ip6_route_output(addr->net, NULL, &fl6); if ((ret = dst->error)) goto put; if (ipv6_addr_any(&fl6.saddr)) { - ret = ipv6_dev_get_saddr(&init_net, ip6_dst_idev(dst)->dev, + ret = ipv6_dev_get_saddr(addr->net, + ip6_dst_idev(dst)->dev, &fl6.daddr, 0, &fl6.saddr); if (ret) goto put; @@ -476,6 +476,7 @@ int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union ib_gid *dgid, u8 *dmac, rdma_gid2ip(&dgid_addr._sockaddr, dgid); memset(&dev_addr, 0, sizeof(dev_addr)); + dev_addr.net = &init_net; ctx.addr = &dev_addr; init_completion(&ctx.comp); @@ -510,6 +511,7 @@ int rdma_addr_find_smac_by_sgid(union ib_gid *sgid, u8 *smac, u16 *vlan_id) rdma_gid2ip(&gid_addr._sockaddr, sgid); memset(&dev_addr, 0, sizeof(dev_addr)); + dev_addr.net = &init_net; ret = rdma_translate_ip(&gid_addr._sockaddr, &dev_addr, vlan_id); if (ret) return ret; diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 1977f601a1ec..c957d53df666 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -524,6 +524,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, INIT_LIST_HEAD(&id_priv->listen_list); INIT_LIST_HEAD(&id_priv->mc_list); get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num); + id_priv->id.route.addr.dev_addr.net = &init_net; return &id_priv->id; } diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h index ac54c27a2bfd..445a72e1320a 100644 --- a/include/rdma/ib_addr.h +++ b/include/rdma/ib_addr.h @@ -47,6 +47,7 @@ #include <rdma/ib_verbs.h> #include <rdma/ib_pack.h> #include <net/ipv6.h> +#include <net/net_namespace.h> struct rdma_addr_client { atomic_t refcount; @@ -64,6 +65,16 @@ void rdma_addr_register_client(struct rdma_addr_client *client); */ void rdma_addr_unregister_client(struct rdma_addr_client *client); +/** + * struct rdma_dev_addr - Contains resolved RDMA hardware addresses + * @src_dev_addr: Source MAC address. + * @dst_dev_addr: Destination MAC address. + * @broadcast: Broadcast address of the device. + * @dev_type: The interface hardware type of the device. + * @bound_dev_if: An optional device interface index. + * @transport: The transport type used. + * @net: Network namespace containing the bound_dev_if net_dev. + */ struct rdma_dev_addr { unsigned char src_dev_addr[MAX_ADDR_LEN]; unsigned char dst_dev_addr[MAX_ADDR_LEN]; @@ -71,11 +82,14 @@ struct rdma_dev_addr { unsigned short dev_type; int bound_dev_if; enum rdma_transport_type transport; + struct net *net; }; /** * rdma_translate_ip - Translate a local IP address to an RDMA hardware * address. + * + * The dev_addr->net field must be initialized. */ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, u16 *vlan_id); @@ -90,7 +104,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, * @dst_addr: The destination address to resolve. * @addr: A reference to a data location that will receive the resolved * addresses. The data location must remain valid until the callback has - * been invoked. + * been invoked. The net field of the addr struct must be valid. * @timeout_ms: Amount of time to wait for the address resolution to complete. * @callback: Call invoked once address resolution has completed, timed out, * or been canceled. A status of 0 indicates success. -- 1.7.11.2 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters 2015-05-17 5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran 2015-05-17 5:50 ` [PATCH v4 for-next 02/12] IB/addr: Pass network namespace as a parameter Haggai Eran @ 2015-05-17 5:50 ` Haggai Eran [not found] ` <1431841868-28063-4-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2015-05-17 5:51 ` [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices " Haggai Eran ` (7 subsequent siblings) 9 siblings, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-05-17 5:50 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Haggai Eran From: Yotam Kenneth <yotamke@mellanox.com> In the case of IPoIB, and maybe in other cases, the network device is managed by an upper-layer protocol (ULP). In order to expose this network device to other users of the IB device, let ULPs implement a callback that returns network device according to connection parameters. The IB device and port, together with the P_Key and the IP address should be enough to uniquely identify the ULP net device. This function is passed to ib_core as part of the ib_client registration. Using this functionality, add a way to get the network namespace corresponding to a work completion. This is needed so that responses to CM requests can be sent from the same network namespace as the request. Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Yotam Kenneth <yotamke@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> Signed-off-by: Guy Shapiro <guysh@mellanox.com> --- drivers/infiniband/core/device.c | 52 ++++++++++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 33 +++++++++++++++++++++++++ 2 files changed, 85 insertions(+) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 3a44723c6b9d..be184f9feab2 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -38,6 +38,7 @@ #include <linux/slab.h> #include <linux/init.h> #include <linux/mutex.h> +#include <linux/netdevice.h> #include <rdma/rdma_netlink.h> #include "core_priv.h" @@ -748,6 +749,57 @@ int ib_find_pkey(struct ib_device *device, } EXPORT_SYMBOL(ib_find_pkey); +static struct net_device *ib_get_net_dev_by_port_pkey_ip(struct ib_device *dev, + u8 port, + u16 pkey, + struct sockaddr *addr) +{ + struct net_device *ret = NULL; + struct ib_client *client; + + down_read(&lists_rwsem); + + list_for_each_entry_rcu(client, &client_list, list) + if (client->get_net_device_by_port_pkey_ip) { + ret = client->get_net_device_by_port_pkey_ip(dev, port, + pkey, + addr); + if (ret) + break; + } + + up_read(&lists_rwsem); + + return ret; +} + +struct net *ib_get_net_ns_by_port_pkey_ip(struct ib_device *dev, + u8 port, + u16 pkey, + struct sockaddr *addr) +{ + struct net_device *ndev = NULL; + struct net *ns; + + if (rdma_protocol_ib(dev, port)) + ndev = ib_get_net_dev_by_port_pkey_ip(dev, port, pkey, addr); + + if (!ndev) + goto not_found; + + rcu_read_lock(); + ns = maybe_get_net(dev_net(ndev)); + dev_put(ndev); + rcu_read_unlock(); + if (!ns) + goto not_found; + return ns; + +not_found: + return get_net(&init_net); +} +EXPORT_SYMBOL(ib_get_net_ns_by_port_pkey_ip); + static int __init ib_core_init(void) { int ret; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 8d59479eea4d..76093a33ed46 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -48,6 +48,7 @@ #include <linux/rwsem.h> #include <linux/scatterlist.h> #include <linux/workqueue.h> +#include <linux/socket.h> #include <uapi/linux/if_ether.h> #include <linux/atomic.h> @@ -1691,6 +1692,24 @@ struct ib_client { void (*add) (struct ib_device *); void (*remove)(struct ib_device *); + /* Returns the net_dev belonging to this ib_client and matching the + * given parameters. + * @dev: An RDMA device that the net_dev use for communication. + * @port: A physical port number on the RDMA device. + * @pkey: P_Key that the net_dev uses if applicable. + * @addr: An IP address the net_dev is configured with. + * + * An ib_client that implements a net_dev on top of RDMA devices + * (such as IP over IB) should implement this callback, allowing the + * rdma_cm module to find the right net_dev for a given request. + * + * The caller is responsible for calling dev_put on the returned + * netdev. */ + struct net_device *(*get_net_device_by_port_pkey_ip)( + struct ib_device *dev, + u8 port, + u16 pkey, + struct sockaddr *addr); struct list_head list; }; @@ -2835,4 +2854,18 @@ static inline int ib_check_mr_access(int flags) int ib_check_mr_status(struct ib_mr *mr, u32 check_mask, struct ib_mr_status *mr_status); +/** + * ib_get_net_ns_by_port_pkey_ip() - Return the appropriate net namespace + * for a received CM request + * @dev: An RDMA device on which the request has been received. + * @port: Port number on the RDMA device. + * @pkey: The Pkey the request came on. + * @addr: Contains the IP address that the request specified as its + * destination. + */ +struct net *ib_get_net_ns_by_port_pkey_ip(struct ib_device *dev, + u8 port, + u16 pkey, + struct sockaddr *addr); + #endif /* IB_VERBS_H */ -- 1.7.11.2 ^ permalink raw reply related [flat|nested] 68+ messages in thread
[parent not found: <1431841868-28063-4-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters [not found] ` <1431841868-28063-4-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-05-19 18:26 ` Jason Gunthorpe [not found] ` <20150519182616.GF18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-19 18:26 UTC (permalink / raw) To: Haggai Eran Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth > + list_for_each_entry_rcu(client, &client_list, list) > + if (client->get_net_device_by_port_pkey_ip) { > + ret = client->get_net_device_by_port_pkey_ip(dev, port, > + pkey, > + > addr); Considering the patch that introduced the rwsem, this doesn't look right. We can't call a client call back on a device that is unregistered, we can't call a client call back on a client that is unregistering. So the ordering of the list_del/add vs call back calls in device.c is not right, please audit everything with an eye toward maintaing sane invarients. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150519182616.GF18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters [not found] ` <20150519182616.GF18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-20 14:48 ` Haggai Eran 0 siblings, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-20 14:48 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 19/05/2015 21:26, Jason Gunthorpe wrote: >> + list_for_each_entry_rcu(client, &client_list, list) >> + if (client->get_net_device_by_port_pkey_ip) { >> + ret = client->get_net_device_by_port_pkey_ip(dev, port, >> + pkey, >> + >> addr); > > Considering the patch that introduced the rwsem, this doesn't look > right. > > We can't call a client call back on a device that is unregistered, we > can't call a client call back on a client that is unregistering. > > So the ordering of the list_del/add vs call back calls in device.c is > not right, please audit everything with an eye toward maintaing sane > invarients. Thanks for pointing that out. I'll do that. Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters 2015-05-17 5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran 2015-05-17 5:50 ` [PATCH v4 for-next 02/12] IB/addr: Pass network namespace as a parameter Haggai Eran 2015-05-17 5:50 ` [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters Haggai Eran @ 2015-05-17 5:51 ` Haggai Eran [not found] ` <1431841868-28063-5-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2015-05-17 5:51 ` [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs Haggai Eran ` (6 subsequent siblings) 9 siblings, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-05-17 5:51 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Haggai Eran From: Guy Shapiro <guysh@mellanox.com> Implement the get_net_device_by_port_pkey_ip callback that returns network device to ib_core according to connection parameters. Check the ipoib device and iterate over all child devices to look for a match. For each ipoib device we iterate through all upper devices when searching for a matching IP, in order to support bonding. Signed-off-by: Guy Shapiro <guysh@mellanox.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Yotam Kenneth <yotamke@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 139 +++++++++++++++++++++++++++++- 1 file changed, 138 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 3421e42870c3..75def39a4271 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -48,6 +48,9 @@ #include <linux/jhash.h> #include <net/arp.h> +#include <net/addrconf.h> +#include <linux/inetdevice.h> +#include <rdma/ib_cache.h> #define DRV_VERSION "1.0.0" @@ -91,11 +94,15 @@ struct ib_sa_client ipoib_sa_client; static void ipoib_add_one(struct ib_device *device); static void ipoib_remove_one(struct ib_device *device); static void ipoib_neigh_reclaim(struct rcu_head *rp); +static struct net_device *ipoib_get_net_device_by_port_pkey_ip( + struct ib_device *dev, u8 port, u16 pkey, + struct sockaddr *addr); static struct ib_client ipoib_client = { .name = "ipoib", .add = ipoib_add_one, - .remove = ipoib_remove_one + .remove = ipoib_remove_one, + .get_net_device_by_port_pkey_ip = ipoib_get_net_device_by_port_pkey_ip, }; int ipoib_open(struct net_device *dev) @@ -222,6 +229,136 @@ static int ipoib_change_mtu(struct net_device *dev, int new_mtu) return 0; } +/* Called with an RCU read lock taken */ +static bool ipoib_is_dev_match_addr(struct sockaddr *addr, + struct net_device *dev) +{ + struct net *net = dev_net(dev); + struct in_device *in_dev; + struct sockaddr_in *addr_in = (struct sockaddr_in *)addr; +#if IS_ENABLED(CONFIG_IPV6) + struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr; +#endif + __be32 ret_addr; + + switch (addr->sa_family) { + case AF_INET: + in_dev = in_dev_get(dev); + if (!in_dev) + return false; + + ret_addr = inet_confirm_addr(net, in_dev, 0, + addr_in->sin_addr.s_addr, + RT_SCOPE_HOST); + in_dev_put(in_dev); + if (ret_addr) + return true; + + break; +#if IS_ENABLED(CONFIG_IPV6) + case AF_INET6: + if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1)) + return true; + + break; +#endif + } + return false; +} + +/** + * Find a net_device matching the given address, which is an upper device of + * the given net_device. + * @addr: IP address to look for. + * @dev: base IPoIB net_device + * + * If found, returns the net_device with a reference held. Otherwise return + * NULL. + */ +static struct net_device *ipoib_get_net_dev_match_addr(struct sockaddr *addr, + struct net_device *dev) +{ + struct net_device *upper, + *result = NULL; + struct list_head *iter; + + rcu_read_lock(); + if (ipoib_is_dev_match_addr(addr, dev)) { + dev_hold(dev); + result = dev; + goto out; + } + + netdev_for_each_all_upper_dev_rcu(dev, upper, iter) { + if (ipoib_is_dev_match_addr(addr, upper)) { + dev_hold(upper); + result = upper; + break; + } + } +out: + rcu_read_unlock(); + return result; +} + +/* returns an IPoIB netdev on top a given ipoib device matching a pkey_index + * and address, if one exists. */ +static struct net_device *ipoib_match_pkey_addr(struct ipoib_dev_priv *priv, + u16 pkey_index, + struct sockaddr *addr) +{ + struct ipoib_dev_priv *child_priv; + struct net_device *net_dev = NULL; + + if (priv->pkey_index == pkey_index) { + net_dev = ipoib_get_net_dev_match_addr(addr, priv->dev); + if (net_dev) + return net_dev; + } + + /* Check child interfaces */ + down_read(&priv->vlan_rwsem); + list_for_each_entry(child_priv, &priv->child_intfs, list) { + net_dev = ipoib_match_pkey_addr(child_priv, pkey_index, addr); + if (net_dev) + break; + } + up_read(&priv->vlan_rwsem); + + return net_dev; +} + +static struct net_device *ipoib_get_net_device_by_port_pkey_ip( + struct ib_device *dev, u8 port, u16 pkey, struct sockaddr *addr) +{ + struct ipoib_dev_priv *priv; + struct list_head *dev_list; + struct net_device *net_dev; + u16 pkey_index; + int ret; + + ret = ib_find_cached_pkey(dev, port, pkey, &pkey_index); + if (ret) + return NULL; + + if (!rdma_protocol_ib(dev, port)) + return NULL; + + dev_list = ib_get_client_data(dev, &ipoib_client); + if (!dev_list) + return NULL; + + list_for_each_entry(priv, dev_list, list) { + if (priv->port != port) + continue; + + net_dev = ipoib_match_pkey_addr(priv, pkey_index, addr); + if (net_dev) + return net_dev; + } + return NULL; +} + int ipoib_set_mode(struct net_device *dev, const char *buf) { struct ipoib_dev_priv *priv = netdev_priv(dev); -- 1.7.11.2 ^ permalink raw reply related [flat|nested] 68+ messages in thread
[parent not found: <1431841868-28063-5-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters [not found] ` <1431841868-28063-5-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-05-19 18:28 ` Jason Gunthorpe [not found] ` <20150519182810.GG18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2015-05-19 23:55 ` Jason Gunthorpe 1 sibling, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-19 18:28 UTC (permalink / raw) To: Haggai Eran Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote: > +#if IS_ENABLED(CONFIG_IPV6) > + struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr; > +#endif > + __be32 ret_addr; > + > + switch (addr->sa_family) { > + case AF_INET: > + in_dev = in_dev_get(dev); > + if (!in_dev) > + return false; > + > + ret_addr = inet_confirm_addr(net, in_dev, 0, > + addr_in->sin_addr.s_addr, > + RT_SCOPE_HOST); > + in_dev_put(in_dev); > + if (ret_addr) > + return true; > + > + break; > +#if IS_ENABLED(CONFIG_IPV6) > + case AF_INET6: > + if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1)) > + return true; > + > + break; > +#endif Can you use if (IS_ENABLED(CONFIG_IPV6)) At the call site instead of the #if guards? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150519182810.GG18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters [not found] ` <20150519182810.GG18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-20 15:17 ` Haggai Eran 0 siblings, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-20 15:17 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 19/05/2015 21:28, Jason Gunthorpe wrote: > On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote: > >> +#if IS_ENABLED(CONFIG_IPV6) >> + struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr; >> +#endif >> + __be32 ret_addr; >> + >> + switch (addr->sa_family) { >> + case AF_INET: >> + in_dev = in_dev_get(dev); >> + if (!in_dev) >> + return false; >> + >> + ret_addr = inet_confirm_addr(net, in_dev, 0, >> + addr_in->sin_addr.s_addr, >> + RT_SCOPE_HOST); >> + in_dev_put(in_dev); >> + if (ret_addr) >> + return true; >> + >> + break; >> +#if IS_ENABLED(CONFIG_IPV6) >> + case AF_INET6: >> + if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1)) >> + return true; >> + >> + break; >> +#endif > > Can you use > > if (IS_ENABLED(CONFIG_IPV6)) > > At the call site instead of the #if guards? Sure, I'll do that in the next revision of the patch-set. Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters [not found] ` <1431841868-28063-5-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2015-05-19 18:28 ` Jason Gunthorpe @ 2015-05-19 23:55 ` Jason Gunthorpe [not found] ` <20150519235502.GB26634-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 1 sibling, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-19 23:55 UTC (permalink / raw) To: Haggai Eran Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote: > From: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > Implement the get_net_device_by_port_pkey_ip callback that returns network > device to ib_core according to connection parameters. Check the ipoib > device and iterate over all child devices to look for a match. Can you give a run down on how to actually set this up? Like what shell command do you execute? What are the child devices in the netnamespace? > For each ipoib device we iterate through all upper devices when searching > for a matching IP, in order to support bonding. Checking an IP address in a packet against a device without consulting the routing table is a big red flag for me. Can you elaborate? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150519235502.GB26634-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters [not found] ` <20150519235502.GB26634-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-21 5:33 ` Haggai Eran [not found] ` <555D6E41.10606-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2015-05-21 5:48 ` Haggai Eran 1 sibling, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-05-21 5:33 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 20/05/2015 02:55, Jason Gunthorpe wrote: > On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote: >> > From: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >> > >> > Implement the get_net_device_by_port_pkey_ip callback that returns network >> > device to ib_core according to connection parameters. Check the ipoib >> > device and iterate over all child devices to look for a match. > Can you give a run down on how to actually set this up? Like what > shell command do you execute? Sure. There are two methods to create new child interface for IPoIB. For a specific P_Key, write the desired P_Key to the create_child sysfs file: # echo 0x8000 > /sys/class/net/ib0/create_child This creates a new interface ib0.8000 operating with P_Key 0x8000. To create a new child interface on the default P_Key, its possible to use iproute: # ip link add link ib0 name ib0.1 type ipoib In order to create a new network namespace: # ip netns add ns1 Then, you can assign the new netdev to the namespace: # ip link set ib0.1 netns ns1 You can then set an IP address in the network namespace, and try some RDMA CM applications: # ip netns exec ns1 ip addr add dev ib0.1 192.168.0.1/24 # ip netns exec ns1 rdma_server Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <555D6E41.10606-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters [not found] ` <555D6E41.10606-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-05-21 5:48 ` Or Gerlitz [not found] ` <CAJ3xEMjN+o=vC4abAeG5EuOo3Y1gSyh1qPDseA_aaYmoLWAunw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-21 17:43 ` Jason Gunthorpe 1 sibling, 1 reply; 68+ messages in thread From: Or Gerlitz @ 2015-05-21 5:48 UTC (permalink / raw) To: Haggai Eran Cc: Jason Gunthorpe, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Thu, May 21, 2015 at 8:33 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote: > On 20/05/2015 02:55, Jason Gunthorpe wrote: >> On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote: >>> > From: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >>> > >>> > Implement the get_net_device_by_port_pkey_ip callback that returns network >>> > device to ib_core according to connection parameters. Check the ipoib >>> > device and iterate over all child devices to look for a match. >> Can you give a run down on how to actually set this up? Like what >> shell command do you execute? > > Sure. > > There are two methods to create new child interface for IPoIB. > For a specific P_Key, write the desired P_Key to the create_child sysfs > file: > # echo 0x8000 > /sys/class/net/ib0/create_child > This creates a new interface ib0.8000 operating with P_Key 0x8000. 0x8000 is practically zero (bits 15-0), right? not sure this is a valid pkey. > To create a new child interface on the default P_Key, its possible to > use iproute: > # ip link add link ib0 name ib0.1 type ipoib you can use non default pkeys as well here. $ ip link add link ib0 name ib0.1 type ipoib pkey 0x8001 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <CAJ3xEMjN+o=vC4abAeG5EuOo3Y1gSyh1qPDseA_aaYmoLWAunw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters [not found] ` <CAJ3xEMjN+o=vC4abAeG5EuOo3Y1gSyh1qPDseA_aaYmoLWAunw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-21 6:33 ` Haggai Eran [not found] ` <555D7C4A.2060708-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-05-21 6:33 UTC (permalink / raw) To: Or Gerlitz Cc: Jason Gunthorpe, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 21/05/2015 08:48, Or Gerlitz wrote: > On Thu, May 21, 2015 at 8:33 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote: >> On 20/05/2015 02:55, Jason Gunthorpe wrote: >>> On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote: >>>>> From: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >>>>> >>>>> Implement the get_net_device_by_port_pkey_ip callback that returns network >>>>> device to ib_core according to connection parameters. Check the ipoib >>>>> device and iterate over all child devices to look for a match. >>> Can you give a run down on how to actually set this up? Like what >>> shell command do you execute? >> >> Sure. >> >> There are two methods to create new child interface for IPoIB. >> For a specific P_Key, write the desired P_Key to the create_child sysfs >> file: >> # echo 0x8000 > /sys/class/net/ib0/create_child >> This creates a new interface ib0.8000 operating with P_Key 0x8000. > > 0x8000 is practically zero (bits 15-0), right? not sure this is a valid pkey. Right, bad example :) > > >> To create a new child interface on the default P_Key, its possible to >> use iproute: >> # ip link add link ib0 name ib0.1 type ipoib > > you can use non default pkeys as well here. > > $ ip link add link ib0 name ib0.1 type ipoib pkey 0x8001 > Right. I think when we started development of the namespaces patches these child interfaces (rtnetlink with pkey) didn't work for us, but I checked now on an updated kernel and they do. Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <555D7C4A.2060708-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters [not found] ` <555D7C4A.2060708-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-05-21 10:31 ` Or Gerlitz 0 siblings, 0 replies; 68+ messages in thread From: Or Gerlitz @ 2015-05-21 10:31 UTC (permalink / raw) To: Haggai Eran Cc: Jason Gunthorpe, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Thu, May 21, 2015 at 9:33 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote: >> you can use non default pkeys as well here. >> $ ip link add link ib0 name ib0.1 type ipoib pkey 0x8001 > Right. I think when we started development of the namespaces patches > these child interfaces (rtnetlink with pkey) didn't work for us, but I > checked now on an updated kernel and they do. The last change to drivers/infiniband/ulp/ipoib/ipoib_netlink.c took place on 3.12, well before you started... could be that you provided wrong command line and such, I haven't seen any bug report nor fix from you group to that area of the kernel. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters [not found] ` <555D6E41.10606-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2015-05-21 5:48 ` Or Gerlitz @ 2015-05-21 17:43 ` Jason Gunthorpe [not found] ` <20150521174336.GA6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 1 sibling, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-21 17:43 UTC (permalink / raw) To: Haggai Eran Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Hefty, Sean On Thu, May 21, 2015 at 08:33:53AM +0300, Haggai Eran wrote: > To create a new child interface on the default P_Key, its possible to > use iproute: > # ip link add link ib0 name ib0.1 type ipoib Uh.. A key invariant of the IP stack is that is it possible to uniquely identify the ingress device. So the above scheme is fine for IPoIB, because it uses the interface unique QPN to uniquely ID the netdevice: (Device,Port,Pkey,QPN) is the unique ID tuple. The world is happy. But RDMA CM doesn't provide the QPN. So when RDMA CM searches the netdevs for an address it cannot *uniquely* map to a IPoIB interface. This is bad, and *completely wrong*, but today, nobody is going to really notice or care. The cases where it does something you don't want are not very significant. But with containers.. Think this through for a minute: 'In some cases the RDMA CM selecs the wrong child' - that goes from being a minor annoyance to a violation of containment! Worse the criteria for 'selects the wrong child' can be triggered by the contained users. Eg the contained user adds a IP to their child that duplicates another container. Now we've lost control. The very idea of ib_get_net_dev_by_port_pkey_ip is broken. So, I don't know what to say here.. Ideas? 1) Forbid creating more than one pkey per ipoib interface? 2) Somehow extend the RDMA CM to send the IPoIB qpn too? 3) ?? Right now the only case that comes to mind is duplicating IPs, that is already going to cause an ARP collision, so maybe having the RDMA CM randomly select an IP is not the end of the world... But with containers and security, who knows? I'm not confident I've exhaustively thought of all possibilities here. ---------- Anyhow.. looking again through this series and the existing code, the flow is wrong, and really needs to be changed before this starts to make sense to anyone, and is no doubt part of how we got here.. When a REQ arrives RDMA CM needs to run down these steps (this is identical to what ip_input.c does) 1) Locate the netdev associated with the ingress of the packet, in a sane world this is done by only checking the unique (Device,Port,Pkey,QPN) tuple. If we keep our brokeness, we'd do this based on (Device,Port,Pkey,IP) - if there are IP collisions then randomly select a netdev (similar to how ARP collision is handled). 2) Then we do the ip_route_input_noref step, this will set skb_dst to the netdev that will handle the packet, or tell us to drop it. This is not always the same as the netdev that accepts the packet!!! NOTE: This route step is missing today, it does critical things like check that the node is actually listening on the dest IP! 3) Now we can use skb_dst to iterate over the set of all RDMA CM listens: 1) Bound to the skb_dst netdev 2) Unbound in the same namespace as skb_dst netdev The first to match the dst IP + port is the listen that will accept the connection, now we go into the cma_new_conn_id path, and we don't need rdma_translate_ip because we already have the handling netdev. The backwards operation of the current code is part of why this is all so strange looking, and I think is strongly connected to the private data compare issues Sean is talking about. It is very much the wrong flow to look for the RDMA CM listen first, and then try to work backwards to the netdev. The above 3 steps hard require that the ib_cm and rdma_cm maintain different listening lists, because we need the 2nd search in #3. So this gets the ib_cm completely out of looking at the private data. [And now we can think carefully about the best way to refcount the listens in RDMA CM] Once the above is cleaned up dropping in namespaces should just happen naturally. So.. I'm going to suggest you make a cleanup series to fix this: - Introduce the ib_get_net_dev_by_port_pkey_ip and document the breakage it represents - Rework rdma_cm to do steps 1,2,3 above, using ib_get_net_dev_by_port_pkey_ip to drive #1, your patch set is already doing about 50% of this change. - Fix the collateral damage As I said to Matan, clean up series like this should not introduce major functional changes, so stack the few remaining net namespace things in another series after it. ** The same comments apply to RoCE too, but for RoCE step #1 works properly based on the (Device,Port,VLAN) unique tuple Ditto for iWarp ** I think this also moves to address Sean's concern about generality, at least for listen. All three protocols will run down the same common code to locate the netdev and find the correct RDMA CM listen. The only difference is the 'ib_get_net_dev_by_port_pkey_ip' call varies. .. I also wouldn't mind seeing the giant cma.c split, if it makes sense to have a cma_listen.c, for instance, wouldn't that be nice? Sorry Haggi, this is a big change to your patchset :( Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150521174336.GA6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters [not found] ` <20150521174336.GA6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-28 11:51 ` Haggai Eran 2015-05-28 15:45 ` Jason Gunthorpe 0 siblings, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-05-28 11:51 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Hefty, Sean On 21/05/2015 20:43, Jason Gunthorpe wrote: > On Thu, May 21, 2015 at 08:33:53AM +0300, Haggai Eran wrote: > >> To create a new child interface on the default P_Key, its possible to >> use iproute: >> # ip link add link ib0 name ib0.1 type ipoib > > Uh.. > > A key invariant of the IP stack is that is it possible to uniquely > identify the ingress device. > > So the above scheme is fine for IPoIB, because it uses the interface > unique QPN to uniquely ID the netdevice: (Device,Port,Pkey,QPN) > is the unique ID tuple. The world is happy. > > But RDMA CM doesn't provide the QPN. So when RDMA CM searches the > netdevs for an address it cannot *uniquely* map to a IPoIB interface. This is technically true, but if someone configures their system that way, they will also have ARP conflicts in addition. I don't see why we should support such a configuration. > This is bad, and *completely wrong*, but today, nobody is going to > really notice or care. The cases where it does something you don't > want are not very significant. > > But with containers.. Think this through for a minute: 'In some cases > the RDMA CM selecs the wrong child' - that goes from being a minor > annoyance to a violation of containment! Worse the criteria for > 'selects the wrong child' can be triggered by the contained users. Eg > the contained user adds a IP to their child that duplicates another > container. Now we've lost control. No, this is exactly what would happen in the Ethernet world. If you create a conflicting configuration between two containers on the same Ethernet segment, then one of them could get the traffic that was intended for the other. I don't see this as a violation of containment, because these containers are assigned net_devs that communicate on the same segment, so they behave just as two different hosts would, with or without conflicts. > > The very idea of ib_get_net_dev_by_port_pkey_ip is broken. > > So, I don't know what to say here.. Ideas? > > 1) Forbid creating more than one pkey per ipoib interface? You probably mean more than one IP on the same pkey. The pkey is actually part of the request, so its not an issue. > 2) Somehow extend the RDMA CM to send the IPoIB qpn too? > 3) ?? We can do something crazy in the future like moving all CM requests to run over UDP as in RoCEv2. But both adding the QPN or moving to UDP require a wire protocol change and won't be compatible with today's systems. > > Right now the only case that comes to mind is duplicating IPs, that is > already going to cause an ARP collision, so maybe having the RDMA CM > randomly select an IP is not the end of the world... But with > containers and security, who knows? I'm not confident I've > exhaustively thought of all possibilities here. > > ---------- > > Anyhow.. looking again through this series and the existing code, the > flow is wrong, and really needs to be changed before this starts to > make sense to anyone, and is no doubt part of how we got here.. > > When a REQ arrives RDMA CM needs to run down these steps (this is identical > to what ip_input.c does) > > 1) Locate the netdev associated with the ingress of the packet, > in a sane world this is done by only checking the > unique (Device,Port,Pkey,QPN) tuple. > If we keep our brokeness, we'd do this based on > (Device,Port,Pkey,IP) - if there are IP collisions then randomly > select a netdev (similar to how ARP collision is handled). That's what ib_get_net_dev_by_port_pkey_ip intends to do. > 2) Then we do the ip_route_input_noref step, this will set skb_dst to > the netdev that will handle the packet, or tell us to drop it. > This is not always the same as the netdev that accepts the > packet!!! > > NOTE: This route step is missing today, it does critical things > like check that the node is actually listening on the dest IP! Isn't this a little over-engineered? If all you want is to make sure the net dev is up, can't we use something like netif_running()? Also, this sounds like a major change in behavior even for applications that do not use containers. I think today RDMA CM will accept connections even if the ipoib interface is down. > > 3) Now we can use skb_dst to iterate over the set of all RDMA CM listens: > 1) Bound to the skb_dst netdev > 2) Unbound in the same namespace as skb_dst netdev > The first to match the dst IP + port is the listen that will accept the > connection, now we go into the cma_new_conn_id path, and we don't > need rdma_translate_ip because we already have the handling netdev. You shouldn't be able to bind one listener to a netdev in a namespace and also have a different listener listening for any netdev on that same namespace. (That is what cma_check_port verifies, right?) So when looking for a listener in a namespace there should be only one match. It is true we no longer need the rdma_translate_ip call. > > The backwards operation of the current code is part of why this is all > so strange looking, and I think is strongly connected to the private > data compare issues Sean is talking about. It is very much the wrong > flow to look for the RDMA CM listen first, and then try to work > backwards to the netdev. That's not what the code does. It first finds the netdev and decides on the namespace based on that. It then finds the RDMA CM listener in that namespace. > > The above 3 steps hard require that the ib_cm and rdma_cm > maintain different listening lists, because we need the 2nd search in > #3. So this gets the ib_cm completely out of looking at the private > data. [And now we can think carefully about the best way to refcount > the listens in RDMA CM] The current patch-set already makes sure that rdma_cm doesn't rely on ib_cm's private_data matching. > > Once the above is cleaned up dropping in namespaces should just > happen naturally. > > So.. I'm going to suggest you make a cleanup series to fix this: > - Introduce the ib_get_net_dev_by_port_pkey_ip and document the > breakage it represents > - Rework rdma_cm to do steps 1,2,3 above, using > ib_get_net_dev_by_port_pkey_ip to drive #1, your patch set is > already doing about 50% of this change. > - Fix the collateral damage > > As I said to Matan, clean up series like this should not introduce > major functional changes, so stack the few remaining net namespace > things in another series after it. I think I can refactor the series this way, but I don't think we need step 2. It seems like an overly complicated way of checking whether a netdev is up or not. It doesn't seem to provide any new information over what ib_get_net_dev_by_port_pkey_ip does. As for fixing the collateral damage, I assume you mean cleaning the interfaces of ib_cm now that private data matching is not used? I don't see why this should be part of the series. > > ** The same comments apply to RoCE too, but for RoCE step #1 works > properly based on the (Device,Port,VLAN) unique tuple For RoCE, you could still have multiple macvlan interfaces using the same VLAN, with different IP address. So the unique tuple is (Device,Port,VLAN,IP). With Matan adding a netdev to each GID entry in the RoCE GID table, I think it would be simpler to find the netdev in RoCE. > Ditto for iWarp ** Yes, if iWarp CM can provide the netdev on which a request arrives, I think it would be simple to use it with this design. > > I think this also moves to address Sean's concern about generality, at > least for listen. All three protocols will run down the same common > code to locate the netdev and find the correct RDMA CM listen. The > only difference is the 'ib_get_net_dev_by_port_pkey_ip' call varies. Right. > > .. I also wouldn't mind seeing the giant cma.c split, if it makes > sense to have a cma_listen.c, for instance, wouldn't that be nice? It would be nice, but this patchset is becoming large already, and I don't want to add unnecessary noise. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters 2015-05-28 11:51 ` Haggai Eran @ 2015-05-28 15:45 ` Jason Gunthorpe 0 siblings, 0 replies; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-28 15:45 UTC (permalink / raw) To: Haggai Eran Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Hefty, Sean On Thu, May 28, 2015 at 02:51:51PM +0300, Haggai Eran wrote: > > But RDMA CM doesn't provide the QPN. So when RDMA CM searches the > > netdevs for an address it cannot *uniquely* map to a IPoIB interface. > This is technically true, but if someone configures their system that > way, they will also have ARP conflicts in addition. I don't see why we > should support such a configuration. Based on the dicussion in the other thread about this, the feeling is we should not support same-GUID, same-PKey child interfaces at all. It breaks too much stuff (DHCP,NetworkManager,IPv6,RDMA-CM) > No, this is exactly what would happen in the Ethernet world. If you > create a conflicting configuration between two containers on the same > Ethernet segment, then one of them could get the traffic that was > intended for the other. It is not exactly the same. In Ethernet there is an ARP collision at L3, but the traffic is properly addressed at L2 and unambiguously directed into only one container. There are ways to deal with ARP collisions, but those are only effective if the full LLADDR is consistently used for routing to containers. With RDMA-CM it is an L3 collsion, so our anti-ARP collision stuff doesn't help. Like I said, I don't from a security perspective what to make of this, but it isn't exactly the same of ethernet. > > 1) Locate the netdev associated with the ingress of the packet, > > in a sane world this is done by only checking the > > unique (Device,Port,Pkey,QPN) tuple. > > If we keep our brokeness, we'd do this based on > > (Device,Port,Pkey,IP) - if there are IP collisions then randomly > > select a netdev (similar to how ARP collision is handled). > That's what ib_get_net_dev_by_port_pkey_ip intends to do. Right, almost there. ib_get_net_dev_by_port_pkey_ip needs to work in a very specific way: If there is only one netdev with the (Device,Port,GUID,Pkey) match then that is the answer. (guid comes from the CM_REQ, if we add alias GUID support to IPoIB as Or suggested then it is needed) The IP search should *only* be done if there are two children with identical (Pkey,GUID), and as above, perhaps we should de-support that. > > 2) Then we do the ip_route_input_noref step, this will set skb_dst to > > the netdev that will handle the packet, or tell us to drop it. > > This is not always the same as the netdev that accepts the > > packet!!! > > > > NOTE: This route step is missing today, it does critical things > > like check that the node is actually listening on the dest IP! > > Isn't this a little over-engineered? If all you want is to make sure the > net dev is up, can't we use something like netif_running()? The routing check is not to see if the netdev is up, it is doing all sorts of subtle userspace visible things. Like checking there is no blackhole route configured for the packet, checking that the IP is present in the system, netdevs are up, etc etc. We don't get to pick and choose what netdev behaviors we implement when doing this kind of stuff. Copy the netstack, don't make stuff up. Understand the two layer separation, first with pick a netdev without looking at L3 info, then we feed the netdev and L3 info into routing to complete the process. > Also, this sounds like a major change in behavior even for applications > that do not use containers. I think today RDMA CM will accept > connections even if the ipoib interface is down. Yes, it is a change in behavior, things move closer to alignment with how netdev works. We did a similar change to the output side years ago as well. I guess the input side was missed. Unless I'm missing something this isn't 'major', these are corner case conditions/bugs that nobody sane should rely on. > > 3) Now we can use skb_dst to iterate over the set of all RDMA CM listens: > > 1) Bound to the skb_dst netdev > > 2) Unbound in the same namespace as skb_dst netdev > > The first to match the dst IP + port is the listen that will accept the > > connection, now we go into the cma_new_conn_id path, and we don't > > need rdma_translate_ip because we already have the handling netdev. > You shouldn't be able to bind one listener to a netdev in a namespace > and also have a different listener listening for any netdev on that same > namespace. (That is what cma_check_port verifies, right?) So when > looking for a listener in a namespace there should be only one match. You know, I don't remember off hand the exact semantics of sockets, whatever sockets does :) > > The backwards operation of the current code is part of why this is all > > so strange looking, and I think is strongly connected to the private > > data compare issues Sean is talking about. It is very much the wrong > > flow to look for the RDMA CM listen first, and then try to work > > backwards to the netdev. > That's not what the code does. It first finds the netdev and decides on > the namespace based on that. It then finds the RDMA CM listener in that > namespace. I was talking about the current code, but even with the patch set, the behavior is still backwards: 1) Find the netdev 2) Get the namespace, throw away the netdev 3) Find a CM_ID without a netdev 4) Find a netdev from the CM_ID Instead: 1) Find the netdev 2) Find a CM_ID compatible with that netdev > I think I can refactor the series this way, but I don't think we need > step 2. It seems like an overly complicated way of checking whether a > netdev is up or not. It doesn't seem to provide any new information over > what ib_get_net_dev_by_port_pkey_ip does. If it is hard to do then leave a FIXME comment and go on, if it is easy, it is the right step. It does provide more information because ib_get_net_dev_by_port_pkey_ip should not check the IP. > For RoCE, you could still have multiple macvlan interfaces using the > same VLAN, with different IP address. So the unique tuple is > (Device,Port,VLAN,IP). With Matan adding a netdev to each GID entry in > the RoCE GID table, I think it would be simpler to find the netdev in RoCE. That isn't how I understand macvlan, the tuple will be (Device,Port,MAC,VLAN) You should never search for an IP, that is not how netdev works. Jason ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters [not found] ` <20150519235502.GB26634-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2015-05-21 5:33 ` Haggai Eran @ 2015-05-21 5:48 ` Haggai Eran 1 sibling, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-21 5:48 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 20/05/2015 02:55, Jason Gunthorpe wrote: > On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote: >> For each ipoib device we iterate through all upper devices when searching >> for a matching IP, in order to support bonding. > > Checking an IP address in a packet against a device without consulting > the routing table is a big red flag for me. Can you elaborate? We want to match the incoming CM request to a specific net_dev, and use the network namespace of that net_dev to find the right rdma_cm_id. Because there can be multiple network namespaces, there would also be multiple routing tables, so I'm not sure how we could use them. The solution we used is to walk all the upper net_devs of all the IPoIB child devices of a given IB device and port, and find one that matches the request's IB device, port number, P_Key and IP address. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs 2015-05-17 5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran ` (2 preceding siblings ...) 2015-05-17 5:51 ` [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices " Haggai Eran @ 2015-05-17 5:51 ` Haggai Eran 2015-05-19 18:35 ` Jason Gunthorpe [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> ` (5 subsequent siblings) 9 siblings, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-05-17 5:51 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Haggai Eran, Jason Gunthorpe Enabling network namespaces for RDMA CM will allow processes on different namespaces to listen on the same port. In order to leave namespace support out of the CM layer, this requires that multiple RDMA CM IDs will be able to share a single CM ID. This patch adds infrastructure to retrieve an existing listening ib_cm_id, based on its device and service ID, or create a new one if one does not already exist. It also adds a reference count for such instances (cm_id_private.sharecount), and prevents cm_destroy_id from destroying a CM if it is still shared. See the relevant discussion [1]. [1] Re: [PATCH v3 for-next 05/13] IB/cm: Reference count ib_cm_ids http://www.spinics.net/lists/netdev/msg328860.html Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> --- drivers/infiniband/core/cm.c | 121 +++++++++++++++++++++++++++++++++++++++++-- include/rdma/ib_cm.h | 4 ++ 2 files changed, 120 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 14423c20c55b..4f936932dd54 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -212,6 +212,8 @@ struct cm_id_private { spinlock_t lock; /* Do not acquire inside cm.lock */ struct completion comp; atomic_t refcount; + /* Number of clients sharing this ib_cm_id. Only valid for listeners. */ + atomic_t sharecount; struct ib_mad_send_buf *msg; struct cm_timewait_info *timewait_info; @@ -720,6 +722,7 @@ struct ib_cm_id *ib_create_cm_id(struct ib_device *device, INIT_LIST_HEAD(&cm_id_priv->work_list); atomic_set(&cm_id_priv->work_count, -1); atomic_set(&cm_id_priv->refcount, 1); + atomic_set(&cm_id_priv->sharecount, 1); return &cm_id_priv->id; error: @@ -841,6 +844,12 @@ static void cm_destroy_id(struct ib_cm_id *cm_id, int err) struct cm_work *work; cm_id_priv = container_of(cm_id, struct cm_id_private, id); + + if (!atomic_dec_and_test(&cm_id_priv->sharecount)) { + /* The id is still shared. */ + return; + } + retest: spin_lock_irq(&cm_id_priv->lock); switch (cm_id->state) { @@ -927,11 +936,32 @@ void ib_destroy_cm_id(struct ib_cm_id *cm_id) } EXPORT_SYMBOL(ib_destroy_cm_id); -int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, - struct ib_cm_compare_data *compare_data) +/** + * __ib_cm_listen - Initiates listening on the specified service ID for + * connection and service ID resolution requests. + * @cm_id: Connection identifier associated with the listen request. + * @service_id: Service identifier matched against incoming connection + * and service ID resolution requests. The service ID should be specified + * network-byte order. If set to IB_CM_ASSIGN_SERVICE_ID, the CM will + * assign a service ID to the caller. + * @service_mask: Mask applied to service ID used to listen across a + * range of service IDs. If set to 0, the service ID is matched + * exactly. This parameter is ignored if %service_id is set to + * IB_CM_ASSIGN_SERVICE_ID. + * @compare_data: This parameter is optional. It specifies data that must + * appear in the private data of a connection request for the specified + * listen request. + * @lock: If set, lock the cm.lock spin-lock when adding the id to the + * listener tree. When false, the caller must already hold the spin-lock, + * and compare_data must be NULL. + */ +static int __ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, + __be64 service_mask, + struct ib_cm_compare_data *compare_data, + bool lock) { struct cm_id_private *cm_id_priv, *cur_cm_id_priv; - unsigned long flags; + unsigned long flags = 0; int ret = 0; service_mask = service_mask ? service_mask : ~cpu_to_be64(0); @@ -957,7 +987,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, cm_id->state = IB_CM_LISTEN; - spin_lock_irqsave(&cm.lock, flags); + if (lock) + spin_lock_irqsave(&cm.lock, flags); if (service_id == IB_CM_ASSIGN_SERVICE_ID) { cm_id->service_id = cpu_to_be64(cm.listen_service_id++); cm_id->service_mask = ~cpu_to_be64(0); @@ -966,7 +997,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, cm_id->service_mask = service_mask; } cur_cm_id_priv = cm_insert_listen(cm_id_priv); - spin_unlock_irqrestore(&cm.lock, flags); + if (lock) + spin_unlock_irqrestore(&cm.lock, flags); if (cur_cm_id_priv) { cm_id->state = IB_CM_IDLE; @@ -976,8 +1008,87 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, } return ret; } + +int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, + struct ib_cm_compare_data *compare_data) +{ + return __ib_cm_listen(cm_id, service_id, service_mask, compare_data, + true); +} EXPORT_SYMBOL(ib_cm_listen); +/** + * Create a new listening ib_cm_id and listen on the given service ID. + * + * If there's an existing ID listening on that same device and service ID, + * return it. + * + * @device: Device associated with the cm_id. All related communication will + * be associated with the specified device. + * @cm_handler: Callback invoked to notify the user of CM events. + * @service_id: Service identifier matched against incoming connection + * and service ID resolution requests. The service ID should be specified + * network-byte order. If set to IB_CM_ASSIGN_SERVICE_ID, the CM will + * assign a service ID to the caller. + * @service_mask: Mask applied to service ID used to listen across a + * range of service IDs. If set to 0, the service ID is matched + * exactly. This parameter is ignored if %service_id is set to + * IB_CM_ASSIGN_SERVICE_ID. + */ +struct ib_cm_id *ib_cm_id_create_and_listen( + struct ib_device *device, + ib_cm_handler cm_handler, + __be64 service_id, + __be64 service_mask) +{ + struct cm_id_private *cm_id_priv; + struct ib_cm_id *cm_id; + unsigned long flags; + int err = 0; + + /* Create an ID in advance, since the creation may sleep */ + cm_id = ib_create_cm_id(device, cm_handler, NULL); + if (IS_ERR(cm_id)) + return cm_id; + + spin_lock_irqsave(&cm.lock, flags); + + if (service_id == IB_CM_ASSIGN_SERVICE_ID) + goto new_id; + + /* Find an existing ID */ + cm_id_priv = cm_find_listen(device, service_id, NULL); + if (cm_id_priv) { + if (atomic_inc_return(&cm_id_priv->sharecount) == 1) { + /* This ID is already being destroyed */ + atomic_dec(&cm_id_priv->sharecount); + goto new_id; + } + + spin_unlock_irqrestore(&cm.lock, flags); + ib_destroy_cm_id(cm_id); + cm_id = &cm_id_priv->id; + if (cm_id->cm_handler != cm_handler || cm_id->context) + /* Sharing an ib_cm_id with different handlers is not + * supported */ + return ERR_PTR(-EINVAL); + return cm_id; + } + +new_id: + /* Use newly created ID */ + err = __ib_cm_listen(cm_id, service_id, service_mask, NULL, false); + + spin_unlock_irqrestore(&cm.lock, flags); + + if (err) { + ib_destroy_cm_id(cm_id); + return ERR_PTR(err); + } + return cm_id; +} +EXPORT_SYMBOL(ib_cm_id_create_and_listen); + static __be64 cm_form_tid(struct cm_id_private *cm_id_priv, enum cm_msg_sequence msg_seq) { diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 39ed2d2fbd51..59f534b5f091 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -361,6 +361,10 @@ struct ib_cm_compare_data { int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, struct ib_cm_compare_data *compare_data); +struct ib_cm_id *ib_cm_id_create_and_listen( + struct ib_device *device, ib_cm_handler cm_handler, + __be64 service_id, __be64 service_mask); + struct ib_cm_req_param { struct ib_sa_path_rec *primary_path; struct ib_sa_path_rec *alternate_path; -- 1.7.11.2 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs 2015-05-17 5:51 ` [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs Haggai Eran @ 2015-05-19 18:35 ` Jason Gunthorpe [not found] ` <20150519183545.GH18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-19 18:35 UTC (permalink / raw) To: Haggai Eran Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Sun, May 17, 2015 at 08:51:01AM +0300, Haggai Eran wrote: > @@ -212,6 +212,8 @@ struct cm_id_private { > spinlock_t lock; /* Do not acquire inside cm.lock */ > struct completion comp; > atomic_t refcount; > + /* Number of clients sharing this ib_cm_id. Only valid for listeners. */ > + atomic_t sharecount; No need for this atomic, hold the lock The use of the atomic looks racy: > + if (!atomic_dec_and_test(&cm_id_priv->sharecount)) { > + /* The id is still shared. */ > + return; > + } Might race with this: > + if (atomic_inc_return(&cm_id_priv->sharecount) == 1) { > + /* This ID is already being destroyed */ > + atomic_dec(&cm_id_priv->sharecount); > + goto new_id; > + } > + Resulting in use-after-free of cm_id_priv->sharecount Don't try and be clever with atomics, it is almost always wrong. The share count should be 'listen_sharecount' because it *only* works for listen. The above test in cm_destroy_id should only be in the listen branch of the if. > + * Create a new listening ib_cm_id and listen on the given service ID. > + * > + * If there's an existing ID listening on that same device and service ID, > + * return it. > + * .. Callers should call cm_destroy_id when done with the listen .. Jason ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150519183545.GH18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs [not found] ` <20150519183545.GH18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-19 22:35 ` Jason Gunthorpe [not found] ` <20150519223502.GA26324-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2015-05-21 7:07 ` Haggai Eran 1 sibling, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-19 22:35 UTC (permalink / raw) To: Haggai Eran Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Tue, May 19, 2015 at 12:35:45PM -0600, Jason Gunthorpe wrote: > On Sun, May 17, 2015 at 08:51:01AM +0300, Haggai Eran wrote: > > @@ -212,6 +212,8 @@ struct cm_id_private { > > spinlock_t lock; /* Do not acquire inside cm.lock */ > > struct completion comp; > > atomic_t refcount; > > + /* Number of clients sharing this ib_cm_id. Only valid for listeners. */ > > + atomic_t sharecount; > > No need for this atomic, hold the lock > > The use of the atomic looks racy: > > > + if (!atomic_dec_and_test(&cm_id_priv->sharecount)) { > > + /* The id is still shared. */ > > + return; > > + } > > Might race with this: > > > + if (atomic_inc_return(&cm_id_priv->sharecount) == 1) { > > + /* This ID is already being destroyed */ > > + atomic_dec(&cm_id_priv->sharecount); > > + goto new_id; > > + } > > + > > Resulting in use-after-free of cm_id_priv->sharecount Actually, there is something else odd here.. I mentioned the above because there wasn't obvious ref'ing on the cm_id_priv. Looking closer the cm.lock should prevent use-after-free, but there is still no ref. The more I look at this, the more I think it is sketchy. Don't try and merge sharecount and refcount together, after cm_find_listen is called you have to increment the refcount before dropping cm.lock. Decrement the refcount when destroying a shared listen. I also don't see how the 'goto new_id' can work, if cm_find_listen succeeds then __ib_cm_listen is guarenteed to fail. Fix the locking to make that impossible, associate sharecount with the cm.lock and, rework how cm_destroy_id grabs the cm_id_priv->lock spinlock: case IB_CM_LISTEN: spin_lock_irq(&cm.lock); if (cm_id_priv->sharecount != 0) { cm_id_prv->sharecount--; // paired with in in ib_cm_id_create_and_listen atomic_dec(&cm_id_priv->refcount); spin_unlock_irq(&cm.lock); return; } rb_erase(&cm_id_priv->service_node, &cm.listen_service_table); spin_unlock_irq(&cm.lock); spin_lock_irq(&cm_id_priv->lock); cm_id->state = IB_CM_IDLE; spin_unlock_irq(&cm_id_priv->lock); break; Now that condition is eliminated, the unneeded atomic is gone, and refcount still acts like a proper kref should. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150519223502.GA26324-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs [not found] ` <20150519223502.GA26324-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-21 8:08 ` Haggai Eran 2015-05-21 17:54 ` Jason Gunthorpe 0 siblings, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-05-21 8:08 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 20/05/2015 01:35, Jason Gunthorpe wrote: > On Tue, May 19, 2015 at 12:35:45PM -0600, Jason Gunthorpe wrote: >> On Sun, May 17, 2015 at 08:51:01AM +0300, Haggai Eran wrote: >>> @@ -212,6 +212,8 @@ struct cm_id_private { >>> spinlock_t lock; /* Do not acquire inside cm.lock */ >>> struct completion comp; >>> atomic_t refcount; >>> + /* Number of clients sharing this ib_cm_id. Only valid for listeners. */ >>> + atomic_t sharecount; >> >> No need for this atomic, hold the lock >> >> The use of the atomic looks racy: >> >>> + if (!atomic_dec_and_test(&cm_id_priv->sharecount)) { >>> + /* The id is still shared. */ >>> + return; >>> + } >> >> Might race with this: >> >>> + if (atomic_inc_return(&cm_id_priv->sharecount) == 1) { >>> + /* This ID is already being destroyed */ >>> + atomic_dec(&cm_id_priv->sharecount); >>> + goto new_id; >>> + } >>> + >> >> Resulting in use-after-free of cm_id_priv->sharecount > > Actually, there is something else odd here.. I mentioned the above > because there wasn't obvious ref'ing on the cm_id_priv. Looking closer > the cm.lock should prevent use-after-free, but there is still no ref. > > The more I look at this, the more I think it is sketchy. Don't try and > merge sharecount and refcount together, I'm not sure what you mean here. The way I was thinking about it was that sharecount = num of rdma_cm_ids sharing this listener, while refcount = num of active internal uses of this ID (work items, timers, etc.) Do you want refcount to also include the sharecount? > after cm_find_listen is called > you have to increment the refcount before dropping cm.lock. > > Decrement the refcount when destroying a shared listen. You mean to decrement event if listen_sharecount > 0, and the id isn't destroyed, right? The code already calls cm_deref_id when listen_sharecount = 0 of course. > I also don't see how the 'goto new_id' can work, if cm_find_listen > succeeds then __ib_cm_listen is guarenteed to fail. > > Fix the locking to make that impossible, associate sharecount with the > cm.lock and, rework how cm_destroy_id grabs the cm_id_priv->lock spinlock: > > case IB_CM_LISTEN: > spin_lock_irq(&cm.lock); > if (cm_id_priv->sharecount != 0) { > cm_id_prv->sharecount--; > // paired with in in ib_cm_id_create_and_listen > atomic_dec(&cm_id_priv->refcount); > spin_unlock_irq(&cm.lock); > return; > } > rb_erase(&cm_id_priv->service_node, &cm.listen_service_table); > spin_unlock_irq(&cm.lock); > > spin_lock_irq(&cm_id_priv->lock); > cm_id->state = IB_CM_IDLE; > spin_unlock_irq(&cm_id_priv->lock); > break; > > Now that condition is eliminated, the unneeded atomic is gone, and > refcount still acts like a proper kref should. Thanks, that looks like a better solution. Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs 2015-05-21 8:08 ` Haggai Eran @ 2015-05-21 17:54 ` Jason Gunthorpe 0 siblings, 0 replies; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-21 17:54 UTC (permalink / raw) To: Haggai Eran Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Thu, May 21, 2015 at 11:08:31AM +0300, Haggai Eran wrote: > > The more I look at this, the more I think it is sketchy. Don't try and > > merge sharecount and refcount together, > I'm not sure what you mean here. The way I was thinking about it was > that sharecount = num of rdma_cm_ids sharing this listener, while > refcount = num of active internal uses of this ID (work items, timers, > etc.) Do you want refcount to also include the sharecount? If you hold on to the pointer, then you increment the refcount. The refcount is the 'number of holders of the pointer'. Basic invariant. When the pointer left the lock region for the lockup, the ref must be incremented. What you had was functionally correct because the sharecount was implying a refcount, but it doesn't follow the standard kernel refcounting pattern. > > after cm_find_listen is called > > you have to increment the refcount before dropping cm.lock. > > > > Decrement the refcount when destroying a shared listen. > You mean to decrement event if listen_sharecount > 0, and the id isn't > destroyed, right? The code already calls cm_deref_id when > listen_sharecount = 0 of course. Yes, because cm_destroy_id is the ref'd pair to cm_listen, after it returns the caller must give up their pointer. Jason ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs [not found] ` <20150519183545.GH18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2015-05-19 22:35 ` Jason Gunthorpe @ 2015-05-21 7:07 ` Haggai Eran 1 sibling, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-21 7:07 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 19/05/2015 21:35, Jason Gunthorpe wrote: ... > The share count should be 'listen_sharecount' because it *only* works > for listen. > > The above test in cm_destroy_id should only be in the listen branch of > the if. Okay. > >> + * Create a new listening ib_cm_id and listen on the given service ID. >> + * >> + * If there's an existing ID listening on that same device and service ID, >> + * return it. >> + * > > .. Callers should call cm_destroy_id when done with the listen .. I'll add that (with ib_destroy_cm_id of course). -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* [PATCH v4 for-next 01/12] IB/core: Add rwsem to allow reading device list or client list [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-05-17 5:50 ` Haggai Eran 2015-05-17 5:51 ` [PATCH v4 for-next 06/12] IB/cm: Expose service ID in request events Haggai Eran ` (3 subsequent siblings) 4 siblings, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-17 5:50 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Haggai Eran, Matan Barak, Jason Gunthorpe Currently the RDMA subsystem's device list and client list are protected by a single mutex. This prevents adding user-facing APIs that iterate these lists, since using them may cause a deadlock. The patch attempts to solve this problem by adding a read-write semaphore to protect the lists. Readers now don't need the mutex, and are safe just by read-locking the semaphore. The ib_register_device, ib_register_client, ib_unregister_device, and ib_unregister_client functions are modified to lock the semaphore for write during their respective list modification This patch attempts to solve a similar need [1] that was seen in the RoCE v2 patch series. [1] http://www.spinics.net/lists/linux-rdma/msg24733.html Cc: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Cc: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> --- drivers/infiniband/core/device.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index b360350a0b20..3a44723c6b9d 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -59,13 +59,17 @@ static LIST_HEAD(device_list); static LIST_HEAD(client_list); /* - * device_mutex protects access to both device_list and client_list. - * There's no real point to using multiple locks or something fancier - * like an rwsem: we always access both lists, and we're always - * modifying one list or the other list. In any case this is not a - * hot path so there's no point in trying to optimize. + * device_mutex and lists_rwsem protect access to both device_list and + * client_list. device_mutex protects writer access by device and client + * registration / de-registration. lists_rwsem protects reader access to + * these lists. Iterators of these lists must lock it for read, while updates + * to the lists must be done with a write lock. A special case is when the + * device_mutex is locked. In this case locking the lists for read access as + * the device_mutex implies it. */ static DEFINE_MUTEX(device_mutex); +static DECLARE_RWSEM(lists_rwsem); + static int ib_device_check_mandatory(struct ib_device *device) { @@ -311,7 +315,9 @@ int ib_register_device(struct ib_device *device, goto out; } + down_write(&lists_rwsem); list_add_tail(&device->core_list, &device_list); + up_write(&lists_rwsem); device->reg_state = IB_DEV_REGISTERED; @@ -347,7 +353,9 @@ void ib_unregister_device(struct ib_device *device) if (client->remove) client->remove(device); + down_write(&lists_rwsem); list_del(&device->core_list); + up_write(&lists_rwsem); kfree(device->gid_tbl_len); kfree(device->pkey_tbl_len); @@ -384,7 +392,10 @@ int ib_register_client(struct ib_client *client) mutex_lock(&device_mutex); + down_write(&lists_rwsem); list_add_tail(&client->list, &client_list); + up_write(&lists_rwsem); + list_for_each_entry(device, &device_list, core_list) if (client->add && !add_client_context(device, client)) client->add(device); @@ -423,7 +434,10 @@ void ib_unregister_client(struct ib_client *client) } spin_unlock_irqrestore(&device->client_data_lock, flags); } + + down_write(&lists_rwsem); list_del(&client->list); + up_write(&lists_rwsem); mutex_unlock(&device_mutex); } -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v4 for-next 06/12] IB/cm: Expose service ID in request events [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2015-05-17 5:50 ` [PATCH v4 for-next 01/12] IB/core: Add rwsem to allow reading device list or client list Haggai Eran @ 2015-05-17 5:51 ` Haggai Eran 2015-05-17 5:51 ` [PATCH v4 for-next 07/12] IB/cma: Refactor RDMA IP CM private-data parsing code Haggai Eran ` (2 subsequent siblings) 4 siblings, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-17 5:51 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Haggai Eran Expose the service ID on an incoming CM or SIDR request to the event handler. This will allow the RDMA CM module to de-multiplex connection requests based on the information encoded in the service ID. Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> --- drivers/infiniband/core/cm.c | 3 +++ include/rdma/ib_cm.h | 1 + 2 files changed, 4 insertions(+) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 4f936932dd54..04cdc753b763 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1365,6 +1365,7 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, primary_path->packet_life_time = cm_req_get_primary_local_ack_timeout(req_msg); primary_path->packet_life_time -= (primary_path->packet_life_time > 0); + primary_path->service_id = req_msg->service_id; if (req_msg->alt_local_lid) { memset(alt_path, 0, sizeof *alt_path); @@ -1386,6 +1387,7 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, alt_path->packet_life_time = cm_req_get_alt_local_ack_timeout(req_msg); alt_path->packet_life_time -= (alt_path->packet_life_time > 0); + alt_path->service_id = req_msg->service_id; } } @@ -3089,6 +3091,7 @@ static void cm_format_sidr_req_event(struct cm_work *work, param = &work->cm_event.param.sidr_req_rcvd; param->pkey = __be16_to_cpu(sidr_req_msg->pkey); param->listen_id = listen_id; + param->service_id = sidr_req_msg->service_id; param->port = work->port->port_num; work->cm_event.private_data = &sidr_req_msg->private_data; } diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 59f534b5f091..6655016b3855 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -223,6 +223,7 @@ struct ib_cm_apr_event_param { struct ib_cm_sidr_req_event_param { struct ib_cm_id *listen_id; + __be64 service_id; u8 port; u16 pkey; }; -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v4 for-next 07/12] IB/cma: Refactor RDMA IP CM private-data parsing code [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2015-05-17 5:50 ` [PATCH v4 for-next 01/12] IB/core: Add rwsem to allow reading device list or client list Haggai Eran 2015-05-17 5:51 ` [PATCH v4 for-next 06/12] IB/cm: Expose service ID in request events Haggai Eran @ 2015-05-17 5:51 ` Haggai Eran 2015-05-17 5:51 ` [PATCH v4 for-next 08/12] IB/cma: Add compare_data checks to the RDMA CM module Haggai Eran 2015-05-17 5:51 ` [PATCH v4 for-next 12/12] IB/ucma: Take the network namespace from the process Haggai Eran 4 siblings, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-17 5:51 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Haggai Eran When receiving a connection request, rdma_cm needs to associate the request with a network namespace. To do this, it needs to know the request's destination IP. For this the module needs to allow getting this information from the private data in the request packet, instead of relying on the information already being in the listening RDMA CM ID. When creating a new incoming connection ID, the code in cma_save_ip{4,6}_info can no longer rely on the listener's private data to find the port number, so it reads it from the requested service ID. Signed-off-by: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Signed-off-by: Yotam Kenneth <yotamke-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> --- drivers/infiniband/core/cma.c | 150 ++++++++++++++++++++++++++---------------- 1 file changed, 92 insertions(+), 58 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index c957d53df666..1828903e8388 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -843,97 +843,122 @@ static inline int cma_any_port(struct sockaddr *addr) return !cma_port(addr); } -static void cma_save_ib_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id, +static void cma_save_ib_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, struct ib_sa_path_rec *path) { - struct sockaddr_ib *listen_ib, *ib; + struct sockaddr_ib *ib; - listen_ib = (struct sockaddr_ib *) &listen_id->route.addr.src_addr; - ib = (struct sockaddr_ib *) &id->route.addr.src_addr; - ib->sib_family = listen_ib->sib_family; - ib->sib_pkey = path->pkey; - ib->sib_flowinfo = path->flow_label; - memcpy(&ib->sib_addr, &path->sgid, 16); - ib->sib_sid = listen_ib->sib_sid; - ib->sib_sid_mask = cpu_to_be64(0xffffffffffffffffULL); - ib->sib_scope_id = listen_ib->sib_scope_id; - - ib = (struct sockaddr_ib *) &id->route.addr.dst_addr; - ib->sib_family = listen_ib->sib_family; - ib->sib_pkey = path->pkey; - ib->sib_flowinfo = path->flow_label; - memcpy(&ib->sib_addr, &path->dgid, 16); -} - -static __be16 ss_get_port(const struct sockaddr_storage *ss) -{ - if (ss->ss_family == AF_INET) - return ((struct sockaddr_in *)ss)->sin_port; - else if (ss->ss_family == AF_INET6) - return ((struct sockaddr_in6 *)ss)->sin6_port; - BUG(); + if (src_addr) { + ib = (struct sockaddr_ib *)src_addr; + ib->sib_family = AF_IB; + ib->sib_pkey = path->pkey; + ib->sib_flowinfo = path->flow_label; + memcpy(&ib->sib_addr, &path->sgid, 16); + ib->sib_sid = path->service_id; + ib->sib_sid_mask = cpu_to_be64(0xffffffffffffffffULL); + ib->sib_scope_id = 0; + } + if (dst_addr) { + ib = (struct sockaddr_ib *)dst_addr; + ib->sib_family = AF_IB; + ib->sib_pkey = path->pkey; + ib->sib_flowinfo = path->flow_label; + memcpy(&ib->sib_addr, &path->dgid, 16); + } } -static void cma_save_ip4_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id, - struct cma_hdr *hdr) +static void cma_save_ip4_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, + struct cma_hdr *hdr, + __be16 local_port) { struct sockaddr_in *ip4; - ip4 = (struct sockaddr_in *) &id->route.addr.src_addr; - ip4->sin_family = AF_INET; - ip4->sin_addr.s_addr = hdr->dst_addr.ip4.addr; - ip4->sin_port = ss_get_port(&listen_id->route.addr.src_addr); + if (src_addr) { + ip4 = (struct sockaddr_in *)src_addr; + ip4->sin_family = AF_INET; + ip4->sin_addr.s_addr = hdr->dst_addr.ip4.addr; + ip4->sin_port = local_port; + } - ip4 = (struct sockaddr_in *) &id->route.addr.dst_addr; - ip4->sin_family = AF_INET; - ip4->sin_addr.s_addr = hdr->src_addr.ip4.addr; - ip4->sin_port = hdr->port; + if (dst_addr) { + ip4 = (struct sockaddr_in *)dst_addr; + ip4->sin_family = AF_INET; + ip4->sin_addr.s_addr = hdr->src_addr.ip4.addr; + ip4->sin_port = hdr->port; + } } -static void cma_save_ip6_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id, - struct cma_hdr *hdr) +static void cma_save_ip6_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, + struct cma_hdr *hdr, + __be16 local_port) { struct sockaddr_in6 *ip6; - ip6 = (struct sockaddr_in6 *) &id->route.addr.src_addr; - ip6->sin6_family = AF_INET6; - ip6->sin6_addr = hdr->dst_addr.ip6; - ip6->sin6_port = ss_get_port(&listen_id->route.addr.src_addr); + if (src_addr) { + ip6 = (struct sockaddr_in6 *)src_addr; + ip6->sin6_family = AF_INET6; + ip6->sin6_addr = hdr->dst_addr.ip6; + ip6->sin6_port = local_port; + } - ip6 = (struct sockaddr_in6 *) &id->route.addr.dst_addr; - ip6->sin6_family = AF_INET6; - ip6->sin6_addr = hdr->src_addr.ip6; - ip6->sin6_port = hdr->port; + if (dst_addr) { + ip6 = (struct sockaddr_in6 *)dst_addr; + ip6->sin6_family = AF_INET6; + ip6->sin6_addr = hdr->src_addr.ip6; + ip6->sin6_port = hdr->port; + } } -static int cma_save_net_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id, - struct ib_cm_event *ib_event) +static u16 cma_port_from_service_id(__be64 service_id) { - struct cma_hdr *hdr; + return be64_to_cpu(service_id); +} - if ((listen_id->route.addr.src_addr.ss_family == AF_IB) && - (ib_event->event == IB_CM_REQ_RECEIVED)) { - cma_save_ib_info(id, listen_id, ib_event->param.req_rcvd.primary_path); - return 0; - } +static int cma_save_ip_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, + struct ib_cm_event *ib_event, + __be64 service_id) +{ + struct cma_hdr *hdr; + __be16 port; hdr = ib_event->private_data; if (hdr->cma_version != CMA_VERSION) return -EINVAL; + port = htons(cma_port_from_service_id(service_id)); + switch (cma_get_ip_ver(hdr)) { case 4: - cma_save_ip4_info(id, listen_id, hdr); + cma_save_ip4_info(src_addr, dst_addr, hdr, port); break; case 6: - cma_save_ip6_info(id, listen_id, hdr); + cma_save_ip6_info(src_addr, dst_addr, hdr, port); break; default: return -EINVAL; } + return 0; } +static int cma_save_net_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, + struct ib_cm_event *ib_event, + sa_family_t sa_family, __be64 service_id) +{ + if (sa_family == AF_IB && ib_event->event == IB_CM_REQ_RECEIVED) { + cma_save_ib_info(src_addr, dst_addr, + ib_event->param.req_rcvd.primary_path); + return 0; + } + + return cma_save_ip_info(src_addr, dst_addr, ib_event, service_id); +} + static inline int cma_user_data_offset(struct rdma_id_private *id_priv) { return cma_family(id_priv) == AF_IB ? 0 : sizeof(struct cma_hdr); @@ -1184,6 +1209,9 @@ static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id, struct rdma_id_private *id_priv; struct rdma_cm_id *id; struct rdma_route *rt; + const sa_family_t ss_family = listen_id->route.addr.src_addr.ss_family; + const __be64 service_id = + ib_event->param.req_rcvd.primary_path->service_id; int ret; id = rdma_create_id(listen_id->event_handler, listen_id->context, @@ -1192,7 +1220,9 @@ static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id, return NULL; id_priv = container_of(id, struct rdma_id_private, id); - if (cma_save_net_info(id, listen_id, ib_event)) + if (cma_save_net_info((struct sockaddr *)&id->route.addr.src_addr, + (struct sockaddr *)&id->route.addr.dst_addr, + ib_event, ss_family, service_id)) goto err; rt = &id->route; @@ -1238,7 +1268,11 @@ static struct rdma_id_private *cma_new_udp_id(struct rdma_cm_id *listen_id, return NULL; id_priv = container_of(id, struct rdma_id_private, id); - if (cma_save_net_info(id, listen_id, ib_event)) + if (cma_save_net_info((struct sockaddr *)&id->route.addr.src_addr, + (struct sockaddr *)&id->route.addr.dst_addr, + ib_event, + listen_id->route.addr.src_addr.ss_family, + ib_event->param.sidr_req_rcvd.service_id)) goto err; if (!cma_any_addr((struct sockaddr *) &id->route.addr.src_addr)) { -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v4 for-next 08/12] IB/cma: Add compare_data checks to the RDMA CM module [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> ` (2 preceding siblings ...) 2015-05-17 5:51 ` [PATCH v4 for-next 07/12] IB/cma: Refactor RDMA IP CM private-data parsing code Haggai Eran @ 2015-05-17 5:51 ` Haggai Eran 2015-05-17 5:51 ` [PATCH v4 for-next 12/12] IB/ucma: Take the network namespace from the process Haggai Eran 4 siblings, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-17 5:51 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Haggai Eran Previously RDMA CM relied on the CM module to check the incoming requests against "compare data" struct to dispatch events for different RDMA CM IDs based on the request parameters (IP address, address family, etc.). With namespace support, multiple namespaces in RDMA CM will need to share a single CM ID. Such an ID cannot be associated with a specific compare data, because that could create conflicts with other namespaces. The patch adds checks to verify that incoming requests match their RDMA CM ID destination inside the RDMA CM module itself. Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> --- drivers/infiniband/core/cm.c | 5 +++-- drivers/infiniband/core/cma.c | 12 +++++++++--- include/rdma/ib_cm.h | 3 +++ 3 files changed, 15 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 04cdc753b763..9bb1034c356f 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -461,8 +461,8 @@ static int cm_compare_data(struct ib_cm_compare_data *src_data, return memcmp(src, dst, sizeof(src)); } -static int cm_compare_private_data(u32 *private_data, - struct ib_cm_compare_data *dst_data) +int cm_compare_private_data(u32 *private_data, + struct ib_cm_compare_data *dst_data) { u32 src[IB_CM_COMPARE_SIZE]; @@ -472,6 +472,7 @@ static int cm_compare_private_data(u32 *private_data, cm_mask_copy(src, private_data, dst_data->mask); return memcmp(src, dst_data->data, sizeof(src)); } +EXPORT_SYMBOL(cm_compare_private_data); /* * Trivial helpers to strip endian annotation and compare; the diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 1828903e8388..dced18975e2e 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -146,6 +146,7 @@ struct rdma_id_private { u8 tos; u8 reuseaddr; u8 afonly; + struct ib_cm_compare_data compare_data; }; struct cma_multicast { @@ -1319,6 +1320,10 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) int offset, ret; listen_id = cm_id->context; + if (cm_compare_private_data(ib_event->private_data, + &listen_id->compare_data)) + return -EINVAL; + if (!cma_check_req_qp_type(&listen_id->id, ib_event)) return -EINVAL; @@ -1586,7 +1591,6 @@ out: static int cma_ib_listen(struct rdma_id_private *id_priv) { - struct ib_cm_compare_data compare_data; struct sockaddr *addr; struct ib_cm_id *id; __be64 svc_id; @@ -1603,8 +1607,10 @@ static int cma_ib_listen(struct rdma_id_private *id_priv) if (cma_any_addr(addr) && !id_priv->afonly) ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, NULL); else { - cma_set_compare_data(id_priv->id.ps, addr, &compare_data); - ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, &compare_data); + cma_set_compare_data(id_priv->id.ps, addr, + &id_priv->compare_data); + ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, + &id_priv->compare_data); } if (ret) { diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 6655016b3855..1f0427310611 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -343,6 +343,9 @@ struct ib_cm_compare_data { u32 mask[IB_CM_COMPARE_SIZE]; }; +int cm_compare_private_data(u32 *private_data, + struct ib_cm_compare_data *dst_data); + /** * ib_cm_listen - Initiates listening on the specified service ID for * connection and service ID resolution requests. -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v4 for-next 12/12] IB/ucma: Take the network namespace from the process [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> ` (3 preceding siblings ...) 2015-05-17 5:51 ` [PATCH v4 for-next 08/12] IB/cma: Add compare_data checks to the RDMA CM module Haggai Eran @ 2015-05-17 5:51 ` Haggai Eran 4 siblings, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-17 5:51 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Haggai Eran From: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Add support for network namespaces from user space. This is done by passing the network namespace of the process instead of init_net. Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Signed-off-by: Yotam Kenneth <yotamke-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Signed-off-by: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> --- drivers/infiniband/core/ucma.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 44670ad5ba10..46b50c64933d 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -42,6 +42,7 @@ #include <linux/slab.h> #include <linux/sysctl.h> #include <linux/module.h> +#include <linux/nsproxy.h> #include <rdma/rdma_user_cm.h> #include <rdma/ib_marshall.h> @@ -391,8 +392,8 @@ static ssize_t ucma_create_id(struct ucma_file *file, const char __user *inbuf, return -ENOMEM; ctx->uid = cmd.uid; - ctx->cm_id = rdma_create_id(&init_net, ucma_event_handler, ctx, cmd.ps, - qp_type); + ctx->cm_id = rdma_create_id(current->nsproxy->net_ns, + ucma_event_handler, ctx, cmd.ps, qp_type); if (IS_ERR(ctx->cm_id)) { ret = PTR_ERR(ctx->cm_id); goto err1; -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v4 for-next 09/12] IB/cma: Separate port allocation to network namespaces 2015-05-17 5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran ` (4 preceding siblings ...) [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-05-17 5:51 ` Haggai Eran 2015-05-17 5:51 ` [PATCH v4 for-next 10/12] IB/cma: Share CM IDs between namespaces Haggai Eran ` (3 subsequent siblings) 9 siblings, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-17 5:51 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Haggai Eran Keep a struct for each network namespace containing the IDRs for the RDMA CM port spaces. The struct is created dynamically using the generic_net mechanism. This patch is internal infrastructure work for the following patches. In this patch, init_net is statically used as the network namespace for the new port-space API. Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Yotam Kenneth <yotamke@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> Signed-off-by: Guy Shapiro <guysh@mellanox.com> --- drivers/infiniband/core/cma.c | 142 +++++++++++++++++++++++++++++++++--------- 1 file changed, 113 insertions(+), 29 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index dced18975e2e..a4645a16c9f9 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -44,6 +44,8 @@ #include <linux/module.h> #include <net/route.h> +#include <net/net_namespace.h> +#include <net/netns/generic.h> #include <net/tcp.h> #include <net/ipv6.h> @@ -80,10 +82,37 @@ static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); static struct workqueue_struct *cma_wq; -static DEFINE_IDR(tcp_ps); -static DEFINE_IDR(udp_ps); -static DEFINE_IDR(ipoib_ps); -static DEFINE_IDR(ib_ps); +static int cma_pernet_id; + +struct cma_pernet { + struct idr tcp_ps; + struct idr udp_ps; + struct idr ipoib_ps; + struct idr ib_ps; +}; + +static struct cma_pernet *cma_pernet(struct net *net) +{ + return net_generic(net, cma_pernet_id); +} + +static struct idr *cma_pernet_idr(struct net *net, enum rdma_port_space ps) +{ + struct cma_pernet *pernet = cma_pernet(net); + + switch (ps) { + case RDMA_PS_TCP: + return &pernet->tcp_ps; + case RDMA_PS_UDP: + return &pernet->udp_ps; + case RDMA_PS_IPOIB: + return &pernet->ipoib_ps; + case RDMA_PS_IB: + return &pernet->ib_ps; + default: + return NULL; + } +} struct cma_device { struct list_head list; @@ -94,11 +123,34 @@ struct cma_device { }; struct rdma_bind_list { - struct idr *ps; + enum rdma_port_space ps; struct hlist_head owners; unsigned short port; }; +static int cma_ps_alloc(struct net *net, enum rdma_port_space ps, + struct rdma_bind_list *bind_list, int snum) +{ + struct idr *idr = cma_pernet_idr(net, ps); + + return idr_alloc(idr, bind_list, snum, snum + 1, GFP_KERNEL); +} + +static struct rdma_bind_list *cma_ps_find(struct net *net, + enum rdma_port_space ps, int snum) +{ + struct idr *idr = cma_pernet_idr(net, ps); + + return idr_find(idr, snum); +} + +static void cma_ps_remove(struct net *net, enum rdma_port_space ps, int snum) +{ + struct idr *idr = cma_pernet_idr(net, ps); + + idr_remove(idr, snum); +} + enum { CMA_OPTION_AFONLY, }; @@ -1027,7 +1079,7 @@ static void cma_release_port(struct rdma_id_private *id_priv) mutex_lock(&lock); hlist_del(&id_priv->node); if (hlist_empty(&bind_list->owners)) { - idr_remove(bind_list->ps, bind_list->port); + cma_ps_remove(&init_net, bind_list->ps, bind_list->port); kfree(bind_list); } mutex_unlock(&lock); @@ -2327,8 +2379,8 @@ static void cma_bind_port(struct rdma_bind_list *bind_list, hlist_add_head(&id_priv->node, &bind_list->owners); } -static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, - unsigned short snum) +static int cma_alloc_port(enum rdma_port_space ps, + struct rdma_id_private *id_priv, unsigned short snum) { struct rdma_bind_list *bind_list; int ret; @@ -2337,7 +2389,7 @@ static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, if (!bind_list) return -ENOMEM; - ret = idr_alloc(ps, bind_list, snum, snum + 1, GFP_KERNEL); + ret = cma_ps_alloc(&init_net, ps, bind_list, snum); if (ret < 0) goto err; @@ -2350,7 +2402,8 @@ err: return ret == -ENOSPC ? -EADDRNOTAVAIL : ret; } -static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv) +static int cma_alloc_any_port(enum rdma_port_space ps, + struct rdma_id_private *id_priv) { static unsigned int last_used_port; int low, high, remaining; @@ -2361,7 +2414,7 @@ static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv) rover = prandom_u32() % remaining + low; retry: if (last_used_port != rover && - !idr_find(ps, (unsigned short) rover)) { + !cma_ps_find(&init_net, ps, (unsigned short)rover)) { int ret = cma_alloc_port(ps, id_priv, rover); /* * Remember previously used port number in order to avoid @@ -2416,7 +2469,8 @@ static int cma_check_port(struct rdma_bind_list *bind_list, return 0; } -static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv) +static int cma_use_port(enum rdma_port_space ps, + struct rdma_id_private *id_priv) { struct rdma_bind_list *bind_list; unsigned short snum; @@ -2426,7 +2480,7 @@ static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv) if (snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE)) return -EACCES; - bind_list = idr_find(ps, snum); + bind_list = cma_ps_find(&init_net, ps, snum); if (!bind_list) { ret = cma_alloc_port(ps, id_priv, snum); } else { @@ -2449,25 +2503,24 @@ static int cma_bind_listen(struct rdma_id_private *id_priv) return ret; } -static struct idr *cma_select_inet_ps(struct rdma_id_private *id_priv) +static enum rdma_port_space cma_select_inet_ps( + struct rdma_id_private *id_priv) { switch (id_priv->id.ps) { case RDMA_PS_TCP: - return &tcp_ps; case RDMA_PS_UDP: - return &udp_ps; case RDMA_PS_IPOIB: - return &ipoib_ps; case RDMA_PS_IB: - return &ib_ps; + return id_priv->id.ps; default: - return NULL; + + return 0; } } -static struct idr *cma_select_ib_ps(struct rdma_id_private *id_priv) +static enum rdma_port_space cma_select_ib_ps(struct rdma_id_private *id_priv) { - struct idr *ps = NULL; + enum rdma_port_space ps = 0; struct sockaddr_ib *sib; u64 sid_ps, mask, sid; @@ -2477,15 +2530,15 @@ static struct idr *cma_select_ib_ps(struct rdma_id_private *id_priv) if ((id_priv->id.ps == RDMA_PS_IB) && (sid == (RDMA_IB_IP_PS_IB & mask))) { sid_ps = RDMA_IB_IP_PS_IB; - ps = &ib_ps; + ps = RDMA_PS_IB; } else if (((id_priv->id.ps == RDMA_PS_IB) || (id_priv->id.ps == RDMA_PS_TCP)) && (sid == (RDMA_IB_IP_PS_TCP & mask))) { sid_ps = RDMA_IB_IP_PS_TCP; - ps = &tcp_ps; + ps = RDMA_PS_TCP; } else if (((id_priv->id.ps == RDMA_PS_IB) || (id_priv->id.ps == RDMA_PS_UDP)) && (sid == (RDMA_IB_IP_PS_UDP & mask))) { sid_ps = RDMA_IB_IP_PS_UDP; - ps = &udp_ps; + ps = RDMA_PS_UDP; } if (ps) { @@ -2498,7 +2551,7 @@ static struct idr *cma_select_ib_ps(struct rdma_id_private *id_priv) static int cma_get_port(struct rdma_id_private *id_priv) { - struct idr *ps; + enum rdma_port_space ps; int ret; if (cma_family(id_priv) != AF_IB) @@ -3647,6 +3700,35 @@ static const struct ibnl_client_cbs cma_cb_table[] = { .module = THIS_MODULE }, }; +static int cma_init_net(struct net *net) +{ + struct cma_pernet *pernet = cma_pernet(net); + + idr_init(&pernet->tcp_ps); + idr_init(&pernet->udp_ps); + idr_init(&pernet->ipoib_ps); + idr_init(&pernet->ib_ps); + + return 0; +} + +static void cma_exit_net(struct net *net) +{ + struct cma_pernet *pernet = cma_pernet(net); + + idr_destroy(&pernet->tcp_ps); + idr_destroy(&pernet->udp_ps); + idr_destroy(&pernet->ipoib_ps); + idr_destroy(&pernet->ib_ps); +} + +static struct pernet_operations cma_pernet_operations = { + .init = cma_init_net, + .exit = cma_exit_net, + .id = &cma_pernet_id, + .size = sizeof(struct cma_pernet), +}; + static int __init cma_init(void) { int ret; @@ -3655,6 +3737,10 @@ static int __init cma_init(void) if (!cma_wq) return -ENOMEM; + ret = register_pernet_subsys(&cma_pernet_operations); + if (ret) + goto err_wq; + ib_sa_register_client(&sa_client); rdma_addr_register_client(&addr_client); register_netdevice_notifier(&cma_nb); @@ -3672,6 +3758,7 @@ err: unregister_netdevice_notifier(&cma_nb); rdma_addr_unregister_client(&addr_client); ib_sa_unregister_client(&sa_client); +err_wq: destroy_workqueue(cma_wq); return ret; } @@ -3683,11 +3770,8 @@ static void __exit cma_cleanup(void) unregister_netdevice_notifier(&cma_nb); rdma_addr_unregister_client(&addr_client); ib_sa_unregister_client(&sa_client); + unregister_pernet_subsys(&cma_pernet_operations); destroy_workqueue(cma_wq); - idr_destroy(&tcp_ps); - idr_destroy(&udp_ps); - idr_destroy(&ipoib_ps); - idr_destroy(&ib_ps); } module_init(cma_init); -- 1.7.11.2 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v4 for-next 10/12] IB/cma: Share CM IDs between namespaces 2015-05-17 5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran ` (5 preceding siblings ...) 2015-05-17 5:51 ` [PATCH v4 for-next 09/12] IB/cma: Separate port allocation to network namespaces Haggai Eran @ 2015-05-17 5:51 ` Haggai Eran 2015-05-17 5:51 ` [PATCH v4 for-next 11/12] IB/cma: Add support for network namespaces Haggai Eran ` (2 subsequent siblings) 9 siblings, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-17 5:51 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Haggai Eran Use ib_cm_id_create_and_listen to create listening IB CM IDs or share existing ones if needed. When given a request on a specific CM ID, the code now needs to find the namespace matching the request, and find the RDMA CM ID based on the namespace and the request parameters, instead of using the context field of ib_cm_id as was previously done. Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Guy Shapiro <guysh@mellanox.com> Signed-off-by: Yotam Kenneth <yotamke@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> --- drivers/infiniband/core/cma.c | 136 +++++++++++++++++++++++++++++++++++------- 1 file changed, 115 insertions(+), 21 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index a4645a16c9f9..e5d389ffa497 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -1012,6 +1012,112 @@ static int cma_save_net_info(struct sockaddr *src_addr, return cma_save_ip_info(src_addr, dst_addr, ib_event, service_id); } +struct cma_req_info { + struct ib_device *device; + int port; + __be64 service_id; + u16 pkey; +}; + +static int cma_save_req_info(struct ib_cm_event *ib_event, + struct cma_req_info *req) +{ + struct ib_cm_req_event_param *req_param = &ib_event->param.req_rcvd; + struct ib_cm_sidr_req_event_param *sidr_param = + &ib_event->param.sidr_req_rcvd; + + switch (ib_event->event) { + case IB_CM_REQ_RECEIVED: + req->device = req_param->listen_id->device; + req->port = req_param->port; + req->service_id = req_param->primary_path->service_id; + req->pkey = be16_to_cpu(req_param->primary_path->pkey); + break; + case IB_CM_SIDR_REQ_RECEIVED: + req->device = sidr_param->listen_id->device; + req->port = sidr_param->port; + req->service_id = sidr_param->service_id; + req->pkey = sidr_param->pkey; + break; + default: + return -EINVAL; + } + + return 0; +} + +static struct net *cma_get_net_ns(struct ib_cm_event *ib_event, + struct cma_req_info *req) +{ + struct sockaddr_storage addr_storage; + struct sockaddr *listen_addr; + int err = 0; + + listen_addr = (struct sockaddr *)&addr_storage; + err = cma_save_ip_info(listen_addr, NULL, ib_event, req->service_id); + if (err) + return ERR_PTR(err); + + return ib_get_net_ns_by_port_pkey_ip(req->device, req->port, + req->pkey, listen_addr); +} + +static enum rdma_port_space rdma_ps_from_service_id(__be64 service_id) +{ + return (be64_to_cpu(service_id) >> 16) & 0xffff; +} + +static struct rdma_id_private *cma_find_listener( + struct rdma_bind_list *bind_list, + struct ib_cm_id *cm_id, + struct ib_cm_event *ib_event) +{ + struct rdma_id_private *id_priv, *id_priv_dev; + + if (!bind_list) + return ERR_PTR(-EINVAL); + + hlist_for_each_entry(id_priv, &bind_list->owners, node) { + if (!cm_compare_private_data(ib_event->private_data, + &id_priv->compare_data)) { + if (id_priv->id.device == cm_id->device) + return id_priv; + list_for_each_entry(id_priv_dev, + &id_priv->listen_list, + listen_list) { + if (id_priv_dev->id.device == cm_id->device) + return id_priv_dev; + } + } + } + + return ERR_PTR(-EINVAL); +} + +static struct rdma_id_private *cma_id_from_event(struct ib_cm_id *cm_id, + struct ib_cm_event *ib_event) +{ + struct cma_req_info req; + struct net *net; + struct rdma_bind_list *bind_list; + struct rdma_id_private *id_priv; + int err; + + err = cma_save_req_info(ib_event, &req); + if (err) + return ERR_PTR(err); + + net = cma_get_net_ns(ib_event, &req); + if (IS_ERR(net)) + return ERR_PTR(PTR_ERR(net)); + + bind_list = cma_ps_find(net, rdma_ps_from_service_id(req.service_id), + cma_port_from_service_id(req.service_id)); + id_priv = cma_find_listener(bind_list, cm_id, ib_event); + put_net(net); + return id_priv; +} + static inline int cma_user_data_offset(struct rdma_id_private *id_priv) { return cma_family(id_priv) == AF_IB ? 0 : sizeof(struct cma_hdr); @@ -1371,10 +1477,9 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) struct rdma_cm_event event; int offset, ret; - listen_id = cm_id->context; - if (cm_compare_private_data(ib_event->private_data, - &listen_id->compare_data)) - return -EINVAL; + listen_id = cma_id_from_event(cm_id, ib_event); + if (IS_ERR(listen_id)) + return PTR_ERR(listen_id); if (!cma_check_req_qp_type(&listen_id->id, ib_event)) return -EINVAL; @@ -1648,27 +1753,16 @@ static int cma_ib_listen(struct rdma_id_private *id_priv) __be64 svc_id; int ret; - id = ib_create_cm_id(id_priv->id.device, cma_req_handler, id_priv); - if (IS_ERR(id)) - return PTR_ERR(id); - - id_priv->cm_id.ib = id; - addr = cma_src_addr(id_priv); svc_id = rdma_get_service_id(&id_priv->id, addr); - if (cma_any_addr(addr) && !id_priv->afonly) - ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, NULL); - else { + if (!cma_any_addr(addr) || id_priv->afonly) cma_set_compare_data(id_priv->id.ps, addr, &id_priv->compare_data); - ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, - &id_priv->compare_data); - } - - if (ret) { - ib_destroy_cm_id(id_priv->cm_id.ib); - id_priv->cm_id.ib = NULL; - } + id = ib_cm_id_create_and_listen(id_priv->id.device, cma_req_handler, + svc_id, 0); + if (IS_ERR(id)) + return PTR_ERR(id); + id_priv->cm_id.ib = id; return ret; } -- 1.7.11.2 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v4 for-next 11/12] IB/cma: Add support for network namespaces 2015-05-17 5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran ` (6 preceding siblings ...) 2015-05-17 5:51 ` [PATCH v4 for-next 10/12] IB/cma: Share CM IDs between namespaces Haggai Eran @ 2015-05-17 5:51 ` Haggai Eran 2015-05-19 14:30 ` [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Yann Droneaud 2015-05-26 13:34 ` Doug Ledford 9 siblings, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-17 5:51 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth, Haggai Eran From: Guy Shapiro <guysh@mellanox.com> Add support for network namespaces in the ib_cma module. This is accomplished by: 1. Adding network namespace parameter for rdma_create_id. This parameter is used to populate the network namespace field in rdma_id_private. rdma_create_id keeps a reference on the network namespace. 2. Using the network namespace from the rdma_id instead of init_net inside of ib_cma, when listening on an ID and when looking for an ID for an incoming request. 3. Decrementing the reference count for the appropriate network namespace when calling rdma_destroy_id. In order to preserve the current behavior init_net is passed when calling from other modules. Signed-off-by: Guy Shapiro <guysh@mellanox.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Yotam Kenneth <yotamke@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> --- drivers/infiniband/core/cma.c | 42 +++++++++++++--------- drivers/infiniband/core/ucma.c | 3 +- drivers/infiniband/ulp/iser/iser_verbs.c | 2 +- drivers/infiniband/ulp/isert/ib_isert.c | 2 +- .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 4 ++- include/rdma/rdma_cm.h | 6 +++- net/9p/trans_rdma.c | 4 +-- net/rds/ib.c | 2 +- net/rds/ib_cm.c | 2 +- net/rds/iw.c | 2 +- net/rds/iw_cm.c | 2 +- net/rds/rdma_transport.c | 4 +-- net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 +-- net/sunrpc/xprtrdma/verbs.c | 3 +- 14 files changed, 50 insertions(+), 32 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index e5d389ffa497..da4b48537111 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -553,7 +553,8 @@ static int cma_disable_callback(struct rdma_id_private *id_priv, return 0; } -struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, +struct rdma_cm_id *rdma_create_id(struct net *net, + rdma_cm_event_handler event_handler, void *context, enum rdma_port_space ps, enum ib_qp_type qp_type) { @@ -577,7 +578,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, INIT_LIST_HEAD(&id_priv->listen_list); INIT_LIST_HEAD(&id_priv->mc_list); get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num); - id_priv->id.route.addr.dev_addr.net = &init_net; + id_priv->id.route.addr.dev_addr.net = get_net(net); return &id_priv->id; } @@ -1178,6 +1179,7 @@ static void cma_cancel_operation(struct rdma_id_private *id_priv, static void cma_release_port(struct rdma_id_private *id_priv) { struct rdma_bind_list *bind_list = id_priv->bind_list; + struct net *net = id_priv->id.route.addr.dev_addr.net; if (!bind_list) return; @@ -1185,7 +1187,7 @@ static void cma_release_port(struct rdma_id_private *id_priv) mutex_lock(&lock); hlist_del(&id_priv->node); if (hlist_empty(&bind_list->owners)) { - cma_ps_remove(&init_net, bind_list->ps, bind_list->port); + cma_ps_remove(net, bind_list->ps, bind_list->port); kfree(bind_list); } mutex_unlock(&lock); @@ -1244,6 +1246,7 @@ void rdma_destroy_id(struct rdma_cm_id *id) cma_deref_id(id_priv->id.context); kfree(id_priv->id.route.path_rec); + put_net(id_priv->id.route.addr.dev_addr.net); kfree(id_priv); } EXPORT_SYMBOL(rdma_destroy_id); @@ -1373,7 +1376,8 @@ static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id, ib_event->param.req_rcvd.primary_path->service_id; int ret; - id = rdma_create_id(listen_id->event_handler, listen_id->context, + id = rdma_create_id(listen_id->route.addr.dev_addr.net, + listen_id->event_handler, listen_id->context, listen_id->ps, ib_event->param.req_rcvd.qp_type); if (IS_ERR(id)) return NULL; @@ -1419,9 +1423,10 @@ static struct rdma_id_private *cma_new_udp_id(struct rdma_cm_id *listen_id, { struct rdma_id_private *id_priv; struct rdma_cm_id *id; + struct net *net = listen_id->route.addr.dev_addr.net; int ret; - id = rdma_create_id(listen_id->event_handler, listen_id->context, + id = rdma_create_id(net, listen_id->event_handler, listen_id->context, listen_id->ps, IB_QPT_UD); if (IS_ERR(id)) return NULL; @@ -1676,7 +1681,8 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id, return -ECONNABORTED; /* Create a new RDMA id for the new IW CM ID */ - new_cm_id = rdma_create_id(listen_id->id.event_handler, + new_cm_id = rdma_create_id(listen_id->id.route.addr.dev_addr.net, + listen_id->id.event_handler, listen_id->id.context, RDMA_PS_TCP, IB_QPT_RC); if (IS_ERR(new_cm_id)) { @@ -1808,12 +1814,13 @@ static void cma_listen_on_dev(struct rdma_id_private *id_priv, { struct rdma_id_private *dev_id_priv; struct rdma_cm_id *id; + struct net *net = id_priv->id.route.addr.dev_addr.net; int ret; if (cma_family(id_priv) == AF_IB && !rdma_cap_ib_cm(cma_dev->device, 1)) return; - id = rdma_create_id(cma_listen_handler, id_priv, id_priv->id.ps, + id = rdma_create_id(net, cma_listen_handler, id_priv, id_priv->id.ps, id_priv->id.qp_type); if (IS_ERR(id)) return; @@ -2483,7 +2490,8 @@ static int cma_alloc_port(enum rdma_port_space ps, if (!bind_list) return -ENOMEM; - ret = cma_ps_alloc(&init_net, ps, bind_list, snum); + ret = cma_ps_alloc(id_priv->id.route.addr.dev_addr.net, ps, bind_list, + snum); if (ret < 0) goto err; @@ -2502,13 +2510,14 @@ static int cma_alloc_any_port(enum rdma_port_space ps, static unsigned int last_used_port; int low, high, remaining; unsigned int rover; + struct net *net = id_priv->id.route.addr.dev_addr.net; - inet_get_local_port_range(&init_net, &low, &high); + inet_get_local_port_range(net, &low, &high); remaining = (high - low) + 1; rover = prandom_u32() % remaining + low; retry: if (last_used_port != rover && - !cma_ps_find(&init_net, ps, (unsigned short)rover)) { + !cma_ps_find(net, ps, (unsigned short)rover)) { int ret = cma_alloc_port(ps, id_priv, rover); /* * Remember previously used port number in order to avoid @@ -2574,7 +2583,7 @@ static int cma_use_port(enum rdma_port_space ps, if (snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE)) return -EACCES; - bind_list = cma_ps_find(&init_net, ps, snum); + bind_list = cma_ps_find(id_priv->id.route.addr.dev_addr.net, ps, snum); if (!bind_list) { ret = cma_alloc_port(ps, id_priv, snum); } else { @@ -2766,8 +2775,11 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) if (addr->sa_family == AF_INET) id_priv->afonly = 1; #if IS_ENABLED(CONFIG_IPV6) - else if (addr->sa_family == AF_INET6) - id_priv->afonly = init_net.ipv6.sysctl.bindv6only; + else if (addr->sa_family == AF_INET6) { + struct net *net = id_priv->id.route.addr.dev_addr.net; + + id_priv->afonly = net->ipv6.sysctl.bindv6only; + } #endif } ret = cma_get_port(id_priv); @@ -3571,6 +3583,7 @@ static int cma_netdev_change(struct net_device *ndev, struct rdma_id_private *id dev_addr = &id_priv->id.route.addr.dev_addr; if ((dev_addr->bound_dev_if == ndev->ifindex) && + (net_eq(dev_net(ndev), dev_addr->net)) && memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) { printk(KERN_INFO "RDMA CM addr change for ndev %s used by id %p\n", ndev->name, &id_priv->id); @@ -3596,9 +3609,6 @@ static int cma_netdev_callback(struct notifier_block *self, unsigned long event, struct rdma_id_private *id_priv; int ret = NOTIFY_DONE; - if (dev_net(ndev) != &init_net) - return NOTIFY_DONE; - if (event != NETDEV_BONDING_FAILOVER) return NOTIFY_DONE; diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index d42b816c781f..44670ad5ba10 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -391,7 +391,8 @@ static ssize_t ucma_create_id(struct ucma_file *file, const char __user *inbuf, return -ENOMEM; ctx->uid = cmd.uid; - ctx->cm_id = rdma_create_id(ucma_event_handler, ctx, cmd.ps, qp_type); + ctx->cm_id = rdma_create_id(&init_net, ucma_event_handler, ctx, cmd.ps, + qp_type); if (IS_ERR(ctx->cm_id)) { ret = PTR_ERR(ctx->cm_id); goto err1; diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index cc2dd35ffbc0..5517333738a8 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -960,7 +960,7 @@ int iser_connect(struct iser_conn *iser_conn, ib_conn->beacon.wr_id = ISER_BEACON_WRID; ib_conn->beacon.opcode = IB_WR_SEND; - ib_conn->cma_id = rdma_create_id(iser_cma_handler, + ib_conn->cma_id = rdma_create_id(&init_net, iser_cma_handler, (void *)iser_conn, RDMA_PS_TCP, IB_QPT_RC); if (IS_ERR(ib_conn->cma_id)) { diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c index 327529ee85eb..746d05ef2cea 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.c +++ b/drivers/infiniband/ulp/isert/ib_isert.c @@ -3043,7 +3043,7 @@ isert_setup_id(struct isert_np *isert_np) sa = (struct sockaddr *)&np->np_sockaddr; isert_dbg("ksockaddr: %p, sa: %p\n", &np->np_sockaddr, sa); - id = rdma_create_id(isert_cma_handler, isert_np, + id = rdma_create_id(&init_net, isert_cma_handler, isert_np, RDMA_PS_TCP, IB_QPT_RC); if (IS_ERR(id)) { isert_err("rdma_create_id() failed: %ld\n", PTR_ERR(id)); diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h index cd664d025f41..f4aab7485a84 100644 --- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h +++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h @@ -125,7 +125,9 @@ extern kib_tunables_t kiblnd_tunables; IBLND_CREDIT_HIGHWATER_V1 : \ *kiblnd_tunables.kib_peercredits_hiw) /* when eagerly to return credits */ -#define kiblnd_rdma_create_id(cb, dev, ps, qpt) rdma_create_id(cb, dev, ps, qpt) +#define kiblnd_rdma_create_id(cb, dev, ps, qpt) rdma_create_id(&init_net, \ + cb, dev, \ + ps, qpt) static inline int kiblnd_concurrent_sends_v1(void) diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h index 1ed2088dc9f5..f45566f8dc1a 100644 --- a/include/rdma/rdma_cm.h +++ b/include/rdma/rdma_cm.h @@ -158,13 +158,17 @@ struct rdma_cm_id { /** * rdma_create_id - Create an RDMA identifier. * + * @net: The network namespace in which to create the new id. * @event_handler: User callback invoked to report events associated with the * returned rdma_id. * @context: User specified context associated with the id. * @ps: RDMA port space. * @qp_type: type of queue pair associated with the id. + * + * The id holds a reference on the network namespace until it is destroyed. */ -struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, +struct rdma_cm_id *rdma_create_id(struct net *net, + rdma_cm_event_handler event_handler, void *context, enum rdma_port_space ps, enum ib_qp_type qp_type); diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c index 3533d2a53ab6..8bc3b5cbedf1 100644 --- a/net/9p/trans_rdma.c +++ b/net/9p/trans_rdma.c @@ -660,8 +660,8 @@ rdma_create_trans(struct p9_client *client, const char *addr, char *args) return -ENOMEM; /* Create the RDMA CM ID */ - rdma->cm_id = rdma_create_id(p9_cm_event_handler, client, RDMA_PS_TCP, - IB_QPT_RC); + rdma->cm_id = rdma_create_id(&init_net, p9_cm_event_handler, client, + RDMA_PS_TCP, IB_QPT_RC); if (IS_ERR(rdma->cm_id)) goto error; diff --git a/net/rds/ib.c b/net/rds/ib.c index ba2dffeff608..c68e7b606975 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -326,7 +326,7 @@ static int rds_ib_laddr_check(__be32 addr) /* Create a CMA ID and try to bind it. This catches both * IB and iWARP capable NICs. */ - cm_id = rdma_create_id(NULL, NULL, RDMA_PS_TCP, IB_QPT_RC); + cm_id = rdma_create_id(&init_net, NULL, NULL, RDMA_PS_TCP, IB_QPT_RC); if (IS_ERR(cm_id)) return PTR_ERR(cm_id); diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 31b74f5e61ad..0640f66d5d9e 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -583,7 +583,7 @@ int rds_ib_conn_connect(struct rds_connection *conn) /* XXX I wonder what affect the port space has */ /* delegate cm event handler to rdma_transport */ - ic->i_cm_id = rdma_create_id(rds_rdma_cm_event_handler, conn, + ic->i_cm_id = rdma_create_id(&init_net, rds_rdma_cm_event_handler, conn, RDMA_PS_TCP, IB_QPT_RC); if (IS_ERR(ic->i_cm_id)) { ret = PTR_ERR(ic->i_cm_id); diff --git a/net/rds/iw.c b/net/rds/iw.c index 589935661d66..230da79e62a1 100644 --- a/net/rds/iw.c +++ b/net/rds/iw.c @@ -227,7 +227,7 @@ static int rds_iw_laddr_check(__be32 addr) /* Create a CMA ID and try to bind it. This catches both * IB and iWARP capable NICs. */ - cm_id = rdma_create_id(NULL, NULL, RDMA_PS_TCP, IB_QPT_RC); + cm_id = rdma_create_id(&init_net, NULL, NULL, RDMA_PS_TCP, IB_QPT_RC); if (IS_ERR(cm_id)) return PTR_ERR(cm_id); diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c index a6c2bea9f8f9..5c18819f13fc 100644 --- a/net/rds/iw_cm.c +++ b/net/rds/iw_cm.c @@ -520,7 +520,7 @@ int rds_iw_conn_connect(struct rds_connection *conn) /* XXX I wonder what affect the port space has */ /* delegate cm event handler to rdma_transport */ - ic->i_cm_id = rdma_create_id(rds_rdma_cm_event_handler, conn, + ic->i_cm_id = rdma_create_id(&init_net, rds_rdma_cm_event_handler, conn, RDMA_PS_TCP, IB_QPT_RC); if (IS_ERR(ic->i_cm_id)) { ret = PTR_ERR(ic->i_cm_id); diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c index 6cd9d1deafc3..0c88f24aed85 100644 --- a/net/rds/rdma_transport.c +++ b/net/rds/rdma_transport.c @@ -159,8 +159,8 @@ static int rds_rdma_listen_init(void) struct rdma_cm_id *cm_id; int ret; - cm_id = rdma_create_id(rds_rdma_cm_event_handler, NULL, RDMA_PS_TCP, - IB_QPT_RC); + cm_id = rdma_create_id(&init_net, rds_rdma_cm_event_handler, NULL, + RDMA_PS_TCP, IB_QPT_RC); if (IS_ERR(cm_id)) { ret = PTR_ERR(cm_id); printk(KERN_ERR "RDS/RDMA: failed to setup listener, " diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 3df8320c6efe..6d5c61c2b381 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -704,8 +704,8 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv *serv, if (!cma_xprt) return ERR_PTR(-ENOMEM); - listen_id = rdma_create_id(rdma_listen_handler, cma_xprt, RDMA_PS_TCP, - IB_QPT_RC); + listen_id = rdma_create_id(&init_net, rdma_listen_handler, cma_xprt, + RDMA_PS_TCP, IB_QPT_RC); if (IS_ERR(listen_id)) { ret = PTR_ERR(listen_id); dprintk("svcrdma: rdma_create_id failed = %d\n", ret); diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index 4870d272e006..12d225359e8d 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -509,7 +509,8 @@ rpcrdma_create_id(struct rpcrdma_xprt *xprt, init_completion(&ia->ri_done); - id = rdma_create_id(rpcrdma_conn_upcall, xprt, RDMA_PS_TCP, IB_QPT_RC); + id = rdma_create_id(&init_net, rpcrdma_conn_upcall, xprt, RDMA_PS_TCP, + IB_QPT_RC); if (IS_ERR(id)) { rc = PTR_ERR(id); dprintk("RPC: %s: rdma_create_id() failed %i\n", -- 1.7.11.2 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-05-17 5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran ` (7 preceding siblings ...) 2015-05-17 5:51 ` [PATCH v4 for-next 11/12] IB/cma: Add support for network namespaces Haggai Eran @ 2015-05-19 14:30 ` Yann Droneaud 2015-05-19 14:54 ` Haggai Eran 2015-05-26 13:34 ` Doug Ledford 9 siblings, 1 reply; 68+ messages in thread From: Yann Droneaud @ 2015-05-19 14:30 UTC (permalink / raw) To: Haggai Eran Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth Hi, Le dimanche 17 mai 2015 à 08:50 +0300, Haggai Eran a écrit : > Thanks again everyone for the review comments. I've updated the patch > set > accordingly. The main changes are in the first patch to use a read > -write > semaphore instead of an SRCU, and with the reference counting of > shared > ib_cm_ids. > Please let me know if I missed anything, or if there are other issues > with > the series. > > Regards, > Haggai > > Changes from v3: > - Patch 1 and 3: use read-write semaphore instead of an SRCU. > - Patch 5: > * Use a direct reference count instead of a kref. > * Instead of adding get/put pair for ib_cm_ids, just avoid > destroying an > id when it is still in use. > * Squashes these two patches together, since the first one became > too > short: > IB/cm: Reference count ib_cm_ids > IB/cm: API to retrieve existing listening CM IDs > - Rebase to Doug's to-be-rebased/for-4.2 branch. > > Changes from v2: > - Add patch 1 to change device_mutex to an RCU. > - Remove patch that fixed IPv4 connections to an IPv4/IPv6 listener. > - Limit namespace related changes to RDMA CM and InfiniBand only. > - Rebase on dledford/for-v4.2, with David Ahern's unaligned access > patch. > * Use Michael Wang's capability functions where needed. > - Move the struct net argument to be the first in all functions, to > match the > networking core scheme. > - Patch 2: > * Remove unwanted braces. > - Patch 4: check the return value of ib_find_cached_pkey. > - Patch 8: verify the address family before calling cm_save_ib_info. > - Patch 10: use generic_net instead of a custom radix tree for having > per > network namespace data. > - Minor changes. > > Changes from v1: > - Include patch 1 in this series. > - Rebase for v4.1. > > Changes from v0: > - Fix code review comments by Yann > - Rebase on top of linux-3.19 > > RDMA-CM uses IP based addressing and routing to setup RDMA > connections between > hosts. Currently, all of the IP interfaces and addresses used by the > RDMA-CM > must reside in the init_net namespace. This restricts the usage of > containers > with RDMA to only work with host network namespace (aka the kernel > init_net NS > instance). > > This patchset allows using network namespaces with the RDMA-CM. > > Each RDMA-CM id keeps a reference to a network namespace. > > This reference is based on the process network namespace at the time > of the > creation of the object or inherited from the listener. > > This network namespace is used to perform all IP and network related > operations. Specifically, the local device lookup, as well as the > remote GID > address resolution are done in the context of the RDMA-CM object's > namespace. > This allows outgoing connections to reach the right target, even if > the same > IP address exists in multiple network namespaces. This can happen if > each > network namespace resides on a different P_Key. > > Additionally, the network namespace is used to split the listener > service ID > table. From the user point of view, each network namespace has a > unique, > completely independent table of service IDs. This allows running > multiple > instances of a single service on the same machine, using containers. > To > implement this, multiple RDMA CM IDs, belonging to different > namespaces may > now share their CM ID. When a request on such a CM ID arrives, the > RDMA CM > module finds out the correct namespaces and looks for the RDMA CM ID > matching the request's parameters. > > The functionality introduced by this series would come into play when > the > transport is InfiniBand and IPoIB interfaces are assigned to each > namespace. > Multiple IPoIB interfaces can be created and assigned to different > RDMA-CM > capable containers, for example using pipework [1]. > > Full support for RoCE will be introduced in a later stage. > How does this play with iWarp: as iWarp HCA are aware of IP addresses / UDP/TCP ports, AFAIK, are those tied to namespace with this patchset or will it be possible to use the iWarp HCA to access to address/port resources tied to a different namespace ? > The patches apply against Doug's tree for v4.2. > > The patchset is structured as follows: > > Patch 1 adds a read-write semaphore in addition to the device mutex > in > ib_core to allow traversing the client list without a deadlock in > Patch 3. > > Patch 2 is a relatively trivial API extension, requiring the callers > of certain ib_addr functions to provide a network namespace, as > needed. > > Patches 3 and 4 adds the ability to lookup a network namespace > according to > the IP address, device and P_Key. It finds the matching IPoIB > interfaces, and > safely takes a reference on the network namespace before returning to > the > caller. > > Patches 5-6 make necessary changes to the CM layer, to allow sharing > of a > single CM ID between multiple RDMA CM IDs. This includes adding a > reference > count to ib_cm_id structs, add an API to either create a new CM ID or > use > an existing one, and expose the service ID to ib_cm clients. > > Patches 7-8 do some preliminary refactoring to the rdma_cm module. > Patch 7 > refactors the logic that extracts the IP address from a connect > request to > allow reuse by the namespace lookup code further on. Patch 8 changes > the > way RDMA CM module creates CM IDs, to avoid relying on the > compare_data > feature of ib_cm. This feature associate a single compare_data struct > per > ib_cm_id, so it cannot be used when sharing CM IDs. > > Patches 9-12 add proper namespace support to the RDMA-CM module. This > includes adding multiple port space tables, sharing ib_cm_ids between > rdma_cm_ids, adding a network namespace parameter, and finally > retrieving > the namespace from the creating process. > Regards. -- Yann Droneaud OPTEYA ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-05-19 14:30 ` [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Yann Droneaud @ 2015-05-19 14:54 ` Haggai Eran [not found] ` <555B4EBE.7010900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-05-19 14:54 UTC (permalink / raw) To: Yann Droneaud Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 19/05/2015 17:30, Yann Droneaud wrote: >Le dimanche 17 mai 2015 à 08:50 +0300, Haggai Eran a écrit : >> Full support for RoCE will be introduced in a later stage. >> > How does this play with iWarp: as iWarp HCA are aware of IP addresses / > UDP/TCP ports, AFAIK, are those tied to namespace with this patchset or > will it be possible to use the iWarp HCA to access to address/port > resources tied to a different namespace ? At the moment the patchset doesn't add iWarp namespace support, as I wrote in the discussion about v3 [1]. I suspect that the current iWarp code lets users listen in one namespace and receive requests belonging to the init namespace, or break the namespaces in similar ways. The patchset doesn't attempt to fix this for iWarp or RoCE, only for InfiniBand. Regards, Haggai [1] http://www.spinics.net/lists/linux-rdma/msg25156.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <555B4EBE.7010900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <555B4EBE.7010900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-05-19 16:39 ` Parav Pandit 2015-05-19 18:01 ` Haggai Eran 2015-05-19 18:38 ` Jason Gunthorpe 0 siblings, 2 replies; 68+ messages in thread From: Parav Pandit @ 2015-05-19 16:39 UTC (permalink / raw) To: Haggai Eran Cc: Yann Droneaud, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Tue, May 19, 2015 at 8:24 PM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote: > On 19/05/2015 17:30, Yann Droneaud wrote: >>Le dimanche 17 mai 2015 à 08:50 +0300, Haggai Eran a écrit : >>> Full support for RoCE will be introduced in a later stage. >>> >> How does this play with iWarp: as iWarp HCA are aware of IP addresses / >> UDP/TCP ports, AFAIK, are those tied to namespace with this patchset or >> will it be possible to use the iWarp HCA to access to address/port >> resources tied to a different namespace ? > > At the moment the patchset doesn't add iWarp namespace support, as I > wrote in the discussion about v3 [1]. I suspect that the current iWarp > code lets users listen in one namespace and receive requests belonging > to the init namespace, or break the namespaces in similar ways. The > patchset doesn't attempt to fix this for iWarp or RoCE, only for InfiniBand. > This has come up during our first round of patch set for RoCEv2. However due to last set of changes in this patch set, we decided to move in for later submission. Current high design approach to support this is: A callback function registered to IP stack for IP address assignment/removal. (it could be done possibly done for GID entry update, instead of IP) For each such namespace, for device supporting RoCEv2, one UDP socket with RoCEv2 UDP port would be created and destroyed. > Regards, > Haggai > > [1] http://www.spinics.net/lists/linux-rdma/msg25156.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-05-19 16:39 ` Parav Pandit @ 2015-05-19 18:01 ` Haggai Eran [not found] ` <1432058488417.98688-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2015-05-19 18:38 ` Jason Gunthorpe 1 sibling, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-05-19 18:01 UTC (permalink / raw) To: Parav Pandit Cc: Yann Droneaud, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Tuesday, May 19, 2015 7:39 PM, Parav Pandit <parav.pandit-1wcpHE2jlwO1Z/+hSey0Gg@public.gmane.org> wrote: > This has come up during our first round of patch set for RoCEv2. > However due to last set of changes in this patch set, we decided to > move in for later submission. > > Current high design approach to support this is: > > A callback function registered to IP stack for IP address > assignment/removal. (it could be done possibly done for GID entry > update, instead of IP) > For each such namespace, for device supporting RoCEv2, one UDP socket > with RoCEv2 UDP port would be created and destroyed. This needs to be done to take ownership of the RoCEv2 UDP port of course, but there is more to RoCE namespace support then that. You will need to de-multiplex rdma_cm_id requests from a shared ib_cm_id similarly to what is done in the InfiniBand patch-set here. Using the new RoCE GID cache should make it relatively easy. In addition, in order to support link-layer address resolution in the verbs layer as it is done today, we may need to add namespace support to the modify QP verb and the create AH verb. Haggai-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <1432058488417.98688-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <1432058488417.98688-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-05-19 18:42 ` Parav Pandit 0 siblings, 0 replies; 68+ messages in thread From: Parav Pandit @ 2015-05-19 18:42 UTC (permalink / raw) To: Haggai Eran Cc: Yann Droneaud, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Tue, May 19, 2015 at 11:31 PM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote: > On Tuesday, May 19, 2015 7:39 PM, Parav Pandit <parav.pandit-1wcpHE2jlwO1Z/+hSey0Gg@public.gmane.org> wrote: >> This has come up during our first round of patch set for RoCEv2. >> However due to last set of changes in this patch set, we decided to >> move in for later submission. >> >> Current high design approach to support this is: >> >> A callback function registered to IP stack for IP address >> assignment/removal. (it could be done possibly done for GID entry >> update, instead of IP) >> For each such namespace, for device supporting RoCEv2, one UDP socket >> with RoCEv2 UDP port would be created and destroyed. > > This needs to be done to take ownership of the RoCEv2 UDP port of course, but there is more to RoCE namespace support then that. You will need to de-multiplex rdma_cm_id requests from a shared ib_cm_id similarly to what is done in the InfiniBand patch-set here. Using the new RoCE GID cache should make it relatively easy. In addition, in order to support link-layer address resolution in the verbs layer as it is done today, we may need to add namespace support to the modify QP verb and the create AH verb. > Yes. We talked sometime back with Liran and Matan on same topic. > Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-05-19 16:39 ` Parav Pandit 2015-05-19 18:01 ` Haggai Eran @ 2015-05-19 18:38 ` Jason Gunthorpe [not found] ` <20150519183843.GI18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 1 sibling, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-19 18:38 UTC (permalink / raw) To: Parav Pandit Cc: Haggai Eran, Yann Droneaud, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Tue, May 19, 2015 at 10:09:14PM +0530, Parav Pandit wrote: > A callback function registered to IP stack for IP address > assignment/removal. (it could be done possibly done for GID entry > update, instead of IP) > For each such namespace, for device supporting RoCEv2, one UDP socket > with RoCEv2 UDP port would be created and destroyed. You need to consult netdev for whatever scheme you come up with. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150519183843.GI18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <20150519183843.GI18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-19 18:44 ` Parav Pandit [not found] ` <CAGgvQNTXAWkQWzBBrQfk39GaCQ2ck63AhgURpYFFBPTbkpx4kg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Parav Pandit @ 2015-05-19 18:44 UTC (permalink / raw) To: Jason Gunthorpe Cc: Haggai Eran, Yann Droneaud, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Wed, May 20, 2015 at 12:08 AM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Tue, May 19, 2015 at 10:09:14PM +0530, Parav Pandit wrote: > >> A callback function registered to IP stack for IP address >> assignment/removal. (it could be done possibly done for GID entry >> update, instead of IP) >> For each such namespace, for device supporting RoCEv2, one UDP socket >> with RoCEv2 UDP port would be created and destroyed. > > You need to consult netdev for whatever scheme you come up with. > Yes. However it won't be an issue for netdev as its going to use register_inetaddr_notifier and friend functions which are standard set of APIs by networking stack. But yes, we will have netdev included in the review. Will send out patch in sometime for review to get comments. We were just waiting for v4 patches to get in first. > Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <CAGgvQNTXAWkQWzBBrQfk39GaCQ2ck63AhgURpYFFBPTbkpx4kg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <CAGgvQNTXAWkQWzBBrQfk39GaCQ2ck63AhgURpYFFBPTbkpx4kg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-19 19:20 ` Jason Gunthorpe 0 siblings, 0 replies; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-19 19:20 UTC (permalink / raw) To: Parav Pandit Cc: Haggai Eran, Yann Droneaud, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Wed, May 20, 2015 at 12:14:10AM +0530, Parav Pandit wrote: > On Wed, May 20, 2015 at 12:08 AM, Jason Gunthorpe > <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > > On Tue, May 19, 2015 at 10:09:14PM +0530, Parav Pandit wrote: > > > >> A callback function registered to IP stack for IP address > >> assignment/removal. (it could be done possibly done for GID entry > >> update, instead of IP) > >> For each such namespace, for device supporting RoCEv2, one UDP socket > >> with RoCEv2 UDP port would be created and destroyed. > > > > You need to consult netdev for whatever scheme you come up with. > > > > Yes. However it won't be an issue for netdev as its going to use > register_inetaddr_notifier and friend functions which are standard > set I was talking about your scheme to reserve a UDP port number. Make it clear, and ask specifically about that. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-05-17 5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran ` (8 preceding siblings ...) 2015-05-19 14:30 ` [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Yann Droneaud @ 2015-05-26 13:34 ` Doug Ledford [not found] ` <1432647280.28905.107.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 9 siblings, 1 reply; 68+ messages in thread From: Doug Ledford @ 2015-05-26 13:34 UTC (permalink / raw) To: Haggai Eran Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth [-- Attachment #1: Type: text/plain, Size: 2574 bytes --] On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote: > Thanks again everyone for the review comments. I've updated the patch set > accordingly. The main changes are in the first patch to use a read-write > semaphore instead of an SRCU, and with the reference counting of shared > ib_cm_ids. > Please let me know if I missed anything, or if there are other issues with > the series. Hi Haggai, I know you are probably busy reworking this right now on the basis of Jason's comments. However, my biggest issue with this patch set right now is not technical (well, it is, but it's only partially technical). This is a core feature more than anything else. Namespaces for RDMA devices is not unique to IB or RoCE in any way. Yet no thought has been given to how this will work universally across all of the RDMA capable devices (mainly I'm talking about iWARP here...I don't think this is an issue for usNIC as if you want namespace support there, you just start the user space app in a given namespace and you are probably 90% of the way there since the user space application gets its own device and so its own MAC/IP and all of the RDMA transfers are UDP, so the application's namespace should get inherited by all the rest, but Cisco would need to confirm that, hence why I say 90% of the way there, it needs confirmed). So, while you are reworking things right now, you would ideally contact Steve Wise and/or Tatyana Nikolova and discuss the iWARP story on this. I know there won't be a lot of overlap between IB and iWARP, but last time you were asked you didn't even know if this setup could be extended to iWARP. For this next statement, I know I'm directing this to you Haggai, but please don't take it that way. I'm really using your patch set to make a broader point to everyone on the list. When I look at patches for support for a given feature, one of the things I'm going to look at is whether or not that feature is specific to a given hardware type, or if it's a generic feature. If it's a generic feature, then I'm going to want to know that the person submitting it has designed it well. A pre-requisite of designing a generic feature well is that it considers all hardware types, not just your specific hardware type. So when you come back with the next version of this patch set, please have an answer for how it should work on each hardware type even if you don't have implementation patches for each hardware type. -- Doug Ledford <dledford@redhat.com> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <1432647280.28905.107.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <1432647280.28905.107.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-05-26 16:59 ` Jason Gunthorpe 2015-05-26 17:46 ` Doug Ledford 2015-05-28 13:15 ` Haggai Eran 2015-05-26 17:55 ` Christian Benvenuti (benve) 2015-05-28 13:07 ` Haggai Eran 2 siblings, 2 replies; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-26 16:59 UTC (permalink / raw) To: Doug Ledford Cc: Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Tue, May 26, 2015 at 09:34:40AM -0400, Doug Ledford wrote: > This is a core feature more than anything else. Namespaces for RDMA > devices is not unique to IB or RoCE in any way. Yet no thought has been > given to how this will work universally across all of the RDMA > capable I think if Haggi is able to follow the perscription I gave then things will be general. - All rdma cm ids are associated with a netdev - The output flow uses that netdev to restrict, configure and determine the output RDMA device QP - The input flow locates the netdev as step one and then uses the (netdev,ip,port) tuple to find the rdma listener, which is in turn tied to a netdev and is restricted/configured by it. The technology specific part is the two maps: from (input device,packet) to netdev, from netdev to (output device,packet) After the above clean up is done, namespace enabling is basically providing those two mapping functions for each technology in a way that can locate delegatable netdevs. The trivial case for all the ethernet techs is to provide the above maps that can take the (input device,VLAN) and locate the correct child VLAN specific netdev. The existing code to support VLAN should pretty much immediately enable basic namespace support for all the ethernet families. The big open question for ethernet is how to work without relying on VLAN to create delgated netdevs - typically one would use a bridge and veth's, which do not seem very RDMA compatible. But that doesn't need to be answered right now. Remember, this isn't RDMA namespaces, this is netdev namespace support for RDMA-CM -> very different things. Basically, I'm happy with the generality story, if the clean up work I outlined turns out.. > issue for usNIC as if you want namespace support there, you just start > the user space app in a given namespace and you are probably 90% of > the usNIC has no kernel facing functionality, and no interaction with RDMA-CM, so it is irrelevant to any discussion about RDMA-CM :( Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-05-26 16:59 ` Jason Gunthorpe @ 2015-05-26 17:46 ` Doug Ledford [not found] ` <1432662396.28905.157.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2015-05-28 13:15 ` Haggai Eran 1 sibling, 1 reply; 68+ messages in thread From: Doug Ledford @ 2015-05-26 17:46 UTC (permalink / raw) To: Jason Gunthorpe Cc: Haggai Eran, linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth [-- Attachment #1: Type: text/plain, Size: 2852 bytes --] On Tue, 2015-05-26 at 10:59 -0600, Jason Gunthorpe wrote: > On Tue, May 26, 2015 at 09:34:40AM -0400, Doug Ledford wrote: > > > This is a core feature more than anything else. Namespaces for RDMA > > devices is not unique to IB or RoCE in any way. Yet no thought has been > > given to how this will work universally across all of the RDMA > > capable > > I think if Haggi is able to follow the perscription I gave then things > will be general. > > - All rdma cm ids are associated with a netdev > - The output flow uses that netdev to restrict, configure and > determine the output RDMA device QP > - The input flow locates the netdev as step one and then uses the > (netdev,ip,port) tuple to find the rdma listener, which is in turn > tied to a netdev and is restricted/configured by it. > > The technology specific part is the two maps: from (input > device,packet) to netdev, from netdev to (output device,packet) > > After the above clean up is done, namespace enabling is basically > providing those two mapping functions for each technology in a way > that can locate delegatable netdevs. > > The trivial case for all the ethernet techs is to provide the above > maps that can take the (input device,VLAN) and locate the correct > child VLAN specific netdev. The existing code to support VLAN should > pretty much immediately enable basic namespace support for all the > ethernet families. > > The big open question for ethernet is how to work without relying on > VLAN to create delgated netdevs - typically one would use a bridge and > veth's, which do not seem very RDMA compatible. But that doesn't need > to be answered right now. > > Remember, this isn't RDMA namespaces, this is netdev namespace support > for RDMA-CM -> very different things. That was the point of my email. This is a very myopic view of the feature. It *should* at least have an idea of these other things too. > Basically, I'm happy with the generality story, if the clean up work I > outlined turns out.. > > > issue for usNIC as if you want namespace support there, you just start > > the user space app in a given namespace and you are probably 90% of > > the > > usNIC has no kernel facing functionality, and no interaction with > RDMA-CM, so it is irrelevant to any discussion about RDMA-CM :( Whether usNIC has a kernel facing functionality or not is irrelevant. This feature isn't kernel only, it effects user space applications launched in a namespace too. And, again, my point was that this discussion is about RDMA-CM and it should be broader (even if the implementation isn't broader). Due to the implementation of usNIC I suspect it would "just work", but it would be better to know so. -- Doug Ledford <dledford@redhat.com> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <1432662396.28905.157.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <1432662396.28905.157.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-05-26 18:47 ` Jason Gunthorpe 2015-05-28 13:22 ` Haggai Eran 1 sibling, 0 replies; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-26 18:47 UTC (permalink / raw) To: Doug Ledford Cc: Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Tue, May 26, 2015 at 01:46:36PM -0400, Doug Ledford wrote: > > Remember, this isn't RDMA namespaces, this is netdev namespace support > > for RDMA-CM -> very different things. > > That was the point of my email. This is a very myopic view of the > feature. It *should* at least have an idea of these other things too. Everything you talked about seems covered: iwarp/roce/ib now have a fairly clear uniform story for CM. usNIC doesn't use core code. I doubt a larger discussion about a 'rdma namespace' is going to substantially change these patches, they are really netdev focused. Anyhow, I've been saving that discussion for when the roce and umad/uverbs namespace stuff is re-sent. It seems more appropriate at that point. I don't know about you, but I am exhausted looking at these huge patch sets, and narrowing the focus is the only way I see to get through. This series has hopefully narrowed to: 'fix the flow in netdev handling for rdma-cm'. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <1432662396.28905.157.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2015-05-26 18:47 ` Jason Gunthorpe @ 2015-05-28 13:22 ` Haggai Eran 2015-05-28 15:46 ` Jason Gunthorpe 1 sibling, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-05-28 13:22 UTC (permalink / raw) To: Doug Ledford, Jason Gunthorpe Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 26/05/2015 20:46, Doug Ledford wrote: >> Remember, this isn't RDMA namespaces, this is netdev namespace support >> > for RDMA-CM -> very different things. > That was the point of my email. This is a very myopic view of the > feature. It *should* at least have an idea of these other things too. We did give some thought to the question of whether an RDMA namespace is needed, and concluded that it isn't. RDMA resources such as QP numbers, memory keys, etc. are allocated by the devices. So different containers wouldn't care if they share the "QP number namespace", etc. RDMA CM ports are different because they are chosen by the applications, but they map directly to the network namespace, so they don't require their own namespace. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-05-28 13:22 ` Haggai Eran @ 2015-05-28 15:46 ` Jason Gunthorpe [not found] ` <20150528154633.GB2962-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-28 15:46 UTC (permalink / raw) To: Haggai Eran Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Thu, May 28, 2015 at 04:22:36PM +0300, Haggai Eran wrote: > wouldn't care if they share the "QP number namespace", etc. RDMA CM > ports are different because they are chosen by the applications, but > they map directly to the network namespace, so they don't require their > own namespace. Different containers should have restricted access to the PKey and GID tables, and the presence device itself. Just like in the SRIOV case. That is what the 'RDMA Namespace' would control. Jason ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150528154633.GB2962-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <20150528154633.GB2962-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-06-03 10:07 ` Haggai Eran 0 siblings, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-06-03 10:07 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 28/05/2015 18:46, Jason Gunthorpe wrote: > On Thu, May 28, 2015 at 04:22:36PM +0300, Haggai Eran wrote: >> wouldn't care if they share the "QP number namespace", etc. RDMA CM >> ports are different because they are chosen by the applications, but >> they map directly to the network namespace, so they don't require their >> own namespace. > > Different containers should have restricted access to the PKey and GID > tables, and the presence device itself. Just like in the SRIOV > case. > > That is what the 'RDMA Namespace' would control. We were thinking here that there is a room for an RDMA cgroup. It would limit the amount of RDMA resources a container can use. It can also be used for the restrictions you mentioned, but maybe they are more suitable for a namespace. I'm not sure. In RoCE for instance, a restricted access to the GID table can be derived from the network namespace directly, but perhaps not in InfiniBand. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-05-26 16:59 ` Jason Gunthorpe 2015-05-26 17:46 ` Doug Ledford @ 2015-05-28 13:15 ` Haggai Eran 1 sibling, 0 replies; 68+ messages in thread From: Haggai Eran @ 2015-05-28 13:15 UTC (permalink / raw) To: Jason Gunthorpe, Doug Ledford Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 26/05/2015 19:59, Jason Gunthorpe wrote: > The big open question for ethernet is how to work without relying on > VLAN to create delgated netdevs - typically one would use a bridge and > veth's, which do not seem very RDMA compatible. But that doesn't need > to be answered right now. I think in Ethernet the first step would be to support macvlan devices. Like IPoIB child devices, they are directly attached to an RDMA device, so they don't require handling a complex virtual bridging topology as veths do. ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <1432647280.28905.107.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2015-05-26 16:59 ` Jason Gunthorpe @ 2015-05-26 17:55 ` Christian Benvenuti (benve) 2015-05-28 13:07 ` Haggai Eran 2 siblings, 0 replies; 68+ messages in thread From: Christian Benvenuti (benve) @ 2015-05-26 17:55 UTC (permalink / raw) To: Doug Ledford, Haggai Eran Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1063 bytes --] > -----Original Message----- > From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On > Behalf Of Doug Ledford > Sent: Tuesday, May 26, 2015 6:35 AM > To: Haggai Eran > Cc: linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Liran Liss; Guy > Shapiro; Shachar Raindel; Yotam Kenneth > Subject: Re: [PATCH v4 for-next 00/12] Add network namespace support in the > RDMA-CM ... > I don't think this is an issue for usNIC as if you > want namespace support there, you just start the user space app in a given > namespace and you are probably 90% of the way there since the user space > application gets its own device and so its own MAC/IP and all of the RDMA > transfers are UDP, so the application's namespace should get inherited by all > the rest, but Cisco would need to confirm that, hence why I say 90% of the way > there, it needs confirmed). This is correct. Thanks /Chris N§²æìr¸yúèØb²X¬¶Ç§vØ^)Þº{.nÇ+·¥{±Ù{ayº\x1dÊÚë,j\a¢f£¢·h»öì\x17/oSc¾Ú³9uÀ¦æåÈ&jw¨®\x03(éÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þàþf£¢·h§~m ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <1432647280.28905.107.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2015-05-26 16:59 ` Jason Gunthorpe 2015-05-26 17:55 ` Christian Benvenuti (benve) @ 2015-05-28 13:07 ` Haggai Eran [not found] ` <55671309.6080303-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2 siblings, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-05-28 13:07 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 26/05/2015 16:34, Doug Ledford wrote: > On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote: >> Thanks again everyone for the review comments. I've updated the patch set >> accordingly. The main changes are in the first patch to use a read-write >> semaphore instead of an SRCU, and with the reference counting of shared >> ib_cm_ids. >> Please let me know if I missed anything, or if there are other issues with >> the series. > > Hi Haggai, > > I know you are probably busy reworking this right now on the basis of > Jason's comments. However, my biggest issue with this patch set right > now is not technical (well, it is, but it's only partially technical). Hi, I'm sorry about the late reply. We had a holiday here, and then some other tasks took precedence. I've only got back to working on this today. > > This is a core feature more than anything else. Namespaces for RDMA > devices is not unique to IB or RoCE in any way. Yet no thought has been > given to how this will work universally across all of the RDMA capable > devices (mainly I'm talking about iWARP here... I don't agree. It is true we have are not planning to provide an iWarp implementation for network namespaces, as we lack the capacity and the expertise. However, I think that the changes we proposed to the rdma_cm module will work with iWarp too. Perhaps with some of Jason's suggestions it will be smoother, but even in the current design, I think that if iWarp drivers can provide iw_cm with the network device on which a request is received, then it should be simple to modify it for namespace support without significant change to rdma_cm. > I don't think this is an > issue for usNIC as if you want namespace support there, you just start > the user space app in a given namespace and you are probably 90% of the > way there since the user space application gets its own device and so > its own MAC/IP and all of the RDMA transfers are UDP, so the > application's namespace should get inherited by all the rest, but Cisco > would need to confirm that, hence why I say 90% of the way there, it > needs confirmed). > > So, while you are reworking things right now, you would ideally contact > Steve Wise and/or Tatyana Nikolova and discuss the iWARP story on this. > I know there won't be a lot of overlap between IB and iWARP, but last > time you were asked you didn't even know if this setup could be extended > to iWARP. > > For this next statement, I know I'm directing this to you Haggai, but > please don't take it that way. I'm really using your patch set to make > a broader point to everyone on the list. > > When I look at patches for support for a given feature, one of the > things I'm going to look at is whether or not that feature is specific > to a given hardware type, or if it's a generic feature. If it's a > generic feature, then I'm going to want to know that the person > submitting it has designed it well. A pre-requisite of designing a > generic feature well is that it considers all hardware types, not just > your specific hardware type. So when you come back with the next > version of this patch set, please have an answer for how it should work > on each hardware type even if you don't have implementation patches for > each hardware type. Well, because the RDMA subsystem supports a very diverse set of devices, I think there are few people who know the details of all hardware types well. If we are going to evolve the generic parts of the stack, we have to cooperate. We have to rely on the knowledge of people on the mailing list to say whether the feature is well designed for all hardware types, or whether changes are warranted. In this specific case, the patches has been on the list since February. I think it is enough time to allow anyone who is interested in network namespace support to chime in. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <55671309.6080303-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <55671309.6080303-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-05-28 14:07 ` Doug Ledford [not found] ` <1432822057.114391.26.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Doug Ledford @ 2015-05-28 14:07 UTC (permalink / raw) To: Haggai Eran Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth [-- Attachment #1: Type: text/plain, Size: 2581 bytes --] On Thu, 2015-05-28 at 16:07 +0300, Haggai Eran wrote: > On 26/05/2015 16:34, Doug Ledford wrote: > > On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote: > > This is a core feature more than anything else. Namespaces for RDMA > > devices is not unique to IB or RoCE in any way. Yet no thought has been > > given to how this will work universally across all of the RDMA capable > > devices (mainly I'm talking about iWARP here... > I don't agree. It is true we have are not planning to provide an iWarp > implementation for network namespaces, as we lack the capacity and the > expertise. However, I think that the changes we proposed to the rdma_cm > module will work with iWarp too. Perhaps with some of Jason's > suggestions it will be smoother, but even in the current design, I think > that if iWarp drivers can provide iw_cm with the network device on which > a request is received, then it should be simple to modify it for > namespace support without significant change to rdma_cm. My request wasn't for a functional implementation, just a statement that you had in fact thought about it and, as you say here, would expect it to work (and preferably why as well). > Well, because the RDMA subsystem supports a very diverse set of devices, > I think there are few people who know the details of all hardware types > well. If we are going to evolve the generic parts of the stack, we have > to cooperate. We have to rely on the knowledge of people on the mailing > list to say whether the feature is well designed for all hardware types, > or whether changes are warranted. In this specific case, the patches has > been on the list since February. I think it is enough time to allow > anyone who is interested in network namespace support to chime in. You would think that, but sometimes important information comes from totally different places. See mine and Jason's comments back and forth in the SRIOV thread started by Or. Long story short: ip link add dev ib0 name ib0.1 type ipoib is totally broken on at least all Red Hat OSes. It will require reworking of the network scripts and NetworkManager assumptions to make it work. It will also break DHCP on the interface as pkey/guid are the only items that uniquely identify DHCP clients. The net result of our talks was that it is likely that each interface on the same pkey will require an alias GUID per child interface in order to keep things workable. -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <1432822057.114391.26.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <1432822057.114391.26.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-05-28 16:21 ` Or Gerlitz [not found] ` <55674077.5040707-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Or Gerlitz @ 2015-05-28 16:21 UTC (permalink / raw) To: Doug Ledford Cc: Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 5/28/2015 5:07 PM, Doug Ledford wrote: > You would think that, but sometimes important information comes from > totally different places. See mine and Jason's comments back and forth > in the SRIOV thread started by Or. > > Long story short: > > ip link add dev ib0 name ib0.1 type ipoib > > is totally broken on at least all Red Hat OSes. It will require > reworking of the network scripts and NetworkManager assumptions to make > it work. It will also break DHCP on the interface as pkey/guid are the > only items that uniquely identify DHCP clients. The net result of our > talks was that it is likely that each interface on the same pkey will > require an alias GUID per child interface in order to keep things workable. > Doug, Just to make sure we're on the same page, you're saying that the IPoIB DHCP scheme (client + server) used on RH product uses Client-ID which is eight byte long or 20 byte long the four upper bytes masked out (which of them?) and hence is broken when multiple entities use the same ID. Anything else except for that (you said "reworking of the network scripts and NetworkManager assumptions to make it work")?? OTOH we realized that the implementation for same PKEY IPoIB childs which exist for a while is broken with the RH DHCP scheme and should be enhanced. OTOH these childs can serve as nice building blocks for IPoIB containers or virtio-IPoIB scheme. Note that out of the eleven patches that make the series, only ONE relates directly to IPoIB, the rest are either applicable to all the transport supported by the RDMA stack, or to IPoIB + RoCE. Under some assumptions and changes people can test it with DHCP scheme different from RH or with non-DHCP based IP address assignment scheme. So we have a very nice effort and work done by developers, to bring RDMA into containers, accompanied by reviewers providing lots of their brain power to make it robust. I don't see why we should stop the whole RDMA containers support train just b/c we found out the IPoIB DHCP bug which was there for few years before this effort started. How about let this series to go after the rest of the reviewers comments are addressed, s.t under IPoIB it will work on small set of environments, while with macvlan based RoCE support to be introduced later it will work on wider set of environments. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <55674077.5040707-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <55674077.5040707-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-05-28 17:43 ` Jason Gunthorpe [not found] ` <20150528174337.GA10448-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-05-28 17:43 UTC (permalink / raw) To: Or Gerlitz Cc: Doug Ledford, Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Thu, May 28, 2015 at 07:21:11PM +0300, Or Gerlitz wrote: > Anything else except for that (you said "reworking of the network scripts > and NetworkManager assumptions to make it work")?? IPv6 becomes very broken, child interfaces will generate the same IPv6 addreses for radv and link local resulting in duplicate address scenarios. About the only thing that will work properly is statically assigned IPv4 addresses. > I don't see why we should stop the whole RDMA containers support train just > b/c we found out the IPoIB DHCP bug which was there for few years before > this effort started. I don't think that is what Doug said. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150528174337.GA10448-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <20150528174337.GA10448-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-28 18:22 ` Doug Ledford [not found] ` <1432837360.114391.35.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Doug Ledford @ 2015-05-28 18:22 UTC (permalink / raw) To: Jason Gunthorpe Cc: Or Gerlitz, Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth [-- Attachment #1: Type: text/plain, Size: 1357 bytes --] On Thu, 2015-05-28 at 11:43 -0600, Jason Gunthorpe wrote: > On Thu, May 28, 2015 at 07:21:11PM +0300, Or Gerlitz wrote: > > > Anything else except for that (you said "reworking of the network scripts > > and NetworkManager assumptions to make it work")?? > > IPv6 becomes very broken, child interfaces will generate the same IPv6 > addreses for radv and link local resulting in duplicate address > scenarios. > > About the only thing that will work properly is statically assigned > IPv4 addresses. > > > I don't see why we should stop the whole RDMA containers support train just > > b/c we found out the IPoIB DHCP bug which was there for few years before > > this effort started. > > I don't think that is what Doug said. Indeed. There is no need to scrap things, but if the design as it stands, and the intended means of creating objects for use in containers, is going to result in an unworkable network, then we have to re-evaluate how the container constructs are created, and that then has possible consequences for how we would get from an incoming packet to the proper container. I'm not trying to stop the "support train" here, but at the same time, if the train is headed for a bridge that's out.... -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <1432837360.114391.35.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <1432837360.114391.35.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-05-28 19:05 ` Or Gerlitz [not found] ` <CAJ3xEMh2T5-56rFxWVdct2uAZYW1ZrKivWfS45V-mvhAfwyGaA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Or Gerlitz @ 2015-05-28 19:05 UTC (permalink / raw) To: Doug Ledford Cc: Jason Gunthorpe, Or Gerlitz, Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Thu, May 28, 2015 at 9:22 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> I don't think that is what Doug said. > Indeed. There is no need to scrap things, but if the design as it > stands, and the intended means of creating objects for use in > containers, is going to result in an unworkable network, then we have to > re-evaluate how the container constructs are created, and that then has > possible consequences for how we would get from an incoming packet to > the proper container. To be precise, do we agree that the issue here isn't "in the design as it stands" but rather in a problem we found in the intended way of assigning IP addresses through DHCP for the containers? > I'm not trying to stop the "support train" here, but at the same time, > if the train is headed for a bridge that's out.... So what's your concrete saying here? where should we go from here? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <CAJ3xEMh2T5-56rFxWVdct2uAZYW1ZrKivWfS45V-mvhAfwyGaA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <CAJ3xEMh2T5-56rFxWVdct2uAZYW1ZrKivWfS45V-mvhAfwyGaA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-28 21:55 ` Doug Ledford [not found] ` <1432850150.114391.56.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Doug Ledford @ 2015-05-28 21:55 UTC (permalink / raw) To: Or Gerlitz Cc: Jason Gunthorpe, Or Gerlitz, Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth [-- Attachment #1: Type: text/plain, Size: 2871 bytes --] On Thu, 2015-05-28 at 22:05 +0300, Or Gerlitz wrote: > On Thu, May 28, 2015 at 9:22 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > >> I don't think that is what Doug said. > > > Indeed. There is no need to scrap things, but if the design as it > > stands, and the intended means of creating objects for use in > > containers, is going to result in an unworkable network, then we have to > > re-evaluate how the container constructs are created, and that then has > > possible consequences for how we would get from an incoming packet to > > the proper container. > > To be precise, do we agree that the issue here isn't "in the design as > it stands" but rather in a problem we found in the intended way of > assigning IP addresses through DHCP for the containers? No, I would say the problem *is* in the design. But the problem is the selected means of identifying the netdev to get to the namespace (and the proposed means of creating non-default namespace devices to exist in the container), not the namespace design itself. > > I'm not trying to stop the "support train" here, but at the same time, > > if the train is headed for a bridge that's out.... > > So what's your concrete saying here? where should we go from here? This excerpt is from the commit log of patch 3/12: The IB device and port, together with the P_Key and the IP address should be enough to uniquely identify the ULP net device. The problem here is that this is wrong. If we allow more than one device per pkey with the same GUID, then DHCP breaks, which is bad in and of itself, but it also breaks ipv6 link local addressing. Which means that this hunk in patch 4/12: +#if IS_ENABLED(CONFIG_IPV6) + case AF_INET6: + if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1)) + return true; + + break; +#endif can now be tricked into returning true for incorrect devices. Where do we go from here? First, I'm inclined to say we should modify the add_child portion of IPoIB to refuse to add links to a PKey if that GUID is already present on that PKey. You could then use different PKeys on the default GUID for separate namespaces. If you need separate namespaces on the same PKey, then enable alias GUIDs for use on the local adapter and require one GUID per namespace on the same PKey. Then I'm inclined to say that we should map for namespaces using device, port, guid/gid, pkey. And in this situation, since a unique guid/gid on any given pkey maps to a unique dhcp identifier and a unique ipv6 lladdr, this becomes freely interchangeable with device, port, pkey, address mappings that this patchset was built around. -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <1432850150.114391.56.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <1432850150.114391.56.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-06-03 10:03 ` Haggai Eran 2015-06-03 16:14 ` Jason Gunthorpe 0 siblings, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-06-03 10:03 UTC (permalink / raw) To: Doug Ledford, Or Gerlitz Cc: Jason Gunthorpe, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 29/05/2015 00:55, Doug Ledford wrote: > On Thu, 2015-05-28 at 22:05 +0300, Or Gerlitz wrote: >> So what's your concrete saying here? where should we go from here? > > This excerpt is from the commit log of patch 3/12: > > The IB device and port, together with the P_Key and the IP address should > be enough to uniquely identify the ULP net device. > > The problem here is that this is wrong. If we allow more than one > device per pkey with the same GUID, then DHCP breaks, which is bad in > and of itself, but it also breaks ipv6 link local addressing. Which > means that this hunk in patch 4/12: > > +#if IS_ENABLED(CONFIG_IPV6) > + case AF_INET6: > + if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1)) > + return true; > + > + break; > +#endif > > can now be tricked into returning true for incorrect devices. > > Where do we go from here? > > First, I'm inclined to say we should modify the add_child portion of > IPoIB to refuse to add links to a PKey if that GUID is already present > on that PKey. You could then use different PKeys on the default GUID > for separate namespaces. If you need separate namespaces on the same > PKey, then enable alias GUIDs for use on the local adapter and require > one GUID per namespace on the same PKey. I don't think blocking the current add_child implementation is needed. I agree IPv6 SLAAC and DHCP currently don't work well, and adding alias GUID for child interfaces is important, but the current implementation can be used with static IPv4 addresses, so I don't think it must be disabled. > Then I'm inclined to say that we should map for namespaces using device, > port, guid/gid, pkey. And in this situation, since a unique guid/gid on > any given pkey maps to a unique dhcp identifier and a unique ipv6 > lladdr, this becomes freely interchangeable with device, port, pkey, > address mappings that this patchset was built around. What if we change the namespaces patches to map (device, port, GID, P_Key, IP) to netdev / namespace? That is, to use both the GID and the IP address. This would allow people to use namespaces with the current implementation (provided they have a valid configuration with no conflicting IP addresses), and once alias GUIDs are added, the GUIDs will be used to uniquely resolve the namespace even with such misconfigurations. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-06-03 10:03 ` Haggai Eran @ 2015-06-03 16:14 ` Jason Gunthorpe [not found] ` <20150603161447.GC12073-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-06-03 16:14 UTC (permalink / raw) To: Haggai Eran Cc: Doug Ledford, Or Gerlitz, Or Gerlitz, linux-rdma@vger.kernel.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Wed, Jun 03, 2015 at 01:03:01PM +0300, Haggai Eran wrote: > > Then I'm inclined to say that we should map for namespaces using device, > > port, guid/gid, pkey. And in this situation, since a unique guid/gid on > > any given pkey maps to a unique dhcp identifier and a unique ipv6 > > lladdr, this becomes freely interchangeable with device, port, pkey, > > address mappings that this patchset was built around. > > What if we change the namespaces patches to map (device, port, GID, > P_Key, IP) to netdev / namespace? That is, to use both the GID and the > IP address. As I keep saying, you are not supposed to use the IP address as a key to find the netdev, that is the wrong way to use the Linux netdev model. Requiring unique GID/PKey allows the implementation to avoid this wrongness, which would be simplifying and more correct. That is the appeal to blocking this scenario when children are created. Jason ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150603161447.GC12073-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <20150603161447.GC12073-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-06-03 19:05 ` Or Gerlitz 2015-06-03 19:53 ` Jason Gunthorpe 0 siblings, 1 reply; 68+ messages in thread From: Or Gerlitz @ 2015-06-03 19:05 UTC (permalink / raw) To: Jason Gunthorpe, Doug Ledford Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Wed, Jun 3, 2015 at 7:14 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Wed, Jun 03, 2015 at 01:03:01PM +0300, Haggai Eran wrote: >> > Then I'm inclined to say that we should map for namespaces using device, >> > port, guid/gid, pkey. And in this situation, since a unique guid/gid on >> > any given pkey maps to a unique dhcp identifier and a unique ipv6 >> > lladdr, this becomes freely interchangeable with device, port, pkey, >> > address mappings that this patchset was built around. >> >> What if we change the namespaces patches to map (device, port, GID, >> P_Key, IP) to netdev / namespace? That is, to use both the GID and the >> IP address. > > As I keep saying, you are not supposed to use the IP address as a key > to find the netdev, that is the wrong way to use the Linux netdev > model. > > Requiring unique GID/PKey allows the implementation to avoid this > wrongness, which would be simplifying and more correct. > > That is the appeal to blocking this scenario when children are created. Jason, The IPoIB RTNL childs were added around release 3.6/7 of the upstream kernel and are part of the kernel UAPI. They are perfectly used in bunch of schemes: 1. when static IP address assignment is used 2. under PV scheme, when the guest has para-virtual Eth NIC and the host does routing between the back-end (e.g tap or alike) and the IPoIB child. Or when the host does tunneling (vxlan) and alike and sends down the encapsulated packet through a host IP address assigned to the IPoIB child 3. etc few more Indeed the DHCP story isn't working there and to get DHCP work something has to be done. But this issue can't serve for blocking the existing UAPI and introduce regression to working systems. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-06-03 19:05 ` Or Gerlitz @ 2015-06-03 19:53 ` Jason Gunthorpe [not found] ` <20150603195325.GC7902-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-06-03 19:53 UTC (permalink / raw) To: Or Gerlitz Cc: Doug Ledford, Haggai Eran, Or Gerlitz, linux-rdma@vger.kernel.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Wed, Jun 03, 2015 at 10:05:34PM +0300, Or Gerlitz wrote: > Indeed the DHCP story isn't working there and to get DHCP work > something has to be done. But this issue can't serve for blocking the > existing UAPI and introduce regression to working systems. It is not DHCP that concerns me, it is the fact we can't combine net namespaces, RDMA-CM and duplicate GUID IPoIB children together without adding hacks to the kernel. Searching netdevs by IP is a hack. I'm mostly fine with it as an optional capability, similar to macvlan, I just don't see how to cleanly integrate it with RDMA CM and namespaces. And I don't see what RDMA CM is supposed to do when it hits this case. So, any ideas that don't involve the searching for IP hack?? [And yes, as discussed with Haggie, it is not the worst hack in the world, and maybe we can live with it, but lets understand the trade offs carefully] Also, now that this has been brought up, I think you need to make a patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to modify addrconf_ifid_infiniband to return error if the IPoIB child is sharing a guid. It was not good at all to push the child patches forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them. Jason ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150603195325.GC7902-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <20150603195325.GC7902-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-06-03 20:07 ` Or Gerlitz [not found] ` <CAJ3xEMiO+hEzOJ2oJ5G-mmBeaX4ZHvUyhNSAzsrRDui6dFjvCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Or Gerlitz @ 2015-06-03 20:07 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Wed, Jun 3, 2015 at 10:53 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Wed, Jun 03, 2015 at 10:05:34PM +0300, Or Gerlitz wrote: > >> Indeed the DHCP story isn't working there and to get DHCP work >> something has to be done. But this issue can't serve for blocking the >> existing UAPI and introduce regression to working systems. > > It is not DHCP that concerns me, it is the fact we can't combine net > namespaces, RDMA-CM and duplicate GUID IPoIB children together without > adding hacks to the kernel. Searching netdevs by IP is a hack. > > I'm mostly fine with it as an optional capability, similar to macvlan, > I just don't see how to cleanly integrate it with RDMA CM and > namespaces. And I don't see what RDMA CM is supposed to do when > it hits this case. > > So, any ideas that don't involve the searching for IP hack?? > > [And yes, as discussed with Haggie, it is not the worst hack in the > world, and maybe we can live with it, but lets understand the trade > offs carefully] As Haggai wrote, if we let the using IP address thing to fly up, we have support for RDMA in containers using the RDMA-CM at IPoIB environments. This will let people test, use, experiment, fix, interact (and even production-it when static IP address assignment scheme is used). Later, usage of alias GUIDs for IPoIB RTNL childs would allow to remove the IP thing. Later, the next stage/s in Matan's work on the RoCE GID table would allow to support MACVLAN and hence RoCE too. This is how the Linux kernel being evolved since the 2.5 failure to come up with giant releases -- doing things in relativity small steps. > Also, now that this has been brought up, I think you need to make a > patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to > modify addrconf_ifid_infiniband to return error if the IPoIB child is > sharing a guid. It was not good at all to push the child patches > forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them. Till the alias GUID thing is introduced, maybe we can patch addrconf_ifid_infiniband to use the QPN value from the device HW address to come up with unique IPv6 link local address, agree? where you think we can place the 24 bits QPN? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <CAJ3xEMiO+hEzOJ2oJ5G-mmBeaX4ZHvUyhNSAzsrRDui6dFjvCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <CAJ3xEMiO+hEzOJ2oJ5G-mmBeaX4ZHvUyhNSAzsrRDui6dFjvCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-06-03 21:45 ` Jason Gunthorpe 2015-06-04 9:41 ` Haggai Eran 2015-06-03 23:48 ` Jason Gunthorpe 1 sibling, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-06-03 21:45 UTC (permalink / raw) To: Or Gerlitz Cc: Doug Ledford, Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote: > As Haggai wrote, if we let the using IP address thing to fly up, we have > support for RDMA in containers using the RDMA-CM at IPoIB environments. > This will let people test, use, experiment, fix, interact (and even > production-it when static IP address assignment scheme is used). Sure, I think we all understand the goal, and you've explained some reasonable use cases for the child support. > Later, usage of alias GUIDs for IPoIB RTNL childs would allow to > remove the IP thing. How do we remove it? Along with same-guid child support? What is your idea here? > > Also, now that this has been brought up, I think you need to make a > > patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to > > modify addrconf_ifid_infiniband to return error if the IPoIB child is > > sharing a guid. It was not good at all to push the child patches > > forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them. > > Till the alias GUID thing is introduced, maybe we can patch > addrconf_ifid_infiniband to use the QPN value from the device HW > address to come up with unique IPv6 link local address, agree? where > you think we can place the 24 bits QPN? I don't know if that is a good idea, an unstable SLAAC is not in spirit with the RFCs. The safest bet is to return error and disable SLAAC completely. But I'm just guessing here - I'm only feel strongly that something should be done to address this issue in the existing kernel. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-06-03 21:45 ` Jason Gunthorpe @ 2015-06-04 9:41 ` Haggai Eran 2015-06-04 16:06 ` Jason Gunthorpe 0 siblings, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-06-04 9:41 UTC (permalink / raw) To: Jason Gunthorpe, Or Gerlitz Cc: Doug Ledford, Or Gerlitz, linux-rdma@vger.kernel.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 04/06/2015 00:45, Jason Gunthorpe wrote: > On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote: >> As Haggai wrote, if we let the using IP address thing to fly up, we have >> support for RDMA in containers using the RDMA-CM at IPoIB environments. >> This will let people test, use, experiment, fix, interact (and even >> production-it when static IP address assignment scheme is used). > > Sure, I think we all understand the goal, and you've explained some > reasonable use cases for the child support. > >> Later, usage of alias GUIDs for IPoIB RTNL childs would allow to >> remove the IP thing. > > How do we remove it? Along with same-guid child support? What is your > idea here? > >>> Also, now that this has been brought up, I think you need to make a >>> patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to >>> modify addrconf_ifid_infiniband to return error if the IPoIB child is >>> sharing a guid. It was not good at all to push the child patches >>> forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them. >> >> Till the alias GUID thing is introduced, maybe we can patch >> addrconf_ifid_infiniband to use the QPN value from the device HW >> address to come up with unique IPv6 link local address, agree? where >> you think we can place the 24 bits QPN? > > I don't know if that is a good idea, an unstable SLAAC is not in > spirit with the RFCs. The safest bet is to return error and disable > SLAAC completely. Maybe this is a silly question, but doesn't DAD already disable SLAAC addresses when there's a conflict? Haggai ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-06-04 9:41 ` Haggai Eran @ 2015-06-04 16:06 ` Jason Gunthorpe 0 siblings, 0 replies; 68+ messages in thread From: Jason Gunthorpe @ 2015-06-04 16:06 UTC (permalink / raw) To: Haggai Eran Cc: Or Gerlitz, Doug Ledford, Or Gerlitz, linux-rdma@vger.kernel.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Thu, Jun 04, 2015 at 12:41:33PM +0300, Haggai Eran wrote: > On 04/06/2015 00:45, Jason Gunthorpe wrote: > > I don't know if that is a good idea, an unstable SLAAC is not in > > spirit with the RFCs. The safest bet is to return error and disable > > SLAAC completely. > Maybe this is a silly question, but doesn't DAD already disable SLAAC > addresses when there's a conflict? Yes, DAD should certainly trigger and disable the child, but the kernel should not rely on DAD for correctness, it is a safety net, and it isn't guarenteed 100% reliable. Jason ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <CAJ3xEMiO+hEzOJ2oJ5G-mmBeaX4ZHvUyhNSAzsrRDui6dFjvCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-06-03 21:45 ` Jason Gunthorpe @ 2015-06-03 23:48 ` Jason Gunthorpe [not found] ` <20150603234811.GA15128-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 1 sibling, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-06-03 23:48 UTC (permalink / raw) To: Or Gerlitz Cc: Doug Ledford, Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote: > > I'm mostly fine with it as an optional capability, similar to macvlan, > > I just don't see how to cleanly integrate it with RDMA CM and > > namespaces. And I don't see what RDMA CM is supposed to do when > > it hits this case. > > > > So, any ideas that don't involve the searching for IP hack?? > > > > [And yes, as discussed with Haggie, it is not the worst hack in the > > world, and maybe we can live with it, but lets understand the trade > > offs carefully] > > As Haggai wrote, if we let the using IP address thing to fly up, we have > support for RDMA in containers using the RDMA-CM at IPoIB environments. > This will let people test, use, experiment, fix, interact (and even > production-it when static IP address assignment scheme is used). I just noticed ipvlan got merged a few months ago.. That certainly changed my view on this topic. It is basically a software version of the same-guid ipoib children scheme. Similar issues: Same MAC address as the parent, IPv6 SLAAC is disabled (?), DHCP has similar issue (solved with RFC4361, and broadcasting fallback, it seems).. The l2/l3 distinction in ipvlan is also very interesting. The L3 mode solves some of the security type issues. What do you think Haggi? Is there any chance standard things like ipvlan and macvlan could be used with rdma-cm if their master devices are IPoIB? Are we even on the right path to do that someday? Is that the plan for roce? Any thoughts on the idea we still need ipoib same-guid children if ipvlan is available? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150603234811.GA15128-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <20150603234811.GA15128-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-06-04 6:24 ` Haggai Eran [not found] ` <556FEF25.80409-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-06-04 6:24 UTC (permalink / raw) To: Jason Gunthorpe, Or Gerlitz Cc: Doug Ledford, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 04/06/2015 02:48, Jason Gunthorpe wrote: > On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote: > >>> I'm mostly fine with it as an optional capability, similar to macvlan, >>> I just don't see how to cleanly integrate it with RDMA CM and >>> namespaces. And I don't see what RDMA CM is supposed to do when >>> it hits this case. >>> >>> So, any ideas that don't involve the searching for IP hack?? >>> >>> [And yes, as discussed with Haggie, it is not the worst hack in the >>> world, and maybe we can live with it, but lets understand the trade >>> offs carefully] >> >> As Haggai wrote, if we let the using IP address thing to fly up, we have >> support for RDMA in containers using the RDMA-CM at IPoIB environments. >> This will let people test, use, experiment, fix, interact (and even >> production-it when static IP address assignment scheme is used). > > I just noticed ipvlan got merged a few months ago.. That certainly > changed my view on this topic. It is basically a software > version of the same-guid ipoib children scheme. Similar issues: Same MAC > address as the parent, IPv6 SLAAC is disabled (?), DHCP has similar > issue (solved with RFC4361, and broadcasting fallback, it seems).. > > The l2/l3 distinction in ipvlan is also very interesting. The L3 mode > solves some of the security type issues. What do you think Haggi? I think some issues ipvlan is trying to solve would also affect us using the alias GUIDs solution. ipvlan tries to solve among other the problem of a limited MAC filter table in NICs, and avoid using promiscuous mode. But the GID table is also limited, and we don't have something like promiscuous mode for GIDs in InfiniBand. For large scale use of containers we would need to also allow the current model. As for L3 mode, it does seem more restrictive, as all routing decisions are done in the controlling namespace. Our current ipoib child interface implementation is more like the L2 version of ipvlan. > > Is there any chance standard things like ipvlan and macvlan could be > used with rdma-cm if their master devices are IPoIB? These standard interfaces seem very much connected with Ethernet (both have an ARPHDR_ETHER-only check for their upper devices). I think macvlan's functionality would be covered by adding alias GUIDs to ipoib, and ipvlan L2 is covered by the current behavior. Perhaps it would be beneficial to try and make ipvlan more generic so that it would work over ipoib, giving us support for L3 mode. As for rdma-cm support, the patch I had for ipoib attempts to scan each child's upper devices in order to support such topologies. We only tested it with bonding, but I think it would also work with such devices. > Are we even on > the right path to do that someday? Is that the plan for roce? Yes, for RoCE our goal for the start was to support namespaces in RDMA CM through macvlan devices. As long as we can update the RoCE gid table correctly for macvlan and ipvlan devices, the RDMA CM implementation shouldn't care where the details come from. > Any thoughts on the idea we still need ipoib same-guid children if > ipvlan is available? If we port ipvlan to work over IPoIB interfaces and not just Ethernet, then ipvlan L2 would provide exactly the same functionality. There onyl difference I can think of is that ipvlan would use a single UD QP for all devices (and in connected-mode, a single RC QP between a pair of hosts), while ipoib would use a QP per child device, and multiple RC QPs for such pairs. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <556FEF25.80409-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <556FEF25.80409-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-06-04 16:40 ` Jason Gunthorpe [not found] ` <20150604164058.GB27699-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 68+ messages in thread From: Jason Gunthorpe @ 2015-06-04 16:40 UTC (permalink / raw) To: Haggai Eran Cc: Or Gerlitz, Doug Ledford, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Thu, Jun 04, 2015 at 09:24:37AM +0300, Haggai Eran wrote: > > The l2/l3 distinction in ipvlan is also very interesting. The L3 mode > > solves some of the security type issues. What do you think Haggi? > I think some issues ipvlan is trying to solve would also affect us using > the alias GUIDs solution. ipvlan tries to solve among other the problem > of a limited MAC filter table in NICs, and avoid using promiscuous mode. > But the GID table is also limited, and we don't have something like > promiscuous mode for GIDs in InfiniBand. For large scale use of > containers we would need to also allow the current model. Yes, that is certainly true. > As for L3 mode, it does seem more restrictive, as all routing decisions > are done in the controlling namespace. Our current ipoib child interface > implementation is more like the L2 version of ipvlan. The ipoib children are exactly like macvlan, because they all have unique LLADDRs. It doesn't start acting like ipvlan until we reach the rdma-cm patches, and where we see the IP stack side act like macvlan and the rdma-cm side try to act like ipvlan - that is why it is so ugly/hacky, > > Is there any chance standard things like ipvlan and macvlan could be > > used with rdma-cm if their master devices are IPoIB? > These standard interfaces seem very much connected with Ethernet (both > have an ARPHDR_ETHER-only check for their upper devices). I think > macvlan's functionality would be covered by adding alias GUIDs to ipoib, > and ipvlan L2 is covered by the current behavior. Perhaps it would be > beneficial to try and make ipvlan more generic so that it would work > over ipoib, giving us support for L3 mode. Yes, macvlan seems very well covered already by IPoIB child interfaces, and I don't see too many reasons to worry about changing that. ipvlan on the other hand, as you observe, is valuable for many reasons. > As for rdma-cm support, the patch I had for ipoib attempts to scan each > child's upper devices in order to support such topologies. We only > tested it with bonding, but I think it would also work with such devices. .. it is so sketchy :| Firstly: I still think the prior discussion is right, and proceeding along the reworking of the ingress side of rdma-cm and focusing on the device,guid,pkey makes 100% sense and will progress things right away. Every other variation seems to build on that. But when we get into bonding and the various vlan things, we loose encapsulation - snooping the children list to guess what the bonding driver is doing seems very hacky. Discussion idea: Can we actually use the netstack to process the RDMA-CM packets? It looks like the netstack wants a skb to do this mid-layer work, so rdma-cm would have to synthesize a skb for the CM packets and pass it through netdev to apply all the transformations and access the various internal states (eg from ipvlan, bonding, etc). rdma-cm would have to 'catch' the skb once it is done traveling and resume its normal processing. Very similar to your notion of using UDP, but without any on-the-wire change. This would fit in that same ingress spot I suggested adding the routing lookup, instead of routing we want the full stack to have a go at figuring out the final netdev. This seems the most general because it will work for all the *vlan type drivers, bonding, and all of the RDMA technologies. (each would have a slightly different way to make the skb, but same basic idea) Lots and lots of details to do that, but conceptually it seems pretty solid? > Yes, for RoCE our goal for the start was to support namespaces in RDMA > CM through macvlan devices. As long as we can update the RoCE gid table > correctly for macvlan and ipvlan devices, the RDMA CM implementation > shouldn't care where the details come from. Hurm, the gid index tagged on the QP1 packet should not be directly used for much on ingress. rdma-cm will have to recover the mac address and vlan to use that as a guide. Synchronizing the gid table and all the internal state in macvlan, ipvlan, bonding seems very hard, I do not envy your task :( > > Any thoughts on the idea we still need ipoib same-guid children if > > ipvlan is available? > If we port ipvlan to work over IPoIB interfaces and not just Ethernet, > then ipvlan L2 would provide exactly the same functionality. There onyl > difference I can think of is that ipvlan would use a single UD QP for > all devices (and in connected-mode, a single RC QP between a pair of > hosts), while ipoib would use a QP per child device, and multiple RC QPs > for such pairs. Agree with this. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20150604164058.GB27699-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM [not found] ` <20150604164058.GB27699-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-06-08 7:52 ` Haggai Eran 2015-06-08 16:53 ` Jason Gunthorpe 0 siblings, 1 reply; 68+ messages in thread From: Haggai Eran @ 2015-06-08 7:52 UTC (permalink / raw) To: Jason Gunthorpe Cc: Or Gerlitz, Doug Ledford, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On 04/06/2015 19:40, Jason Gunthorpe wrote: > Discussion idea: Can we actually use the netstack to process the > RDMA-CM packets? It looks like the netstack wants a skb to do this > mid-layer work, so rdma-cm would have to synthesize a skb for the CM > packets and pass it through netdev to apply all the transformations > and access the various internal states (eg from ipvlan, bonding, > etc). rdma-cm would have to 'catch' the skb once it is done traveling > and resume its normal processing. Very similar to your notion of using > UDP, but without any on-the-wire change. > > This would fit in that same ingress spot I suggested adding the > routing lookup, instead of routing we want the full stack to have a go > at figuring out the final netdev. > > This seems the most general because it will work for all the *vlan > type drivers, bonding, and all of the RDMA technologies. (each would > have a slightly different way to make the skb, but same basic idea) > > Lots and lots of details to do that, but conceptually it seems pretty > solid? The problem is that the network stack can do all sort of changes to the packets (like NAT), and it may be the case that the hardware can't reflect these changes later on when creating a QP. I think it would be best to stick with resolving the net_dev using the request parameters, and the simpler routing lookup. This way RDMA CM remains in control, and if the user configures routing in an unexpected way, it can just block the request. Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM 2015-06-08 7:52 ` Haggai Eran @ 2015-06-08 16:53 ` Jason Gunthorpe 0 siblings, 0 replies; 68+ messages in thread From: Jason Gunthorpe @ 2015-06-08 16:53 UTC (permalink / raw) To: Haggai Eran Cc: Or Gerlitz, Doug Ledford, Or Gerlitz, linux-rdma@vger.kernel.org, Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth On Mon, Jun 08, 2015 at 10:52:34AM +0300, Haggai Eran wrote: > On 04/06/2015 19:40, Jason Gunthorpe wrote: > > Discussion idea: Can we actually use the netstack to process the > > RDMA-CM packets? It looks like the netstack wants a skb to do this > > mid-layer work, so rdma-cm would have to synthesize a skb for the CM > > packets and pass it through netdev to apply all the transformations > > and access the various internal states (eg from ipvlan, bonding, > > etc). rdma-cm would have to 'catch' the skb once it is done traveling > > and resume its normal processing. Very similar to your notion of using > > UDP, but without any on-the-wire change. > > > > This would fit in that same ingress spot I suggested adding the > > routing lookup, instead of routing we want the full stack to have a go > > at figuring out the final netdev. > > > > This seems the most general because it will work for all the *vlan > > type drivers, bonding, and all of the RDMA technologies. (each would > > have a slightly different way to make the skb, but same basic idea) > > > > Lots and lots of details to do that, but conceptually it seems pretty > > solid? > > The problem is that the network stack can do all sort of changes to the > packets (like NAT), and it may be the case that the hardware can't > reflect these changes later on when creating a QP. Yes, I am aware of that, but there are also alot of things netdev can do that we can realize, like netfilter rules to block packets, for instance Ignoring NAT is a bad choice as well, the best would be to drop on NAT. It would be easy to detect if the netstack mangled the REQ skb packet, for instance. We can't track netdev after the QP is created, but totally ignoring one thing and while re-implementing others seems like a bad idea, long term... > I think it would be best to stick with resolving the net_dev using > the request parameters, and the simpler routing lookup. This way > RDMA CM remains in control, and if the user configures routing in an > unexpected way, it can just block the request. As I said, I think that is fine for the immediate IB support, but when you start talking about roce and emulating macvlan and ipvlan.. Then it starts to look really bad. At least think it through carefully before posting those series. Jason ^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2015-06-08 16:53 UTC | newest]
Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-17 5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran
2015-05-17 5:50 ` [PATCH v4 for-next 02/12] IB/addr: Pass network namespace as a parameter Haggai Eran
2015-05-17 5:50 ` [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters Haggai Eran
[not found] ` <1431841868-28063-4-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-19 18:26 ` Jason Gunthorpe
[not found] ` <20150519182616.GF18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-20 14:48 ` Haggai Eran
2015-05-17 5:51 ` [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices " Haggai Eran
[not found] ` <1431841868-28063-5-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-19 18:28 ` Jason Gunthorpe
[not found] ` <20150519182810.GG18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-20 15:17 ` Haggai Eran
2015-05-19 23:55 ` Jason Gunthorpe
[not found] ` <20150519235502.GB26634-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-21 5:33 ` Haggai Eran
[not found] ` <555D6E41.10606-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-21 5:48 ` Or Gerlitz
[not found] ` <CAJ3xEMjN+o=vC4abAeG5EuOo3Y1gSyh1qPDseA_aaYmoLWAunw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-21 6:33 ` Haggai Eran
[not found] ` <555D7C4A.2060708-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-21 10:31 ` Or Gerlitz
2015-05-21 17:43 ` Jason Gunthorpe
[not found] ` <20150521174336.GA6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-28 11:51 ` Haggai Eran
2015-05-28 15:45 ` Jason Gunthorpe
2015-05-21 5:48 ` Haggai Eran
2015-05-17 5:51 ` [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs Haggai Eran
2015-05-19 18:35 ` Jason Gunthorpe
[not found] ` <20150519183545.GH18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-19 22:35 ` Jason Gunthorpe
[not found] ` <20150519223502.GA26324-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-21 8:08 ` Haggai Eran
2015-05-21 17:54 ` Jason Gunthorpe
2015-05-21 7:07 ` Haggai Eran
[not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-17 5:50 ` [PATCH v4 for-next 01/12] IB/core: Add rwsem to allow reading device list or client list Haggai Eran
2015-05-17 5:51 ` [PATCH v4 for-next 06/12] IB/cm: Expose service ID in request events Haggai Eran
2015-05-17 5:51 ` [PATCH v4 for-next 07/12] IB/cma: Refactor RDMA IP CM private-data parsing code Haggai Eran
2015-05-17 5:51 ` [PATCH v4 for-next 08/12] IB/cma: Add compare_data checks to the RDMA CM module Haggai Eran
2015-05-17 5:51 ` [PATCH v4 for-next 12/12] IB/ucma: Take the network namespace from the process Haggai Eran
2015-05-17 5:51 ` [PATCH v4 for-next 09/12] IB/cma: Separate port allocation to network namespaces Haggai Eran
2015-05-17 5:51 ` [PATCH v4 for-next 10/12] IB/cma: Share CM IDs between namespaces Haggai Eran
2015-05-17 5:51 ` [PATCH v4 for-next 11/12] IB/cma: Add support for network namespaces Haggai Eran
2015-05-19 14:30 ` [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Yann Droneaud
2015-05-19 14:54 ` Haggai Eran
[not found] ` <555B4EBE.7010900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-19 16:39 ` Parav Pandit
2015-05-19 18:01 ` Haggai Eran
[not found] ` <1432058488417.98688-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-19 18:42 ` Parav Pandit
2015-05-19 18:38 ` Jason Gunthorpe
[not found] ` <20150519183843.GI18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-19 18:44 ` Parav Pandit
[not found] ` <CAGgvQNTXAWkQWzBBrQfk39GaCQ2ck63AhgURpYFFBPTbkpx4kg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-19 19:20 ` Jason Gunthorpe
2015-05-26 13:34 ` Doug Ledford
[not found] ` <1432647280.28905.107.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-26 16:59 ` Jason Gunthorpe
2015-05-26 17:46 ` Doug Ledford
[not found] ` <1432662396.28905.157.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-26 18:47 ` Jason Gunthorpe
2015-05-28 13:22 ` Haggai Eran
2015-05-28 15:46 ` Jason Gunthorpe
[not found] ` <20150528154633.GB2962-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-03 10:07 ` Haggai Eran
2015-05-28 13:15 ` Haggai Eran
2015-05-26 17:55 ` Christian Benvenuti (benve)
2015-05-28 13:07 ` Haggai Eran
[not found] ` <55671309.6080303-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-28 14:07 ` Doug Ledford
[not found] ` <1432822057.114391.26.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-28 16:21 ` Or Gerlitz
[not found] ` <55674077.5040707-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-28 17:43 ` Jason Gunthorpe
[not found] ` <20150528174337.GA10448-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-28 18:22 ` Doug Ledford
[not found] ` <1432837360.114391.35.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-28 19:05 ` Or Gerlitz
[not found] ` <CAJ3xEMh2T5-56rFxWVdct2uAZYW1ZrKivWfS45V-mvhAfwyGaA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-28 21:55 ` Doug Ledford
[not found] ` <1432850150.114391.56.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-06-03 10:03 ` Haggai Eran
2015-06-03 16:14 ` Jason Gunthorpe
[not found] ` <20150603161447.GC12073-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-03 19:05 ` Or Gerlitz
2015-06-03 19:53 ` Jason Gunthorpe
[not found] ` <20150603195325.GC7902-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-03 20:07 ` Or Gerlitz
[not found] ` <CAJ3xEMiO+hEzOJ2oJ5G-mmBeaX4ZHvUyhNSAzsrRDui6dFjvCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-03 21:45 ` Jason Gunthorpe
2015-06-04 9:41 ` Haggai Eran
2015-06-04 16:06 ` Jason Gunthorpe
2015-06-03 23:48 ` Jason Gunthorpe
[not found] ` <20150603234811.GA15128-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-04 6:24 ` Haggai Eran
[not found] ` <556FEF25.80409-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-04 16:40 ` Jason Gunthorpe
[not found] ` <20150604164058.GB27699-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-08 7:52 ` Haggai Eran
2015-06-08 16:53 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).