* [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
@ 2015-06-09 14:57 kaike.wan-ral2JQCrhuEAvxtiuMwx3w
[not found] ` <1433861837-26177-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: kaike.wan-ral2JQCrhuEAvxtiuMwx3w @ 2015-06-09 14:57 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Kaike Wan
From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
A SA cache is undeniably critical for fabric scalability and performance.
In user space, the ibacm application provides a good example of pathrecord
cache for address and route resolution. With the recent implementation of
the provider architecture, ibacm offers more extensibility as a SA cache.
In kernel, ipoib implements its own small cache for pathrecords, which is
however not available for general use. Furthermore, the implementation of
a SA cache in user space offers better flexibility, larger capacity, and
more robustness for the system.
In this patch series, a mechanism is implemented to allow ib_sa to
send pathrecord query to a user application (eg ibacm) through netlink.
Potentially, this mechanism could be easily extended to other SA queries.
With a customized test implemented in rdma_cm module (not included in this
series), it was shown that the time to retrieve 1 million pathrecords
dropped from 46660 jiffies (46.66 seconds) to 16119 jiffies (or 16.119
seconds) on a two-node system, a reduction of more than 60%.
Changes since v3:
- Patch 1: added basic RESOLVE attribute types.
- Patch 4: changed the encoding of the RESOLVE request message based on
the new attribute types and the input comp_mask. Changed the response
handling by iterating all attributes.
Changes since v2:
- Redesigned the communication protocol between the kernel and user space
application. Instead of the MAD packet format, the new protocol uses
netlink message header and attributes to exchange request and
response between the kernel and user space.The design was described
here:
http://www.spinics.net/lists/linux-rdma/msg25621.html
Changes since v1:
- Move kzalloc changes into a separate patch (Patch 3).
- Remove redundant include line (Patch 4).
- Rename struct rdma_nl_resp_msg as structure ib_nl_resp_msg (Patch 4).
Kaike Wan (4):
IB/netlink: Add defines for local service requests through netlink
IB/core: Check the presence of netlink multicast group listeners
IB/sa: Allocate SA query with kzalloc
IB/sa: Route SA pathrecord query through netlink
drivers/infiniband/core/netlink.c | 8 +
drivers/infiniband/core/sa_query.c | 523 +++++++++++++++++++++++++++++++++++-
include/rdma/rdma_netlink.h | 7 +
include/uapi/rdma/rdma_netlink.h | 82 ++++++
4 files changed, 615 insertions(+), 5 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink
[not found] ` <1433861837-26177-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2015-06-09 14:57 ` kaike.wan-ral2JQCrhuEAvxtiuMwx3w
[not found] ` <1433861837-26177-2-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2015-06-09 14:57 ` [PATCH v4 2/4] IB/core: Check the presence of netlink multicast group listeners kaike.wan-ral2JQCrhuEAvxtiuMwx3w
` (3 subsequent siblings)
4 siblings, 1 reply; 39+ messages in thread
From: kaike.wan-ral2JQCrhuEAvxtiuMwx3w @ 2015-06-09 14:57 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Kaike Wan, John Fleck, Ira Weiny
From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
This patch adds netlink defines for SA client, local service group, local
service operations, and related attributes.
Signed-off-by: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: John Fleck <john.fleck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
include/uapi/rdma/rdma_netlink.h | 82 ++++++++++++++++++++++++++++++++++++++
1 files changed, 82 insertions(+), 0 deletions(-)
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index 6e4bb42..341e9be 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -7,12 +7,14 @@ enum {
RDMA_NL_RDMA_CM = 1,
RDMA_NL_NES,
RDMA_NL_C4IW,
+ RDMA_NL_SA,
RDMA_NL_NUM_CLIENTS
};
enum {
RDMA_NL_GROUP_CM = 1,
RDMA_NL_GROUP_IWPM,
+ RDMA_NL_GROUP_LS,
RDMA_NL_NUM_GROUPS
};
@@ -128,5 +130,85 @@ enum {
IWPM_NLA_ERR_MAX
};
+/* Local service group opcodes */
+enum {
+ RDMA_NL_LS_OP_RESOLVE = 0,
+ RDMA_NL_LS_OP_SET_TIMEOUT,
+ RDMA_NL_LS_NUM_OPS
+};
+
+/* Local service netlink message flags */
+#define RDMA_NL_LS_F_OK 0x0100 /* Success response */
+#define RDMA_NL_LS_F_ERR 0x0200 /* Failed response */
+
+/* Local service attribute type */
+enum {
+ LS_NLA_TYPE_STATUS = 0,
+ LS_NLA_TYPE_PATH_RECORD,
+ LS_NLA_TYPE_TIMEOUT,
+ LS_NLA_TYPE_SERVICE_ID,
+ LS_NLA_TYPE_DGID,
+ LS_NLA_TYPE_SGID,
+ LS_NLA_TYPE_TCLASS,
+ LS_NLA_TYPE_REVERSIBLE,
+ LS_NLA_TYPE_NUM_PATH,
+ LS_NLA_TYPE_PKEY,
+ LS_NLA_TYPE_QOS_CLASS,
+ LS_NLA_TYPE_MAX
+};
+
+/* Local service status attribute */
+enum {
+ LS_NLA_STATUS_SUCCESS = 0,
+ LS_NLA_STATUS_EINVAL,
+ LS_NLA_STATUS_ENODATA,
+ LS_NLA_STATUS_MAX
+};
+
+struct rdma_nla_ls_status {
+ __u32 status;
+};
+
+/* Local service pathrecord attribute: struct ib_path_rec_data */
+
+/* Local service timeout attribute */
+struct rdma_nla_ls_timeout {
+ __u32 timeout;
+};
+
+/* Local Service ServiceID attribute */
+struct rdma_nla_ls_service_id {
+ __be64 service_id;
+};
+
+/* Local Service DGID/SGID attribute: big endian */
+struct rdma_nla_ls_gid {
+ __u8 gid[16];
+};
+
+/* Local Service Traffic Class attribute */
+struct rdma_nla_ls_tclass {
+ __u8 tclass;
+};
+
+/* Local Service Reversible attribute */
+struct rdma_nla_ls_reversible {
+ __u32 reversible;
+};
+
+/* Local Service numb_path attribute */
+struct rdma_nla_ls_numb_path {
+ __u8 numb_path;
+};
+
+/* Local Service Pkey attribute*/
+struct rdma_nla_ls_pkey {
+ __be16 pkey;
+};
+
+/* Local Service Qos Class attribute */
+struct rdma_nla_ls_qos_class {
+ __be16 qos_class;
+};
#endif /* _UAPI_RDMA_NETLINK_H */
--
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v4 2/4] IB/core: Check the presence of netlink multicast group listeners
[not found] ` <1433861837-26177-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2015-06-09 14:57 ` [PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink kaike.wan-ral2JQCrhuEAvxtiuMwx3w
@ 2015-06-09 14:57 ` kaike.wan-ral2JQCrhuEAvxtiuMwx3w
2015-06-09 14:57 ` [PATCH v4 3/4] IB/sa: Allocate SA query with kzalloc kaike.wan-ral2JQCrhuEAvxtiuMwx3w
` (2 subsequent siblings)
4 siblings, 0 replies; 39+ messages in thread
From: kaike.wan-ral2JQCrhuEAvxtiuMwx3w @ 2015-06-09 14:57 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Kaike Wan, John Fleck, Ira Weiny
From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
This patch adds a function to check if listeners for a netlink multicast
group are present.
Signed-off-by: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: John Fleck <john.fleck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
drivers/infiniband/core/netlink.c | 8 ++++++++
include/rdma/rdma_netlink.h | 7 +++++++
2 files changed, 15 insertions(+), 0 deletions(-)
diff --git a/drivers/infiniband/core/netlink.c b/drivers/infiniband/core/netlink.c
index 23dd5a5..e0fc913 100644
--- a/drivers/infiniband/core/netlink.c
+++ b/drivers/infiniband/core/netlink.c
@@ -49,6 +49,14 @@ static DEFINE_MUTEX(ibnl_mutex);
static struct sock *nls;
static LIST_HEAD(client_list);
+int ibnl_chk_listeners(unsigned int group)
+{
+ if (netlink_has_listeners(nls, group) == 0)
+ return -1;
+ return 0;
+}
+EXPORT_SYMBOL(ibnl_chk_listeners);
+
int ibnl_add_client(int index, int nops,
const struct ibnl_client_cbs cb_table[])
{
diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h
index 0790882..5852661 100644
--- a/include/rdma/rdma_netlink.h
+++ b/include/rdma/rdma_netlink.h
@@ -77,4 +77,11 @@ int ibnl_unicast(struct sk_buff *skb, struct nlmsghdr *nlh,
int ibnl_multicast(struct sk_buff *skb, struct nlmsghdr *nlh,
unsigned int group, gfp_t flags);
+/**
+ * Check if there are any listeners to the netlink group
+ * @group: the netlink group ID
+ * Returns 0 on success or a negative for no listeners.
+ */
+int ibnl_chk_listeners(unsigned int group);
+
#endif /* _RDMA_NETLINK_H */
--
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v4 3/4] IB/sa: Allocate SA query with kzalloc
[not found] ` <1433861837-26177-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2015-06-09 14:57 ` [PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink kaike.wan-ral2JQCrhuEAvxtiuMwx3w
2015-06-09 14:57 ` [PATCH v4 2/4] IB/core: Check the presence of netlink multicast group listeners kaike.wan-ral2JQCrhuEAvxtiuMwx3w
@ 2015-06-09 14:57 ` kaike.wan-ral2JQCrhuEAvxtiuMwx3w
2015-06-09 14:57 ` [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink kaike.wan-ral2JQCrhuEAvxtiuMwx3w
2015-06-10 13:37 ` [PATCH v4 0/4] Sending kernel pathrecord query to user cache server Hal Rosenstock
4 siblings, 0 replies; 39+ messages in thread
From: kaike.wan-ral2JQCrhuEAvxtiuMwx3w @ 2015-06-09 14:57 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Kaike Wan, John Fleck, Ira Weiny
From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Replace kmalloc with kzalloc so that all uninitialized fields in SA query
will be zero-ed out to avoid unintentional consequence. This prepares the
SA query structure to accept new fields in the future.
Signed-off-by: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: John Fleck <john.fleck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
drivers/infiniband/core/sa_query.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index c38f030..17e1cf7 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -739,7 +739,7 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
port = &sa_dev->port[port_num - sa_dev->start_port];
agent = port->agent;
- query = kmalloc(sizeof *query, gfp_mask);
+ query = kzalloc(sizeof(*query), gfp_mask);
if (!query)
return -ENOMEM;
@@ -861,7 +861,7 @@ int ib_sa_service_rec_query(struct ib_sa_client *client,
method != IB_SA_METHOD_DELETE)
return -EINVAL;
- query = kmalloc(sizeof *query, gfp_mask);
+ query = kzalloc(sizeof(*query), gfp_mask);
if (!query)
return -ENOMEM;
@@ -953,7 +953,7 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
port = &sa_dev->port[port_num - sa_dev->start_port];
agent = port->agent;
- query = kmalloc(sizeof *query, gfp_mask);
+ query = kzalloc(sizeof(*query), gfp_mask);
if (!query)
return -ENOMEM;
@@ -1050,7 +1050,7 @@ int ib_sa_guid_info_rec_query(struct ib_sa_client *client,
port = &sa_dev->port[port_num - sa_dev->start_port];
agent = port->agent;
- query = kmalloc(sizeof *query, gfp_mask);
+ query = kzalloc(sizeof(*query), gfp_mask);
if (!query)
return -ENOMEM;
--
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink
[not found] ` <1433861837-26177-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
` (2 preceding siblings ...)
2015-06-09 14:57 ` [PATCH v4 3/4] IB/sa: Allocate SA query with kzalloc kaike.wan-ral2JQCrhuEAvxtiuMwx3w
@ 2015-06-09 14:57 ` kaike.wan-ral2JQCrhuEAvxtiuMwx3w
[not found] ` <1433861837-26177-5-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2015-06-10 13:37 ` [PATCH v4 0/4] Sending kernel pathrecord query to user cache server Hal Rosenstock
4 siblings, 1 reply; 39+ messages in thread
From: kaike.wan-ral2JQCrhuEAvxtiuMwx3w @ 2015-06-09 14:57 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Kaike Wan, John Fleck, Ira Weiny
From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
This patch routes a SA pathrecord query to netlink first and processes the
response appropriately. If a failure is returned, the request will be sent
through IB. The decision whether to route the request to netlink first is
determined by the presence of a listener for the local service netlink
multicast group. If the user-space local service netlink multicast group
listener is not present, the request will be sent through IB, just like
what is currently being done.
Signed-off-by: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: John Fleck <john.fleck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
drivers/infiniband/core/sa_query.c | 515 +++++++++++++++++++++++++++++++++++-
include/uapi/rdma/rdma_netlink.h | 2 +-
2 files changed, 515 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 17e1cf7..e063593 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -45,12 +45,21 @@
#include <uapi/linux/if_ether.h>
#include <rdma/ib_pack.h>
#include <rdma/ib_cache.h>
+#include <rdma/rdma_netlink.h>
+#include <net/netlink.h>
+#include <uapi/rdma/ib_user_sa.h>
+#include <rdma/ib_marshall.h>
#include "sa.h"
MODULE_AUTHOR("Roland Dreier");
MODULE_DESCRIPTION("InfiniBand subnet administration query support");
MODULE_LICENSE("Dual BSD/GPL");
+#define IB_SA_LOCAL_SVC_TIMEOUT_MIN 100
+#define IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT 2000
+#define IB_SA_LOCAL_SVC_TIMEOUT_MAX 200000
+static int sa_local_svc_timeout_ms = IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT;
+
struct ib_sa_sm_ah {
struct ib_ah *ah;
struct kref ref;
@@ -80,8 +89,24 @@ struct ib_sa_query {
struct ib_mad_send_buf *mad_buf;
struct ib_sa_sm_ah *sm_ah;
int id;
+ u32 flags;
};
+#define IB_SA_ENABLE_LOCAL_SERVICE 0x00000001
+#define IB_SA_CANCEL 0x00000002
+
+#define IB_SA_LOCAL_SVC_ENABLED(query) \
+ ((query)->flags & IB_SA_ENABLE_LOCAL_SERVICE)
+#define IB_SA_ENABLE_LOCAL_SVC(query) \
+ ((query)->flags |= IB_SA_ENABLE_LOCAL_SERVICE)
+#define IB_SA_DISABLE_LOCAL_SVC(query) \
+ ((query)->flags &= ~IB_SA_ENABLE_LOCAL_SERVICE)
+
+#define IB_SA_QUERY_CANCELLED(query) \
+ ((query)->flags & IB_SA_CANCEL)
+#define IB_SA_CANCEL_QUERY(query) \
+ ((query)->flags |= IB_SA_CANCEL)
+
struct ib_sa_service_query {
void (*callback)(int, struct ib_sa_service_rec *, void *);
void *context;
@@ -106,6 +131,26 @@ struct ib_sa_mcmember_query {
struct ib_sa_query sa_query;
};
+struct ib_nl_request_info {
+ struct list_head list;
+ u32 seq;
+ unsigned long timeout;
+ struct ib_sa_query *query;
+};
+
+struct ib_nl_attr_info {
+ u16 len; /* Total attr len: header + payload + padding */
+ ib_sa_comp_mask comp_mask;
+ void *input;
+ void (*set_attrs)(struct sk_buff *skb, struct ib_nl_attr_info *info);
+};
+
+static LIST_HEAD(ib_nl_request_list);
+static DEFINE_SPINLOCK(ib_nl_request_lock);
+static atomic_t ib_nl_sa_request_seq;
+static struct workqueue_struct *ib_nl_wq;
+static struct delayed_work ib_nl_timed_work;
+
static void ib_sa_add_one(struct ib_device *device);
static void ib_sa_remove_one(struct ib_device *device);
@@ -381,6 +426,433 @@ static const struct ib_field guidinfo_rec_table[] = {
.size_bits = 512 },
};
+static int ib_nl_send_msg(int opcode, struct ib_nl_attr_info *attrs, u32 seq)
+{
+ struct sk_buff *skb = NULL;
+ struct nlmsghdr *nlh;
+ void *data;
+ int ret = 0;
+
+ if (attrs->len <= 0)
+ return -EMSGSIZE;
+
+ skb = nlmsg_new(attrs->len, GFP_KERNEL);
+ if (!skb) {
+ pr_err("alloc failed ret=%d\n", ret);
+ return -ENOMEM;
+ }
+
+ /* Put nlmsg header only for now */
+ data = ibnl_put_msg(skb, &nlh, seq, 0, RDMA_NL_SA,
+ opcode, GFP_KERNEL);
+ if (!data) {
+ kfree_skb(skb);
+ return -EMSGSIZE;
+ }
+
+ /* Add attributes */
+ attrs->set_attrs(skb, attrs);
+
+ /* Repair the nlmsg header length */
+ nlmsg_end(skb, nlh);
+
+ ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_KERNEL);
+ if (!ret) {
+ ret = attrs->len;
+ } else {
+ if (ret != -ESRCH)
+ pr_err("ibnl_multicast failed l=%d, r=%d\n",
+ attrs->len, ret);
+ ret = 0;
+ }
+ return ret;
+}
+
+static struct ib_nl_request_info *
+ib_nl_alloc_request(struct ib_sa_query *query)
+{
+ struct ib_nl_request_info *rinfo;
+
+ rinfo = kzalloc(sizeof(*rinfo), GFP_ATOMIC);
+ if (rinfo == NULL)
+ return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD(&rinfo->list);
+ rinfo->seq = (u32)atomic_inc_return(&ib_nl_sa_request_seq);
+ rinfo->query = query;
+
+ return rinfo;
+}
+
+static void ib_nl_set_path_rec_attrs(struct sk_buff *skb,
+ struct ib_nl_attr_info *info)
+{
+ struct ib_sa_path_rec *sa_rec = info->input;
+
+ if (info->comp_mask & IB_SA_PATH_REC_SERVICE_ID)
+ nla_put(skb, LS_NLA_TYPE_SERVICE_ID,
+ sizeof(sa_rec->service_id), &sa_rec->service_id);
+ if (info->comp_mask & IB_SA_PATH_REC_DGID)
+ nla_put(skb, LS_NLA_TYPE_DGID, sizeof(sa_rec->dgid),
+ &sa_rec->dgid);
+ if (info->comp_mask & IB_SA_PATH_REC_SGID)
+ nla_put(skb, LS_NLA_TYPE_SGID, sizeof(sa_rec->sgid),
+ &sa_rec->sgid);
+ if (info->comp_mask & IB_SA_PATH_REC_TRAFFIC_CLASS)
+ nla_put(skb, LS_NLA_TYPE_TCLASS, sizeof(sa_rec->traffic_class),
+ &sa_rec->traffic_class);
+ if (info->comp_mask & IB_SA_PATH_REC_REVERSIBLE)
+ nla_put(skb, LS_NLA_TYPE_REVERSIBLE,
+ sizeof(sa_rec->reversible), &sa_rec->reversible);
+ if (info->comp_mask & IB_SA_PATH_REC_NUMB_PATH)
+ nla_put(skb, LS_NLA_TYPE_NUMB_PATH,
+ sizeof(sa_rec->numb_path), &sa_rec->numb_path);
+ if (info->comp_mask & IB_SA_PATH_REC_PKEY)
+ nla_put(skb, LS_NLA_TYPE_PKEY, sizeof(sa_rec->pkey),
+ &sa_rec->pkey);
+ if (info->comp_mask & IB_SA_PATH_REC_QOS_CLASS)
+ nla_put(skb, LS_NLA_TYPE_QOS_CLASS,
+ sizeof(sa_rec->qos_class), &sa_rec->qos_class);
+}
+
+static int ib_nl_get_path_rec_attrs_len(struct ib_nl_attr_info *info)
+{
+ int len = 0;
+
+ if (info->comp_mask & IB_SA_PATH_REC_SERVICE_ID)
+ len += nla_total_size(sizeof(struct rdma_nla_ls_service_id));
+ if (info->comp_mask & IB_SA_PATH_REC_DGID)
+ len += nla_total_size(sizeof(struct rdma_nla_ls_gid));
+ if (info->comp_mask & IB_SA_PATH_REC_SGID)
+ len += nla_total_size(sizeof(struct rdma_nla_ls_gid));
+ if (info->comp_mask & IB_SA_PATH_REC_TRAFFIC_CLASS)
+ len += nla_total_size(sizeof(struct rdma_nla_ls_tclass));
+ if (info->comp_mask & IB_SA_PATH_REC_REVERSIBLE)
+ len += nla_total_size(sizeof(struct rdma_nla_ls_reversible));
+ if (info->comp_mask & IB_SA_PATH_REC_NUMB_PATH)
+ len += nla_total_size(sizeof(struct rdma_nla_ls_numb_path));
+ if (info->comp_mask & IB_SA_PATH_REC_PKEY)
+ len += nla_total_size(sizeof(struct rdma_nla_ls_pkey));
+ if (info->comp_mask & IB_SA_PATH_REC_QOS_CLASS)
+ len += nla_total_size(sizeof(struct rdma_nla_ls_qos_class));
+
+ return len;
+}
+
+static int ib_nl_send_request(struct ib_nl_request_info *rinfo)
+{
+ struct ib_nl_attr_info info;
+ int opcode;
+ struct ib_sa_mad *mad;
+ unsigned long flags;
+ unsigned long delay;
+ int ret;
+
+ mad = rinfo->query->mad_buf->mad;
+ switch (mad->mad_hdr.attr_id) {
+ case cpu_to_be16(IB_SA_ATTR_PATH_REC):
+ opcode = RDMA_NL_LS_OP_RESOLVE;
+ mad = rinfo->query->mad_buf->mad;
+ info.comp_mask = mad->sa_hdr.comp_mask;
+ info.input = rinfo->query->mad_buf->context[1];
+ rinfo->query->mad_buf->context[1] = NULL;
+ info.len = ib_nl_get_path_rec_attrs_len(&info);
+ info.set_attrs = ib_nl_set_path_rec_attrs;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ spin_lock_irqsave(&ib_nl_request_lock, flags);
+ ret = ib_nl_send_msg(opcode, &info, rinfo->seq);
+ if (ret <= 0) {
+ ret = -EIO;
+ goto request_out;
+ } else {
+ ret = 0;
+ }
+
+ delay = msecs_to_jiffies(sa_local_svc_timeout_ms);
+ rinfo->timeout = delay + jiffies;
+ list_add_tail(&rinfo->list, &ib_nl_request_list);
+ /* Start the timeout if this is the only request */
+ if (ib_nl_request_list.next == &rinfo->list)
+ queue_delayed_work(ib_nl_wq, &ib_nl_timed_work, delay);
+
+request_out:
+ spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+
+ return ret;
+}
+
+static int ib_nl_make_request(struct ib_sa_query *query)
+{
+ struct ib_nl_request_info *rinfo;
+ int ret;
+
+ rinfo = ib_nl_alloc_request(query);
+ if (IS_ERR(rinfo))
+ return -ENOMEM;
+
+ ret = ib_nl_send_request(rinfo);
+ if (ret)
+ kfree(rinfo);
+
+ return ret;
+}
+
+static int ib_nl_cancel_request(struct ib_sa_query *query)
+{
+ unsigned long flags;
+ struct ib_nl_request_info *rinfo;
+ int found = 0;
+
+ spin_lock_irqsave(&ib_nl_request_lock, flags);
+ list_for_each_entry(rinfo, &ib_nl_request_list, list) {
+ /* Let the timeout to take care of the callback */
+ if (query == rinfo->query) {
+ IB_SA_CANCEL_QUERY(query);
+ rinfo->timeout = jiffies;
+ list_move(&rinfo->list, &ib_nl_request_list);
+ found = 1;
+ mod_delayed_work(ib_nl_wq, &ib_nl_timed_work, 1);
+ break;
+ }
+ }
+ spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+
+ return found;
+}
+
+static void send_handler(struct ib_mad_agent *agent,
+ struct ib_mad_send_wc *mad_send_wc);
+
+static void ib_nl_process_good_resolve_rsp(struct ib_sa_query *query,
+ const struct nlmsghdr *nlh)
+{
+ struct ib_mad_send_wc mad_send_wc;
+ struct ib_sa_mad *mad = NULL;
+ const struct nlattr *head, *curr;
+ struct ib_path_rec_data *rec;
+ int len, rem;
+
+ if (query->callback) {
+ head = (const struct nlattr *) nlmsg_data(nlh);
+ len = nlmsg_len(nlh);
+ nla_for_each_attr(curr, head, len, rem) {
+ if (curr->nla_type == LS_NLA_TYPE_PATH_RECORD) {
+ rec = nla_data(curr);
+ if ((rec->flags && IB_PATH_GMP) &&
+ (rec->flags && IB_PATH_PRIMARY)) {
+ mad = query->mad_buf->mad;
+ mad->mad_hdr.method |=
+ IB_MGMT_METHOD_RESP;
+ memcpy(mad->data, rec->path_rec,
+ sizeof(rec->path_rec));
+ query->callback(query, 0, mad);
+ break;
+ }
+ }
+ }
+ }
+
+ mad_send_wc.send_buf = query->mad_buf;
+ mad_send_wc.status = IB_WC_SUCCESS;
+ send_handler(query->mad_buf->mad_agent, &mad_send_wc);
+}
+
+static void ib_nl_request_timeout(struct work_struct *work)
+{
+ unsigned long flags;
+ struct ib_nl_request_info *rinfo;
+ struct ib_sa_query *query;
+ unsigned long delay;
+ struct ib_mad_send_wc mad_send_wc;
+ int ret;
+
+ spin_lock_irqsave(&ib_nl_request_lock, flags);
+ while (!list_empty(&ib_nl_request_list)) {
+ rinfo = list_entry(ib_nl_request_list.next,
+ struct ib_nl_request_info, list);
+
+ if (time_after(rinfo->timeout, jiffies)) {
+ delay = rinfo->timeout - jiffies;
+ if ((long)delay <= 0)
+ delay = 1;
+ queue_delayed_work(ib_nl_wq, &ib_nl_timed_work, delay);
+ break;
+ }
+
+ list_del(&rinfo->list);
+ query = rinfo->query;
+ IB_SA_DISABLE_LOCAL_SVC(query);
+ /* Hold the lock to protect against query cancellation */
+ if (IB_SA_QUERY_CANCELLED(query))
+ ret = -1;
+ else
+ ret = ib_post_send_mad(query->mad_buf, NULL);
+ if (ret) {
+ mad_send_wc.send_buf = query->mad_buf;
+ mad_send_wc.status = IB_WC_WR_FLUSH_ERR;
+ spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+ send_handler(query->port->agent, &mad_send_wc);
+ spin_lock_irqsave(&ib_nl_request_lock, flags);
+ }
+ kfree(rinfo);
+ }
+ spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+}
+
+static inline int ib_nl_is_good_resolve_resp(const struct nlmsghdr *nlh)
+{
+ const struct nlattr *head, *curr;
+ struct ib_path_rec_data *rec;
+ int len, rem;
+
+ if (nlh->nlmsg_flags & RDMA_NL_LS_F_ERR)
+ return 0;
+
+ if (!(nlh->nlmsg_flags & RDMA_NL_LS_F_OK))
+ return 0;
+
+ if (nlmsg_len(nlh) < nla_attr_size(sizeof(*rec)))
+ return 0;
+
+ head = (const struct nlattr *) nlmsg_data(nlh);
+ len = nlmsg_len(nlh);
+ nla_for_each_attr(curr, head, len, rem) {
+ if (curr->nla_type == LS_NLA_TYPE_PATH_RECORD) {
+ rec = nla_data(curr);
+ if ((rec->flags && IB_PATH_GMP) &&
+ (rec->flags && IB_PATH_PRIMARY))
+ return 1;
+ }
+ }
+
+ return 0;
+}
+
+static int ib_nl_handle_set_timeout(struct sk_buff *skb,
+ struct netlink_callback *cb)
+{
+ const struct nlmsghdr *nlh = (struct nlmsghdr *)cb->nlh;
+ int timeout, delta, abs_delta;
+ const struct nlattr *attr;
+ struct rdma_nla_ls_timeout *to_attr;
+ unsigned long flags;
+ struct ib_nl_request_info *rinfo;
+ long delay = 0;
+
+ if (nlmsg_len(nlh) < nla_attr_size(sizeof(*to_attr)))
+ goto settimeout_out;
+
+ attr = (const struct nlattr *) nlmsg_data(nlh);
+ if (attr->nla_type != LS_NLA_TYPE_TIMEOUT ||
+ nla_len(attr) != sizeof(*to_attr))
+ goto settimeout_out;
+
+ to_attr = (struct rdma_nla_ls_timeout *) nla_data(attr);
+ timeout = (int) to_attr->timeout;
+ if (timeout < IB_SA_LOCAL_SVC_TIMEOUT_MIN)
+ timeout = IB_SA_LOCAL_SVC_TIMEOUT_MIN;
+ if (timeout > IB_SA_LOCAL_SVC_TIMEOUT_MAX)
+ timeout = IB_SA_LOCAL_SVC_TIMEOUT_MAX;
+
+ delta = timeout - sa_local_svc_timeout_ms;
+ if (delta < 0)
+ abs_delta = -delta;
+ else
+ abs_delta = delta;
+
+ if (delta != 0) {
+ spin_lock_irqsave(&ib_nl_request_lock, flags);
+ sa_local_svc_timeout_ms = timeout;
+ list_for_each_entry(rinfo, &ib_nl_request_list, list) {
+ if (delta < 0 && abs_delta > rinfo->timeout)
+ rinfo->timeout = 0;
+ else
+ rinfo->timeout += delta;
+
+ /* Get the new delay from the first entry */
+ if (!delay) {
+ delay = rinfo->timeout - jiffies;
+ if (delay <= 0)
+ delay = 1;
+ }
+ }
+ if (delay)
+ mod_delayed_work(ib_nl_wq, &ib_nl_timed_work,
+ (unsigned long)delay);
+ spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+ }
+
+settimeout_out:
+ return skb->len;
+}
+
+static int ib_nl_handle_resolve_resp(struct sk_buff *skb,
+ struct netlink_callback *cb)
+{
+ const struct nlmsghdr *nlh = (struct nlmsghdr *)cb->nlh;
+ unsigned long flags;
+ struct ib_nl_request_info *rinfo;
+ struct ib_sa_query *query;
+ struct ib_mad_send_buf *send_buf;
+ struct ib_mad_send_wc mad_send_wc;
+ int found = 0;
+ int ret;
+
+ spin_lock_irqsave(&ib_nl_request_lock, flags);
+ list_for_each_entry(rinfo, &ib_nl_request_list, list) {
+ /*
+ * If the query is cancelled, let the timeout routine
+ * take care of it.
+ */
+ if (nlh->nlmsg_seq == rinfo->seq) {
+ found = !IB_SA_QUERY_CANCELLED(rinfo->query);
+ if (found)
+ list_del(&rinfo->list);
+ break;
+ }
+ }
+
+ if (!found) {
+ spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+ goto resp_out;
+ }
+
+ query = rinfo->query;
+ send_buf = query->mad_buf;
+
+ if (!ib_nl_is_good_resolve_resp(nlh)) {
+ /* if the result is a failure, send out the packet via IB */
+ IB_SA_DISABLE_LOCAL_SVC(query);
+ ret = ib_post_send_mad(query->mad_buf, NULL);
+ spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+ if (ret) {
+ mad_send_wc.send_buf = send_buf;
+ mad_send_wc.status = IB_WC_GENERAL_ERR;
+ send_handler(query->port->agent, &mad_send_wc);
+ }
+ } else {
+ spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+ ib_nl_process_good_resolve_rsp(query, nlh);
+ }
+
+ kfree(rinfo);
+resp_out:
+ return skb->len;
+}
+
+static struct ibnl_client_cbs ib_sa_cb_table[] = {
+ [RDMA_NL_LS_OP_RESOLVE] = {
+ .dump = ib_nl_handle_resolve_resp,
+ .module = THIS_MODULE },
+ [RDMA_NL_LS_OP_SET_TIMEOUT] = {
+ .dump = ib_nl_handle_set_timeout,
+ .module = THIS_MODULE },
+};
+
static void free_sm_ah(struct kref *kref)
{
struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref);
@@ -502,7 +974,13 @@ void ib_sa_cancel_query(int id, struct ib_sa_query *query)
mad_buf = query->mad_buf;
spin_unlock_irqrestore(&idr_lock, flags);
- ib_cancel_mad(agent, mad_buf);
+ /*
+ * If the query is still on the netlink request list, schedule
+ * it to be cancelled by the timeout routine. Otherwise, it has been
+ * sent to the MAD layer and has to be cancelled from there.
+ */
+ if (!ib_nl_cancel_request(query))
+ ib_cancel_mad(agent, mad_buf);
}
EXPORT_SYMBOL(ib_sa_cancel_query);
@@ -638,6 +1116,14 @@ static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
query->mad_buf->context[0] = query;
query->id = id;
+ if (IB_SA_LOCAL_SVC_ENABLED(query)) {
+ if (!ibnl_chk_listeners(RDMA_NL_GROUP_LS)) {
+ if (!ib_nl_make_request(query))
+ return id;
+ }
+ IB_SA_DISABLE_LOCAL_SVC(query);
+ }
+
ret = ib_post_send_mad(query->mad_buf, NULL);
if (ret) {
spin_lock_irqsave(&idr_lock, flags);
@@ -766,6 +1252,9 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
*sa_query = &query->sa_query;
+ IB_SA_ENABLE_LOCAL_SVC(&query->sa_query);
+ query->sa_query.mad_buf->context[1] = rec;
+
ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
if (ret < 0)
goto err2;
@@ -1250,6 +1739,8 @@ static int __init ib_sa_init(void)
get_random_bytes(&tid, sizeof tid);
+ atomic_set(&ib_nl_sa_request_seq, 0);
+
ret = ib_register_client(&sa_client);
if (ret) {
printk(KERN_ERR "Couldn't register ib_sa client\n");
@@ -1262,7 +1753,25 @@ static int __init ib_sa_init(void)
goto err2;
}
+ ib_nl_wq = create_singlethread_workqueue("ib_nl_sa_wq");
+ if (!ib_nl_wq) {
+ ret = -ENOMEM;
+ goto err3;
+ }
+
+ if (ibnl_add_client(RDMA_NL_SA, RDMA_NL_LS_NUM_OPS,
+ ib_sa_cb_table)) {
+ pr_err("Failed to add netlink callback\n");
+ ret = -EINVAL;
+ goto err4;
+ }
+ INIT_DELAYED_WORK(&ib_nl_timed_work, ib_nl_request_timeout);
+
return 0;
+err4:
+ destroy_workqueue(ib_nl_wq);
+err3:
+ mcast_cleanup();
err2:
ib_unregister_client(&sa_client);
err1:
@@ -1271,6 +1780,10 @@ err1:
static void __exit ib_sa_cleanup(void)
{
+ ibnl_remove_client(RDMA_NL_SA);
+ cancel_delayed_work(&ib_nl_timed_work);
+ flush_workqueue(ib_nl_wq);
+ destroy_workqueue(ib_nl_wq);
mcast_cleanup();
ib_unregister_client(&sa_client);
idr_destroy(&query_idr);
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index 341e9be..6aa9281 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -151,7 +151,7 @@ enum {
LS_NLA_TYPE_SGID,
LS_NLA_TYPE_TCLASS,
LS_NLA_TYPE_REVERSIBLE,
- LS_NLA_TYPE_NUM_PATH,
+ LS_NLA_TYPE_NUMB_PATH,
LS_NLA_TYPE_PKEY,
LS_NLA_TYPE_QOS_CLASS,
LS_NLA_TYPE_MAX
--
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <1433861837-26177-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
` (3 preceding siblings ...)
2015-06-09 14:57 ` [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink kaike.wan-ral2JQCrhuEAvxtiuMwx3w
@ 2015-06-10 13:37 ` Hal Rosenstock
[not found] ` <55783D84.6040709-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
4 siblings, 1 reply; 39+ messages in thread
From: Hal Rosenstock @ 2015-06-10 13:37 UTC (permalink / raw)
To: kaike.wan-ral2JQCrhuEAvxtiuMwx3w; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
On 6/9/2015 10:57 AM, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>
> A SA cache is undeniably critical for fabric scalability and performance.
> In user space, the ibacm application provides a good example of pathrecord
> cache for address and route resolution. With the recent implementation of
> the provider architecture, ibacm offers more extensibility as a SA cache.
> In kernel, ipoib implements its own small cache for pathrecords, which is
> however not available for general use. Furthermore, the implementation of
> a SA cache in user space offers better flexibility, larger capacity, and
> more robustness for the system.
>
> In this patch series, a mechanism is implemented to allow ib_sa to
> send pathrecord query to a user application (eg ibacm) through netlink.
While this appears to address the current upstream use model for ACM
with it's multicast overlay backend where PRs are static, it does not
appear to address PR changes.
Should aging/revalidation of PRs be supported ? If so, would this make
PRs similar at "high" level to IP neighbor cache in kernel ?
-- Hal
> Potentially, this mechanism could be easily extended to other SA queries.
>
> With a customized test implemented in rdma_cm module (not included in this
> series), it was shown that the time to retrieve 1 million pathrecords
> dropped from 46660 jiffies (46.66 seconds) to 16119 jiffies (or 16.119
> seconds) on a two-node system, a reduction of more than 60%.
>
> Changes since v3:
> - Patch 1: added basic RESOLVE attribute types.
> - Patch 4: changed the encoding of the RESOLVE request message based on
> the new attribute types and the input comp_mask. Changed the response
> handling by iterating all attributes.
>
> Changes since v2:
> - Redesigned the communication protocol between the kernel and user space
> application. Instead of the MAD packet format, the new protocol uses
> netlink message header and attributes to exchange request and
> response between the kernel and user space.The design was described
> here:
> http://www.spinics.net/lists/linux-rdma/msg25621.html
>
> Changes since v1:
> - Move kzalloc changes into a separate patch (Patch 3).
> - Remove redundant include line (Patch 4).
> - Rename struct rdma_nl_resp_msg as structure ib_nl_resp_msg (Patch 4).
>
> Kaike Wan (4):
> IB/netlink: Add defines for local service requests through netlink
> IB/core: Check the presence of netlink multicast group listeners
> IB/sa: Allocate SA query with kzalloc
> IB/sa: Route SA pathrecord query through netlink
>
> drivers/infiniband/core/netlink.c | 8 +
> drivers/infiniband/core/sa_query.c | 523 +++++++++++++++++++++++++++++++++++-
> include/rdma/rdma_netlink.h | 7 +
> include/uapi/rdma/rdma_netlink.h | 82 ++++++
> 4 files changed, 615 insertions(+), 5 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <55783D84.6040709-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-10 14:22 ` Wan, Kaike
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABCC2E-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 15:21 ` Hefty, Sean
1 sibling, 1 reply; 39+ messages in thread
From: Wan, Kaike @ 2015-06-10 14:22 UTC (permalink / raw)
To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> Sent: Wednesday, June 10, 2015 9:37 AM
>
> >
> > A SA cache is undeniably critical for fabric scalability and performance.
> > In user space, the ibacm application provides a good example of
> > pathrecord cache for address and route resolution. With the recent
> > implementation of the provider architecture, ibacm offers more
> extensibility as a SA cache.
> > In kernel, ipoib implements its own small cache for pathrecords, which
> > is however not available for general use. Furthermore, the
> > implementation of a SA cache in user space offers better flexibility,
> > larger capacity, and more robustness for the system.
> >
> > In this patch series, a mechanism is implemented to allow ib_sa to
> > send pathrecord query to a user application (eg ibacm) through netlink.
>
> While this appears to address the current upstream use model for ACM with
> it's multicast overlay backend where PRs are static, it does not appear to
> address PR changes.
> Should aging/revalidation of PRs be supported ? If so, would this make PRs
> similar at "high" level to IP neighbor cache in kernel ?
Even for the default provider acmp, PRs will time out and the length of timeout is configurable. For other providers (eg ibssa), the PR change could be managed correctly and promptly, and this capability is beyond ibacm core itself.
Kaike
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABCC2E-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-10 14:39 ` Hal Rosenstock
[not found] ` <55784C35.2020401-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hal Rosenstock @ 2015-06-10 14:39 UTC (permalink / raw)
To: Wan, Kaike; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On 6/10/2015 10:22 AM, Wan, Kaike wrote:
>
>
>> -----Original Message-----
>> From: Hal Rosenstock [mailto:hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
>> Sent: Wednesday, June 10, 2015 9:37 AM
>>
>>>
>>> A SA cache is undeniably critical for fabric scalability and performance.
>>> In user space, the ibacm application provides a good example of
>>> pathrecord cache for address and route resolution. With the recent
>>> implementation of the provider architecture, ibacm offers more
>> extensibility as a SA cache.
>>> In kernel, ipoib implements its own small cache for pathrecords, which
>>> is however not available for general use. Furthermore, the
>>> implementation of a SA cache in user space offers better flexibility,
>>> larger capacity, and more robustness for the system.
>>>
>>> In this patch series, a mechanism is implemented to allow ib_sa to
>>> send pathrecord query to a user application (eg ibacm) through netlink.
>>
>> While this appears to address the current upstream use model for ACM with
>> it's multicast overlay backend where PRs are static, it does not appear to
>> address PR changes.
>> Should aging/revalidation of PRs be supported ? If so, would this make PRs
>> similar at "high" level to IP neighbor cache in kernel ?
>
> Even for the default provider acmp, PRs will time out and the length of timeout is configurable. For other providers (eg ibssa), the PR change could be managed correctly and promptly, and this capability is beyond ibacm core itself.
That deals with the update of PR in user space (ACM). Doesn't kernel
need some way of knowing PR was updated ?
-- Hal
> Kaike
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <55784C35.2020401-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-10 15:07 ` Wan, Kaike
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABCC80-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 18:00 ` ira.weiny
1 sibling, 1 reply; 39+ messages in thread
From: Wan, Kaike @ 2015-06-10 15:07 UTC (permalink / raw)
To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> Sent: Wednesday, June 10, 2015 10:40 AM
> >
> >
> >> -----Original Message-----
> >> From: Hal Rosenstock [mailto:hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> >> Sent: Wednesday, June 10, 2015 9:37 AM
> >>
> >>>
> >>> A SA cache is undeniably critical for fabric scalability and performance.
> >>> In user space, the ibacm application provides a good example of
> >>> pathrecord cache for address and route resolution. With the recent
> >>> implementation of the provider architecture, ibacm offers more
> >> extensibility as a SA cache.
> >>> In kernel, ipoib implements its own small cache for pathrecords,
> >>> which is however not available for general use. Furthermore, the
> >>> implementation of a SA cache in user space offers better
> >>> flexibility, larger capacity, and more robustness for the system.
> >>>
> >>> In this patch series, a mechanism is implemented to allow ib_sa to
> >>> send pathrecord query to a user application (eg ibacm) through netlink.
> >>
> >> While this appears to address the current upstream use model for ACM
> >> with it's multicast overlay backend where PRs are static, it does not
> >> appear to address PR changes.
> >> Should aging/revalidation of PRs be supported ? If so, would this
> >> make PRs similar at "high" level to IP neighbor cache in kernel ?
> >
> > Even for the default provider acmp, PRs will time out and the length of
> timeout is configurable. For other providers (eg ibssa), the PR change could
> be managed correctly and promptly, and this capability is beyond ibacm core
> itself.
>
> That deals with the update of PR in user space (ACM). Doesn't kernel need
> some way of knowing PR was updated ?
Maybe. If a kernel client is interested in PR changes, it can register for notification and once a notice is received, it can always go back to query for the new PR. But that is the responsibility of the individual client and is beyond the scope of this patch series.
Kaike
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABCC80-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-10 15:19 ` Hal Rosenstock
[not found] ` <5578558B.1070503-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hal Rosenstock @ 2015-06-10 15:19 UTC (permalink / raw)
To: Wan, Kaike; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On 6/10/2015 11:07 AM, Wan, Kaike wrote:
>
>
>> -----Original Message-----
>> From: Hal Rosenstock [mailto:hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
>> Sent: Wednesday, June 10, 2015 10:40 AM
>>>
>>>
>>>> -----Original Message-----
>>>> From: Hal Rosenstock [mailto:hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
>>>> Sent: Wednesday, June 10, 2015 9:37 AM
>>>>
>>>>>
>>>>> A SA cache is undeniably critical for fabric scalability and performance.
>>>>> In user space, the ibacm application provides a good example of
>>>>> pathrecord cache for address and route resolution. With the recent
>>>>> implementation of the provider architecture, ibacm offers more
>>>> extensibility as a SA cache.
>>>>> In kernel, ipoib implements its own small cache for pathrecords,
>>>>> which is however not available for general use. Furthermore, the
>>>>> implementation of a SA cache in user space offers better
>>>>> flexibility, larger capacity, and more robustness for the system.
>>>>>
>>>>> In this patch series, a mechanism is implemented to allow ib_sa to
>>>>> send pathrecord query to a user application (eg ibacm) through netlink.
>>>>
>>>> While this appears to address the current upstream use model for ACM
>>>> with it's multicast overlay backend where PRs are static, it does not
>>>> appear to address PR changes.
>>>> Should aging/revalidation of PRs be supported ? If so, would this
>>>> make PRs similar at "high" level to IP neighbor cache in kernel ?
>>>
>>> Even for the default provider acmp, PRs will time out and the length of
>> timeout is configurable. For other providers (eg ibssa), the PR change could
>> be managed correctly and promptly, and this capability is beyond ibacm core
>> itself.
>>
>> That deals with the update of PR in user space (ACM). Doesn't kernel need
>> some way of knowing PR was updated ?
>
> Maybe. If a kernel client is interested in PR changes, it can register for notification
Wouldn't the notification mechanism be via netlink ? What would the
granularity be for the registrations and notifications ?
> and once a notice is received, it can always go back to query for the new PR.
> But that is the responsibility of the individual client and is beyond the scope of this patch series.
I thought that the goal is to have an interface that addresses kernel
<-> userspace needs for PRs in general and ACM was just first consumer.
I'm not sure that it's sufficient for other ACM providers.
-- Hal
> Kaike
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <55783D84.6040709-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-10 14:22 ` Wan, Kaike
@ 2015-06-10 15:21 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6677-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
1 sibling, 1 reply; 39+ messages in thread
From: Hefty, Sean @ 2015-06-10 15:21 UTC (permalink / raw)
To: Hal Rosenstock, Wan, Kaike
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> While this appears to address the current upstream use model for ACM
> with it's multicast overlay backend where PRs are static, it does not
> appear to address PR changes.
Although this ties into ibacm, from the viewpoint of the kernel, there's no requirement on the user space implementation.
> Should aging/revalidation of PRs be supported ? If so, would this make
> PRs similar at "high" level to IP neighbor cache in kernel ?
This is requesting a new feature not supported by anything in the kernel today, and would seem to fall well outside the scope of the suggested changes. Is there a specific issue in the patches that you are seeing related to this?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <5578558B.1070503-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-10 15:49 ` Wan, Kaike
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABCCC5-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Wan, Kaike @ 2015-06-10 15:49 UTC (permalink / raw)
To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>>>> A SA cache is undeniably critical for fabric scalability and performance.
> >>>>> In user space, the ibacm application provides a good example of
> >>>>> pathrecord cache for address and route resolution. With the recent
> >>>>> implementation of the provider architecture, ibacm offers more
> >>>> extensibility as a SA cache.
> >>>>> In kernel, ipoib implements its own small cache for pathrecords,
> >>>>> which is however not available for general use. Furthermore, the
> >>>>> implementation of a SA cache in user space offers better
> >>>>> flexibility, larger capacity, and more robustness for the system.
> >>>>>
> >>>>> In this patch series, a mechanism is implemented to allow ib_sa to
> >>>>> send pathrecord query to a user application (eg ibacm) through netlink.
> >>>>
> >>>> While this appears to address the current upstream use model for
> >>>> ACM with it's multicast overlay backend where PRs are static, it
> >>>> does not appear to address PR changes.
> >>>> Should aging/revalidation of PRs be supported ? If so, would this
> >>>> make PRs similar at "high" level to IP neighbor cache in kernel ?
> >>>
> >>> Even for the default provider acmp, PRs will time out and the length
> >>> of
> >> timeout is configurable. For other providers (eg ibssa), the PR
> >> change could be managed correctly and promptly, and this capability
> >> is beyond ibacm core itself.
> >>
> >> That deals with the update of PR in user space (ACM). Doesn't kernel
> >> need some way of knowing PR was updated ?
> >
> > Maybe. If a kernel client is interested in PR changes, it can register
> > for notification
>
> Wouldn't the notification mechanism be via netlink ? What would the
> granularity be for the registrations and notifications ?
Not necessarily. AFAIK, SA notification is not currently supported in kernel. One could register notification through netlink, as long as the user space application (eg: ibacm) supports it. But this would be a new feature, as Sean pointed out in previous e-mail.
Kaike
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABCCC5-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-10 16:46 ` Hal Rosenstock
0 siblings, 0 replies; 39+ messages in thread
From: Hal Rosenstock @ 2015-06-10 16:46 UTC (permalink / raw)
To: Wan, Kaike; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On 6/10/2015 11:49 AM, Wan, Kaike wrote:
>>>>>>> A SA cache is undeniably critical for fabric scalability and performance.
>>>>>>> In user space, the ibacm application provides a good example of
>>>>>>> pathrecord cache for address and route resolution. With the recent
>>>>>>> implementation of the provider architecture, ibacm offers more
>>>>>> extensibility as a SA cache.
>>>>>>> In kernel, ipoib implements its own small cache for pathrecords,
>>>>>>> which is however not available for general use. Furthermore, the
>>>>>>> implementation of a SA cache in user space offers better
>>>>>>> flexibility, larger capacity, and more robustness for the system.
>>>>>>>
>>>>>>> In this patch series, a mechanism is implemented to allow ib_sa to
>>>>>>> send pathrecord query to a user application (eg ibacm) through netlink.
>>>>>>
>>>>>> While this appears to address the current upstream use model for
>>>>>> ACM with it's multicast overlay backend where PRs are static, it
>>>>>> does not appear to address PR changes.
>>>>>> Should aging/revalidation of PRs be supported ? If so, would this
>>>>>> make PRs similar at "high" level to IP neighbor cache in kernel ?
>>>>>
>>>>> Even for the default provider acmp, PRs will time out and the length
>>>>> of
>>>> timeout is configurable. For other providers (eg ibssa), the PR
>>>> change could be managed correctly and promptly, and this capability
>>>> is beyond ibacm core itself.
>>>>
>>>> That deals with the update of PR in user space (ACM). Doesn't kernel
>>>> need some way of knowing PR was updated ?
>>>
>>> Maybe. If a kernel client is interested in PR changes, it can register
>>> for notification
>>
>> Wouldn't the notification mechanism be via netlink ? What would the
>> granularity be for the registrations and notifications ?
>
> Not necessarily. AFAIK, SA notification is not currently supported in kernel. One could register notification through netlink, as long as the user space application (eg: ibacm) supports it. But this would be a new feature, as Sean pointed out in previous e-mail.
Such notifications are not necessarily tied to SA event reporting. In
fact, Todd made argument at 2014 OFA Devcon that such mechanism is
anathema to (exa)scalability.
-- Hal
> Kaike
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6677-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-10 16:50 ` Hal Rosenstock
[not found] ` <55786ACC.4070704-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hal Rosenstock @ 2015-06-10 16:50 UTC (permalink / raw)
To: Hefty, Sean
Cc: Wan, Kaike, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On 6/10/2015 11:21 AM, Hefty, Sean wrote:
>> While this appears to address the current upstream use model for ACM
>> with it's multicast overlay backend where PRs are static, it does not
>> appear to address PR changes.
>
> Although this ties into ibacm, from the viewpoint of the kernel, there's no requirement on the user space implementation.
True. This is not how it works in the kernel today. It is meant as
future thinking/exploring.
>> Should aging/revalidation of PRs be supported ? If so, would this make
>> PRs similar at "high" level to IP neighbor cache in kernel ?
>
> This is requesting a new feature not supported by anything in the kernel today, and would seem to fall well
> outside the scope of the suggested changes.
Outside scope of suggested changes but where does the kernel need to
head in the longer term ?
> Is there a specific issue in the patches that you are seeing related to this?
Not in the patches themselves but in the general issue when a PR changes.
Do you think this needs addressing or are things fine as they are now ?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <55786ACC.4070704-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-10 17:04 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE680D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hefty, Sean @ 2015-06-10 17:04 UTC (permalink / raw)
To: Hal Rosenstock
Cc: Wan, Kaike, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Not in the patches themselves but in the general issue when a PR changes.
>
> Do you think this needs addressing or are things fine as they are now ?
IMO, I think it needs addressing in terms of "can the proposed netlink architecture and design accommodate this sort of request in the future?" We shouldn't design in a limitation up front. I don't see anything in the current approach that would cause an issue. There would likely be a need for new messages and attributes.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink
[not found] ` <1433861837-26177-2-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2015-06-10 17:47 ` Hal Rosenstock
[not found] ` <55787827.7030003-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hal Rosenstock @ 2015-06-10 17:47 UTC (permalink / raw)
To: kaike.wan-ral2JQCrhuEAvxtiuMwx3w
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Fleck, Ira Weiny
On 6/9/2015 10:57 AM, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>
> This patch adds netlink defines for SA client, local service group, local
> service operations, and related attributes.
>
> Signed-off-by: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: John Fleck <john.fleck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Reviewed-by: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
> include/uapi/rdma/rdma_netlink.h | 82 ++++++++++++++++++++++++++++++++++++++
> 1 files changed, 82 insertions(+), 0 deletions(-)
>
> diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
> index 6e4bb42..341e9be 100644
> --- a/include/uapi/rdma/rdma_netlink.h
> +++ b/include/uapi/rdma/rdma_netlink.h
> @@ -7,12 +7,14 @@ enum {
> RDMA_NL_RDMA_CM = 1,
> RDMA_NL_NES,
> RDMA_NL_C4IW,
> + RDMA_NL_SA,
> RDMA_NL_NUM_CLIENTS
> };
>
> enum {
> RDMA_NL_GROUP_CM = 1,
> RDMA_NL_GROUP_IWPM,
> + RDMA_NL_GROUP_LS,
> RDMA_NL_NUM_GROUPS
> };
>
> @@ -128,5 +130,85 @@ enum {
> IWPM_NLA_ERR_MAX
> };
>
> +/* Local service group opcodes */
> +enum {
> + RDMA_NL_LS_OP_RESOLVE = 0,
> + RDMA_NL_LS_OP_SET_TIMEOUT,
> + RDMA_NL_LS_NUM_OPS
> +};
> +
> +/* Local service netlink message flags */
> +#define RDMA_NL_LS_F_OK 0x0100 /* Success response */
> +#define RDMA_NL_LS_F_ERR 0x0200 /* Failed response */
> +
> +/* Local service attribute type */
> +enum {
> + LS_NLA_TYPE_STATUS = 0,
> + LS_NLA_TYPE_PATH_RECORD,
> + LS_NLA_TYPE_TIMEOUT,
> + LS_NLA_TYPE_SERVICE_ID,
> + LS_NLA_TYPE_DGID,
> + LS_NLA_TYPE_SGID,
> + LS_NLA_TYPE_TCLASS,
> + LS_NLA_TYPE_REVERSIBLE,
> + LS_NLA_TYPE_NUM_PATH,
Should this be NUMB_PATH rather than NUM_PATH ?
> + LS_NLA_TYPE_PKEY,
> + LS_NLA_TYPE_QOS_CLASS,
Should this include SL too ?
> + LS_NLA_TYPE_MAX
> +};
> +
> +/* Local service status attribute */
> +enum {
> + LS_NLA_STATUS_SUCCESS = 0,
> + LS_NLA_STATUS_EINVAL,
> + LS_NLA_STATUS_ENODATA,
> + LS_NLA_STATUS_MAX
> +};
> +
> +struct rdma_nla_ls_status {
> + __u32 status;
> +};
> +
> +/* Local service pathrecord attribute: struct ib_path_rec_data */
> +
> +/* Local service timeout attribute */
> +struct rdma_nla_ls_timeout {
> + __u32 timeout;
> +};
> +
> +/* Local Service ServiceID attribute */
> +struct rdma_nla_ls_service_id {
> + __be64 service_id;
> +};
> +
> +/* Local Service DGID/SGID attribute: big endian */
> +struct rdma_nla_ls_gid {
> + __u8 gid[16];
> +};
> +
> +/* Local Service Traffic Class attribute */
> +struct rdma_nla_ls_tclass {
> + __u8 tclass;
> +};
> +
> +/* Local Service Reversible attribute */
> +struct rdma_nla_ls_reversible {
> + __u32 reversible;
> +};
Isn't __u8 sufficient for reversible ?
> +
> +/* Local Service numb_path attribute */
> +struct rdma_nla_ls_numb_path {
> + __u8 numb_path;
> +};
> +
> +/* Local Service Pkey attribute*/
> +struct rdma_nla_ls_pkey {
> + __be16 pkey;
> +};
> +
> +/* Local Service Qos Class attribute */
> +struct rdma_nla_ls_qos_class {
> + __be16 qos_class;
> +};
>
> #endif /* _UAPI_RDMA_NETLINK_H */
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink
[not found] ` <1433861837-26177-5-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2015-06-10 17:47 ` Hal Rosenstock
[not found] ` <55787838.3020606-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hal Rosenstock @ 2015-06-10 17:47 UTC (permalink / raw)
To: kaike.wan-ral2JQCrhuEAvxtiuMwx3w
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Fleck, Ira Weiny
On 6/9/2015 10:57 AM, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>
> This patch routes a SA pathrecord query to netlink first
Should only unicast PRs be done in this manner or should API support
enabling for unicast and/or multicast ?
AFAIK kernel doesn't query multicast PRs now (queries MCMRs) but this
seems like it would help make it future proof and not have to take
timeout on local query unless app supports it.
> and processes the
> response appropriately. If a failure is returned, the request will be sent
> through IB. The decision whether to route the request to netlink first is
> determined by the presence of a listener for the local service netlink
> multicast group. If the user-space local service netlink multicast group
> listener is not present, the request will be sent through IB, just like
> what is currently being done.
>
> Signed-off-by: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: John Fleck <john.fleck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Reviewed-by: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
> drivers/infiniband/core/sa_query.c | 515 +++++++++++++++++++++++++++++++++++-
> include/uapi/rdma/rdma_netlink.h | 2 +-
> 2 files changed, 515 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
> index 17e1cf7..e063593 100644
> --- a/drivers/infiniband/core/sa_query.c
> +++ b/drivers/infiniband/core/sa_query.c
> @@ -45,12 +45,21 @@
> #include <uapi/linux/if_ether.h>
> #include <rdma/ib_pack.h>
> #include <rdma/ib_cache.h>
> +#include <rdma/rdma_netlink.h>
> +#include <net/netlink.h>
> +#include <uapi/rdma/ib_user_sa.h>
> +#include <rdma/ib_marshall.h>
> #include "sa.h"
>
> MODULE_AUTHOR("Roland Dreier");
> MODULE_DESCRIPTION("InfiniBand subnet administration query support");
> MODULE_LICENSE("Dual BSD/GPL");
>
> +#define IB_SA_LOCAL_SVC_TIMEOUT_MIN 100
> +#define IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT 2000
> +#define IB_SA_LOCAL_SVC_TIMEOUT_MAX 200000
> +static int sa_local_svc_timeout_ms = IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT;
> +
> struct ib_sa_sm_ah {
> struct ib_ah *ah;
> struct kref ref;
> @@ -80,8 +89,24 @@ struct ib_sa_query {
> struct ib_mad_send_buf *mad_buf;
> struct ib_sa_sm_ah *sm_ah;
> int id;
> + u32 flags;
> };
>
> +#define IB_SA_ENABLE_LOCAL_SERVICE 0x00000001
> +#define IB_SA_CANCEL 0x00000002
> +
> +#define IB_SA_LOCAL_SVC_ENABLED(query) \
> + ((query)->flags & IB_SA_ENABLE_LOCAL_SERVICE)
> +#define IB_SA_ENABLE_LOCAL_SVC(query) \
> + ((query)->flags |= IB_SA_ENABLE_LOCAL_SERVICE)
> +#define IB_SA_DISABLE_LOCAL_SVC(query) \
> + ((query)->flags &= ~IB_SA_ENABLE_LOCAL_SERVICE)
> +
> +#define IB_SA_QUERY_CANCELLED(query) \
> + ((query)->flags & IB_SA_CANCEL)
> +#define IB_SA_CANCEL_QUERY(query) \
> + ((query)->flags |= IB_SA_CANCEL)
> +
> struct ib_sa_service_query {
> void (*callback)(int, struct ib_sa_service_rec *, void *);
> void *context;
> @@ -106,6 +131,26 @@ struct ib_sa_mcmember_query {
> struct ib_sa_query sa_query;
> };
>
> +struct ib_nl_request_info {
> + struct list_head list;
> + u32 seq;
> + unsigned long timeout;
> + struct ib_sa_query *query;
> +};
> +
> +struct ib_nl_attr_info {
> + u16 len; /* Total attr len: header + payload + padding */
> + ib_sa_comp_mask comp_mask;
> + void *input;
> + void (*set_attrs)(struct sk_buff *skb, struct ib_nl_attr_info *info);
> +};
> +
> +static LIST_HEAD(ib_nl_request_list);
> +static DEFINE_SPINLOCK(ib_nl_request_lock);
> +static atomic_t ib_nl_sa_request_seq;
> +static struct workqueue_struct *ib_nl_wq;
> +static struct delayed_work ib_nl_timed_work;
> +
> static void ib_sa_add_one(struct ib_device *device);
> static void ib_sa_remove_one(struct ib_device *device);
>
> @@ -381,6 +426,433 @@ static const struct ib_field guidinfo_rec_table[] = {
> .size_bits = 512 },
> };
>
> +static int ib_nl_send_msg(int opcode, struct ib_nl_attr_info *attrs, u32 seq)
> +{
> + struct sk_buff *skb = NULL;
> + struct nlmsghdr *nlh;
> + void *data;
> + int ret = 0;
> +
> + if (attrs->len <= 0)
> + return -EMSGSIZE;
> +
> + skb = nlmsg_new(attrs->len, GFP_KERNEL);
> + if (!skb) {
> + pr_err("alloc failed ret=%d\n", ret);
> + return -ENOMEM;
> + }
> +
> + /* Put nlmsg header only for now */
> + data = ibnl_put_msg(skb, &nlh, seq, 0, RDMA_NL_SA,
> + opcode, GFP_KERNEL);
> + if (!data) {
> + kfree_skb(skb);
> + return -EMSGSIZE;
> + }
> +
> + /* Add attributes */
> + attrs->set_attrs(skb, attrs);
> +
> + /* Repair the nlmsg header length */
> + nlmsg_end(skb, nlh);
> +
> + ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_KERNEL);
> + if (!ret) {
> + ret = attrs->len;
> + } else {
> + if (ret != -ESRCH)
> + pr_err("ibnl_multicast failed l=%d, r=%d\n",
> + attrs->len, ret);
> + ret = 0;
> + }
> + return ret;
> +}
> +
> +static struct ib_nl_request_info *
> +ib_nl_alloc_request(struct ib_sa_query *query)
> +{
> + struct ib_nl_request_info *rinfo;
> +
> + rinfo = kzalloc(sizeof(*rinfo), GFP_ATOMIC);
> + if (rinfo == NULL)
> + return ERR_PTR(-ENOMEM);
> +
> + INIT_LIST_HEAD(&rinfo->list);
> + rinfo->seq = (u32)atomic_inc_return(&ib_nl_sa_request_seq);
> + rinfo->query = query;
> +
> + return rinfo;
> +}
> +
> +static void ib_nl_set_path_rec_attrs(struct sk_buff *skb,
> + struct ib_nl_attr_info *info)
> +{
> + struct ib_sa_path_rec *sa_rec = info->input;
> +
> + if (info->comp_mask & IB_SA_PATH_REC_SERVICE_ID)
> + nla_put(skb, LS_NLA_TYPE_SERVICE_ID,
> + sizeof(sa_rec->service_id), &sa_rec->service_id);
> + if (info->comp_mask & IB_SA_PATH_REC_DGID)
> + nla_put(skb, LS_NLA_TYPE_DGID, sizeof(sa_rec->dgid),
> + &sa_rec->dgid);
> + if (info->comp_mask & IB_SA_PATH_REC_SGID)
> + nla_put(skb, LS_NLA_TYPE_SGID, sizeof(sa_rec->sgid),
> + &sa_rec->sgid);
> + if (info->comp_mask & IB_SA_PATH_REC_TRAFFIC_CLASS)
> + nla_put(skb, LS_NLA_TYPE_TCLASS, sizeof(sa_rec->traffic_class),
> + &sa_rec->traffic_class);
> + if (info->comp_mask & IB_SA_PATH_REC_REVERSIBLE)
> + nla_put(skb, LS_NLA_TYPE_REVERSIBLE,
> + sizeof(sa_rec->reversible), &sa_rec->reversible);
> + if (info->comp_mask & IB_SA_PATH_REC_NUMB_PATH)
> + nla_put(skb, LS_NLA_TYPE_NUMB_PATH,
> + sizeof(sa_rec->numb_path), &sa_rec->numb_path);
> + if (info->comp_mask & IB_SA_PATH_REC_PKEY)
> + nla_put(skb, LS_NLA_TYPE_PKEY, sizeof(sa_rec->pkey),
> + &sa_rec->pkey);
> + if (info->comp_mask & IB_SA_PATH_REC_QOS_CLASS)
> + nla_put(skb, LS_NLA_TYPE_QOS_CLASS,
> + sizeof(sa_rec->qos_class), &sa_rec->qos_class);
> +}
> +
> +static int ib_nl_get_path_rec_attrs_len(struct ib_nl_attr_info *info)
> +{
> + int len = 0;
> +
> + if (info->comp_mask & IB_SA_PATH_REC_SERVICE_ID)
> + len += nla_total_size(sizeof(struct rdma_nla_ls_service_id));
> + if (info->comp_mask & IB_SA_PATH_REC_DGID)
> + len += nla_total_size(sizeof(struct rdma_nla_ls_gid));
> + if (info->comp_mask & IB_SA_PATH_REC_SGID)
> + len += nla_total_size(sizeof(struct rdma_nla_ls_gid));
> + if (info->comp_mask & IB_SA_PATH_REC_TRAFFIC_CLASS)
> + len += nla_total_size(sizeof(struct rdma_nla_ls_tclass));
> + if (info->comp_mask & IB_SA_PATH_REC_REVERSIBLE)
> + len += nla_total_size(sizeof(struct rdma_nla_ls_reversible));
> + if (info->comp_mask & IB_SA_PATH_REC_NUMB_PATH)
> + len += nla_total_size(sizeof(struct rdma_nla_ls_numb_path));
> + if (info->comp_mask & IB_SA_PATH_REC_PKEY)
> + len += nla_total_size(sizeof(struct rdma_nla_ls_pkey));
> + if (info->comp_mask & IB_SA_PATH_REC_QOS_CLASS)
> + len += nla_total_size(sizeof(struct rdma_nla_ls_qos_class));
> +
> + return len;
> +}
> +
> +static int ib_nl_send_request(struct ib_nl_request_info *rinfo)
> +{
> + struct ib_nl_attr_info info;
> + int opcode;
> + struct ib_sa_mad *mad;
> + unsigned long flags;
> + unsigned long delay;
> + int ret;
> +
> + mad = rinfo->query->mad_buf->mad;
> + switch (mad->mad_hdr.attr_id) {
> + case cpu_to_be16(IB_SA_ATTR_PATH_REC):
> + opcode = RDMA_NL_LS_OP_RESOLVE;
> + mad = rinfo->query->mad_buf->mad;
> + info.comp_mask = mad->sa_hdr.comp_mask;
> + info.input = rinfo->query->mad_buf->context[1];
> + rinfo->query->mad_buf->context[1] = NULL;
> + info.len = ib_nl_get_path_rec_attrs_len(&info);
> + info.set_attrs = ib_nl_set_path_rec_attrs;
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + spin_lock_irqsave(&ib_nl_request_lock, flags);
> + ret = ib_nl_send_msg(opcode, &info, rinfo->seq);
> + if (ret <= 0) {
> + ret = -EIO;
> + goto request_out;
> + } else {
> + ret = 0;
> + }
> +
> + delay = msecs_to_jiffies(sa_local_svc_timeout_ms);
> + rinfo->timeout = delay + jiffies;
> + list_add_tail(&rinfo->list, &ib_nl_request_list);
> + /* Start the timeout if this is the only request */
> + if (ib_nl_request_list.next == &rinfo->list)
> + queue_delayed_work(ib_nl_wq, &ib_nl_timed_work, delay);
> +
> +request_out:
> + spin_unlock_irqrestore(&ib_nl_request_lock, flags);
> +
> + return ret;
> +}
> +
> +static int ib_nl_make_request(struct ib_sa_query *query)
> +{
> + struct ib_nl_request_info *rinfo;
> + int ret;
> +
> + rinfo = ib_nl_alloc_request(query);
> + if (IS_ERR(rinfo))
> + return -ENOMEM;
> +
> + ret = ib_nl_send_request(rinfo);
> + if (ret)
> + kfree(rinfo);
> +
> + return ret;
> +}
> +
> +static int ib_nl_cancel_request(struct ib_sa_query *query)
> +{
> + unsigned long flags;
> + struct ib_nl_request_info *rinfo;
> + int found = 0;
> +
> + spin_lock_irqsave(&ib_nl_request_lock, flags);
> + list_for_each_entry(rinfo, &ib_nl_request_list, list) {
> + /* Let the timeout to take care of the callback */
> + if (query == rinfo->query) {
> + IB_SA_CANCEL_QUERY(query);
> + rinfo->timeout = jiffies;
> + list_move(&rinfo->list, &ib_nl_request_list);
> + found = 1;
> + mod_delayed_work(ib_nl_wq, &ib_nl_timed_work, 1);
> + break;
> + }
> + }
> + spin_unlock_irqrestore(&ib_nl_request_lock, flags);
> +
> + return found;
> +}
> +
> +static void send_handler(struct ib_mad_agent *agent,
> + struct ib_mad_send_wc *mad_send_wc);
> +
> +static void ib_nl_process_good_resolve_rsp(struct ib_sa_query *query,
> + const struct nlmsghdr *nlh)
> +{
> + struct ib_mad_send_wc mad_send_wc;
> + struct ib_sa_mad *mad = NULL;
> + const struct nlattr *head, *curr;
> + struct ib_path_rec_data *rec;
> + int len, rem;
> +
> + if (query->callback) {
> + head = (const struct nlattr *) nlmsg_data(nlh);
> + len = nlmsg_len(nlh);
> + nla_for_each_attr(curr, head, len, rem) {
> + if (curr->nla_type == LS_NLA_TYPE_PATH_RECORD) {
> + rec = nla_data(curr);
> + if ((rec->flags && IB_PATH_GMP) &&
> + (rec->flags && IB_PATH_PRIMARY)) {
> + mad = query->mad_buf->mad;
> + mad->mad_hdr.method |=
> + IB_MGMT_METHOD_RESP;
> + memcpy(mad->data, rec->path_rec,
> + sizeof(rec->path_rec));
> + query->callback(query, 0, mad);
> + break;
> + }
> + }
> + }
> + }
> +
> + mad_send_wc.send_buf = query->mad_buf;
> + mad_send_wc.status = IB_WC_SUCCESS;
> + send_handler(query->mad_buf->mad_agent, &mad_send_wc);
> +}
> +
> +static void ib_nl_request_timeout(struct work_struct *work)
> +{
> + unsigned long flags;
> + struct ib_nl_request_info *rinfo;
> + struct ib_sa_query *query;
> + unsigned long delay;
> + struct ib_mad_send_wc mad_send_wc;
> + int ret;
> +
> + spin_lock_irqsave(&ib_nl_request_lock, flags);
> + while (!list_empty(&ib_nl_request_list)) {
> + rinfo = list_entry(ib_nl_request_list.next,
> + struct ib_nl_request_info, list);
> +
> + if (time_after(rinfo->timeout, jiffies)) {
> + delay = rinfo->timeout - jiffies;
> + if ((long)delay <= 0)
> + delay = 1;
> + queue_delayed_work(ib_nl_wq, &ib_nl_timed_work, delay);
> + break;
> + }
> +
> + list_del(&rinfo->list);
> + query = rinfo->query;
> + IB_SA_DISABLE_LOCAL_SVC(query);
> + /* Hold the lock to protect against query cancellation */
> + if (IB_SA_QUERY_CANCELLED(query))
> + ret = -1;
> + else
> + ret = ib_post_send_mad(query->mad_buf, NULL);
> + if (ret) {
> + mad_send_wc.send_buf = query->mad_buf;
> + mad_send_wc.status = IB_WC_WR_FLUSH_ERR;
> + spin_unlock_irqrestore(&ib_nl_request_lock, flags);
> + send_handler(query->port->agent, &mad_send_wc);
> + spin_lock_irqsave(&ib_nl_request_lock, flags);
> + }
> + kfree(rinfo);
> + }
> + spin_unlock_irqrestore(&ib_nl_request_lock, flags);
> +}
> +
> +static inline int ib_nl_is_good_resolve_resp(const struct nlmsghdr *nlh)
> +{
> + const struct nlattr *head, *curr;
> + struct ib_path_rec_data *rec;
> + int len, rem;
> +
> + if (nlh->nlmsg_flags & RDMA_NL_LS_F_ERR)
> + return 0;
> +
> + if (!(nlh->nlmsg_flags & RDMA_NL_LS_F_OK))
> + return 0;
> +
> + if (nlmsg_len(nlh) < nla_attr_size(sizeof(*rec)))
> + return 0;
> +
> + head = (const struct nlattr *) nlmsg_data(nlh);
> + len = nlmsg_len(nlh);
> + nla_for_each_attr(curr, head, len, rem) {
> + if (curr->nla_type == LS_NLA_TYPE_PATH_RECORD) {
> + rec = nla_data(curr);
> + if ((rec->flags && IB_PATH_GMP) &&
> + (rec->flags && IB_PATH_PRIMARY))
> + return 1;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int ib_nl_handle_set_timeout(struct sk_buff *skb,
> + struct netlink_callback *cb)
> +{
> + const struct nlmsghdr *nlh = (struct nlmsghdr *)cb->nlh;
> + int timeout, delta, abs_delta;
> + const struct nlattr *attr;
> + struct rdma_nla_ls_timeout *to_attr;
> + unsigned long flags;
> + struct ib_nl_request_info *rinfo;
> + long delay = 0;
> +
> + if (nlmsg_len(nlh) < nla_attr_size(sizeof(*to_attr)))
> + goto settimeout_out;
> +
> + attr = (const struct nlattr *) nlmsg_data(nlh);
> + if (attr->nla_type != LS_NLA_TYPE_TIMEOUT ||
> + nla_len(attr) != sizeof(*to_attr))
> + goto settimeout_out;
> +
> + to_attr = (struct rdma_nla_ls_timeout *) nla_data(attr);
> + timeout = (int) to_attr->timeout;
> + if (timeout < IB_SA_LOCAL_SVC_TIMEOUT_MIN)
> + timeout = IB_SA_LOCAL_SVC_TIMEOUT_MIN;
> + if (timeout > IB_SA_LOCAL_SVC_TIMEOUT_MAX)
> + timeout = IB_SA_LOCAL_SVC_TIMEOUT_MAX;
> +
> + delta = timeout - sa_local_svc_timeout_ms;
> + if (delta < 0)
> + abs_delta = -delta;
> + else
> + abs_delta = delta;
> +
> + if (delta != 0) {
> + spin_lock_irqsave(&ib_nl_request_lock, flags);
> + sa_local_svc_timeout_ms = timeout;
> + list_for_each_entry(rinfo, &ib_nl_request_list, list) {
> + if (delta < 0 && abs_delta > rinfo->timeout)
> + rinfo->timeout = 0;
> + else
> + rinfo->timeout += delta;
> +
> + /* Get the new delay from the first entry */
> + if (!delay) {
> + delay = rinfo->timeout - jiffies;
> + if (delay <= 0)
> + delay = 1;
> + }
> + }
> + if (delay)
> + mod_delayed_work(ib_nl_wq, &ib_nl_timed_work,
> + (unsigned long)delay);
> + spin_unlock_irqrestore(&ib_nl_request_lock, flags);
> + }
> +
> +settimeout_out:
> + return skb->len;
> +}
> +
> +static int ib_nl_handle_resolve_resp(struct sk_buff *skb,
> + struct netlink_callback *cb)
> +{
> + const struct nlmsghdr *nlh = (struct nlmsghdr *)cb->nlh;
> + unsigned long flags;
> + struct ib_nl_request_info *rinfo;
> + struct ib_sa_query *query;
> + struct ib_mad_send_buf *send_buf;
> + struct ib_mad_send_wc mad_send_wc;
> + int found = 0;
> + int ret;
> +
> + spin_lock_irqsave(&ib_nl_request_lock, flags);
> + list_for_each_entry(rinfo, &ib_nl_request_list, list) {
> + /*
> + * If the query is cancelled, let the timeout routine
> + * take care of it.
> + */
> + if (nlh->nlmsg_seq == rinfo->seq) {
> + found = !IB_SA_QUERY_CANCELLED(rinfo->query);
> + if (found)
> + list_del(&rinfo->list);
> + break;
> + }
> + }
> +
> + if (!found) {
> + spin_unlock_irqrestore(&ib_nl_request_lock, flags);
> + goto resp_out;
> + }
> +
> + query = rinfo->query;
> + send_buf = query->mad_buf;
> +
> + if (!ib_nl_is_good_resolve_resp(nlh)) {
> + /* if the result is a failure, send out the packet via IB */
> + IB_SA_DISABLE_LOCAL_SVC(query);
> + ret = ib_post_send_mad(query->mad_buf, NULL);
> + spin_unlock_irqrestore(&ib_nl_request_lock, flags);
> + if (ret) {
> + mad_send_wc.send_buf = send_buf;
> + mad_send_wc.status = IB_WC_GENERAL_ERR;
> + send_handler(query->port->agent, &mad_send_wc);
> + }
> + } else {
> + spin_unlock_irqrestore(&ib_nl_request_lock, flags);
> + ib_nl_process_good_resolve_rsp(query, nlh);
> + }
> +
> + kfree(rinfo);
> +resp_out:
> + return skb->len;
> +}
> +
> +static struct ibnl_client_cbs ib_sa_cb_table[] = {
> + [RDMA_NL_LS_OP_RESOLVE] = {
> + .dump = ib_nl_handle_resolve_resp,
> + .module = THIS_MODULE },
> + [RDMA_NL_LS_OP_SET_TIMEOUT] = {
> + .dump = ib_nl_handle_set_timeout,
> + .module = THIS_MODULE },
> +};
> +
> static void free_sm_ah(struct kref *kref)
> {
> struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref);
> @@ -502,7 +974,13 @@ void ib_sa_cancel_query(int id, struct ib_sa_query *query)
> mad_buf = query->mad_buf;
> spin_unlock_irqrestore(&idr_lock, flags);
>
> - ib_cancel_mad(agent, mad_buf);
> + /*
> + * If the query is still on the netlink request list, schedule
> + * it to be cancelled by the timeout routine. Otherwise, it has been
> + * sent to the MAD layer and has to be cancelled from there.
> + */
> + if (!ib_nl_cancel_request(query))
> + ib_cancel_mad(agent, mad_buf);
> }
> EXPORT_SYMBOL(ib_sa_cancel_query);
>
> @@ -638,6 +1116,14 @@ static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
> query->mad_buf->context[0] = query;
> query->id = id;
>
> + if (IB_SA_LOCAL_SVC_ENABLED(query)) {
> + if (!ibnl_chk_listeners(RDMA_NL_GROUP_LS)) {
> + if (!ib_nl_make_request(query))
> + return id;
> + }
> + IB_SA_DISABLE_LOCAL_SVC(query);
> + }
> +
> ret = ib_post_send_mad(query->mad_buf, NULL);
> if (ret) {
> spin_lock_irqsave(&idr_lock, flags);
> @@ -766,6 +1252,9 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
>
> *sa_query = &query->sa_query;
>
> + IB_SA_ENABLE_LOCAL_SVC(&query->sa_query);
> + query->sa_query.mad_buf->context[1] = rec;
> +
> ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
> if (ret < 0)
> goto err2;
> @@ -1250,6 +1739,8 @@ static int __init ib_sa_init(void)
>
> get_random_bytes(&tid, sizeof tid);
>
> + atomic_set(&ib_nl_sa_request_seq, 0);
> +
> ret = ib_register_client(&sa_client);
> if (ret) {
> printk(KERN_ERR "Couldn't register ib_sa client\n");
> @@ -1262,7 +1753,25 @@ static int __init ib_sa_init(void)
> goto err2;
> }
>
> + ib_nl_wq = create_singlethread_workqueue("ib_nl_sa_wq");
> + if (!ib_nl_wq) {
> + ret = -ENOMEM;
> + goto err3;
> + }
> +
> + if (ibnl_add_client(RDMA_NL_SA, RDMA_NL_LS_NUM_OPS,
> + ib_sa_cb_table)) {
> + pr_err("Failed to add netlink callback\n");
> + ret = -EINVAL;
> + goto err4;
> + }
> + INIT_DELAYED_WORK(&ib_nl_timed_work, ib_nl_request_timeout);
> +
> return 0;
> +err4:
> + destroy_workqueue(ib_nl_wq);
> +err3:
> + mcast_cleanup();
> err2:
> ib_unregister_client(&sa_client);
> err1:
> @@ -1271,6 +1780,10 @@ err1:
>
> static void __exit ib_sa_cleanup(void)
> {
> + ibnl_remove_client(RDMA_NL_SA);
> + cancel_delayed_work(&ib_nl_timed_work);
> + flush_workqueue(ib_nl_wq);
> + destroy_workqueue(ib_nl_wq);
> mcast_cleanup();
> ib_unregister_client(&sa_client);
> idr_destroy(&query_idr);
> diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
> index 341e9be..6aa9281 100644
> --- a/include/uapi/rdma/rdma_netlink.h
> +++ b/include/uapi/rdma/rdma_netlink.h
> @@ -151,7 +151,7 @@ enum {
> LS_NLA_TYPE_SGID,
> LS_NLA_TYPE_TCLASS,
> LS_NLA_TYPE_REVERSIBLE,
> - LS_NLA_TYPE_NUM_PATH,
> + LS_NLA_TYPE_NUMB_PATH,
Looks to me like this should've been in patch 1.
> LS_NLA_TYPE_PKEY,
> LS_NLA_TYPE_QOS_CLASS,
> LS_NLA_TYPE_MAX
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <55784C35.2020401-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-10 15:07 ` Wan, Kaike
@ 2015-06-10 18:00 ` ira.weiny
[not found] ` <20150610180020.GC13497-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
1 sibling, 1 reply; 39+ messages in thread
From: ira.weiny @ 2015-06-10 18:00 UTC (permalink / raw)
To: Hal Rosenstock
Cc: Wan, Kaike, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On Wed, Jun 10, 2015 at 10:39:49AM -0400, Hal Rosenstock wrote:
> On 6/10/2015 10:22 AM, Wan, Kaike wrote:
> >
> >
> >> -----Original Message-----
> >> From: Hal Rosenstock [mailto:hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> >> Sent: Wednesday, June 10, 2015 9:37 AM
> >>
> >>>
> >>> A SA cache is undeniably critical for fabric scalability and performance.
> >>> In user space, the ibacm application provides a good example of
> >>> pathrecord cache for address and route resolution. With the recent
> >>> implementation of the provider architecture, ibacm offers more
> >> extensibility as a SA cache.
> >>> In kernel, ipoib implements its own small cache for pathrecords, which
> >>> is however not available for general use. Furthermore, the
> >>> implementation of a SA cache in user space offers better flexibility,
> >>> larger capacity, and more robustness for the system.
> >>>
> >>> In this patch series, a mechanism is implemented to allow ib_sa to
> >>> send pathrecord query to a user application (eg ibacm) through netlink.
> >>
> >> While this appears to address the current upstream use model for ACM with
> >> it's multicast overlay backend where PRs are static, it does not appear to
> >> address PR changes.
> >> Should aging/revalidation of PRs be supported ? If so, would this make PRs
> >> similar at "high" level to IP neighbor cache in kernel ?
> >
> > Even for the default provider acmp, PRs will time out and the length of timeout is configurable. For other providers (eg ibssa), the PR change could be managed correctly and promptly, and this capability is beyond ibacm core itself.
>
> That deals with the update of PR in user space (ACM). Doesn't kernel
> need some way of knowing PR was updated ?
>
This series does not attempt to optimize the kernel needing to know that a PR
has been updated. There are existing mechanisms for that.
In the future it would be nice to have more local support for such updates but
I don't think this patch precludes that in any way. Right now we are just
taking the first step by making the actual query more efficient.
Ira
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <20150610180020.GC13497-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
@ 2015-06-10 18:05 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6940-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hefty, Sean @ 2015-06-10 18:05 UTC (permalink / raw)
To: Weiny, Ira, Hal Rosenstock
Cc: Wan, Kaike, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> This series does not attempt to optimize the kernel needing to know that a
> PR
> has been updated. There are existing mechanisms for that.
Does this exist in the kernel?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink
[not found] ` <55787827.7030003-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-10 18:31 ` Wan, Kaike
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABCD80-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Wan, Kaike @ 2015-06-10 18:31 UTC (permalink / raw)
To: Hal Rosenstock
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Fleck, John,
Weiny, Ira
> From: Hal Rosenstock [mailto:hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> Sent: Wednesday, June 10, 2015 1:47 PM
>
> On 6/9/2015 10:57 AM, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> > From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> >
> > This patch adds netlink defines for SA client, local service group,
> > local service operations, and related attributes.
> >
> > Signed-off-by: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > Signed-off-by: John Fleck <john.fleck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > Reviewed-by: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > ---
> > include/uapi/rdma/rdma_netlink.h | 82
> ++++++++++++++++++++++++++++++++++++++
> > 1 files changed, 82 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/uapi/rdma/rdma_netlink.h
> > b/include/uapi/rdma/rdma_netlink.h
> > index 6e4bb42..341e9be 100644
> > --- a/include/uapi/rdma/rdma_netlink.h
> > +++ b/include/uapi/rdma/rdma_netlink.h
> > @@ -7,12 +7,14 @@ enum {
> > RDMA_NL_RDMA_CM = 1,
> > RDMA_NL_NES,
> > RDMA_NL_C4IW,
> > + RDMA_NL_SA,
> > RDMA_NL_NUM_CLIENTS
> > };
> >
> > enum {
> > RDMA_NL_GROUP_CM = 1,
> > RDMA_NL_GROUP_IWPM,
> > + RDMA_NL_GROUP_LS,
> > RDMA_NL_NUM_GROUPS
> > };
> >
> > @@ -128,5 +130,85 @@ enum {
> > IWPM_NLA_ERR_MAX
> > };
> >
> > +/* Local service group opcodes */
> > +enum {
> > + RDMA_NL_LS_OP_RESOLVE = 0,
> > + RDMA_NL_LS_OP_SET_TIMEOUT,
> > + RDMA_NL_LS_NUM_OPS
> > +};
> > +
> > +/* Local service netlink message flags */
> > +#define RDMA_NL_LS_F_OK 0x0100 /* Success response */
> > +#define RDMA_NL_LS_F_ERR 0x0200 /* Failed response */
> > +
> > +/* Local service attribute type */
> > +enum {
> > + LS_NLA_TYPE_STATUS = 0,
> > + LS_NLA_TYPE_PATH_RECORD,
> > + LS_NLA_TYPE_TIMEOUT,
> > + LS_NLA_TYPE_SERVICE_ID,
> > + LS_NLA_TYPE_DGID,
> > + LS_NLA_TYPE_SGID,
> > + LS_NLA_TYPE_TCLASS,
> > + LS_NLA_TYPE_REVERSIBLE,
> > + LS_NLA_TYPE_NUM_PATH,
>
> Should this be NUMB_PATH rather than NUM_PATH ?
Yes.
>
> > + LS_NLA_TYPE_PKEY,
> > + LS_NLA_TYPE_QOS_CLASS,
>
> Should this include SL too ?
It will be another attribute. All the fields are modeled after struct ib_sa_path_rec and struct ib_user_path_rec, not after struct ibv_path_record in the user space. In addition, only those pathrecord components that are currently used by rdma_cm, ipoib, and srp are list here.
>
> > + LS_NLA_TYPE_MAX
> > +};
> > +
> > +/* Local service status attribute */
> > +enum {
> > + LS_NLA_STATUS_SUCCESS = 0,
> > + LS_NLA_STATUS_EINVAL,
> > + LS_NLA_STATUS_ENODATA,
> > + LS_NLA_STATUS_MAX
> > +};
> > +
> > +struct rdma_nla_ls_status {
> > + __u32 status;
> > +};
> > +
> > +/* Local service pathrecord attribute: struct ib_path_rec_data */
> > +
> > +/* Local service timeout attribute */ struct rdma_nla_ls_timeout {
> > + __u32 timeout;
> > +};
> > +
> > +/* Local Service ServiceID attribute */ struct rdma_nla_ls_service_id
> > +{
> > + __be64 service_id;
> > +};
> > +
> > +/* Local Service DGID/SGID attribute: big endian */ struct
> > +rdma_nla_ls_gid {
> > + __u8 gid[16];
> > +};
> > +
> > +/* Local Service Traffic Class attribute */ struct rdma_nla_ls_tclass
> > +{
> > + __u8 tclass;
> > +};
> > +
> > +/* Local Service Reversible attribute */ struct
> > +rdma_nla_ls_reversible {
> > + __u32 reversible;
> > +};
>
> Isn't __u8 sufficient for reversible ?
Certainly enough. However, reversible is __u32 in struct ib_user_path_rec and int in struct ib_sa_path_rec.
>
> > +
> > +/* Local Service numb_path attribute */ struct rdma_nla_ls_numb_path
> > +{
> > + __u8 numb_path;
> > +};
> > +
> > +/* Local Service Pkey attribute*/
> > +struct rdma_nla_ls_pkey {
> > + __be16 pkey;
> > +};
> > +
> > +/* Local Service Qos Class attribute */ struct rdma_nla_ls_qos_class
> > +{
> > + __be16 qos_class;
> > +};
> >
> > #endif /* _UAPI_RDMA_NETLINK_H */
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink
[not found] ` <55787838.3020606-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-10 19:10 ` Jason Gunthorpe
[not found] ` <20150610191026.GA28334-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2015-06-10 19:10 UTC (permalink / raw)
To: Hal Rosenstock
Cc: kaike.wan-ral2JQCrhuEAvxtiuMwx3w,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Fleck, Ira Weiny
On Wed, Jun 10, 2015 at 01:47:36PM -0400, Hal Rosenstock wrote:
> On 6/9/2015 10:57 AM, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> > From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> >
> > This patch routes a SA pathrecord query to netlink first
>
> Should only unicast PRs be done in this manner or should API support
> enabling for unicast and/or multicast ?
>
> AFAIK kernel doesn't query multicast PRs now (queries MCMRs) but this
> seems like it would help make it future proof and not have to take
> timeout on local query unless app supports it.
It is a good question. We can clearly extend toward that, using a MGID
as the DGID and adding additional nested netlink fields.
However, does it make sense? I don't think user space should be
joining on behalf of the kernel... Do we ever query without a join?
Jasno
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE680D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-10 19:14 ` Jason Gunthorpe
[not found] ` <20150610191439.GB28334-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-10 19:51 ` Hal Rosenstock
1 sibling, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2015-06-10 19:14 UTC (permalink / raw)
To: Hefty, Sean
Cc: Hal Rosenstock, Wan, Kaike,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On Wed, Jun 10, 2015 at 05:04:55PM +0000, Hefty, Sean wrote:
> > Not in the patches themselves but in the general issue when a PR changes.
> >
> > Do you think this needs addressing or are things fine as they are now ?
>
> IMO, I think it needs addressing in terms of "can the proposed
> netlink architecture and design accommodate this sort of request in
> the future?" We shouldn't design in a limitation up front. I don't
> see anything in the current approach that would cause an issue.
> There would likely be a need for new messages and attributes.
I think the kernel netlink side is fine.
But userspace needs to understand what to do if it gets a request with
an attribute it does not understand.
It should tell the kernel 'no, do it yourself', rather than try and
answer.
That also suggests we should have optional and mandatory netlink
nested attributes.
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE680D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 19:14 ` Jason Gunthorpe
@ 2015-06-10 19:51 ` Hal Rosenstock
[not found] ` <55789557.3080100-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
1 sibling, 1 reply; 39+ messages in thread
From: Hal Rosenstock @ 2015-06-10 19:51 UTC (permalink / raw)
To: Hefty, Sean
Cc: Wan, Kaike, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On 6/10/2015 1:04 PM, Hefty, Sean wrote:
>> Not in the patches themselves but in the general issue when a PR changes.
>>
>> Do you think this needs addressing or are things fine as they are now ?
>
> IMO, I think it needs addressing in terms of "can the proposed netlink architecture and design accommodate this
> sort of request in the future?" We shouldn't design in a limitation up front. I don't see anything in the current approach that would cause an issue. There would likely be a need for new messages and attributes.
The current proposal is focused around the PR attributes/styles
currently used in the kernel. The case I can see is if in future a new
attribute is added to the PR netlink API. How is that handled ? Can user
space say it can't service a request ? That seems a little different
from the no PR case.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink
[not found] ` <20150610191026.GA28334-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-10 19:51 ` Hal Rosenstock
[not found] ` <5578955A.4070001-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hal Rosenstock @ 2015-06-10 19:51 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: kaike.wan-ral2JQCrhuEAvxtiuMwx3w,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Fleck, Ira Weiny
On 6/10/2015 3:10 PM, Jason Gunthorpe wrote:
> On Wed, Jun 10, 2015 at 01:47:36PM -0400, Hal Rosenstock wrote:
>> On 6/9/2015 10:57 AM, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
>>> From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>>
>>> This patch routes a SA pathrecord query to netlink first
>>
>> Should only unicast PRs be done in this manner or should API support
>> enabling for unicast and/or multicast ?
>>
>> AFAIK kernel doesn't query multicast PRs now (queries MCMRs) but this
>> seems like it would help make it future proof and not have to take
>> timeout on local query unless app supports it.
>
> It is a good question. We can clearly extend toward that, using a MGID
> as the DGID and adding additional nested netlink fields.
>
> However, does it make sense?
If it doesn't make sense, then should MGIDs as DGIDs never be requested
via the PR netlink API ?
> I don't think user space should be
> joining on behalf of the kernel...
Agreed.
> Do we ever query without a join?
Query is possible prior to join.
The use case I'm aware of is doing query PR for MGID rather than query
MCMR. That case could be viewed just like unicast PRs.
-- Hal
> Jasno
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABCD80-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-10 19:54 ` Hal Rosenstock
[not found] ` <55789609.4020906-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hal Rosenstock @ 2015-06-10 19:54 UTC (permalink / raw)
To: Wan, Kaike
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Fleck, John,
Weiny, Ira
On 6/10/2015 2:31 PM, Wan, Kaike wrote:
>> From: Hal Rosenstock [mailto:hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
>> Sent: Wednesday, June 10, 2015 1:47 PM
>>
>> On 6/9/2015 10:57 AM, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
>>> From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>>
>>> This patch adds netlink defines for SA client, local service group,
>>> local service operations, and related attributes.
>>>
>>> Signed-off-by: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>> Signed-off-by: John Fleck <john.fleck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>> Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>> Reviewed-by: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>> ---
>>> include/uapi/rdma/rdma_netlink.h | 82
>> ++++++++++++++++++++++++++++++++++++++
>>> 1 files changed, 82 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/uapi/rdma/rdma_netlink.h
>>> b/include/uapi/rdma/rdma_netlink.h
>>> index 6e4bb42..341e9be 100644
>>> --- a/include/uapi/rdma/rdma_netlink.h
>>> +++ b/include/uapi/rdma/rdma_netlink.h
>>> @@ -7,12 +7,14 @@ enum {
>>> RDMA_NL_RDMA_CM = 1,
>>> RDMA_NL_NES,
>>> RDMA_NL_C4IW,
>>> + RDMA_NL_SA,
>>> RDMA_NL_NUM_CLIENTS
>>> };
>>>
>>> enum {
>>> RDMA_NL_GROUP_CM = 1,
>>> RDMA_NL_GROUP_IWPM,
>>> + RDMA_NL_GROUP_LS,
>>> RDMA_NL_NUM_GROUPS
>>> };
>>>
>>> @@ -128,5 +130,85 @@ enum {
>>> IWPM_NLA_ERR_MAX
>>> };
>>>
>>> +/* Local service group opcodes */
>>> +enum {
>>> + RDMA_NL_LS_OP_RESOLVE = 0,
>>> + RDMA_NL_LS_OP_SET_TIMEOUT,
>>> + RDMA_NL_LS_NUM_OPS
>>> +};
>>> +
>>> +/* Local service netlink message flags */
>>> +#define RDMA_NL_LS_F_OK 0x0100 /* Success response */
>>> +#define RDMA_NL_LS_F_ERR 0x0200 /* Failed response */
>>> +
>>> +/* Local service attribute type */
>>> +enum {
>>> + LS_NLA_TYPE_STATUS = 0,
>>> + LS_NLA_TYPE_PATH_RECORD,
>>> + LS_NLA_TYPE_TIMEOUT,
>>> + LS_NLA_TYPE_SERVICE_ID,
>>> + LS_NLA_TYPE_DGID,
>>> + LS_NLA_TYPE_SGID,
>>> + LS_NLA_TYPE_TCLASS,
>>> + LS_NLA_TYPE_REVERSIBLE,
>>> + LS_NLA_TYPE_NUM_PATH,
>>
>> Should this be NUMB_PATH rather than NUM_PATH ?
>
> Yes.
>
>>
>>> + LS_NLA_TYPE_PKEY,
>>> + LS_NLA_TYPE_QOS_CLASS,
>>
>> Should this include SL too ?
>
> It will be another attribute.
If SL is to be included, maybe it should be SL mask. Maybe this is an
example of future extensibility...
> All the fields are modeled after struct ib_sa_path_rec and struct ib_user_path_rec,
> not after struct ibv_path_record in the user space. In addition, only those pathrecord components that are currently
> used by rdma_cm, ipoib, and srp are list here.
Understood.
>>
>>> + LS_NLA_TYPE_MAX
>>> +};
>>> +
>>> +/* Local service status attribute */
>>> +enum {
>>> + LS_NLA_STATUS_SUCCESS = 0,
>>> + LS_NLA_STATUS_EINVAL,
>>> + LS_NLA_STATUS_ENODATA,
>>> + LS_NLA_STATUS_MAX
>>> +};
>>> +
>>> +struct rdma_nla_ls_status {
>>> + __u32 status;
>>> +};
>>> +
>>> +/* Local service pathrecord attribute: struct ib_path_rec_data */
>>> +
>>> +/* Local service timeout attribute */ struct rdma_nla_ls_timeout {
>>> + __u32 timeout;
>>> +};
>>> +
>>> +/* Local Service ServiceID attribute */ struct rdma_nla_ls_service_id
>>> +{
>>> + __be64 service_id;
>>> +};
>>> +
>>> +/* Local Service DGID/SGID attribute: big endian */ struct
>>> +rdma_nla_ls_gid {
>>> + __u8 gid[16];
>>> +};
>>> +
>>> +/* Local Service Traffic Class attribute */ struct rdma_nla_ls_tclass
>>> +{
>>> + __u8 tclass;
>>> +};
>>> +
>>> +/* Local Service Reversible attribute */ struct
>>> +rdma_nla_ls_reversible {
>>> + __u32 reversible;
>>> +};
>>
>> Isn't __u8 sufficient for reversible ?
> Certainly enough. However, reversible is __u32 in struct ib_user_path_rec and int in struct ib_sa_path_rec.
OK; I hadn't double checked there. So it's "inherited" from those
reversible definitions.
>>
>>> +
>>> +/* Local Service numb_path attribute */ struct rdma_nla_ls_numb_path
>>> +{
>>> + __u8 numb_path;
>>> +};
>>> +
>>> +/* Local Service Pkey attribute*/
>>> +struct rdma_nla_ls_pkey {
>>> + __be16 pkey;
>>> +};
>>> +
>>> +/* Local Service Qos Class attribute */ struct rdma_nla_ls_qos_class
>>> +{
>>> + __be16 qos_class;
>>> +};
>>>
>>> #endif /* _UAPI_RDMA_NETLINK_H */
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink
[not found] ` <55789609.4020906-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-10 20:32 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6A9D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hefty, Sean @ 2015-06-10 20:32 UTC (permalink / raw)
To: Hal Rosenstock, Wan, Kaike
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Fleck, John,
Weiny, Ira
> >>> +/* Local Service Reversible attribute */ struct
> >>> +rdma_nla_ls_reversible {
> >>> + __u32 reversible;
> >>> +};
> >>
> >> Isn't __u8 sufficient for reversible ?
> > Certainly enough. However, reversible is __u32 in struct
> ib_user_path_rec and int in struct ib_sa_path_rec.
>
> OK; I hadn't double checked there. So it's "inherited" from those
> reversible definitions.
I don't think we need to adhere to the sizes defined in other structures. I agree with Hal's original comment. A __u8 here seems sufficient, unless we want to introduce some bit fields.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6940-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-10 21:04 ` Weiny, Ira
[not found] ` <2807E5FD2F6FDA4886F6618EAC48510E1109B12A-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Weiny, Ira @ 2015-06-10 21:04 UTC (permalink / raw)
To: Hefty, Sean, Hal Rosenstock
Cc: Wan, Kaike, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>
> > This series does not attempt to optimize the kernel needing to know
> > that a PR has been updated. There are existing mechanisms for that.
>
> Does this exist in the kernel?
At least some support, yes. For example client reregister marks all IPoIB paths as invalid.
Ira
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink
[not found] ` <5578955A.4070001-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-10 21:09 ` Weiny, Ira
[not found] ` <2807E5FD2F6FDA4886F6618EAC48510E1109B14D-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Weiny, Ira @ 2015-06-10 21:09 UTC (permalink / raw)
To: Hal Rosenstock, Jason Gunthorpe
Cc: Wan, Kaike, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Fleck, John
> On 6/10/2015 3:10 PM, Jason Gunthorpe wrote:
> > On Wed, Jun 10, 2015 at 01:47:36PM -0400, Hal Rosenstock wrote:
> >> On 6/9/2015 10:57 AM, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> >>> From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> >>>
> >>> This patch routes a SA pathrecord query to netlink first
> >>
> >> Should only unicast PRs be done in this manner or should API support
> >> enabling for unicast and/or multicast ?
> >>
> >> AFAIK kernel doesn't query multicast PRs now (queries MCMRs) but this
> >> seems like it would help make it future proof and not have to take
> >> timeout on local query unless app supports it.
> >
> > It is a good question. We can clearly extend toward that, using a MGID
> > as the DGID and adding additional nested netlink fields.
> >
> > However, does it make sense?
>
> If it doesn't make sense, then should MGIDs as DGIDs never be requested via
> the PR netlink API ?
>
Why should we prevent it? What does it hurt?
Ira
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <2807E5FD2F6FDA4886F6618EAC48510E1109B12A-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-10 21:11 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6AD4-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hefty, Sean @ 2015-06-10 21:11 UTC (permalink / raw)
To: Weiny, Ira, Hal Rosenstock
Cc: Wan, Kaike, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > This series does not attempt to optimize the kernel needing to know
> > > that a PR has been updated. There are existing mechanisms for that.
> >
> > Does this exist in the kernel?
>
> At least some support, yes. For example client reregister marks all IPoIB
> paths as invalid.
Reregister indicates that all PRs are now invalid?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6AD4-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-10 21:31 ` Weiny, Ira
[not found] ` <2807E5FD2F6FDA4886F6618EAC48510E1109B194-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Weiny, Ira @ 2015-06-10 21:31 UTC (permalink / raw)
To: Hefty, Sean, Hal Rosenstock
Cc: Wan, Kaike, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>
> > > > This series does not attempt to optimize the kernel needing to
> > > > know that a PR has been updated. There are existing mechanisms for
> that.
> > >
> > > Does this exist in the kernel?
> >
> > At least some support, yes. For example client reregister marks all
> > IPoIB paths as invalid.
>
> Reregister indicates that all PRs are now invalid?
Not directly. IPoIB treats it that way. I guess to "be safe".
Officially one should register for UnPath/RePath traps. But no one has ever implemented that.
To be clear I am agreeing with Hal that having some sort of "update signal" would be nice. But I don't think that must be done before this series goes in.
I think a PR "update signal" should be an extension to this interface.
Ira
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <2807E5FD2F6FDA4886F6618EAC48510E1109B194-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-10 21:34 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6B07-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hefty, Sean @ 2015-06-10 21:34 UTC (permalink / raw)
To: Weiny, Ira, Hal Rosenstock
Cc: Wan, Kaike, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Not directly. IPoIB treats it that way. I guess to "be safe".
>
> Officially one should register for UnPath/RePath traps. But no one has
> ever implemented that.
>
> To be clear I am agreeing with Hal that having some sort of "update
> signal" would be nice. But I don't think that must be done before this
> series goes in.
>
> I think a PR "update signal" should be an extension to this interface.
I agree. I just wanted to make sure that there wasn't some feature regarding PRs, such as unpath, that a kernel client would lose (i.e. it is currently implemented) by changing how the PRs are retrieved. Basically nothing breaks with this change.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6B07-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-10 21:37 ` Jason Gunthorpe
0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2015-06-10 21:37 UTC (permalink / raw)
To: Hefty, Sean
Cc: Weiny, Ira, Hal Rosenstock, Wan, Kaike,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On Wed, Jun 10, 2015 at 09:34:58PM +0000, Hefty, Sean wrote:
> I agree. I just wanted to make sure that there wasn't some feature
> regarding PRs, such as unpath, that a kernel client would lose
> (i.e. it is currently implemented) by changing how the PRs are
> retrieved. Basically nothing breaks with this change.
That is my expectation, I've seen nothing to contradict it...
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink
[not found] ` <2807E5FD2F6FDA4886F6618EAC48510E1109B14D-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-11 12:24 ` Hal Rosenstock
[not found] ` <55797DF3.6050304-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hal Rosenstock @ 2015-06-11 12:24 UTC (permalink / raw)
To: Weiny, Ira
Cc: Jason Gunthorpe, Wan, Kaike,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Fleck, John
On 6/10/2015 5:09 PM, Weiny, Ira wrote:
>> On 6/10/2015 3:10 PM, Jason Gunthorpe wrote:
>>> On Wed, Jun 10, 2015 at 01:47:36PM -0400, Hal Rosenstock wrote:
>>>> On 6/9/2015 10:57 AM, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
>>>>> From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>>>>
>>>>> This patch routes a SA pathrecord query to netlink first
>>>>
>>>> Should only unicast PRs be done in this manner or should API support
>>>> enabling for unicast and/or multicast ?
>>>>
>>>> AFAIK kernel doesn't query multicast PRs now (queries MCMRs) but this
>>>> seems like it would help make it future proof and not have to take
>>>> timeout on local query unless app supports it.
>>>
>>> It is a good question. We can clearly extend toward that, using a MGID
>>> as the DGID and adding additional nested netlink fields.
>>>
>>> However, does it make sense?
>>
>> If it doesn't make sense, then should MGIDs as DGIDs never be requested via
>> the PR netlink API ?
>>
>
> Why should we prevent it? What does it hurt?
It's merely an optimization in that round trip to user space is avoided
with user space needing to perform some validation/lookup which would
fail in case multicast PRs not supported (which is case for ACM with
multicast backend).
If user space PR capabilities (unicast, multicast, both) is supported,
it affects the API. Maybe it's overkill but requires user space to
indicate no PR available for multicast DGIDs. I think this is the
tradeoff to be made/decided.
-- Hal
> Ira
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6A9D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-11 12:24 ` Wan, Kaike
0 siblings, 0 replies; 39+ messages in thread
From: Wan, Kaike @ 2015-06-11 12:24 UTC (permalink / raw)
To: Hefty, Sean, Hal Rosenstock
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Fleck, John,
Weiny, Ira
> From: Hefty, Sean
> Sent: Wednesday, June 10, 2015 4:32 PM
> To: Hal Rosenstock; Wan, Kaike
> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Fleck, John; Weiny, Ira
> Subject: RE: [PATCH v4 1/4] IB/netlink: Add defines for local service requests
> through netlink
>
> > >>> +/* Local Service Reversible attribute */ struct
> > >>> +rdma_nla_ls_reversible {
> > >>> + __u32 reversible;
> > >>> +};
> > >>
> > >> Isn't __u8 sufficient for reversible ?
> > > Certainly enough. However, reversible is __u32 in struct
> > ib_user_path_rec and int in struct ib_sa_path_rec.
> >
> > OK; I hadn't double checked there. So it's "inherited" from those
> > reversible definitions.
>
> I don't think we need to adhere to the sizes defined in other structures. I
> agree with Hal's original comment. A __u8 here seems sufficient, unless we
> want to introduce some bit fields.
There are two reasons why we should use __u32 here:
1. Easy to convert from struct ib_sa_path_rec (the input) to netlink attribute:
nla_put(skb, LS_NLA_TYPE_REVERSIBLE,
sizeof(sa_rec->reversible), &sa_rec->reversible);
Instead of:
__u8 tmp;
tmp = (__u8)sa_rec->reversible;
nla_put(skb, , LS_NLA_TYPE_REVERSIBLE, sizeof(tmp), &tmp);
2. Most importantly, sa_rec->reversible is define as type "int", but it really is "__be32" (big-endian), as proven below:
path rec: len 64 :
00 00 00 00 00 00 00 00 fe 80 00 00 00 00 00 00 00 11 75 00
00 78 ad 92 fe 80 00 00 00 00 00 00 00 11 75 00 00 78 a4 24
00 07 00 03 00 00 00 00 00 80 ff ff 00 00 84 87 92 00 00 00
00 00 00 00
sa path_rec: len 88 :
00 00 00 00 00 00 00 00 fe 80 00 00 00 00 00 00 00 11 75 00
00 78 ad 92 fe 80 00 00 00 00 00 00 00 11 75 00 00 78 a4 24
00 07 00 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01
00 00 ff ff 00 00 00 02 04 02 07 02 12 00 00 00 00 00 00 00
00 00 00 00 00 00 ff ff
sa path_rec reversible 0x1000000
The path rec is the pathrecord returned from ibacm in MAD wire format (dumped in sa_query.c). The byte "0x80" before "ff ff" (the third row) is reversible_numpath field (reversible bit 7: 1).
The sa_path_rec is the struct ib_sa_path_rec dumped in cma_query_handler() in cma.c (rdma_cm) when the query callback is invoked. The data "00 00 00 01" (the last four bytes of the third row) is the field ib_sa_path_rec->reversible (there are some alignment padding for fields before it). Again the same field is dumped in user-friendly version as 0x1000000 (or 0x01000000), indicating that it is in big-endian format.
This test was run on x86_64 machines with Opensm as the SM/SA.
I stumped across this during previous patch tests where ibacm returned struct ib_user_path_rec as the response. In that test, I had to manually convert struct ibv_path_record into ib_user_path_rec. Then in ib_sa, I had to convert from struct ib_user_path_rec to struct ib_sa_path_rec. If I set ib_user_path_rec->reversible = 1, then the returned result failed rping; If I set ib_user_path_rec->reversible = 0x01000000, then rping successfully connected to the destination. I also used ib_pack() and ib_unpack() to convert between struct ib_sa_path_rec and pathrecord in MAD wire format (struct ibv_path_record) and compared the result with that from SA. If ib_sa_path_rec->reversible = 1, the ibv_path_record->reversible_numpath=0x00; if ib_sa_path_rec->reversible=0x01000000, then ibv_path_record->r
eversible_numpath=0x80.
As a result, I think that the following code is incorrect in cma_query_ib_route() (cma.c):
path_rec.reversible = 1;
I am surprised that it still works to query SA with IB_SA_PATH_REC_REVERSIBLE bit set in comp_mask. By the way, of the three callers of ib_sa_path_rec_get(), this is the only place that ib_sa_path_rec.reversible is set.
That is why we should use __u32 as the attribute data type for reversible.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <55789557.3080100-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-11 12:34 ` Wan, Kaike
0 siblings, 0 replies; 39+ messages in thread
From: Wan, Kaike @ 2015-06-11 12:34 UTC (permalink / raw)
To: Hal Rosenstock, Hefty, Sean
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> Sent: Wednesday, June 10, 2015 3:52 PM
>
> On 6/10/2015 1:04 PM, Hefty, Sean wrote:
> >> Not in the patches themselves but in the general issue when a PR changes.
> >>
> >> Do you think this needs addressing or are things fine as they are now ?
> >
> > IMO, I think it needs addressing in terms of "can the proposed netlink
> > architecture and design accommodate this sort of request in the future?"
> We shouldn't design in a limitation up front. I don't see anything in the
> current approach that would cause an issue. There would likely be a need for
> new messages and attributes.
>
> Can user space say it can't
> service a request ?
Yes. It can NACK a request.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink
[not found] ` <55797DF3.6050304-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-11 12:54 ` Wan, Kaike
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABD089-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Wan, Kaike @ 2015-06-11 12:54 UTC (permalink / raw)
To: Hal Rosenstock, Weiny, Ira
Cc: Jason Gunthorpe,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Fleck, John
> From: Hal Rosenstock [mailto:hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> Sent: Thursday, June 11, 2015 8:24 AM
> To: Weiny, Ira
> Cc: Jason Gunthorpe; Wan, Kaike; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Fleck, John
> Subject: Re: [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink
>
> On 6/10/2015 5:09 PM, Weiny, Ira wrote:
> >> On 6/10/2015 3:10 PM, Jason Gunthorpe wrote:
> >>> On Wed, Jun 10, 2015 at 01:47:36PM -0400, Hal Rosenstock wrote:
> >>>> On 6/9/2015 10:57 AM, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> >>>>> From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> >>>>>
> >>>>> This patch routes a SA pathrecord query to netlink first
> >>>>
> >>>> Should only unicast PRs be done in this manner or should API
> >>>> support enabling for unicast and/or multicast ?
> >>>>
> >>>> AFAIK kernel doesn't query multicast PRs now (queries MCMRs) but
> >>>> this seems like it would help make it future proof and not have to
> >>>> take timeout on local query unless app supports it.
> >>>
> >>> It is a good question. We can clearly extend toward that, using a
> >>> MGID as the DGID and adding additional nested netlink fields.
> >>>
> >>> However, does it make sense?
> >>
> >> If it doesn't make sense, then should MGIDs as DGIDs never be
> >> requested via the PR netlink API ?
> >>
> >
> > Why should we prevent it? What does it hurt?
>
> It's merely an optimization in that round trip to user space is avoided with
> user space needing to perform some validation/lookup which would fail in
> case multicast PRs not supported (which is case for ACM with multicast
> backend).
If the kernel client can query SA for mulitcast PRs, why can't ibacm (even backed by acmp)? Is there any code there that prevents a mgid being used as the destination?
>
> If user space PR capabilities (unicast, multicast, both) is supported, it affects
> the API.
How? If we can query SA for multicast PRs without joining the multicast groups, what additional changes in netlink API do we need to support both?
>Maybe it's overkill but requires user space to indicate no PR available
> for multicast DGIDs. I think this is the tradeoff to be made/decided.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
[not found] ` <20150610191439.GB28334-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-11 12:58 ` Wan, Kaike
0 siblings, 0 replies; 39+ messages in thread
From: Wan, Kaike @ 2015-06-11 12:58 UTC (permalink / raw)
To: Jason Gunthorpe, Hefty, Sean
Cc: Hal Rosenstock,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Wednesday, June 10, 2015 3:15 PM
>
> On Wed, Jun 10, 2015 at 05:04:55PM +0000, Hefty, Sean wrote:
> > > Not in the patches themselves but in the general issue when a PR changes.
> > >
> > > Do you think this needs addressing or are things fine as they are now ?
> >
> > IMO, I think it needs addressing in terms of "can the proposed netlink
> > architecture and design accommodate this sort of request in the
> > future?" We shouldn't design in a limitation up front. I don't see
> > anything in the current approach that would cause an issue.
> > There would likely be a need for new messages and attributes.
>
> I think the kernel netlink side is fine.
>
> But userspace needs to understand what to do if it gets a request with an
> attribute it does not understand.
>
> It should tell the kernel 'no, do it yourself', rather than try and answer.
I will add some code in ibacm to do that.
Kaike
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABD089-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-11 13:25 ` Hal Rosenstock
[not found] ` <55798C31.5090600-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
0 siblings, 1 reply; 39+ messages in thread
From: Hal Rosenstock @ 2015-06-11 13:25 UTC (permalink / raw)
To: Wan, Kaike
Cc: Weiny, Ira, Jason Gunthorpe,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Fleck, John
On 6/11/2015 8:54 AM, Wan, Kaike wrote:
>> From: Hal Rosenstock [mailto:hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
>> Sent: Thursday, June 11, 2015 8:24 AM
>> To: Weiny, Ira
>> Cc: Jason Gunthorpe; Wan, Kaike; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Fleck, John
>> Subject: Re: [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink
>>
>> On 6/10/2015 5:09 PM, Weiny, Ira wrote:
>>>> On 6/10/2015 3:10 PM, Jason Gunthorpe wrote:
>>>>> On Wed, Jun 10, 2015 at 01:47:36PM -0400, Hal Rosenstock wrote:
>>>>>> On 6/9/2015 10:57 AM, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
>>>>>>> From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>>>>>>
>>>>>>> This patch routes a SA pathrecord query to netlink first
>>>>>>
>>>>>> Should only unicast PRs be done in this manner or should API
>>>>>> support enabling for unicast and/or multicast ?
>>>>>>
>>>>>> AFAIK kernel doesn't query multicast PRs now (queries MCMRs) but
>>>>>> this seems like it would help make it future proof and not have to
>>>>>> take timeout on local query unless app supports it.
>>>>>
>>>>> It is a good question. We can clearly extend toward that, using a
>>>>> MGID as the DGID and adding additional nested netlink fields.
>>>>>
>>>>> However, does it make sense?
>>>>
>>>> If it doesn't make sense, then should MGIDs as DGIDs never be
>>>> requested via the PR netlink API ?
>>>>
>>>
>>> Why should we prevent it? What does it hurt?
>>
>> It's merely an optimization in that round trip to user space is avoided with
>> user space needing to perform some validation/lookup which would fail in
>> case multicast PRs not supported (which is case for ACM with multicast
>> backend).
>
> If the kernel client can query SA for mulitcast PRs, why can't ibacm (even backed by acmp)?
It can.
> Is there any code there that prevents a mgid being used as the destination?
No. It's a matter of optimization only.
>>
>> If user space PR capabilities (unicast, multicast, both) is supported, it affects
>> the API.
> How? If we can query SA for multicast PRs without joining the multicast groups,
> what additional changes in netlink API do we need to support both?
Nothing to support both but if we wanted to disable one or other based
on user space we would but it sounds like we don't need/want this but
would use user space rejection/no data for this.
>> Maybe it's overkill but requires user space to indicate no PR available
>> for multicast DGIDs. I think this is the tradeoff to be made/decided.
>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink
[not found] ` <55798C31.5090600-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-11 16:08 ` Weiny, Ira
0 siblings, 0 replies; 39+ messages in thread
From: Weiny, Ira @ 2015-06-11 16:08 UTC (permalink / raw)
To: Hal Rosenstock, Wan, Kaike
Cc: Jason Gunthorpe,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Fleck, John
> >> If user space PR capabilities (unicast, multicast, both) is
> >> supported, it affects the API.
> > How? If we can query SA for multicast PRs without joining the
> > multicast groups, what additional changes in netlink API do we need to
> support both?
>
> Nothing to support both but if we wanted to disable one or other based on user
> space we would but it sounds like we don't need/want this but would use user
> space rejection/no data for this.
>
Yes I think we should allow userspace to decide if this is supported or not. That allows for more flexibility.
Ira
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2015-06-11 16:08 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-09 14:57 [PATCH v4 0/4] Sending kernel pathrecord query to user cache server kaike.wan-ral2JQCrhuEAvxtiuMwx3w
[not found] ` <1433861837-26177-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2015-06-09 14:57 ` [PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink kaike.wan-ral2JQCrhuEAvxtiuMwx3w
[not found] ` <1433861837-26177-2-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2015-06-10 17:47 ` Hal Rosenstock
[not found] ` <55787827.7030003-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-10 18:31 ` Wan, Kaike
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABCD80-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 19:54 ` Hal Rosenstock
[not found] ` <55789609.4020906-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-10 20:32 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6A9D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-11 12:24 ` Wan, Kaike
2015-06-09 14:57 ` [PATCH v4 2/4] IB/core: Check the presence of netlink multicast group listeners kaike.wan-ral2JQCrhuEAvxtiuMwx3w
2015-06-09 14:57 ` [PATCH v4 3/4] IB/sa: Allocate SA query with kzalloc kaike.wan-ral2JQCrhuEAvxtiuMwx3w
2015-06-09 14:57 ` [PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink kaike.wan-ral2JQCrhuEAvxtiuMwx3w
[not found] ` <1433861837-26177-5-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2015-06-10 17:47 ` Hal Rosenstock
[not found] ` <55787838.3020606-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-10 19:10 ` Jason Gunthorpe
[not found] ` <20150610191026.GA28334-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-10 19:51 ` Hal Rosenstock
[not found] ` <5578955A.4070001-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-10 21:09 ` Weiny, Ira
[not found] ` <2807E5FD2F6FDA4886F6618EAC48510E1109B14D-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-11 12:24 ` Hal Rosenstock
[not found] ` <55797DF3.6050304-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-11 12:54 ` Wan, Kaike
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABD089-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-11 13:25 ` Hal Rosenstock
[not found] ` <55798C31.5090600-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-11 16:08 ` Weiny, Ira
2015-06-10 13:37 ` [PATCH v4 0/4] Sending kernel pathrecord query to user cache server Hal Rosenstock
[not found] ` <55783D84.6040709-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-10 14:22 ` Wan, Kaike
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABCC2E-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 14:39 ` Hal Rosenstock
[not found] ` <55784C35.2020401-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-10 15:07 ` Wan, Kaike
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABCC80-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 15:19 ` Hal Rosenstock
[not found] ` <5578558B.1070503-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-10 15:49 ` Wan, Kaike
[not found] ` <3F128C9216C9B84BB6ED23EF16290AFB0CABCCC5-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 16:46 ` Hal Rosenstock
2015-06-10 18:00 ` ira.weiny
[not found] ` <20150610180020.GC13497-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
2015-06-10 18:05 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6940-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 21:04 ` Weiny, Ira
[not found] ` <2807E5FD2F6FDA4886F6618EAC48510E1109B12A-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 21:11 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6AD4-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 21:31 ` Weiny, Ira
[not found] ` <2807E5FD2F6FDA4886F6618EAC48510E1109B194-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 21:34 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6B07-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 21:37 ` Jason Gunthorpe
2015-06-10 15:21 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE6677-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 16:50 ` Hal Rosenstock
[not found] ` <55786ACC.4070704-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-10 17:04 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A8FE680D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 19:14 ` Jason Gunthorpe
[not found] ` <20150610191439.GB28334-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-11 12:58 ` Wan, Kaike
2015-06-10 19:51 ` Hal Rosenstock
[not found] ` <55789557.3080100-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-11 12:34 ` Wan, Kaike
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox