All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC/RFT 06/14] soc: amlogic: meson-clk-measure: add G12B second cluster cpu clk
From: Neil Armstrong @ 2019-06-20 15:00 UTC (permalink / raw)
  To: jbrunet, khilman
  Cc: Neil Armstrong, martin.blumenstingl, linux-kernel, linux-amlogic,
	linux-clk, linux-arm-kernel
In-Reply-To: <20190620150013.13462-1-narmstrong@baylibre.com>

Add the G12B second CPU cluster CPU and SYS_PLL measure IDs.

These IDs returns 0Hz on G12A.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
---
 drivers/soc/amlogic/meson-clk-measure.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/soc/amlogic/meson-clk-measure.c b/drivers/soc/amlogic/meson-clk-measure.c
index c470e24f1dfa..f09b404b39d3 100644
--- a/drivers/soc/amlogic/meson-clk-measure.c
+++ b/drivers/soc/amlogic/meson-clk-measure.c
@@ -324,6 +324,8 @@ static struct meson_msr_id clk_msr_g12a[CLK_MSR_MAX] = {
 	CLK_MSR_ID(84, "co_tx"),
 	CLK_MSR_ID(89, "hdmi_todig"),
 	CLK_MSR_ID(90, "hdmitx_sys"),
+	CLK_MSR_ID(91, "sys_cpub_div16"),
+	CLK_MSR_ID(92, "sys_pll_cpub_div16"),
 	CLK_MSR_ID(94, "eth_phy_rx"),
 	CLK_MSR_ID(95, "eth_phy_pll"),
 	CLK_MSR_ID(96, "vpu_b"),
-- 
2.21.0


_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic

^ permalink raw reply related

* [PATCH v4 13/25] ibtrs: include client and server modules into kernel compilation
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, Jack Wang
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

Add IBTRS Makefile, Kconfig and also corresponding lines into upper
layer infiniband/ulp files.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/Kconfig            |  1 +
 drivers/infiniband/ulp/Makefile       |  1 +
 drivers/infiniband/ulp/ibtrs/Kconfig  | 22 ++++++++++++++++++++++
 drivers/infiniband/ulp/ibtrs/Makefile | 15 +++++++++++++++
 4 files changed, 39 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
 create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 8ba41cbf1869..1a271ade9997 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -117,6 +117,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
 
 source "drivers/infiniband/ulp/iser/Kconfig"
 source "drivers/infiniband/ulp/isert/Kconfig"
+source "drivers/infiniband/ulp/ibtrs/Kconfig"
 
 source "drivers/infiniband/ulp/opa_vnic/Kconfig"
 
diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile
index 437813c7b481..1c4f10dc8d49 100644
--- a/drivers/infiniband/ulp/Makefile
+++ b/drivers/infiniband/ulp/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_INFINIBAND_SRPT)		+= srpt/
 obj-$(CONFIG_INFINIBAND_ISER)		+= iser/
 obj-$(CONFIG_INFINIBAND_ISERT)		+= isert/
 obj-$(CONFIG_INFINIBAND_OPA_VNIC)	+= opa_vnic/
+obj-$(CONFIG_INFINIBAND_IBTRS)		+= ibtrs/
diff --git a/drivers/infiniband/ulp/ibtrs/Kconfig b/drivers/infiniband/ulp/ibtrs/Kconfig
new file mode 100644
index 000000000000..1f30c88783e6
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/Kconfig
@@ -0,0 +1,22 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+config INFINIBAND_IBTRS
+	tristate
+	depends on INFINIBAND_ADDR_TRANS
+
+config INFINIBAND_IBTRS_CLIENT
+	tristate "IBTRS client module"
+	depends on INFINIBAND_ADDR_TRANS
+	select INFINIBAND_IBTRS
+	help
+	  IBTRS client allows for simplified data transfer and connection
+	  establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
+	  READ/WRITE semantics and provides multipath capabilities.
+
+config INFINIBAND_IBTRS_SERVER
+	tristate "IBTRS server module"
+	depends on INFINIBAND_ADDR_TRANS
+	select INFINIBAND_IBTRS
+	help
+	  IBTRS server module processing connection and IO requests received
+	  from the IBTRS client module.
diff --git a/drivers/infiniband/ulp/ibtrs/Makefile b/drivers/infiniband/ulp/ibtrs/Makefile
new file mode 100644
index 000000000000..d2e6cce8f94f
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/Makefile
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+ibtrs-client-y := ibtrs-clt.o \
+		  ibtrs-clt-stats.o \
+		  ibtrs-clt-sysfs.o
+
+ibtrs-server-y := ibtrs-srv.o \
+		  ibtrs-srv-stats.o \
+		  ibtrs-srv-sysfs.o
+
+ibtrs-core-y := ibtrs.o
+
+obj-$(CONFIG_INFINIBAND_IBTRS)        += ibtrs-core.o
+obj-$(CONFIG_INFINIBAND_IBTRS_CLIENT) += ibtrs-client.o
+obj-$(CONFIG_INFINIBAND_IBTRS_SERVER) += ibtrs-server.o
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 10/25] ibtrs: server: main functionality
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, Jack Wang
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

This is main functionality of ibtrs-server module, which accepts
set of RDMA connections (so called IBTRS session), creates/destroys
sysfs entries associated with IBTRS session and notifies upper layer
(user of IBTRS API) about RDMA requests or link events.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c | 1998 ++++++++++++++++++++++
 1 file changed, 1998 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
new file mode 100644
index 000000000000..b6d4906d0f09
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
@@ -0,0 +1,1998 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ * Authors: Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Jack Wang <jinpu.wang@cloud.ionos.com>
+ *          Danil Kipnis <danil.kipnis@cloud.ionos.com>
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/mempool.h>
+
+#include "ibtrs-srv.h"
+#include "ibtrs-log.h"
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Server");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+/* Must be power of 2, see mask from mr->page_size in ib_sg_to_pages() */
+#define DEFAULT_MAX_CHUNK_SIZE (128 << 10)
+#define DEFAULT_SESS_QUEUE_DEPTH 512
+#define MAX_HDR_SIZE PAGE_SIZE
+#define MAX_SG_COUNT ((MAX_HDR_SIZE - sizeof(struct ibtrs_msg_rdma_read)) \
+		      / sizeof(struct ibtrs_sg_desc))
+
+/* We guarantee to serve 10 paths at least */
+#define CHUNK_POOL_SZ 10
+
+static struct ibtrs_ib_dev_pool dev_pool;
+static mempool_t *chunk_pool;
+struct class *ibtrs_dev_class;
+
+static int retry_count = 7;
+static int __read_mostly max_chunk_size = DEFAULT_MAX_CHUNK_SIZE;
+static int __read_mostly sess_queue_depth = DEFAULT_SESS_QUEUE_DEPTH;
+
+module_param_named(max_chunk_size, max_chunk_size, int, 0444);
+MODULE_PARM_DESC(max_chunk_size,
+		 "Max size for each IO request, when change the unit is in byte"
+		 " (default: " __stringify(DEFAULT_MAX_CHUNK_SIZE_KB) "KB)");
+
+module_param_named(sess_queue_depth, sess_queue_depth, int, 0444);
+MODULE_PARM_DESC(sess_queue_depth,
+		 "Number of buffers for pending I/O requests to allocate"
+		 " per session. Maximum: " __stringify(MAX_SESS_QUEUE_DEPTH)
+		 " (default: " __stringify(DEFAULT_SESS_QUEUE_DEPTH) ")");
+
+static int retry_count_set(const char *val, const struct kernel_param *kp)
+{
+	int err, ival;
+
+	err = kstrtoint(val, 0, &ival);
+	if (err)
+		return err;
+
+	if (ival < MIN_RTR_CNT || ival > MAX_RTR_CNT) {
+		pr_err("Invalid retry count value %d, has to be"
+		       " > %d, < %d\n", ival, MIN_RTR_CNT, MAX_RTR_CNT);
+		return -EINVAL;
+	}
+
+	retry_count = ival;
+	pr_info("QP retry count changed to %d\n", ival);
+
+	return 0;
+}
+
+static const struct kernel_param_ops retry_count_ops = {
+	.set		= retry_count_set,
+	.get		= param_get_int,
+};
+module_param_cb(retry_count, &retry_count_ops, &retry_count, 0644);
+
+MODULE_PARM_DESC(retry_count, "Number of times to send the message if the"
+		 " remote side didn't respond with Ack or Nack (default: 3,"
+		 " min: " __stringify(MIN_RTR_CNT) ", max: "
+		 __stringify(MAX_RTR_CNT) ")");
+
+static char cq_affinity_list[256] = "";
+static cpumask_t cq_affinity_mask = { CPU_BITS_ALL };
+
+static void init_cq_affinity(void)
+{
+	sprintf(cq_affinity_list, "0-%d", nr_cpu_ids - 1);
+}
+
+static int cq_affinity_list_set(const char *val, const struct kernel_param *kp)
+{
+	int ret = 0, len = strlen(val);
+	cpumask_var_t new_value;
+
+	if (!strlen(cq_affinity_list))
+		init_cq_affinity();
+
+	if (len >= sizeof(cq_affinity_list))
+		return -EINVAL;
+	if (!alloc_cpumask_var(&new_value, GFP_KERNEL))
+		return -ENOMEM;
+
+	ret = cpulist_parse(val, new_value);
+	if (ret) {
+		pr_err("Can't set cq_affinity_list \"%s\": %d\n", val,
+		       ret);
+		goto free_cpumask;
+	}
+
+	strlcpy(cq_affinity_list, val, sizeof(cq_affinity_list));
+	*strchrnul(cq_affinity_list, '\n') = '\0';
+	cpumask_copy(&cq_affinity_mask, new_value);
+
+	pr_info("cq_affinity_list changed to %*pbl\n",
+		cpumask_pr_args(&cq_affinity_mask));
+free_cpumask:
+	free_cpumask_var(new_value);
+	return ret;
+}
+
+static struct kparam_string cq_affinity_list_kparam_str = {
+	.maxlen	= sizeof(cq_affinity_list),
+	.string	= cq_affinity_list
+};
+
+static const struct kernel_param_ops cq_affinity_list_ops = {
+	.set	= cq_affinity_list_set,
+	.get	= param_get_string,
+};
+
+module_param_cb(cq_affinity_list, &cq_affinity_list_ops,
+		&cq_affinity_list_kparam_str, 0644);
+MODULE_PARM_DESC(cq_affinity_list, "Sets the list of cpus to use as cq vectors."
+		 "(default: use all possible CPUs)");
+
+static struct workqueue_struct *ibtrs_wq;
+
+static void close_sess(struct ibtrs_srv_sess *sess);
+
+static inline struct ibtrs_srv_con *to_srv_con(struct ibtrs_con *c)
+{
+	return container_of(c, struct ibtrs_srv_con, c);
+}
+
+static inline struct ibtrs_srv_sess *to_srv_sess(struct ibtrs_sess *s)
+{
+	return container_of(s, struct ibtrs_srv_sess, s);
+}
+
+static bool __ibtrs_srv_change_state(struct ibtrs_srv_sess *sess,
+				     enum ibtrs_srv_state new_state)
+{
+	enum ibtrs_srv_state old_state;
+	bool changed = false;
+
+	old_state = sess->state;
+	switch (new_state) {
+	case IBTRS_SRV_CONNECTED:
+		switch (old_state) {
+		case IBTRS_SRV_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_SRV_CLOSING:
+		switch (old_state) {
+		case IBTRS_SRV_CONNECTING:
+		case IBTRS_SRV_CONNECTED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_SRV_CLOSED:
+		switch (old_state) {
+		case IBTRS_SRV_CLOSING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	default:
+		break;
+	}
+	if (changed)
+		sess->state = new_state;
+
+	return changed;
+}
+
+static bool ibtrs_srv_change_state_get_old(struct ibtrs_srv_sess *sess,
+					   enum ibtrs_srv_state new_state,
+					   enum ibtrs_srv_state *old_state)
+{
+	bool changed;
+
+	spin_lock_irq(&sess->state_lock);
+	*old_state = sess->state;
+	changed = __ibtrs_srv_change_state(sess, new_state);
+	spin_unlock_irq(&sess->state_lock);
+
+	return changed;
+}
+
+static bool ibtrs_srv_change_state(struct ibtrs_srv_sess *sess,
+				   enum ibtrs_srv_state new_state)
+{
+	enum ibtrs_srv_state old_state;
+
+	return ibtrs_srv_change_state_get_old(sess, new_state, &old_state);
+}
+
+static void free_id(struct ibtrs_srv_op *id)
+{
+	if (!id)
+		return;
+	kfree(id->tx_wr);
+	kfree(id->tx_sg);
+	kfree(id);
+}
+
+static void ibtrs_srv_free_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	int i;
+
+	WARN_ON(atomic_read(&sess->ids_inflight));
+	if (sess->ops_ids) {
+		for (i = 0; i < srv->queue_depth; i++)
+			free_id(sess->ops_ids[i]);
+		kfree(sess->ops_ids);
+		sess->ops_ids = NULL;
+	}
+}
+
+static int ibtrs_srv_alloc_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_op *id;
+	int i;
+
+	sess->ops_ids = kcalloc(srv->queue_depth, sizeof(*sess->ops_ids),
+				GFP_KERNEL);
+	if (unlikely(!sess->ops_ids))
+		goto err;
+
+	for (i = 0; i < srv->queue_depth; ++i) {
+		id = kzalloc(sizeof(*id), GFP_KERNEL);
+		if (unlikely(!id))
+			goto err;
+
+		sess->ops_ids[i] = id;
+		id->tx_wr = kcalloc(MAX_SG_COUNT, sizeof(*id->tx_wr),
+				    GFP_KERNEL);
+		if (unlikely(!id->tx_wr))
+			goto err;
+
+		id->tx_sg = kcalloc(MAX_SG_COUNT, sizeof(*id->tx_sg),
+				    GFP_KERNEL);
+		if (unlikely(!id->tx_sg))
+			goto err;
+	}
+	init_waitqueue_head(&sess->ids_waitq);
+	atomic_set(&sess->ids_inflight, 0);
+
+	return 0;
+
+err:
+	ibtrs_srv_free_ops_ids(sess);
+	return -ENOMEM;
+}
+
+static void ibtrs_srv_get_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	atomic_inc(&sess->ids_inflight);
+}
+
+static void ibtrs_srv_put_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	if (atomic_dec_and_test(&sess->ids_inflight))
+		wake_up(&sess->ids_waitq);
+}
+
+static void ibtrs_srv_wait_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	wait_event(sess->ids_waitq, !atomic_read(&sess->ids_inflight));
+}
+
+static void ibtrs_srv_rdma_done(struct ib_cq *cq, struct ib_wc *wc);
+
+static struct ib_cqe io_comp_cqe = {
+	.done = ibtrs_srv_rdma_done
+};
+
+/**
+ * rdma_write_sg() - response on successful READ request
+ */
+static int rdma_write_sg(struct ibtrs_srv_op *id)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(id->con->c.sess);
+	dma_addr_t dma_addr = sess->dma_addr[id->msg_id];
+	struct ibtrs_srv *srv = sess->srv;
+	struct ib_send_wr inv_wr, imm_wr;
+	struct ib_rdma_wr *wr = NULL;
+	const struct ib_send_wr *bad_wr;
+	enum ib_send_flags flags;
+	size_t sg_cnt;
+	int err, i, offset;
+	bool need_inval;
+	u32 rkey = 0;
+
+	sg_cnt = le16_to_cpu(id->rd_msg->sg_cnt);
+	need_inval = le16_to_cpu(id->rd_msg->flags) & IBTRS_MSG_NEED_INVAL_F;
+	if (unlikely(!sg_cnt))
+		return -EINVAL;
+
+	offset = 0;
+	for (i = 0; i < sg_cnt; i++) {
+		struct ib_sge *list;
+
+		wr		= &id->tx_wr[i];
+		list		= &id->tx_sg[i];
+		list->addr	= dma_addr + offset;
+		list->length	= le32_to_cpu(id->rd_msg->desc[i].len);
+
+		/* WR will fail with length error
+		 * if this is 0
+		 */
+		if (unlikely(list->length == 0)) {
+			ibtrs_err(sess, "Invalid RDMA-Write sg list length 0\n");
+			return -EINVAL;
+		}
+
+		list->lkey = sess->s.dev->ib_pd->local_dma_lkey;
+		offset += list->length;
+
+		wr->wr.wr_cqe	= &io_comp_cqe;
+		wr->wr.sg_list	= list;
+		wr->wr.num_sge	= 1;
+		wr->remote_addr	= le64_to_cpu(id->rd_msg->desc[i].addr);
+		wr->rkey	= le32_to_cpu(id->rd_msg->desc[i].key);
+		if (rkey == 0)
+			rkey = wr->rkey;
+		else
+			/* Only one key is actually used */
+			WARN_ON_ONCE(rkey != wr->rkey);
+
+		if (i < (sg_cnt - 1))
+			wr->wr.next = &id->tx_wr[i + 1].wr;
+		else if (need_inval)
+			wr->wr.next = &inv_wr;
+		else
+			wr->wr.next = &imm_wr;
+
+		wr->wr.opcode = IB_WR_RDMA_WRITE;
+		wr->wr.ex.imm_data = 0;
+		wr->wr.send_flags  = 0;
+	}
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&id->con->wr_cnt) % srv->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+
+	if (need_inval) {
+		inv_wr.next = &imm_wr;
+		inv_wr.wr_cqe = &io_comp_cqe;
+		inv_wr.sg_list = NULL;
+		inv_wr.num_sge = 0;
+		inv_wr.opcode = IB_WR_SEND_WITH_INV;
+		inv_wr.send_flags = 0;
+		inv_wr.ex.invalidate_rkey = rkey;
+	}
+	imm_wr.next = NULL;
+	imm_wr.wr_cqe = &io_comp_cqe;
+	imm_wr.sg_list = NULL;
+	imm_wr.num_sge = 0;
+	imm_wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM;
+	imm_wr.send_flags = flags;
+	imm_wr.ex.imm_data = cpu_to_be32(ibtrs_to_io_rsp_imm(id->msg_id,
+							     0, need_inval));
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, dma_addr,
+				      offset, DMA_BIDIRECTIONAL);
+
+	err = ib_post_send(id->con->c.qp, &id->tx_wr[0].wr, &bad_wr);
+	if (unlikely(err))
+		ibtrs_err(sess,
+			  "Posting RDMA-Write-Request to QP failed, err: %d\n",
+			  err);
+
+	return err;
+}
+
+/**
+ * send_io_resp_imm() - response with empty IMM on failed READ/WRITE requests or
+ *                      on successful WRITE request.
+ */
+static int send_io_resp_imm(struct ibtrs_srv_con *con, struct ibtrs_srv_op *id,
+			    int errno)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ib_send_wr inv_wr, *wr = NULL;
+	struct ibtrs_srv *srv = sess->srv;
+	bool need_inval = false;
+	enum ib_send_flags flags;
+	u32 imm;
+	int err;
+
+	if (id->dir == READ) {
+		struct ibtrs_msg_rdma_read *rd_msg = id->rd_msg;
+		size_t sg_cnt;
+
+		need_inval = le16_to_cpu(rd_msg->flags) &
+				IBTRS_MSG_NEED_INVAL_F;
+		sg_cnt = le16_to_cpu(rd_msg->sg_cnt);
+
+		if (need_inval) {
+			if (likely(sg_cnt)) {
+				inv_wr.next = NULL;
+				inv_wr.wr_cqe = &io_comp_cqe;
+				inv_wr.sg_list = NULL;
+				inv_wr.num_sge = 0;
+				inv_wr.opcode = IB_WR_SEND_WITH_INV;
+				inv_wr.send_flags = 0;
+				/* Only one key is actually used */
+				inv_wr.ex.invalidate_rkey =
+					le32_to_cpu(rd_msg->desc[0].key);
+				wr = &inv_wr;
+			} else {
+				WARN_ON_ONCE(1);
+				need_inval = false;
+			}
+		}
+	}
+
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->wr_cnt) % srv->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+	imm = ibtrs_to_io_rsp_imm(id->msg_id, errno, need_inval);
+	err = ibtrs_post_rdma_write_imm_empty(&con->c, &io_comp_cqe, imm,
+					      flags, wr);
+	if (unlikely(err))
+		ibtrs_err_rl(sess, "ib_post_send(), err: %d\n", err);
+
+	return err;
+}
+
+/*
+ * ibtrs_srv_resp_rdma() - sends response to the client.
+ *
+ * Context: any
+ */
+void ibtrs_srv_resp_rdma(struct ibtrs_srv_op *id, int status)
+{
+	struct ibtrs_srv_con *con = id->con;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	int err;
+
+	if (WARN_ON(!id))
+		return;
+
+	if (unlikely(sess->state != IBTRS_SRV_CONNECTED)) {
+		ibtrs_err_rl(sess, "Sending I/O response failed, "
+			     " session is disconnected, sess state %s\n",
+			     ibtrs_srv_state_str(sess->state));
+		goto out;
+	}
+	if (status || id->dir == WRITE || !id->rd_msg->sg_cnt)
+		err = send_io_resp_imm(con, id, status);
+	else
+		err = rdma_write_sg(id);
+	if (unlikely(err)) {
+		ibtrs_err_rl(sess, "IO response failed: %d\n", err);
+		close_sess(sess);
+	}
+out:
+	ibtrs_srv_put_ops_ids(sess);
+}
+EXPORT_SYMBOL(ibtrs_srv_resp_rdma);
+
+void ibtrs_srv_set_sess_priv(struct ibtrs_srv *srv, void *priv)
+{
+	srv->priv = priv;
+}
+EXPORT_SYMBOL(ibtrs_srv_set_sess_priv);
+
+static void unmap_cont_bufs(struct ibtrs_srv_sess *sess)
+{
+	int i;
+
+	for (i = 0; i < sess->mrs_num; i++) {
+		struct ibtrs_srv_mr *srv_mr;
+
+		srv_mr = &sess->mrs[i];
+		ib_dereg_mr(srv_mr->mr);
+		ib_dma_unmap_sg(sess->s.dev->ib_dev, srv_mr->sgt.sgl,
+				srv_mr->sgt.nents, DMA_BIDIRECTIONAL);
+		sg_free_table(&srv_mr->sgt);
+	}
+	kfree(sess->mrs);
+}
+
+static int map_cont_bufs(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	int i, mri, err, mrs_num;
+	unsigned int chunk_bits;
+	int chunks_per_mr;
+
+	/*
+	 * Here we map queue_depth chunks to MR.  Firstly we have to
+	 * figure out how many chunks can we map per MR.
+	 */
+
+	chunks_per_mr = sess->s.dev->ib_dev->attrs.max_fast_reg_page_list_len;
+	mrs_num = DIV_ROUND_UP(srv->queue_depth, chunks_per_mr);
+	chunks_per_mr = DIV_ROUND_UP(srv->queue_depth, mrs_num);
+
+	sess->mrs = kcalloc(mrs_num, sizeof(*sess->mrs), GFP_KERNEL);
+	if (unlikely(!sess->mrs))
+		return -ENOMEM;
+
+	sess->mrs_num = mrs_num;
+
+	for (mri = 0; mri < mrs_num; mri++) {
+		struct ibtrs_srv_mr *srv_mr = &sess->mrs[mri];
+		struct sg_table *sgt = &srv_mr->sgt;
+		struct scatterlist *s;
+		struct ib_mr *mr;
+		int nr, chunks;
+
+		chunks = chunks_per_mr * mri;
+		chunks_per_mr = min_t(int, chunks_per_mr,
+				      srv->queue_depth - chunks);
+
+		err = sg_alloc_table(sgt, chunks_per_mr, GFP_KERNEL);
+		if (unlikely(err))
+			goto err;
+
+		for_each_sg(sgt->sgl, s, chunks_per_mr, i)
+			sg_set_page(s, srv->chunks[chunks + i],
+				    max_chunk_size, 0);
+
+		nr = ib_dma_map_sg(sess->s.dev->ib_dev, sgt->sgl,
+				   sgt->nents, DMA_BIDIRECTIONAL);
+		if (unlikely(nr < sgt->nents)) {
+			err = nr < 0 ? nr : -EINVAL;
+			goto free_sg;
+		}
+		mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
+				 sgt->nents);
+		if (unlikely(IS_ERR(mr))) {
+			err = PTR_ERR(mr);
+			goto unmap_sg;
+		}
+		nr = ib_map_mr_sg(mr, sgt->sgl, sgt->nents,
+				  NULL, max_chunk_size);
+		if (unlikely(nr < sgt->nents)) {
+			err = nr < 0 ? nr : -EINVAL;
+			goto dereg_mr;
+		}
+
+		/* Eventually dma addr for each chunk can be cached */
+		for_each_sg(sgt->sgl, s, sgt->orig_nents, i)
+			sess->dma_addr[chunks + i] = sg_dma_address(s);
+
+		ib_update_fast_reg_key(mr, ib_inc_rkey(mr->rkey));
+
+		srv_mr->mr = mr;
+
+		continue;
+err:
+		while (mri--) {
+			srv_mr = &sess->mrs[mri];
+			sgt = &srv_mr->sgt;
+			mr = srv_mr->mr;
+dereg_mr:
+			ib_dereg_mr(mr);
+unmap_sg:
+			ib_dma_unmap_sg(sess->s.dev->ib_dev, sgt->sgl,
+					sgt->nents, DMA_BIDIRECTIONAL);
+free_sg:
+			sg_free_table(sgt);
+		}
+		kfree(sess->mrs);
+
+		return err;
+	}
+
+	chunk_bits = ilog2(srv->queue_depth - 1) + 1;
+	sess->mem_bits = (MAX_IMM_PAYL_BITS - chunk_bits);
+
+	return 0;
+}
+
+static void ibtrs_srv_hb_err_handler(struct ibtrs_con *c, int err)
+{
+	(void)err;
+	close_sess(to_srv_sess(c->sess));
+}
+
+static void ibtrs_srv_init_hb(struct ibtrs_srv_sess *sess)
+{
+	ibtrs_init_hb(&sess->s, &io_comp_cqe,
+		      IBTRS_HB_INTERVAL_MS,
+		      IBTRS_HB_MISSED_MAX,
+		      ibtrs_srv_hb_err_handler,
+		      ibtrs_wq);
+}
+
+static void ibtrs_srv_start_hb(struct ibtrs_srv_sess *sess)
+{
+	ibtrs_start_hb(&sess->s);
+}
+
+static void ibtrs_srv_stop_hb(struct ibtrs_srv_sess *sess)
+{
+	ibtrs_stop_hb(&sess->s);
+}
+
+static void ibtrs_srv_info_rsp_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_srv_con *con = cq->cq_context;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_iu *iu;
+
+	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
+	ibtrs_iu_free(iu, DMA_TO_DEVICE, sess->s.dev->ib_dev);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Sess info response send failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		close_sess(sess);
+		return;
+	}
+	WARN_ON(wc->opcode != IB_WC_SEND);
+	ibtrs_srv_update_wc_stats(&sess->stats);
+}
+
+static void ibtrs_srv_sess_up(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_ctx *ctx = srv->ctx;
+	int up;
+
+	mutex_lock(&srv->paths_ev_mutex);
+	up = ++srv->paths_up;
+	if (up == 1)
+		ctx->link_ev(srv, IBTRS_SRV_LINK_EV_CONNECTED, NULL);
+	mutex_unlock(&srv->paths_ev_mutex);
+
+	/* Mark session as established */
+	sess->established = true;
+}
+
+static void ibtrs_srv_sess_down(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_ctx *ctx = srv->ctx;
+
+	if (!sess->established)
+		return;
+
+	sess->established = false;
+	mutex_lock(&srv->paths_ev_mutex);
+	WARN_ON(!srv->paths_up);
+	if (--srv->paths_up == 0)
+		ctx->link_ev(srv, IBTRS_SRV_LINK_EV_DISCONNECTED, srv->priv);
+	mutex_unlock(&srv->paths_ev_mutex);
+}
+
+static void ibtrs_srv_reg_mr_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_srv_con *con = cq->cq_context;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "REG MR failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		close_sess(sess);
+		return;
+	}
+}
+
+static struct ib_cqe local_reg_cqe = {
+	.done = ibtrs_srv_reg_mr_done
+};
+
+static int post_recv_sess(struct ibtrs_srv_sess *sess);
+
+static int process_info_req(struct ibtrs_srv_con *con,
+			    struct ibtrs_msg_info_req *msg)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ib_send_wr *reg_wr = NULL;
+	struct ibtrs_msg_info_rsp *rsp;
+	struct ibtrs_iu *tx_iu;
+	struct ib_reg_wr *rwr;
+	int mri, err;
+	size_t tx_sz;
+
+	err = post_recv_sess(sess);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "post_recv_sess(), err: %d\n", err);
+		return err;
+	}
+	rwr = kcalloc(sess->mrs_num, sizeof(*rwr), GFP_KERNEL);
+	if (unlikely(!rwr)) {
+		ibtrs_err(sess, "No memory\n");
+		return -ENOMEM;
+	}
+	memcpy(sess->s.sessname, msg->sessname, sizeof(sess->s.sessname));
+
+	tx_sz  = sizeof(*rsp);
+	tx_sz += sizeof(rsp->desc[0]) * sess->mrs_num;
+	tx_iu = ibtrs_iu_alloc(0, tx_sz, GFP_KERNEL, sess->s.dev->ib_dev,
+			       DMA_TO_DEVICE, ibtrs_srv_info_rsp_done);
+	if (unlikely(!tx_iu)) {
+		ibtrs_err(sess, "ibtrs_iu_alloc(), err: %d\n", -ENOMEM);
+		err = -ENOMEM;
+		goto rwr_free;
+	}
+
+	rsp = tx_iu->buf;
+	rsp->type = cpu_to_le16(IBTRS_MSG_INFO_RSP);
+	rsp->sg_cnt = cpu_to_le16(sess->mrs_num);
+
+	for (mri = 0; mri < sess->mrs_num; mri++) {
+		struct ib_mr *mr = sess->mrs[mri].mr;
+
+		rsp->desc[mri].addr = cpu_to_le64(mr->iova);
+		rsp->desc[mri].key  = cpu_to_le32(mr->rkey);
+		rsp->desc[mri].len  = cpu_to_le32(mr->length);
+
+		/*
+		 * Fill in reg MR request and chain them *backwards*
+		 */
+		rwr[mri].wr.next = mri ? &rwr[mri - 1].wr : NULL;
+		rwr[mri].wr.opcode = IB_WR_REG_MR;
+		rwr[mri].wr.wr_cqe = &local_reg_cqe;
+		rwr[mri].wr.num_sge = 0;
+		rwr[mri].wr.send_flags = 0;
+		rwr[mri].mr = mr;
+		rwr[mri].key = mr->rkey;
+		rwr[mri].access = (IB_ACCESS_LOCAL_WRITE |
+				   IB_ACCESS_REMOTE_WRITE);
+		reg_wr = &rwr[mri].wr;
+	}
+
+	err = ibtrs_srv_create_sess_files(sess);
+	if (unlikely(err))
+		goto iu_free;
+
+	ibtrs_srv_change_state(sess, IBTRS_SRV_CONNECTED);
+	ibtrs_srv_start_hb(sess);
+
+	/*
+	 * We do not account number of established connections at the current
+	 * moment, we rely on the client, which should send info request when
+	 * all connections are successfully established.  Thus, simply notify
+	 * listener with a proper event if we are the first path.
+	 */
+	ibtrs_srv_sess_up(sess);
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, tx_iu->dma_addr,
+				      tx_iu->size, DMA_TO_DEVICE);
+
+	/* Send info response */
+	err = ibtrs_iu_post_send(&con->c, tx_iu, tx_sz, reg_wr);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_iu_post_send(), err: %d\n", err);
+iu_free:
+		ibtrs_iu_free(tx_iu, DMA_TO_DEVICE, sess->s.dev->ib_dev);
+	}
+rwr_free:
+	kfree(rwr);
+
+	return err;
+}
+
+static void ibtrs_srv_info_req_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_srv_con *con = cq->cq_context;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_msg_info_req *msg;
+	struct ibtrs_iu *iu;
+	int err;
+
+	WARN_ON(con->c.cid);
+
+	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Sess info request receive failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		goto close;
+	}
+	WARN_ON(wc->opcode != IB_WC_RECV);
+
+	if (unlikely(wc->byte_len < sizeof(*msg))) {
+		ibtrs_err(sess, "Sess info request is malformed: size %d\n",
+			  wc->byte_len);
+		goto close;
+	}
+	ib_dma_sync_single_for_cpu(sess->s.dev->ib_dev, iu->dma_addr,
+				   iu->size, DMA_FROM_DEVICE);
+	msg = iu->buf;
+	if (unlikely(le16_to_cpu(msg->type) != IBTRS_MSG_INFO_REQ)) {
+		ibtrs_err(sess, "Sess info request is malformed: type %d\n",
+			  le16_to_cpu(msg->type));
+		goto close;
+	}
+	err = process_info_req(con, msg);
+	if (unlikely(err))
+		goto close;
+
+out:
+	ibtrs_iu_free(iu, DMA_FROM_DEVICE, sess->s.dev->ib_dev);
+	return;
+close:
+	close_sess(sess);
+	goto out;
+}
+
+static int post_recv_info_req(struct ibtrs_srv_con *con)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_iu *rx_iu;
+	int err;
+
+	rx_iu = ibtrs_iu_alloc(0, sizeof(struct ibtrs_msg_info_req),
+			       GFP_KERNEL, sess->s.dev->ib_dev,
+			       DMA_FROM_DEVICE, ibtrs_srv_info_req_done);
+	if (unlikely(!rx_iu)) {
+		ibtrs_err(sess, "ibtrs_iu_alloc(): no memory\n");
+		return -ENOMEM;
+	}
+	/* Prepare for getting info response */
+	err = ibtrs_iu_post_recv(&con->c, rx_iu);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_iu_post_recv(), err: %d\n", err);
+		ibtrs_iu_free(rx_iu, DMA_FROM_DEVICE, sess->s.dev->ib_dev);
+		return err;
+	}
+
+	return 0;
+}
+
+static int post_recv_io(struct ibtrs_srv_con *con, size_t q_size)
+{
+	int i, err;
+
+	for (i = 0; i < q_size; i++) {
+		err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err))
+			return err;
+	}
+
+	return 0;
+}
+
+static int post_recv_sess(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	size_t q_size;
+	int err, cid;
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		if (cid == 0)
+			q_size = SERVICE_CON_QUEUE_DEPTH;
+		else
+			q_size = srv->queue_depth;
+
+		err = post_recv_io(to_srv_con(sess->s.con[cid]), q_size);
+		if (unlikely(err)) {
+			ibtrs_err(sess, "post_recv_io(), err: %d\n", err);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+static void process_read(struct ibtrs_srv_con *con,
+			 struct ibtrs_msg_rdma_read *msg,
+			 u32 buf_id, u32 off)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_ctx *ctx = srv->ctx;
+	struct ibtrs_srv_op *id;
+
+	size_t usr_len, data_len;
+	void *data;
+	int ret;
+
+	if (unlikely(sess->state != IBTRS_SRV_CONNECTED)) {
+		ibtrs_err_rl(sess, "Processing read request failed, "
+			     " session is disconnected, sess state %s\n",
+			     ibtrs_srv_state_str(sess->state));
+		return;
+	}
+	ibtrs_srv_get_ops_ids(sess);
+	ibtrs_srv_update_rdma_stats(&sess->stats, off, READ);
+	id = sess->ops_ids[buf_id];
+	id->con		= con;
+	id->dir		= READ;
+	id->msg_id	= buf_id;
+	id->rd_msg	= msg;
+	usr_len = le16_to_cpu(msg->usr_len);
+	data_len = off - usr_len;
+	data = page_address(srv->chunks[buf_id]);
+	ret = ctx->rdma_ev(srv, srv->priv, id, READ, data, data_len,
+			   data + data_len, usr_len);
+
+	if (unlikely(ret)) {
+		ibtrs_err_rl(sess, "Processing read request failed, user "
+			     "module cb reported for msg_id %d, err: %d\n",
+			     buf_id, ret);
+		goto send_err_msg;
+	}
+
+	return;
+
+send_err_msg:
+	ret = send_io_resp_imm(con, id, ret);
+	if (ret < 0) {
+		ibtrs_err_rl(sess, "Sending err msg for failed RDMA-Write-Req"
+			     " failed, msg_id %d, err: %d\n", buf_id, ret);
+		close_sess(sess);
+	}
+	ibtrs_srv_put_ops_ids(sess);
+}
+
+static void process_write(struct ibtrs_srv_con *con,
+			  struct ibtrs_msg_rdma_write *req,
+			  u32 buf_id, u32 off)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_ctx *ctx = srv->ctx;
+	struct ibtrs_srv_op *id;
+
+	size_t data_len, usr_len;
+	void *data;
+	int ret;
+
+	if (unlikely(sess->state != IBTRS_SRV_CONNECTED)) {
+		ibtrs_err_rl(sess, "Processing write request failed, "
+			     " session is disconnected, sess state %s\n",
+			     ibtrs_srv_state_str(sess->state));
+		return;
+	}
+	ibtrs_srv_get_ops_ids(sess);
+	ibtrs_srv_update_rdma_stats(&sess->stats, off, WRITE);
+	id = sess->ops_ids[buf_id];
+	id->con    = con;
+	id->dir    = WRITE;
+	id->msg_id = buf_id;
+
+	usr_len = le16_to_cpu(req->usr_len);
+	data_len = off - usr_len;
+	data = page_address(srv->chunks[buf_id]);
+	ret = ctx->rdma_ev(srv, srv->priv, id, WRITE, data, data_len,
+			   data + data_len, usr_len);
+	if (unlikely(ret)) {
+		ibtrs_err_rl(sess, "Processing write request failed, user"
+			     " module callback reports err: %d\n", ret);
+		goto send_err_msg;
+	}
+
+	return;
+
+send_err_msg:
+	ret = send_io_resp_imm(con, id, ret);
+	if (ret < 0) {
+		ibtrs_err_rl(sess, "Processing write request failed, sending"
+			     " I/O response failed, msg_id %d, err: %d\n",
+			     buf_id, ret);
+		close_sess(sess);
+	}
+	ibtrs_srv_put_ops_ids(sess);
+}
+
+static void process_io_req(struct ibtrs_srv_con *con, void *msg,
+			   u32 id, u32 off)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_msg_rdma_hdr *hdr;
+	unsigned int type;
+
+	ib_dma_sync_single_for_cpu(sess->s.dev->ib_dev, sess->dma_addr[id],
+				   max_chunk_size, DMA_BIDIRECTIONAL);
+	hdr = msg;
+	type = le16_to_cpu(hdr->type);
+
+	switch (type) {
+	case IBTRS_MSG_WRITE:
+		process_write(con, msg, id, off);
+		break;
+	case IBTRS_MSG_READ:
+		process_read(con, msg, id, off);
+		break;
+	default:
+		ibtrs_err(sess, "Processing I/O request failed, "
+			  "unknown message type received: 0x%02x\n", type);
+		goto err;
+	}
+
+	return;
+
+err:
+	close_sess(sess);
+}
+
+static void ibtrs_srv_rdma_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_srv_con *con = cq->cq_context;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	u32 imm_type, imm_payload;
+	int err;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		if (wc->status != IB_WC_WR_FLUSH_ERR) {
+			ibtrs_err(sess, "%s (wr_cqe: %p,"
+				  " type: %d, vendor_err: 0x%x, len: %u)\n",
+				  ib_wc_status_msg(wc->status), wc->wr_cqe,
+				  wc->opcode, wc->vendor_err, wc->byte_len);
+			close_sess(sess);
+		}
+		return;
+	}
+	ibtrs_srv_update_wc_stats(&sess->stats);
+
+	switch (wc->opcode) {
+	case IB_WC_RDMA_WRITE:
+		/*
+		 * post_send() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		break;
+	case IB_WC_RECV_RDMA_WITH_IMM:
+		/*
+		 * post_recv() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		if (WARN_ON(wc->wr_cqe != &io_comp_cqe))
+			return;
+		err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err)) {
+			ibtrs_err(sess, "ibtrs_post_recv(), err: %d\n", err);
+			close_sess(sess);
+			break;
+		}
+		ibtrs_from_imm(be32_to_cpu(wc->ex.imm_data),
+			       &imm_type, &imm_payload);
+		if (likely(imm_type == IBTRS_IO_REQ_IMM)) {
+			u32 msg_id, off;
+			void *data;
+
+			msg_id = imm_payload >> sess->mem_bits;
+			off = imm_payload & ((1 << sess->mem_bits) - 1);
+			if (unlikely(msg_id > srv->queue_depth ||
+				     off > max_chunk_size)) {
+				ibtrs_err(sess, "Wrong msg_id %u, off %u\n",
+					  msg_id, off);
+				close_sess(sess);
+				return;
+			}
+			data = page_address(srv->chunks[msg_id]) + off;
+			process_io_req(con, data, msg_id, off);
+		} else if (imm_type == IBTRS_HB_MSG_IMM) {
+			WARN_ON(con->c.cid);
+			ibtrs_send_hb_ack(&sess->s);
+		} else if (imm_type == IBTRS_HB_ACK_IMM) {
+			WARN_ON(con->c.cid);
+			sess->s.hb_missed_cnt = 0;
+		} else {
+			ibtrs_wrn(sess, "Unknown IMM type %u\n", imm_type);
+		}
+		break;
+	default:
+		ibtrs_wrn(sess, "Unexpected WC type: %d\n", wc->opcode);
+		return;
+	}
+}
+
+int ibtrs_srv_get_sess_name(struct ibtrs_srv *srv, char *sessname, size_t len)
+{
+	struct ibtrs_srv_sess *sess;
+	int err = -ENOTCONN;
+
+	mutex_lock(&srv->paths_mutex);
+	list_for_each_entry(sess, &srv->paths_list, s.entry) {
+		if (sess->state != IBTRS_SRV_CONNECTED)
+			continue;
+		memcpy(sessname, sess->s.sessname,
+		       min_t(size_t, sizeof(sess->s.sessname), len));
+		err = 0;
+		break;
+	}
+	mutex_unlock(&srv->paths_mutex);
+
+	return err;
+}
+EXPORT_SYMBOL(ibtrs_srv_get_sess_name);
+
+int ibtrs_srv_get_queue_depth(struct ibtrs_srv *srv)
+{
+	return srv->queue_depth;
+}
+EXPORT_SYMBOL(ibtrs_srv_get_queue_depth);
+
+static int find_next_bit_ring(struct ibtrs_srv_sess *sess)
+{
+	struct ib_device *ib_dev = sess->s.dev->ib_dev;
+	int v;
+
+	v = cpumask_next(sess->cur_cq_vector, &cq_affinity_mask);
+	if (v >= nr_cpu_ids || v >= ib_dev->num_comp_vectors)
+		v = cpumask_first(&cq_affinity_mask);
+	return v;
+}
+
+static int ibtrs_srv_get_next_cq_vector(struct ibtrs_srv_sess *sess)
+{
+	sess->cur_cq_vector = find_next_bit_ring(sess);
+
+	return sess->cur_cq_vector;
+}
+
+static struct ibtrs_srv *__alloc_srv(struct ibtrs_srv_ctx *ctx,
+				     const uuid_t *paths_uuid)
+{
+	struct ibtrs_srv *srv;
+	int i;
+
+	srv = kzalloc(sizeof(*srv), GFP_KERNEL);
+	if  (unlikely(!srv))
+		return NULL;
+
+	refcount_set(&srv->refcount, 1);
+	INIT_LIST_HEAD(&srv->paths_list);
+	mutex_init(&srv->paths_mutex);
+	mutex_init(&srv->paths_ev_mutex);
+	uuid_copy(&srv->paths_uuid, paths_uuid);
+	srv->queue_depth = sess_queue_depth;
+	srv->ctx = ctx;
+
+	srv->chunks = kcalloc(srv->queue_depth, sizeof(*srv->chunks),
+			      GFP_KERNEL);
+	if (unlikely(!srv->chunks))
+		goto err_free_srv;
+
+	for (i = 0; i < srv->queue_depth; i++) {
+		srv->chunks[i] = mempool_alloc(chunk_pool, GFP_KERNEL);
+		if (unlikely(!srv->chunks[i])) {
+			pr_err("mempool_alloc() failed\n");
+			goto err_free_chunks;
+		}
+	}
+	list_add(&srv->ctx_list, &ctx->srv_list);
+
+	return srv;
+
+err_free_chunks:
+	while (i--)
+		mempool_free(srv->chunks[i], chunk_pool);
+	kfree(srv->chunks);
+
+err_free_srv:
+	kfree(srv);
+
+	return NULL;
+}
+
+static void free_srv(struct ibtrs_srv *srv)
+{
+	int i;
+
+	WARN_ON(refcount_read(&srv->refcount));
+	for (i = 0; i < srv->queue_depth; i++)
+		mempool_free(srv->chunks[i], chunk_pool);
+	kfree(srv->chunks);
+	kfree(srv);
+}
+
+static inline struct ibtrs_srv *__find_srv_and_get(struct ibtrs_srv_ctx *ctx,
+						   const uuid_t *paths_uuid)
+{
+	struct ibtrs_srv *srv;
+
+	list_for_each_entry(srv, &ctx->srv_list, ctx_list) {
+		if (uuid_equal(&srv->paths_uuid, paths_uuid) &&
+		    refcount_inc_not_zero(&srv->refcount))
+			return srv;
+	}
+
+	return NULL;
+}
+
+static struct ibtrs_srv *get_or_create_srv(struct ibtrs_srv_ctx *ctx,
+					   const uuid_t *paths_uuid)
+{
+	struct ibtrs_srv *srv;
+
+	mutex_lock(&ctx->srv_mutex);
+	srv = __find_srv_and_get(ctx, paths_uuid);
+	if (!srv)
+		srv = __alloc_srv(ctx, paths_uuid);
+	mutex_unlock(&ctx->srv_mutex);
+
+	return srv;
+}
+
+static void put_srv(struct ibtrs_srv *srv)
+{
+	if (refcount_dec_and_test(&srv->refcount)) {
+		struct ibtrs_srv_ctx *ctx = srv->ctx;
+
+		WARN_ON(srv->dev.kobj.state_in_sysfs);
+		WARN_ON(srv->kobj_paths.state_in_sysfs);
+
+		mutex_lock(&ctx->srv_mutex);
+		list_del(&srv->ctx_list);
+		mutex_unlock(&ctx->srv_mutex);
+		free_srv(srv);
+	}
+}
+
+static void __add_path_to_srv(struct ibtrs_srv *srv,
+			      struct ibtrs_srv_sess *sess)
+{
+	list_add_tail(&sess->s.entry, &srv->paths_list);
+	srv->paths_num++;
+	WARN_ON(srv->paths_num >= MAX_PATHS_NUM);
+}
+
+static void del_path_from_srv(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+
+	if (WARN_ON(!srv))
+		return;
+
+	mutex_lock(&srv->paths_mutex);
+	list_del(&sess->s.entry);
+	WARN_ON(!srv->paths_num);
+	srv->paths_num--;
+	mutex_unlock(&srv->paths_mutex);
+}
+
+static inline bool __is_path_w_addr_exists(struct ibtrs_srv *srv,
+					   struct rdma_addr *addr)
+{
+	struct ibtrs_srv_sess *sess;
+
+	list_for_each_entry(sess, &srv->paths_list, s.entry)
+		if (!sockaddr_cmp((struct sockaddr *)&sess->s.dst_addr,
+				  (struct sockaddr *)&addr->dst_addr) &&
+		    !sockaddr_cmp((struct sockaddr *)&sess->s.src_addr,
+				  (struct sockaddr *)&addr->src_addr))
+			return true;
+
+	return false;
+}
+
+static void ibtrs_srv_close_work(struct work_struct *work)
+{
+	struct ibtrs_srv_sess *sess;
+	struct ibtrs_srv_ctx *ctx;
+	struct ibtrs_srv_con *con;
+	int i;
+
+	sess = container_of(work, typeof(*sess), close_work);
+	ctx = sess->srv->ctx;
+
+	ibtrs_srv_destroy_sess_files(sess);
+	ibtrs_srv_stop_hb(sess);
+
+	for (i = 0; i < sess->s.con_num; i++) {
+		if (!sess->s.con[i])
+			continue;
+		con = to_srv_con(sess->s.con[i]);
+		rdma_disconnect(con->c.cm_id);
+		ib_drain_qp(con->c.qp);
+	}
+	/* Wait for all inflights */
+	ibtrs_srv_wait_ops_ids(sess);
+
+	/* Notify upper layer if we are the last path */
+	ibtrs_srv_sess_down(sess);
+
+	unmap_cont_bufs(sess);
+	ibtrs_srv_free_ops_ids(sess);
+
+	for (i = 0; i < sess->s.con_num; i++) {
+		if (!sess->s.con[i])
+			continue;
+		con = to_srv_con(sess->s.con[i]);
+		ibtrs_cq_qp_destroy(&con->c);
+		rdma_destroy_id(con->c.cm_id);
+		kfree(con);
+	}
+	ibtrs_ib_dev_put(sess->s.dev);
+
+	del_path_from_srv(sess);
+	put_srv(sess->srv);
+	sess->srv = NULL;
+	ibtrs_srv_change_state(sess, IBTRS_SRV_CLOSED);
+
+	kfree(sess->dma_addr);
+	kfree(sess->s.con);
+	kfree(sess);
+}
+
+static int ibtrs_rdma_do_accept(struct ibtrs_srv_sess *sess,
+				struct rdma_cm_id *cm_id)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_msg_conn_rsp msg;
+	struct rdma_conn_param param;
+	int err;
+
+	memset(&param, 0, sizeof(param));
+	param.retry_count = retry_count;
+	param.rnr_retry_count = 7;
+	param.private_data = &msg;
+	param.private_data_len = sizeof(msg);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.magic = cpu_to_le16(IBTRS_MAGIC);
+	msg.version = cpu_to_le16(IBTRS_PROTO_VER);
+	msg.errno = 0;
+	msg.queue_depth = cpu_to_le16(srv->queue_depth);
+	msg.max_io_size = cpu_to_le32(max_chunk_size - MAX_HDR_SIZE);
+	msg.max_hdr_size = cpu_to_le32(MAX_HDR_SIZE);
+
+	err = rdma_accept(cm_id, &param);
+	if (err)
+		pr_err("rdma_accept(), err: %d\n", err);
+
+	return err;
+}
+
+static int ibtrs_rdma_do_reject(struct rdma_cm_id *cm_id, int errno)
+{
+	struct ibtrs_msg_conn_rsp msg;
+	int err;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.magic = cpu_to_le16(IBTRS_MAGIC);
+	msg.version = cpu_to_le16(IBTRS_PROTO_VER);
+	msg.errno = cpu_to_le16(errno);
+
+	err = rdma_reject(cm_id, &msg, sizeof(msg));
+	if (err)
+		pr_err("rdma_reject(), err: %d\n", err);
+
+	/* Bounce errno back */
+	return errno;
+}
+
+static struct ibtrs_srv_sess *
+__find_sess(struct ibtrs_srv *srv, const uuid_t *sess_uuid)
+{
+	struct ibtrs_srv_sess *sess;
+
+	list_for_each_entry(sess, &srv->paths_list, s.entry) {
+		if (uuid_equal(&sess->s.uuid, sess_uuid))
+			return sess;
+	}
+
+	return NULL;
+}
+
+static int create_con(struct ibtrs_srv_sess *sess,
+		      struct rdma_cm_id *cm_id,
+		      unsigned int cid)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_con *con;
+
+	u16 cq_size, wr_queue_size;
+	int err, cq_vector;
+
+	con = kzalloc(sizeof(*con), GFP_KERNEL);
+	if (unlikely(!con)) {
+		ibtrs_err(sess, "kzalloc() failed\n");
+		err = -ENOMEM;
+		goto err;
+	}
+
+	con->c.cm_id = cm_id;
+	con->c.sess = &sess->s;
+	con->c.cid = cid;
+	atomic_set(&con->wr_cnt, 0);
+
+	if (con->c.cid == 0) {
+		/*
+		 * All receive and all send (each requiring invalidate)
+		 * + 2 for drain and heartbeat
+		 */
+		wr_queue_size = SERVICE_CON_QUEUE_DEPTH * 3 + 2;
+		cq_size = wr_queue_size;
+	} else {
+		/*
+		 * If we have all receive requests posted and
+		 * all write requests posted and each read request
+		 * requires an invalidate request + drain
+		 * and qp gets into error state.
+		 */
+		cq_size = srv->queue_depth * 3 + 1;
+		/*
+		 * In theory we might have queue_depth * 32
+		 * outstanding requests if an unsafe global key is used
+		 * and we have queue_depth read requests each consisting
+		 * of 32 different addresses. div 3 for mlx5.
+		 */
+		wr_queue_size = sess->s.dev->ib_dev->attrs.max_qp_wr / 3;
+	}
+
+	cq_vector = ibtrs_srv_get_next_cq_vector(sess);
+
+	/* TODO: SOFTIRQ can be faster, but be careful with softirq context */
+	err = ibtrs_cq_qp_create(&sess->s, &con->c, 1, cq_vector, cq_size,
+				 wr_queue_size, IB_POLL_WORKQUEUE);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_cq_qp_create(), err: %d\n", err);
+		goto free_con;
+	}
+	if (con->c.cid == 0) {
+		err = post_recv_info_req(con);
+		if (unlikely(err))
+			goto free_cqqp;
+	}
+	WARN_ON(sess->s.con[cid]);
+	sess->s.con[cid] = &con->c;
+
+	/*
+	 * Change context from server to current connection.  The other
+	 * way is to use cm_id->qp->qp_context, which does not work on OFED.
+	 */
+	cm_id->context = &con->c;
+
+	return 0;
+
+free_cqqp:
+	ibtrs_cq_qp_destroy(&con->c);
+free_con:
+	kfree(con);
+
+err:
+	return err;
+}
+
+static struct ibtrs_srv_sess *__alloc_sess(struct ibtrs_srv *srv,
+					   struct rdma_cm_id *cm_id,
+					   unsigned int con_num,
+					   unsigned int recon_cnt,
+					   const uuid_t *uuid)
+{
+	struct ibtrs_srv_sess *sess;
+	int err = -ENOMEM;
+
+	if (unlikely(srv->paths_num >= MAX_PATHS_NUM)) {
+		err = -ECONNRESET;
+		goto err;
+	}
+	if (unlikely(__is_path_w_addr_exists(srv, &cm_id->route.addr))) {
+		err = -EEXIST;
+		goto err;
+	}
+	sess = kzalloc(sizeof(*sess), GFP_KERNEL);
+	if (unlikely(!sess))
+		goto err;
+
+	sess->dma_addr = kcalloc(srv->queue_depth, sizeof(*sess->dma_addr),
+				 GFP_KERNEL);
+	if (unlikely(!sess->dma_addr))
+		goto err_free_sess;
+
+	sess->s.con = kcalloc(con_num, sizeof(*sess->s.con), GFP_KERNEL);
+	if (unlikely(!sess->s.con))
+		goto err_free_dma_addr;
+
+	sess->state = IBTRS_SRV_CONNECTING;
+	sess->srv = srv;
+	sess->cur_cq_vector = -1;
+	sess->s.dst_addr = cm_id->route.addr.dst_addr;
+	sess->s.src_addr = cm_id->route.addr.src_addr;
+	sess->s.con_num = con_num;
+	sess->s.recon_cnt = recon_cnt;
+	uuid_copy(&sess->s.uuid, uuid);
+	spin_lock_init(&sess->state_lock);
+	INIT_WORK(&sess->close_work, ibtrs_srv_close_work);
+	ibtrs_srv_init_hb(sess);
+
+	sess->s.dev = ibtrs_ib_dev_find_or_add(cm_id->device, &dev_pool);
+	if (unlikely(!sess->s.dev)) {
+		err = -ENOMEM;
+		ibtrs_wrn(sess, "Failed to alloc ibtrs_device\n");
+		goto err_free_con;
+	}
+	err = map_cont_bufs(sess);
+	if (unlikely(err))
+		goto err_put_dev;
+
+	err = ibtrs_srv_alloc_ops_ids(sess);
+	if (unlikely(err))
+		goto err_unmap_bufs;
+
+	__add_path_to_srv(srv, sess);
+
+	return sess;
+
+err_unmap_bufs:
+	unmap_cont_bufs(sess);
+err_put_dev:
+	ibtrs_ib_dev_put(sess->s.dev);
+err_free_con:
+	kfree(sess->s.con);
+err_free_dma_addr:
+	kfree(sess->dma_addr);
+err_free_sess:
+	kfree(sess);
+
+err:
+	return ERR_PTR(err);
+}
+
+static int ibtrs_rdma_connect(struct rdma_cm_id *cm_id,
+			      const struct ibtrs_msg_conn_req *msg,
+			      size_t len)
+{
+	struct ibtrs_srv_ctx *ctx = cm_id->context;
+	struct ibtrs_srv_sess *sess;
+	struct ibtrs_srv *srv;
+
+	u16 version, con_num, cid;
+	u16 recon_cnt;
+	int err;
+
+	if (unlikely(len < sizeof(*msg))) {
+		pr_err("Invalid IBTRS connection request\n");
+		goto reject_w_econnreset;
+	}
+	if (unlikely(le16_to_cpu(msg->magic) != IBTRS_MAGIC)) {
+		pr_err("Invalid IBTRS magic\n");
+		goto reject_w_econnreset;
+	}
+	version = le16_to_cpu(msg->version);
+	if (unlikely(version >> 8 != IBTRS_PROTO_VER_MAJOR)) {
+		pr_err("Unsupported major IBTRS version: %d, expected %d\n",
+		       version >> 8, IBTRS_PROTO_VER_MAJOR);
+		goto reject_w_econnreset;
+	}
+	con_num = le16_to_cpu(msg->cid_num);
+	if (unlikely(con_num > 4096)) {
+		/* Sanity check */
+		pr_err("Too many connections requested: %d\n", con_num);
+		goto reject_w_econnreset;
+	}
+	cid = le16_to_cpu(msg->cid);
+	if (unlikely(cid >= con_num)) {
+		/* Sanity check */
+		pr_err("Incorrect cid: %d >= %d\n", cid, con_num);
+		goto reject_w_econnreset;
+	}
+	recon_cnt = le16_to_cpu(msg->recon_cnt);
+	srv = get_or_create_srv(ctx, &msg->paths_uuid);
+	if (unlikely(!srv)) {
+		err = -ENOMEM;
+		goto reject_w_err;
+	}
+	mutex_lock(&srv->paths_mutex);
+	sess = __find_sess(srv, &msg->sess_uuid);
+	if (sess) {
+		/* Session already holds a reference */
+		put_srv(srv);
+
+		if (unlikely(sess->s.recon_cnt != recon_cnt)) {
+			ibtrs_err(sess, "Reconnect detected %d != %d, but "
+				  "previous session is still alive, reconnect "
+				  "later\n", sess->s.recon_cnt, recon_cnt);
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_ebusy;
+		}
+		if (unlikely(sess->state != IBTRS_SRV_CONNECTING)) {
+			ibtrs_err(sess, "Session in wrong state: %s\n",
+				  ibtrs_srv_state_str(sess->state));
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_econnreset;
+		}
+		/*
+		 * Sanity checks
+		 */
+		if (unlikely(con_num != sess->s.con_num ||
+			     cid >= sess->s.con_num)) {
+			ibtrs_err(sess, "Incorrect request: %d, %d\n",
+				  cid, con_num);
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_econnreset;
+		}
+		if (unlikely(sess->s.con[cid])) {
+			ibtrs_err(sess, "Connection already exists: %d\n",
+				  cid);
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_econnreset;
+		}
+	} else {
+		sess = __alloc_sess(srv, cm_id, con_num, recon_cnt,
+				    &msg->sess_uuid);
+		if (unlikely(IS_ERR(sess))) {
+			mutex_unlock(&srv->paths_mutex);
+			put_srv(srv);
+			err = PTR_ERR(sess);
+			goto reject_w_err;
+		}
+	}
+	err = create_con(sess, cm_id, cid);
+	if (unlikely(err)) {
+		(void)ibtrs_rdma_do_reject(cm_id, err);
+		/*
+		 * Since session has other connections we follow normal way
+		 * through workqueue, but still return an error to tell cma.c
+		 * to call rdma_destroy_id() for current connection.
+		 */
+		goto close_and_return_err;
+	}
+	err = ibtrs_rdma_do_accept(sess, cm_id);
+	if (unlikely(err)) {
+		(void)ibtrs_rdma_do_reject(cm_id, err);
+		/*
+		 * Since current connection was successfully added to the
+		 * session we follow normal way through workqueue to close the
+		 * session, thus return 0 to tell cma.c we call
+		 * rdma_destroy_id() ourselves.
+		 */
+		err = 0;
+		goto close_and_return_err;
+	}
+	mutex_unlock(&srv->paths_mutex);
+
+	return 0;
+
+reject_w_err:
+	return ibtrs_rdma_do_reject(cm_id, err);
+
+reject_w_econnreset:
+	return ibtrs_rdma_do_reject(cm_id, -ECONNRESET);
+
+reject_w_ebusy:
+	return ibtrs_rdma_do_reject(cm_id, -EBUSY);
+
+close_and_return_err:
+	close_sess(sess);
+	mutex_unlock(&srv->paths_mutex);
+
+	return err;
+}
+
+static int ibtrs_srv_rdma_cm_handler(struct rdma_cm_id *cm_id,
+				     struct rdma_cm_event *ev)
+{
+	struct ibtrs_srv_sess *sess = NULL;
+
+	if (ev->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
+		struct ibtrs_con *c = cm_id->context;
+
+		sess = to_srv_sess(c->sess);
+	}
+
+	switch (ev->event) {
+	case RDMA_CM_EVENT_CONNECT_REQUEST:
+		/*
+		 * In case of error cma.c will destroy cm_id,
+		 * see cma_process_remove()
+		 */
+		return ibtrs_rdma_connect(cm_id, ev->param.conn.private_data,
+					  ev->param.conn.private_data_len);
+	case RDMA_CM_EVENT_ESTABLISHED:
+		/* Nothing here */
+		break;
+	case RDMA_CM_EVENT_REJECTED:
+	case RDMA_CM_EVENT_CONNECT_ERROR:
+	case RDMA_CM_EVENT_UNREACHABLE:
+		ibtrs_err(sess, "CM error (CM event: %s, err: %d)\n",
+			  rdma_event_msg(ev->event), ev->status);
+		close_sess(sess);
+		break;
+	case RDMA_CM_EVENT_DISCONNECTED:
+	case RDMA_CM_EVENT_ADDR_CHANGE:
+	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+		close_sess(sess);
+		break;
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+		close_sess(sess);
+		break;
+	default:
+		pr_err("Ignoring unexpected CM event %s, err %d\n",
+		       rdma_event_msg(ev->event), ev->status);
+		break;
+	}
+
+	return 0;
+}
+
+static struct rdma_cm_id *ibtrs_srv_cm_init(struct ibtrs_srv_ctx *ctx,
+					    struct sockaddr *addr,
+					    enum rdma_ucm_port_space ps)
+{
+	struct rdma_cm_id *cm_id;
+	int ret;
+
+	cm_id = rdma_create_id(&init_net, ibtrs_srv_rdma_cm_handler,
+			       ctx, ps, IB_QPT_RC);
+	if (IS_ERR(cm_id)) {
+		ret = PTR_ERR(cm_id);
+		pr_err("Creating id for RDMA connection failed, err: %d\n",
+		       ret);
+		goto err_out;
+	}
+	ret = rdma_bind_addr(cm_id, addr);
+	if (ret) {
+		pr_err("Binding RDMA address failed, err: %d\n", ret);
+		goto err_cm;
+	}
+	ret = rdma_listen(cm_id, 64);
+	if (ret) {
+		pr_err("Listening on RDMA connection failed, err: %d\n",
+		       ret);
+		goto err_cm;
+	}
+
+	return cm_id;
+
+err_cm:
+	rdma_destroy_id(cm_id);
+err_out:
+
+	return ERR_PTR(ret);
+}
+
+static int ibtrs_srv_rdma_init(struct ibtrs_srv_ctx *ctx, unsigned int port)
+{
+	struct sockaddr_in6 sin = {
+		.sin6_family	= AF_INET6,
+		.sin6_addr	= IN6ADDR_ANY_INIT,
+		.sin6_port	= htons(port),
+	};
+	struct sockaddr_ib sib = {
+		.sib_family			= AF_IB,
+		.sib_addr.sib_subnet_prefix	= 0ULL,
+		.sib_addr.sib_interface_id	= 0ULL,
+		.sib_sid	= cpu_to_be64(RDMA_IB_IP_PS_IB | port),
+		.sib_sid_mask	= cpu_to_be64(0xffffffffffffffffULL),
+		.sib_pkey	= cpu_to_be16(0xffff),
+	};
+	struct rdma_cm_id *cm_ip, *cm_ib;
+	int ret;
+
+	/*
+	 * We accept both IPoIB and IB connections, so we need to keep
+	 * two cm id's, one for each socket type and port space.
+	 * If the cm initialization of one of the id's fails, we abort
+	 * everything.
+	 */
+	cm_ip = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sin, RDMA_PS_TCP);
+	if (unlikely(IS_ERR(cm_ip)))
+		return PTR_ERR(cm_ip);
+
+	cm_ib = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sib, RDMA_PS_IB);
+	if (unlikely(IS_ERR(cm_ib))) {
+		ret = PTR_ERR(cm_ib);
+		goto free_cm_ip;
+	}
+
+	ctx->cm_id_ip = cm_ip;
+	ctx->cm_id_ib = cm_ib;
+
+	return 0;
+
+free_cm_ip:
+	rdma_destroy_id(cm_ip);
+
+	return ret;
+}
+
+static struct ibtrs_srv_ctx *alloc_srv_ctx(rdma_ev_fn *rdma_ev,
+					   link_ev_fn *link_ev)
+{
+	struct ibtrs_srv_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	ctx->rdma_ev = rdma_ev;
+	ctx->link_ev = link_ev;
+	mutex_init(&ctx->srv_mutex);
+	INIT_LIST_HEAD(&ctx->srv_list);
+
+	return ctx;
+}
+
+static void free_srv_ctx(struct ibtrs_srv_ctx *ctx)
+{
+	WARN_ON(!list_empty(&ctx->srv_list));
+	kfree(ctx);
+}
+
+struct ibtrs_srv_ctx *ibtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
+				     unsigned int port)
+{
+	struct ibtrs_srv_ctx *ctx;
+	int err;
+
+	ctx = alloc_srv_ctx(rdma_ev, link_ev);
+	if (unlikely(!ctx))
+		return ERR_PTR(-ENOMEM);
+
+	err = ibtrs_srv_rdma_init(ctx, port);
+	if (unlikely(err)) {
+		free_srv_ctx(ctx);
+		return ERR_PTR(err);
+	}
+	/* Do not let module be unloaded if server context is alive */
+	__module_get(THIS_MODULE);
+
+	return ctx;
+}
+EXPORT_SYMBOL(ibtrs_srv_open);
+
+void ibtrs_srv_queue_close(struct ibtrs_srv_sess *sess)
+{
+	close_sess(sess);
+}
+
+static void close_sess(struct ibtrs_srv_sess *sess)
+{
+	enum ibtrs_srv_state old_state;
+
+	if (ibtrs_srv_change_state_get_old(sess, IBTRS_SRV_CLOSING,
+					   &old_state))
+		queue_work(ibtrs_wq, &sess->close_work);
+	WARN_ON(sess->state != IBTRS_SRV_CLOSING);
+}
+
+static void close_sessions(struct ibtrs_srv *srv)
+{
+	struct ibtrs_srv_sess *sess;
+
+	mutex_lock(&srv->paths_mutex);
+	list_for_each_entry(sess, &srv->paths_list, s.entry)
+		close_sess(sess);
+	mutex_unlock(&srv->paths_mutex);
+}
+
+static void close_ctx(struct ibtrs_srv_ctx *ctx)
+{
+	struct ibtrs_srv *srv;
+
+	mutex_lock(&ctx->srv_mutex);
+	list_for_each_entry(srv, &ctx->srv_list, ctx_list)
+		close_sessions(srv);
+	mutex_unlock(&ctx->srv_mutex);
+	flush_workqueue(ibtrs_wq);
+}
+
+void ibtrs_srv_close(struct ibtrs_srv_ctx *ctx)
+{
+	rdma_destroy_id(ctx->cm_id_ip);
+	rdma_destroy_id(ctx->cm_id_ib);
+	close_ctx(ctx);
+	free_srv_ctx(ctx);
+	module_put(THIS_MODULE);
+}
+EXPORT_SYMBOL(ibtrs_srv_close);
+
+static int check_module_params(void)
+{
+	if (sess_queue_depth < 1 || sess_queue_depth > MAX_SESS_QUEUE_DEPTH) {
+		pr_err("Invalid sess_queue_depth value %d, has to be"
+		       " >= %d, <= %d.\n",
+		       sess_queue_depth, 1, MAX_SESS_QUEUE_DEPTH);
+		return -EINVAL;
+	}
+	if (max_chunk_size < 4096 || !is_power_of_2(max_chunk_size)) {
+		pr_err("Invalid max_chunk_size value %d, has to be"
+		       " >= %d and should be power of two.\n",
+		       max_chunk_size, 4096);
+		return -EINVAL;
+	}
+
+	/*
+	 * Check if IB immediate data size is enough to hold the mem_id and the
+	 * offset inside the memory chunk
+	 */
+	if ((ilog2(sess_queue_depth - 1) + 1) +
+	    (ilog2(max_chunk_size - 1) + 1) > MAX_IMM_PAYL_BITS) {
+		pr_err("RDMA immediate size (%db) not enough to encode "
+		       "%d buffers of size %dB. Reduce 'sess_queue_depth' "
+		       "or 'max_chunk_size' parameters.\n", MAX_IMM_PAYL_BITS,
+		       sess_queue_depth, max_chunk_size);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int __init ibtrs_server_init(void)
+{
+	int err;
+
+	if (!strlen(cq_affinity_list))
+		init_cq_affinity();
+
+	pr_info("Loading module %s, version %s, proto %s: "
+		"(retry_count: %d, cq_affinity_list: %s, "
+		"max_chunk_size: %d (pure IO %ld, headers %ld) , "
+		"sess_queue_depth: %d)\n",
+		KBUILD_MODNAME, IBTRS_VER_STRING, IBTRS_PROTO_VER_STRING,
+		retry_count, cq_affinity_list, max_chunk_size,
+		max_chunk_size - MAX_HDR_SIZE, MAX_HDR_SIZE,
+		sess_queue_depth);
+
+	ibtrs_ib_dev_pool_init(0, &dev_pool);
+
+	err = check_module_params();
+	if (err) {
+		pr_err("Failed to load module, invalid module parameters,"
+		       " err: %d\n", err);
+		return err;
+	}
+	chunk_pool = mempool_create_page_pool(sess_queue_depth * CHUNK_POOL_SZ,
+					      get_order(max_chunk_size));
+	if (unlikely(!chunk_pool)) {
+		pr_err("Failed preallocate pool of chunks\n");
+		return -ENOMEM;
+	}
+	ibtrs_dev_class = class_create(THIS_MODULE, "ibtrs-server");
+	if (unlikely(IS_ERR(ibtrs_dev_class))) {
+		pr_err("Failed to create ibtrs-server dev class\n");
+		err = PTR_ERR(ibtrs_dev_class);
+		goto out_chunk_pool;
+	}
+	ibtrs_wq = alloc_workqueue("ibtrs_server_wq", WQ_MEM_RECLAIM, 0);
+	if (unlikely(!ibtrs_wq)) {
+		pr_err("Failed to load module, alloc ibtrs_server_wq failed\n");
+		goto out_dev_class;
+	}
+
+	return 0;
+
+out_dev_class:
+	class_destroy(ibtrs_dev_class);
+out_chunk_pool:
+	mempool_destroy(chunk_pool);
+
+	return err;
+}
+
+static void __exit ibtrs_server_exit(void)
+{
+	destroy_workqueue(ibtrs_wq);
+	class_destroy(ibtrs_dev_class);
+	mempool_destroy(chunk_pool);
+	ibtrs_ib_dev_pool_deinit(&dev_pool);
+}
+
+module_init(ibtrs_server_init);
+module_exit(ibtrs_server_exit);
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 12/25] ibtrs: server: sysfs interface functions
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, Jack Wang
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

This is the sysfs interface to IBTRS sessions on server side:

  /sys/devices/virtual/ibtrs-server/<SESS-NAME>/
    *** IBTRS session accepted from a client peer
    |
    |- paths/<SRC@DST>/
       *** established paths from a client in a session
       |
       |- disconnect
       |  *** disconnect path
       |
       |- hca_name
       |  *** HCA name
       |
       |- hca_port
       |  *** HCA port
       |
       |- stats/
          *** current path statistics
          |
	  |- rdma
	  |- reset_all
	  |- wc_completions

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 .../infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c    | 303 ++++++++++++++++++
 1 file changed, 303 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
new file mode 100644
index 000000000000..c48c368c1906
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
@@ -0,0 +1,303 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ * Authors: Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Jack Wang <jinpu.wang@cloud.ionos.com>
+ *          Danil Kipnis <danil.kipnis@cloud.ionos.com>
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-pri.h"
+#include "ibtrs-srv.h"
+#include "ibtrs-log.h"
+
+static struct kobj_type ktype = {
+	.sysfs_ops	= &kobj_sysfs_ops,
+};
+
+static ssize_t ibtrs_srv_disconnect_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_srv_disconnect_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct ibtrs_srv_sess *sess;
+	char str[MAXHOSTNAMELEN];
+
+	sess = container_of(kobj, struct ibtrs_srv_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		ibtrs_err(sess, "%s: invalid value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+
+	sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr, str, sizeof(str));
+
+	ibtrs_info(sess, "disconnect for path %s requested\n", str);
+	ibtrs_srv_queue_close(sess);
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_srv_disconnect_attr =
+	__ATTR(disconnect, 0644,
+	       ibtrs_srv_disconnect_show, ibtrs_srv_disconnect_store);
+
+static ssize_t ibtrs_srv_hca_port_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_srv_sess *sess;
+	struct ibtrs_con *usr_con;
+
+	sess = container_of(kobj, typeof(*sess), kobj);
+	usr_con = sess->s.con[0];
+
+	return scnprintf(page, PAGE_SIZE, "%u\n",
+			 usr_con->cm_id->port_num);
+}
+
+static struct kobj_attribute ibtrs_srv_hca_port_attr =
+	__ATTR(hca_port, 0444, ibtrs_srv_hca_port_show, NULL);
+
+static ssize_t ibtrs_srv_hca_name_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_srv_sess *sess;
+
+	sess = container_of(kobj, struct ibtrs_srv_sess, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n",
+			 sess->s.dev->ib_dev->name);
+}
+
+static struct kobj_attribute ibtrs_srv_hca_name_attr =
+	__ATTR(hca_name, 0444, ibtrs_srv_hca_name_show, NULL);
+
+static ssize_t ibtrs_srv_src_addr_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_srv_sess *sess;
+	int cnt;
+
+	sess = container_of(kobj, struct ibtrs_srv_sess, kobj);
+	cnt = sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr,
+			      page, PAGE_SIZE);
+	return cnt + scnprintf(page + cnt, PAGE_SIZE - cnt, "\n");
+}
+
+static struct kobj_attribute ibtrs_srv_src_addr_attr =
+	__ATTR(src_addr, 0444, ibtrs_srv_src_addr_show, NULL);
+
+static ssize_t ibtrs_srv_dst_addr_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_srv_sess *sess;
+	int cnt;
+
+	sess = container_of(kobj, struct ibtrs_srv_sess, kobj);
+	cnt = sockaddr_to_str((struct sockaddr *)&sess->s.src_addr,
+			      page, PAGE_SIZE);
+	return cnt + scnprintf(page + cnt, PAGE_SIZE - cnt, "\n");
+}
+
+static struct kobj_attribute ibtrs_srv_dst_addr_attr =
+	__ATTR(dst_addr, 0444, ibtrs_srv_dst_addr_show, NULL);
+
+static struct attribute *ibtrs_srv_sess_attrs[] = {
+	&ibtrs_srv_hca_name_attr.attr,
+	&ibtrs_srv_hca_port_attr.attr,
+	&ibtrs_srv_src_addr_attr.attr,
+	&ibtrs_srv_dst_addr_attr.attr,
+	&ibtrs_srv_disconnect_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_srv_sess_attr_group = {
+	.attrs = ibtrs_srv_sess_attrs,
+};
+
+STAT_ATTR(struct ibtrs_srv_sess, rdma,
+	  ibtrs_srv_stats_rdma_to_str,
+	  ibtrs_srv_reset_rdma_stats);
+
+STAT_ATTR(struct ibtrs_srv_sess, wc_completion,
+	  ibtrs_srv_stats_wc_completion_to_str,
+	  ibtrs_srv_reset_wc_completion_stats);
+
+STAT_ATTR(struct ibtrs_srv_sess, reset_all,
+	  ibtrs_srv_reset_all_help,
+	  ibtrs_srv_reset_all_stats);
+
+static struct attribute *ibtrs_srv_stats_attrs[] = {
+	&rdma_attr.attr,
+	&wc_completion_attr.attr,
+	&reset_all_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_srv_stats_attr_group = {
+	.attrs = ibtrs_srv_stats_attrs,
+};
+
+static void ibtrs_srv_dev_release(struct device *dev)
+{
+	/* Nobody plays with device references, so nop */
+}
+
+static int ibtrs_srv_create_once_sysfs_root_folders(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	int err = 0;
+
+	mutex_lock(&srv->paths_mutex);
+	if (srv->dev_ref++) {
+		/*
+		 * Just increase device reference.  We can't use get_device()
+		 * because we need to unregister device when ref goes to 0,
+		 * not just to put it.
+		 */
+		goto unlock;
+	}
+	srv->dev.class = ibtrs_dev_class;
+	srv->dev.release = ibtrs_srv_dev_release;
+	dev_set_name(&srv->dev, "%s", sess->s.sessname);
+
+	err = device_register(&srv->dev);
+	if (unlikely(err)) {
+		pr_err("device_register(): %d\n", err);
+		goto unlock;
+	}
+	err = kobject_init_and_add(&srv->kobj_paths, &ktype,
+				   &srv->dev.kobj, "paths");
+	if (unlikely(err)) {
+		pr_err("kobject_init_and_add(): %d\n", err);
+		device_unregister(&srv->dev);
+		goto unlock;
+	}
+unlock:
+	mutex_unlock(&srv->paths_mutex);
+
+	return err;
+}
+
+static void
+ibtrs_srv_destroy_once_sysfs_root_folders(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+
+	mutex_lock(&srv->paths_mutex);
+	if (!--srv->dev_ref) {
+		kobject_del(&srv->kobj_paths);
+		kobject_put(&srv->kobj_paths);
+		device_unregister(&srv->dev);
+	}
+	mutex_unlock(&srv->paths_mutex);
+}
+
+static int ibtrs_srv_create_stats_files(struct ibtrs_srv_sess *sess)
+{
+	int err;
+
+	err = kobject_init_and_add(&sess->kobj_stats, &ktype,
+				   &sess->kobj, "stats");
+	if (unlikely(err)) {
+		ibtrs_err(sess, "kobject_init_and_add(): %d\n", err);
+		return err;
+	}
+	err = sysfs_create_group(&sess->kobj_stats,
+				 &ibtrs_srv_stats_attr_group);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "sysfs_create_group(): %d\n", err);
+		goto err;
+	}
+
+	return 0;
+
+err:
+	kobject_del(&sess->kobj_stats);
+	kobject_put(&sess->kobj_stats);
+
+	return err;
+}
+
+int ibtrs_srv_create_sess_files(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	char str[NAME_MAX];
+	int err, cnt;
+
+	cnt = sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr,
+			      str, sizeof(str));
+	cnt += scnprintf(str + cnt, sizeof(str) - cnt, "@");
+	sockaddr_to_str((struct sockaddr *)&sess->s.src_addr,
+			str + cnt, sizeof(str) - cnt);
+
+	err = ibtrs_srv_create_once_sysfs_root_folders(sess);
+	if (unlikely(err))
+		return err;
+
+	err = kobject_init_and_add(&sess->kobj, &ktype, &srv->kobj_paths,
+				   "%s", str);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "kobject_init_and_add(): %d\n", err);
+		goto destroy_root;
+	}
+	err = sysfs_create_group(&sess->kobj, &ibtrs_srv_sess_attr_group);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "sysfs_create_group(): %d\n", err);
+		goto put_kobj;
+	}
+	err = ibtrs_srv_create_stats_files(sess);
+	if (unlikely(err))
+		goto remove_group;
+
+	return 0;
+
+remove_group:
+	sysfs_remove_group(&sess->kobj, &ibtrs_srv_sess_attr_group);
+put_kobj:
+	kobject_del(&sess->kobj);
+	kobject_put(&sess->kobj);
+destroy_root:
+	ibtrs_srv_destroy_once_sysfs_root_folders(sess);
+
+	return err;
+}
+
+void ibtrs_srv_destroy_sess_files(struct ibtrs_srv_sess *sess)
+{
+	if (sess->kobj.state_in_sysfs) {
+		kobject_del(&sess->kobj_stats);
+		kobject_put(&sess->kobj_stats);
+		kobject_del(&sess->kobj);
+		kobject_put(&sess->kobj);
+
+		ibtrs_srv_destroy_once_sysfs_root_folders(sess);
+	}
+}
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 06/25] ibtrs: client: main functionality
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, Jack Wang
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

This is main functionality of ibtrs-client module, which manages
set of RDMA connections for each IBTRS session, does multipathing,
load balancing and failover of RDMA requests.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c | 2844 ++++++++++++++++++++++
 1 file changed, 2844 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt.c b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
new file mode 100644
index 000000000000..721f9a4fafdc
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
@@ -0,0 +1,2844 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ * Authors: Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Jack Wang <jinpu.wang@cloud.ionos.com>
+ *          Danil Kipnis <danil.kipnis@cloud.ionos.com>
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/rculist.h>
+#include <linux/blkdev.h> /* for BLK_MAX_SEGMENT_SIZE */
+
+#include "ibtrs-clt.h"
+#include "ibtrs-log.h"
+
+#define IBTRS_CONNECT_TIMEOUT_MS 30000
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Client");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+static ushort nr_cons_per_session;
+module_param(nr_cons_per_session, ushort, 0444);
+MODULE_PARM_DESC(nr_cons_per_session, "Number of connections per session."
+		 " (default: nr_cpu_ids)");
+
+static int retry_cnt = 7;
+module_param_named(retry_cnt, retry_cnt, int, 0644);
+MODULE_PARM_DESC(retry_cnt, "Number of times to send the message if the"
+		 " remote side didn't respond with Ack or Nack (default: 7,"
+		 " min: " __stringify(MIN_RTR_CNT) ", max: "
+		 __stringify(MAX_RTR_CNT) ")");
+
+static int __read_mostly noreg_cnt;
+module_param_named(noreg_cnt, noreg_cnt, int, 0444);
+MODULE_PARM_DESC(noreg_cnt, "Max number of SG entries when MR registration "
+		 "does not happen (default: 0)");
+
+static const struct ibtrs_ib_dev_pool_ops dev_pool_ops;
+static struct ibtrs_ib_dev_pool dev_pool = {
+	.ops = &dev_pool_ops
+};
+
+static struct workqueue_struct *ibtrs_wq;
+static struct class *ibtrs_dev_class;
+
+static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con);
+static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id,
+				     struct rdma_cm_event *ev);
+static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc);
+static void complete_rdma_req(struct ibtrs_clt_io_req *req, int errno,
+			      bool notify, bool can_wait);
+static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req);
+static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req);
+
+bool ibtrs_clt_sess_is_connected(const struct ibtrs_clt_sess *sess)
+{
+	return sess->state == IBTRS_CLT_CONNECTED;
+}
+
+static inline bool ibtrs_clt_is_connected(const struct ibtrs_clt *clt)
+{
+	struct ibtrs_clt_sess *sess;
+	bool connected = false;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(sess, &clt->paths_list, s.entry)
+		connected |= ibtrs_clt_sess_is_connected(sess);
+	rcu_read_unlock();
+
+	return connected;
+}
+
+static inline struct ibtrs_tag *
+__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
+{
+	size_t max_depth = clt->queue_depth;
+	struct ibtrs_tag *tag;
+	int cpu, bit;
+
+	cpu = get_cpu();
+	do {
+		bit = find_first_zero_bit(clt->tags_map, max_depth);
+		if (unlikely(bit >= max_depth)) {
+			put_cpu();
+			return NULL;
+		}
+
+	} while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
+	put_cpu();
+
+	tag = GET_TAG(clt, bit);
+	WARN_ON(tag->mem_id != bit);
+	tag->cpu_id = cpu;
+	tag->con_type = con_type;
+
+	return tag;
+}
+
+static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
+				   struct ibtrs_tag *tag)
+{
+	clear_bit_unlock(tag->mem_id, clt->tags_map);
+}
+
+struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
+				    enum ibtrs_clt_con_type con_type,
+				    int can_wait)
+{
+	struct ibtrs_tag *tag;
+	DEFINE_WAIT(wait);
+
+	tag = __ibtrs_get_tag(clt, con_type);
+	if (likely(tag) || !can_wait)
+		return tag;
+
+	do {
+		prepare_to_wait(&clt->tags_wait, &wait, TASK_UNINTERRUPTIBLE);
+		tag = __ibtrs_get_tag(clt, con_type);
+		if (likely(tag))
+			break;
+
+		io_schedule();
+	} while (1);
+
+	finish_wait(&clt->tags_wait, &wait);
+
+	return tag;
+}
+EXPORT_SYMBOL(ibtrs_clt_get_tag);
+
+void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
+{
+	if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
+		return;
+
+	__ibtrs_put_tag(clt, tag);
+
+	/*
+	 * Putting a tag is a barrier, so we will observe
+	 * new entry in the wait list, no worries.
+	 */
+	if (waitqueue_active(&clt->tags_wait))
+		wake_up(&clt->tags_wait);
+}
+EXPORT_SYMBOL(ibtrs_clt_put_tag);
+
+struct ibtrs_tag *ibtrs_tag_from_pdu(void *pdu)
+{
+	return pdu - sizeof(struct ibtrs_tag);
+}
+EXPORT_SYMBOL(ibtrs_tag_from_pdu);
+
+void *ibtrs_tag_to_pdu(struct ibtrs_tag *tag)
+{
+	return tag + 1;
+}
+EXPORT_SYMBOL(ibtrs_tag_to_pdu);
+
+/**
+ * ibtrs_tag_to_clt_con() - returns RDMA connection id by the tag
+ *
+ * Note:
+ *     IO connection starts from 1.
+ *     0 connection is for user messages.
+ */
+static struct ibtrs_clt_con *ibtrs_tag_to_clt_con(struct ibtrs_clt_sess *sess,
+						  struct ibtrs_tag *tag)
+{
+	int id = 0;
+
+	if (likely(tag->con_type == IBTRS_IO_CON))
+		id = (tag->cpu_id % (sess->s.con_num - 1)) + 1;
+
+	return to_clt_con(sess->s.con[id]);
+}
+
+static void ibtrs_clt_fast_reg_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Failed IB_WR_REG_MR: %s\n",
+			  ib_wc_status_msg(wc->status));
+		ibtrs_rdma_error_recovery(con);
+	}
+}
+
+static struct ib_cqe fast_reg_cqe = {
+	.done = ibtrs_clt_fast_reg_done
+};
+
+static void ibtrs_clt_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_io_req *req =
+		container_of(wc->wr_cqe, typeof(*req), inv_cqe);
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Failed IB_WR_LOCAL_INV: %s\n",
+			  ib_wc_status_msg(wc->status));
+		ibtrs_rdma_error_recovery(con);
+	}
+	req->need_inv = false;
+	if (likely(req->need_inv_comp))
+		complete(&req->inv_comp);
+	else
+		/* Complete request from INV callback */
+		complete_rdma_req(req, req->inv_errno, true, false);
+}
+
+static int ibtrs_inv_rkey(struct ibtrs_clt_io_req *req)
+{
+	struct ibtrs_clt_con *con = req->con;
+	const struct ib_send_wr *bad_wr;
+	struct ib_send_wr wr = {
+		.opcode		    = IB_WR_LOCAL_INV,
+		.wr_cqe		    = &req->inv_cqe,
+		.next		    = NULL,
+		.num_sge	    = 0,
+		.send_flags	    = IB_SEND_SIGNALED,
+		.ex.invalidate_rkey = req->mr->rkey,
+	};
+	req->inv_cqe.done = ibtrs_clt_inv_rkey_done;
+
+	return ib_post_send(con->c.qp, &wr, &bad_wr);
+}
+
+static int ibtrs_post_send_rdma(struct ibtrs_clt_con *con,
+				struct ibtrs_clt_io_req *req,
+				struct ibtrs_rbuf *rbuf, u32 off,
+				u32 imm, struct ib_send_wr *wr)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	enum ib_send_flags flags;
+	struct ib_sge sge;
+
+	if (unlikely(!req->sg_size)) {
+		ibtrs_wrn(sess, "Doing RDMA Write failed, no data supplied\n");
+		return -EINVAL;
+	}
+
+	/* user data and user message in the first list element */
+	sge.addr   = req->iu->dma_addr;
+	sge.length = req->sg_size;
+	sge.lkey   = sess->s.dev->ib_pd->local_dma_lkey;
+
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, req->iu->dma_addr,
+				      req->sg_size, DMA_TO_DEVICE);
+
+	return ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, &sge, 1,
+					    rbuf->rkey, rbuf->addr + off,
+					    imm, flags, wr);
+}
+
+static void complete_rdma_req(struct ibtrs_clt_io_req *req, int errno,
+			      bool notify, bool can_wait)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ibtrs_clt_sess *sess;
+	struct ibtrs_clt *clt;
+	int err;
+
+	if (WARN_ON(!req->in_use))
+		return;
+	if (WARN_ON(!req->con))
+		return;
+	sess = to_clt_sess(con->c.sess);
+	clt = sess->clt;
+
+	if (req->sg_cnt) {
+		if (unlikely(req->dir == DMA_FROM_DEVICE && req->need_inv)) {
+			/*
+			 * We are here to invalidate RDMA read requests
+			 * ourselves.  In normal scenario server should
+			 * send INV for all requested RDMA reads, but
+			 * we are here, thus two things could happen:
+			 *
+			 *    1.  this is failover, when errno != 0
+			 *        and can_wait == 1,
+			 *
+			 *    2.  something totally bad happened and
+			 *        server forgot to send INV, so we
+			 *        should do that ourselves.
+			 */
+
+			if (likely(can_wait)) {
+				req->need_inv_comp = true;
+			} else {
+				/* This should be IO path, so always notify */
+				WARN_ON(!notify);
+				/* Save errno for INV callback */
+				req->inv_errno = errno;
+			}
+
+			err = ibtrs_inv_rkey(req);
+			if (unlikely(err)) {
+				ibtrs_err(sess, "Send INV WR key=%#x: %d\n",
+					  req->mr->rkey, err);
+			} else if (likely(can_wait)) {
+				wait_for_completion(&req->inv_comp);
+			} else {
+				/*
+				 * Something went wrong, so request will be
+				 * completed from INV callback.
+				 */
+				WARN_ON_ONCE(1);
+
+				return;
+			}
+		}
+		ib_dma_unmap_sg(sess->s.dev->ib_dev, req->sglist,
+				req->sg_cnt, req->dir);
+	}
+	if (sess->stats.enable_rdma_lat)
+		ibtrs_clt_update_rdma_lat(&sess->stats,
+				req->dir == DMA_FROM_DEVICE,
+				jiffies_to_msecs(jiffies - req->start_jiffies));
+	ibtrs_clt_decrease_inflight(&sess->stats);
+
+	req->in_use = false;
+	req->con = NULL;
+
+	if (notify)
+		req->conf(req->priv, errno);
+}
+
+static void process_io_rsp(struct ibtrs_clt_sess *sess, u32 msg_id,
+			   s16 errno, bool w_inval)
+{
+	struct ibtrs_clt_io_req *req;
+
+	if (WARN_ON(msg_id >= sess->queue_depth))
+		return;
+
+	req = &sess->reqs[msg_id];
+	/* Drop need_inv if server responsed with invalidation */
+	req->need_inv &= !w_inval;
+	complete_rdma_req(req, errno, true, false);
+}
+
+static struct ib_cqe io_comp_cqe = {
+	.done = ibtrs_clt_rdma_done
+};
+
+static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	u32 imm_type, imm_payload;
+	bool w_inval = false;
+	int err;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		if (wc->status != IB_WC_WR_FLUSH_ERR) {
+			ibtrs_err(sess, "RDMA failed: %s\n",
+				  ib_wc_status_msg(wc->status));
+			ibtrs_rdma_error_recovery(con);
+		}
+		return;
+	}
+	ibtrs_clt_update_wc_stats(con);
+
+	switch (wc->opcode) {
+	case IB_WC_RDMA_WRITE:
+		/*
+		 * post_send() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		break;
+	case IB_WC_RECV:
+		/*
+		 * Key invalidations from server side
+		 */
+		WARN_ON(!(wc->wc_flags & IB_WC_WITH_INVALIDATE));
+		WARN_ON(wc->wr_cqe != &io_comp_cqe);
+		break;
+
+	case IB_WC_RECV_RDMA_WITH_IMM:
+		/*
+		 * post_recv() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		if (WARN_ON(wc->wr_cqe != &io_comp_cqe))
+			return;
+
+		ibtrs_from_imm(be32_to_cpu(wc->ex.imm_data),
+			       &imm_type, &imm_payload);
+		if (likely(imm_type == IBTRS_IO_RSP_IMM ||
+			   imm_type == IBTRS_IO_RSP_W_INV_IMM)) {
+			u32 msg_id;
+
+			w_inval = (imm_type == IBTRS_IO_RSP_W_INV_IMM);
+			ibtrs_from_io_rsp_imm(imm_payload, &msg_id, &err);
+			process_io_rsp(sess, msg_id, err, w_inval);
+		} else if (imm_type == IBTRS_HB_MSG_IMM) {
+			WARN_ON(con->c.cid);
+			ibtrs_send_hb_ack(&sess->s);
+		} else if (imm_type == IBTRS_HB_ACK_IMM) {
+			WARN_ON(con->c.cid);
+			sess->s.hb_missed_cnt = 0;
+		} else {
+			ibtrs_wrn(sess, "Unknown IMM type %u\n", imm_type);
+		}
+		if (w_inval)
+			/*
+			 * Post x2 empty WRs: first is for this RDMA with IMM,
+			 * second is for RECV with INV, which happened earlier.
+			 */
+			err = ibtrs_post_recv_empty_x2(&con->c, &io_comp_cqe);
+		else
+			err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err)) {
+			ibtrs_err(sess, "ibtrs_post_recv_empty(): %d\n", err);
+			ibtrs_rdma_error_recovery(con);
+			break;
+		}
+		break;
+	default:
+		ibtrs_wrn(sess, "Unexpected WC type: %d\n", wc->opcode);
+		return;
+	}
+}
+
+static int post_recv_io(struct ibtrs_clt_con *con, size_t q_size)
+{
+	int err, i;
+
+	for (i = 0; i < q_size; i++) {
+		err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err))
+			return err;
+	}
+
+	return 0;
+}
+
+static int post_recv_sess(struct ibtrs_clt_sess *sess)
+{
+	size_t q_size;
+	int err, cid;
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		if (cid == 0)
+			q_size = SERVICE_CON_QUEUE_DEPTH;
+		else
+			q_size = sess->queue_depth;
+
+		/*
+		 * x2 for RDMA read responses + FR key invalidations,
+		 * RDMA writes do not require any FR registrations.
+		 */
+		q_size *= 2;
+
+		err = post_recv_io(to_clt_con(sess->s.con[cid]), q_size);
+		if (unlikely(err)) {
+			ibtrs_err(sess, "post_recv_io(), err: %d\n", err);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+struct path_it {
+	int i;
+	struct list_head skip_list;
+	struct ibtrs_clt *clt;
+	struct ibtrs_clt_sess *(*next_path)(struct path_it *it);
+};
+
+#define do_each_path(path, clt, it) {					\
+	path_it_init(it, clt);						\
+	rcu_read_lock();						\
+	for ((it)->i = 0; ((path) = ((it)->next_path)(it)) &&		\
+			  (it)->i < (it)->clt->paths_num;		\
+	     (it)->i++)
+
+#define while_each_path(it)						\
+	path_it_deinit(it);						\
+	rcu_read_unlock();						\
+	}
+
+/**
+ * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
+ * @head:	the head for the list.
+ * @ptr:        the list head to take the next element from.
+ * @type:       the type of the struct this is embedded in.
+ * @memb:       the name of the list_head within the struct.
+ *
+ * Next element returned in round-robin fashion, i.e. head will be skipped,
+ * but if list is observed as empty, NULL will be returned.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
+ */
+#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
+({ \
+	list_next_or_null_rcu(head, ptr, type, memb) ?: \
+		list_next_or_null_rcu(head, READ_ONCE((ptr)->next), \
+				      type, memb); \
+})
+
+/**
+ * get_next_path_rr() - Returns path in round-robin fashion.
+ *
+ * Related to @MP_POLICY_RR
+ *
+ * Locks:
+ *    rcu_read_lock() must be hold.
+ */
+static struct ibtrs_clt_sess *get_next_path_rr(struct path_it *it)
+{
+	struct ibtrs_clt_sess __rcu **ppcpu_path;
+	struct ibtrs_clt_sess *path;
+	struct ibtrs_clt *clt;
+
+	clt = it->clt;
+
+	/*
+	 * Here we use two RCU objects: @paths_list and @pcpu_path
+	 * pointer.  See ibtrs_clt_remove_path_from_arr() for details
+	 * how that is handled.
+	 */
+
+	ppcpu_path = this_cpu_ptr(clt->pcpu_path);
+	path = rcu_dereference(*ppcpu_path);
+	if (unlikely(!path))
+		path = list_first_or_null_rcu(&clt->paths_list,
+					      typeof(*path), s.entry);
+	else
+		path = list_next_or_null_rr_rcu(&clt->paths_list,
+						&path->s.entry,
+						typeof(*path),
+						s.entry);
+	rcu_assign_pointer(*ppcpu_path, path);
+
+	return path;
+}
+
+/**
+ * get_next_path_min_inflight() - Returns path with minimal inflight count.
+ *
+ * Related to @MP_POLICY_MIN_INFLIGHT
+ *
+ * Locks:
+ *    rcu_read_lock() must be hold.
+ */
+static struct ibtrs_clt_sess *get_next_path_min_inflight(struct path_it *it)
+{
+	struct ibtrs_clt_sess *min_path = NULL;
+	struct ibtrs_clt *clt = it->clt;
+	struct ibtrs_clt_sess *sess;
+	int min_inflight = INT_MAX;
+	int inflight;
+
+	list_for_each_entry_rcu(sess, &clt->paths_list, s.entry) {
+		if (unlikely(!list_empty(raw_cpu_ptr(sess->mp_skip_entry))))
+			continue;
+
+		inflight = atomic_read(&sess->stats.inflight);
+
+		if (inflight < min_inflight) {
+			min_inflight = inflight;
+			min_path = sess;
+		}
+	}
+
+	/*
+	 * add the path to the skip list, so that next time we can get
+	 * a different one
+	 */
+	if (min_path)
+		list_add(raw_cpu_ptr(min_path->mp_skip_entry), &it->skip_list);
+
+	return min_path;
+}
+
+static inline void path_it_init(struct path_it *it, struct ibtrs_clt *clt)
+{
+	INIT_LIST_HEAD(&it->skip_list);
+	it->clt = clt;
+	it->i = 0;
+
+	if (clt->mp_policy == MP_POLICY_RR)
+		it->next_path = get_next_path_rr;
+	else
+		it->next_path = get_next_path_min_inflight;
+}
+
+static inline void path_it_deinit(struct path_it *it)
+{
+	struct list_head *skip, *tmp;
+	/*
+	 * The skip_list is used only for the MIN_INFLIGHT policy.
+	 * We need to remove paths from it, so that next IO can insert
+	 * paths (->mp_skip_entry) into a skip_list again.
+	 */
+	list_for_each_safe(skip, tmp, &it->skip_list)
+		list_del_init(skip);
+}
+
+static inline void ibtrs_clt_init_req(struct ibtrs_clt_io_req *req,
+				      struct ibtrs_clt_sess *sess,
+				      ibtrs_conf_fn *conf,
+				      struct ibtrs_tag *tag, void *priv,
+				      const struct kvec *vec, size_t usr_len,
+				      struct scatterlist *sg, size_t sg_cnt,
+				      size_t data_len, int dir)
+{
+	struct iov_iter iter;
+	size_t len;
+
+	req->tag = tag;
+	req->in_use = true;
+	req->usr_len = usr_len;
+	req->data_len = data_len;
+	req->sglist = sg;
+	req->sg_cnt = sg_cnt;
+	req->priv = priv;
+	req->dir = dir;
+	req->con = ibtrs_tag_to_clt_con(sess, tag);
+	req->conf = conf;
+	req->need_inv = false;
+	req->need_inv_comp = false;
+	req->inv_errno = 0;
+
+	iov_iter_kvec(&iter, READ, vec, 1, usr_len);
+	len = _copy_from_iter(req->iu->buf, usr_len, &iter);
+	WARN_ON(len != usr_len);
+
+	reinit_completion(&req->inv_comp);
+	if (sess->stats.enable_rdma_lat)
+		req->start_jiffies = jiffies;
+}
+
+static inline struct ibtrs_clt_io_req *
+ibtrs_clt_get_req(struct ibtrs_clt_sess *sess, ibtrs_conf_fn *conf,
+		  struct ibtrs_tag *tag, void *priv,
+		  const struct kvec *vec, size_t usr_len,
+		  struct scatterlist *sg, size_t sg_cnt,
+		  size_t data_len, int dir)
+{
+	struct ibtrs_clt_io_req *req;
+
+	req = &sess->reqs[tag->mem_id];
+	ibtrs_clt_init_req(req, sess, conf, tag, priv, vec, usr_len,
+			   sg, sg_cnt, data_len, dir);
+	return req;
+}
+
+static inline struct ibtrs_clt_io_req *
+ibtrs_clt_get_copy_req(struct ibtrs_clt_sess *alive_sess,
+		       struct ibtrs_clt_io_req *fail_req)
+{
+	struct ibtrs_clt_io_req *req;
+	struct kvec vec = {
+		.iov_base = fail_req->iu->buf,
+		.iov_len  = fail_req->usr_len
+	};
+
+	req = &alive_sess->reqs[fail_req->tag->mem_id];
+	ibtrs_clt_init_req(req, alive_sess, fail_req->conf, fail_req->tag,
+			   fail_req->priv, &vec, fail_req->usr_len,
+			   fail_req->sglist, fail_req->sg_cnt,
+			   fail_req->data_len, fail_req->dir);
+	return req;
+}
+
+static int ibtrs_clt_failover_req(struct ibtrs_clt *clt,
+				  struct ibtrs_clt_io_req *fail_req)
+{
+	struct ibtrs_clt_sess *alive_sess;
+	struct ibtrs_clt_io_req *req;
+	int err = -ECONNABORTED;
+	struct path_it it;
+
+	do_each_path(alive_sess, clt, &it) {
+		if (unlikely(alive_sess->state != IBTRS_CLT_CONNECTED))
+			continue;
+		req = ibtrs_clt_get_copy_req(alive_sess, fail_req);
+		if (req->dir == DMA_TO_DEVICE)
+			err = ibtrs_clt_write_req(req);
+		else
+			err = ibtrs_clt_read_req(req);
+		if (unlikely(err)) {
+			req->in_use = false;
+			continue;
+		}
+		/* Success path */
+		ibtrs_clt_inc_failover_cnt(&alive_sess->stats);
+		break;
+	} while_each_path(&it);
+
+	return err;
+}
+
+static void fail_all_outstanding_reqs(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	struct ibtrs_clt_io_req *req;
+	int i, err;
+
+	if (!sess->reqs)
+		return;
+	for (i = 0; i < sess->queue_depth; ++i) {
+		req = &sess->reqs[i];
+		if (!req->in_use)
+			continue;
+
+		/*
+		 * Safely (without notification) complete failed request.
+		 * After completion this request is still usebale and can
+		 * be failovered to another path.
+		 */
+		complete_rdma_req(req, -ECONNABORTED, false, true);
+
+		err = ibtrs_clt_failover_req(clt, req);
+		if (unlikely(err))
+			/* Failover failed, notify anyway */
+			req->conf(req->priv, err);
+	}
+}
+
+static void free_sess_reqs(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt_io_req *req;
+	int i;
+
+	if (!sess->reqs)
+		return;
+	for (i = 0; i < sess->queue_depth; ++i) {
+		req = &sess->reqs[i];
+		if (req->mr)
+			ib_dereg_mr(req->mr);
+		kfree(req->sge);
+		ibtrs_iu_free(req->iu, DMA_TO_DEVICE,
+			      sess->s.dev->ib_dev);
+	}
+	kfree(sess->reqs);
+	sess->reqs = NULL;
+}
+
+static int alloc_sess_reqs(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt_io_req *req;
+	struct ibtrs_clt *clt = sess->clt;
+	int i, err = -ENOMEM;
+
+	sess->reqs = kcalloc(sess->queue_depth, sizeof(*sess->reqs),
+			     GFP_KERNEL);
+	if (unlikely(!sess->reqs))
+		return -ENOMEM;
+
+	for (i = 0; i < sess->queue_depth; ++i) {
+		req = &sess->reqs[i];
+		req->iu = ibtrs_iu_alloc(i, sess->max_hdr_size, GFP_KERNEL,
+					 sess->s.dev->ib_dev, DMA_TO_DEVICE,
+					 ibtrs_clt_rdma_done);
+		if (unlikely(!req->iu))
+			goto out;
+
+		req->sge = kmalloc_array(clt->max_segments + 1,
+					 sizeof(*req->sge), GFP_KERNEL);
+		if (unlikely(!req->sge))
+			goto out;
+
+		req->mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
+				      sess->max_pages_per_mr);
+		if (unlikely(IS_ERR(req->mr))) {
+			err = PTR_ERR(req->mr);
+			req->mr = NULL;
+			pr_err("Failed to alloc sess->max_pages_per_mr %d\n",
+			       sess->max_pages_per_mr);
+			goto out;
+		}
+
+		init_completion(&req->inv_comp);
+	}
+
+	return 0;
+
+out:
+	free_sess_reqs(sess);
+
+	return err;
+}
+
+static int alloc_tags(struct ibtrs_clt *clt)
+{
+	unsigned int chunk_bits;
+	int err, i;
+
+	clt->tags_map = kcalloc(BITS_TO_LONGS(clt->queue_depth), sizeof(long),
+				GFP_KERNEL);
+	if (unlikely(!clt->tags_map)) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+	clt->tags = kcalloc(clt->queue_depth, TAG_SIZE(clt), GFP_KERNEL);
+	if (unlikely(!clt->tags)) {
+		err = -ENOMEM;
+		goto err_map;
+	}
+	chunk_bits = ilog2(clt->queue_depth - 1) + 1;
+	for (i = 0; i < clt->queue_depth; i++) {
+		struct ibtrs_tag *tag;
+
+		tag = GET_TAG(clt, i);
+		tag->mem_id = i;
+		tag->mem_off = i << (MAX_IMM_PAYL_BITS - chunk_bits);
+	}
+
+	return 0;
+
+err_map:
+	kfree(clt->tags_map);
+	clt->tags_map = NULL;
+out_err:
+	return err;
+}
+
+static void free_tags(struct ibtrs_clt *clt)
+{
+	kfree(clt->tags_map);
+	clt->tags_map = NULL;
+	kfree(clt->tags);
+	clt->tags = NULL;
+}
+
+static void query_fast_reg_mode(struct ibtrs_clt_sess *sess)
+{
+	struct ib_device *ib_dev;
+	u64 max_pages_per_mr;
+	int mr_page_shift;
+
+	ib_dev = sess->s.dev->ib_dev;
+
+	/*
+	 * Use the smallest page size supported by the HCA, down to a
+	 * minimum of 4096 bytes. We're unlikely to build large sglists
+	 * out of smaller entries.
+	 */
+	mr_page_shift      = max(12, ffs(ib_dev->attrs.page_size_cap) - 1);
+	max_pages_per_mr   = ib_dev->attrs.max_mr_size;
+	do_div(max_pages_per_mr, (1ull << mr_page_shift));
+	sess->max_pages_per_mr =
+		min3(sess->max_pages_per_mr, (u32)max_pages_per_mr,
+		     ib_dev->attrs.max_fast_reg_page_list_len);
+	sess->max_send_sge = ib_dev->attrs.max_send_sge;
+}
+
+static bool __ibtrs_clt_change_state(struct ibtrs_clt_sess *sess,
+				     enum ibtrs_clt_state new_state)
+{
+	enum ibtrs_clt_state old_state;
+	bool changed = false;
+
+	old_state = sess->state;
+	switch (new_state) {
+	case IBTRS_CLT_CONNECTING:
+		switch (old_state) {
+		case IBTRS_CLT_RECONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_RECONNECTING:
+		switch (old_state) {
+		case IBTRS_CLT_CONNECTED:
+		case IBTRS_CLT_CONNECTING_ERR:
+		case IBTRS_CLT_CLOSED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_CONNECTED:
+		switch (old_state) {
+		case IBTRS_CLT_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_CONNECTING_ERR:
+		switch (old_state) {
+		case IBTRS_CLT_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_CLOSING:
+		switch (old_state) {
+		case IBTRS_CLT_CONNECTING:
+		case IBTRS_CLT_CONNECTING_ERR:
+		case IBTRS_CLT_RECONNECTING:
+		case IBTRS_CLT_CONNECTED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_CLOSED:
+		switch (old_state) {
+		case IBTRS_CLT_CLOSING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_DEAD:
+		switch (old_state) {
+		case IBTRS_CLT_CLOSED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	default:
+		break;
+	}
+	if (changed) {
+		sess->state = new_state;
+		wake_up_locked(&sess->state_wq);
+	}
+
+	return changed;
+}
+
+static bool ibtrs_clt_change_state_from_to(struct ibtrs_clt_sess *sess,
+					   enum ibtrs_clt_state old_state,
+					   enum ibtrs_clt_state new_state)
+{
+	bool changed = false;
+
+	spin_lock_irq(&sess->state_wq.lock);
+	if (sess->state == old_state)
+		changed = __ibtrs_clt_change_state(sess, new_state);
+	spin_unlock_irq(&sess->state_wq.lock);
+
+	return changed;
+}
+
+static bool ibtrs_clt_change_state_get_old(struct ibtrs_clt_sess *sess,
+					   enum ibtrs_clt_state new_state,
+					   enum ibtrs_clt_state *old_state)
+{
+	bool changed;
+
+	spin_lock_irq(&sess->state_wq.lock);
+	*old_state = sess->state;
+	changed = __ibtrs_clt_change_state(sess, new_state);
+	spin_unlock_irq(&sess->state_wq.lock);
+
+	return changed;
+}
+
+static bool ibtrs_clt_change_state(struct ibtrs_clt_sess *sess,
+				   enum ibtrs_clt_state new_state)
+{
+	enum ibtrs_clt_state old_state;
+
+	return ibtrs_clt_change_state_get_old(sess, new_state, &old_state);
+}
+
+static enum ibtrs_clt_state ibtrs_clt_state(struct ibtrs_clt_sess *sess)
+{
+	enum ibtrs_clt_state state;
+
+	spin_lock_irq(&sess->state_wq.lock);
+	state = sess->state;
+	spin_unlock_irq(&sess->state_wq.lock);
+
+	return state;
+}
+
+static void ibtrs_clt_hb_err_handler(struct ibtrs_con *c, int err)
+{
+	struct ibtrs_clt_con *con;
+
+	(void)err;
+	con = container_of(c, typeof(*con), c);
+	ibtrs_rdma_error_recovery(con);
+}
+
+static void ibtrs_clt_init_hb(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_init_hb(&sess->s, &io_comp_cqe,
+		      IBTRS_HB_INTERVAL_MS,
+		      IBTRS_HB_MISSED_MAX,
+		      ibtrs_clt_hb_err_handler,
+		      ibtrs_wq);
+}
+
+static void ibtrs_clt_start_hb(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_start_hb(&sess->s);
+}
+
+static void ibtrs_clt_stop_hb(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_stop_hb(&sess->s);
+}
+
+static void ibtrs_clt_reconnect_work(struct work_struct *work);
+static void ibtrs_clt_close_work(struct work_struct *work);
+
+static struct ibtrs_clt_sess *alloc_sess(struct ibtrs_clt *clt,
+					 const struct ibtrs_addr *path,
+					 size_t con_num, u16 max_segments)
+{
+	struct ibtrs_clt_sess *sess;
+	int err = -ENOMEM;
+	int cpu;
+
+	sess = kzalloc(sizeof(*sess), GFP_KERNEL);
+	if (unlikely(!sess))
+		goto err;
+
+	/* Extra connection for user messages */
+	con_num += 1;
+
+	sess->s.con = kcalloc(con_num, sizeof(*sess->s.con), GFP_KERNEL);
+	if (unlikely(!sess->s.con))
+		goto err_free_sess;
+
+	mutex_init(&sess->init_mutex);
+	uuid_gen(&sess->s.uuid);
+	memcpy(&sess->s.dst_addr, path->dst,
+	       rdma_addr_size((struct sockaddr *)path->dst));
+
+	/*
+	 * rdma_resolve_addr() passes src_addr to cma_bind_addr, which
+	 * checks the sa_family to be non-zero. If user passed src_addr=NULL
+	 * the sess->src_addr will contain only zeros, which is then fine.
+	 */
+	if (path->src)
+		memcpy(&sess->s.src_addr, path->src,
+		       rdma_addr_size((struct sockaddr *)path->src));
+	strlcpy(sess->s.sessname, clt->sessname, sizeof(sess->s.sessname));
+	sess->s.con_num = con_num;
+	sess->clt = clt;
+	sess->max_pages_per_mr = max_segments * BLK_MAX_SEGMENT_SIZE >> 12;
+	init_waitqueue_head(&sess->state_wq);
+	sess->state = IBTRS_CLT_CONNECTING;
+	atomic_set(&sess->connected_cnt, 0);
+	INIT_WORK(&sess->close_work, ibtrs_clt_close_work);
+	INIT_DELAYED_WORK(&sess->reconnect_dwork, ibtrs_clt_reconnect_work);
+	ibtrs_clt_init_hb(sess);
+
+	sess->mp_skip_entry = alloc_percpu(typeof(*sess->mp_skip_entry));
+	if (unlikely(!sess->mp_skip_entry))
+		goto err_free_con;
+
+	for_each_possible_cpu(cpu)
+		INIT_LIST_HEAD(per_cpu_ptr(sess->mp_skip_entry, cpu));
+
+	err = ibtrs_clt_init_stats(&sess->stats);
+	if (unlikely(err))
+		goto err_free_percpu;
+
+	return sess;
+
+err_free_percpu:
+	free_percpu(sess->mp_skip_entry);
+err_free_con:
+	kfree(sess->s.con);
+err_free_sess:
+	kfree(sess);
+err:
+	return ERR_PTR(err);
+}
+
+static void free_sess(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_clt_free_stats(&sess->stats);
+	free_percpu(sess->mp_skip_entry);
+	kfree(sess->s.con);
+	kfree(sess->rbufs);
+	kfree(sess);
+}
+
+static int create_con(struct ibtrs_clt_sess *sess, unsigned int cid)
+{
+	struct ibtrs_clt_con *con;
+
+	con = kzalloc(sizeof(*con), GFP_KERNEL);
+	if (unlikely(!con))
+		return -ENOMEM;
+
+	/* Map first two connections to the first CPU */
+	con->cpu  = (cid ? cid - 1 : 0) % nr_cpu_ids;
+	con->c.cid = cid;
+	con->c.sess = &sess->s;
+	atomic_set(&con->io_cnt, 0);
+
+	sess->s.con[cid] = &con->c;
+
+	return 0;
+}
+
+static void destroy_con(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	sess->s.con[con->c.cid] = NULL;
+	kfree(con);
+}
+
+static int create_con_cq_qp(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	u16 wr_queue_size;
+	int err, cq_vector;
+
+	/*
+	 * This function can fail, but still destroy_con_cq_qp() should
+	 * be called, this is because create_con_cq_qp() is called on cm
+	 * event path, thus caller/waiter never knows: have we failed before
+	 * create_con_cq_qp() or after.  To solve this dilemma without
+	 * creating any additional flags just allow destroy_con_cq_qp() be
+	 * called many times.
+	 */
+
+	if (con->c.cid == 0) {
+		/*
+		 * One completion for each receive and two for each send
+		 * (send request + registration)
+		 * + 2 for drain and heartbeat
+		 * in case qp gets into error state
+		 */
+		wr_queue_size = SERVICE_CON_QUEUE_DEPTH * 3 + 2;
+		/* We must be the first here */
+		if (WARN_ON(sess->s.dev))
+			return -EINVAL;
+
+		/*
+		 * The whole session uses device from user connection.
+		 * Be careful not to close user connection before ib dev
+		 * is gracefully put.
+		 */
+		sess->s.dev = ibtrs_ib_dev_find_or_add(con->c.cm_id->device,
+						       &dev_pool);
+		if (unlikely(!sess->s.dev)) {
+			ibtrs_wrn(sess,
+				  "ibtrs_ib_dev_find_get_or_add(): no memory\n");
+			return -ENOMEM;
+		}
+		sess->s.dev_ref = 1;
+		query_fast_reg_mode(sess);
+	} else {
+		/*
+		 * Here we assume that session members are correctly set.
+		 * This is always true if user connection (cid == 0) is
+		 * established first.
+		 */
+		if (WARN_ON(!sess->s.dev))
+			return -EINVAL;
+		if (WARN_ON(!sess->queue_depth))
+			return -EINVAL;
+
+		/* Shared between connections */
+		sess->s.dev_ref++;
+		wr_queue_size =
+			min_t(int, sess->s.dev->ib_dev->attrs.max_qp_wr,
+			      /* QD * (REQ + RSP + FR REGS or INVS) + drain */
+			      sess->queue_depth * 3 + 1);
+	}
+	cq_vector = con->cpu % sess->s.dev->ib_dev->num_comp_vectors;
+	err = ibtrs_cq_qp_create(&sess->s, &con->c, sess->max_send_sge,
+				 cq_vector, wr_queue_size, wr_queue_size,
+				 IB_POLL_SOFTIRQ);
+	/*
+	 * In case of error we do not bother to clean previous allocations,
+	 * since destroy_con_cq_qp() must be called.
+	 */
+
+	if (unlikely(err))
+		return err;
+
+	return err;
+}
+
+static void destroy_con_cq_qp(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	/*
+	 * Be careful here: destroy_con_cq_qp() can be called even
+	 * create_con_cq_qp() failed, see comments there.
+	 */
+
+	ibtrs_cq_qp_destroy(&con->c);
+	if (sess->s.dev_ref && !--sess->s.dev_ref) {
+		ibtrs_ib_dev_put(sess->s.dev);
+		sess->s.dev = NULL;
+	}
+}
+
+static void stop_cm(struct ibtrs_clt_con *con)
+{
+	rdma_disconnect(con->c.cm_id);
+	if (con->c.qp)
+		ib_drain_qp(con->c.qp);
+}
+
+static void destroy_cm(struct ibtrs_clt_con *con)
+{
+	rdma_destroy_id(con->c.cm_id);
+	con->c.cm_id = NULL;
+}
+
+static int create_cm(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct rdma_cm_id *cm_id;
+	int err;
+
+	cm_id = rdma_create_id(&init_net, ibtrs_clt_rdma_cm_handler, con,
+			       sess->s.dst_addr.ss_family == AF_IB ?
+			       RDMA_PS_IB : RDMA_PS_TCP, IB_QPT_RC);
+	if (unlikely(IS_ERR(cm_id))) {
+		err = PTR_ERR(cm_id);
+		ibtrs_err(sess, "Failed to create CM ID, err: %d\n", err);
+
+		return err;
+	}
+	con->c.cm_id = cm_id;
+	con->cm_err = 0;
+	/* allow the port to be reused */
+	err = rdma_set_reuseaddr(cm_id, 1);
+	if (err != 0) {
+		ibtrs_err(sess, "Set address reuse failed, err: %d\n", err);
+		goto destroy_cm;
+	}
+	err = rdma_resolve_addr(cm_id, (struct sockaddr *)&sess->s.src_addr,
+				(struct sockaddr *)&sess->s.dst_addr,
+				IBTRS_CONNECT_TIMEOUT_MS);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "Failed to resolve address, err: %d\n", err);
+		goto destroy_cm;
+	}
+	/*
+	 * Combine connection status and session events. This is needed
+	 * for waiting two possible cases: cm_err has something meaningful
+	 * or session state was really changed to error by device removal.
+	 */
+	err = wait_event_interruptible_timeout(sess->state_wq,
+			con->cm_err || sess->state != IBTRS_CLT_CONNECTING,
+			msecs_to_jiffies(IBTRS_CONNECT_TIMEOUT_MS));
+	if (unlikely(err == 0 || err == -ERESTARTSYS)) {
+		if (err == 0)
+			err = -ETIMEDOUT;
+		/* Timedout or interrupted */
+		goto errr;
+	}
+	if (unlikely(con->cm_err < 0)) {
+		err = con->cm_err;
+		goto errr;
+	}
+	if (unlikely(sess->state != IBTRS_CLT_CONNECTING)) {
+		/* Device removal */
+		err = -ECONNABORTED;
+		goto errr;
+	}
+
+	return 0;
+
+errr:
+	stop_cm(con);
+	/* Is safe to call destroy if cq_qp is not inited */
+	destroy_con_cq_qp(con);
+destroy_cm:
+	destroy_cm(con);
+
+	return err;
+}
+
+static void ibtrs_clt_sess_up(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	int up;
+
+	/*
+	 * We can fire RECONNECTED event only when all paths were
+	 * connected on ibtrs_clt_open(), then each was disconnected
+	 * and the first one connected again.  That's why this nasty
+	 * game with counter value.
+	 */
+
+	mutex_lock(&clt->paths_ev_mutex);
+	up = ++clt->paths_up;
+	/*
+	 * Here it is safe to access paths num directly since up counter
+	 * is greater than MAX_PATHS_NUM only while ibtrs_clt_open() is
+	 * in progress, thus paths removals are impossible.
+	 */
+	if (up > MAX_PATHS_NUM && up == MAX_PATHS_NUM + clt->paths_num)
+		clt->paths_up = clt->paths_num;
+	else if (up == 1)
+		clt->link_ev(clt->priv, IBTRS_CLT_LINK_EV_RECONNECTED);
+	mutex_unlock(&clt->paths_ev_mutex);
+
+	/* Mark session as established */
+	sess->established = true;
+	sess->reconnect_attempts = 0;
+	sess->stats.reconnects.successful_cnt++;
+}
+
+static void ibtrs_clt_sess_down(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+
+	if (!sess->established)
+		return;
+
+	sess->established = false;
+	mutex_lock(&clt->paths_ev_mutex);
+	WARN_ON(!clt->paths_up);
+	if (--clt->paths_up == 0)
+		clt->link_ev(clt->priv, IBTRS_CLT_LINK_EV_DISCONNECTED);
+	mutex_unlock(&clt->paths_ev_mutex);
+}
+
+static void ibtrs_clt_stop_and_destroy_conns(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt_con *con;
+	unsigned int cid;
+
+	WARN_ON(sess->state == IBTRS_CLT_CONNECTED);
+
+	/*
+	 * Possible race with ibtrs_clt_open(), when DEVICE_REMOVAL comes
+	 * exactly in between.  Start destroying after it finishes.
+	 */
+	mutex_lock(&sess->init_mutex);
+	mutex_unlock(&sess->init_mutex);
+
+	/*
+	 * All IO paths must observe !CONNECTED state before we
+	 * free everything.
+	 */
+	synchronize_rcu();
+
+	ibtrs_clt_stop_hb(sess);
+
+	/*
+	 * The order it utterly crucial: firstly disconnect and complete all
+	 * rdma requests with error (thus set in_use=false for requests),
+	 * then fail outstanding requests checking in_use for each, and
+	 * eventually notify upper layer about session disconnection.
+	 */
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		if (!sess->s.con[cid])
+			break;
+		con = to_clt_con(sess->s.con[cid]);
+		stop_cm(con);
+	}
+	fail_all_outstanding_reqs(sess);
+	free_sess_reqs(sess);
+	ibtrs_clt_sess_down(sess);
+
+	/*
+	 * Wait for graceful shutdown, namely when peer side invokes
+	 * rdma_disconnect(). 'connected_cnt' is decremented only on
+	 * CM events, thus if other side had crashed and hb has detected
+	 * something is wrong, here we will stuck for exactly timeout ms,
+	 * since CM does not fire anything.  That is fine, we are not in
+	 * hurry.
+	 */
+	wait_event_timeout(sess->state_wq, !atomic_read(&sess->connected_cnt),
+			   msecs_to_jiffies(IBTRS_CONNECT_TIMEOUT_MS));
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		if (!sess->s.con[cid])
+			break;
+		con = to_clt_con(sess->s.con[cid]);
+		destroy_con_cq_qp(con);
+		destroy_cm(con);
+		destroy_con(con);
+	}
+}
+
+static inline bool xchg_sessions(struct ibtrs_clt_sess __rcu **rcu_ppcpu_path,
+				 struct ibtrs_clt_sess *sess,
+				 struct ibtrs_clt_sess *next)
+{
+	struct ibtrs_clt_sess **ppcpu_path;
+
+	/* Call cmpxchg() without sparse warnings */
+	ppcpu_path = (typeof(ppcpu_path))rcu_ppcpu_path;
+	return (sess == cmpxchg(ppcpu_path, sess, next));
+}
+
+static void ibtrs_clt_remove_path_from_arr(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	struct ibtrs_clt_sess *next;
+	bool wait_for_grace = false;
+	int cpu;
+
+	mutex_lock(&clt->paths_mutex);
+	list_del_rcu(&sess->s.entry);
+
+	/* Make sure everybody observes path removal. */
+	synchronize_rcu();
+
+	/*
+	 * At this point nobody sees @sess in the list, but still we have
+	 * dangling pointer @pcpu_path which _can_ point to @sess.  Since
+	 * nobody can observe @sess in the list, we guarantee that IO path
+	 * will not assign @sess to @pcpu_path, i.e. @pcpu_path can be equal
+	 * to @sess, but can never again become @sess.
+	 */
+
+	/*
+	 * Decrement paths number only after grace period, because
+	 * caller of do_each_path() must firstly observe list without
+	 * path and only then decremented paths number.
+	 *
+	 * Otherwise there can be the following situation:
+	 *    o Two paths exist and IO is coming.
+	 *    o One path is removed:
+	 *      CPU#0                          CPU#1
+	 *      do_each_path():                ibtrs_clt_remove_path_from_arr():
+	 *          path = get_next_path()
+	 *          ^^^                            list_del_rcu(path)
+	 *          [!CONNECTED path]              clt->paths_num--
+	 *                                              ^^^^^^^^^
+	 *          load clt->paths_num                 from 2 to 1
+	 *                    ^^^^^^^^^
+	 *                    sees 1
+	 *
+	 *      path is observed as !CONNECTED, but do_each_path() loop
+	 *      ends, because expression i < clt->paths_num is false.
+	 */
+	clt->paths_num--;
+
+	/*
+	 * Get @next connection from current @sess which is going to be
+	 * removed.  If @sess is the last element, then @next is NULL.
+	 */
+	next = list_next_or_null_rr_rcu(&clt->paths_list, &sess->s.entry,
+					typeof(*next), s.entry);
+
+	/*
+	 * @pcpu paths can still point to the path which is going to be
+	 * removed, so change the pointer manually.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct ibtrs_clt_sess __rcu **ppcpu_path;
+
+		ppcpu_path = per_cpu_ptr(clt->pcpu_path, cpu);
+		if (rcu_dereference(*ppcpu_path) != sess)
+			/*
+			 * synchronize_rcu() was called just after deleting
+			 * entry from the list, thus IO code path cannot
+			 * change pointer back to the pointer which is going
+			 * to be removed, we are safe here.
+			 */
+			continue;
+
+		/*
+		 * We race with IO code path, which also changes pointer,
+		 * thus we have to be careful not to overwrite it.
+		 */
+		if (xchg_sessions(ppcpu_path, sess, next))
+			/*
+			 * @ppcpu_path was successfully replaced with @next,
+			 * that means that someone could also pick up the
+			 * @sess and dereferencing it right now, so wait for
+			 * a grace period is required.
+			 */
+			wait_for_grace = true;
+	}
+	if (wait_for_grace)
+		synchronize_rcu();
+
+	mutex_unlock(&clt->paths_mutex);
+}
+
+static void ibtrs_clt_add_path_to_arr(struct ibtrs_clt_sess *sess,
+				      struct ibtrs_addr *addr)
+{
+	struct ibtrs_clt *clt = sess->clt;
+
+	mutex_lock(&clt->paths_mutex);
+	clt->paths_num++;
+
+	/*
+	 * Firstly increase paths_num, wait for GP and then
+	 * add path to the list.  Why?  Since we add path with
+	 * !CONNECTED state explanation is similar to what has
+	 * been written in ibtrs_clt_remove_path_from_arr().
+	 */
+	synchronize_rcu();
+
+	list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
+	mutex_unlock(&clt->paths_mutex);
+}
+
+static void ibtrs_clt_close_work(struct work_struct *work)
+{
+	struct ibtrs_clt_sess *sess;
+
+	sess = container_of(work, struct ibtrs_clt_sess, close_work);
+
+	cancel_delayed_work_sync(&sess->reconnect_dwork);
+	ibtrs_clt_stop_and_destroy_conns(sess);
+	/*
+	 * Sounds stupid, huh?  No, it is not.  Consider this sequence:
+	 *
+	 *   #CPU0                              #CPU1
+	 *   1.  CONNECTED->RECONNECTING
+	 *   2.                                 RECONNECTING->CLOSING
+	 *   3.  queue_work(&reconnect_dwork)
+	 *   4.                                 queue_work(&close_work);
+	 *   5.  reconnect_work();              close_work();
+	 *
+	 * To avoid that case do cancel twice: before and after.
+	 */
+	cancel_delayed_work_sync(&sess->reconnect_dwork);
+	ibtrs_clt_change_state(sess, IBTRS_CLT_CLOSED);
+}
+
+static void ibtrs_clt_close_conns(struct ibtrs_clt_sess *sess, bool wait)
+{
+	if (ibtrs_clt_change_state(sess, IBTRS_CLT_CLOSING))
+		queue_work(ibtrs_wq, &sess->close_work);
+	if (wait)
+		flush_work(&sess->close_work);
+}
+
+static int init_conns(struct ibtrs_clt_sess *sess)
+{
+	unsigned int cid;
+	int err;
+
+	/*
+	 * On every new session connections increase reconnect counter
+	 * to avoid clashes with previous sessions not yet closed
+	 * sessions on a server side.
+	 */
+	sess->s.recon_cnt++;
+
+	/* Establish all RDMA connections  */
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		err = create_con(sess, cid);
+		if (unlikely(err))
+			goto destroy;
+
+		err = create_cm(to_clt_con(sess->s.con[cid]));
+		if (unlikely(err)) {
+			destroy_con(to_clt_con(sess->s.con[cid]));
+			goto destroy;
+		}
+	}
+	err = alloc_sess_reqs(sess);
+	if (unlikely(err))
+		goto destroy;
+
+	ibtrs_clt_start_hb(sess);
+
+	return 0;
+
+destroy:
+	while (cid--) {
+		struct ibtrs_clt_con *con = to_clt_con(sess->s.con[cid]);
+
+		stop_cm(con);
+		destroy_con_cq_qp(con);
+		destroy_cm(con);
+		destroy_con(con);
+	}
+	/*
+	 * If we've never taken async path and got an error, say,
+	 * doing rdma_resolve_addr(), switch to CONNECTION_ERR state
+	 * manually to keep reconnecting.
+	 */
+	ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING_ERR);
+
+	return err;
+}
+
+static int ibtrs_rdma_addr_resolved(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	int err;
+
+	err = create_con_cq_qp(con);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "create_con_cq_qp(), err: %d\n", err);
+		return err;
+	}
+	err = rdma_resolve_route(con->c.cm_id, IBTRS_CONNECT_TIMEOUT_MS);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "Resolving route failed, err: %d\n", err);
+		destroy_con_cq_qp(con);
+	}
+
+	return err;
+}
+
+static int ibtrs_rdma_route_resolved(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_clt *clt = sess->clt;
+	struct ibtrs_msg_conn_req msg;
+	struct rdma_conn_param param;
+
+	int err;
+
+	memset(&param, 0, sizeof(param));
+	param.retry_count = clamp(retry_cnt, MIN_RTR_CNT, MAX_RTR_CNT);
+	param.rnr_retry_count = 7;
+	param.private_data = &msg;
+	param.private_data_len = sizeof(msg);
+
+	/*
+	 * Those two are the part of struct cma_hdr which is shared
+	 * with private_data in case of AF_IB, so put zeroes to avoid
+	 * wrong validation inside cma.c on receiver side.
+	 */
+	msg.__cma_version = 0;
+	msg.__ip_version = 0;
+	msg.magic = cpu_to_le16(IBTRS_MAGIC);
+	msg.version = cpu_to_le16(IBTRS_PROTO_VER);
+	msg.cid = cpu_to_le16(con->c.cid);
+	msg.cid_num = cpu_to_le16(sess->s.con_num);
+	msg.recon_cnt = cpu_to_le16(sess->s.recon_cnt);
+	uuid_copy(&msg.sess_uuid, &sess->s.uuid);
+	uuid_copy(&msg.paths_uuid, &clt->paths_uuid);
+
+	err = rdma_connect(con->c.cm_id, &param);
+	if (err)
+		ibtrs_err(sess, "rdma_connect(): %d\n", err);
+
+	return err;
+}
+
+static int ibtrs_rdma_conn_established(struct ibtrs_clt_con *con,
+				       struct rdma_cm_event *ev)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_clt *clt = sess->clt;
+	const struct ibtrs_msg_conn_rsp *msg;
+	u16 version, queue_depth;
+	int errno;
+	u8 len;
+
+	msg = ev->param.conn.private_data;
+	len = ev->param.conn.private_data_len;
+	if (unlikely(len < sizeof(*msg))) {
+		ibtrs_err(sess, "Invalid IBTRS connection response\n");
+		return -ECONNRESET;
+	}
+	if (unlikely(le16_to_cpu(msg->magic) != IBTRS_MAGIC)) {
+		ibtrs_err(sess, "Invalid IBTRS magic\n");
+		return -ECONNRESET;
+	}
+	version = le16_to_cpu(msg->version);
+	if (unlikely(version >> 8 != IBTRS_PROTO_VER_MAJOR)) {
+		ibtrs_err(sess, "Unsupported major IBTRS version: %d, expected %d\n",
+			  version >> 8, IBTRS_PROTO_VER_MAJOR);
+		return -ECONNRESET;
+	}
+	errno = le16_to_cpu(msg->errno);
+	if (unlikely(errno)) {
+		ibtrs_err(sess, "Invalid IBTRS message: errno %d\n",
+			  errno);
+		return -ECONNRESET;
+	}
+	if (con->c.cid == 0) {
+		queue_depth = le16_to_cpu(msg->queue_depth);
+
+		if (queue_depth > MAX_SESS_QUEUE_DEPTH) {
+			ibtrs_err(sess, "Invalid IBTRS message: queue=%d\n",
+				  queue_depth);
+			return -ECONNRESET;
+		}
+		if (!sess->rbufs || sess->queue_depth < queue_depth) {
+			kfree(sess->rbufs);
+			sess->rbufs = kcalloc(queue_depth, sizeof(*sess->rbufs),
+					      GFP_KERNEL);
+			if (unlikely(!sess->rbufs)) {
+				ibtrs_err(sess, "Failed to allocate "
+					  "queue_depth=%d\n", queue_depth);
+				return -ENOMEM;
+			}
+		}
+		sess->queue_depth = queue_depth;
+		sess->max_hdr_size = le32_to_cpu(msg->max_hdr_size);
+		sess->max_io_size = le32_to_cpu(msg->max_io_size);
+		sess->chunk_size = sess->max_io_size + sess->max_hdr_size;
+
+		/*
+		 * Global queue depth and IO size is always a minimum.
+		 * If while a reconnection server sends us a value a bit
+		 * higher - client does not care and uses cached minimum.
+		 *
+		 * Since we can have several sessions (paths) restablishing
+		 * connections in parallel, use lock.
+		 */
+		mutex_lock(&clt->paths_mutex);
+		clt->queue_depth = min_not_zero(sess->queue_depth,
+						clt->queue_depth);
+		clt->max_io_size = min_not_zero(sess->max_io_size,
+						clt->max_io_size);
+		mutex_unlock(&clt->paths_mutex);
+
+		/*
+		 * Cache the hca_port and hca_name for sysfs
+		 */
+		sess->hca_port = con->c.cm_id->port_num;
+		scnprintf(sess->hca_name, sizeof(sess->hca_name),
+			  sess->s.dev->ib_dev->name);
+		sess->s.src_addr = con->c.cm_id->route.addr.src_addr;
+	}
+
+	return 0;
+}
+
+static int ibtrs_rdma_conn_rejected(struct ibtrs_clt_con *con,
+				    struct rdma_cm_event *ev)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	const struct ibtrs_msg_conn_rsp *msg;
+	const char *rej_msg;
+	int status, errno;
+	u8 data_len;
+
+	status = ev->status;
+	rej_msg = rdma_reject_msg(con->c.cm_id, status);
+	msg = rdma_consumer_reject_data(con->c.cm_id, ev, &data_len);
+
+	if (msg && data_len >= sizeof(*msg)) {
+		errno = (int16_t)le16_to_cpu(msg->errno);
+		if (errno == -EBUSY)
+			ibtrs_err(sess,
+				  "Previous session is still exists on the "
+				  "server, please reconnect later\n");
+		else
+			ibtrs_err(sess,
+				  "Connect rejected: status %d (%s), ibtrs "
+				  "errno %d\n", status, rej_msg, errno);
+	} else {
+		ibtrs_err(sess,
+			  "Connect rejected but with malformed message: "
+			  "status %d (%s)\n", status, rej_msg);
+	}
+
+	return -ECONNRESET;
+}
+
+static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	if (ibtrs_clt_change_state_from_to(sess,
+					   IBTRS_CLT_CONNECTED,
+					   IBTRS_CLT_RECONNECTING)) {
+		/*
+		 * Normal scenario, reconnect if we were successfully connected
+		 */
+		queue_delayed_work(ibtrs_wq, &sess->reconnect_dwork, 0);
+	} else {
+		/*
+		 * Error can happen just on establishing new connection,
+		 * so notify waiter with error state, waiter is responsible
+		 * for cleaning the rest and reconnect if needed.
+		 */
+		ibtrs_clt_change_state_from_to(sess,
+					       IBTRS_CLT_CONNECTING,
+					       IBTRS_CLT_CONNECTING_ERR);
+	}
+}
+
+static inline void flag_success_on_conn(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	atomic_inc(&sess->connected_cnt);
+	con->cm_err = 1;
+}
+
+static inline void flag_error_on_conn(struct ibtrs_clt_con *con, int cm_err)
+{
+	if (con->cm_err == 1) {
+		struct ibtrs_clt_sess *sess;
+
+		sess = to_clt_sess(con->c.sess);
+		if (atomic_dec_and_test(&sess->connected_cnt))
+			wake_up(&sess->state_wq);
+	}
+	con->cm_err = cm_err;
+}
+
+static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id,
+				     struct rdma_cm_event *ev)
+{
+	struct ibtrs_clt_con *con = cm_id->context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	int cm_err = 0;
+
+	switch (ev->event) {
+	case RDMA_CM_EVENT_ADDR_RESOLVED:
+		cm_err = ibtrs_rdma_addr_resolved(con);
+		break;
+	case RDMA_CM_EVENT_ROUTE_RESOLVED:
+		cm_err = ibtrs_rdma_route_resolved(con);
+		break;
+	case RDMA_CM_EVENT_ESTABLISHED:
+		con->cm_err = ibtrs_rdma_conn_established(con, ev);
+		if (likely(!con->cm_err)) {
+			/*
+			 * Report success and wake up. Here we abuse state_wq,
+			 * i.e. wake up without state change, but we set cm_err.
+			 */
+			flag_success_on_conn(con);
+			wake_up(&sess->state_wq);
+			return 0;
+		}
+		break;
+	case RDMA_CM_EVENT_REJECTED:
+		cm_err = ibtrs_rdma_conn_rejected(con, ev);
+		break;
+	case RDMA_CM_EVENT_CONNECT_ERROR:
+	case RDMA_CM_EVENT_UNREACHABLE:
+		ibtrs_wrn(sess, "CM error event %d\n", ev->event);
+		cm_err = -ECONNRESET;
+		break;
+	case RDMA_CM_EVENT_ADDR_ERROR:
+	case RDMA_CM_EVENT_ROUTE_ERROR:
+		cm_err = -EHOSTUNREACH;
+		break;
+	case RDMA_CM_EVENT_DISCONNECTED:
+	case RDMA_CM_EVENT_ADDR_CHANGE:
+	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+		cm_err = -ECONNRESET;
+		break;
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+		/*
+		 * Device removal is a special case.  Queue close and return 0.
+		 */
+		ibtrs_clt_close_conns(sess, false);
+		return 0;
+	default:
+		ibtrs_err(sess, "Unexpected RDMA CM event (%d)\n", ev->event);
+		cm_err = -ECONNRESET;
+		break;
+	}
+
+	if (cm_err) {
+		/*
+		 * cm error makes sense only on connection establishing,
+		 * in other cases we rely on normal procedure of reconnecting.
+		 */
+		flag_error_on_conn(con, cm_err);
+		ibtrs_rdma_error_recovery(con);
+	}
+
+	return 0;
+}
+
+static void ibtrs_clt_info_req_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_iu *iu;
+
+	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
+	ibtrs_iu_free(iu, DMA_TO_DEVICE, sess->s.dev->ib_dev);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Sess info request send failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING_ERR);
+		return;
+	}
+
+	ibtrs_clt_update_wc_stats(con);
+}
+
+static int process_info_rsp(struct ibtrs_clt_sess *sess,
+			    const struct ibtrs_msg_info_rsp *msg)
+{
+	unsigned int sg_cnt, total_len;
+	int i, sgi;
+
+	sg_cnt = le16_to_cpu(msg->sg_cnt);
+	if (unlikely(!sg_cnt))
+		return -EINVAL;
+	/*
+	 * Check if IB immediate data size is enough to hold the mem_id and
+	 * the offset inside the memory chunk.
+	 */
+	if (unlikely((ilog2(sg_cnt - 1) + 1) +
+		     (ilog2(sess->chunk_size - 1) + 1) >
+		     MAX_IMM_PAYL_BITS)) {
+		ibtrs_err(sess, "RDMA immediate size (%db) not enough to "
+			  "encode %d buffers of size %dB\n",  MAX_IMM_PAYL_BITS,
+			  sg_cnt, sess->chunk_size);
+		return -EINVAL;
+	}
+	if (unlikely(!sg_cnt || (sess->queue_depth % sg_cnt))) {
+		ibtrs_err(sess, "Incorrect sg_cnt %d, is not multiple\n",
+			  sg_cnt);
+		return -EINVAL;
+	}
+	total_len = 0;
+	for (sgi = 0, i = 0; sgi < sg_cnt && i < sess->queue_depth; sgi++) {
+		const struct ibtrs_sg_desc *desc = &msg->desc[sgi];
+		u32 len, rkey;
+		u64 addr;
+
+		addr = le64_to_cpu(desc->addr);
+		rkey = le32_to_cpu(desc->key);
+		len  = le32_to_cpu(desc->len);
+
+		total_len += len;
+
+		if (unlikely(!len || (len % sess->chunk_size))) {
+			ibtrs_err(sess, "Incorrect [%d].len %d\n", sgi, len);
+			return -EINVAL;
+		}
+		for ( ; len && i < sess->queue_depth; i++) {
+			sess->rbufs[i].addr = addr;
+			sess->rbufs[i].rkey = rkey;
+
+			len  -= sess->chunk_size;
+			addr += sess->chunk_size;
+		}
+	}
+	/* Sanity check */
+	if (unlikely(sgi != sg_cnt || i != sess->queue_depth)) {
+		ibtrs_err(sess, "Incorrect sg vector, not fully mapped\n");
+		return -EINVAL;
+	}
+	if (unlikely(total_len != sess->chunk_size * sess->queue_depth)) {
+		ibtrs_err(sess, "Incorrect total_len %d\n", total_len);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void ibtrs_clt_info_rsp_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_msg_info_rsp *msg;
+	enum ibtrs_clt_state state;
+	struct ibtrs_iu *iu;
+	size_t rx_sz;
+	int err;
+
+	state = IBTRS_CLT_CONNECTING_ERR;
+
+	WARN_ON(con->c.cid);
+	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Sess info response recv failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		goto out;
+	}
+	WARN_ON(wc->opcode != IB_WC_RECV);
+
+	if (unlikely(wc->byte_len < sizeof(*msg))) {
+		ibtrs_err(sess, "Sess info response is malformed: size %d\n",
+			  wc->byte_len);
+		goto out;
+	}
+	ib_dma_sync_single_for_cpu(sess->s.dev->ib_dev, iu->dma_addr,
+				   iu->size, DMA_FROM_DEVICE);
+	msg = iu->buf;
+	if (unlikely(le16_to_cpu(msg->type) != IBTRS_MSG_INFO_RSP)) {
+		ibtrs_err(sess, "Sess info response is malformed: type %d\n",
+			  le16_to_cpu(msg->type));
+		goto out;
+	}
+	rx_sz  = sizeof(*msg);
+	rx_sz += sizeof(msg->desc[0]) * le16_to_cpu(msg->sg_cnt);
+	if (unlikely(wc->byte_len < rx_sz)) {
+		ibtrs_err(sess, "Sess info response is malformed: size %d\n",
+			  wc->byte_len);
+		goto out;
+	}
+	err = process_info_rsp(sess, msg);
+	if (unlikely(err))
+		goto out;
+
+	err = post_recv_sess(sess);
+	if (unlikely(err))
+		goto out;
+
+	state = IBTRS_CLT_CONNECTED;
+
+out:
+	ibtrs_clt_update_wc_stats(con);
+	ibtrs_iu_free(iu, DMA_FROM_DEVICE, sess->s.dev->ib_dev);
+	ibtrs_clt_change_state(sess, state);
+}
+
+static int ibtrs_send_sess_info(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt_con *usr_con = to_clt_con(sess->s.con[0]);
+	struct ibtrs_msg_info_req *msg;
+	struct ibtrs_iu *tx_iu, *rx_iu;
+	size_t rx_sz;
+	int err;
+
+	rx_sz  = sizeof(struct ibtrs_msg_info_rsp);
+	rx_sz += sizeof(u64) * MAX_SESS_QUEUE_DEPTH;
+
+	tx_iu = ibtrs_iu_alloc(0, sizeof(struct ibtrs_msg_info_req), GFP_KERNEL,
+			       sess->s.dev->ib_dev, DMA_TO_DEVICE,
+			       ibtrs_clt_info_req_done);
+	rx_iu = ibtrs_iu_alloc(0, rx_sz, GFP_KERNEL, sess->s.dev->ib_dev,
+			       DMA_FROM_DEVICE, ibtrs_clt_info_rsp_done);
+	if (unlikely(!tx_iu || !rx_iu)) {
+		ibtrs_err(sess, "ibtrs_iu_alloc(): no memory\n");
+		err = -ENOMEM;
+		goto out;
+	}
+	/* Prepare for getting info response */
+	err = ibtrs_iu_post_recv(&usr_con->c, rx_iu);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_iu_post_recv(), err: %d\n", err);
+		goto out;
+	}
+	rx_iu = NULL;
+
+	msg = tx_iu->buf;
+	msg->type = cpu_to_le16(IBTRS_MSG_INFO_REQ);
+	memcpy(msg->sessname, sess->s.sessname, sizeof(msg->sessname));
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, tx_iu->dma_addr,
+				      tx_iu->size, DMA_TO_DEVICE);
+
+	/* Send info request */
+	err = ibtrs_iu_post_send(&usr_con->c, tx_iu, sizeof(*msg), NULL);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_iu_post_send(), err: %d\n", err);
+		goto out;
+	}
+	tx_iu = NULL;
+
+	/* Wait for state change */
+	wait_event_interruptible_timeout(sess->state_wq,
+				sess->state != IBTRS_CLT_CONNECTING,
+				msecs_to_jiffies(IBTRS_CONNECT_TIMEOUT_MS));
+	if (unlikely(sess->state != IBTRS_CLT_CONNECTED)) {
+		if (sess->state == IBTRS_CLT_CONNECTING_ERR)
+			err = -ECONNRESET;
+		else
+			err = -ETIMEDOUT;
+		goto out;
+	}
+
+out:
+	if (tx_iu)
+		ibtrs_iu_free(tx_iu, DMA_TO_DEVICE, sess->s.dev->ib_dev);
+	if (rx_iu)
+		ibtrs_iu_free(rx_iu, DMA_FROM_DEVICE, sess->s.dev->ib_dev);
+	if (unlikely(err))
+		/* If we've never taken async path because of malloc problems */
+		ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING_ERR);
+
+	return err;
+}
+
+/**
+ * init_sess() - establishes all session connections and does handshake
+ *
+ * In case of error full close or reconnect procedure should be taken,
+ * because reconnect or close async works can be started.
+ */
+static int init_sess(struct ibtrs_clt_sess *sess)
+{
+	int err;
+
+	mutex_lock(&sess->init_mutex);
+	err = init_conns(sess);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "init_conns(), err: %d\n", err);
+		goto out;
+	}
+	err = ibtrs_send_sess_info(sess);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_send_sess_info(), err: %d\n", err);
+		goto out;
+	}
+	ibtrs_clt_sess_up(sess);
+out:
+	mutex_unlock(&sess->init_mutex);
+
+	return err;
+}
+
+static void ibtrs_clt_reconnect_work(struct work_struct *work)
+{
+	struct ibtrs_clt_sess *sess;
+	struct ibtrs_clt *clt;
+	unsigned int delay_ms;
+	int err;
+
+	sess = container_of(to_delayed_work(work), struct ibtrs_clt_sess,
+			    reconnect_dwork);
+	clt = sess->clt;
+
+	if (ibtrs_clt_state(sess) == IBTRS_CLT_CLOSING)
+		/* User requested closing */
+		return;
+
+	if (sess->reconnect_attempts >= clt->max_reconnect_attempts) {
+		/* Close a session completely if max attempts is reached */
+		ibtrs_clt_close_conns(sess, false);
+		return;
+	}
+	sess->reconnect_attempts++;
+
+	/* Stop everything */
+	ibtrs_clt_stop_and_destroy_conns(sess);
+	ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING);
+
+	err = init_sess(sess);
+	if (unlikely(err))
+		goto reconnect_again;
+
+	return;
+
+reconnect_again:
+	if (ibtrs_clt_change_state(sess, IBTRS_CLT_RECONNECTING)) {
+		sess->stats.reconnects.fail_cnt++;
+		delay_ms = clt->reconnect_delay_sec * 1000;
+		queue_delayed_work(ibtrs_wq, &sess->reconnect_dwork,
+				   msecs_to_jiffies(delay_ms));
+	}
+}
+
+static void ibtrs_clt_dev_release(struct device *dev)
+{
+	/* Nobody plays with device references, so nop */
+}
+
+static struct ibtrs_clt *alloc_clt(const char *sessname, size_t paths_num,
+				   short port, size_t pdu_sz,
+				   void *priv, link_clt_ev_fn *link_ev,
+				   unsigned int max_segments,
+				   unsigned int reconnect_delay_sec,
+				   unsigned int max_reconnect_attempts)
+{
+	struct ibtrs_clt *clt;
+	int err;
+
+	if (unlikely(!paths_num || paths_num > MAX_PATHS_NUM))
+		return ERR_PTR(-EINVAL);
+
+	if (unlikely(strlen(sessname) >= sizeof(clt->sessname)))
+		return ERR_PTR(-EINVAL);
+
+	clt = kzalloc(sizeof(*clt), GFP_KERNEL);
+	if (unlikely(!clt))
+		return ERR_PTR(-ENOMEM);
+
+	clt->pcpu_path = alloc_percpu(typeof(*clt->pcpu_path));
+	if (unlikely(!clt->pcpu_path)) {
+		kfree(clt);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	uuid_gen(&clt->paths_uuid);
+	INIT_LIST_HEAD_RCU(&clt->paths_list);
+	clt->paths_num = paths_num;
+	clt->paths_up = MAX_PATHS_NUM;
+	clt->port = port;
+	clt->pdu_sz = pdu_sz;
+	clt->max_segments = max_segments;
+	clt->reconnect_delay_sec = reconnect_delay_sec;
+	clt->max_reconnect_attempts = max_reconnect_attempts;
+	clt->priv = priv;
+	clt->link_ev = link_ev;
+	clt->mp_policy = MP_POLICY_MIN_INFLIGHT;
+	strlcpy(clt->sessname, sessname, sizeof(clt->sessname));
+	init_waitqueue_head(&clt->tags_wait);
+	mutex_init(&clt->paths_ev_mutex);
+	mutex_init(&clt->paths_mutex);
+
+	clt->dev.class = ibtrs_dev_class;
+	clt->dev.release = ibtrs_clt_dev_release;
+	dev_set_name(&clt->dev, "%s", sessname);
+
+	err = device_register(&clt->dev);
+	if (unlikely(err))
+		goto percpu_free;
+
+	err = ibtrs_clt_create_sysfs_root_folders(clt);
+	if (unlikely(err))
+		goto dev_unregister;
+
+	return clt;
+
+dev_unregister:
+	/* Nobody plays with dev refs, so dev.release() is nop */
+	device_unregister(&clt->dev);
+percpu_free:
+	free_percpu(clt->pcpu_path);
+	kfree(clt);
+
+	return ERR_PTR(err);
+}
+
+static void wait_for_inflight_tags(struct ibtrs_clt *clt)
+{
+	if (clt->tags_map) {
+		size_t sz = clt->queue_depth;
+
+		wait_event(clt->tags_wait,
+			   find_first_bit(clt->tags_map, sz) >= sz);
+	}
+}
+
+static void free_clt(struct ibtrs_clt *clt)
+{
+	ibtrs_clt_destroy_sysfs_root_folders(clt);
+	wait_for_inflight_tags(clt);
+	free_tags(clt);
+	free_percpu(clt->pcpu_path);
+	/* Nobody plays with dev refs, so dev.release() is nop */
+	device_unregister(&clt->dev);
+	kfree(clt);
+}
+
+struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
+				 const char *sessname,
+				 const struct ibtrs_addr *paths,
+				 size_t paths_num,
+				 short port,
+				 size_t pdu_sz, u8 reconnect_delay_sec,
+				 u16 max_segments,
+				 s16 max_reconnect_attempts)
+{
+	struct ibtrs_clt_sess *sess, *tmp;
+	struct ibtrs_clt *clt;
+	int err, i;
+
+	clt = alloc_clt(sessname, paths_num, port, pdu_sz, priv, link_ev,
+			max_segments, reconnect_delay_sec,
+			max_reconnect_attempts);
+	if (unlikely(IS_ERR(clt))) {
+		err = PTR_ERR(clt);
+		goto out;
+	}
+	for (i = 0; i < paths_num; i++) {
+		struct ibtrs_clt_sess *sess;
+
+		sess = alloc_sess(clt, &paths[i], nr_cons_per_session,
+				  max_segments);
+		if (unlikely(IS_ERR(sess))) {
+			err = PTR_ERR(sess);
+			ibtrs_err(clt, "alloc_sess(), err: %d\n", err);
+			goto close_all_sess;
+		}
+		list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
+
+		err = init_sess(sess);
+		if (unlikely(err))
+			goto close_all_sess;
+
+		err = ibtrs_clt_create_sess_files(sess);
+		if (unlikely(err))
+			goto close_all_sess;
+	}
+	err = alloc_tags(clt);
+	if (unlikely(err)) {
+		ibtrs_err(clt, "alloc_tags(), err: %d\n", err);
+		goto close_all_sess;
+	}
+	err = ibtrs_clt_create_sysfs_root_files(clt);
+	if (unlikely(err))
+		goto close_all_sess;
+
+	/*
+	 * There is a race if someone decides to completely remove just
+	 * newly created path using sysfs entry.  To avoid the race we
+	 * use simple 'opened' flag, see ibtrs_clt_remove_path_from_sysfs().
+	 */
+	clt->opened = true;
+
+	/* Do not let module be unloaded if client is alive */
+	__module_get(THIS_MODULE);
+
+	return clt;
+
+close_all_sess:
+	list_for_each_entry_safe(sess, tmp, &clt->paths_list, s.entry) {
+		ibtrs_clt_destroy_sess_files(sess, NULL);
+		ibtrs_clt_close_conns(sess, true);
+		free_sess(sess);
+	}
+	free_clt(clt);
+
+out:
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL(ibtrs_clt_open);
+
+void ibtrs_clt_close(struct ibtrs_clt *clt)
+{
+	struct ibtrs_clt_sess *sess, *tmp;
+
+	/* Firstly forbid sysfs access */
+	ibtrs_clt_destroy_sysfs_root_files(clt);
+	ibtrs_clt_destroy_sysfs_root_folders(clt);
+
+	/* Now it is safe to iterate over all paths without locks */
+	list_for_each_entry_safe(sess, tmp, &clt->paths_list, s.entry) {
+		ibtrs_clt_destroy_sess_files(sess, NULL);
+		ibtrs_clt_close_conns(sess, true);
+		free_sess(sess);
+	}
+	free_clt(clt);
+	module_put(THIS_MODULE);
+}
+EXPORT_SYMBOL(ibtrs_clt_close);
+
+int ibtrs_clt_reconnect_from_sysfs(struct ibtrs_clt_sess *sess)
+{
+	enum ibtrs_clt_state old_state;
+	int err = -EBUSY;
+	bool changed;
+
+	changed = ibtrs_clt_change_state_get_old(sess, IBTRS_CLT_RECONNECTING,
+						 &old_state);
+	if (changed) {
+		sess->reconnect_attempts = 0;
+		queue_delayed_work(ibtrs_wq, &sess->reconnect_dwork, 0);
+	}
+	if (changed || old_state == IBTRS_CLT_RECONNECTING) {
+		/*
+		 * flush_delayed_work() queues pending work for immediate
+		 * execution, so do the flush if we have queued something
+		 * right now or work is pending.
+		 */
+		flush_delayed_work(&sess->reconnect_dwork);
+		err = ibtrs_clt_sess_is_connected(sess) ? 0 : -ENOTCONN;
+	}
+
+	return err;
+}
+
+int ibtrs_clt_disconnect_from_sysfs(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_clt_close_conns(sess, true);
+
+	return 0;
+}
+
+int ibtrs_clt_remove_path_from_sysfs(struct ibtrs_clt_sess *sess,
+				     const struct attribute *sysfs_self)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	enum ibtrs_clt_state old_state;
+	bool changed;
+
+	/*
+	 * That can happen only when userspace tries to remove path
+	 * very early, when ibtrs_clt_open() is not yet finished.
+	 */
+	if (unlikely(!clt->opened))
+		return -EBUSY;
+
+	/*
+	 * Continue stopping path till state was changed to DEAD or
+	 * state was observed as DEAD:
+	 * 1. State was changed to DEAD - we were fast and nobody
+	 *    invoked ibtrs_clt_reconnect(), which can again start
+	 *    reconnecting.
+	 * 2. State was observed as DEAD - we have someone in parallel
+	 *    removing the path.
+	 */
+	do {
+		ibtrs_clt_close_conns(sess, true);
+	} while (!(changed = ibtrs_clt_change_state_get_old(sess,
+							    IBTRS_CLT_DEAD,
+							    &old_state)) &&
+		   old_state != IBTRS_CLT_DEAD);
+
+	/*
+	 * If state was successfully changed to DEAD, commit suicide.
+	 */
+	if (likely(changed)) {
+		ibtrs_clt_destroy_sess_files(sess, sysfs_self);
+		ibtrs_clt_remove_path_from_arr(sess);
+		free_sess(sess);
+	}
+
+	return 0;
+}
+
+void ibtrs_clt_set_max_reconnect_attempts(struct ibtrs_clt *clt, int value)
+{
+	clt->max_reconnect_attempts = (unsigned int)value;
+}
+
+int ibtrs_clt_get_max_reconnect_attempts(const struct ibtrs_clt *clt)
+{
+	return (int)clt->max_reconnect_attempts;
+}
+
+static int ibtrs_post_rdma_write_sg(struct ibtrs_clt_con *con,
+				    struct ibtrs_clt_io_req *req,
+				    struct ibtrs_rbuf *rbuf,
+				    u32 size, u32 imm)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ib_sge *sge = req->sge;
+	enum ib_send_flags flags;
+	struct scatterlist *sg;
+	size_t num_sge;
+	int i;
+
+	for_each_sg(req->sglist, sg, req->sg_cnt, i) {
+		sge[i].addr   = sg_dma_address(sg);
+		sge[i].length = sg_dma_len(sg);
+		sge[i].lkey   = sess->s.dev->ib_pd->local_dma_lkey;
+	}
+	sge[i].addr   = req->iu->dma_addr;
+	sge[i].length = size;
+	sge[i].lkey   = sess->s.dev->ib_pd->local_dma_lkey;
+
+	num_sge = 1 + req->sg_cnt;
+
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, req->iu->dma_addr,
+				      size, DMA_TO_DEVICE);
+
+	return ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, sge, num_sge,
+					    rbuf->rkey, rbuf->addr, imm,
+					    flags, NULL);
+}
+
+static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_msg_rdma_write *msg;
+
+	struct ibtrs_rbuf *rbuf;
+	int ret, count = 0;
+	u32 imm, buf_id;
+
+	const size_t tsize = sizeof(*msg) + req->data_len + req->usr_len;
+
+	if (unlikely(tsize > sess->chunk_size)) {
+		ibtrs_wrn(sess, "Write request failed, size too big %zu > %d\n",
+			  tsize, sess->chunk_size);
+		return -EMSGSIZE;
+	}
+	if (req->sg_cnt) {
+		count = ib_dma_map_sg(sess->s.dev->ib_dev, req->sglist,
+				      req->sg_cnt, req->dir);
+		if (unlikely(!count)) {
+			ibtrs_wrn(sess, "Write request failed, map failed\n");
+			return -EINVAL;
+		}
+	}
+	/* put ibtrs msg after sg and user message */
+	msg = req->iu->buf + req->usr_len;
+	msg->type = cpu_to_le16(IBTRS_MSG_WRITE);
+	msg->usr_len = cpu_to_le16(req->usr_len);
+
+	/* ibtrs message on server side will be after user data and message */
+	imm = req->tag->mem_off + req->data_len + req->usr_len;
+	imm = ibtrs_to_io_req_imm(imm);
+	buf_id = req->tag->mem_id;
+	req->sg_size = tsize;
+	rbuf = &sess->rbufs[buf_id];
+
+	/*
+	 * Update stats now, after request is successfully sent it is not
+	 * safe anymore to touch it.
+	 */
+	ibtrs_clt_update_all_stats(req, WRITE);
+
+	ret = ibtrs_post_rdma_write_sg(req->con, req, rbuf,
+				       req->usr_len + sizeof(*msg),
+				       imm);
+	if (unlikely(ret)) {
+		ibtrs_err(sess, "Write request failed: %d\n", ret);
+		ibtrs_clt_decrease_inflight(&sess->stats);
+		if (req->sg_cnt)
+			ib_dma_unmap_sg(sess->s.dev->ib_dev, req->sglist,
+					req->sg_cnt, req->dir);
+	}
+
+	return ret;
+}
+
+static int ibtrs_map_sg_fr(struct ibtrs_clt_io_req *req, size_t count)
+{
+	int nr;
+
+	/* Align the MR to a 4K page size to match the block virt boundary */
+	nr = ib_map_mr_sg(req->mr, req->sglist, count, NULL, SZ_4K);
+	if (unlikely(nr < req->sg_cnt)) {
+		if (nr < 0)
+			return nr;
+		return -EINVAL;
+	}
+	ib_update_fast_reg_key(req->mr, ib_inc_rkey(req->mr->rkey));
+
+	return nr;
+}
+
+static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_msg_rdma_read *msg;
+	struct ibtrs_ib_dev *dev;
+	struct scatterlist *sg;
+
+	struct ib_reg_wr rwr;
+	struct ib_send_wr *wr = NULL;
+
+	int i, ret, count = 0;
+	u32 imm, buf_id;
+
+	const size_t tsize = sizeof(*msg) + req->data_len + req->usr_len;
+
+	dev = sess->s.dev;
+
+	if (unlikely(tsize > sess->chunk_size)) {
+		ibtrs_wrn(sess, "Read request failed, message size is"
+			  " %zu, bigger than CHUNK_SIZE %d\n", tsize,
+			  sess->chunk_size);
+		return -EMSGSIZE;
+	}
+
+	if (req->sg_cnt) {
+		count = ib_dma_map_sg(dev->ib_dev, req->sglist, req->sg_cnt,
+				      req->dir);
+		if (unlikely(!count)) {
+			ibtrs_wrn(sess, "Read request failed, "
+				  "dma map failed\n");
+			return -EINVAL;
+		}
+	}
+	/* put our message into req->buf after user message*/
+	msg = req->iu->buf + req->usr_len;
+	msg->type = cpu_to_le16(IBTRS_MSG_READ);
+	msg->usr_len = cpu_to_le16(req->usr_len);
+
+	if (count > noreg_cnt) {
+		ret = ibtrs_map_sg_fr(req, count);
+		if (ret < 0) {
+			ibtrs_err_rl(sess,
+				     "Read request failed, failed to map "
+				     " fast reg. data, err: %d\n", ret);
+			ib_dma_unmap_sg(dev->ib_dev, req->sglist, req->sg_cnt,
+					req->dir);
+			return ret;
+		}
+		memset(&rwr, 0, sizeof(rwr));
+		rwr.wr.next = NULL;
+		rwr.wr.opcode = IB_WR_REG_MR;
+		rwr.wr.wr_cqe = &fast_reg_cqe;
+		rwr.wr.num_sge = 0;
+		rwr.mr = req->mr;
+		rwr.key = req->mr->rkey;
+		rwr.access = (IB_ACCESS_LOCAL_WRITE |
+			      IB_ACCESS_REMOTE_WRITE);
+		wr = &rwr.wr;
+
+		msg->sg_cnt = cpu_to_le16(1);
+		msg->flags = cpu_to_le16(ibtrs_invalidate_flag());
+
+		msg->desc[0].addr = cpu_to_le64(req->mr->iova);
+		msg->desc[0].key = cpu_to_le32(req->mr->rkey);
+		msg->desc[0].len = cpu_to_le32(req->mr->length);
+
+		/* Further invalidation is required */
+		req->need_inv = !!ibtrs_invalidate_flag();
+
+	} else {
+		msg->sg_cnt = cpu_to_le16(count);
+		msg->flags = 0;
+
+		for_each_sg(req->sglist, sg, req->sg_cnt, i) {
+			msg->desc[i].addr = cpu_to_le64(sg_dma_address(sg));
+			msg->desc[i].key =
+				cpu_to_le32(dev->ib_pd->unsafe_global_rkey);
+			msg->desc[i].len = cpu_to_le32(sg_dma_len(sg));
+		}
+	}
+	/*
+	 * ibtrs message will be after the space reserved for disk data and
+	 * user message
+	 */
+	imm = req->tag->mem_off + req->data_len + req->usr_len;
+	imm = ibtrs_to_io_req_imm(imm);
+	buf_id = req->tag->mem_id;
+
+	req->sg_size  = sizeof(*msg);
+	req->sg_size += le16_to_cpu(msg->sg_cnt) * sizeof(struct ibtrs_sg_desc);
+	req->sg_size += req->usr_len;
+
+	/*
+	 * Update stats now, after request is successfully sent it is not
+	 * safe anymore to touch it.
+	 */
+	ibtrs_clt_update_all_stats(req, READ);
+
+	ret = ibtrs_post_send_rdma(req->con, req, &sess->rbufs[buf_id],
+				   req->data_len, imm, wr);
+	if (unlikely(ret)) {
+		ibtrs_err(sess, "Read request failed: %d\n", ret);
+		ibtrs_clt_decrease_inflight(&sess->stats);
+		req->need_inv = false;
+		if (req->sg_cnt)
+			ib_dma_unmap_sg(dev->ib_dev, req->sglist,
+					req->sg_cnt, req->dir);
+	}
+
+	return ret;
+}
+
+int ibtrs_clt_request(int dir, ibtrs_conf_fn *conf, struct ibtrs_clt *clt,
+		      struct ibtrs_tag *tag, void *priv, const struct kvec *vec,
+		      size_t nr, size_t data_len, struct scatterlist *sg,
+		      unsigned int sg_cnt)
+{
+	struct ibtrs_clt_io_req *req;
+	struct ibtrs_clt_sess *sess;
+
+	enum dma_data_direction dma_dir;
+	int err = -ECONNABORTED, i;
+	size_t usr_len, hdr_len;
+	struct path_it it;
+
+	/* Get kvec length */
+	for (i = 0, usr_len = 0; i < nr; i++)
+		usr_len += vec[i].iov_len;
+
+	if (dir == READ) {
+		hdr_len = sizeof(struct ibtrs_msg_rdma_read) +
+			  sg_cnt * sizeof(struct ibtrs_sg_desc);
+		dma_dir = DMA_FROM_DEVICE;
+	} else {
+		hdr_len = sizeof(struct ibtrs_msg_rdma_write);
+		dma_dir = DMA_TO_DEVICE;
+	}
+
+	do_each_path(sess, clt, &it) {
+		if (unlikely(sess->state != IBTRS_CLT_CONNECTED))
+			continue;
+
+		if (unlikely(usr_len + hdr_len > sess->max_hdr_size)) {
+			ibtrs_wrn_rl(sess, "%s request failed, user message "
+				     "size is %zu and header length %zu, but "
+				     "max size is %u\n",
+				     dir == READ ? "Read" : "Write",
+				     usr_len, hdr_len, sess->max_hdr_size);
+			err = -EMSGSIZE;
+			break;
+		}
+		req = ibtrs_clt_get_req(sess, conf, tag, priv, vec, usr_len,
+					sg, sg_cnt, data_len, dma_dir);
+		if (dir == READ)
+			err = ibtrs_clt_read_req(req);
+		else
+			err = ibtrs_clt_write_req(req);
+		if (unlikely(err)) {
+			req->in_use = false;
+			continue;
+		}
+		/* Success path */
+		break;
+	} while_each_path(&it);
+
+	return err;
+}
+EXPORT_SYMBOL(ibtrs_clt_request);
+
+int ibtrs_clt_query(struct ibtrs_clt *clt, struct ibtrs_attrs *attr)
+{
+	if (unlikely(!ibtrs_clt_is_connected(clt)))
+		return -ECOMM;
+
+	attr->queue_depth      = clt->queue_depth;
+	attr->max_io_size      = clt->max_io_size;
+	attr->sess_kobj	       = &clt->dev.kobj;
+	strlcpy(attr->sessname, clt->sessname, sizeof(attr->sessname));
+
+	return 0;
+}
+EXPORT_SYMBOL(ibtrs_clt_query);
+
+int ibtrs_clt_create_path_from_sysfs(struct ibtrs_clt *clt,
+				     struct ibtrs_addr *addr)
+{
+	struct ibtrs_clt_sess *sess;
+	int err;
+
+	sess = alloc_sess(clt, addr, nr_cons_per_session, clt->max_segments);
+	if (unlikely(IS_ERR(sess)))
+		return PTR_ERR(sess);
+
+	/*
+	 * It is totally safe to add path in CONNECTING state: coming
+	 * IO will never grab it.  Also it is very important to add
+	 * path before init, since init fires LINK_CONNECTED event.
+	 */
+	ibtrs_clt_add_path_to_arr(sess, addr);
+
+	err = init_sess(sess);
+	if (unlikely(err))
+		goto close_sess;
+
+	err = ibtrs_clt_create_sess_files(sess);
+	if (unlikely(err))
+		goto close_sess;
+
+	return 0;
+
+close_sess:
+	ibtrs_clt_remove_path_from_arr(sess);
+	ibtrs_clt_close_conns(sess, true);
+	free_sess(sess);
+
+	return err;
+}
+
+static int check_module_params(void)
+{
+	if (nr_cons_per_session == 0)
+		nr_cons_per_session = min_t(unsigned int, nr_cpu_ids, U16_MAX);
+
+	return 0;
+}
+
+static int ibtrs_clt_ib_dev_init(struct ibtrs_ib_dev *dev)
+{
+	if (!(dev->ib_dev->attrs.device_cap_flags &
+	      IB_DEVICE_MEM_MGT_EXTENSIONS)) {
+		pr_err("Memory registrations not supported.\n");
+		return -ENOTSUPP;
+	}
+
+	return 0;
+}
+
+static const struct ibtrs_ib_dev_pool_ops dev_pool_ops = {
+	.init = ibtrs_clt_ib_dev_init
+};
+
+static int __init ibtrs_client_init(void)
+{
+	int err;
+
+	pr_info("Loading module %s, version %s, proto %s: "
+		"(retry_cnt: %d, noreg_cnt: %d)\n",
+		KBUILD_MODNAME, IBTRS_VER_STRING, IBTRS_PROTO_VER_STRING,
+		retry_cnt, noreg_cnt);
+
+	ibtrs_ib_dev_pool_init(noreg_cnt ? IB_PD_UNSAFE_GLOBAL_RKEY : 0,
+			       &dev_pool);
+
+	err = check_module_params();
+	if (unlikely(err)) {
+		pr_err("Failed to load module, invalid module parameters,"
+		       " err: %d\n", err);
+		return err;
+	}
+	ibtrs_dev_class = class_create(THIS_MODULE, "ibtrs-client");
+	if (unlikely(IS_ERR(ibtrs_dev_class))) {
+		pr_err("Failed to create ibtrs-client dev class\n");
+		return PTR_ERR(ibtrs_dev_class);
+	}
+	ibtrs_wq = alloc_workqueue("ibtrs_client_wq", WQ_MEM_RECLAIM, 0);
+	if (unlikely(!ibtrs_wq)) {
+		pr_err("Failed to load module, alloc ibtrs_client_wq failed\n");
+		class_destroy(ibtrs_dev_class);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void __exit ibtrs_client_exit(void)
+{
+	destroy_workqueue(ibtrs_wq);
+	class_destroy(ibtrs_dev_class);
+	ibtrs_ib_dev_pool_deinit(&dev_pool);
+}
+
+module_init(ibtrs_client_init);
+module_exit(ibtrs_client_exit);
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 11/25] ibtrs: server: statistics functions
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, Jack Wang
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

This introduces set of functions used on server side to account
statistics of RDMA data sent/received.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 .../infiniband/ulp/ibtrs/ibtrs-srv-stats.c    | 103 ++++++++++++++++++
 1 file changed, 103 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
new file mode 100644
index 000000000000..47f8d6d2d88d
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ * Authors: Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Jack Wang <jinpu.wang@cloud.ionos.com>
+ *          Danil Kipnis <danil.kipnis@cloud.ionos.com>
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-srv.h"
+
+void ibtrs_srv_update_rdma_stats(struct ibtrs_srv_stats *s,
+				 size_t size, int d)
+{
+	atomic64_inc(&s->rdma_stats.dir[d].cnt);
+	atomic64_add(size, &s->rdma_stats.dir[d].size_total);
+}
+
+void ibtrs_srv_update_wc_stats(struct ibtrs_srv_stats *s)
+{
+	atomic64_inc(&s->wc_comp.calls);
+	atomic64_inc(&s->wc_comp.total_wc_cnt);
+}
+
+int ibtrs_srv_reset_rdma_stats(struct ibtrs_srv_stats *stats, bool enable)
+{
+	if (enable) {
+		struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
+
+		memset(r, 0, sizeof(*r));
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+ssize_t ibtrs_srv_stats_rdma_to_str(struct ibtrs_srv_stats *stats,
+				    char *page, size_t len)
+{
+	struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
+	struct ibtrs_srv_sess *sess;
+
+	sess = container_of(stats, typeof(*sess), stats);
+
+	return scnprintf(page, len, "%lld %lld %lld %lld %u\n",
+			 (s64)atomic64_read(&r->dir[READ].cnt),
+			 (s64)atomic64_read(&r->dir[READ].size_total),
+			 (s64)atomic64_read(&r->dir[WRITE].cnt),
+			 (s64)atomic64_read(&r->dir[WRITE].size_total),
+			 atomic_read(&sess->ids_inflight));
+}
+
+int ibtrs_srv_reset_wc_completion_stats(struct ibtrs_srv_stats *stats,
+					bool enable)
+{
+	if (enable) {
+		memset(&stats->wc_comp, 0, sizeof(stats->wc_comp));
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+int ibtrs_srv_stats_wc_completion_to_str(struct ibtrs_srv_stats *stats,
+					 char *buf, size_t len)
+{
+	return snprintf(buf, len, "%lld %lld\n",
+			(s64)atomic64_read(&stats->wc_comp.total_wc_cnt),
+			(s64)atomic64_read(&stats->wc_comp.calls));
+}
+
+ssize_t ibtrs_srv_reset_all_help(struct ibtrs_srv_stats *stats,
+				 char *page, size_t len)
+{
+	return scnprintf(page, PAGE_SIZE, "echo 1 to reset all statistics\n");
+}
+
+int ibtrs_srv_reset_all_stats(struct ibtrs_srv_stats *stats, bool enable)
+{
+	if (enable) {
+		ibtrs_srv_reset_wc_completion_stats(stats, enable);
+		ibtrs_srv_reset_rdma_stats(stats, enable);
+		return 0;
+	}
+
+	return -EINVAL;
+}
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 09/25] ibtrs: server: private header with server structs and functions
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, Jack Wang
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

This header describes main structs and functions used by ibtrs-server
module, mainly for accepting IBTRS sessions, creating/destroying
sysfs entries, accounting statistics on server side.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h | 170 +++++++++++++++++++++++
 1 file changed, 170 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv.h b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
new file mode 100644
index 000000000000..6d3b77541d77
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
@@ -0,0 +1,170 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ * Authors: Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Jack Wang <jinpu.wang@cloud.ionos.com>
+ *          Danil Kipnis <danil.kipnis@cloud.ionos.com>
+ */
+
+#ifndef IBTRS_SRV_H
+#define IBTRS_SRV_H
+
+#include <linux/device.h>
+#include <linux/refcount.h>
+#include "ibtrs-pri.h"
+
+/**
+ * enum ibtrs_srv_state - Server states.
+ */
+enum ibtrs_srv_state {
+	IBTRS_SRV_CONNECTING,
+	IBTRS_SRV_CONNECTED,
+	IBTRS_SRV_CLOSING,
+	IBTRS_SRV_CLOSED,
+};
+
+static inline const char *ibtrs_srv_state_str(enum ibtrs_srv_state state)
+{
+	switch (state) {
+	case IBTRS_SRV_CONNECTING:
+		return "IBTRS_SRV_CONNECTING";
+	case IBTRS_SRV_CONNECTED:
+		return "IBTRS_SRV_CONNECTED";
+	case IBTRS_SRV_CLOSING:
+		return "IBTRS_SRV_CLOSING";
+	case IBTRS_SRV_CLOSED:
+		return "IBTRS_SRV_CLOSED";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+struct ibtrs_stats_wc_comp {
+	atomic64_t	calls;
+	atomic64_t	total_wc_cnt;
+};
+
+struct ibtrs_srv_stats_rdma_stats {
+	struct {
+		atomic64_t	cnt;
+		atomic64_t	size_total;
+	} dir[2];
+};
+
+struct ibtrs_srv_stats {
+	struct ibtrs_srv_stats_rdma_stats	rdma_stats;
+	atomic_t				apm_cnt;
+	struct ibtrs_stats_wc_comp		wc_comp;
+};
+
+struct ibtrs_srv_con {
+	struct ibtrs_con	c;
+	atomic_t		wr_cnt;
+};
+
+struct ibtrs_srv_op {
+	struct ibtrs_srv_con		*con;
+	u32				msg_id;
+	u8				dir;
+	struct ibtrs_msg_rdma_read	*rd_msg;
+	struct ib_rdma_wr		*tx_wr;
+	struct ib_sge			*tx_sg;
+};
+
+struct ibtrs_srv_mr {
+	struct ib_mr	*mr;
+	struct sg_table	sgt;
+};
+
+struct ibtrs_srv_sess {
+	struct ibtrs_sess	s;
+	struct ibtrs_srv	*srv;
+	struct work_struct	close_work;
+	enum ibtrs_srv_state	state;
+	spinlock_t		state_lock;
+	int			cur_cq_vector;
+	struct ibtrs_srv_op	**ops_ids;
+	atomic_t		ids_inflight;
+	wait_queue_head_t	ids_waitq;
+	struct ibtrs_srv_mr	*mrs;
+	unsigned int		mrs_num;
+	dma_addr_t		*dma_addr;
+	bool			established;
+	unsigned int		mem_bits;
+	struct kobject		kobj;
+	struct kobject		kobj_stats;
+	struct ibtrs_srv_stats	stats;
+};
+
+struct ibtrs_srv {
+	struct list_head	paths_list;
+	int			paths_up;
+	struct mutex		paths_ev_mutex;
+	size_t			paths_num;
+	struct mutex		paths_mutex;
+	uuid_t			paths_uuid;
+	refcount_t		refcount;
+	struct ibtrs_srv_ctx	*ctx;
+	struct list_head	ctx_list;
+	void			*priv;
+	size_t			queue_depth;
+	struct page		**chunks;
+	struct device		dev;
+	u32			dev_ref;
+	struct kobject		kobj_paths;
+};
+
+struct ibtrs_srv_ctx {
+	rdma_ev_fn *rdma_ev;
+	link_ev_fn *link_ev;
+	struct rdma_cm_id *cm_id_ip;
+	struct rdma_cm_id *cm_id_ib;
+	struct mutex srv_mutex;
+	struct list_head srv_list;
+};
+
+extern struct class *ibtrs_dev_class;
+
+/* See ibtrs-log.h */
+#define TYPES_TO_SESSNAME(obj)						\
+	LIST(CASE(obj, struct ibtrs_srv_sess *, s.sessname))
+
+void ibtrs_srv_queue_close(struct ibtrs_srv_sess *sess);
+
+/* ibtrs-srv-stats.c */
+
+void ibtrs_srv_update_rdma_stats(struct ibtrs_srv_stats *s, size_t size, int d);
+void ibtrs_srv_update_wc_stats(struct ibtrs_srv_stats *s);
+
+int ibtrs_srv_reset_rdma_stats(struct ibtrs_srv_stats *stats, bool enable);
+ssize_t ibtrs_srv_stats_rdma_to_str(struct ibtrs_srv_stats *stats,
+				    char *page, size_t len);
+int ibtrs_srv_reset_wc_completion_stats(struct ibtrs_srv_stats *stats,
+					bool enable);
+int ibtrs_srv_stats_wc_completion_to_str(struct ibtrs_srv_stats *stats,
+					 char *buf, size_t len);
+int ibtrs_srv_reset_all_stats(struct ibtrs_srv_stats *stats, bool enable);
+ssize_t ibtrs_srv_reset_all_help(struct ibtrs_srv_stats *stats,
+				 char *page, size_t len);
+
+/* ibtrs-srv-sysfs.c */
+
+int ibtrs_srv_create_sess_files(struct ibtrs_srv_sess *sess);
+void ibtrs_srv_destroy_sess_files(struct ibtrs_srv_sess *sess);
+
+#endif /* IBTRS_SRV_H */
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 08/25] ibtrs: client: sysfs interface functions
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, Jack Wang
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

This is the sysfs interface to IBTRS sessions on client side:

  /sys/devices/virtual/ibtrs-client/<SESS-NAME>/
    *** IBTRS session created by ibtrs_clt_open() API call
    |
    |- max_reconnect_attempts
    |  *** number of reconnect attempts for session
    |
    |- add_path
    |  *** adds another connection path into IBTRS session
    |
    |- paths/<SRC@DST>/
       *** established paths to server in a session
       |
       |- disconnect
       |  *** disconnect path
       |
       |- reconnect
       |  *** reconnect path
       |
       |- remove_path
       |  *** remove current path
       |
       |- state
       |  *** retrieve current path state
       |
       |- hca_port
       |  *** HCA port number
       |
       |- hca_name
       |  *** HCA name
       |
       |- stats/
          *** current path statistics
          |
	  |- cpu_migration
	  |- rdma
	  |- rdma_lat
	  |- reconnects
	  |- reset_all
	  |- sg_entries
	  |- wc_completions

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 .../infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c    | 514 ++++++++++++++++++
 1 file changed, 514 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
new file mode 100644
index 000000000000..1f7b6c28e6b4
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
@@ -0,0 +1,514 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ * Authors: Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Jack Wang <jinpu.wang@cloud.ionos.com>
+ *          Danil Kipnis <danil.kipnis@cloud.ionos.com>
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-pri.h"
+#include "ibtrs-clt.h"
+#include "ibtrs-log.h"
+
+#define MIN_MAX_RECONN_ATT -1
+#define MAX_MAX_RECONN_ATT 9999
+
+static struct kobj_type ktype = {
+	.sysfs_ops = &kobj_sysfs_ops,
+};
+
+static ssize_t max_reconnect_attempts_show(struct device *dev,
+					   struct device_attribute *attr,
+					   char *page)
+{
+	struct ibtrs_clt *clt;
+
+	clt = container_of(dev, struct ibtrs_clt, dev);
+
+	return sprintf(page, "%d\n", ibtrs_clt_get_max_reconnect_attempts(clt));
+}
+
+static ssize_t max_reconnect_attempts_store(struct device *dev,
+					    struct device_attribute *attr,
+					    const char *buf,
+					    size_t count)
+{
+	struct ibtrs_clt *clt;
+	int value;
+	int ret;
+
+	clt = container_of(dev, struct ibtrs_clt, dev);
+
+	ret = kstrtoint(buf, 10, &value);
+	if (unlikely(ret)) {
+		ibtrs_err(clt, "%s: failed to convert string '%s' to int\n",
+			  attr->attr.name, buf);
+		return ret;
+	}
+	if (unlikely(value > MAX_MAX_RECONN_ATT ||
+		     value < MIN_MAX_RECONN_ATT)) {
+		ibtrs_err(clt, "%s: invalid range"
+			  " (provided: '%s', accepted: min: %d, max: %d)\n",
+			  attr->attr.name, buf, MIN_MAX_RECONN_ATT,
+			  MAX_MAX_RECONN_ATT);
+		return -EINVAL;
+	}
+	ibtrs_clt_set_max_reconnect_attempts(clt, value);
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(max_reconnect_attempts);
+
+static ssize_t mpath_policy_show(struct device *dev,
+				 struct device_attribute *attr,
+				 char *page)
+{
+	struct ibtrs_clt *clt;
+
+	clt = container_of(dev, struct ibtrs_clt, dev);
+
+	switch (clt->mp_policy) {
+	case MP_POLICY_RR:
+		return sprintf(page, "round-robin (RR: %d)\n", clt->mp_policy);
+	case MP_POLICY_MIN_INFLIGHT:
+		return sprintf(page, "min-inflight (MI: %d)\n", clt->mp_policy);
+	default:
+		return sprintf(page, "Unknown (%d)\n", clt->mp_policy);
+	}
+}
+
+static ssize_t mpath_policy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf,
+				  size_t count)
+{
+	struct ibtrs_clt *clt;
+	int value;
+	int ret;
+
+	clt = container_of(dev, struct ibtrs_clt, dev);
+
+	ret = kstrtoint(buf, 10, &value);
+	if (!ret && (value == MP_POLICY_RR ||
+		     value == MP_POLICY_MIN_INFLIGHT)) {
+		clt->mp_policy = value;
+		return count;
+	}
+
+	if (!strncasecmp(buf, "round-robin", 11) ||
+	    !strncasecmp(buf, "rr", 2))
+		clt->mp_policy = MP_POLICY_RR;
+	else if (!strncasecmp(buf, "min-inflight", 12) ||
+		 !strncasecmp(buf, "mi", 2))
+		clt->mp_policy = MP_POLICY_MIN_INFLIGHT;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(mpath_policy);
+
+static ssize_t add_path_show(struct device *dev,
+			     struct device_attribute *attr, char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo"
+			 " [<source addr>,]<destination addr> > %s\n\n"
+			"*addr ::= [ ip:<ipv4|ipv6> | gid:<gid> ]\n",
+			 attr->attr.name);
+}
+
+static ssize_t add_path_store(struct device *dev,
+			      struct device_attribute *attr,
+			      const char *buf, size_t count)
+{
+	struct sockaddr_storage srcaddr, dstaddr;
+	struct ibtrs_addr addr = {
+		.src = &srcaddr,
+		.dst = &dstaddr
+	};
+	struct ibtrs_clt *clt;
+	const char *nl;
+	size_t len;
+	int err;
+
+	clt = container_of(dev, struct ibtrs_clt, dev);
+
+	nl = strchr(buf, '\n');
+	if (nl)
+		len = nl - buf;
+	else
+		len = count;
+	err = ibtrs_addr_to_sockaddr(buf, len, clt->port, &addr);
+	if (unlikely(err))
+		return -EINVAL;
+
+	err = ibtrs_clt_create_path_from_sysfs(clt, &addr);
+	if (unlikely(err))
+		return err;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(add_path);
+
+static ssize_t ibtrs_clt_state_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *page)
+{
+	struct ibtrs_clt_sess *sess;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	if (ibtrs_clt_sess_is_connected(sess))
+		return sprintf(page, "connected\n");
+
+	return sprintf(page, "disconnected\n");
+}
+
+static struct kobj_attribute ibtrs_clt_state_attr =
+	__ATTR(state, 0444, ibtrs_clt_state_show, NULL);
+
+static ssize_t ibtrs_clt_reconnect_show(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_clt_reconnect_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct ibtrs_clt_sess *sess;
+	int ret;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		ibtrs_err(sess, "%s: unknown value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+	ret = ibtrs_clt_reconnect_from_sysfs(sess);
+	if (unlikely(ret))
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_clt_reconnect_attr =
+	__ATTR(reconnect, 0644, ibtrs_clt_reconnect_show,
+	       ibtrs_clt_reconnect_store);
+
+static ssize_t ibtrs_clt_disconnect_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_clt_disconnect_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct ibtrs_clt_sess *sess;
+	int ret;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		ibtrs_err(sess, "%s: unknown value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+	ret = ibtrs_clt_disconnect_from_sysfs(sess);
+	if (unlikely(ret))
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_clt_disconnect_attr =
+	__ATTR(disconnect, 0644, ibtrs_clt_disconnect_show,
+	       ibtrs_clt_disconnect_store);
+
+static ssize_t ibtrs_clt_remove_path_show(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_clt_remove_path_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	struct ibtrs_clt_sess *sess;
+	int ret;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		ibtrs_err(sess, "%s: unknown value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+	ret = ibtrs_clt_remove_path_from_sysfs(sess, &attr->attr);
+	if (unlikely(ret))
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_clt_remove_path_attr =
+	__ATTR(remove_path, 0644, ibtrs_clt_remove_path_show,
+	       ibtrs_clt_remove_path_store);
+
+STAT_ATTR(struct ibtrs_clt_sess, cpu_migration,
+	  ibtrs_clt_stats_migration_cnt_to_str,
+	  ibtrs_clt_reset_cpu_migr_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, sg_entries,
+	  ibtrs_clt_stats_sg_list_distr_to_str,
+	  ibtrs_clt_reset_sg_list_distr_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, reconnects,
+	  ibtrs_clt_stats_reconnects_to_str,
+	  ibtrs_clt_reset_reconnects_stat);
+
+STAT_ATTR(struct ibtrs_clt_sess, rdma_lat,
+	  ibtrs_clt_stats_rdma_lat_distr_to_str,
+	  ibtrs_clt_reset_rdma_lat_distr_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, wc_completion,
+	  ibtrs_clt_stats_wc_completion_to_str,
+	  ibtrs_clt_reset_wc_comp_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, rdma,
+	  ibtrs_clt_stats_rdma_to_str,
+	  ibtrs_clt_reset_rdma_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, reset_all,
+	  ibtrs_clt_reset_all_help,
+	  ibtrs_clt_reset_all_stats);
+
+static struct attribute *ibtrs_clt_stats_attrs[] = {
+	&sg_entries_attr.attr,
+	&cpu_migration_attr.attr,
+	&reconnects_attr.attr,
+	&rdma_lat_attr.attr,
+	&wc_completion_attr.attr,
+	&rdma_attr.attr,
+	&reset_all_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_clt_stats_attr_group = {
+	.attrs = ibtrs_clt_stats_attrs,
+};
+
+static int ibtrs_clt_create_stats_files(struct kobject *kobj,
+					struct kobject *kobj_stats)
+{
+	int ret;
+
+	ret = kobject_init_and_add(kobj_stats, &ktype, kobj, "stats");
+	if (ret) {
+		pr_err("Failed to init and add stats kobject, err: %d\n",
+		       ret);
+		return ret;
+	}
+
+	ret = sysfs_create_group(kobj_stats, &ibtrs_clt_stats_attr_group);
+	if (ret) {
+		pr_err("failed to create stats sysfs group, err: %d\n",
+		       ret);
+		goto err;
+	}
+
+	return 0;
+
+err:
+	kobject_del(kobj_stats);
+	kobject_put(kobj_stats);
+
+	return ret;
+}
+
+static ssize_t ibtrs_clt_hca_port_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_clt_sess *sess;
+
+	sess = container_of(kobj, typeof(*sess), kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%u\n", sess->hca_port);
+}
+
+static struct kobj_attribute ibtrs_clt_hca_port_attr =
+	__ATTR(hca_port, 0444, ibtrs_clt_hca_port_show, NULL);
+
+static ssize_t ibtrs_clt_hca_name_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_clt_sess *sess;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n", sess->hca_name);
+}
+
+static struct kobj_attribute ibtrs_clt_hca_name_attr =
+	__ATTR(hca_name, 0444, ibtrs_clt_hca_name_show, NULL);
+
+static ssize_t ibtrs_clt_src_addr_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_clt_sess *sess;
+	int cnt;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	cnt = sockaddr_to_str((struct sockaddr *)&sess->s.src_addr,
+			      page, PAGE_SIZE);
+	return cnt + scnprintf(page + cnt, PAGE_SIZE - cnt, "\n");
+}
+
+static struct kobj_attribute ibtrs_clt_src_addr_attr =
+	__ATTR(src_addr, 0444, ibtrs_clt_src_addr_show, NULL);
+
+static ssize_t ibtrs_clt_dst_addr_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_clt_sess *sess;
+	int cnt;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	cnt = sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr,
+			      page, PAGE_SIZE);
+	return cnt + scnprintf(page + cnt, PAGE_SIZE - cnt, "\n");
+}
+
+static struct kobj_attribute ibtrs_clt_dst_addr_attr =
+	__ATTR(dst_addr, 0444, ibtrs_clt_dst_addr_show, NULL);
+
+static struct attribute *ibtrs_clt_sess_attrs[] = {
+	&ibtrs_clt_hca_name_attr.attr,
+	&ibtrs_clt_hca_port_attr.attr,
+	&ibtrs_clt_src_addr_attr.attr,
+	&ibtrs_clt_dst_addr_attr.attr,
+	&ibtrs_clt_state_attr.attr,
+	&ibtrs_clt_reconnect_attr.attr,
+	&ibtrs_clt_disconnect_attr.attr,
+	&ibtrs_clt_remove_path_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_clt_sess_attr_group = {
+	.attrs = ibtrs_clt_sess_attrs,
+};
+
+int ibtrs_clt_create_sess_files(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	char str[NAME_MAX];
+	int err, cnt;
+
+	cnt = sockaddr_to_str((struct sockaddr *)&sess->s.src_addr,
+			      str, sizeof(str));
+	cnt += scnprintf(str + cnt, sizeof(str) - cnt, "@");
+	sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr,
+			str + cnt, sizeof(str) - cnt);
+
+	err = kobject_init_and_add(&sess->kobj, &ktype, &clt->kobj_paths,
+				   "%s", str);
+	if (unlikely(err)) {
+		pr_err("kobject_init_and_add: %d\n", err);
+		return err;
+	}
+	err = sysfs_create_group(&sess->kobj, &ibtrs_clt_sess_attr_group);
+	if (unlikely(err)) {
+		pr_err("sysfs_create_group(): %d\n", err);
+		goto put_kobj;
+	}
+	err = ibtrs_clt_create_stats_files(&sess->kobj, &sess->kobj_stats);
+	if (unlikely(err))
+		goto put_kobj;
+
+	return 0;
+
+put_kobj:
+	kobject_del(&sess->kobj);
+	kobject_put(&sess->kobj);
+
+	return err;
+}
+
+void ibtrs_clt_destroy_sess_files(struct ibtrs_clt_sess *sess,
+				  const struct attribute *sysfs_self)
+{
+	if (sess->kobj.state_in_sysfs) {
+		kobject_del(&sess->kobj_stats);
+		kobject_put(&sess->kobj_stats);
+		if (sysfs_self)
+			/* To avoid deadlock firstly commit suicide */
+			sysfs_remove_file_self(&sess->kobj, sysfs_self);
+		kobject_del(&sess->kobj);
+		kobject_put(&sess->kobj);
+	}
+}
+
+static struct attribute *ibtrs_clt_attrs[] = {
+	&dev_attr_max_reconnect_attempts.attr,
+	&dev_attr_mpath_policy.attr,
+	&dev_attr_add_path.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_clt_attr_group = {
+	.attrs = ibtrs_clt_attrs,
+};
+
+int ibtrs_clt_create_sysfs_root_folders(struct ibtrs_clt *clt)
+{
+	return kobject_init_and_add(&clt->kobj_paths, &ktype,
+				    &clt->dev.kobj, "paths");
+}
+
+int ibtrs_clt_create_sysfs_root_files(struct ibtrs_clt *clt)
+{
+	return sysfs_create_group(&clt->dev.kobj, &ibtrs_clt_attr_group);
+}
+
+void ibtrs_clt_destroy_sysfs_root_folders(struct ibtrs_clt *clt)
+{
+	if (clt->kobj_paths.state_in_sysfs) {
+		kobject_del(&clt->kobj_paths);
+		kobject_put(&clt->kobj_paths);
+	}
+}
+
+void ibtrs_clt_destroy_sysfs_root_files(struct ibtrs_clt *clt)
+{
+	sysfs_remove_group(&clt->dev.kobj, &ibtrs_clt_attr_group);
+}
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 07/25] ibtrs: client: statistics functions
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, Jack Wang
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

This introduces set of functions used on client side to account
statistics of RDMA data sent/received, amount of IOs inflight,
latency, cpu migrations, etc.  Almost all statistics is collected
using percpu variables.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 .../infiniband/ulp/ibtrs/ibtrs-clt-stats.c    | 447 ++++++++++++++++++
 1 file changed, 447 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
new file mode 100644
index 000000000000..fbeb1549aaf4
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
@@ -0,0 +1,447 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ * Authors: Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Jack Wang <jinpu.wang@cloud.ionos.com>
+ *          Danil Kipnis <danil.kipnis@cloud.ionos.com>
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-clt.h"
+
+static inline int ibtrs_clt_ms_to_id(unsigned long ms)
+{
+	int id = ms ? ilog2(ms) - MIN_LOG_LAT + 1 : 0;
+
+	return clamp(id, 0, LOG_LAT_SZ - 1);
+}
+
+void ibtrs_clt_update_rdma_lat(struct ibtrs_clt_stats *stats, bool read,
+			       unsigned long ms)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int id;
+
+	id = ibtrs_clt_ms_to_id(ms);
+	s = this_cpu_ptr(stats->pcpu_stats);
+	if (read) {
+		s->rdma_lat_distr[id].read++;
+		if (s->rdma_lat_max.read < ms)
+			s->rdma_lat_max.read = ms;
+	} else {
+		s->rdma_lat_distr[id].write++;
+		if (s->rdma_lat_max.write < ms)
+			s->rdma_lat_max.write = ms;
+	}
+}
+
+void ibtrs_clt_decrease_inflight(struct ibtrs_clt_stats *stats)
+{
+	atomic_dec(&stats->inflight);
+}
+
+void ibtrs_clt_update_wc_stats(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_clt_stats *stats = &sess->stats;
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	cpu = raw_smp_processor_id();
+	s = this_cpu_ptr(stats->pcpu_stats);
+	s->wc_comp.cnt++;
+	s->wc_comp.total_cnt++;
+	if (unlikely(con->cpu != cpu)) {
+		s->cpu_migr.to++;
+
+		/* Careful here, override s pointer */
+		s = per_cpu_ptr(stats->pcpu_stats, con->cpu);
+		atomic_inc(&s->cpu_migr.from);
+	}
+}
+
+void ibtrs_clt_inc_failover_cnt(struct ibtrs_clt_stats *stats)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+
+	s = this_cpu_ptr(stats->pcpu_stats);
+	s->rdma.failover_cnt++;
+}
+
+static inline u32 ibtrs_clt_stats_get_avg_wc_cnt(struct ibtrs_clt_stats *stats)
+{
+	u32 cnt = 0;
+	u64 sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct ibtrs_clt_stats_pcpu *s;
+
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		sum += s->wc_comp.total_cnt;
+		cnt += s->wc_comp.cnt;
+	}
+
+	return cnt ? sum / cnt : 0;
+}
+
+int ibtrs_clt_stats_wc_completion_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len)
+{
+	return scnprintf(buf, len, "%u\n",
+			 ibtrs_clt_stats_get_avg_wc_cnt(stats));
+}
+
+ssize_t ibtrs_clt_stats_rdma_lat_distr_to_str(struct ibtrs_clt_stats *stats,
+					      char *page, size_t len)
+{
+	struct ibtrs_clt_stats_rdma_lat res[LOG_LAT_SZ];
+	struct ibtrs_clt_stats_rdma_lat max;
+	struct ibtrs_clt_stats_pcpu *s;
+
+	ssize_t cnt = 0;
+	int i, cpu;
+
+	max.write = 0;
+	max.read = 0;
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+
+		if (max.write < s->rdma_lat_max.write)
+			max.write = s->rdma_lat_max.write;
+		if (max.read < s->rdma_lat_max.read)
+			max.read = s->rdma_lat_max.read;
+	}
+	for (i = 0; i < ARRAY_SIZE(res); i++) {
+		res[i].write = 0;
+		res[i].read = 0;
+		for_each_possible_cpu(cpu) {
+			s = per_cpu_ptr(stats->pcpu_stats, cpu);
+
+			res[i].write += s->rdma_lat_distr[i].write;
+			res[i].read += s->rdma_lat_distr[i].read;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(res) - 1; i++)
+		cnt += scnprintf(page + cnt, len - cnt,
+				 "< %6d ms: %llu %llu\n",
+				 1 << (i + MIN_LOG_LAT), res[i].read,
+				 res[i].write);
+	cnt += scnprintf(page + cnt, len - cnt, ">= %5d ms: %llu %llu\n",
+			 1 << (i - 1 + MIN_LOG_LAT), res[i].read,
+			 res[i].write);
+	cnt += scnprintf(page + cnt, len - cnt, " maximum ms: %llu %llu\n",
+			 max.read, max.write);
+
+	return cnt;
+}
+
+int ibtrs_clt_stats_migration_cnt_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+
+	size_t used;
+	int cpu;
+
+	used = scnprintf(buf, len, "    ");
+	for_each_possible_cpu(cpu)
+		used += scnprintf(buf + used, len - used, " CPU%u", cpu);
+
+	used += scnprintf(buf + used, len - used, "\nfrom:");
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		used += scnprintf(buf + used, len - used, " %d",
+				  atomic_read(&s->cpu_migr.from));
+	}
+
+	used += scnprintf(buf + used, len - used, "\nto  :");
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		used += scnprintf(buf + used, len - used, " %d",
+				  s->cpu_migr.to);
+	}
+	used += scnprintf(buf + used, len - used, "\n");
+
+	return used;
+}
+
+int ibtrs_clt_stats_reconnects_to_str(struct ibtrs_clt_stats *stats, char *buf,
+				      size_t len)
+{
+	return scnprintf(buf, len, "%d %d\n",
+			 stats->reconnects.successful_cnt,
+			 stats->reconnects.fail_cnt);
+}
+
+ssize_t ibtrs_clt_stats_rdma_to_str(struct ibtrs_clt_stats *stats,
+				    char *page, size_t len)
+{
+	struct ibtrs_clt_stats_rdma sum;
+	struct ibtrs_clt_stats_rdma *r;
+	int cpu;
+
+	memset(&sum, 0, sizeof(sum));
+
+	for_each_possible_cpu(cpu) {
+		r = &per_cpu_ptr(stats->pcpu_stats, cpu)->rdma;
+
+		sum.dir[READ].cnt	  += r->dir[READ].cnt;
+		sum.dir[READ].size_total  += r->dir[READ].size_total;
+		sum.dir[WRITE].cnt	  += r->dir[WRITE].cnt;
+		sum.dir[WRITE].size_total += r->dir[WRITE].size_total;
+		sum.failover_cnt	  += r->failover_cnt;
+	}
+
+	return scnprintf(page, len, "%llu %llu %llu %llu %u %llu\n",
+			 sum.dir[READ].cnt, sum.dir[READ].size_total,
+			 sum.dir[WRITE].cnt, sum.dir[WRITE].size_total,
+			 atomic_read(&stats->inflight), sum.failover_cnt);
+}
+
+int ibtrs_clt_stats_sg_list_distr_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+
+	int i, cpu, cnt;
+
+	cnt = scnprintf(buf, len, "n\\cpu:");
+	for_each_possible_cpu(cpu)
+		cnt += scnprintf(buf + cnt, len - cnt, "%5d", cpu);
+
+	for (i = 0; i < SG_DISTR_SZ; i++) {
+		if (i <= MAX_LIN_SG)
+			cnt += scnprintf(buf + cnt, len - cnt, "\n= %3d:", i);
+		else if (i < SG_DISTR_SZ - 1)
+			cnt += scnprintf(buf + cnt, len - cnt, "\n< %3d:",
+					 1 << (i + MIN_LOG_SG - MAX_LIN_SG));
+		else
+			cnt += scnprintf(buf + cnt, len - cnt, "\n>=%3d:",
+					 1 << (i + MIN_LOG_SG -
+					       MAX_LIN_SG - 1));
+
+		for_each_possible_cpu(cpu) {
+			unsigned int p, p_i, p_f;
+			u64 total, distr;
+
+			s = per_cpu_ptr(stats->pcpu_stats, cpu);
+			total = s->sg_list_total;
+			distr = s->sg_list_distr[i];
+
+			p = total ? distr * 1000 / total : 0;
+			p_i = p / 10;
+			p_f = p % 10;
+
+			if (distr)
+				cnt += scnprintf(buf + cnt, len - cnt,
+						 " %2u.%01u", p_i, p_f);
+			else
+				cnt += scnprintf(buf + cnt, len - cnt, "    0");
+		}
+	}
+
+	cnt += scnprintf(buf + cnt, len - cnt, "\ntotal:");
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		cnt += scnprintf(buf + cnt, len - cnt, " %llu",
+				 s->sg_list_total);
+	}
+	cnt += scnprintf(buf + cnt, len - cnt, "\n");
+
+	return cnt;
+}
+
+ssize_t ibtrs_clt_reset_all_help(struct ibtrs_clt_stats *s,
+				 char *page, size_t len)
+{
+	return scnprintf(page, len, "echo 1 to reset all statistics\n");
+}
+
+int ibtrs_clt_reset_rdma_stats(struct ibtrs_clt_stats *stats, bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->rdma, 0, sizeof(s->rdma));
+	}
+
+	return 0;
+}
+
+int ibtrs_clt_reset_rdma_lat_distr_stats(struct ibtrs_clt_stats *stats,
+					 bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (enable) {
+		for_each_possible_cpu(cpu) {
+			s = per_cpu_ptr(stats->pcpu_stats, cpu);
+			memset(&s->rdma_lat_max, 0, sizeof(s->rdma_lat_max));
+			memset(&s->rdma_lat_distr, 0,
+			       sizeof(s->rdma_lat_distr));
+		}
+	}
+	stats->enable_rdma_lat = enable;
+
+	return 0;
+}
+
+int ibtrs_clt_reset_sg_list_distr_stats(struct ibtrs_clt_stats *stats,
+					bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->sg_list_total, 0, sizeof(s->sg_list_total));
+		memset(&s->sg_list_distr, 0, sizeof(s->sg_list_distr));
+	}
+
+	return 0;
+}
+
+int ibtrs_clt_reset_cpu_migr_stats(struct ibtrs_clt_stats *stats, bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->cpu_migr, 0, sizeof(s->cpu_migr));
+	}
+
+	return 0;
+}
+
+int ibtrs_clt_reset_reconnects_stat(struct ibtrs_clt_stats *stats, bool enable)
+{
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	memset(&stats->reconnects, 0, sizeof(stats->reconnects));
+
+	return 0;
+}
+
+int ibtrs_clt_reset_wc_comp_stats(struct ibtrs_clt_stats *stats, bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->wc_comp, 0, sizeof(s->wc_comp));
+	}
+
+	return 0;
+}
+
+int ibtrs_clt_reset_all_stats(struct ibtrs_clt_stats *s, bool enable)
+{
+	if (enable) {
+		ibtrs_clt_reset_rdma_stats(s, enable);
+		ibtrs_clt_reset_rdma_lat_distr_stats(s, enable);
+		ibtrs_clt_reset_sg_list_distr_stats(s, enable);
+		ibtrs_clt_reset_cpu_migr_stats(s, enable);
+		ibtrs_clt_reset_reconnects_stat(s, enable);
+		ibtrs_clt_reset_wc_comp_stats(s, enable);
+		atomic_set(&s->inflight, 0);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static inline void ibtrs_clt_record_sg_distr(u64 stat[SG_DISTR_SZ], u64 *total,
+					     unsigned int cnt)
+{
+	int i;
+
+	i = cnt > MAX_LIN_SG ? ilog2(cnt) + MAX_LIN_SG - MIN_LOG_SG + 1 : cnt;
+	i = i < SG_DISTR_SZ ? i : SG_DISTR_SZ - 1;
+
+	stat[i]++;
+	(*total)++;
+}
+
+static inline void ibtrs_clt_update_rdma_stats(struct ibtrs_clt_stats *stats,
+					       size_t size, int d)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+
+	s = this_cpu_ptr(stats->pcpu_stats);
+	s->rdma.dir[d].cnt++;
+	s->rdma.dir[d].size_total += size;
+}
+
+void ibtrs_clt_update_all_stats(struct ibtrs_clt_io_req *req, int dir)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_clt_stats *stats = &sess->stats;
+	unsigned int len;
+
+	struct ibtrs_clt_stats_pcpu *s;
+
+	s = this_cpu_ptr(stats->pcpu_stats);
+	ibtrs_clt_record_sg_distr(s->sg_list_distr, &s->sg_list_total,
+				  req->sg_cnt);
+	len = req->usr_len + req->data_len;
+	ibtrs_clt_update_rdma_stats(stats, len, dir);
+	atomic_inc(&stats->inflight);
+}
+
+int ibtrs_clt_init_stats(struct ibtrs_clt_stats *stats)
+{
+	stats->enable_rdma_lat = false;
+	stats->pcpu_stats = alloc_percpu(typeof(*stats->pcpu_stats));
+	if (unlikely(!stats->pcpu_stats))
+		return -ENOMEM;
+
+	/*
+	 * successful_cnt will be set to 0 after session
+	 * is established for the first time
+	 */
+	stats->reconnects.successful_cnt = -1;
+
+	return 0;
+}
+
+void ibtrs_clt_free_stats(struct ibtrs_clt_stats *stats)
+{
+	free_percpu(stats->pcpu_stats);
+}
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 01/25] sysfs: export sysfs_remove_file_self()
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, linux-kernel
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

Function is going to be used in transport over RDMA module
in subsequent patches.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 fs/sysfs/file.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 130fc6fbcc03..1ff4672d7746 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -492,6 +492,7 @@ bool sysfs_remove_file_self(struct kobject *kobj, const struct attribute *attr)
 	kernfs_put(kn);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(sysfs_remove_file_self);
 
 void sysfs_remove_files(struct kobject *kobj, const struct attribute * const *ptr)
 {
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, Jack Wang
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

These are common private headers with IBTRS protocol structures,
logging, sysfs and other helper functions, which are used on
both client and server sides.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-log.h |  84 ++++
 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h | 463 +++++++++++++++++++++++
 2 files changed, 547 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-log.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-log.h b/drivers/infiniband/ulp/ibtrs/ibtrs-log.h
new file mode 100644
index 000000000000..fec816c935bc
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-log.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ * Authors: Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Jack Wang <jinpu.wang@cloud.ionos.com>
+ *          Danil Kipnis <danil.kipnis@cloud.ionos.com>
+ */
+
+#ifndef IBTRS_LOG_H
+#define IBTRS_LOG_H
+
+#define P1 )
+#define P2 ))
+#define P3 )))
+#define P4 ))))
+#define P(N) P ## N
+
+#define CAT(a, ...) PRIMITIVE_CAT(a, __VA_ARGS__)
+#define PRIMITIVE_CAT(a, ...) a ## __VA_ARGS__
+
+#define LIST(...)						\
+	__VA_ARGS__,						\
+	({ unknown_type(); NULL; })				\
+	CAT(P, COUNT_ARGS(__VA_ARGS__))				\
+
+#define EMPTY()
+#define DEFER(id) id EMPTY()
+
+#define _CASE(obj, type, member)				\
+	__builtin_choose_expr(					\
+	__builtin_types_compatible_p(				\
+		typeof(obj), type),				\
+		((type)obj)->member
+#define CASE(o, t, m) DEFER(_CASE)(o, t, m)
+
+/*
+ * Below we define retrieving of sessname from common IBTRS types.
+ * Client or server related types have to be defined by special
+ * TYPES_TO_SESSNAME macro.
+ */
+
+void unknown_type(void);
+
+#ifndef TYPES_TO_SESSNAME
+#define TYPES_TO_SESSNAME(...) ({ unknown_type(); NULL; })
+#endif
+
+#define ibtrs_prefix(obj)					\
+	_CASE(obj, struct ibtrs_con *,  sess->sessname),	\
+	_CASE(obj, struct ibtrs_sess *, sessname),		\
+	TYPES_TO_SESSNAME(obj)					\
+	))
+
+#define ibtrs_log(fn, obj, fmt, ...)				\
+	fn("<%s>: " fmt, ibtrs_prefix(obj), ##__VA_ARGS__)
+
+#define ibtrs_err(obj, fmt, ...)	\
+	ibtrs_log(pr_err, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_err_rl(obj, fmt, ...)	\
+	ibtrs_log(pr_err_ratelimited, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_wrn(obj, fmt, ...)	\
+	ibtrs_log(pr_warn, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_wrn_rl(obj, fmt, ...) \
+	ibtrs_log(pr_warn_ratelimited, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_info(obj, fmt, ...) \
+	ibtrs_log(pr_info, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_info_rl(obj, fmt, ...) \
+	ibtrs_log(pr_info_ratelimited, obj, fmt, ##__VA_ARGS__)
+
+#endif /* IBTRS_LOG_H */
diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h b/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
new file mode 100644
index 000000000000..5a180e5e19bc
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
@@ -0,0 +1,463 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ * Authors: Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Jack Wang <jinpu.wang@cloud.ionos.com>
+ *          Danil Kipnis <danil.kipnis@cloud.ionos.com>
+ */
+
+#ifndef IBTRS_PRI_H
+#define IBTRS_PRI_H
+
+#include <linux/uuid.h>
+#include <rdma/rdma_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib.h>
+
+#include "ibtrs.h"
+
+#define IBTRS_PROTO_VER_MAJOR 2
+#define IBTRS_PROTO_VER_MINOR 0
+
+#define IBTRS_PROTO_VER_STRING __stringify(IBTRS_PROTO_VER_MAJOR) "." \
+			       __stringify(IBTRS_PROTO_VER_MINOR)
+
+#ifndef IBTRS_VER_STRING
+#define IBTRS_VER_STRING __stringify(IBTRS_PROTO_VER_MAJOR) "." \
+			 __stringify(IBTRS_PROTO_VER_MINOR)
+#endif
+
+enum ibtrs_imm_const {
+	MAX_IMM_TYPE_BITS = 4,
+	MAX_IMM_TYPE_MASK = ((1 << MAX_IMM_TYPE_BITS) - 1),
+	MAX_IMM_PAYL_BITS = 28,
+	MAX_IMM_PAYL_MASK = ((1 << MAX_IMM_PAYL_BITS) - 1),
+};
+
+enum ibtrs_imm_type {
+	IBTRS_IO_REQ_IMM       = 0, /* client to server */
+	IBTRS_IO_RSP_IMM       = 1, /* server to client */
+	IBTRS_IO_RSP_W_INV_IMM = 2, /* server to client */
+
+	IBTRS_HB_MSG_IMM = 8,
+	IBTRS_HB_ACK_IMM = 9,
+
+	IBTRS_LAST_IMM,
+};
+
+enum {
+	SERVICE_CON_QUEUE_DEPTH = 512,
+
+	MIN_RTR_CNT = 1,
+	MAX_RTR_CNT = 7,
+
+	MAX_PATHS_NUM = 128,
+
+	/*
+	 * With the current size of the tag allocated on the client, 4K
+	 * is the maximum number of tags we can allocate.  This number is
+	 * also used on the client to allocate the IU for the user connection
+	 * to receive the RDMA addresses from the server.
+	 */
+	MAX_SESS_QUEUE_DEPTH = 4096,
+
+	IBTRS_HB_INTERVAL_MS = 5000,
+	IBTRS_HB_MISSED_MAX = 5,
+
+	IBTRS_MAGIC = 0x1BBD,
+	IBTRS_PROTO_VER = (IBTRS_PROTO_VER_MAJOR << 8) | IBTRS_PROTO_VER_MINOR,
+};
+
+struct ibtrs_ib_dev;
+
+struct ibtrs_ib_dev_pool_ops {
+	struct ibtrs_ib_dev *(*alloc)(void);
+	void (*free)(struct ibtrs_ib_dev *);
+	int (*init)(struct ibtrs_ib_dev *);
+	void (*deinit)(struct ibtrs_ib_dev *);
+};
+
+struct ibtrs_ib_dev_pool {
+	struct mutex		mutex;
+	struct list_head	list;
+	enum ib_pd_flags	pd_flags;
+	const struct ibtrs_ib_dev_pool_ops *ops;
+};
+
+struct ibtrs_ib_dev {
+	struct ib_device	 *ib_dev;
+	struct ib_pd		 *ib_pd;
+	struct kref		 ref;
+	struct list_head	 entry;
+	struct ibtrs_ib_dev_pool *pool;
+};
+
+struct ibtrs_con {
+	struct ibtrs_sess	*sess;
+	struct ib_qp		*qp;
+	struct ib_cq		*cq;
+	struct rdma_cm_id	*cm_id;
+	u32			cid;
+};
+
+typedef void (ibtrs_hb_handler_t)(struct ibtrs_con *con, int err);
+
+struct ibtrs_sess {
+	struct list_head	entry;
+	struct sockaddr_storage dst_addr;
+	struct sockaddr_storage src_addr;
+	char			sessname[NAME_MAX];
+	uuid_t			uuid;
+	struct ibtrs_con	**con;
+	u32			con_num;
+	u32		recon_cnt;
+	struct ibtrs_ib_dev	*dev;
+	int			dev_ref;
+	struct ib_cqe		*hb_cqe;
+	ibtrs_hb_handler_t	*hb_err_handler;
+	struct workqueue_struct *hb_wq;
+	struct delayed_work	hb_dwork;
+	u32			hb_interval_ms;
+	u32			hb_missed_cnt;
+	u32			hb_missed_max;
+};
+
+struct ibtrs_iu {
+	struct list_head        list;
+	struct ib_cqe           cqe;
+	dma_addr_t              dma_addr;
+	void                    *buf;
+	size_t                  size;
+	enum dma_data_direction direction;
+	u32			tag;
+};
+
+/**
+ * enum ibtrs_msg_types - IBTRS message types.
+ * @IBTRS_MSG_INFO_REQ:		Client additional info request to the server
+ * @IBTRS_MSG_INFO_RSP:		Server additional info response to the client
+ * @IBTRS_MSG_WRITE:		Client writes data per RDMA to server
+ * @IBTRS_MSG_READ:		Client requests data transfer from server
+ */
+enum ibtrs_msg_types {
+	IBTRS_MSG_INFO_REQ,
+	IBTRS_MSG_INFO_RSP,
+	IBTRS_MSG_WRITE,
+	IBTRS_MSG_READ,
+};
+
+/**
+ * enum ibtrs_msg_flags - IBTRS message flags.
+ * @IBTRS_NEED_INVAL:	Send invalidation in response.
+ */
+enum ibtrs_msg_flags {
+	IBTRS_MSG_NEED_INVAL_F = 1<<0
+};
+
+/**
+ * struct ibtrs_sg_desc - RDMA-Buffer entry description
+ * @addr:	Address of RDMA destination buffer
+ * @key:	Authorization rkey to write to the buffer
+ * @len:	Size of the buffer
+ */
+struct ibtrs_sg_desc {
+	__le64			addr;
+	__le32			key;
+	__le32			len;
+};
+
+/**
+ * struct ibtrs_msg_conn_req - Client connection request to the server
+ * @magic:	   IBTRS magic
+ * @version:	   IBTRS protocol version
+ * @cid:	   Current connection id
+ * @cid_num:	   Number of connections per session
+ * @recon_cnt:	   Reconnections counter
+ * @sess_uuid:	   UUID of a session (path)
+ * @paths_uuid:	   UUID of a group of sessions (paths)
+ *
+ * NOTE: max size 56 bytes, see man rdma_connect().
+ */
+struct ibtrs_msg_conn_req {
+	u8		__cma_version; /* Is set to 0 by cma.c in case of
+					* AF_IB, do not touch that. */
+	u8		__ip_version;  /* On sender side that should be
+					* set to 0, or cma_save_ip_info()
+					* extract garbage and will fail. */
+	__le16		magic;
+	__le16		version;
+	__le16		cid;
+	__le16		cid_num;
+	__le16		recon_cnt;
+	uuid_t		sess_uuid;
+	uuid_t		paths_uuid;
+	u8		reserved[12];
+};
+
+/**
+ * struct ibtrs_msg_conn_rsp - Server connection response to the client
+ * @magic:	   IBTRS magic
+ * @version:	   IBTRS protocol version
+ * @errno:	   If rdma_accept() then 0, if rdma_reject() indicates error
+ * @queue_depth:   max inflight messages (queue-depth) in this session
+ * @max_io_size:   max io size server supports
+ * @max_hdr_size:  max msg header size server supports
+ *
+ * NOTE: size is 56 bytes, max possible is 136 bytes, see man rdma_accept().
+ */
+struct ibtrs_msg_conn_rsp {
+	__le16		magic;
+	__le16		version;
+	__le16		errno;
+	__le16		queue_depth;
+	__le32		max_io_size;
+	__le32		max_hdr_size;
+	u8		reserved[40];
+};
+
+/**
+ * struct ibtrs_msg_info_req
+ * @type:		@IBTRS_MSG_INFO_REQ
+ * @sessname:		Session name chosen by client
+ */
+struct ibtrs_msg_info_req {
+	__le16		type;
+	u8		sessname[NAME_MAX];
+	u8		reserved[15];
+};
+
+/**
+ * struct ibtrs_msg_info_rsp
+ * @type:		@IBTRS_MSG_INFO_RSP
+ * @sg_cnt:		Number of @desc entries
+ * @desc:		RDMA buffers where the client can write to server
+ */
+struct ibtrs_msg_info_rsp {
+	__le16		type;
+	__le16          sg_cnt;
+	u8              reserved[4];
+	struct ibtrs_sg_desc desc[];
+};
+
+/**
+ * struct ibtrs_msg_rdma_read - RDMA data transfer request from client
+ * @type:		always @IBTRS_MSG_READ
+ * @usr_len:		length of user payload
+ * @sg_cnt:		number of @desc entries
+ * @desc:		RDMA buffers where the server can write the result to
+ */
+struct ibtrs_msg_rdma_read {
+	__le16			type;
+	__le16			usr_len;
+	__le16			flags;
+	__le16			sg_cnt;
+	struct ibtrs_sg_desc    desc[];
+};
+
+/**
+ * struct_msg_rdma_write - Message transferred to server with RDMA-Write
+ * @type:		always @IBTRS_MSG_WRITE
+ * @usr_len:		length of user payload
+ */
+struct ibtrs_msg_rdma_write {
+	__le16			type;
+	__le16			usr_len;
+};
+
+/**
+ * struct_msg_rdma_hdr - header for read or write request
+ * @type:		@IBTRS_MSG_WRITE | @IBTRS_MSG_READ
+ */
+struct ibtrs_msg_rdma_hdr {
+	__le16			type;
+};
+
+/* ibtrs.c */
+
+struct ibtrs_iu *ibtrs_iu_alloc(u32 tag, size_t size, gfp_t t,
+				struct ib_device *dev, enum dma_data_direction,
+				void (*done)(struct ib_cq *cq, struct ib_wc *wc));
+void ibtrs_iu_free(struct ibtrs_iu *iu, enum dma_data_direction dir,
+		   struct ib_device *dev);
+int ibtrs_iu_post_recv(struct ibtrs_con *con, struct ibtrs_iu *iu);
+int ibtrs_iu_post_send(struct ibtrs_con *con, struct ibtrs_iu *iu, size_t size,
+		       struct ib_send_wr *head);
+int ibtrs_iu_post_rdma_write_imm(struct ibtrs_con *con, struct ibtrs_iu *iu,
+				 struct ib_sge *sge, unsigned int num_sge,
+				 u32 rkey, u64 rdma_addr, u32 imm_data,
+				 enum ib_send_flags flags,
+				 struct ib_send_wr *head);
+
+int ibtrs_post_recv_empty(struct ibtrs_con *con, struct ib_cqe *cqe);
+int ibtrs_post_recv_empty_x2(struct ibtrs_con *con, struct ib_cqe *cqe);
+int ibtrs_post_rdma_write_imm_empty(struct ibtrs_con *con, struct ib_cqe *cqe,
+				    u32 imm_data, enum ib_send_flags flags,
+				    struct ib_send_wr *head);
+
+int ibtrs_cq_qp_create(struct ibtrs_sess *ibtrs_sess, struct ibtrs_con *con,
+		       u32 max_send_sge, int cq_vector, u16 cq_size,
+		       u16 wr_queue_size, enum ib_poll_context poll_ctx);
+void ibtrs_cq_qp_destroy(struct ibtrs_con *con);
+
+void ibtrs_init_hb(struct ibtrs_sess *sess, struct ib_cqe *cqe,
+		   u32 interval_ms, u32 missed_max,
+		   ibtrs_hb_handler_t *err_handler,
+		   struct workqueue_struct *wq);
+void ibtrs_start_hb(struct ibtrs_sess *sess);
+void ibtrs_stop_hb(struct ibtrs_sess *sess);
+void ibtrs_send_hb_ack(struct ibtrs_sess *sess);
+
+void ibtrs_ib_dev_pool_init(enum ib_pd_flags pd_flags,
+			    struct ibtrs_ib_dev_pool *pool);
+void ibtrs_ib_dev_pool_deinit(struct ibtrs_ib_dev_pool *pool);
+
+struct ibtrs_ib_dev *ibtrs_ib_dev_find_or_add(struct ib_device *ib_dev,
+					      struct ibtrs_ib_dev_pool *pool);
+int ibtrs_ib_dev_put(struct ibtrs_ib_dev *dev);
+
+static inline int sockaddr_cmp(const struct sockaddr *a,
+			       const struct sockaddr *b)
+{
+	switch (a->sa_family) {
+	case AF_IB:
+		return memcmp(&((struct sockaddr_ib *)a)->sib_addr,
+			      &((struct sockaddr_ib *)b)->sib_addr,
+			      sizeof(struct ib_addr));
+	case AF_INET:
+		return memcmp(&((struct sockaddr_in *)a)->sin_addr,
+			      &((struct sockaddr_in *)b)->sin_addr,
+			      sizeof(struct in_addr));
+	case AF_INET6:
+		return memcmp(&((struct sockaddr_in6 *)a)->sin6_addr,
+			      &((struct sockaddr_in6 *)b)->sin6_addr,
+			      sizeof(struct in6_addr));
+	default:
+		return -ENOENT;
+	}
+}
+
+static inline int sockaddr_to_str(const struct sockaddr *addr,
+				   char *buf, size_t len)
+{
+	int cnt;
+
+	switch (addr->sa_family) {
+	case AF_IB:
+		cnt = scnprintf(buf, len, "gid:%pI6",
+			&((struct sockaddr_ib *)addr)->sib_addr.sib_raw);
+		return cnt;
+	case AF_INET:
+		cnt = scnprintf(buf, len, "ip:%pI4",
+			&((struct sockaddr_in *)addr)->sin_addr);
+		return cnt;
+	case AF_INET6:
+		cnt = scnprintf(buf, len, "ip:%pI6c",
+			  &((struct sockaddr_in6 *)addr)->sin6_addr);
+		return cnt;
+	}
+	cnt = scnprintf(buf, len, "<invalid address family>");
+	pr_err("Invalid address family\n");
+	return cnt;
+}
+
+/**
+ * ibtrs_invalidate_flag() - returns proper flags for invalidation
+ *
+ * NOTE: This function is needed for compat layer, so think twice before
+ *       rename or remove.
+ */
+static inline u32 ibtrs_invalidate_flag(void)
+{
+	return IBTRS_MSG_NEED_INVAL_F;
+}
+
+static inline u32 ibtrs_to_imm(u32 type, u32 payload)
+{
+	BUILD_BUG_ON(MAX_IMM_PAYL_BITS + MAX_IMM_TYPE_BITS != 32);
+	BUILD_BUG_ON(IBTRS_LAST_IMM > (1<<MAX_IMM_TYPE_BITS));
+	return ((type & MAX_IMM_TYPE_MASK) << MAX_IMM_PAYL_BITS) |
+		(payload & MAX_IMM_PAYL_MASK);
+}
+
+static inline void ibtrs_from_imm(u32 imm, u32 *type, u32 *payload)
+{
+	*payload = (imm & MAX_IMM_PAYL_MASK);
+	*type = (imm >> MAX_IMM_PAYL_BITS);
+}
+
+static inline u32 ibtrs_to_io_req_imm(u32 addr)
+{
+	return ibtrs_to_imm(IBTRS_IO_REQ_IMM, addr);
+}
+
+static inline u32 ibtrs_to_io_rsp_imm(u32 msg_id, int errno, bool w_inval)
+{
+	enum ibtrs_imm_type type;
+	u32 payload;
+
+	/* 9 bits for errno, 19 bits for msg_id */
+	payload = (abs(errno) & 0x1ff) << 19 | (msg_id & 0x7ffff);
+	type = (w_inval ? IBTRS_IO_RSP_W_INV_IMM : IBTRS_IO_RSP_IMM);
+
+	return ibtrs_to_imm(type, payload);
+}
+
+static inline void ibtrs_from_io_rsp_imm(u32 payload, u32 *msg_id, int *errno)
+{
+	/* 9 bits for errno, 19 bits for msg_id */
+	*msg_id = (payload & 0x7ffff);
+	*errno = -(int)((payload >> 19) & 0x1ff);
+}
+
+#define STAT_STORE_FUNC(type, store, reset)				\
+static ssize_t store##_store(struct kobject *kobj,			\
+			     struct kobj_attribute *attr,		\
+			     const char *buf, size_t count)		\
+{									\
+	int ret = -EINVAL;						\
+	type *sess = container_of(kobj, type, kobj_stats);		\
+									\
+	if (sysfs_streq(buf, "1"))					\
+		ret = reset(&sess->stats, true);			\
+	else if (sysfs_streq(buf, "0"))					\
+		ret = reset(&sess->stats, false);			\
+	if (ret)							\
+		return ret;						\
+									\
+	return count;							\
+}
+
+#define STAT_SHOW_FUNC(type, show, print)				\
+static ssize_t show##_show(struct kobject *kobj,			\
+			   struct kobj_attribute *attr,			\
+			   char *page)					\
+{									\
+	type *sess = container_of(kobj, type, kobj_stats);		\
+									\
+	return print(&sess->stats, page, PAGE_SIZE);			\
+}
+
+#define STAT_ATTR(type, stat, print, reset)				\
+STAT_STORE_FUNC(type, stat, reset)					\
+STAT_SHOW_FUNC(type, stat, print)					\
+static struct kobj_attribute stat##_attr =				\
+		__ATTR(stat, 0644,					\
+		       stat##_show,					\
+		       stat##_store)
+
+#endif /* IBTRS_PRI_H */
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 04/25] ibtrs: core: lib functions shared between client and server modules
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, Jack Wang
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

This is a set of library functions existing as a ibtrs-core module,
used by client and server modules.

Mainly these functions wrap IB and RDMA calls and provide a bit higher
abstraction for implementing of IBTRS protocol on client or server
sides.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs.c | 610 +++++++++++++++++++++++++++
 1 file changed, 610 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs.c b/drivers/infiniband/ulp/ibtrs/ibtrs.c
new file mode 100644
index 000000000000..f6879daa5bb9
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs.c
@@ -0,0 +1,610 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ * Authors: Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Jack Wang <jinpu.wang@cloud.ionos.com>
+ *          Danil Kipnis <danil.kipnis@cloud.ionos.com>
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/inet.h>
+
+#include "ibtrs-pri.h"
+#include "ibtrs-log.h"
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Core");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+struct ibtrs_iu *ibtrs_iu_alloc(u32 tag, size_t size, gfp_t gfp_mask,
+				struct ib_device *dma_dev,
+				enum dma_data_direction direction,
+				void (*done)(struct ib_cq *cq,
+					     struct ib_wc *wc))
+{
+	struct ibtrs_iu *iu;
+
+	iu = kmalloc(sizeof(*iu), gfp_mask);
+	if (unlikely(!iu))
+		return NULL;
+
+	iu->buf = kzalloc(size, gfp_mask);
+	if (unlikely(!iu->buf))
+		goto err1;
+
+	iu->dma_addr = ib_dma_map_single(dma_dev, iu->buf, size, direction);
+	if (unlikely(ib_dma_mapping_error(dma_dev, iu->dma_addr)))
+		goto err2;
+
+	iu->cqe.done  = done;
+	iu->size      = size;
+	iu->direction = direction;
+	iu->tag       = tag;
+
+	return iu;
+
+err2:
+	kfree(iu->buf);
+err1:
+	kfree(iu);
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_alloc);
+
+void ibtrs_iu_free(struct ibtrs_iu *iu, enum dma_data_direction dir,
+		   struct ib_device *ibdev)
+{
+	if (!iu)
+		return;
+
+	ib_dma_unmap_single(ibdev, iu->dma_addr, iu->size, dir);
+	kfree(iu->buf);
+	kfree(iu);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_free);
+
+int ibtrs_iu_post_recv(struct ibtrs_con *con, struct ibtrs_iu *iu)
+{
+	struct ibtrs_sess *sess = con->sess;
+	struct ib_recv_wr wr;
+	const struct ib_recv_wr *bad_wr;
+	struct ib_sge list;
+
+	list.addr   = iu->dma_addr;
+	list.length = iu->size;
+	list.lkey   = sess->dev->ib_pd->local_dma_lkey;
+
+	if (WARN_ON(list.length == 0)) {
+		ibtrs_wrn(con, "Posting receive work request failed,"
+			  " sg list is empty\n");
+		return -EINVAL;
+	}
+
+	wr.next    = NULL;
+	wr.wr_cqe  = &iu->cqe;
+	wr.sg_list = &list;
+	wr.num_sge = 1;
+
+	return ib_post_recv(con->qp, &wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_post_recv);
+
+int ibtrs_post_recv_empty(struct ibtrs_con *con, struct ib_cqe *cqe)
+{
+	struct ib_recv_wr wr;
+	const struct ib_recv_wr *bad_wr;
+
+	wr.next    = NULL;
+	wr.wr_cqe  = cqe;
+	wr.sg_list = NULL;
+	wr.num_sge = 0;
+
+	return ib_post_recv(con->qp, &wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_post_recv_empty);
+
+int ibtrs_post_recv_empty_x2(struct ibtrs_con *con, struct ib_cqe *cqe)
+{
+	struct ib_recv_wr wr_arr[2], *wr;
+	const struct ib_recv_wr *bad_wr;
+	int i;
+
+	memset(wr_arr, 0, sizeof(wr_arr));
+	for (i = 0; i < ARRAY_SIZE(wr_arr); i++) {
+		wr = &wr_arr[i];
+		wr->wr_cqe  = cqe;
+		if (i)
+			/* Chain backwards */
+			wr->next = &wr_arr[i - 1];
+	}
+
+	return ib_post_recv(con->qp, wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_post_recv_empty_x2);
+
+int ibtrs_iu_post_send(struct ibtrs_con *con, struct ibtrs_iu *iu, size_t size,
+		       struct ib_send_wr *head)
+{
+	struct ibtrs_sess *sess = con->sess;
+	struct ib_send_wr wr;
+	const struct ib_send_wr *bad_wr;
+	struct ib_sge list;
+
+	if ((WARN_ON(size == 0)))
+		return -EINVAL;
+
+	list.addr   = iu->dma_addr;
+	list.length = size;
+	list.lkey   = sess->dev->ib_pd->local_dma_lkey;
+
+	memset(&wr, 0, sizeof(wr));
+	wr.next       = NULL;
+	wr.wr_cqe     = &iu->cqe;
+	wr.sg_list    = &list;
+	wr.num_sge    = 1;
+	wr.opcode     = IB_WR_SEND;
+	wr.send_flags = IB_SEND_SIGNALED;
+
+	if (head) {
+		struct ib_send_wr *tail = head;
+
+		while (tail->next)
+			tail = tail->next;
+		tail->next = &wr;
+	} else {
+		head = &wr;
+	}
+
+	return ib_post_send(con->qp, head, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_post_send);
+
+int ibtrs_iu_post_rdma_write_imm(struct ibtrs_con *con, struct ibtrs_iu *iu,
+				 struct ib_sge *sge, unsigned int num_sge,
+				 u32 rkey, u64 rdma_addr, u32 imm_data,
+				 enum ib_send_flags flags,
+				 struct ib_send_wr *head)
+{
+	const struct ib_send_wr *bad_wr;
+	struct ib_rdma_wr wr;
+	int i;
+
+	wr.wr.next	  = NULL;
+	wr.wr.wr_cqe	  = &iu->cqe;
+	wr.wr.sg_list	  = sge;
+	wr.wr.num_sge	  = num_sge;
+	wr.rkey		  = rkey;
+	wr.remote_addr	  = rdma_addr;
+	wr.wr.opcode	  = IB_WR_RDMA_WRITE_WITH_IMM;
+	wr.wr.ex.imm_data = cpu_to_be32(imm_data);
+	wr.wr.send_flags  = flags;
+
+	/*
+	 * If one of the sges has 0 size, the operation will fail with an
+	 * length error
+	 */
+	for (i = 0; i < num_sge; i++)
+		if (WARN_ON(sge[i].length == 0))
+			return -EINVAL;
+
+	if (head) {
+		struct ib_send_wr *tail = head;
+
+		while (tail->next)
+			tail = tail->next;
+		tail->next = &wr.wr;
+	} else {
+		head = &wr.wr;
+	}
+
+	return ib_post_send(con->qp, head, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_post_rdma_write_imm);
+
+int ibtrs_post_rdma_write_imm_empty(struct ibtrs_con *con, struct ib_cqe *cqe,
+				    u32 imm_data, enum ib_send_flags flags,
+				    struct ib_send_wr *head)
+{
+	struct ib_send_wr wr;
+	const struct ib_send_wr *bad_wr;
+
+	memset(&wr, 0, sizeof(wr));
+	wr.wr_cqe	= cqe;
+	wr.send_flags	= flags;
+	wr.opcode	= IB_WR_RDMA_WRITE_WITH_IMM;
+	wr.ex.imm_data	= cpu_to_be32(imm_data);
+
+	if (head) {
+		struct ib_send_wr *tail = head;
+
+		while (tail->next)
+			tail = tail->next;
+		tail->next = &wr;
+	} else {
+		head = &wr;
+	}
+
+	return ib_post_send(con->qp, head, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_post_rdma_write_imm_empty);
+
+static void qp_event_handler(struct ib_event *ev, void *ctx)
+{
+	struct ibtrs_con *con = ctx;
+
+	switch (ev->event) {
+	case IB_EVENT_COMM_EST:
+		ibtrs_info(con, "QP event %s (%d) received\n",
+			   ib_event_msg(ev->event), ev->event);
+		rdma_notify(con->cm_id, IB_EVENT_COMM_EST);
+		break;
+	default:
+		ibtrs_info(con, "Unhandled QP event %s (%d) received\n",
+			   ib_event_msg(ev->event), ev->event);
+		break;
+	}
+}
+
+static int create_cq(struct ibtrs_con *con, int cq_vector, u16 cq_size,
+		     enum ib_poll_context poll_ctx)
+{
+	struct rdma_cm_id *cm_id = con->cm_id;
+	struct ib_cq *cq;
+
+	cq = ib_alloc_cq(cm_id->device, con, cq_size,
+			 cq_vector, poll_ctx);
+	if (unlikely(IS_ERR(cq))) {
+		ibtrs_err(con, "Creating completion queue failed, errno: %ld\n",
+			  PTR_ERR(cq));
+		return PTR_ERR(cq);
+	}
+	con->cq = cq;
+
+	return 0;
+}
+
+static int create_qp(struct ibtrs_con *con, struct ib_pd *pd,
+		     u16 wr_queue_size, u32 max_sge)
+{
+	struct ib_qp_init_attr init_attr = {NULL};
+	struct rdma_cm_id *cm_id = con->cm_id;
+	int ret;
+
+	init_attr.cap.max_send_wr = wr_queue_size;
+	init_attr.cap.max_recv_wr = wr_queue_size;
+	init_attr.cap.max_recv_sge = 1;
+	init_attr.event_handler = qp_event_handler;
+	init_attr.qp_context = con;
+#undef max_send_sge
+	init_attr.cap.max_send_sge = max_sge;
+
+	init_attr.qp_type = IB_QPT_RC;
+	init_attr.send_cq = con->cq;
+	init_attr.recv_cq = con->cq;
+	init_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
+
+	ret = rdma_create_qp(cm_id, pd, &init_attr);
+	if (unlikely(ret)) {
+		ibtrs_err(con, "Creating QP failed, err: %d\n", ret);
+		return ret;
+	}
+	con->qp = cm_id->qp;
+
+	return ret;
+}
+
+int ibtrs_cq_qp_create(struct ibtrs_sess *sess, struct ibtrs_con *con,
+		       u32 max_send_sge, int cq_vector, u16 cq_size,
+		       u16 wr_queue_size, enum ib_poll_context poll_ctx)
+{
+	int err;
+
+	err = create_cq(con, cq_vector, cq_size, poll_ctx);
+	if (unlikely(err))
+		return err;
+
+	err = create_qp(con, sess->dev->ib_pd, wr_queue_size, max_send_sge);
+	if (unlikely(err)) {
+		ib_free_cq(con->cq);
+		con->cq = NULL;
+		return err;
+	}
+	con->sess = sess;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ibtrs_cq_qp_create);
+
+void ibtrs_cq_qp_destroy(struct ibtrs_con *con)
+{
+	if (con->qp) {
+		rdma_destroy_qp(con->cm_id);
+		con->qp = NULL;
+	}
+	if (con->cq) {
+		ib_free_cq(con->cq);
+		con->cq = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(ibtrs_cq_qp_destroy);
+
+static void schedule_hb(struct ibtrs_sess *sess)
+{
+	queue_delayed_work(sess->hb_wq, &sess->hb_dwork,
+			   msecs_to_jiffies(sess->hb_interval_ms));
+}
+
+void ibtrs_send_hb_ack(struct ibtrs_sess *sess)
+{
+	struct ibtrs_con *usr_con = sess->con[0];
+	u32 imm;
+	int err;
+
+	imm = ibtrs_to_imm(IBTRS_HB_ACK_IMM, 0);
+	err = ibtrs_post_rdma_write_imm_empty(usr_con, sess->hb_cqe, imm,
+					      IB_SEND_SIGNALED, NULL);
+	if (unlikely(err)) {
+		sess->hb_err_handler(usr_con, err);
+		return;
+	}
+}
+EXPORT_SYMBOL_GPL(ibtrs_send_hb_ack);
+
+static void hb_work(struct work_struct *work)
+{
+	struct ibtrs_con *usr_con;
+	struct ibtrs_sess *sess;
+	u32 imm;
+	int err;
+
+	sess = container_of(to_delayed_work(work), typeof(*sess), hb_dwork);
+	usr_con = sess->con[0];
+
+	if (sess->hb_missed_cnt > sess->hb_missed_max) {
+		sess->hb_err_handler(usr_con, -ETIMEDOUT);
+		return;
+	}
+	if (sess->hb_missed_cnt++) {
+		/* Reschedule work without sending hb */
+		schedule_hb(sess);
+		return;
+	}
+	imm = ibtrs_to_imm(IBTRS_HB_MSG_IMM, 0);
+	err = ibtrs_post_rdma_write_imm_empty(usr_con, sess->hb_cqe, imm,
+					      IB_SEND_SIGNALED, NULL);
+	if (unlikely(err)) {
+		sess->hb_err_handler(usr_con, err);
+		return;
+	}
+
+	schedule_hb(sess);
+}
+
+void ibtrs_init_hb(struct ibtrs_sess *sess, struct ib_cqe *cqe,
+		   unsigned int interval_ms, unsigned int missed_max,
+		   ibtrs_hb_handler_t *err_handler,
+		   struct workqueue_struct *wq)
+{
+	sess->hb_cqe = cqe;
+	sess->hb_interval_ms = interval_ms;
+	sess->hb_err_handler = err_handler;
+	sess->hb_wq = wq;
+	sess->hb_missed_max = missed_max;
+	sess->hb_missed_cnt = 0;
+	INIT_DELAYED_WORK(&sess->hb_dwork, hb_work);
+}
+EXPORT_SYMBOL_GPL(ibtrs_init_hb);
+
+void ibtrs_start_hb(struct ibtrs_sess *sess)
+{
+	schedule_hb(sess);
+}
+EXPORT_SYMBOL_GPL(ibtrs_start_hb);
+
+void ibtrs_stop_hb(struct ibtrs_sess *sess)
+{
+	cancel_delayed_work_sync(&sess->hb_dwork);
+	sess->hb_missed_cnt = 0;
+	sess->hb_missed_max = 0;
+}
+EXPORT_SYMBOL_GPL(ibtrs_stop_hb);
+
+static int ibtrs_str_gid_to_sockaddr(const char *addr, size_t len,
+				     short port, struct sockaddr_storage *dst)
+{
+	struct sockaddr_ib *dst_ib = (struct sockaddr_ib *)dst;
+	int ret;
+
+	/*
+	 * We can use some of the I6 functions since GID is a valid
+	 * IPv6 address format
+	 */
+	ret = in6_pton(addr, len, dst_ib->sib_addr.sib_raw, '\0', NULL);
+	if (ret == 0)
+		return -EINVAL;
+
+	dst_ib->sib_family = AF_IB;
+	/*
+	 * Use the same TCP server port number as the IB service ID
+	 * on the IB port space range
+	 */
+	dst_ib->sib_sid = cpu_to_be64(RDMA_IB_IP_PS_IB | port);
+	dst_ib->sib_sid_mask = cpu_to_be64(0xffffffffffffffffULL);
+	dst_ib->sib_pkey = cpu_to_be16(0xffff);
+
+	return 0;
+}
+
+/**
+ * ibtrs_str_to_sockaddr() - Convert ibtrs address string to sockaddr
+ * @addr	String representation of an addr (IPv4, IPv6 or IB GID):
+ *              - "ip:192.168.1.1"
+ *              - "ip:fe80::200:5aee:feaa:20a2"
+ *              - "gid:fe80::200:5aee:feaa:20a2"
+ * @len         String address length
+ * @port	Destination port
+ * @dst		Destination sockaddr structure
+ *
+ * Returns 0 if conversion successful. Non-zero on error.
+ */
+static int ibtrs_str_to_sockaddr(const char *addr, size_t len,
+				 short port, struct sockaddr_storage *dst)
+{
+	if (strncmp(addr, "gid:", 4) == 0) {
+		return ibtrs_str_gid_to_sockaddr(addr + 4, len - 4, port, dst);
+	} else if (strncmp(addr, "ip:", 3) == 0) {
+		char port_str[8];
+		char *cpy;
+		int err;
+
+		snprintf(port_str, sizeof(port_str), "%u", port);
+		cpy = kstrndup(addr + 3, len - 3, GFP_KERNEL);
+		err = cpy ? inet_pton_with_scope(&init_net, AF_UNSPEC,
+						 cpy, port_str, dst) : -ENOMEM;
+		kfree(cpy);
+
+		return err;
+	}
+	return -EPROTONOSUPPORT;
+}
+
+int ibtrs_addr_to_sockaddr(const char *str, size_t len, short port,
+			   struct ibtrs_addr *addr)
+{
+	const char *d;
+	int ret;
+
+	d = strchr(str, ',');
+	if (!d)
+		d = strchr(str, '@');
+	if (d) {
+		if (ibtrs_str_to_sockaddr(str, d - str, 0, addr->src))
+			return -EINVAL;
+		d += 1;
+		len -= d - str;
+		str  = d;
+
+	} else {
+		addr->src = NULL;
+	}
+	ret = ibtrs_str_to_sockaddr(str, len, port, addr->dst);
+
+	return ret;
+}
+EXPORT_SYMBOL(ibtrs_addr_to_sockaddr);
+
+void ibtrs_ib_dev_pool_init(enum ib_pd_flags pd_flags,
+			    struct ibtrs_ib_dev_pool *pool)
+{
+	WARN_ON(pool->ops && (!pool->ops->alloc ^ !pool->ops->free));
+	INIT_LIST_HEAD(&pool->list);
+	mutex_init(&pool->mutex);
+	pool->pd_flags = pd_flags;
+}
+EXPORT_SYMBOL(ibtrs_ib_dev_pool_init);
+
+void ibtrs_ib_dev_pool_deinit(struct ibtrs_ib_dev_pool *pool)
+{
+	WARN_ON(!list_empty(&pool->list));
+}
+EXPORT_SYMBOL(ibtrs_ib_dev_pool_deinit);
+
+static void dev_free(struct kref *ref)
+{
+	struct ibtrs_ib_dev_pool *pool;
+	struct ibtrs_ib_dev *dev;
+
+	dev = container_of(ref, typeof(*dev), ref);
+	pool = dev->pool;
+
+	mutex_lock(&pool->mutex);
+	list_del(&dev->entry);
+	mutex_unlock(&pool->mutex);
+
+	if (pool->ops && pool->ops->deinit)
+		pool->ops->deinit(dev);
+
+	ib_dealloc_pd(dev->ib_pd);
+
+	if (pool->ops && pool->ops->free)
+		pool->ops->free(dev);
+	else
+		kfree(dev);
+}
+
+int ibtrs_ib_dev_put(struct ibtrs_ib_dev *dev)
+{
+	return kref_put(&dev->ref, dev_free);
+}
+EXPORT_SYMBOL(ibtrs_ib_dev_put);
+
+static int ibtrs_ib_dev_get(struct ibtrs_ib_dev *dev)
+{
+	return kref_get_unless_zero(&dev->ref);
+}
+
+struct ibtrs_ib_dev *
+ibtrs_ib_dev_find_or_add(struct ib_device *ib_dev,
+			 struct ibtrs_ib_dev_pool *pool)
+{
+	struct ibtrs_ib_dev *dev;
+
+	mutex_lock(&pool->mutex);
+	list_for_each_entry(dev, &pool->list, entry) {
+		if (dev->ib_dev->node_guid == ib_dev->node_guid &&
+		    ibtrs_ib_dev_get(dev))
+			goto out_unlock;
+	}
+	if (pool->ops && pool->ops->alloc)
+		dev = pool->ops->alloc();
+	else
+		dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (unlikely(IS_ERR_OR_NULL(dev)))
+		goto out_err;
+
+	kref_init(&dev->ref);
+	dev->pool = pool;
+	dev->ib_dev = ib_dev;
+	dev->ib_pd = ib_alloc_pd(ib_dev, pool->pd_flags);
+	if (unlikely(IS_ERR(dev->ib_pd)))
+		goto out_free_dev;
+
+	if (pool->ops && pool->ops->init && pool->ops->init(dev))
+		goto out_free_pd;
+
+	list_add(&dev->entry, &pool->list);
+out_unlock:
+	mutex_unlock(&pool->mutex);
+	return dev;
+
+out_free_pd:
+	ib_dealloc_pd(dev->ib_pd);
+out_free_dev:
+	if (pool->ops && pool->ops->free)
+		pool->ops->free(dev);
+	else
+		kfree(dev);
+out_err:
+	mutex_unlock(&pool->mutex);
+	return NULL;
+}
+EXPORT_SYMBOL(ibtrs_ib_dev_find_or_add);
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 05/25] ibtrs: client: private header with client structs and functions
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, Jack Wang
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

This header describes main structs and functions used by ibtrs-client
module, mainly for managing IBTRS sessions, creating/destroying sysfs
entries, accounting statistics on client side.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h | 308 +++++++++++++++++++++++
 1 file changed, 308 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
new file mode 100644
index 000000000000..f9e65f5eb5ab
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
@@ -0,0 +1,308 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ * Authors: Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Jack Wang <jinpu.wang@cloud.ionos.com>
+ *          Danil Kipnis <danil.kipnis@cloud.ionos.com>
+ */
+
+#ifndef IBTRS_CLT_H
+#define IBTRS_CLT_H
+
+#include <linux/device.h>
+#include "ibtrs-pri.h"
+
+/**
+ * enum ibtrs_clt_state - Client states.
+ */
+enum ibtrs_clt_state {
+	IBTRS_CLT_CONNECTING,
+	IBTRS_CLT_CONNECTING_ERR,
+	IBTRS_CLT_RECONNECTING,
+	IBTRS_CLT_CONNECTED,
+	IBTRS_CLT_CLOSING,
+	IBTRS_CLT_CLOSED,
+	IBTRS_CLT_DEAD,
+};
+
+static inline const char *ibtrs_clt_state_str(enum ibtrs_clt_state state)
+{
+	switch (state) {
+	case IBTRS_CLT_CONNECTING:
+		return "IBTRS_CLT_CONNECTING";
+	case IBTRS_CLT_CONNECTING_ERR:
+		return "IBTRS_CLT_CONNECTING_ERR";
+	case IBTRS_CLT_RECONNECTING:
+		return "IBTRS_CLT_RECONNECTING";
+	case IBTRS_CLT_CONNECTED:
+		return "IBTRS_CLT_CONNECTED";
+	case IBTRS_CLT_CLOSING:
+		return "IBTRS_CLT_CLOSING";
+	case IBTRS_CLT_CLOSED:
+		return "IBTRS_CLT_CLOSED";
+	case IBTRS_CLT_DEAD:
+		return "IBTRS_CLT_DEAD";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+enum ibtrs_mp_policy {
+	MP_POLICY_RR,
+	MP_POLICY_MIN_INFLIGHT,
+};
+
+struct ibtrs_clt_stats_reconnects {
+	int successful_cnt;
+	int fail_cnt;
+};
+
+struct ibtrs_clt_stats_wc_comp {
+	u32 cnt;
+	u64 total_cnt;
+};
+
+struct ibtrs_clt_stats_cpu_migr {
+	atomic_t from;
+	int to;
+};
+
+struct ibtrs_clt_stats_rdma {
+	struct {
+		u64 cnt;
+		u64 size_total;
+	} dir[2];
+
+	u64 failover_cnt;
+};
+
+struct ibtrs_clt_stats_rdma_lat {
+	u64 read;
+	u64 write;
+};
+
+#define MIN_LOG_SG 2
+#define MAX_LOG_SG 5
+#define MAX_LIN_SG BIT(MIN_LOG_SG)
+#define SG_DISTR_SZ (MAX_LOG_SG - MIN_LOG_SG + MAX_LIN_SG + 2)
+
+#define MAX_LOG_LAT 16
+#define MIN_LOG_LAT 0
+#define LOG_LAT_SZ (MAX_LOG_LAT - MIN_LOG_LAT + 2)
+
+struct ibtrs_clt_stats_pcpu {
+	struct ibtrs_clt_stats_cpu_migr		cpu_migr;
+	struct ibtrs_clt_stats_rdma		rdma;
+	u64					sg_list_total;
+	u64					sg_list_distr[SG_DISTR_SZ];
+	struct ibtrs_clt_stats_rdma_lat		rdma_lat_distr[LOG_LAT_SZ];
+	struct ibtrs_clt_stats_rdma_lat		rdma_lat_max;
+	struct ibtrs_clt_stats_wc_comp		wc_comp;
+};
+
+struct ibtrs_clt_stats {
+	bool					enable_rdma_lat;
+	struct ibtrs_clt_stats_pcpu    __percpu	*pcpu_stats;
+	struct ibtrs_clt_stats_reconnects	reconnects;
+	atomic_t				inflight;
+};
+
+struct ibtrs_clt_con {
+	struct ibtrs_con	c;
+	u32			cpu;
+	atomic_t		io_cnt;
+	int			cm_err;
+};
+
+/**
+ * ibtrs_tag - tags the memory allocation for future RDMA operation
+ */
+struct ibtrs_tag {
+	enum ibtrs_clt_con_type con_type;
+	u32			cpu_id;
+	u32			mem_id;
+	u32			mem_off;
+};
+
+struct ibtrs_clt_io_req {
+	struct list_head        list;
+	struct ibtrs_iu		*iu;
+	struct scatterlist	*sglist; /* list holding user data */
+	u32			sg_cnt;
+	u32			sg_size;
+	u32			data_len;
+	u32			usr_len;
+	void			*priv;
+	bool			in_use;
+	struct ibtrs_clt_con	*con;
+	struct ibtrs_sg_desc	*desc;
+	struct ib_sge		*sge;
+	struct ibtrs_tag	*tag;
+	enum dma_data_direction dir;
+	ibtrs_conf_fn		*conf;
+	unsigned long		start_jiffies;
+
+	struct ib_mr		*mr;
+	struct ib_cqe		inv_cqe;
+	struct completion	inv_comp;
+	int			inv_errno;
+	bool			need_inv_comp;
+	bool			need_inv;
+};
+
+struct ibtrs_rbuf {
+	u64 addr;
+	u32 rkey;
+};
+
+struct ibtrs_clt_sess {
+	struct ibtrs_sess	s;
+	struct ibtrs_clt	*clt;
+	wait_queue_head_t	state_wq;
+	enum ibtrs_clt_state	state;
+	atomic_t		connected_cnt;
+	struct mutex		init_mutex;
+	struct ibtrs_clt_io_req	*reqs;
+	struct delayed_work	reconnect_dwork;
+	struct work_struct	close_work;
+	u32			reconnect_attempts;
+	bool			established;
+	struct ibtrs_rbuf	*rbufs;
+	size_t			max_io_size;
+	u32			max_hdr_size;
+	u32			chunk_size;
+	size_t			queue_depth;
+	u32			max_pages_per_mr;
+	int			max_send_sge;
+	struct kobject		kobj;
+	struct kobject		kobj_stats;
+	struct ibtrs_clt_stats  stats;
+	/* cache hca_port and hca_name to display in sysfs */
+	u8			hca_port;
+	char                    hca_name[IB_DEVICE_NAME_MAX];
+	struct list_head __percpu
+				*mp_skip_entry;
+};
+
+struct ibtrs_clt {
+	struct list_head   /* __rcu */ paths_list;
+	size_t			       paths_num;
+	struct ibtrs_clt_sess
+		      __rcu * __percpu *pcpu_path;
+
+	bool			opened;
+	uuid_t			paths_uuid;
+	int			paths_up;
+	struct mutex		paths_mutex;
+	struct mutex		paths_ev_mutex;
+	char			sessname[NAME_MAX];
+	short			port;
+	u32			max_reconnect_attempts;
+	u32		reconnect_delay_sec;
+	u32		max_segments;
+	void			*tags;
+	unsigned long		*tags_map;
+	size_t			queue_depth;
+	size_t			max_io_size;
+	wait_queue_head_t	tags_wait;
+	size_t			pdu_sz;
+	void			*priv;
+	link_clt_ev_fn		*link_ev;
+	struct device		dev;
+	struct kobject		kobj_paths;
+	enum ibtrs_mp_policy	mp_policy;
+};
+
+static inline struct ibtrs_clt_con *to_clt_con(struct ibtrs_con *c)
+{
+	return container_of(c, struct ibtrs_clt_con, c);
+}
+
+static inline struct ibtrs_clt_sess *to_clt_sess(struct ibtrs_sess *s)
+{
+	return container_of(s, struct ibtrs_clt_sess, s);
+}
+
+/* See ibtrs-log.h */
+#define TYPES_TO_SESSNAME(obj)						\
+	LIST(CASE(obj, struct ibtrs_clt_sess *, s.sessname),		\
+	     CASE(obj, struct ibtrs_clt *, sessname))
+
+#define TAG_SIZE(clt) (sizeof(struct ibtrs_tag) + (clt)->pdu_sz)
+#define GET_TAG(clt, idx) ((clt)->tags + TAG_SIZE(clt) * idx)
+
+int ibtrs_clt_reconnect_from_sysfs(struct ibtrs_clt_sess *sess);
+int ibtrs_clt_disconnect_from_sysfs(struct ibtrs_clt_sess *sess);
+int ibtrs_clt_create_path_from_sysfs(struct ibtrs_clt *clt,
+				     struct ibtrs_addr *addr);
+int ibtrs_clt_remove_path_from_sysfs(struct ibtrs_clt_sess *sess,
+				     const struct attribute *sysfs_self);
+
+void ibtrs_clt_set_max_reconnect_attempts(struct ibtrs_clt *clt, int value);
+int ibtrs_clt_get_max_reconnect_attempts(const struct ibtrs_clt *clt);
+
+/* ibtrs-clt-stats.c */
+
+int ibtrs_clt_init_stats(struct ibtrs_clt_stats *stats);
+void ibtrs_clt_free_stats(struct ibtrs_clt_stats *stats);
+
+void ibtrs_clt_decrease_inflight(struct ibtrs_clt_stats *s);
+void ibtrs_clt_inc_failover_cnt(struct ibtrs_clt_stats *s);
+
+void ibtrs_clt_update_rdma_lat(struct ibtrs_clt_stats *s, bool read,
+			       unsigned long ms);
+void ibtrs_clt_update_wc_stats(struct ibtrs_clt_con *con);
+void ibtrs_clt_update_all_stats(struct ibtrs_clt_io_req *req, int dir);
+
+int ibtrs_clt_reset_sg_list_distr_stats(struct ibtrs_clt_stats *stats,
+					bool enable);
+int ibtrs_clt_stats_sg_list_distr_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len);
+int ibtrs_clt_reset_rdma_lat_distr_stats(struct ibtrs_clt_stats *stats,
+					 bool enable);
+ssize_t ibtrs_clt_stats_rdma_lat_distr_to_str(struct ibtrs_clt_stats *stats,
+					      char *page, size_t len);
+int ibtrs_clt_reset_cpu_migr_stats(struct ibtrs_clt_stats *stats, bool enable);
+int ibtrs_clt_stats_migration_cnt_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len);
+int ibtrs_clt_reset_reconnects_stat(struct ibtrs_clt_stats *stats, bool enable);
+int ibtrs_clt_stats_reconnects_to_str(struct ibtrs_clt_stats *stats, char *buf,
+				      size_t len);
+int ibtrs_clt_reset_wc_comp_stats(struct ibtrs_clt_stats *stats, bool enable);
+int ibtrs_clt_stats_wc_completion_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len);
+int ibtrs_clt_reset_rdma_stats(struct ibtrs_clt_stats *stats, bool enable);
+ssize_t ibtrs_clt_stats_rdma_to_str(struct ibtrs_clt_stats *stats,
+				    char *page, size_t len);
+bool ibtrs_clt_sess_is_connected(const struct ibtrs_clt_sess *sess);
+int ibtrs_clt_reset_all_stats(struct ibtrs_clt_stats *stats, bool enable);
+ssize_t ibtrs_clt_reset_all_help(struct ibtrs_clt_stats *stats,
+				 char *page, size_t len);
+
+/* ibtrs-clt-sysfs.c */
+
+int ibtrs_clt_create_sysfs_root_folders(struct ibtrs_clt *clt);
+int ibtrs_clt_create_sysfs_root_files(struct ibtrs_clt *clt);
+void ibtrs_clt_destroy_sysfs_root_folders(struct ibtrs_clt *clt);
+void ibtrs_clt_destroy_sysfs_root_files(struct ibtrs_clt *clt);
+
+int ibtrs_clt_create_sess_files(struct ibtrs_clt_sess *sess);
+void ibtrs_clt_destroy_sess_files(struct ibtrs_clt_sess *sess,
+				  const struct attribute *sysfs_self);
+
+#endif /* IBTRS_CLT_H */
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 02/25] ibtrs: public interface header to establish RDMA connections
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, Jack Wang
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

Introduce public header which provides set of API functions to
establish RDMA connections from client to server machine using
IBTRS protocol, which manages RDMA connections for each session,
does multipathing and load balancing.

Main functions for client (active) side:

 ibtrs_clt_open() - Creates set of RDMA connections incapsulated
                    in IBTRS session and returns pointer on IBTRS
		    session object.
 ibtrs_clt_close() - Closes RDMA connections associated with IBTRS
                     session.
 ibtrs_clt_request() - Requests zero-copy RDMA transfer to/from
                       server.

Main functions for server (passive) side:

 ibtrs_srv_open() - Starts listening for IBTRS clients on specified
                    port and invokes IBTRS callbacks for incoming
		    RDMA requests or link events.
 ibtrs_srv_close() - Closes IBTRS server context.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs.h | 318 +++++++++++++++++++++++++++
 1 file changed, 318 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.h

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs.h b/drivers/infiniband/ulp/ibtrs/ibtrs.h
new file mode 100644
index 000000000000..f5434f0bb85c
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs.h
@@ -0,0 +1,318 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ * Authors: Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Jack Wang <jinpu.wang@cloud.ionos.com>
+ *          Danil Kipnis <danil.kipnis@cloud.ionos.com>
+ */
+
+#ifndef IBTRS_H
+#define IBTRS_H
+
+#include <linux/socket.h>
+#include <linux/scatterlist.h>
+
+struct ibtrs_tag;
+struct ibtrs_clt;
+struct ibtrs_srv_ctx;
+struct ibtrs_srv;
+struct ibtrs_srv_op;
+
+/*
+ * Here goes IBTRS client API
+ */
+
+/**
+ * enum ibtrs_clt_link_ev - Events about connectivity state of a client
+ * @IBTRS_CLT_LINK_EV_RECONNECTED	Client was reconnected.
+ * @IBTRS_CLT_LINK_EV_DISCONNECTED	Client was disconnected.
+ */
+enum ibtrs_clt_link_ev {
+	IBTRS_CLT_LINK_EV_RECONNECTED,
+	IBTRS_CLT_LINK_EV_DISCONNECTED,
+};
+
+/**
+ * Source and destination address of a path to be established
+ */
+struct ibtrs_addr {
+	struct sockaddr_storage *src;
+	struct sockaddr_storage *dst;
+};
+
+typedef void (link_clt_ev_fn)(void *priv, enum ibtrs_clt_link_ev ev);
+/**
+ * ibtrs_clt_open() - Open a session to a IBTRS client
+ * @priv:		User supplied private data.
+ * @link_ev:		Event notification for connection state changes
+ *	@priv:			user supplied data that was passed to
+ *				ibtrs_clt_open()
+ *	@ev:			Occurred event
+ * @sessname: name of the session
+ * @paths: Paths to be established defined by their src and dst addresses
+ * @path_cnt: Number of elemnts in the @paths array
+ * @port: port to be used by the IBTRS session
+ * @pdu_sz: Size of extra payload which can be accessed after tag allocation.
+ * @max_inflight_msg: Max. number of parallel inflight messages for the session
+ * @max_segments: Max. number of segments per IO request
+ * @reconnect_delay_sec: time between reconnect tries
+ * @max_reconnect_attempts: Number of times to reconnect on error before giving
+ *			    up, 0 for * disabled, -1 for forever
+ *
+ * Starts session establishment with the ibtrs_server. The function can block
+ * up to ~2000ms until it returns.
+ *
+ * Return a valid pointer on success otherwise PTR_ERR.
+ */
+struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
+				 const char *sessname,
+				 const struct ibtrs_addr *paths,
+				 size_t path_cnt, short port,
+				 size_t pdu_sz, u8 reconnect_delay_sec,
+				 u16 max_segments,
+				 s16 max_reconnect_attempts);
+
+/**
+ * ibtrs_clt_close() - Close a session
+ * @sess: Session handler, is freed on return
+ */
+void ibtrs_clt_close(struct ibtrs_clt *sess);
+
+/**
+ * ibtrs_tag_from_pdu() - converts opaque pdu pointer to ibtrs_tag
+ * @pdu: opaque pointer
+ */
+struct ibtrs_tag *ibtrs_tag_from_pdu(void *pdu);
+
+/**
+ * ibtrs_tag_to_pdu() - converts ibtrs_tag to opaque pdu pointer
+ * @tag: IBTRS tag pointer
+ */
+void *ibtrs_tag_to_pdu(struct ibtrs_tag *tag);
+
+enum {
+	IBTRS_TAG_NOWAIT = 0,
+	IBTRS_TAG_WAIT   = 1,
+};
+
+/**
+ * enum ibtrs_clt_con_type() type of ib connection to use with a given tag
+ * @USR_CON - use connection reserved vor "service" messages
+ * @IO_CON - use a connection reserved for IO
+ */
+enum ibtrs_clt_con_type {
+	IBTRS_USR_CON,
+	IBTRS_IO_CON
+};
+
+/**
+ * ibtrs_clt_get_tag() - allocates tag for future RDMA operation
+ * @sess:	Current session
+ * @con_type:	Type of connection to use with the tag
+ * @wait:	Wait type
+ *
+ * Description:
+ *    Allocates tag for the following RDMA operation.  Tag is used
+ *    to preallocate all resources and to propagate memory pressure
+ *    up earlier.
+ *
+ * Context:
+ *    Can sleep if @wait == IBTRS_TAG_WAIT
+ */
+struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *sess,
+				    enum ibtrs_clt_con_type con_type,
+				    int wait);
+
+/**
+ * ibtrs_clt_put_tag() - puts allocated tag
+ * @sess:	Current session
+ * @tag:	Tag to be freed
+ *
+ * Context:
+ *    Does not matter
+ */
+void ibtrs_clt_put_tag(struct ibtrs_clt *sess, struct ibtrs_tag *tag);
+
+typedef void (ibtrs_conf_fn)(void *priv, int errno);
+/**
+ * ibtrs_clt_request() - Request data transfer to/from server via RDMA.
+ *
+ * @dir:	READ/WRITE
+ * @conf:	callback function to be called as confirmation
+ * @sess:	Session
+ * @tag:	Preallocated tag
+ * @priv:	User provided data, passed back with corresponding
+ *		@(conf) confirmation.
+ * @vec:	Message that is send to server together with the request.
+ *		Sum of len of all @vec elements limited to <= IO_MSG_SIZE.
+ *		Since the msg is copied internally it can be allocated on stack.
+ * @nr:		Number of elements in @vec.
+ * @len:	length of data send to/from server
+ * @sg:		Pages to be sent/received to/from server.
+ * @sg_cnt:	Number of elements in the @sg
+ *
+ * Return:
+ * 0:		Success
+ * <0:		Error
+ *
+ * On dir=READ ibtrs client will request a data transfer from Server to client.
+ * The data that the server will respond with will be stored in @sg when
+ * the user receives an %IBTRS_CLT_RDMA_EV_RDMA_REQUEST_WRITE_COMPL event.
+ * On dir=WRITE ibtrs client will rdma write data in sg to server side.
+ */
+int ibtrs_clt_request(int dir, ibtrs_conf_fn *conf, struct ibtrs_clt *sess,
+		      struct ibtrs_tag *tag, void *priv, const struct kvec *vec,
+		      size_t nr, size_t len, struct scatterlist *sg,
+		      unsigned int sg_cnt);
+
+/**
+ * ibtrs_attrs - IBTRS session attributes
+ */
+struct ibtrs_attrs {
+	u32	queue_depth;
+	u32	max_io_size;
+	u8	sessname[NAME_MAX];
+	struct kobject *sess_kobj;
+};
+
+/**
+ * ibtrs_clt_query() - queries IBTRS session attributes
+ *
+ * Returns:
+ *    0 on success
+ *    -ECOMM		no connection to the server
+ */
+int ibtrs_clt_query(struct ibtrs_clt *sess, struct ibtrs_attrs *attr);
+
+/*
+ * Here goes IBTRS server API
+ */
+
+/**
+ * enum ibtrs_srv_link_ev - Server link events
+ * @IBTRS_SRV_LINK_EV_CONNECTED:	Connection from client established
+ * @IBTRS_SRV_LINK_EV_DISCONNECTED:	Connection was disconnected, all
+ *					connection IBTRS resources were freed.
+ */
+enum ibtrs_srv_link_ev {
+	IBTRS_SRV_LINK_EV_CONNECTED,
+	IBTRS_SRV_LINK_EV_DISCONNECTED,
+};
+
+/**
+ * rdma_ev_fn():	Event notification for RDMA operations
+ *			If the callback returns a value != 0, an error message
+ *			for the data transfer will be sent to the client.
+
+ *	@sess:		Session
+ *	@priv:		Private data set by ibtrs_srv_set_sess_priv()
+ *	@id:		internal IBTRS operation id
+ *	@dir:		READ/WRITE
+ *	@data:		Pointer to (bidirectional) rdma memory area:
+ *			- in case of %IBTRS_SRV_RDMA_EV_RECV contains
+ *			data sent by the client
+ *			- in case of %IBTRS_SRV_RDMA_EV_WRITE_REQ points to the
+ *			memory area where the response is to be written to
+ *	@datalen:	Size of the memory area in @data
+ *	@usr:		The extra user message sent by the client (%vec)
+ *	@usrlen:	Size of the user message
+ */
+typedef int (rdma_ev_fn)(struct ibtrs_srv *sess, void *priv,
+			 struct ibtrs_srv_op *id, int dir,
+			 void *data, size_t datalen, const void *usr,
+			 size_t usrlen);
+
+/**
+ * link_ev_fn():	Events about connective state changes
+ *			If the callback returns != 0 and the event
+ *			%IBTRS_SRV_LINK_EV_CONNECTED the corresponding session
+ *			will be destroyed.
+ *	@sess:		Session
+ *	@ev:		event
+ *	@priv:		Private data from user if previously set with
+ *			ibtrs_srv_set_sess_priv()
+ */
+typedef int (link_ev_fn)(struct ibtrs_srv *sess, enum ibtrs_srv_link_ev ev,
+			 void *priv);
+
+/**
+ * ibtrs_srv_open() - open IBTRS server context
+ * @ops:		callback functions
+ *
+ * Creates server context with specified callbacks.
+ *
+ * Return a valid pointer on success otherwise PTR_ERR.
+ */
+struct ibtrs_srv_ctx *ibtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
+				     unsigned int port);
+
+/**
+ * ibtrs_srv_close() - close IBTRS server context
+ * @ctx: pointer to server context
+ *
+ * Closes IBTRS server context with all client sessions.
+ */
+void ibtrs_srv_close(struct ibtrs_srv_ctx *ctx);
+
+/**
+ * ibtrs_srv_resp_rdma() - Finish an RDMA request
+ *
+ * @id:		Internal IBTRS operation identifier
+ * @errno:	Response Code send to the other side for this operation.
+ *		0 = success, <=0 error
+ *
+ * Finish a RDMA operation. A message is sent to the client and the
+ * corresponding memory areas will be released.
+ */
+void ibtrs_srv_resp_rdma(struct ibtrs_srv_op *id, int errno);
+
+/**
+ * ibtrs_srv_set_sess_priv() - Set private pointer in ibtrs_srv.
+ * @sess:	Session
+ * @priv:	The private pointer that is associated with the session.
+ */
+void ibtrs_srv_set_sess_priv(struct ibtrs_srv *sess, void *priv);
+
+/**
+ * ibtrs_srv_get_sess_qdepth() - Get ibtrs_srv qdepth.
+ * @sess:	Session
+ */
+int ibtrs_srv_get_queue_depth(struct ibtrs_srv *sess);
+
+/**
+ * ibtrs_srv_get_sess_name() - Get ibtrs_srv peer hostname.
+ * @sess:	Session
+ * @sessname:	Sessname buffer
+ * @len:	Length of sessname buffer
+ */
+int ibtrs_srv_get_sess_name(struct ibtrs_srv *sess, char *sessname, size_t len);
+
+/**
+ * ibtrs_addr_to_sockaddr() - convert path string "src,dst" to sockaddreses
+ * @str		string containing source and destination addr of a path
+ *		separated by comma. I.e. "ip:1.1.1.1,ip:1.1.1.2". If str
+ *		contains only one address it's considered to be destination.
+ * @len		string length
+ * @addr->dst	will be set to the destination sockadddr.
+ * @addr->src	will be set to the source address or to NULL
+ *		if str doesn't contain any sorce address.
+ *
+ * Returns zero if conversion successful. Non-zero otherwise.
+ */
+int ibtrs_addr_to_sockaddr(const char *str, size_t len, short port,
+			   struct ibtrs_addr *addr);
+#endif
-- 
2.17.1


^ permalink raw reply related

* [RFC/RFT 06/14] soc: amlogic: meson-clk-measure: add G12B second cluster cpu clk
From: Neil Armstrong @ 2019-06-20 15:00 UTC (permalink / raw)
  To: jbrunet, khilman
  Cc: Neil Armstrong, martin.blumenstingl, linux-kernel, linux-amlogic,
	linux-clk, linux-arm-kernel
In-Reply-To: <20190620150013.13462-1-narmstrong@baylibre.com>

Add the G12B second CPU cluster CPU and SYS_PLL measure IDs.

These IDs returns 0Hz on G12A.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
---
 drivers/soc/amlogic/meson-clk-measure.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/soc/amlogic/meson-clk-measure.c b/drivers/soc/amlogic/meson-clk-measure.c
index c470e24f1dfa..f09b404b39d3 100644
--- a/drivers/soc/amlogic/meson-clk-measure.c
+++ b/drivers/soc/amlogic/meson-clk-measure.c
@@ -324,6 +324,8 @@ static struct meson_msr_id clk_msr_g12a[CLK_MSR_MAX] = {
 	CLK_MSR_ID(84, "co_tx"),
 	CLK_MSR_ID(89, "hdmi_todig"),
 	CLK_MSR_ID(90, "hdmitx_sys"),
+	CLK_MSR_ID(91, "sys_cpub_div16"),
+	CLK_MSR_ID(92, "sys_pll_cpub_div16"),
 	CLK_MSR_ID(94, "eth_phy_rx"),
 	CLK_MSR_ID(95, "eth_phy_pll"),
 	CLK_MSR_ID(96, "vpu_b"),
-- 
2.21.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related

* [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev

Hi all,

Here is v4 of IBNBD/IBTRS patches, which have minor changes

 Changelog
 ---------
v4:
  o Protocol extended to transport IO priorities
  o Support for Mellanox ConnectX-4/X-5
  o Minor sysfs extentions (display access mode on server side)
  o Bug fixes: cleaning up sysfs folders, race on deallocation of resources
  o Style fixes

v3:
  o Sparse fixes:
     - le32 -> le16 conversion
     - pcpu and RCU wrong declaration
     - sysfs: dynamically alloc array of sockaddr structures to reduce
	   size of a stack frame

  o Rename sysfs folder on client and server sides to show source and
    destination addresses of the connection, i.e.:
	   .../<session-name>/paths/<src@dst>/

  o Remove external inclusions from Makefiles.
  * https://lwn.net/Articles/756994/

v2:
  o IBNBD:
     - No legacy request IO mode, only MQ is left.

  o IBTRS:
     - No FMR registration, only FR is left.

  * https://lwn.net/Articles/755075/

v1:
  - IBTRS: load-balancing and IO fail-over using multipath features were added.

  - Major parts of the code were rewritten, simplified and overall code
    size was reduced by a quarter.

  * https://lwn.net/Articles/746342/

v0:
  - Initial submission

  * https://lwn.net/Articles/718181/


 Introduction
 -------------

IBTRS (InfiniBand Transport) is a reliable high speed transport library
which allows for establishing connection between client and server
machines via RDMA. It is based on RDMA-CM, so expect also to support RoCE
and iWARP, but we mainly tested in IB environment. It is optimized to
transfer (read/write) IO blocks in the sense that it follows the BIO 
semantics of providing the possibility to either write data from a 
scatter-gather list to the remote side or to request ("read") data
transfer from the remote side into a given set of buffers.

IBTRS is multipath capable and provides I/O fail-over and load-balancing
functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA
CMs and particular path is selected according to the load-balancing policy.
It can be used for other components not bind to IBNBD.


IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
(client and server) that allow for remote access of a block device on
the server over IBTRS protocol. After being mapped, the remote block
devices can be accessed on the client side as local block devices.
Internally IBNBD uses IBTRS as an RDMA transport library.


   - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
     thus internal protocol is simple.
   - IBTRS was developed as an independent RDMA transport library, which
     supports fail-over and load-balancing policies using multipath, thus
     it can be used for any other IO needs rather than only for block
     device.
   - IBNBD/IBTRS is fast.
     Old comparison results:
     https://www.spinics.net/lists/linux-rdma/msg48799.html
     New comparison results: see performance measurements section below.

Key features of IBTRS transport library and IBNBD block device:

o High throughput and low latency due to:
   - Only two RDMA messages per IO.
   - IMM InfiniBand messages on responses to reduce round trip latency.
   - Simplified memory management: memory allocation happens once on
     server side when IBTRS session is established.

o IO fail-over and load-balancing by using multipath.  According to
  our test loads additional path brings ~20% of bandwidth.  

o Simple configuration of IBNBD:
   - Server side is completely passive: volumes do not need to be
     explicitly exported.
   - Only IB port GID and device path needed on client side to map
     a block device.
   - A device is remapped automatically i.e. after storage reboot.

Commits for kernel can be found here:
   https://github.com/ionos-enterprise/ibnbd/tree/linux-5.2-rc3--ibnbd-v4
The out-of-tree modules are here:
   https://github.com/ionos-enterprise/ibnbd

Vault 2017 presentation:
  https://events.static.linuxfound.org/sites/events/files/slides/IBNBD-Vault-2017.pdf

 Performance measurements
 ------------------------

o IBNBD and NVMEoRDMA

  Performance results for the v5.2-rc3 kernel
  link: https://github.com/ionos-enterprise/ibnbd/tree/develop/performance/v4-v5.2-rc3

Roman Pen (25):
  sysfs: export sysfs_remove_file_self()
  ibtrs: public interface header to establish RDMA connections
  ibtrs: private headers with IBTRS protocol structs and helpers
  ibtrs: core: lib functions shared between client and server modules
  ibtrs: client: private header with client structs and functions
  ibtrs: client: main functionality
  ibtrs: client: statistics functions
  ibtrs: client: sysfs interface functions
  ibtrs: server: private header with server structs and functions
  ibtrs: server: main functionality
  ibtrs: server: statistics functions
  ibtrs: server: sysfs interface functions
  ibtrs: include client and server modules into kernel compilation
  ibtrs: a bit of documentation
  ibnbd: private headers with IBNBD protocol structs and helpers
  ibnbd: client: private header with client structs and functions
  ibnbd: client: main functionality
  ibnbd: client: sysfs interface functions
  ibnbd: server: private header with server structs and functions
  ibnbd: server: main functionality
  ibnbd: server: functionality for IO submission to file or block dev
  ibnbd: server: sysfs interface functions
  ibnbd: include client and server modules into kernel compilation
  ibnbd: a bit of documentation
  MAINTAINERS: Add maintainer for IBNBD/IBTRS modules

 MAINTAINERS                                   |   14 +
 drivers/block/Kconfig                         |    2 +
 drivers/block/Makefile                        |    1 +
 drivers/block/ibnbd/Kconfig                   |   24 +
 drivers/block/ibnbd/Makefile                  |   13 +
 drivers/block/ibnbd/README                    |  315 ++
 drivers/block/ibnbd/ibnbd-clt-sysfs.c         |  691 ++++
 drivers/block/ibnbd/ibnbd-clt.c               | 1832 +++++++++++
 drivers/block/ibnbd/ibnbd-clt.h               |  166 +
 drivers/block/ibnbd/ibnbd-log.h               |   59 +
 drivers/block/ibnbd/ibnbd-proto.h             |  378 +++
 drivers/block/ibnbd/ibnbd-srv-dev.c           |  408 +++
 drivers/block/ibnbd/ibnbd-srv-dev.h           |  143 +
 drivers/block/ibnbd/ibnbd-srv-sysfs.c         |  270 ++
 drivers/block/ibnbd/ibnbd-srv.c               |  945 ++++++
 drivers/block/ibnbd/ibnbd-srv.h               |   94 +
 drivers/infiniband/Kconfig                    |    1 +
 drivers/infiniband/ulp/Makefile               |    1 +
 drivers/infiniband/ulp/ibtrs/Kconfig          |   22 +
 drivers/infiniband/ulp/ibtrs/Makefile         |   15 +
 drivers/infiniband/ulp/ibtrs/README           |  385 +++
 .../infiniband/ulp/ibtrs/ibtrs-clt-stats.c    |  447 +++
 .../infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c    |  514 +++
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c      | 2844 +++++++++++++++++
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h      |  308 ++
 drivers/infiniband/ulp/ibtrs/ibtrs-log.h      |   84 +
 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h      |  463 +++
 .../infiniband/ulp/ibtrs/ibtrs-srv-stats.c    |  103 +
 .../infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c    |  303 ++
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c      | 1998 ++++++++++++
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h      |  170 +
 drivers/infiniband/ulp/ibtrs/ibtrs.c          |  610 ++++
 drivers/infiniband/ulp/ibtrs/ibtrs.h          |  318 ++
 fs/sysfs/file.c                               |    1 +
 34 files changed, 13942 insertions(+)
 create mode 100644 drivers/block/ibnbd/Kconfig
 create mode 100644 drivers/block/ibnbd/Makefile
 create mode 100644 drivers/block/ibnbd/README
 create mode 100644 drivers/block/ibnbd/ibnbd-clt-sysfs.c
 create mode 100644 drivers/block/ibnbd/ibnbd-clt.c
 create mode 100644 drivers/block/ibnbd/ibnbd-clt.h
 create mode 100644 drivers/block/ibnbd/ibnbd-log.h
 create mode 100644 drivers/block/ibnbd/ibnbd-proto.h
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.h
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-sysfs.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
 create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
 create mode 100644 drivers/infiniband/ulp/ibtrs/README
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-log.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.h

-- 
2.17.1


^ permalink raw reply

* [RFC/RFT 05/14] soc: amlogic: meson-clk-measure: protect measure with a mutex
From: Neil Armstrong @ 2019-06-20 15:00 UTC (permalink / raw)
  To: jbrunet, khilman
  Cc: Neil Armstrong, martin.blumenstingl, linux-kernel, linux-amlogic,
	linux-clk, linux-arm-kernel
In-Reply-To: <20190620150013.13462-1-narmstrong@baylibre.com>

In order to protect clock measuring when multiple process asks for
a mesure, protect the main measure function with mutexes.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
---
 drivers/soc/amlogic/meson-clk-measure.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/soc/amlogic/meson-clk-measure.c b/drivers/soc/amlogic/meson-clk-measure.c
index 19d4cbc93a17..c470e24f1dfa 100644
--- a/drivers/soc/amlogic/meson-clk-measure.c
+++ b/drivers/soc/amlogic/meson-clk-measure.c
@@ -11,6 +11,8 @@
 #include <linux/debugfs.h>
 #include <linux/regmap.h>
 
+static DEFINE_MUTEX(measure_lock);
+
 #define MSR_CLK_DUTY		0x0
 #define MSR_CLK_REG0		0x4
 #define MSR_CLK_REG1		0x8
@@ -360,6 +362,10 @@ static int meson_measure_id(struct meson_msr_id *clk_msr_id,
 	unsigned int val;
 	int ret;
 
+	ret = mutex_lock_interruptible(&measure_lock);
+	if (ret)
+		return ret;
+
 	regmap_write(priv->regmap, MSR_CLK_REG0, 0);
 
 	/* Set measurement duration */
@@ -377,8 +383,10 @@ static int meson_measure_id(struct meson_msr_id *clk_msr_id,
 
 	ret = regmap_read_poll_timeout(priv->regmap, MSR_CLK_REG0,
 				       val, !(val & MSR_BUSY), 10, 10000);
-	if (ret)
+	if (ret) {
+		mutex_unlock(&measure_lock);
 		return ret;
+	}
 
 	/* Disable */
 	regmap_update_bits(priv->regmap, MSR_CLK_REG0, MSR_ENABLE, 0);
@@ -386,6 +394,8 @@ static int meson_measure_id(struct meson_msr_id *clk_msr_id,
 	/* Get the value in multiple of gate time counts */
 	regmap_read(priv->regmap, MSR_CLK_REG2, &val);
 
+	mutex_unlock(&measure_lock);
+
 	if (val >= MSR_VAL_MASK)
 		return -EINVAL;
 
-- 
2.21.0


_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic

^ permalink raw reply related

* [RFC/RFT 05/14] soc: amlogic: meson-clk-measure: protect measure with a mutex
From: Neil Armstrong @ 2019-06-20 15:00 UTC (permalink / raw)
  To: jbrunet, khilman
  Cc: Neil Armstrong, martin.blumenstingl, linux-kernel, linux-amlogic,
	linux-clk, linux-arm-kernel
In-Reply-To: <20190620150013.13462-1-narmstrong@baylibre.com>

In order to protect clock measuring when multiple process asks for
a mesure, protect the main measure function with mutexes.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
---
 drivers/soc/amlogic/meson-clk-measure.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/soc/amlogic/meson-clk-measure.c b/drivers/soc/amlogic/meson-clk-measure.c
index 19d4cbc93a17..c470e24f1dfa 100644
--- a/drivers/soc/amlogic/meson-clk-measure.c
+++ b/drivers/soc/amlogic/meson-clk-measure.c
@@ -11,6 +11,8 @@
 #include <linux/debugfs.h>
 #include <linux/regmap.h>
 
+static DEFINE_MUTEX(measure_lock);
+
 #define MSR_CLK_DUTY		0x0
 #define MSR_CLK_REG0		0x4
 #define MSR_CLK_REG1		0x8
@@ -360,6 +362,10 @@ static int meson_measure_id(struct meson_msr_id *clk_msr_id,
 	unsigned int val;
 	int ret;
 
+	ret = mutex_lock_interruptible(&measure_lock);
+	if (ret)
+		return ret;
+
 	regmap_write(priv->regmap, MSR_CLK_REG0, 0);
 
 	/* Set measurement duration */
@@ -377,8 +383,10 @@ static int meson_measure_id(struct meson_msr_id *clk_msr_id,
 
 	ret = regmap_read_poll_timeout(priv->regmap, MSR_CLK_REG0,
 				       val, !(val & MSR_BUSY), 10, 10000);
-	if (ret)
+	if (ret) {
+		mutex_unlock(&measure_lock);
 		return ret;
+	}
 
 	/* Disable */
 	regmap_update_bits(priv->regmap, MSR_CLK_REG0, MSR_ENABLE, 0);
@@ -386,6 +394,8 @@ static int meson_measure_id(struct meson_msr_id *clk_msr_id,
 	/* Get the value in multiple of gate time counts */
 	regmap_read(priv->regmap, MSR_CLK_REG2, &val);
 
+	mutex_unlock(&measure_lock);
+
 	if (val >= MSR_VAL_MASK)
 		return -EINVAL;
 
-- 
2.21.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related

* Re: [PATCH] kvm: tests: Sort tests in the Makefile alphabetically
From: Aaron Lewis @ 2019-06-20 15:03 UTC (permalink / raw)
  To: Krish Sadhukhan; +Cc: Jim Mattson, Peter Shier, Marc Orr, kvm
In-Reply-To: <63c62aa7-2dd1-f133-d8bc-c7a3528b7598@oracle.com>

On Tue, Jun 18, 2019 at 12:47 PM Krish Sadhukhan
<krish.sadhukhan@oracle.com> wrote:
>
>
>
> On 06/18/2019 07:14 AM, Aaron Lewis wrote:
> > On Fri, May 31, 2019 at 9:37 AM Aaron Lewis <aaronlewis@google.com> wrote:
> >> On Tue, May 21, 2019 at 10:14 AM Aaron Lewis <aaronlewis@google.com> wrote:
> >>> Signed-off-by: Aaron Lewis <aaronlewis@google.com>
> >>> Reviewed-by: Peter Shier <pshier@google.com>
> >>> ---
> >>>   tools/testing/selftests/kvm/Makefile | 20 ++++++++++----------
> >>>   1 file changed, 10 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> >>> index 79c524395ebe..234f679fa5ad 100644
> >>> --- a/tools/testing/selftests/kvm/Makefile
> >>> +++ b/tools/testing/selftests/kvm/Makefile
> >>> @@ -10,23 +10,23 @@ LIBKVM = lib/assert.c lib/elf.c lib/io.c lib/kvm_util.c lib/ucall.c lib/sparsebi
> >>>   LIBKVM_x86_64 = lib/x86_64/processor.c lib/x86_64/vmx.c
> >>>   LIBKVM_aarch64 = lib/aarch64/processor.c
> >>>
> >>> -TEST_GEN_PROGS_x86_64 = x86_64/platform_info_test
> >>> -TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
> >>> -TEST_GEN_PROGS_x86_64 += x86_64/sync_regs_test
> >>> -TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
> >>> -TEST_GEN_PROGS_x86_64 += x86_64/cr4_cpuid_sync_test
> >>> -TEST_GEN_PROGS_x86_64 += x86_64/state_test
> >>> +TEST_GEN_PROGS_x86_64 = x86_64/cr4_cpuid_sync_test
> >>>   TEST_GEN_PROGS_x86_64 += x86_64/evmcs_test
> >>>   TEST_GEN_PROGS_x86_64 += x86_64/hyperv_cpuid
> >>> -TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test
> >>> -TEST_GEN_PROGS_x86_64 += x86_64/smm_test
> >>>   TEST_GEN_PROGS_x86_64 += x86_64/kvm_create_max_vcpus
> >>> +TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
> >>> +TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
> >>> +TEST_GEN_PROGS_x86_64 += x86_64/smm_test
> >>> +TEST_GEN_PROGS_x86_64 += x86_64/state_test
> >>> +TEST_GEN_PROGS_x86_64 += x86_64/sync_regs_test
> >>> +TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test
> >>>   TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
> >>> -TEST_GEN_PROGS_x86_64 += dirty_log_test
> >>> +TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
> >>>   TEST_GEN_PROGS_x86_64 += clear_dirty_log_test
> >>> +TEST_GEN_PROGS_x86_64 += dirty_log_test
>
> May be, place the last two at the beginning if you are arranging them
> alphabetically ?

The original scheme had everything in x86_64 first then everything in
the root folder second.  I wanted to maintain this scheme, so that's
why it is sorted this way.

>
> >>>
> >>> -TEST_GEN_PROGS_aarch64 += dirty_log_test
> >>>   TEST_GEN_PROGS_aarch64 += clear_dirty_log_test
> >>> +TEST_GEN_PROGS_aarch64 += dirty_log_test
>
> May be, put the aarch64 ones above the x86_64 ones to arrange them
> alphabetically ?

I want to make a minimal change, so sorting the tags isn't a priority,
just the files.  The goal is to have a more predictable place to add
new tests, and I think this helps.

>
> >>>
> >>>   TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(UNAME_M))
> >>>   LIBKVM += $(LIBKVM_$(UNAME_M))
> >>> --
> >>> 2.21.0.1020.gf2820cf01a-goog
> >>>
> >> Does this look okay?  It's just a simple reordering of the list.  It
> >> helps when adding new tests.
> > ping
>

^ permalink raw reply

* [RFC/RFT 04/14] clk: meson: eeclk: add setup callback
From: Neil Armstrong @ 2019-06-20 15:00 UTC (permalink / raw)
  To: jbrunet, khilman
  Cc: Neil Armstrong, martin.blumenstingl, linux-kernel, linux-amlogic,
	linux-clk, linux-arm-kernel
In-Reply-To: <20190620150013.13462-1-narmstrong@baylibre.com>

Add a setup() callback in the eeclk structure, to call an optional
call() function at end of eeclk probe to setup clocks.

It's used for the G12A clock controller to setup the CPU clock notifiers.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
---
 drivers/clk/meson/meson-eeclk.c | 6 ++++++
 drivers/clk/meson/meson-eeclk.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/drivers/clk/meson/meson-eeclk.c b/drivers/clk/meson/meson-eeclk.c
index 6ba2094be257..81fd2abcd173 100644
--- a/drivers/clk/meson/meson-eeclk.c
+++ b/drivers/clk/meson/meson-eeclk.c
@@ -61,6 +61,12 @@ int meson_eeclkc_probe(struct platform_device *pdev)
 		}
 	}
 
+	if (data->setup) {
+		ret = data->setup(pdev);
+		if (ret)
+			return ret;
+	}
+
 	return devm_of_clk_add_hw_provider(dev, of_clk_hw_onecell_get,
 					   data->hw_onecell_data);
 }
diff --git a/drivers/clk/meson/meson-eeclk.h b/drivers/clk/meson/meson-eeclk.h
index 9ab5d6fa7ccb..c22b57781e3e 100644
--- a/drivers/clk/meson/meson-eeclk.h
+++ b/drivers/clk/meson/meson-eeclk.h
@@ -20,6 +20,7 @@ struct meson_eeclkc_data {
 	const struct reg_sequence	*init_regs;
 	unsigned int			init_count;
 	struct clk_hw_onecell_data	*hw_onecell_data;
+	int				(*setup)(struct platform_device *pdev); 
 };
 
 int meson_eeclkc_probe(struct platform_device *pdev);
-- 
2.21.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related

* [RFC/RFT 04/14] clk: meson: eeclk: add setup callback
From: Neil Armstrong @ 2019-06-20 15:00 UTC (permalink / raw)
  To: jbrunet, khilman
  Cc: Neil Armstrong, martin.blumenstingl, linux-kernel, linux-amlogic,
	linux-clk, linux-arm-kernel
In-Reply-To: <20190620150013.13462-1-narmstrong@baylibre.com>

Add a setup() callback in the eeclk structure, to call an optional
call() function at end of eeclk probe to setup clocks.

It's used for the G12A clock controller to setup the CPU clock notifiers.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
---
 drivers/clk/meson/meson-eeclk.c | 6 ++++++
 drivers/clk/meson/meson-eeclk.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/drivers/clk/meson/meson-eeclk.c b/drivers/clk/meson/meson-eeclk.c
index 6ba2094be257..81fd2abcd173 100644
--- a/drivers/clk/meson/meson-eeclk.c
+++ b/drivers/clk/meson/meson-eeclk.c
@@ -61,6 +61,12 @@ int meson_eeclkc_probe(struct platform_device *pdev)
 		}
 	}
 
+	if (data->setup) {
+		ret = data->setup(pdev);
+		if (ret)
+			return ret;
+	}
+
 	return devm_of_clk_add_hw_provider(dev, of_clk_hw_onecell_get,
 					   data->hw_onecell_data);
 }
diff --git a/drivers/clk/meson/meson-eeclk.h b/drivers/clk/meson/meson-eeclk.h
index 9ab5d6fa7ccb..c22b57781e3e 100644
--- a/drivers/clk/meson/meson-eeclk.h
+++ b/drivers/clk/meson/meson-eeclk.h
@@ -20,6 +20,7 @@ struct meson_eeclkc_data {
 	const struct reg_sequence	*init_regs;
 	unsigned int			init_count;
 	struct clk_hw_onecell_data	*hw_onecell_data;
+	int				(*setup)(struct platform_device *pdev); 
 };
 
 int meson_eeclkc_probe(struct platform_device *pdev);
-- 
2.21.0


_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic

^ permalink raw reply related

* [RFC/RFT 03/14] clk: meson: regmap: export regmap_div ops functions
From: Neil Armstrong @ 2019-06-20 15:00 UTC (permalink / raw)
  To: jbrunet, khilman
  Cc: Neil Armstrong, martin.blumenstingl, linux-kernel, linux-amlogic,
	linux-clk, linux-arm-kernel
In-Reply-To: <20190620150013.13462-1-narmstrong@baylibre.com>

The G12A CPU Clock Postmux divider needs a custom div_set_rate() call.

Export the clk_regmap_div_round_rate() and clk_regmap_div_recalc_rate()
to be able to override the default clk_regmap_div_set_rate() callback.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
---
 drivers/clk/meson/clk-regmap.c | 10 ++++++----
 drivers/clk/meson/clk-regmap.h |  5 +++++
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/clk/meson/clk-regmap.c b/drivers/clk/meson/clk-regmap.c
index dcd1757cc5df..26c8c74a8cf0 100644
--- a/drivers/clk/meson/clk-regmap.c
+++ b/drivers/clk/meson/clk-regmap.c
@@ -56,8 +56,8 @@ const struct clk_ops clk_regmap_gate_ro_ops = {
 };
 EXPORT_SYMBOL_GPL(clk_regmap_gate_ro_ops);
 
-static unsigned long clk_regmap_div_recalc_rate(struct clk_hw *hw,
-						unsigned long prate)
+unsigned long clk_regmap_div_recalc_rate(struct clk_hw *hw,
+					 unsigned long prate)
 {
 	struct clk_regmap *clk = to_clk_regmap(hw);
 	struct clk_regmap_div_data *div = clk_get_regmap_div_data(clk);
@@ -74,9 +74,10 @@ static unsigned long clk_regmap_div_recalc_rate(struct clk_hw *hw,
 	return divider_recalc_rate(hw, prate, val, div->table, div->flags,
 				   div->width);
 }
+EXPORT_SYMBOL_GPL(clk_regmap_div_recalc_rate);
 
-static long clk_regmap_div_round_rate(struct clk_hw *hw, unsigned long rate,
-				      unsigned long *prate)
+long clk_regmap_div_round_rate(struct clk_hw *hw, unsigned long rate,
+			       unsigned long *prate)
 {
 	struct clk_regmap *clk = to_clk_regmap(hw);
 	struct clk_regmap_div_data *div = clk_get_regmap_div_data(clk);
@@ -100,6 +101,7 @@ static long clk_regmap_div_round_rate(struct clk_hw *hw, unsigned long rate,
 	return divider_round_rate(hw, rate, prate, div->table, div->width,
 				  div->flags);
 }
+EXPORT_SYMBOL_GPL(clk_regmap_div_round_rate);
 
 static int clk_regmap_div_set_rate(struct clk_hw *hw, unsigned long rate,
 				   unsigned long parent_rate)
diff --git a/drivers/clk/meson/clk-regmap.h b/drivers/clk/meson/clk-regmap.h
index 1dd0abe3ba91..d22b83fb9bad 100644
--- a/drivers/clk/meson/clk-regmap.h
+++ b/drivers/clk/meson/clk-regmap.h
@@ -78,6 +78,11 @@ clk_get_regmap_div_data(struct clk_regmap *clk)
 	return (struct clk_regmap_div_data *)clk->data;
 }
 
+unsigned long clk_regmap_div_recalc_rate(struct clk_hw *hw,
+					 unsigned long prate);
+long clk_regmap_div_round_rate(struct clk_hw *hw, unsigned long rate,
+			       unsigned long *prate);
+
 extern const struct clk_ops clk_regmap_divider_ops;
 extern const struct clk_ops clk_regmap_divider_ro_ops;
 
-- 
2.21.0


_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic

^ permalink raw reply related

* [RFC/RFT 03/14] clk: meson: regmap: export regmap_div ops functions
From: Neil Armstrong @ 2019-06-20 15:00 UTC (permalink / raw)
  To: jbrunet, khilman
  Cc: Neil Armstrong, martin.blumenstingl, linux-kernel, linux-amlogic,
	linux-clk, linux-arm-kernel
In-Reply-To: <20190620150013.13462-1-narmstrong@baylibre.com>

The G12A CPU Clock Postmux divider needs a custom div_set_rate() call.

Export the clk_regmap_div_round_rate() and clk_regmap_div_recalc_rate()
to be able to override the default clk_regmap_div_set_rate() callback.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
---
 drivers/clk/meson/clk-regmap.c | 10 ++++++----
 drivers/clk/meson/clk-regmap.h |  5 +++++
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/clk/meson/clk-regmap.c b/drivers/clk/meson/clk-regmap.c
index dcd1757cc5df..26c8c74a8cf0 100644
--- a/drivers/clk/meson/clk-regmap.c
+++ b/drivers/clk/meson/clk-regmap.c
@@ -56,8 +56,8 @@ const struct clk_ops clk_regmap_gate_ro_ops = {
 };
 EXPORT_SYMBOL_GPL(clk_regmap_gate_ro_ops);
 
-static unsigned long clk_regmap_div_recalc_rate(struct clk_hw *hw,
-						unsigned long prate)
+unsigned long clk_regmap_div_recalc_rate(struct clk_hw *hw,
+					 unsigned long prate)
 {
 	struct clk_regmap *clk = to_clk_regmap(hw);
 	struct clk_regmap_div_data *div = clk_get_regmap_div_data(clk);
@@ -74,9 +74,10 @@ static unsigned long clk_regmap_div_recalc_rate(struct clk_hw *hw,
 	return divider_recalc_rate(hw, prate, val, div->table, div->flags,
 				   div->width);
 }
+EXPORT_SYMBOL_GPL(clk_regmap_div_recalc_rate);
 
-static long clk_regmap_div_round_rate(struct clk_hw *hw, unsigned long rate,
-				      unsigned long *prate)
+long clk_regmap_div_round_rate(struct clk_hw *hw, unsigned long rate,
+			       unsigned long *prate)
 {
 	struct clk_regmap *clk = to_clk_regmap(hw);
 	struct clk_regmap_div_data *div = clk_get_regmap_div_data(clk);
@@ -100,6 +101,7 @@ static long clk_regmap_div_round_rate(struct clk_hw *hw, unsigned long rate,
 	return divider_round_rate(hw, rate, prate, div->table, div->width,
 				  div->flags);
 }
+EXPORT_SYMBOL_GPL(clk_regmap_div_round_rate);
 
 static int clk_regmap_div_set_rate(struct clk_hw *hw, unsigned long rate,
 				   unsigned long parent_rate)
diff --git a/drivers/clk/meson/clk-regmap.h b/drivers/clk/meson/clk-regmap.h
index 1dd0abe3ba91..d22b83fb9bad 100644
--- a/drivers/clk/meson/clk-regmap.h
+++ b/drivers/clk/meson/clk-regmap.h
@@ -78,6 +78,11 @@ clk_get_regmap_div_data(struct clk_regmap *clk)
 	return (struct clk_regmap_div_data *)clk->data;
 }
 
+unsigned long clk_regmap_div_recalc_rate(struct clk_hw *hw,
+					 unsigned long prate);
+long clk_regmap_div_round_rate(struct clk_hw *hw, unsigned long rate,
+			       unsigned long *prate);
+
 extern const struct clk_ops clk_regmap_divider_ops;
 extern const struct clk_ops clk_regmap_divider_ro_ops;
 
-- 
2.21.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related

* [RFC/RFT 02/14] clk: core: introduce clk_hw_set_parent()
From: Neil Armstrong @ 2019-06-20 15:00 UTC (permalink / raw)
  To: jbrunet, khilman
  Cc: Neil Armstrong, martin.blumenstingl, linux-kernel, linux-amlogic,
	linux-clk, linux-arm-kernel
In-Reply-To: <20190620150013.13462-1-narmstrong@baylibre.com>

Introduce the clk_hw_set_parent() provider call to change parent of
a clock by using the clk_hw pointers.

This eases the clock reparenting from clock rate notifiers and
implementing DVFS with simpler code avoiding the boilerplates
functions as __clk_lookup(clk_hw_get_name()) then clk_set_parent().

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
---
 drivers/clk/clk.c            | 5 +++++
 include/linux/clk-provider.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
index aa51756fd4d6..3e98f7dec626 100644
--- a/drivers/clk/clk.c
+++ b/drivers/clk/clk.c
@@ -2490,6 +2490,11 @@ static int clk_core_set_parent_nolock(struct clk_core *core,
 	return ret;
 }
 
+int clk_hw_set_parent(struct clk_hw *hw, struct clk_hw *parent)
+{
+	return clk_core_set_parent_nolock(hw->core, parent->core);
+}
+
 /**
  * clk_set_parent - switch the parent of a mux clk
  * @clk: the mux clk whose input we are switching
diff --git a/include/linux/clk-provider.h b/include/linux/clk-provider.h
index bb6118f79784..8a453380f9a4 100644
--- a/include/linux/clk-provider.h
+++ b/include/linux/clk-provider.h
@@ -812,6 +812,7 @@ unsigned int clk_hw_get_num_parents(const struct clk_hw *hw);
 struct clk_hw *clk_hw_get_parent(const struct clk_hw *hw);
 struct clk_hw *clk_hw_get_parent_by_index(const struct clk_hw *hw,
 					  unsigned int index);
+int clk_hw_set_parent(struct clk_hw *hw, struct clk_hw *new_parent);
 unsigned int __clk_get_enable_count(struct clk *clk);
 unsigned long clk_hw_get_rate(const struct clk_hw *hw);
 unsigned long __clk_get_flags(struct clk *clk);
-- 
2.21.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related

* Re: [PATCH][next] RDMA: check for null return from call to ib_get_client_data
From: Jason Gunthorpe @ 2019-06-20 15:02 UTC (permalink / raw)
  To: Colin King
  Cc: Doug Ledford, Leon Romanovsky, Parav Pandit, linux-rdma,
	kernel-janitors, linux-kernel
In-Reply-To: <20190620135052.27367-1-colin.king@canonical.com>

On Thu, Jun 20, 2019 at 02:50:52PM +0100, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> The return from ib_get_client_data can potentially be null, so add a null
> check on umad_dev and return -ENODEV in this unlikely case to avoid any
> null pointer deferences.

It would be a kernel bug if NULL is seen here.

Jason

^ permalink raw reply


This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.