RFC: CQ pools and implicit CQ resource allocation

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* RFC: CQ pools and implicit CQ resource allocation
@ 2016-09-09 12:36 Christoph Hellwig
       [not found] ` <1473424587-13818-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2016-09-09 12:36 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: sagi-NQWnxTmZq1alnMjI0IkVqw

This series adds support to the RDMA core to implicitly allocate the required
CQEs when creating a QP.  The primary driver for that was to implement a
common scheme for CQ pooling, which helps with better resource usage for
server / target style drivers that have many outstanding connections.  In
fact the first version of this code from Sagi did just that: add a CQ pool
API, and convert drivers that were using some form of pooling (iSER initiator
& target, NVMe target) to that API.  But looking at the API I felt that
there was still way too much logic in the individual ULPs, and looked into a
way to make that boilerplate code go away.  It turns out that we can simply
create CQs underneath if we know the poll context that the ULP requires, so
this series shows an approach that makes CQs mostly invisible to ULPs.

One concern Sagi had when reviewing an earlier version of my edits was that
the over allocation as part of using a pool might be harmful for client side
driver that only have one (or a few) connections. I played around with iSER
a bit and could not measure and slowdown, but that doesn't mean there aren't
any down sides - I'd really like some feedback from the hardware driver
people on what CQ sizing / allocation preferences their hardware has.

This series applied on top of the "move unsafe global rkey handling to the
RDMA core V2" series.

A git tree is available at:

    git://git.infradead.org/users/hch/rdma.git rdma-cq

Gitweb:

    http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/rdma-cq

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/6] IB/core: add implicit CQ allocation
       [not found] ` <1473424587-13818-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
@ 2016-09-09 12:36   ` Christoph Hellwig
  2016-09-09 12:36   ` [PATCH 2/6] nvmet-rdma: use " Christoph Hellwig
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-09-09 12:36 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: sagi-NQWnxTmZq1alnMjI0IkVqw

From: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>

Currently RDMA ULPs need to explicitly allocate CQ resources, which
leads to both code churn in the drivers, and the need to explicitly
handle load balancing between CQs and completion vectors, which is
especially painful for server / target side ULPs that server many
connections.

This API instead allows the RDMA core to implicitly allocate CQ
resources when QPs are created - the ULP just has to specify the
poll context, and the core will find a suitable CQ for the number
of entries calculated from the number or WRs and RDMA contexts
specified by the ULP.

CQs are allocated lazily on first use, and maybe be shard between
connections and ULPs.  For client-side ULPs that want to make use
of spread out completion vectors we also allow manually specifying
a completion vector for a QP.

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
[hch: rewrote Sagi's CQ pool to be hidden behіnd QP creation / destruction]
Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/infiniband/core/core_priv.h |   6 ++
 drivers/infiniband/core/cq.c        | 125 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/device.c    |   4 ++
 drivers/infiniband/core/verbs.c     |  93 +++++++++++++++++++++------
 include/rdma/ib_verbs.h             |  31 +++++++--
 5 files changed, 234 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 19d499d..685bcdf 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -137,6 +137,12 @@ static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
 	return _upper == upper;
 }
 
+void ib_init_cq_pools(struct ib_device *dev);
+void ib_free_cq_pools(struct ib_device *dev);
+struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nents,
+		enum ib_poll_context poll_ctx, int comp_vector);
+void ib_put_cq(struct ib_cq *cq, unsigned int nents);
+
 int addr_init(void);
 void addr_cleanup(void);
 
diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index a754fc7..11d2ff8 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -15,6 +15,9 @@
 #include <linux/slab.h>
 #include <rdma/ib_verbs.h>
 
+/* wild guess - should not be too large or too small to avoid wastage */
+#define IB_CQE_BATCH			1024
+
 /* # of WCs to poll for with a single call to ib_poll_cq */
 #define IB_POLL_BATCH			16
 
@@ -143,6 +146,8 @@ struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
 	cq->cq_context = private;
 	cq->poll_ctx = poll_ctx;
 	atomic_set(&cq->usecnt, 0);
+	cq->cqe_used = 0;
+	cq->comp_vector = comp_vector;
 
 	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
 	if (!cq->wc)
@@ -188,6 +193,8 @@ void ib_free_cq(struct ib_cq *cq)
 
 	if (WARN_ON_ONCE(atomic_read(&cq->usecnt)))
 		return;
+	if (WARN_ON_ONCE(cq->cqe_used != 0))
+		return;
 
 	switch (cq->poll_ctx) {
 	case IB_POLL_DIRECT:
@@ -207,3 +214,121 @@ void ib_free_cq(struct ib_cq *cq)
 	WARN_ON_ONCE(ret);
 }
 EXPORT_SYMBOL(ib_free_cq);
+
+void ib_init_cq_pools(struct ib_device *dev)
+{
+	int i;
+
+	spin_lock_init(&dev->cq_lock);
+	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
+		INIT_LIST_HEAD(&dev->cq_pools[i]);
+}
+
+void ib_free_cq_pools(struct ib_device *dev)
+{
+	struct ib_cq *cq, *n;
+	LIST_HEAD(tmp_list);
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&dev->cq_lock, flags);
+		list_splice_init(&dev->cq_pools[i], &tmp_list);
+		spin_unlock_irqrestore(&dev->cq_lock, flags);
+	}
+
+	list_for_each_entry_safe(cq, n, &tmp_list, pool_entry)
+		ib_free_cq(cq);
+}
+
+static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
+		enum ib_poll_context poll_ctx)
+{
+	LIST_HEAD(tmp_list);
+	struct ib_cq *cq;
+	unsigned long flags;
+	int nr_cqs, ret, i;
+
+	/*
+	 * Allocated at least as many CQEs as requested, and otherwise
+	 * a reasonable batch size so that we can share CQs between
+	 * multiple users instead of allocating a larger number of CQs.
+	 */
+	nr_cqes = max(nr_cqes, min(dev->attrs.max_cqe, IB_CQE_BATCH));
+	nr_cqs = min_t(int, dev->num_comp_vectors, num_possible_cpus());
+	for (i = 0; i < nr_cqs; i++) {
+		cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
+		if (IS_ERR(cq)) {
+			ret = PTR_ERR(cq);
+			pr_err("%s: failed to create CQ ret=%d\n",
+				__func__, ret);
+			goto out_free_cqs;
+		}
+		list_add_tail(&cq->pool_entry, &tmp_list);
+	}
+
+	spin_lock_irqsave(&dev->cq_lock, flags);
+	list_splice(&tmp_list, &dev->cq_pools[poll_ctx]);
+	spin_unlock_irqrestore(&dev->cq_lock, flags);
+
+	return 0;
+
+out_free_cqs:
+	list_for_each_entry(cq, &tmp_list, pool_entry)
+		ib_free_cq(cq);
+	return ret;
+}
+
+/*
+ * Find the least used completion queue that fits nents, and claim nents
+ * entries in it for us.  In case there aren't enough CQEs available, allocate
+ * another array of CQs and restart the search.
+ */
+struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nents,
+		enum ib_poll_context poll_ctx, int comp_vector)
+{
+	struct ib_cq *cq, *found;
+	unsigned long flags;
+	int ret;
+
+	if (poll_ctx >= ARRAY_SIZE(dev->cq_pools))
+		return ERR_PTR(-EINVAL);
+
+	comp_vector %= num_possible_cpus();
+restart:
+	found = NULL;
+	spin_lock_irqsave(&dev->cq_lock, flags);
+	list_for_each_entry(cq, &dev->cq_pools[poll_ctx], pool_entry) {
+		if (comp_vector != -1 && comp_vector != cq->comp_vector)
+			continue;
+		if (cq->cqe_used + nents > cq->cqe)
+			continue;
+		if (found && cq->cqe_used >= found->cqe_used)
+			continue;
+		found = cq;
+	}
+
+	if (found) {
+		found->cqe_used += nents;
+		spin_unlock_irqrestore(&dev->cq_lock, flags);
+		return found;
+	}
+	spin_unlock_irqrestore(&dev->cq_lock, flags);
+
+	/* Ran out of CQs, allocate a new array of CQs */
+	ret = ib_alloc_cqs(dev, nents, poll_ctx);
+	if (ret)
+		return ERR_PTR(ret);
+	goto restart;
+}
+
+void ib_put_cq(struct ib_cq *cq, unsigned int nents)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cq->device->cq_lock, flags);
+	cq->cqe_used -= nents;
+	WARN_ON_ONCE(cq->cqe_used < 0);
+	spin_unlock_irqrestore(&cq->device->cq_lock, flags);
+}
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 760ef60..5a15ba8 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -337,6 +337,8 @@ int ib_register_device(struct ib_device *device,
 	struct ib_client *client;
 	struct ib_udata uhw = {.outlen = 0, .inlen = 0};
 
+	ib_init_cq_pools(device);
+
 	mutex_lock(&device_mutex);
 
 	if (strchr(device->name, '%')) {
@@ -435,6 +437,8 @@ void ib_unregister_device(struct ib_device *device)
 	up_write(&lists_rwsem);
 
 	device->reg_state = IB_DEV_UNREGISTERED;
+
+	ib_free_cq_pools(device);
 }
 EXPORT_SYMBOL(ib_unregister_device);
 
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index d4ef3b1..abd580b 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -777,14 +777,16 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 			   struct ib_qp_init_attr *qp_init_attr)
 {
 	struct ib_device *device = pd ? pd->device : qp_init_attr->xrcd->device;
+	struct ib_cq *cq = NULL;
 	struct ib_qp *qp;
-	int ret;
+	u32 nr_cqes = 0;
+	int ret = -EINVAL;
 
 	if (qp_init_attr->rwq_ind_tbl &&
 	    (qp_init_attr->recv_cq ||
 	    qp_init_attr->srq || qp_init_attr->cap.max_recv_wr ||
 	    qp_init_attr->cap.max_recv_sge))
-		return ERR_PTR(-EINVAL);
+		goto out;
 
 	/*
 	 * If the callers is using the RDMA API calculate the resources
@@ -795,15 +797,58 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 	if (qp_init_attr->cap.max_rdma_ctxs)
 		rdma_rw_init_qp(device, qp_init_attr);
 
+	if (qp_init_attr->create_flags & IB_QP_CREATE_ASSIGN_CQS) {
+		int vector = -1;
+
+		if (WARN_ON(qp_init_attr->recv_cq))
+			goto out;
+		if (WARN_ON(qp_init_attr->send_cq))
+			goto out;
+
+		if (qp_init_attr->create_flags & IB_QP_CREATE_COMP_VECTOR)
+			vector = qp_init_attr->comp_vector;
+
+		nr_cqes = qp_init_attr->cap.max_recv_wr +
+			  qp_init_attr->cap.max_send_wr;
+		if (nr_cqes) {
+			cq = ib_find_get_cq(device, nr_cqes,
+					    qp_init_attr->poll_ctx, vector);
+			if (IS_ERR(cq)) {
+				ret = PTR_ERR(cq);
+				goto out;
+			}
+
+			if (qp_init_attr->cap.max_recv_wr) {
+				qp_init_attr->recv_cq = cq;
+
+				/*
+				 * Low-level drivers expect max_recv_wr == 0
+				 * for the SRQ case:
+				 */
+				if (qp_init_attr->srq)
+					qp_init_attr->cap.max_recv_wr = 0;
+			}
+
+			if (qp_init_attr->cap.max_send_wr)
+				qp_init_attr->send_cq = cq;
+		}
+
+		qp_init_attr->create_flags &=
+			~(IB_QP_CREATE_ASSIGN_CQS | IB_QP_CREATE_COMP_VECTOR);
+	}
+
 	qp = device->create_qp(pd, qp_init_attr, NULL);
-	if (IS_ERR(qp))
-		return qp;
+	if (IS_ERR(qp)) {
+		ret = PTR_ERR(qp);
+		goto out_put_cq;
+	}
 
 	qp->device     = device;
 	qp->real_qp    = qp;
 	qp->uobject    = NULL;
 	qp->qp_type    = qp_init_attr->qp_type;
 	qp->rwq_ind_tbl = qp_init_attr->rwq_ind_tbl;
+	qp->nr_cqes    = nr_cqes;
 
 	atomic_set(&qp->usecnt, 0);
 	qp->mrs_used = 0;
@@ -842,8 +887,7 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 		ret = rdma_rw_init_mrs(qp, qp_init_attr);
 		if (ret) {
 			pr_err("failed to init MR pool ret= %d\n", ret);
-			ib_destroy_qp(qp);
-			qp = ERR_PTR(ret);
+			goto out_destroy_qp;
 		}
 	}
 
@@ -857,6 +901,14 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 				 device->attrs.max_sge_rd);
 
 	return qp;
+
+out_destroy_qp:
+	ib_destroy_qp(qp);
+out_put_cq:
+	if (cq)
+		ib_put_cq(cq, nr_cqes);
+out:
+	return ERR_PTR(ret);
 }
 EXPORT_SYMBOL(ib_create_qp);
 
@@ -1346,20 +1398,23 @@ int ib_destroy_qp(struct ib_qp *qp)
 		rdma_rw_cleanup_mrs(qp);
 
 	ret = qp->device->destroy_qp(qp);
-	if (!ret) {
-		if (pd)
-			atomic_dec(&pd->usecnt);
-		if (scq)
-			atomic_dec(&scq->usecnt);
-		if (rcq)
-			atomic_dec(&rcq->usecnt);
-		if (srq)
-			atomic_dec(&srq->usecnt);
-		if (ind_tbl)
-			atomic_dec(&ind_tbl->usecnt);
-	}
+	if (ret)
+		return ret;
 
-	return ret;
+	if (qp->nr_cqes)
+		ib_put_cq(rcq ? rcq : scq, qp->nr_cqes);
+
+	if (pd)
+		atomic_dec(&pd->usecnt);
+	if (scq)
+		atomic_dec(&scq->usecnt);
+	if (rcq)
+		atomic_dec(&rcq->usecnt);
+	if (srq)
+		atomic_dec(&srq->usecnt);
+	if (ind_tbl)
+		atomic_dec(&ind_tbl->usecnt);
+	return 0;
 }
 EXPORT_SYMBOL(ib_destroy_qp);
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0a6c708..147c451 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -991,11 +991,22 @@ enum ib_qp_create_flags {
 	IB_QP_CREATE_SIGNATURE_EN		= 1 << 6,
 	IB_QP_CREATE_USE_GFP_NOIO		= 1 << 7,
 	IB_QP_CREATE_SCATTER_FCS		= 1 << 8,
+
+	/* only used by the core, not passed to low-level drivers */
+	IB_QP_CREATE_ASSIGN_CQS			= 1 << 16,
+	IB_QP_CREATE_COMP_VECTOR		= 1 << 17,
+
 	/* reserve bits 26-31 for low level drivers' internal use */
 	IB_QP_CREATE_RESERVED_START		= 1 << 26,
 	IB_QP_CREATE_RESERVED_END		= 1 << 31,
 };
 
+enum ib_poll_context {
+	IB_POLL_SOFTIRQ,	/* poll from softirq context */
+	IB_POLL_WORKQUEUE,	/* poll from workqueue */
+	IB_POLL_DIRECT,		/* caller context, no hw completions */
+};
+
 /*
  * Note: users may not call ib_close_qp or ib_destroy_qp from the event_handler
  * callback to destroy the passed in QP.
@@ -1017,6 +1028,13 @@ struct ib_qp_init_attr {
 	 * Only needed for special QP types, or when using the RW API.
 	 */
 	u8			port_num;
+
+	/*
+	 * Only needed when not passing in explicit CQs.
+	 */
+	enum ib_poll_context	poll_ctx;
+	int			comp_vector;
+
 	struct ib_rwq_ind_table *rwq_ind_tbl;
 };
 
@@ -1400,12 +1418,6 @@ struct ib_ah {
 
 typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context);
 
-enum ib_poll_context {
-	IB_POLL_DIRECT,		/* caller context, no hw completions */
-	IB_POLL_SOFTIRQ,	/* poll from softirq context */
-	IB_POLL_WORKQUEUE,	/* poll from workqueue */
-};
-
 struct ib_cq {
 	struct ib_device       *device;
 	struct ib_uobject      *uobject;
@@ -1413,9 +1425,12 @@ struct ib_cq {
 	void                  (*event_handler)(struct ib_event *, void *);
 	void                   *cq_context;
 	int               	cqe;
+	unsigned int		cqe_used;
 	atomic_t          	usecnt; /* count number of work queues */
 	enum ib_poll_context	poll_ctx;
+	int			comp_vector;
 	struct ib_wc		*wc;
+	struct list_head	pool_entry;
 	union {
 		struct irq_poll		iop;
 		struct work_struct	work;
@@ -1526,6 +1541,7 @@ struct ib_qp {
 	u32			max_read_sge;
 	enum ib_qp_type		qp_type;
 	struct ib_rwq_ind_table *rwq_ind_tbl;
+	u32			nr_cqes;
 };
 
 struct ib_mr {
@@ -2050,6 +2066,9 @@ struct ib_device {
 	struct attribute_group	     *hw_stats_ag;
 	struct rdma_hw_stats         *hw_stats;
 
+	spinlock_t		     cq_lock;
+	struct list_head	     cq_pools[IB_POLL_WORKQUEUE + 1];
+
 	/**
 	 * The following mandatory functions are used only at device
 	 * registration.  Keep functions such as these at the end of this
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/6] nvmet-rdma: use implicit CQ allocation
       [not found] ` <1473424587-13818-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
  2016-09-09 12:36   ` [PATCH 1/6] IB/core: add implicit CQ allocation Christoph Hellwig
@ 2016-09-09 12:36   ` Christoph Hellwig
  2016-09-09 12:36   ` [PATCH 3/6] IB/isert: " Christoph Hellwig
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-09-09 12:36 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: sagi-NQWnxTmZq1alnMjI0IkVqw

From: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
[hch: ported to the new API]
Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/nvme/target/rdma.c | 67 ++++++++++++----------------------------------
 1 file changed, 17 insertions(+), 50 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 187763a..fc7ed0a 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -83,7 +83,6 @@ enum nvmet_rdma_queue_state {
 struct nvmet_rdma_queue {
 	struct rdma_cm_id	*cm_id;
 	struct nvmet_port	*port;
-	struct ib_cq		*cq;
 	atomic_t		sq_wr_avail;
 	struct nvmet_rdma_device *dev;
 	spinlock_t		state_lock;
@@ -548,7 +547,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 {
 	struct nvmet_rdma_rsp *rsp =
 		container_of(wc->wr_cqe, struct nvmet_rdma_rsp, read_cqe);
-	struct nvmet_rdma_queue *queue = cq->cq_context;
+	struct nvmet_rdma_queue *queue = wc->qp->qp_context;
 
 	WARN_ON(rsp->n_rdma <= 0);
 	atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
@@ -722,7 +721,7 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 {
 	struct nvmet_rdma_cmd *cmd =
 		container_of(wc->wr_cqe, struct nvmet_rdma_cmd, cqe);
-	struct nvmet_rdma_queue *queue = cq->cq_context;
+	struct nvmet_rdma_queue *queue = wc->qp->qp_context;
 	struct nvmet_rdma_rsp *rsp;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
@@ -877,62 +876,41 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 {
 	struct ib_qp_init_attr qp_attr;
 	struct nvmet_rdma_device *ndev = queue->dev;
-	int comp_vector, nr_cqe, ret, i;
-
-	/*
-	 * Spread the io queues across completion vectors,
-	 * but still keep all admin queues on vector 0.
-	 */
-	comp_vector = !queue->host_qid ? 0 :
-		queue->idx % ndev->device->num_comp_vectors;
-
-	/*
-	 * Reserve CQ slots for RECV + RDMA_READ/RDMA_WRITE + RDMA_SEND.
-	 */
-	nr_cqe = queue->recv_queue_size + 2 * queue->send_queue_size;
-
-	queue->cq = ib_alloc_cq(ndev->device, queue,
-			nr_cqe + 1, comp_vector,
-			IB_POLL_WORKQUEUE);
-	if (IS_ERR(queue->cq)) {
-		ret = PTR_ERR(queue->cq);
-		pr_err("failed to create CQ cqe= %d ret= %d\n",
-		       nr_cqe + 1, ret);
-		goto out;
-	}
+	int ret, i;
 
 	memset(&qp_attr, 0, sizeof(qp_attr));
+	qp_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	qp_attr.qp_context = queue;
 	qp_attr.event_handler = nvmet_rdma_qp_event;
-	qp_attr.send_cq = queue->cq;
-	qp_attr.recv_cq = queue->cq;
 	qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
 	qp_attr.qp_type = IB_QPT_RC;
+	qp_attr.poll_ctx = IB_POLL_WORKQUEUE;
+
 	/* +1 for drain */
 	qp_attr.cap.max_send_wr = queue->send_queue_size + 1;
 	qp_attr.cap.max_rdma_ctxs = queue->send_queue_size;
 	qp_attr.cap.max_send_sge = max(ndev->device->attrs.max_sge_rd,
 					ndev->device->attrs.max_sge);
 
-	if (ndev->srq) {
+	/* +1 for drain */
+	qp_attr.cap.max_recv_wr = queue->recv_queue_size + 1;
+
+	if (ndev->srq)
 		qp_attr.srq = ndev->srq;
-	} else {
-		/* +1 for drain */
-		qp_attr.cap.max_recv_wr = 1 + queue->recv_queue_size;
+	else
 		qp_attr.cap.max_recv_sge = 2;
-	}
 
 	ret = rdma_create_qp(queue->cm_id, ndev->pd, &qp_attr);
 	if (ret) {
 		pr_err("failed to create_qp ret= %d\n", ret);
-		goto err_destroy_cq;
+		return ret;
 	}
 
 	atomic_set(&queue->sq_wr_avail, qp_attr.cap.max_send_wr);
 
-	pr_debug("%s: max_cqe= %d max_sge= %d sq_size = %d cm_id= %p\n",
-		 __func__, queue->cq->cqe, qp_attr.cap.max_send_sge,
-		 qp_attr.cap.max_send_wr, queue->cm_id);
+	pr_debug("%s: max_sge= %d sq_size = %d cm_id=%p\n", __func__,
+		qp_attr.cap.max_send_sge, qp_attr.cap.max_send_wr,
+		queue->cm_id);
 
 	if (!ndev->srq) {
 		for (i = 0; i < queue->recv_queue_size; i++) {
@@ -941,18 +919,7 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 		}
 	}
 
-out:
-	return ret;
-
-err_destroy_cq:
-	ib_free_cq(queue->cq);
-	goto out;
-}
-
-static void nvmet_rdma_destroy_queue_ib(struct nvmet_rdma_queue *queue)
-{
-	rdma_destroy_qp(queue->cm_id);
-	ib_free_cq(queue->cq);
+	return 0;
 }
 
 static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
@@ -961,7 +928,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 
 	nvmet_sq_destroy(&queue->nvme_sq);
 
-	nvmet_rdma_destroy_queue_ib(queue);
+	rdma_destroy_qp(queue->cm_id);
 	if (!queue->dev->srq) {
 		nvmet_rdma_free_cmds(queue->dev, queue->cmds,
 				queue->recv_queue_size,
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/6] IB/isert: use implicit CQ allocation
       [not found] ` <1473424587-13818-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
  2016-09-09 12:36   ` [PATCH 1/6] IB/core: add implicit CQ allocation Christoph Hellwig
  2016-09-09 12:36   ` [PATCH 2/6] nvmet-rdma: use " Christoph Hellwig
@ 2016-09-09 12:36   ` Christoph Hellwig
  2016-09-09 12:36   ` [PATCH 4/6] IB/srpt: " Christoph Hellwig
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-09-09 12:36 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: sagi-NQWnxTmZq1alnMjI0IkVqw

From: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
[hch: ported to the new API]
Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/infiniband/ulp/isert/ib_isert.c | 164 ++++----------------------------
 drivers/infiniband/ulp/isert/ib_isert.h |  16 ----
 2 files changed, 18 insertions(+), 162 deletions(-)

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index 8df608e..78f6b2a 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -35,8 +35,6 @@
 #define ISER_MAX_RX_CQ_LEN	(ISERT_QP_MAX_RECV_DTOS * ISERT_MAX_CONN)
 #define ISER_MAX_TX_CQ_LEN \
 	((ISERT_QP_MAX_REQ_DTOS + ISCSI_DEF_XMIT_CMDS_MAX) * ISERT_MAX_CONN)
-#define ISER_MAX_CQ_LEN		(ISER_MAX_RX_CQ_LEN + ISER_MAX_TX_CQ_LEN + \
-				 ISERT_MAX_CONN)
 
 static int isert_debug_level;
 module_param_named(debug_level, isert_debug_level, int, 0644);
@@ -89,55 +87,26 @@ isert_qp_event_callback(struct ib_event *e, void *context)
 	}
 }
 
-static struct isert_comp *
-isert_comp_get(struct isert_conn *isert_conn)
-{
-	struct isert_device *device = isert_conn->device;
-	struct isert_comp *comp;
-	int i, min = 0;
-
-	mutex_lock(&device_list_mutex);
-	for (i = 0; i < device->comps_used; i++)
-		if (device->comps[i].active_qps <
-		    device->comps[min].active_qps)
-			min = i;
-	comp = &device->comps[min];
-	comp->active_qps++;
-	mutex_unlock(&device_list_mutex);
-
-	isert_info("conn %p, using comp %p min_index: %d\n",
-		   isert_conn, comp, min);
-
-	return comp;
-}
-
-static void
-isert_comp_put(struct isert_comp *comp)
-{
-	mutex_lock(&device_list_mutex);
-	comp->active_qps--;
-	mutex_unlock(&device_list_mutex);
-}
-
 static struct ib_qp *
-isert_create_qp(struct isert_conn *isert_conn,
-		struct isert_comp *comp,
-		struct rdma_cm_id *cma_id)
+isert_create_qp(struct isert_conn *isert_conn, struct rdma_cm_id *cma_id)
 {
 	struct isert_device *device = isert_conn->device;
 	struct ib_qp_init_attr attr;
 	int ret;
 
-	memset(&attr, 0, sizeof(struct ib_qp_init_attr));
+	memset(&attr, 0, sizeof(attr));
+	attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	attr.event_handler = isert_qp_event_callback;
 	attr.qp_context = isert_conn;
-	attr.send_cq = comp->cq;
-	attr.recv_cq = comp->cq;
+	attr.poll_ctx = IB_POLL_WORKQUEUE;
+
 	attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS + 1;
-	attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
 	attr.cap.max_rdma_ctxs = ISCSI_DEF_XMIT_CMDS_MAX;
 	attr.cap.max_send_sge = device->ib_device->attrs.max_sge;
+
+	attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
 	attr.cap.max_recv_sge = 1;
+
 	attr.sq_sig_type = IB_SIGNAL_REQ_WR;
 	attr.qp_type = IB_QPT_RC;
 	if (device->pi_capable)
@@ -153,25 +122,6 @@ isert_create_qp(struct isert_conn *isert_conn,
 }
 
 static int
-isert_conn_setup_qp(struct isert_conn *isert_conn, struct rdma_cm_id *cma_id)
-{
-	struct isert_comp *comp;
-	int ret;
-
-	comp = isert_comp_get(isert_conn);
-	isert_conn->qp = isert_create_qp(isert_conn, comp, cma_id);
-	if (IS_ERR(isert_conn->qp)) {
-		ret = PTR_ERR(isert_conn->qp);
-		goto err;
-	}
-
-	return 0;
-err:
-	isert_comp_put(comp);
-	return ret;
-}
-
-static int
 isert_alloc_rx_descriptors(struct isert_conn *isert_conn)
 {
 	struct isert_device *device = isert_conn->device;
@@ -239,63 +189,6 @@ isert_free_rx_descriptors(struct isert_conn *isert_conn)
 	isert_conn->rx_descs = NULL;
 }
 
-static void
-isert_free_comps(struct isert_device *device)
-{
-	int i;
-
-	for (i = 0; i < device->comps_used; i++) {
-		struct isert_comp *comp = &device->comps[i];
-
-		if (comp->cq)
-			ib_free_cq(comp->cq);
-	}
-	kfree(device->comps);
-}
-
-static int
-isert_alloc_comps(struct isert_device *device)
-{
-	int i, max_cqe, ret = 0;
-
-	device->comps_used = min(ISERT_MAX_CQ, min_t(int, num_online_cpus(),
-				 device->ib_device->num_comp_vectors));
-
-	isert_info("Using %d CQs, %s supports %d vectors support "
-		   "pi_capable %d\n",
-		   device->comps_used, device->ib_device->name,
-		   device->ib_device->num_comp_vectors,
-		   device->pi_capable);
-
-	device->comps = kcalloc(device->comps_used, sizeof(struct isert_comp),
-				GFP_KERNEL);
-	if (!device->comps) {
-		isert_err("Unable to allocate completion contexts\n");
-		return -ENOMEM;
-	}
-
-	max_cqe = min(ISER_MAX_CQ_LEN, device->ib_device->attrs.max_cqe);
-
-	for (i = 0; i < device->comps_used; i++) {
-		struct isert_comp *comp = &device->comps[i];
-
-		comp->device = device;
-		comp->cq = ib_alloc_cq(device->ib_device, comp, max_cqe, i,
-				IB_POLL_WORKQUEUE);
-		if (IS_ERR(comp->cq)) {
-			isert_err("Unable to allocate cq\n");
-			ret = PTR_ERR(comp->cq);
-			comp->cq = NULL;
-			goto out_cq;
-		}
-	}
-
-	return 0;
-out_cq:
-	isert_free_comps(device);
-	return ret;
-}
-
 static int
 isert_create_device_ib_res(struct isert_device *device)
 {
@@ -305,16 +198,11 @@ isert_create_device_ib_res(struct isert_device *device)
 	isert_dbg("devattr->max_sge: %d\n", ib_dev->attrs.max_sge);
 	isert_dbg("devattr->max_sge_rd: %d\n", ib_dev->attrs.max_sge_rd);
 
-	ret = isert_alloc_comps(device);
-	if (ret)
-		goto out;
-
 	device->pd = ib_alloc_pd(ib_dev, 0);
 	if (IS_ERR(device->pd)) {
-		ret = PTR_ERR(device->pd);
 		isert_err("failed to allocate pd, device %p, ret=%d\n",
 			  device, ret);
-		goto out_cq;
+		return PTR_ERR(device->pd);
 	}
 
 	/* Check signature cap */
@@ -322,22 +210,6 @@ isert_create_device_ib_res(struct isert_device *device)
 			     IB_DEVICE_SIGNATURE_HANDOVER ? true : false;
 
 	return 0;
-
-out_cq:
-	isert_free_comps(device);
-out:
-	if (ret > 0)
-		ret = -EINVAL;
-	return ret;
-}
-
-static void
-isert_free_device_ib_res(struct isert_device *device)
-{
-	isert_info("device %p\n", device);
-
-	ib_dealloc_pd(device->pd);
-	isert_free_comps(device);
 }
 
 static void
@@ -347,7 +219,7 @@ isert_device_put(struct isert_device *device)
 	device->refcount--;
 	isert_info("device %p refcount %d\n", device, device->refcount);
 	if (!device->refcount) {
-		isert_free_device_ib_res(device);
+		ib_dealloc_pd(device->pd);
 		list_del(&device->dev_node);
 		kfree(device);
 	}
@@ -540,13 +412,15 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
 
 	isert_set_nego_params(isert_conn, &event->param.conn);
 
-	ret = isert_conn_setup_qp(isert_conn, cma_id);
-	if (ret)
+	isert_conn->qp = isert_create_qp(isert_conn, cma_id);
+	if (IS_ERR(isert_conn->qp)) {
+		ret = PTR_ERR(isert_conn->qp);
 		goto out_conn_dev;
+	}
 
 	ret = isert_login_post_recv(isert_conn);
 	if (ret)
-		goto out_conn_dev;
+		goto out_conn_qp;
 
 	ret = isert_rdma_accept(isert_conn);
 	if (ret)
@@ -558,6 +432,8 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
 
 	return 0;
 
+out_conn_qp:
+	ib_destroy_qp(isert_conn->qp);
 out_conn_dev:
 	isert_device_put(device);
 out_rsp_dma_map:
@@ -581,12 +457,8 @@ isert_connect_release(struct isert_conn *isert_conn)
 	if (isert_conn->cm_id)
 		rdma_destroy_id(isert_conn->cm_id);
 
-	if (isert_conn->qp) {
-		struct isert_comp *comp = isert_conn->qp->recv_cq->cq_context;
-
-		isert_comp_put(comp);
+	if (isert_conn->qp)
 		ib_destroy_qp(isert_conn->qp);
-	}
 
 	if (isert_conn->login_req_buf)
 		isert_free_login_buf(isert_conn);
diff --git a/drivers/infiniband/ulp/isert/ib_isert.h b/drivers/infiniband/ulp/isert/ib_isert.h
index fc791ef..2e85eed 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.h
+++ b/drivers/infiniband/ulp/isert/ib_isert.h
@@ -162,27 +162,11 @@ struct isert_conn {
 
 #define ISERT_MAX_CQ 64
 
-/**
- * struct isert_comp - iSER completion context
- *
- * @device:     pointer to device handle
- * @cq:         completion queue
- * @active_qps: Number of active QPs attached
- *              to completion context
- */
-struct isert_comp {
-	struct isert_device     *device;
-	struct ib_cq		*cq;
-	int                      active_qps;
-};
-
 struct isert_device {
 	bool			pi_capable;
 	int			refcount;
 	struct ib_device	*ib_device;
 	struct ib_pd		*pd;
-	struct isert_comp	*comps;
-	int                     comps_used;
 	struct list_head	dev_node;
 };
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/6] IB/srpt: use implicit CQ allocation
       [not found] ` <1473424587-13818-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
                     ` (2 preceding siblings ...)
  2016-09-09 12:36   ` [PATCH 3/6] IB/isert: " Christoph Hellwig
@ 2016-09-09 12:36   ` Christoph Hellwig
  2016-09-09 12:36   ` [PATCH 5/6] IB/iser: " Christoph Hellwig
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-09-09 12:36 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: sagi-NQWnxTmZq1alnMjI0IkVqw

From: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
[hch: ported to the new API]
Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/infiniband/ulp/srpt/ib_srpt.c | 46 ++++++++++++-----------------------
 drivers/infiniband/ulp/srpt/ib_srpt.h |  1 -
 2 files changed, 15 insertions(+), 32 deletions(-)

diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 48a44af..7c1110a 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -785,7 +785,7 @@ static int srpt_zerolength_write(struct srpt_rdma_ch *ch)
 
 static void srpt_zerolength_write_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	struct srpt_rdma_ch *ch = cq->cq_context;
+	struct srpt_rdma_ch *ch = wc->qp->qp_context;
 
 	if (wc->status == IB_WC_SUCCESS) {
 		srpt_process_wait_list(ch);
@@ -1188,7 +1188,7 @@ static int srpt_abort_cmd(struct srpt_send_ioctx *ioctx)
  */
 static void srpt_rdma_read_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	struct srpt_rdma_ch *ch = cq->cq_context;
+	struct srpt_rdma_ch *ch = wc->qp->qp_context;
 	struct srpt_send_ioctx *ioctx =
 		container_of(wc->wr_cqe, struct srpt_send_ioctx, rdma_cqe);
 
@@ -1513,7 +1513,7 @@ out_wait:
 
 static void srpt_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	struct srpt_rdma_ch *ch = cq->cq_context;
+	struct srpt_rdma_ch *ch = wc->qp->qp_context;
 	struct srpt_recv_ioctx *ioctx =
 		container_of(wc->wr_cqe, struct srpt_recv_ioctx, ioctx.cqe);
 
@@ -1567,7 +1567,7 @@ static void srpt_process_wait_list(struct srpt_rdma_ch *ch)
  */
 static void srpt_send_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	struct srpt_rdma_ch *ch = cq->cq_context;
+	struct srpt_rdma_ch *ch = wc->qp->qp_context;
 	struct srpt_send_ioctx *ioctx =
 		container_of(wc->wr_cqe, struct srpt_send_ioctx, ioctx.cqe);
 	enum srpt_command_state state;
@@ -1613,23 +1613,14 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 		goto out;
 
 retry:
-	ch->cq = ib_alloc_cq(sdev->device, ch, ch->rq_size + srp_sq_size,
-			0 /* XXX: spread CQs */, IB_POLL_WORKQUEUE);
-	if (IS_ERR(ch->cq)) {
-		ret = PTR_ERR(ch->cq);
-		pr_err("failed to create CQ cqe= %d ret= %d\n",
-		       ch->rq_size + srp_sq_size, ret);
-		goto out;
-	}
-
+	qp_init->create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	qp_init->qp_context = (void *)ch;
 	qp_init->event_handler
 		= (void(*)(struct ib_event *, void*))srpt_qp_event;
-	qp_init->send_cq = ch->cq;
-	qp_init->recv_cq = ch->cq;
 	qp_init->srq = sdev->srq;
 	qp_init->sq_sig_type = IB_SIGNAL_REQ_WR;
 	qp_init->qp_type = IB_QPT_RC;
+	qp_init->poll_ctx = IB_POLL_WORKQUEUE;
 	/*
 	 * We divide up our send queue size into half SEND WRs to send the
 	 * completions, and half R/W contexts to actually do the RDMA
@@ -1640,6 +1631,9 @@ retry:
 	qp_init->cap.max_send_wr = srp_sq_size / 2;
 	qp_init->cap.max_rdma_ctxs = srp_sq_size / 2;
 	qp_init->cap.max_send_sge = min(attrs->max_sge, SRPT_MAX_SG_PER_WQE);
+
+	qp_init->cap.max_recv_wr = ch->rq_size;
+
 	qp_init->port_num = ch->sport->port;
 
 	ch->qp = ib_create_qp(sdev->pd, qp_init);
@@ -1647,19 +1641,17 @@ retry:
 		ret = PTR_ERR(ch->qp);
 		if (ret == -ENOMEM) {
 			srp_sq_size /= 2;
-			if (srp_sq_size >= MIN_SRPT_SQ_SIZE) {
-				ib_destroy_cq(ch->cq);
+			if (srp_sq_size >= MIN_SRPT_SQ_SIZE)
 				goto retry;
-			}
 		}
 		pr_err("failed to create_qp ret= %d\n", ret);
-		goto err_destroy_cq;
+		goto out;
 	}
 
 	atomic_set(&ch->sq_wr_avail, qp_init->cap.max_send_wr);
 
-	pr_debug("%s: max_cqe= %d max_sge= %d sq_size = %d cm_id= %p\n",
-		 __func__, ch->cq->cqe, qp_init->cap.max_send_sge,
+	pr_debug("%s: max_sge= %d sq_size = %d cm_id= %p\n",
+		 __func__, qp_init->cap.max_send_sge,
 		 qp_init->cap.max_send_wr, ch->cm_id);
 
 	ret = srpt_init_ch_qp(ch, ch->qp);
@@ -1672,17 +1664,9 @@ out:
 
 err_destroy_qp:
 	ib_destroy_qp(ch->qp);
-err_destroy_cq:
-	ib_free_cq(ch->cq);
 	goto out;
 }
 
-static void srpt_destroy_ch_ib(struct srpt_rdma_ch *ch)
-{
-	ib_destroy_qp(ch->qp);
-	ib_free_cq(ch->cq);
-}
-
 /**
  * srpt_close_ch() - Close an RDMA channel.
  *
@@ -1799,7 +1783,7 @@ static void srpt_release_channel_work(struct work_struct *w)
 
 	ib_destroy_cm_id(ch->cm_id);
 
-	srpt_destroy_ch_ib(ch);
+	ib_destroy_qp(ch->qp);
 
 	srpt_free_ioctx_ring((struct srpt_ioctx **)ch->ioctx_ring,
 			     ch->sport->sdev, ch->rq_size,
@@ -2056,7 +2040,7 @@ release_channel:
 	ch->sess = NULL;
 
 destroy_ib:
-	srpt_destroy_ch_ib(ch);
+	ib_destroy_qp(ch->qp);
 
 free_ring:
 	srpt_free_ioctx_ring((struct srpt_ioctx **)ch->ioctx_ring,
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.h b/drivers/infiniband/ulp/srpt/ib_srpt.h
index 5818787..162eb11 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.h
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.h
@@ -264,7 +264,6 @@ enum rdma_ch_state {
 struct srpt_rdma_ch {
 	struct ib_cm_id		*cm_id;
 	struct ib_qp		*qp;
-	struct ib_cq		*cq;
 	struct ib_cqe		zw_cqe;
 	struct kref		kref;
 	int			rq_size;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/6] IB/iser: use implicit CQ allocation
       [not found] ` <1473424587-13818-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
                     ` (3 preceding siblings ...)
  2016-09-09 12:36   ` [PATCH 4/6] IB/srpt: " Christoph Hellwig
@ 2016-09-09 12:36   ` Christoph Hellwig
  2016-09-09 12:36   ` [PATCH 6/6] nvme-rdma: " Christoph Hellwig
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-09-09 12:36 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: sagi-NQWnxTmZq1alnMjI0IkVqw

From: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
[hch: ported to the new API]
Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/infiniband/ulp/iser/iscsi_iser.h | 19 -------
 drivers/infiniband/ulp/iser/iser_verbs.c | 90 ++++++--------------------------
 2 files changed, 16 insertions(+), 93 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h
index 0be6a7c..d3cc83a 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -318,18 +318,6 @@ struct ib_conn;
 struct iscsi_iser_task;
 
 /**
- * struct iser_comp - iSER completion context
- *
- * @cq:         completion queue
- * @active_qps: Number of active QPs attached
- *              to completion context
- */
-struct iser_comp {
-	struct ib_cq		*cq;
-	int                      active_qps;
-};
-
-/**
  * struct iser_device - Memory registration operations
  *     per-device registration schemes
  *
@@ -365,9 +353,6 @@ struct iser_reg_ops {
  * @event_handler: IB events handle routine
  * @ig_list:	   entry in devices list
  * @refcount:      Reference counter, dominated by open iser connections
- * @comps_used:    Number of completion contexts used, Min between online
- *                 cpus and device max completion vectors
- * @comps:         Dinamically allocated array of completion handlers
  * @reg_ops:       Registration ops
  * @remote_inv_sup: Remote invalidate is supported on this device
  */
@@ -377,8 +362,6 @@ struct iser_device {
 	struct ib_event_handler      event_handler;
 	struct list_head             ig_list;
 	int                          refcount;
-	int			     comps_used;
-	struct iser_comp	     *comps;
 	const struct iser_reg_ops    *reg_ops;
 	bool                         remote_inv_sup;
 };
@@ -454,7 +437,6 @@ struct iser_fr_pool {
  * @sig_count:           send work request signal count
  * @rx_wr:               receive work request for batch posts
  * @device:              reference to iser device
- * @comp:                iser completion context
  * @fr_pool:             connection fast registration poool
  * @pi_support:          Indicate device T10-PI support
  */
@@ -465,7 +447,6 @@ struct ib_conn {
 	u8                           sig_count;
 	struct ib_recv_wr	     rx_wr[ISER_MIN_POSTED_RX];
 	struct iser_device          *device;
-	struct iser_comp	    *comp;
 	struct iser_fr_pool          fr_pool;
 	bool			     pi_support;
 	struct ib_cqe		     reg_cqe;
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index a4b791d..901d56a 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -68,62 +68,32 @@ static void iser_event_handler(struct ib_event_handler *handler,
 static int iser_create_device_ib_res(struct iser_device *device)
 {
 	struct ib_device *ib_dev = device->ib_device;
-	int ret, i, max_cqe;
+	int ret;
 
 	ret = iser_assign_reg_ops(device);
 	if (ret)
 		return ret;
 
-	device->comps_used = min_t(int, num_online_cpus(),
-				 ib_dev->num_comp_vectors);
-
-	device->comps = kcalloc(device->comps_used, sizeof(*device->comps),
-				GFP_KERNEL);
-	if (!device->comps)
-		goto comps_err;
-
-	max_cqe = min(ISER_MAX_CQ_LEN, ib_dev->attrs.max_cqe);
-
-	iser_info("using %d CQs, device %s supports %d vectors max_cqe %d\n",
-		  device->comps_used, ib_dev->name,
-		  ib_dev->num_comp_vectors, max_cqe);
-
 	device->pd = ib_alloc_pd(ib_dev,
 		iser_always_reg ? 0 : IB_PD_UNSAFE_GLOBAL_RKEY);
-	if (IS_ERR(device->pd))
-		goto pd_err;
-
-	for (i = 0; i < device->comps_used; i++) {
-		struct iser_comp *comp = &device->comps[i];
-
-		comp->cq = ib_alloc_cq(ib_dev, comp, max_cqe, i,
-				       IB_POLL_SOFTIRQ);
-		if (IS_ERR(comp->cq)) {
-			comp->cq = NULL;
-			goto cq_err;
-		}
+	if (IS_ERR(device->pd)) {
+		ret = PTR_ERR(device->pd);
+		goto out;
 	}
 
 	INIT_IB_EVENT_HANDLER(&device->event_handler, ib_dev,
 			      iser_event_handler);
-	if (ib_register_event_handler(&device->event_handler))
-		goto cq_err;
+	ret = ib_register_event_handler(&device->event_handler);
+	if (ret)
+		goto dealloc_pd;
 
 	return 0;
 
-cq_err:
-	for (i = 0; i < device->comps_used; i++) {
-		struct iser_comp *comp = &device->comps[i];
-
-		if (comp->cq)
-			ib_free_cq(comp->cq);
-	}
+dealloc_pd:
 	ib_dealloc_pd(device->pd);
-pd_err:
-	kfree(device->comps);
-comps_err:
+out:
 	iser_err("failed to allocate an IB resource\n");
-	return -1;
+	return ret;
 }
 
 /**
@@ -132,20 +102,8 @@ comps_err:
  */
 static void iser_free_device_ib_res(struct iser_device *device)
 {
-	int i;
-
-	for (i = 0; i < device->comps_used; i++) {
-		struct iser_comp *comp = &device->comps[i];
-
-		ib_free_cq(comp->cq);
-		comp->cq = NULL;
-	}
-
 	(void)ib_unregister_event_handler(&device->event_handler);
 	ib_dealloc_pd(device->pd);
-
-	kfree(device->comps);
-	device->comps = NULL;
 	device->pd = NULL;
 }
 
@@ -423,7 +381,6 @@ static int iser_create_ib_conn_res(struct ib_conn *ib_conn)
 	struct ib_device	*ib_dev;
 	struct ib_qp_init_attr	init_attr;
 	int			ret = -ENOMEM;
-	int index, min_index = 0;
 
 	BUG_ON(ib_conn->device == NULL);
 
@@ -431,26 +388,16 @@ static int iser_create_ib_conn_res(struct ib_conn *ib_conn)
 	ib_dev = device->ib_device;
 
 	memset(&init_attr, 0, sizeof init_attr);
-
-	mutex_lock(&ig.connlist_mutex);
-	/* select the CQ with the minimal number of usages */
-	for (index = 0; index < device->comps_used; index++) {
-		if (device->comps[index].active_qps <
-		    device->comps[min_index].active_qps)
-			min_index = index;
-	}
-	ib_conn->comp = &device->comps[min_index];
-	ib_conn->comp->active_qps++;
-	mutex_unlock(&ig.connlist_mutex);
-	iser_info("cq index %d used for ib_conn %p\n", min_index, ib_conn);
-
+	init_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	init_attr.event_handler = iser_qp_event_callback;
 	init_attr.qp_context	= (void *)ib_conn;
-	init_attr.send_cq	= ib_conn->comp->cq;
-	init_attr.recv_cq	= ib_conn->comp->cq;
+	init_attr.poll_ctx = IB_POLL_SOFTIRQ;
+
 	init_attr.cap.max_recv_wr  = ISER_QP_MAX_RECV_DTOS;
-	init_attr.cap.max_send_sge = 2;
 	init_attr.cap.max_recv_sge = 1;
+
+	init_attr.cap.max_send_sge = 2;
+
 	init_attr.sq_sig_type	= IB_SIGNAL_REQ_WR;
 	init_attr.qp_type	= IB_QPT_RC;
 	if (ib_conn->pi_support) {
@@ -483,11 +430,7 @@ static int iser_create_ib_conn_res(struct ib_conn *ib_conn)
 	return ret;
 
 out_err:
-	mutex_lock(&ig.connlist_mutex);
-	ib_conn->comp->active_qps--;
-	mutex_unlock(&ig.connlist_mutex);
 	iser_err("unable to alloc mem or create resource, err %d\n", ret);
-
 	return ret;
 }
 
@@ -597,7 +540,6 @@ static void iser_free_ib_conn_res(struct iser_conn *iser_conn,
 		  iser_conn, ib_conn->cma_id, ib_conn->qp);
 
 	if (ib_conn->qp != NULL) {
-		ib_conn->comp->active_qps--;
 		rdma_destroy_qp(ib_conn->cma_id);
 		ib_conn->qp = NULL;
 	}
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/6] nvme-rdma: use implicit CQ allocation
       [not found] ` <1473424587-13818-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
                     ` (4 preceding siblings ...)
  2016-09-09 12:36   ` [PATCH 5/6] IB/iser: " Christoph Hellwig
@ 2016-09-09 12:36   ` Christoph Hellwig
  2016-09-11  6:39   ` RFC: CQ pools and implicit CQ resource allocation Sagi Grimberg
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-09-09 12:36 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: sagi-NQWnxTmZq1alnMjI0IkVqw

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/nvme/host/rdma.c | 70 +++++++++++++++++++-----------------------------
 1 file changed, 28 insertions(+), 42 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 1a18547..e8f343f 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -95,7 +95,6 @@ struct nvme_rdma_queue {
 	size_t			cmnd_capsule_len;
 	struct nvme_rdma_ctrl	*ctrl;
 	struct nvme_rdma_device	*device;
-	struct ib_cq		*ib_cq;
 	struct ib_qp		*qp;
 
 	unsigned long		flags;
@@ -253,24 +252,38 @@ static int nvme_rdma_wait_for_cm(struct nvme_rdma_queue *queue)
 	return queue->cm_error;
 }
 
-static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue, const int factor)
+static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue)
 {
 	struct nvme_rdma_device *dev = queue->device;
 	struct ib_qp_init_attr init_attr;
-	int ret;
+	int ret, idx;
+	const int send_wr_factor = 3;		/* MR, SEND, INV */
 
 	memset(&init_attr, 0, sizeof(init_attr));
+	init_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	init_attr.event_handler = nvme_rdma_qp_event;
+	init_attr.qp_context = queue;
+	init_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
+	init_attr.qp_type = IB_QPT_RC;
+	init_attr.poll_ctx = IB_POLL_SOFTIRQ;
+
 	/* +1 for drain */
-	init_attr.cap.max_send_wr = factor * queue->queue_size + 1;
+	init_attr.cap.max_send_wr = send_wr_factor * queue->queue_size + 1;
+	init_attr.cap.max_send_sge = 1 + NVME_RDMA_MAX_INLINE_SEGMENTS;
+
 	/* +1 for drain */
 	init_attr.cap.max_recv_wr = queue->queue_size + 1;
 	init_attr.cap.max_recv_sge = 1;
-	init_attr.cap.max_send_sge = 1 + NVME_RDMA_MAX_INLINE_SEGMENTS;
-	init_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
-	init_attr.qp_type = IB_QPT_RC;
-	init_attr.send_cq = queue->ib_cq;
-	init_attr.recv_cq = queue->ib_cq;
+
+	/*
+	 * The admin queue is barely used once the controller is live, so don't
+	 * bother to spread it out.
+	 */
+	idx = nvme_rdma_queue_idx(queue);
+	if (idx > 0) {
+		init_attr.comp_vector = idx;
+		init_attr.create_flags |= IB_QP_CREATE_COMP_VECTOR;
+	}
 
 	ret = rdma_create_qp(queue->cm_id, dev->pd, &init_attr);
 
@@ -474,7 +487,6 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
 	struct ib_device *ibdev = dev->dev;
 
 	rdma_destroy_qp(queue->cm_id);
-	ib_free_cq(queue->ib_cq);
 
 	nvme_rdma_free_ring(ibdev, queue->rsp_ring, queue->queue_size,
 			sizeof(struct nvme_completion), DMA_FROM_DEVICE);
@@ -485,39 +497,15 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
 static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_device *dev)
 {
-	struct ib_device *ibdev = dev->dev;
-	const int send_wr_factor = 3;			/* MR, SEND, INV */
-	const int cq_factor = send_wr_factor + 1;	/* + RECV */
-	int comp_vector, idx = nvme_rdma_queue_idx(queue);
-
 	int ret;
 
 	queue->device = dev;
 
-	/*
-	 * The admin queue is barely used once the controller is live, so don't
-	 * bother to spread it out.
-	 */
-	if (idx == 0)
-		comp_vector = 0;
-	else
-		comp_vector = idx % ibdev->num_comp_vectors;
-
-
-	/* +1 for ib_stop_cq */
-	queue->ib_cq = ib_alloc_cq(dev->dev, queue,
-				cq_factor * queue->queue_size + 1, comp_vector,
-				IB_POLL_SOFTIRQ);
-	if (IS_ERR(queue->ib_cq)) {
-		ret = PTR_ERR(queue->ib_cq);
-		goto out;
-	}
-
-	ret = nvme_rdma_create_qp(queue, send_wr_factor);
+	ret = nvme_rdma_create_qp(queue);
 	if (ret)
-		goto out_destroy_ib_cq;
+		goto out;
 
-	queue->rsp_ring = nvme_rdma_alloc_ring(ibdev, queue->queue_size,
+	queue->rsp_ring = nvme_rdma_alloc_ring(dev->dev, queue->queue_size,
 			sizeof(struct nvme_completion), DMA_FROM_DEVICE);
 	if (!queue->rsp_ring) {
 		ret = -ENOMEM;
@@ -528,8 +516,6 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue,
 
 out_destroy_qp:
 	ib_destroy_qp(queue->qp);
-out_destroy_ib_cq:
-	ib_free_cq(queue->ib_cq);
 out:
 	return ret;
 }
@@ -781,7 +767,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
 static void nvme_rdma_wr_error(struct ib_cq *cq, struct ib_wc *wc,
 		const char *op)
 {
-	struct nvme_rdma_queue *queue = cq->cq_context;
+	struct nvme_rdma_queue *queue = wc->qp->qp_context;
 	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
 
 	if (ctrl->ctrl.state == NVME_CTRL_LIVE)
@@ -1141,7 +1127,7 @@ static int __nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc, int tag)
 {
 	struct nvme_rdma_qe *qe =
 		container_of(wc->wr_cqe, struct nvme_rdma_qe, cqe);
-	struct nvme_rdma_queue *queue = cq->cq_context;
+	struct nvme_rdma_queue *queue = wc->qp->qp_context;
 	struct ib_device *ibdev = queue->device->dev;
 	struct nvme_completion *cqe = qe->data;
 	const size_t len = sizeof(struct nvme_completion);
@@ -1461,7 +1447,7 @@ err:
 static int nvme_rdma_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 {
 	struct nvme_rdma_queue *queue = hctx->driver_data;
-	struct ib_cq *cq = queue->ib_cq;
+	struct ib_cq *cq = queue->cm_id->qp->recv_cq;
 	struct ib_wc wc;
 	int found = 0;
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: RFC: CQ pools and implicit CQ resource allocation
       [not found] ` <1473424587-13818-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
                     ` (5 preceding siblings ...)
  2016-09-09 12:36   ` [PATCH 6/6] nvme-rdma: " Christoph Hellwig
@ 2016-09-11  6:39   ` Sagi Grimberg
  2016-09-11  6:44   ` Sagi Grimberg
  2016-09-16  6:02   ` Bart Van Assche
  8 siblings, 0 replies; 17+ messages in thread
From: Sagi Grimberg @ 2016-09-11  6:39 UTC (permalink / raw)
  To: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA


> One concern Sagi had when reviewing an earlier version of my edits was that
> the over allocation as part of using a pool might be harmful for client side
> driver that only have one (or a few) connections. I played around with iSER
> a bit and could not measure and slowdown, but that doesn't mean there aren't
> any down sides - I'd really like some feedback from the hardware driver
> people on what CQ sizing / allocation preferences their hardware has.

Just to clarify,

I don't expect this will introduce slow downs on the host/client side.
in fact, for the same reason it helps the target/server side it will
help the client side.

What I was talking about is that we can end up over-allocating CQs if a
host/client has a single connection (or a few connections) to the target
and it may not need the pool (it is more common for target/server mode
drivers to have a lot of connections/channels than host/client mode
drivers). But given the benefit, I think we can tolerate it (we do that
in iSER initiator anyhow).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: CQ pools and implicit CQ resource allocation
       [not found] ` <1473424587-13818-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
                     ` (6 preceding siblings ...)
  2016-09-11  6:39   ` RFC: CQ pools and implicit CQ resource allocation Sagi Grimberg
@ 2016-09-11  6:44   ` Sagi Grimberg
       [not found]     ` <d0c645eb-3674-2841-bdb7-8b9e6fd46473-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  2016-09-16  6:02   ` Bart Van Assche
  8 siblings, 1 reply; 17+ messages in thread
From: Sagi Grimberg @ 2016-09-11  6:44 UTC (permalink / raw)
  To: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA


> This series adds support to the RDMA core to implicitly allocate the required
> CQEs when creating a QP.  The primary driver for that was to implement a
> common scheme for CQ pooling, which helps with better resource usage for
> server / target style drivers that have many outstanding connections.  In
> fact the first version of this code from Sagi did just that: add a CQ pool
> API, and convert drivers that were using some form of pooling (iSER initiator
> & target, NVMe target) to that API.  But looking at the API I felt that
> there was still way too much logic in the individual ULPs, and looked into a
> way to make that boilerplate code go away.  It turns out that we can simply
> create CQs underneath if we know the poll context that the ULP requires, so
> this series shows an approach that makes CQs mostly invisible to ULPs.

One other note that I wanted to raise for the folks interested in this
is that with the RDMA core owning the completion queue pools, different
ULPs can easily share the same completion queue (given that it uses
the same poll context). For example, nvme-rdma host, iser and srp
initiators can end up using the same completion queues (if running
simultaneously on the same machine).

Up until now, I couldn't think of anything that can introduce a problem
with that but maybe someone else will...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: RFC: CQ pools and implicit CQ resource allocation
       [not found]     ` <d0c645eb-3674-2841-bdb7-8b9e6fd46473-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2016-09-12 14:42       ` Steve Wise
  2016-09-12 20:07         ` Sagi Grimberg
  2016-09-14  3:29         ` Bart Van Assche
  2016-09-12 15:03       ` Chuck Lever
  1 sibling, 2 replies; 17+ messages in thread
From: Steve Wise @ 2016-09-12 14:42 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Christoph Hellwig',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

> 
> > This series adds support to the RDMA core to implicitly allocate the
required
> > CQEs when creating a QP.  The primary driver for that was to implement a
> > common scheme for CQ pooling, which helps with better resource usage for
> > server / target style drivers that have many outstanding connections.  In
> > fact the first version of this code from Sagi did just that: add a CQ pool
> > API, and convert drivers that were using some form of pooling (iSER
initiator
> > & target, NVMe target) to that API.  But looking at the API I felt that
> > there was still way too much logic in the individual ULPs, and looked into a
> > way to make that boilerplate code go away.  It turns out that we can simply
> > create CQs underneath if we know the poll context that the ULP requires, so
> > this series shows an approach that makes CQs mostly invisible to ULPs.
> 
> One other note that I wanted to raise for the folks interested in this
> is that with the RDMA core owning the completion queue pools, different
> ULPs can easily share the same completion queue (given that it uses
> the same poll context). For example, nvme-rdma host, iser and srp
> initiators can end up using the same completion queues (if running
> simultaneously on the same machine).
> 
> Up until now, I couldn't think of anything that can introduce a problem
> with that but maybe someone else will...

It would be useful to provide details on how many CQs get created and of what
size for an uber iSER/NVMF/SRP initiator/host and target.   

One concern I have is that cxgb4 CQs require contiguous memory,  So a scheme
like CQ pooling might cause resource problems on large core systems.   

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: CQ pools and implicit CQ resource allocation
       [not found]     ` <d0c645eb-3674-2841-bdb7-8b9e6fd46473-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  2016-09-12 14:42       ` Steve Wise
@ 2016-09-12 15:03       ` Chuck Lever
       [not found]         ` <BCF7E723-FC50-4497-9E0B-1157CF2D4185-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: Chuck Lever @ 2016-09-12 15:03 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA


> On Sep 11, 2016, at 2:44 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>> This series adds support to the RDMA core to implicitly allocate the required
>> CQEs when creating a QP.  The primary driver for that was to implement a
>> common scheme for CQ pooling, which helps with better resource usage for
>> server / target style drivers that have many outstanding connections.  In
>> fact the first version of this code from Sagi did just that: add a CQ pool
>> API, and convert drivers that were using some form of pooling (iSER initiator
>> & target, NVMe target) to that API.  But looking at the API I felt that
>> there was still way too much logic in the individual ULPs, and looked into a
>> way to make that boilerplate code go away.  It turns out that we can simply
>> create CQs underneath if we know the poll context that the ULP requires, so
>> this series shows an approach that makes CQs mostly invisible to ULPs.
> 
> One other note that I wanted to raise for the folks interested in this
> is that with the RDMA core owning the completion queue pools, different
> ULPs can easily share the same completion queue (given that it uses
> the same poll context). For example, nvme-rdma host, iser and srp
> initiators can end up using the same completion queues (if running
> simultaneously on the same machine).

I've browsed the patches a little. Do you have a sense of how much
lock / memory contention this sharing scheme introduces on multi-socket
machines using multiple protocols and multiple QPs?

Would it make sense for a ULP to indicate that it wants an unshared set
of resources?


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: CQ pools and implicit CQ resource allocation
  2016-09-12 14:42       ` Steve Wise
@ 2016-09-12 20:07         ` Sagi Grimberg
       [not found]           ` <6dd3ecbd-7109-95ff-9c86-dfea9e515538-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  2016-09-14  3:29         ` Bart Van Assche
  1 sibling, 1 reply; 17+ messages in thread
From: Sagi Grimberg @ 2016-09-12 20:07 UTC (permalink / raw)
  To: Steve Wise, 'Christoph Hellwig',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA


>> One other note that I wanted to raise for the folks interested in this
>> is that with the RDMA core owning the completion queue pools, different
>> ULPs can easily share the same completion queue (given that it uses
>> the same poll context). For example, nvme-rdma host, iser and srp
>> initiators can end up using the same completion queues (if running
>> simultaneously on the same machine).
>>
>> Up until now, I couldn't think of anything that can introduce a problem
>> with that but maybe someone else will...
>
> It would be useful to provide details on how many CQs get created and of what
> size for an uber iSER/NVMF/SRP initiator/host and target.

Are you talking about some debugfs layout?

> One concern I have is that cxgb4 CQs require contiguous memory,  So a scheme
> like CQ pooling might cause resource problems on large core systems.

Note that the CQ allocation will never exceed the device max_cqe cap.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: RFC: CQ pools and implicit CQ resource allocation
       [not found]           ` <6dd3ecbd-7109-95ff-9c86-dfea9e515538-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2016-09-12 20:22             ` Steve Wise
  2016-09-12 21:12               ` Sagi Grimberg
  0 siblings, 1 reply; 17+ messages in thread
From: Steve Wise @ 2016-09-12 20:22 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Christoph Hellwig',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA



> -----Original Message-----
> From: Sagi Grimberg [mailto:sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org]
> Sent: Monday, September 12, 2016 3:08 PM
> To: Steve Wise; 'Christoph Hellwig'; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: RFC: CQ pools and implicit CQ resource allocation
> 
> 
> >> One other note that I wanted to raise for the folks interested in this
> >> is that with the RDMA core owning the completion queue pools, different
> >> ULPs can easily share the same completion queue (given that it uses
> >> the same poll context). For example, nvme-rdma host, iser and srp
> >> initiators can end up using the same completion queues (if running
> >> simultaneously on the same machine).
> >>
> >> Up until now, I couldn't think of anything that can introduce a problem
> >> with that but maybe someone else will...
> >
> > It would be useful to provide details on how many CQs get created and of
what
> > size for an uber iSER/NVMF/SRP initiator/host and target.
> 
> Are you talking about some debugfs layout?
>

No, just a matrix showing how the CQs scale out when shared among these three
ULPs on a machine with X cores, for example.   Just to visualize if the number
of CQs and their sizes are reduced by this new series or increased...
 
> > One concern I have is that cxgb4 CQs require contiguous memory,  So a scheme
> > like CQ pooling might cause resource problems on large core systems.
> 
> Note that the CQ allocation will never exceed the device max_cqe cap.

If this series causes, say, 2X the amount of memory needed for CQs vs the
existing private CQ approach, then that impacts how many CQs can be allocated,
due to limits on the amount of memory that can be allocated system-wide via
dma_alloc_coherent(), which is what cxgb4 uses to allocate queue memory.

So I'm just voicing the concern this design can possibly reduce the overall
number of CQs available on a given system.  It is probably not a big deal
though, but I don't have a good visualization of how much more memory this
proposed series would incur...

Stevo

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: CQ pools and implicit CQ resource allocation
       [not found]         ` <BCF7E723-FC50-4497-9E0B-1157CF2D4185-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-09-12 20:25           ` Sagi Grimberg
  0 siblings, 0 replies; 17+ messages in thread
From: Sagi Grimberg @ 2016-09-12 20:25 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hey Chuck,

>> One other note that I wanted to raise for the folks interested in this
>> is that with the RDMA core owning the completion queue pools, different
>> ULPs can easily share the same completion queue (given that it uses
>> the same poll context). For example, nvme-rdma host, iser and srp
>> initiators can end up using the same completion queues (if running
>> simultaneously on the same machine).
>
> I've browsed the patches a little. Do you have a sense of how much
> lock / memory contention this sharing scheme introduces on multi-socket
> machines using multiple protocols and multiple QPs?

What we noticed was that usually target/server mode drivers will want
to spread CQs across cores in order to achieve better parallelism when
serving multiple targets. The benefit of this is better completion
aggregation and reduce completion interrupts as we have less CQs in
total and each CQ completes multiple QPs (send and/or recv).

Note that the CQ API uses either irq-poll or a workqueue context, both
guarantee that a single context will poll the completion queue at any
point in time so no lock contention should ever exist regardless of
how many queue-pairs are attached to it.

As for memory, I don't see how the scheme can introduce any inefficient
memory access (its just allocating larger CQs and attach more QPs to
it). But that is a good question. I have noticed recently that
having the RDMA queues (QP, CQ, SRQ) allocated on the "correct" numa
socket makes a real difference (queue lock becomes a lot cheaper and
as well as the work-queue entries fills).

I have a patch set in the works that allows RDMA queues to accept
numa_node (which can accept NUMA_NO_NODE for unaware ULPs).

> Would it make sense for a ULP to indicate that it wants an unshared set
> of resources?

I thought about this before, but it would make sense if there is a good
reason to not share the completion queue. Note that when one allocates
a CQ with the same completion vector assignment as someone else it
practically shares the assigned core anyhow, we'll just see interrupts 
from both CQs so I still can't see how that is better.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: CQ pools and implicit CQ resource allocation
  2016-09-12 20:22             ` Steve Wise
@ 2016-09-12 21:12               ` Sagi Grimberg
  0 siblings, 0 replies; 17+ messages in thread
From: Sagi Grimberg @ 2016-09-12 21:12 UTC (permalink / raw)
  To: Steve Wise, 'Christoph Hellwig',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA


>>>> One other note that I wanted to raise for the folks interested in this
>>>> is that with the RDMA core owning the completion queue pools, different
>>>> ULPs can easily share the same completion queue (given that it uses
>>>> the same poll context). For example, nvme-rdma host, iser and srp
>>>> initiators can end up using the same completion queues (if running
>>>> simultaneously on the same machine).
>>>>
>>>> Up until now, I couldn't think of anything that can introduce a problem
>>>> with that but maybe someone else will...
>>>
>>> It would be useful to provide details on how many CQs get created and of
> what
>>> size for an uber iSER/NVMF/SRP initiator/host and target.
>>
>> Are you talking about some debugfs layout?
>>
>
> No, just a matrix showing how the CQs scale out when shared among these three
> ULPs on a machine with X cores, for example.   Just to visualize if the number
> of CQs and their sizes are reduced by this new series or increased...

Umm, it sort of depends on the workload.

But the rule of thumb is that less CQs would be allocated in the system
(because they are shared) but would probably be larger.

One downside is that we might have some unused cqes for each CQ.
Say we create a CQ with 1024 cqes, and the we have 7 QPs of size
128 attached to it. Now 896 cqes occupied to serve the 7 qps. Now
a QP of size 129 comes along, it cannot be attached to this CQ as we
might overrun it, so another CQ of size 1024 will be created and the QP
will be attached to it. The old CQ will have 128 free cqes until some
QP comes along that can fill it.

The current algorithm takes the least-used CQ that accommodates the
caller needs (poll_ctx and completion vector if exists).

> If this series causes, say, 2X the amount of memory needed for CQs vs the
> existing private CQ approach, then that impacts how many CQs can be allocated,
> due to limits on the amount of memory that can be allocated system-wide via
> dma_alloc_coherent(), which is what cxgb4 uses to allocate queue memory.
>
> So I'm just voicing the concern this design can possibly reduce the overall
> number of CQs available on a given system.  It is probably not a big deal
> though, but I don't have a good visualization of how much more memory this
> proposed series would incur...

I see. So I don't think this would allocate way more completion queues
than say iser/srp/nvmf alone (note that today iser alone uses crazy
over-allocations for CQs to aggressively aggregate them).

The main motivation for this is to aggregate completions as much
as possible (and reasonable). It is possible that we will sacrifice
some memory for that...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: CQ pools and implicit CQ resource allocation
  2016-09-12 14:42       ` Steve Wise
  2016-09-12 20:07         ` Sagi Grimberg
@ 2016-09-14  3:29         ` Bart Van Assche
  1 sibling, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2016-09-14  3:29 UTC (permalink / raw)
  To: Steve Wise, 'Sagi Grimberg', 'Christoph Hellwig',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche

On 09/12/2016 04:42 PM, Steve Wise wrote:
> One concern I have is that cxgb4 CQs require contiguous memory,  So a scheme
> like CQ pooling might cause resource problems on large core systems.

Hello Steve,

Had you already noticed the following code in the first patch of this 
series? I think this code guarantees that never more than 
dev->num_comp_vectors CQs are allocated at once:

     nr_cqs = min_t(int, dev->num_comp_vectors, num_possible_cpus())

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: CQ pools and implicit CQ resource allocation
       [not found] ` <1473424587-13818-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
                     ` (7 preceding siblings ...)
  2016-09-11  6:44   ` Sagi Grimberg
@ 2016-09-16  6:02   ` Bart Van Assche
  8 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2016-09-16  6:02 UTC (permalink / raw)
  To: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: sagi-NQWnxTmZq1alnMjI0IkVqw

On 09/09/2016 02:36 PM, Christoph Hellwig wrote:
> This series adds support to the RDMA core to implicitly allocate the required
> CQEs when creating a QP.

Hello Christoph,

Have you already had the chance to test the ib_srpt changes included
in this patch series? I'm asking because running srp-test on top of
a kernel that includes this patch series triggers an OOM message I
had not yet encountered while testing ib_srp or ib_srpt:

ion-dev-ib-ini:~ # ~bart/software/infiniband/srp-test/run_tests -f xfs
SRP LUN /sys/class/scsi_device/19:0:0:1 / sdt: removing /dev/dm-0: done
SRP LUN /sys/class/scsi_device/19:0:0:2 / sdv: removing /dev/dm-1: done
Unloaded the ib_srp kernel module
Unloaded the ib_srpt kernel module
Configured SRP target driver
Running test /home/bart/software/infiniband/srp-test/tests/01 ...
Connection to 10.60.180.187 closed by remote host.
Connection to 10.60.180.187 closed.


The ssh connection was closed by the OOM killer.


>From the netconsole output:

[  286.878805] ------------[ cut here ]------------
[  286.878901] WARNING: CPU: 7 PID: 53 at drivers/infiniband/core/cq.c:196 ib_free_cq+0xc0/0xf0 [ib_core]
[  286.886153] CPU: 7 PID: 53 Comm: kworker/7:0 Not tainted 4.8.0-rc6-dbg+ #1
[  286.886206] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014
[  286.886259] Workqueue: srp_remove srp_remove_work [ib_srp]
[  286.887924] Call Trace:
[  286.887988]  [<ffffffff81322945>] dump_stack+0x68/0x93
[  286.888046]  [<ffffffff81062e76>] __warn+0xc6/0xe0
[  286.888105]  [<ffffffff81062f48>] warn_slowpath_null+0x18/0x20
[  286.888177]  [<ffffffffa04e5e90>] ib_free_cq+0xc0/0xf0 [ib_core]
[  286.888237]  [<ffffffffa027b787>] srp_free_ch_ib.isra.39+0xa7/0x1a0 [ib_srp]
[  286.888301]  [<ffffffffa027bc04>] srp_remove_work+0xd4/0x1f0 [ib_srp]
[  286.888362]  [<ffffffff81080425>] process_one_work+0x1f5/0x690
[  286.888472]  [<ffffffff81080909>] worker_thread+0x49/0x490
[  286.888591]  [<ffffffff810870ea>] kthread+0xea/0x100
[  286.888652]  [<ffffffff8162b5bf>] ret_from_fork+0x1f/0x40
[  286.889383] ---[ end trace 57bd282839f289df ]---
[ ... ]
[  295.458538] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x0:0xe41d2d03000a85b1, t_port_id 0xe41d2d03000a85b0:0xe41d2d03000a85b0 and it_iu_len 4148 on port 1 (guid=0xfe80000000000000:0xe41d2
d03000a85b1)
[  295.658698] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x516d0a00032d1de4:0x2c90300a02dc1, t_port_id 0xe41d2d03000a85b0:0xe41d2d03000a85b0 and it_iu_len 260 on port 1 (guid=0xfe80000000000
000:0xe41d2d03000a6d51)
[  295.660820] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x516d0a00032d1de4:0x2c90300a02dc2, t_port_id 0xe41d2d03000a85b0:0xe41d2d03000a85b0 and it_iu_len 260 on port 1 (guid=0xfe80000000000
000:0xe41d2d03000a6d51)
[  305.335833] systemd-udevd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=-1000
[  305.335904] systemd-udevd cpuset=/ mems_allowed=0
[  305.336037] CPU: 10 PID: 567 Comm: systemd-udevd Tainted: G        W       4.8.0-rc6-dbg+ #1
[  305.336088] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014
[  305.336789] Call Trace:
[  305.336847]  [<ffffffff81322945>] dump_stack+0x68/0x93
[  305.336907]  [<ffffffff811d47e9>] dump_header+0x57/0x229
[  305.337022]  [<ffffffff811591bd>] oom_kill_process+0x20d/0x3e0
[  305.337078]  [<ffffffff811598f2>] out_of_memory+0x512/0x530
[  305.337192]  [<ffffffff8115eeab>] __alloc_pages_nodemask+0xe5b/0x1000
[  305.337252]  [<ffffffff81060312>] copy_process.part.45+0x102/0x1ac0
[  305.337418]  [<ffffffff81061eb1>] _do_fork+0xe1/0x6d0
[  305.337588]  [<ffffffff81062524>] SyS_clone+0x14/0x20
[  305.337642]  [<ffffffff81002c93>] do_syscall_64+0x53/0x120
[  305.337697]  [<ffffffff8162b45a>] entry_SYSCALL64_slow_path+0x25/0x25

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-09-16  6:02 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-09 12:36 RFC: CQ pools and implicit CQ resource allocation Christoph Hellwig
     [not found] ` <1473424587-13818-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
2016-09-09 12:36   ` [PATCH 1/6] IB/core: add implicit CQ allocation Christoph Hellwig
2016-09-09 12:36   ` [PATCH 2/6] nvmet-rdma: use " Christoph Hellwig
2016-09-09 12:36   ` [PATCH 3/6] IB/isert: " Christoph Hellwig
2016-09-09 12:36   ` [PATCH 4/6] IB/srpt: " Christoph Hellwig
2016-09-09 12:36   ` [PATCH 5/6] IB/iser: " Christoph Hellwig
2016-09-09 12:36   ` [PATCH 6/6] nvme-rdma: " Christoph Hellwig
2016-09-11  6:39   ` RFC: CQ pools and implicit CQ resource allocation Sagi Grimberg
2016-09-11  6:44   ` Sagi Grimberg
     [not found]     ` <d0c645eb-3674-2841-bdb7-8b9e6fd46473-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2016-09-12 14:42       ` Steve Wise
2016-09-12 20:07         ` Sagi Grimberg
     [not found]           ` <6dd3ecbd-7109-95ff-9c86-dfea9e515538-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2016-09-12 20:22             ` Steve Wise
2016-09-12 21:12               ` Sagi Grimberg
2016-09-14  3:29         ` Bart Van Assche
2016-09-12 15:03       ` Chuck Lever
     [not found]         ` <BCF7E723-FC50-4497-9E0B-1157CF2D4185-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-09-12 20:25           ` Sagi Grimberg
2016-09-16  6:02   ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox