public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] IB/srp bug fixes
@ 2017-02-10 23:56 Bart Van Assche
  2017-02-10 23:56 ` [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug Bart Van Assche
                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche

Hello Doug,

The patches in this series are the initiator patches I came up with while
testing the SRP initiator and target drivers. Please consider these patches
for inclusion in the upstream kernel.

Sorry for sending these patches so close to the merge window. If this means
that it's too late to include these patches in the first kernel v4.11 pull
request that's fine with me.

Bart Van Assche (8):
  IB/srp: Avoid that duplicate responses trigger a kernel bug
  IB/srp: Fix race conditions related to task management
  IB/srp: Document locking conventions
  IB/srp: Make a diagnostic message more informative
  IB/srp: Improve an error path
  IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported
  IB/core: Add support for draining IB_POLL_DIRECT completion queues
  IB/srp: Drain the send queue before destroying a QP

 drivers/infiniband/core/verbs.c     |  35 +++++-----
 drivers/infiniband/ulp/srp/ib_srp.c | 129 ++++++++++++++++++++++++------------
 drivers/infiniband/ulp/srp/ib_srp.h |   1 +
 3 files changed, 103 insertions(+), 62 deletions(-)

-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug
  2017-02-10 23:56 [PATCH 0/8] IB/srp bug fixes Bart Van Assche
@ 2017-02-10 23:56 ` Bart Van Assche
  2017-02-12 17:05   ` Leon Romanovsky
  2017-02-10 23:56 ` [PATCH 2/8] IB/srp: Fix race conditions related to task management Bart Van Assche
       [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  2 siblings, 1 reply; 47+ messages in thread
From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma, Bart Van Assche, Israel Rukshin, Max Gurtovoy,
	Laurence Oberman, Steve Feeley, stable

After srp_process_rsp() returns there is a short time during which
the scsi_host_find_tag() call will return a pointer to the SCSI
command that is being completed. If during that time a duplicate
response is received, avoid that the following call stack appears:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: srp_recv_done+0x450/0x6b0 [ib_srp]
Oops: 0000 [#1] SMP
CPU: 10 PID: 0 Comm: swapper/10 Not tainted 4.10.0-rc7-dbg+ #1
Call Trace:
 <IRQ>
 __ib_process_cq+0x4b/0xd0 [ib_core]
 ib_poll_handler+0x1d/0x70 [ib_core]
 irq_poll_softirq+0xba/0x120
 __do_softirq+0xba/0x4c0
 irq_exit+0xbe/0xd0
 smp_apic_timer_interrupt+0x38/0x50
 apic_timer_interrupt+0x90/0xa0
 </IRQ>
 cpuidle_enter_state+0xf2/0x370
 cpuidle_enter+0x12/0x20
 call_cpuidle+0x1e/0x40
 do_idle+0xe3/0x1c0
 cpu_startup_entry+0x18/0x20
 start_secondary+0x103/0x130
 start_cpu+0x14/0x14
RIP: srp_recv_done+0x450/0x6b0 [ib_srp] RSP: ffff88046f483e20

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Israel Rukshin <israelr@mellanox.com>
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Steve Feeley <Steve.Feeley@sandisk.com>
Cc: <stable@vger.kernel.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 79bf48477ddb..4068d34f5427 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1899,7 +1899,14 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp)
 		scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag);
 		if (scmnd) {
 			req = (void *)scmnd->host_scribble;
-			scmnd = srp_claim_req(ch, req, NULL, scmnd);
+			if (req) {
+				scmnd = srp_claim_req(ch, req, NULL, scmnd);
+			} else {
+				shost_printk(KERN_ERR, target->scsi_host,
+					     "NULL host_scribble for response with tag %#llx\n",
+					     rsp->tag);
+				scmnd = NULL;
+			}
 		}
 		if (!scmnd) {
 			shost_printk(KERN_ERR, target->scsi_host,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/8] IB/srp: Fix race conditions related to task management
  2017-02-10 23:56 [PATCH 0/8] IB/srp bug fixes Bart Van Assche
  2017-02-10 23:56 ` [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug Bart Van Assche
@ 2017-02-10 23:56 ` Bart Van Assche
       [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  2 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma, Bart Van Assche, Israel Rukshin, Max Gurtovoy,
	Laurence Oberman, Steve Feeley, stable

Avoid that srp_process_rsp() overwrites the status information
in ch if the SRP target response timed out and processing of
another task management function has already started. Avoid that
issuing multiple task management functions concurrently triggers
list corruption. This patch prevents that the following stack
trace appears in the system log:

WARNING: CPU: 8 PID: 9269 at lib/list_debug.c:52 __list_del_entry_valid+0xbc/0xc0
list_del corruption. prev->next should be ffffc90004bb7b00, but was ffff8804052ecc68
CPU: 8 PID: 9269 Comm: sg_reset Tainted: G        W       4.10.0-rc7-dbg+ #3
Call Trace:
 dump_stack+0x68/0x93
 __warn+0xc6/0xe0
 warn_slowpath_fmt+0x4a/0x50
 __list_del_entry_valid+0xbc/0xc0
 wait_for_completion_timeout+0x12e/0x170
 srp_send_tsk_mgmt+0x1ef/0x2d0 [ib_srp]
 srp_reset_device+0x5b/0x110 [ib_srp]
 scsi_ioctl_reset+0x1c7/0x290
 scsi_ioctl+0x12a/0x420
 sd_ioctl+0x9d/0x100
 blkdev_ioctl+0x51e/0x9f0
 block_ioctl+0x38/0x40
 do_vfs_ioctl+0x8f/0x700
 SyS_ioctl+0x3c/0x70
 entry_SYSCALL_64_fastpath+0x18/0xad

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Israel Rukshin <israelr@mellanox.com>
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Steve Feeley <Steve.Feeley@sandisk.com>
Cc: <stable@vger.kernel.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 45 ++++++++++++++++++++++++-------------
 drivers/infiniband/ulp/srp/ib_srp.h |  1 +
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 4068d34f5427..511eb4b2e6e0 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1889,12 +1889,17 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp)
 	if (unlikely(rsp->tag & SRP_TAG_TSK_MGMT)) {
 		spin_lock_irqsave(&ch->lock, flags);
 		ch->req_lim += be32_to_cpu(rsp->req_lim_delta);
+		if (rsp->tag == ch->tsk_mgmt_tag) {
+			ch->tsk_mgmt_status = -1;
+			if (be32_to_cpu(rsp->resp_data_len) >= 4)
+				ch->tsk_mgmt_status = rsp->data[3];
+			complete(&ch->tsk_mgmt_done);
+		} else {
+			shost_printk(KERN_ERR, target->scsi_host,
+				     "Received tsk mgmt response too late for tag %#llx\n",
+				     rsp->tag);
+		}
 		spin_unlock_irqrestore(&ch->lock, flags);
-
-		ch->tsk_mgmt_status = -1;
-		if (be32_to_cpu(rsp->resp_data_len) >= 4)
-			ch->tsk_mgmt_status = rsp->data[3];
-		complete(&ch->tsk_mgmt_done);
 	} else {
 		scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag);
 		if (scmnd) {
@@ -2538,19 +2543,18 @@ srp_change_queue_depth(struct scsi_device *sdev, int qdepth)
 }
 
 static int srp_send_tsk_mgmt(struct srp_rdma_ch *ch, u64 req_tag, u64 lun,
-			     u8 func)
+			     u8 func, u8 *status)
 {
 	struct srp_target_port *target = ch->target;
 	struct srp_rport *rport = target->rport;
 	struct ib_device *dev = target->srp_host->srp_dev->dev;
 	struct srp_iu *iu;
 	struct srp_tsk_mgmt *tsk_mgmt;
+	int res;
 
 	if (!ch->connected || target->qp_in_error)
 		return -1;
 
-	init_completion(&ch->tsk_mgmt_done);
-
 	/*
 	 * Lock the rport mutex to avoid that srp_create_ch_ib() is
 	 * invoked while a task management function is being sent.
@@ -2573,10 +2577,16 @@ static int srp_send_tsk_mgmt(struct srp_rdma_ch *ch, u64 req_tag, u64 lun,
 
 	tsk_mgmt->opcode 	= SRP_TSK_MGMT;
 	int_to_scsilun(lun, &tsk_mgmt->lun);
-	tsk_mgmt->tag		= req_tag | SRP_TAG_TSK_MGMT;
 	tsk_mgmt->tsk_mgmt_func = func;
 	tsk_mgmt->task_tag	= req_tag;
 
+	spin_lock_irq(&ch->lock);
+	ch->tsk_mgmt_tag = (ch->tsk_mgmt_tag + 1) | SRP_TAG_TSK_MGMT;
+	tsk_mgmt->tag = ch->tsk_mgmt_tag;
+	spin_unlock_irq(&ch->lock);
+
+	init_completion(&ch->tsk_mgmt_done);
+
 	ib_dma_sync_single_for_device(dev, iu->dma, sizeof *tsk_mgmt,
 				      DMA_TO_DEVICE);
 	if (srp_post_send(ch, iu, sizeof(*tsk_mgmt))) {
@@ -2585,13 +2595,15 @@ static int srp_send_tsk_mgmt(struct srp_rdma_ch *ch, u64 req_tag, u64 lun,
 
 		return -1;
 	}
+	res = wait_for_completion_timeout(&ch->tsk_mgmt_done,
+					msecs_to_jiffies(SRP_ABORT_TIMEOUT_MS));
+	if (res > 0 && status)
+		*status = ch->tsk_mgmt_status;
 	mutex_unlock(&rport->mutex);
 
-	if (!wait_for_completion_timeout(&ch->tsk_mgmt_done,
-					 msecs_to_jiffies(SRP_ABORT_TIMEOUT_MS)))
-		return -1;
+	WARN_ON_ONCE(res < 0);
 
-	return 0;
+	return res > 0 ? 0 : -1;
 }
 
 static int srp_abort(struct scsi_cmnd *scmnd)
@@ -2617,7 +2629,7 @@ static int srp_abort(struct scsi_cmnd *scmnd)
 	shost_printk(KERN_ERR, target->scsi_host,
 		     "Sending SRP abort for tag %#x\n", tag);
 	if (srp_send_tsk_mgmt(ch, tag, scmnd->device->lun,
-			      SRP_TSK_ABORT_TASK) == 0)
+			      SRP_TSK_ABORT_TASK, NULL) == 0)
 		ret = SUCCESS;
 	else if (target->rport->state == SRP_RPORT_LOST)
 		ret = FAST_IO_FAIL;
@@ -2635,14 +2647,15 @@ static int srp_reset_device(struct scsi_cmnd *scmnd)
 	struct srp_target_port *target = host_to_target(scmnd->device->host);
 	struct srp_rdma_ch *ch;
 	int i;
+	u8 status;
 
 	shost_printk(KERN_ERR, target->scsi_host, "SRP reset_device called\n");
 
 	ch = &target->ch[0];
 	if (srp_send_tsk_mgmt(ch, SRP_TAG_NO_REQ, scmnd->device->lun,
-			      SRP_TSK_LUN_RESET))
+			      SRP_TSK_LUN_RESET, &status))
 		return FAILED;
-	if (ch->tsk_mgmt_status)
+	if (status)
 		return FAILED;
 
 	for (i = 0; i < target->ch_count; i++) {
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h
index 21c69695f9d4..32ed40db3ca2 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -163,6 +163,7 @@ struct srp_rdma_ch {
 	int			max_ti_iu_len;
 	int			comp_vector;
 
+	u64			tsk_mgmt_tag;
 	struct completion	tsk_mgmt_done;
 	u8			tsk_mgmt_status;
 	bool			connected;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 3/8] IB/srp: Document locking conventions
       [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2017-02-10 23:56   ` Bart Van Assche
  2017-02-10 23:56   ` [PATCH 4/8] IB/srp: Make a diagnostic message more informative Bart Van Assche
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	Israel Rukshin, Max Gurtovoy, Laurence Oberman

Use lockdep_assert_held() statements to verify at run-time
whether the proper locks are held.

Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 511eb4b2e6e0..a43db9d6b399 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -40,6 +40,7 @@
 #include <linux/parser.h>
 #include <linux/random.h>
 #include <linux/jiffies.h>
+#include <linux/lockdep.h>
 #include <rdma/ib_cache.h>
 
 #include <linux/atomic.h>
@@ -1804,6 +1805,8 @@ static struct srp_iu *__srp_get_tx_iu(struct srp_rdma_ch *ch,
 	s32 rsv = (iu_type == SRP_IU_TSK_MGMT) ? 0 : SRP_TSK_MGMT_SQ_SIZE;
 	struct srp_iu *iu;
 
+	lockdep_assert_held(&ch->lock);
+
 	ib_process_cq_direct(ch->send_cq, -1);
 
 	if (list_empty(&ch->free_tx))
@@ -1834,6 +1837,8 @@ static void srp_send_done(struct ib_cq *cq, struct ib_wc *wc)
 		return;
 	}
 
+	lockdep_assert_held(&ch->lock);
+
 	list_add(&iu->list, &ch->free_tx);
 }
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 4/8] IB/srp: Make a diagnostic message more informative
       [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  2017-02-10 23:56   ` [PATCH 3/8] IB/srp: Document locking conventions Bart Van Assche
@ 2017-02-10 23:56   ` Bart Van Assche
  2017-02-10 23:56   ` [PATCH 5/8] IB/srp: Improve an error path Bart Van Assche
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	Israel Rukshin, Max Gurtovoy, Laurence Oberman

Report the destination port GID if connecting fails.

Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index a43db9d6b399..d21611a4e90f 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3446,9 +3446,10 @@ static ssize_t srp_create_target(struct device *dev,
 			ret = srp_connect_ch(ch, multich);
 			if (ret) {
 				shost_printk(KERN_ERR, target->scsi_host,
-					     PFX "Connection %d/%d failed\n",
+					     PFX "Connection %d/%d to %pI6 failed\n",
 					     ch_start + cpu_idx,
-					     target->ch_count);
+					     target->ch_count,
+					     ch->target->orig_dgid.raw);
 				if (node_idx == 0 && cpu_idx == 0) {
 					goto err_disconnect;
 				} else {
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 5/8] IB/srp: Improve an error path
       [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  2017-02-10 23:56   ` [PATCH 3/8] IB/srp: Document locking conventions Bart Van Assche
  2017-02-10 23:56   ` [PATCH 4/8] IB/srp: Make a diagnostic message more informative Bart Van Assche
@ 2017-02-10 23:56   ` Bart Van Assche
  2017-02-10 23:56   ` [PATCH 6/8] IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported Bart Van Assche
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	Israel Rukshin, Max Gurtovoy, Laurence Oberman

Avoid that the following message is printed if login fails:

scsi host0: ib_srp: Sending CM DREQ failed

Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index d21611a4e90f..87efb702b1c6 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3451,7 +3451,7 @@ static ssize_t srp_create_target(struct device *dev,
 					     target->ch_count,
 					     ch->target->orig_dgid.raw);
 				if (node_idx == 0 && cpu_idx == 0) {
-					goto err_disconnect;
+					goto free_ch;
 				} else {
 					srp_free_ch_ib(target, ch);
 					srp_free_req_data(target, ch);
@@ -3498,6 +3498,7 @@ static ssize_t srp_create_target(struct device *dev,
 err_disconnect:
 	srp_disconnect_target(target);
 
+free_ch:
 	for (i = 0; i < target->ch_count; i++) {
 		ch = &target->ch[i];
 		srp_free_ch_ib(target, ch);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 6/8] IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported
       [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-02-10 23:56   ` [PATCH 5/8] IB/srp: Improve an error path Bart Van Assche
@ 2017-02-10 23:56   ` Bart Van Assche
  2017-02-10 23:56   ` [PATCH 7/8] IB/core: Add support for draining IB_POLL_DIRECT completion queues Bart Van Assche
  2017-02-10 23:56   ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche
  5 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	Israel Rukshin, Max Gurtovoy, Laurence Oberman

If a HCA supports the SG_GAPS_REG feature then a single memory
region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch
reduces the number of memory regions that is allocated per SRP
session.

Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 43 ++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 87efb702b1c6..2f85255d2aca 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3356,25 +3356,34 @@ static ssize_t srp_create_target(struct device *dev,
 	}
 
 	if (srp_dev->use_fast_reg || srp_dev->use_fmr) {
-		/*
-		 * FR and FMR can only map one HCA page per entry. If the
-		 * start address is not aligned on a HCA page boundary two
-		 * entries will be used for the head and the tail although
-		 * these two entries combined contain at most one HCA page of
-		 * data. Hence the "+ 1" in the calculation below.
-		 *
-		 * The indirect data buffer descriptor is contiguous so the
-		 * memory for that buffer will only be registered if
-		 * register_always is true. Hence add one to mr_per_cmd if
-		 * register_always has been set.
-		 */
+		bool gaps_reg = (ibdev->attrs.device_cap_flags &
+				 IB_DEVICE_SG_GAPS_REG);
+
 		max_sectors_per_mr = srp_dev->max_pages_per_mr <<
 				  (ilog2(srp_dev->mr_page_size) - 9);
-		mr_per_cmd = register_always +
-			(target->scsi_host->max_sectors + 1 +
-			 max_sectors_per_mr - 1) / max_sectors_per_mr;
-		pr_debug("max_sectors = %u; max_pages_per_mr = %u; mr_page_size = %u; max_sectors_per_mr = %u; mr_per_cmd = %u\n",
-			 target->scsi_host->max_sectors,
+		if (!gaps_reg) {
+			/*
+			 * FR and FMR can only map one HCA page per entry. If
+			 * the start address is not aligned on a HCA page
+			 * boundary two entries will be used for the head and
+			 * the tail although these two entries combined
+			 * contain at most one HCA page of data. Hence the "+
+			 * 1" in the calculation below.
+			 *
+			 * The indirect data buffer descriptor is contiguous
+			 * so the memory for that buffer will only be
+			 * registered if register_always is true. Hence add
+			 * one to mr_per_cmd if register_always has been set.
+			 */
+			mr_per_cmd = register_always +
+				(target->scsi_host->max_sectors + 1 +
+				 max_sectors_per_mr - 1) / max_sectors_per_mr;
+			mr_per_cmd = max(2U, mr_per_cmd);
+		} else {
+			mr_per_cmd = 1;
+		}
+		pr_debug("IB_DEVICE_SG_GAPS_REG = %d; max_sectors = %u; max_pages_per_mr = %u; mr_page_size = %u; max_sectors_per_mr = %u; mr_per_cmd = %u\n",
+			 gaps_reg, target->scsi_host->max_sectors,
 			 srp_dev->max_pages_per_mr, srp_dev->mr_page_size,
 			 max_sectors_per_mr, mr_per_cmd);
 	}
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 7/8] IB/core: Add support for draining IB_POLL_DIRECT completion queues
       [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
                     ` (3 preceding siblings ...)
  2017-02-10 23:56   ` [PATCH 6/8] IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported Bart Van Assche
@ 2017-02-10 23:56   ` Bart Van Assche
  2017-02-10 23:56   ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche
  5 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche, Steve Wise,
	Chuck Lever, Christoph Hellwig, Max Gurtovoy

Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Cc: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Cc: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/verbs.c | 35 +++++++++++++++--------------------
 1 file changed, 15 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 71580cc28c9e..42f8927b542c 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1949,17 +1949,12 @@ static void ib_drain_qp_done(struct ib_cq *cq, struct ib_wc *wc)
  */
 static void __ib_drain_sq(struct ib_qp *qp)
 {
+	struct ib_cq *cq = qp->send_cq;
 	struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
 	struct ib_drain_cqe sdrain;
 	struct ib_send_wr swr = {}, *bad_swr;
 	int ret;
 
-	if (qp->send_cq->poll_ctx == IB_POLL_DIRECT) {
-		WARN_ONCE(qp->send_cq->poll_ctx == IB_POLL_DIRECT,
-			  "IB_POLL_DIRECT poll_ctx not supported for drain\n");
-		return;
-	}
-
 	swr.wr_cqe = &sdrain.cqe;
 	sdrain.cqe.done = ib_drain_qp_done;
 	init_completion(&sdrain.done);
@@ -1976,7 +1971,11 @@ static void __ib_drain_sq(struct ib_qp *qp)
 		return;
 	}
 
-	wait_for_completion(&sdrain.done);
+	if (cq->poll_ctx == IB_POLL_DIRECT)
+		while (wait_for_completion_timeout(&sdrain.done, HZ / 10) <= 0)
+			ib_process_cq_direct(cq, -1);
+	else
+		wait_for_completion(&sdrain.done);
 }
 
 /*
@@ -1984,17 +1983,12 @@ static void __ib_drain_sq(struct ib_qp *qp)
  */
 static void __ib_drain_rq(struct ib_qp *qp)
 {
+	struct ib_cq *cq = qp->recv_cq;
 	struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
 	struct ib_drain_cqe rdrain;
 	struct ib_recv_wr rwr = {}, *bad_rwr;
 	int ret;
 
-	if (qp->recv_cq->poll_ctx == IB_POLL_DIRECT) {
-		WARN_ONCE(qp->recv_cq->poll_ctx == IB_POLL_DIRECT,
-			  "IB_POLL_DIRECT poll_ctx not supported for drain\n");
-		return;
-	}
-
 	rwr.wr_cqe = &rdrain.cqe;
 	rdrain.cqe.done = ib_drain_qp_done;
 	init_completion(&rdrain.done);
@@ -2011,7 +2005,11 @@ static void __ib_drain_rq(struct ib_qp *qp)
 		return;
 	}
 
-	wait_for_completion(&rdrain.done);
+	if (cq->poll_ctx == IB_POLL_DIRECT)
+		while (wait_for_completion_timeout(&rdrain.done, HZ / 10) <= 0)
+			ib_process_cq_direct(cq, -1);
+	else
+		wait_for_completion(&rdrain.done);
 }
 
 /**
@@ -2028,8 +2026,7 @@ static void __ib_drain_rq(struct ib_qp *qp)
  * ensure there is room in the CQ and SQ for the drain work request and
  * completion.
  *
- * allocate the CQ using ib_alloc_cq() and the CQ poll context cannot be
- * IB_POLL_DIRECT.
+ * allocate the CQ using ib_alloc_cq().
  *
  * ensure that there are no other contexts that are posting WRs concurrently.
  * Otherwise the drain is not guaranteed.
@@ -2057,8 +2054,7 @@ EXPORT_SYMBOL(ib_drain_sq);
  * ensure there is room in the CQ and RQ for the drain work request and
  * completion.
  *
- * allocate the CQ using ib_alloc_cq() and the CQ poll context cannot be
- * IB_POLL_DIRECT.
+ * allocate the CQ using ib_alloc_cq().
  *
  * ensure that there are no other contexts that are posting WRs concurrently.
  * Otherwise the drain is not guaranteed.
@@ -2082,8 +2078,7 @@ EXPORT_SYMBOL(ib_drain_rq);
  * ensure there is room in the CQ(s), SQ, and RQ for drain work requests
  * and completions.
  *
- * allocate the CQs using ib_alloc_cq() and the CQ poll context cannot be
- * IB_POLL_DIRECT.
+ * allocate the CQs using ib_alloc_cq().
  *
  * ensure that there are no other contexts that are posting WRs concurrently.
  * Otherwise the drain is not guaranteed.
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
                     ` (4 preceding siblings ...)
  2017-02-10 23:56   ` [PATCH 7/8] IB/core: Add support for draining IB_POLL_DIRECT completion queues Bart Van Assche
@ 2017-02-10 23:56   ` Bart Van Assche
       [not found]     ` <20170210235611.3243-9-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  5 siblings, 1 reply; 47+ messages in thread
From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	Christoph Hellwig, Israel Rukshin, Max Gurtovoy, Laurence Oberman

A quote from the IB spec:

However, if the Consumer does not wait for the Affiliated Asynchronous
Last WQE Reached Event, then WQE and Data Segment leakage may occur.
Therefore, it is good programming practice to tear down a QP that is
associated with an SRQ by using the following process:
* Put the QP in the Error State;
* wait for the Affiliated Asynchronous Last WQE Reached Event;
* either:
  * drain the CQ by invoking the Poll CQ verb and either wait for CQ
    to be empty or the number of Poll CQ operations has exceeded CQ
    capacity size; or
  * post another WR that completes on the same CQ and wait for this WR to return as a WC;
* and then invoke a Destroy QP or Reset QP.

Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 2f85255d2aca..b50733910f7e 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -471,9 +471,13 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct srp_target_port *target)
  * completion handler can access the queue pair while it is
  * being destroyed.
  */
-static void srp_destroy_qp(struct ib_qp *qp)
+static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp)
 {
-	ib_drain_rq(qp);
+	spin_lock_irq(&ch->lock);
+	ib_process_cq_direct(ch->send_cq, -1);
+	spin_unlock_irq(&ch->lock);
+
+	ib_drain_qp(qp);
 	ib_destroy_qp(qp);
 }
 
@@ -547,7 +551,7 @@ static int srp_create_ch_ib(struct srp_rdma_ch *ch)
 	}
 
 	if (ch->qp)
-		srp_destroy_qp(ch->qp);
+		srp_destroy_qp(ch, ch->qp);
 	if (ch->recv_cq)
 		ib_free_cq(ch->recv_cq);
 	if (ch->send_cq)
@@ -571,7 +575,7 @@ static int srp_create_ch_ib(struct srp_rdma_ch *ch)
 	return 0;
 
 err_qp:
-	srp_destroy_qp(qp);
+	srp_destroy_qp(ch, qp);
 
 err_send_cq:
 	ib_free_cq(send_cq);
@@ -614,7 +618,7 @@ static void srp_free_ch_ib(struct srp_target_port *target,
 			ib_destroy_fmr_pool(ch->fmr_pool);
 	}
 
-	srp_destroy_qp(ch->qp);
+	srp_destroy_qp(ch, ch->qp);
 	ib_free_cq(ch->send_cq);
 	ib_free_cq(ch->recv_cq);
 
@@ -1827,6 +1831,11 @@ static struct srp_iu *__srp_get_tx_iu(struct srp_rdma_ch *ch,
 	return iu;
 }
 
+/*
+ * Note: if this function is called from inside ib_drain_sq() then it will
+ * be called without ch->lock being held. If ib_drain_sq() dequeues a WQE
+ * with status IB_WC_SUCCESS then that's a bug.
+ */
 static void srp_send_done(struct ib_cq *cq, struct ib_wc *wc)
 {
 	struct srp_iu *iu = container_of(wc->wr_cqe, struct srp_iu, cqe);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]     ` <20170210235611.3243-9-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2017-02-11  0:07       ` Robert LeBlanc
       [not found]         ` <CAANLjFr+Jd3ctmhpBnjYGKZ4ZQPtYLAB7EWZxL59vHpgekP=Jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-02-12 17:19       ` Leon Romanovsky
  1 sibling, 1 reply; 47+ messages in thread
From: Robert LeBlanc @ 2017-02-11  0:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Doug Ledford, linux-rdma, Christoph Hellwig, Israel Rukshin,
	Max Gurtovoy, Laurence Oberman

On Fri, Feb 10, 2017 at 4:56 PM, Bart Van Assche
<bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
> A quote from the IB spec:
>
> However, if the Consumer does not wait for the Affiliated Asynchronous
> Last WQE Reached Event, then WQE and Data Segment leakage may occur.
> Therefore, it is good programming practice to tear down a QP that is
> associated with an SRQ by using the following process:
> * Put the QP in the Error State;
> * wait for the Affiliated Asynchronous Last WQE Reached Event;
> * either:
>   * drain the CQ by invoking the Poll CQ verb and either wait for CQ
>     to be empty or the number of Poll CQ operations has exceeded CQ
>     capacity size; or
>   * post another WR that completes on the same CQ and wait for this WR to return as a WC;
> * and then invoke a Destroy QP or Reset QP.
>
> Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  drivers/infiniband/ulp/srp/ib_srp.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index 2f85255d2aca..b50733910f7e 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -471,9 +471,13 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct srp_target_port *target)
>   * completion handler can access the queue pair while it is
>   * being destroyed.
>   */
> -static void srp_destroy_qp(struct ib_qp *qp)
> +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp)
>  {
> -       ib_drain_rq(qp);
> +       spin_lock_irq(&ch->lock);
> +       ib_process_cq_direct(ch->send_cq, -1);
> +       spin_unlock_irq(&ch->lock);
> +
> +       ib_drain_qp(qp);
>         ib_destroy_qp(qp);
>  }
>
> @@ -547,7 +551,7 @@ static int srp_create_ch_ib(struct srp_rdma_ch *ch)
>         }
>
>         if (ch->qp)
> -               srp_destroy_qp(ch->qp);
> +               srp_destroy_qp(ch, ch->qp);
>         if (ch->recv_cq)
>                 ib_free_cq(ch->recv_cq);
>         if (ch->send_cq)
> @@ -571,7 +575,7 @@ static int srp_create_ch_ib(struct srp_rdma_ch *ch)
>         return 0;
>
>  err_qp:
> -       srp_destroy_qp(qp);
> +       srp_destroy_qp(ch, qp);
>
>  err_send_cq:
>         ib_free_cq(send_cq);
> @@ -614,7 +618,7 @@ static void srp_free_ch_ib(struct srp_target_port *target,
>                         ib_destroy_fmr_pool(ch->fmr_pool);
>         }
>
> -       srp_destroy_qp(ch->qp);
> +       srp_destroy_qp(ch, ch->qp);
>         ib_free_cq(ch->send_cq);
>         ib_free_cq(ch->recv_cq);
>
> @@ -1827,6 +1831,11 @@ static struct srp_iu *__srp_get_tx_iu(struct srp_rdma_ch *ch,
>         return iu;
>  }
>
> +/*
> + * Note: if this function is called from inside ib_drain_sq() then it will

Don't you mean outside of ib_drain_sq?

> + * be called without ch->lock being held. If ib_drain_sq() dequeues a WQE
> + * with status IB_WC_SUCCESS then that's a bug.
> + */
>  static void srp_send_done(struct ib_cq *cq, struct ib_wc *wc)
>  {
>         struct srp_iu *iu = container_of(wc->wr_cqe, struct srp_iu, cqe);
> --
> 2.11.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi,

Does something like this need to happen for iSER as well? Maybe it
could help with the D state problem?

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]         ` <CAANLjFr+Jd3ctmhpBnjYGKZ4ZQPtYLAB7EWZxL59vHpgekP=Jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-11  0:13           ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-11  0:13 UTC (permalink / raw)
  To: robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org
  Cc: hch-jcswGhMUV9g@public.gmane.org,
	maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

On Fri, 2017-02-10 at 17:07 -0700, Robert LeBlanc wrote:
> > +/*
> > + * Note: if this function is called from inside ib_drain_sq() then it will
> 
> Don't you mean outside of ib_drain_sq?

I meant inside.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug
  2017-02-10 23:56 ` [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug Bart Van Assche
@ 2017-02-12 17:05   ` Leon Romanovsky
  2017-02-12 20:07     ` Bart Van Assche
  0 siblings, 1 reply; 47+ messages in thread
From: Leon Romanovsky @ 2017-02-12 17:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Doug Ledford, linux-rdma, Israel Rukshin, Max Gurtovoy,
	Laurence Oberman, Steve Feeley, stable

[-- Attachment #1: Type: text/plain, Size: 2695 bytes --]

On Fri, Feb 10, 2017 at 03:56:04PM -0800, Bart Van Assche wrote:
> After srp_process_rsp() returns there is a short time during which
> the scsi_host_find_tag() call will return a pointer to the SCSI
> command that is being completed. If during that time a duplicate
> response is received, avoid that the following call stack appears:
>
> BUG: unable to handle kernel NULL pointer dereference at           (null)
> IP: srp_recv_done+0x450/0x6b0 [ib_srp]
> Oops: 0000 [#1] SMP
> CPU: 10 PID: 0 Comm: swapper/10 Not tainted 4.10.0-rc7-dbg+ #1
> Call Trace:
>  <IRQ>
>  __ib_process_cq+0x4b/0xd0 [ib_core]
>  ib_poll_handler+0x1d/0x70 [ib_core]
>  irq_poll_softirq+0xba/0x120
>  __do_softirq+0xba/0x4c0
>  irq_exit+0xbe/0xd0
>  smp_apic_timer_interrupt+0x38/0x50
>  apic_timer_interrupt+0x90/0xa0
>  </IRQ>
>  cpuidle_enter_state+0xf2/0x370
>  cpuidle_enter+0x12/0x20
>  call_cpuidle+0x1e/0x40
>  do_idle+0xe3/0x1c0
>  cpu_startup_entry+0x18/0x20
>  start_secondary+0x103/0x130
>  start_cpu+0x14/0x14
> RIP: srp_recv_done+0x450/0x6b0 [ib_srp] RSP: ffff88046f483e20
>
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Israel Rukshin <israelr@mellanox.com>
> Cc: Max Gurtovoy <maxg@mellanox.com>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Steve Feeley <Steve.Feeley@sandisk.com>
> Cc: <stable@vger.kernel.org>
> ---
>  drivers/infiniband/ulp/srp/ib_srp.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index 79bf48477ddb..4068d34f5427 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -1899,7 +1899,14 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp)
>  		scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag);
>  		if (scmnd) {
>  			req = (void *)scmnd->host_scribble;
> -			scmnd = srp_claim_req(ch, req, NULL, scmnd);
> +			if (req) {
> +				scmnd = srp_claim_req(ch, req, NULL, scmnd);
> +			} else {
> +				shost_printk(KERN_ERR, target->scsi_host,
> +					     "NULL host_scribble for response with tag %#llx\n",
> +					     rsp->tag);
> +				scmnd = NULL;
> +			}
>  		}
>  		if (!scmnd) {
>  			shost_printk(KERN_ERR, target->scsi_host,

You have the chance to print the message below together with your new
print, because scmd will be NULL.

What about to do the following check "if (scmd && scmd->host_scribble)"
instead of your proposed patch?

Thanks

> --
> 2.11.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]     ` <20170210235611.3243-9-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  2017-02-11  0:07       ` Robert LeBlanc
@ 2017-02-12 17:19       ` Leon Romanovsky
       [not found]         ` <20170212171928.GF14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  1 sibling, 1 reply; 47+ messages in thread
From: Leon Romanovsky @ 2017-02-12 17:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, Israel Rukshin, Max Gurtovoy, Laurence Oberman

[-- Attachment #1: Type: text/plain, Size: 2274 bytes --]

On Fri, Feb 10, 2017 at 03:56:11PM -0800, Bart Van Assche wrote:
> A quote from the IB spec:
>
> However, if the Consumer does not wait for the Affiliated Asynchronous
> Last WQE Reached Event, then WQE and Data Segment leakage may occur.
> Therefore, it is good programming practice to tear down a QP that is
> associated with an SRQ by using the following process:
> * Put the QP in the Error State;
> * wait for the Affiliated Asynchronous Last WQE Reached Event;
> * either:
>   * drain the CQ by invoking the Poll CQ verb and either wait for CQ
>     to be empty or the number of Poll CQ operations has exceeded CQ
>     capacity size; or
>   * post another WR that completes on the same CQ and wait for this WR to return as a WC;
> * and then invoke a Destroy QP or Reset QP.
>
> Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  drivers/infiniband/ulp/srp/ib_srp.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index 2f85255d2aca..b50733910f7e 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -471,9 +471,13 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct srp_target_port *target)
>   * completion handler can access the queue pair while it is
>   * being destroyed.
>   */
> -static void srp_destroy_qp(struct ib_qp *qp)
> +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp)
>  {
> -	ib_drain_rq(qp);
> +	spin_lock_irq(&ch->lock);
> +	ib_process_cq_direct(ch->send_cq, -1);

I see that you are already using "-1" in your code, but the comments in the
ib_process_cq_direct states that no new code should use "-1".

 61  * Note: for compatibility reasons -1 can be passed in %budget for unlimited
 62  * polling.  Do not use this feature in new code, it will be removed soon.
 63  */
 64 int ib_process_cq_direct(struct ib_cq *cq, int budget)

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]         ` <20170212171928.GF14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-02-12 18:02           ` Laurence Oberman
       [not found]             ` <1041506550.30101266.1486922573298.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-02-12 20:11           ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche
  1 sibling, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-12 18:02 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bart Van Assche, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, Israel Rukshin, Max Gurtovoy

----- Original Message -----
> From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> To: "Bart Van Assche" <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> Cc: "Doug Ledford" <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, "Christoph Hellwig" <hch-jcswGhMUV9g@public.gmane.org>, "Israel
> Rukshin" <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>, "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>, "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Sent: Sunday, February 12, 2017 12:19:28 PM
> Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
> 
> On Fri, Feb 10, 2017 at 03:56:11PM -0800, Bart Van Assche wrote:
> > A quote from the IB spec:
> >
> > However, if the Consumer does not wait for the Affiliated Asynchronous
> > Last WQE Reached Event, then WQE and Data Segment leakage may occur.
> > Therefore, it is good programming practice to tear down a QP that is
> > associated with an SRQ by using the following process:
> > * Put the QP in the Error State;
> > * wait for the Affiliated Asynchronous Last WQE Reached Event;
> > * either:
> >   * drain the CQ by invoking the Poll CQ verb and either wait for CQ
> >     to be empty or the number of Poll CQ operations has exceeded CQ
> >     capacity size; or
> >   * post another WR that completes on the same CQ and wait for this WR to
> >   return as a WC;
> > * and then invoke a Destroy QP or Reset QP.
> >
> > Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> > Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  drivers/infiniband/ulp/srp/ib_srp.c | 19 ++++++++++++++-----
> >  1 file changed, 14 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
> > b/drivers/infiniband/ulp/srp/ib_srp.c
> > index 2f85255d2aca..b50733910f7e 100644
> > --- a/drivers/infiniband/ulp/srp/ib_srp.c
> > +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> > @@ -471,9 +471,13 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct
> > srp_target_port *target)
> >   * completion handler can access the queue pair while it is
> >   * being destroyed.
> >   */
> > -static void srp_destroy_qp(struct ib_qp *qp)
> > +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp)
> >  {
> > -	ib_drain_rq(qp);
> > +	spin_lock_irq(&ch->lock);
> > +	ib_process_cq_direct(ch->send_cq, -1);
> 
> I see that you are already using "-1" in your code, but the comments in the
> ib_process_cq_direct states that no new code should use "-1".
> 
>  61  * Note: for compatibility reasons -1 can be passed in %budget for
>  unlimited
>  62  * polling.  Do not use this feature in new code, it will be removed
>  soon.
>  63  */
>  64 int ib_process_cq_direct(struct ib_cq *cq, int budget)
> 
> Thanks
> 

Hello Bart

I took latest for-next from your git tree and started the fist set of tests.

I bumped into this very quickly, but I only am running the new code on the client.
The server has not been updated.

On the client I see this after starting a single write thread to and XFS on on eof the mpaths.
Given its in ib_strain figured I would let you know now.


[  850.862430] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
[  850.865203] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f3d94a30
[  850.941454] scsi host1: ib_srp: Failed to map data (-12)
[  860.990411] mlx5_0:dump_cqe:262:(pid 1103): dump error cqe
[  861.019162] 00000000 00000000 00000000 00000000
[  861.042085] 00000000 00000000 00000000 00000000
[  861.066567] 00000000 00000000 00000000 00000000
[  861.092164] 00000000 0f007806 2500002a cefe87d1
[  861.117091] ------------[ cut here ]------------
[  861.143141] WARNING: CPU: 27 PID: 1103 at drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 [ib_core]
[  861.202208] IB_POLL_DIRECT poll_ctx not supported for drain
[  861.235179] Modules linked in: dm_service_time xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat rpcrdma nf_conntrack ib_isert iscsi_target_mod iptable_mangle iptable_security iptable_raw ebtable_filter ib_iser ebtables libiscsi ip6table_filter ip6_tables scsi_transport_iscsi iptable_filter target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32
 _pclmul ghash_clmulni_intel
[  861.646587]  pcbc aesni_intel crypto_simd ipmi_ssif glue_helper ipmi_si cryptd iTCO_wdt gpio_ich ipmi_devintf iTCO_vendor_support pcspkr hpwdt hpilo pcc_cpufreq sg ipmi_msghandler acpi_power_meter i7core_edac acpi_cpufreq shpchp edac_core lpc_ich nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea sd_mod sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm ptp fjes hpsa crc32c_intel serio_raw i2c_core pps_core bnx2 devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt]
[  861.943997] CPU: 27 PID: 1103 Comm: kworker/27:2 Tainted: G          I     4.10.0-rc7+ #1
[  861.989476] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[  862.024833] Workqueue: events_long srp_reconnect_work [scsi_transport_srp]
[  862.063004] Call Trace:
[  862.076516]  dump_stack+0x63/0x87
[  862.094841]  __warn+0xd1/0xf0
[  862.112164]  warn_slowpath_fmt+0x5f/0x80
[  862.134013]  ? mlx5_poll_one+0x59/0xa40 [mlx5_ib]
[  862.161124]  __ib_drain_sq+0x1bb/0x1c0 [ib_core]
[  862.187702]  ib_drain_sq+0x25/0x30 [ib_core]
[  862.212168]  ib_drain_qp+0x12/0x30 [ib_core]
[  862.238138]  srp_destroy_qp+0x47/0x60 [ib_srp]
[  862.264155]  srp_create_ch_ib+0x26f/0x5f0 [ib_srp]
[  862.291646]  ? scsi_done+0x21/0x70
[  862.312392]  ? srp_finish_req+0x93/0xb0 [ib_srp]
[  862.338654]  srp_rport_reconnect+0xf0/0x1f0 [ib_srp]
[  862.366274]  srp_reconnect_rport+0xca/0x220 [scsi_transport_srp]
[  862.400756]  srp_reconnect_work+0x44/0xd1 [scsi_transport_srp]
[  862.434277]  process_one_work+0x165/0x410
[  862.456198]  worker_thread+0x137/0x4c0
[  862.476973]  kthread+0x101/0x140
[  862.493935]  ? rescuer_thread+0x3b0/0x3b0
[  862.516800]  ? kthread_park+0x90/0x90
[  862.537396]  ? do_syscall_64+0x67/0x180
[  862.558477]  ret_from_fork+0x2c/0x40
[  862.578161] ---[ end trace 2a6c2779f0a2d28f ]---
[  864.274137] scsi host1: ib_srp: reconnect succeeded
[  864.306836] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
[  864.310916] mlx5_0:dump_cqe:262:(pid 13776): dump error cqe
[  864.310917] 00000000 00000000 00000000 00000000
[  864.310921] 00000000 00000000 00000000 00000000
[  864.310922] 00000000 00000000 00000000 00000000
[  864.310922] 00000000 0f007806 25000032 00044cd0
[  864.310928] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff880b94268078
[  864.527890] scsi host1: ib_srp: Failed to map data (-12)
[  876.101124] scsi host1: ib_srp: reconnect succeeded
[  876.133923] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
[  876.135014] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf1939130
[  876.210311] scsi host1: ib_srp: Failed to map data (-12)
[  876.239985] mlx5_0:dump_cqe:262:(pid 5945): dump error cqe
[  876.270855] 00000000 00000000 00000000 00000000
[  876.296525] 00000000 00000000 00000000 00000000
[  876.322500] 00000000 00000000 00000000 00000000
[  876.348519] 00000000 0f007806 2500003a 0080e1d0
[  887.784981] scsi host1: ib_srp: reconnect succeeded
[  887.819808] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
[  887.851777] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf1939130
[  887.898850] scsi host1: ib_srp: Failed to map data (-12)
[  887.928647] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe
[  887.959938] 00000000 00000000 00000000 00000000
[  887.985041] 00000000 00000000 00000000 00000000
[  888.010619] 00000000 00000000 00000000 00000000
[  888.035601] 00000000 0f007806 25000042 008099d0
[  899.546781] scsi host1: ib_srp: reconnect succeeded
[  899.580758] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
[  899.611289] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf1939130
[  899.658289] scsi host1: ib_srp: Failed to map data (-12)
[  899.687219] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe
[  899.718736] 00000000 00000000 00000000 00000000
[  899.744137] 00000000 00000000 00000000 00000000
[  899.769206] 00000000 00000000 00000000 00000000
[  899.795217] 00000000 0f007806 2500004a 008091d0
[  911.343869] scsi host1: ib_srp: reconnect succeeded
[  911.376684] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
[  911.407755] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf1939130
[  911.454474] scsi host1: ib_srp: Failed to map data (-12)
[  911.484279] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe
[  911.514784] 00000000 00000000 00000000 00000000
[  911.540251] 00000000 00000000 00000000 00000000
[  911.564841] 00000000 00000000 00000000 00000000
[  911.590743] 00000000 0f007806 25000052 008089d0
[  923.066748] scsi host1: ib_srp: reconnect succeeded
[  923.099656] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
[  923.131825] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf1939130
[  923.179514] scsi host1: ib_srp: Failed to map data (-12)
[  923.209307] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe
[  923.239986] 00000000 00000000 00000000 00000000
[  923.265419] 00000000 00000000 00000000 00000000
[  923.290102] 00000000 00000000 00000000 00000000
[  923.315120] 00000000 0f007806 2500005a 00c4d4d0
[  934.839336] scsi host1: ib_srp: reconnect succeeded
[  934.874582] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
[  934.906298] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf1939130
[  934.953712] scsi host1: ib_srp: Failed to map data (-12)
[  934.983829] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe
[  935.015371] 00000000 00000000 00000000 00000000
[  935.041544] 00000000 00000000 00000000 00000000
[  935.066883] 00000000 00000000 00000000 00000000
[  935.092755] 00000000 0f007806 25000062 00c4ecd0
[  946.610744] scsi host1: ib_srp: reconnect succeeded
[  946.644528] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
[  946.647935] mlx5_0:dump_cqe:262:(pid 752): dump error cqe
[  946.647936] 00000000 00000000 00000000 00000000
[  946.647937] 00000000 00000000 00000000 00000000
[  946.647937] 00000000 00000000 00000000 00000000
[  946.647938] 00000000 0f007806 2500006a 00c4e4d0
[  946.647940] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff880b94268c78
[  946.869439] scsi host1: ib_srp: Failed to map data (-12)

I will reset and restart to make sure this issue is repeatable.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]             ` <1041506550.30101266.1486922573298.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-12 18:06               ` Laurence Oberman
       [not found]                 ` <1051975432.30101289.1486922792858.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-02-12 20:05               ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche
  1 sibling, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-12 18:06 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bart Van Assche, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, Israel Rukshin, Max Gurtovoy



----- Original Message -----
> From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: "Bart Van Assche" <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, "Doug Ledford" <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> "Christoph Hellwig" <hch-jcswGhMUV9g@public.gmane.org>, "Israel Rukshin" <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>, "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Sent: Sunday, February 12, 2017 1:02:53 PM
> Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
> 
> ----- Original Message -----
> > From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > To: "Bart Van Assche" <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > Cc: "Doug Ledford" <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > "Christoph Hellwig" <hch-jcswGhMUV9g@public.gmane.org>, "Israel
> > Rukshin" <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>, "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
> > "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Sent: Sunday, February 12, 2017 12:19:28 PM
> > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > QP
> > 
> > On Fri, Feb 10, 2017 at 03:56:11PM -0800, Bart Van Assche wrote:
> > > A quote from the IB spec:
> > >
> > > However, if the Consumer does not wait for the Affiliated Asynchronous
> > > Last WQE Reached Event, then WQE and Data Segment leakage may occur.
> > > Therefore, it is good programming practice to tear down a QP that is
> > > associated with an SRQ by using the following process:
> > > * Put the QP in the Error State;
> > > * wait for the Affiliated Asynchronous Last WQE Reached Event;
> > > * either:
> > >   * drain the CQ by invoking the Poll CQ verb and either wait for CQ
> > >     to be empty or the number of Poll CQ operations has exceeded CQ
> > >     capacity size; or
> > >   * post another WR that completes on the same CQ and wait for this WR to
> > >   return as a WC;
> > > * and then invoke a Destroy QP or Reset QP.
> > >
> > > Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> > > Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > ---
> > >  drivers/infiniband/ulp/srp/ib_srp.c | 19 ++++++++++++++-----
> > >  1 file changed, 14 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
> > > b/drivers/infiniband/ulp/srp/ib_srp.c
> > > index 2f85255d2aca..b50733910f7e 100644
> > > --- a/drivers/infiniband/ulp/srp/ib_srp.c
> > > +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> > > @@ -471,9 +471,13 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct
> > > srp_target_port *target)
> > >   * completion handler can access the queue pair while it is
> > >   * being destroyed.
> > >   */
> > > -static void srp_destroy_qp(struct ib_qp *qp)
> > > +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp)
> > >  {
> > > -	ib_drain_rq(qp);
> > > +	spin_lock_irq(&ch->lock);
> > > +	ib_process_cq_direct(ch->send_cq, -1);
> > 
> > I see that you are already using "-1" in your code, but the comments in the
> > ib_process_cq_direct states that no new code should use "-1".
> > 
> >  61  * Note: for compatibility reasons -1 can be passed in %budget for
> >  unlimited
> >  62  * polling.  Do not use this feature in new code, it will be removed
> >  soon.
> >  63  */
> >  64 int ib_process_cq_direct(struct ib_cq *cq, int budget)
> > 
> > Thanks
> > 
> 
> Hello Bart
> 
> I took latest for-next from your git tree and started the fist set of tests.
> 
> I bumped into this very quickly, but I only am running the new code on the
> client.
> The server has not been updated.
> 
> On the client I see this after starting a single write thread to and XFS on
> on eof the mpaths.
> Given its in ib_strain figured I would let you know now.
> 
> 
> [  850.862430] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
> [  850.865203] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f3d94a30
> [  850.941454] scsi host1: ib_srp: Failed to map data (-12)
> [  860.990411] mlx5_0:dump_cqe:262:(pid 1103): dump error cqe
> [  861.019162] 00000000 00000000 00000000 00000000
> [  861.042085] 00000000 00000000 00000000 00000000
> [  861.066567] 00000000 00000000 00000000 00000000
> [  861.092164] 00000000 0f007806 2500002a cefe87d1
> [  861.117091] ------------[ cut here ]------------
> [  861.143141] WARNING: CPU: 27 PID: 1103 at
> drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 [ib_core]
> [  861.202208] IB_POLL_DIRECT poll_ctx not supported for drain
> [  861.235179] Modules linked in: dm_service_time xt_CHECKSUM ipt_MASQUERADE
> nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4
> ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
> ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
> nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat rpcrdma nf_conntrack
> ib_isert iscsi_target_mod iptable_mangle iptable_security iptable_raw
> ebtable_filter ib_iser ebtables libiscsi ip6table_filter ip6_tables
> scsi_transport_iscsi iptable_filter target_core_mod ib_srp
> scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm
> iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> [  861.646587]  pcbc aesni_intel crypto_simd ipmi_ssif glue_helper ipmi_si
> cryptd iTCO_wdt gpio_ich ipmi_devintf iTCO_vendor_support pcspkr hpwdt hpilo
> pcc_cpufreq sg ipmi_msghandler acpi_power_meter i7core_edac acpi_cpufreq
> shpchp edac_core lpc_ich nfsd auth_rpcgss nfs_acl lockd grace sunrpc
> dm_multipath ip_tables xfs libcrc32c amdkfd amd_iommu_v2 radeon i2c_algo_bit
> drm_kms_helper syscopyarea sd_mod sysfillrect sysimgblt fb_sys_fops ttm
> mlx5_core drm ptp fjes hpsa crc32c_intel serio_raw i2c_core pps_core bnx2
> devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last
> unloaded: ib_srpt]
> [  861.943997] CPU: 27 PID: 1103 Comm: kworker/27:2 Tainted: G          I
> 4.10.0-rc7+ #1
> [  861.989476] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
> [  862.024833] Workqueue: events_long srp_reconnect_work [scsi_transport_srp]
> [  862.063004] Call Trace:
> [  862.076516]  dump_stack+0x63/0x87
> [  862.094841]  __warn+0xd1/0xf0
> [  862.112164]  warn_slowpath_fmt+0x5f/0x80
> [  862.134013]  ? mlx5_poll_one+0x59/0xa40 [mlx5_ib]
> [  862.161124]  __ib_drain_sq+0x1bb/0x1c0 [ib_core]
> [  862.187702]  ib_drain_sq+0x25/0x30 [ib_core]
> [  862.212168]  ib_drain_qp+0x12/0x30 [ib_core]
> [  862.238138]  srp_destroy_qp+0x47/0x60 [ib_srp]
> [  862.264155]  srp_create_ch_ib+0x26f/0x5f0 [ib_srp]
> [  862.291646]  ? scsi_done+0x21/0x70
> [  862.312392]  ? srp_finish_req+0x93/0xb0 [ib_srp]
> [  862.338654]  srp_rport_reconnect+0xf0/0x1f0 [ib_srp]
> [  862.366274]  srp_reconnect_rport+0xca/0x220 [scsi_transport_srp]
> [  862.400756]  srp_reconnect_work+0x44/0xd1 [scsi_transport_srp]
> [  862.434277]  process_one_work+0x165/0x410
> [  862.456198]  worker_thread+0x137/0x4c0
> [  862.476973]  kthread+0x101/0x140
> [  862.493935]  ? rescuer_thread+0x3b0/0x3b0
> [  862.516800]  ? kthread_park+0x90/0x90
> [  862.537396]  ? do_syscall_64+0x67/0x180
> [  862.558477]  ret_from_fork+0x2c/0x40
> [  862.578161] ---[ end trace 2a6c2779f0a2d28f ]---
> [  864.274137] scsi host1: ib_srp: reconnect succeeded
> [  864.306836] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
> [  864.310916] mlx5_0:dump_cqe:262:(pid 13776): dump error cqe
> [  864.310917] 00000000 00000000 00000000 00000000
> [  864.310921] 00000000 00000000 00000000 00000000
> [  864.310922] 00000000 00000000 00000000 00000000
> [  864.310922] 00000000 0f007806 25000032 00044cd0
> [  864.310928] scsi host1: ib_srp: failed FAST REG status memory management
> operation error (6) for CQE ffff880b94268078
> [  864.527890] scsi host1: ib_srp: Failed to map data (-12)
> [  876.101124] scsi host1: ib_srp: reconnect succeeded
> [  876.133923] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
> [  876.135014] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff880bf1939130
> [  876.210311] scsi host1: ib_srp: Failed to map data (-12)
> [  876.239985] mlx5_0:dump_cqe:262:(pid 5945): dump error cqe
> [  876.270855] 00000000 00000000 00000000 00000000
> [  876.296525] 00000000 00000000 00000000 00000000
> [  876.322500] 00000000 00000000 00000000 00000000
> [  876.348519] 00000000 0f007806 2500003a 0080e1d0
> [  887.784981] scsi host1: ib_srp: reconnect succeeded
> [  887.819808] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
> [  887.851777] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff880bf1939130
> [  887.898850] scsi host1: ib_srp: Failed to map data (-12)
> [  887.928647] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe
> [  887.959938] 00000000 00000000 00000000 00000000
> [  887.985041] 00000000 00000000 00000000 00000000
> [  888.010619] 00000000 00000000 00000000 00000000
> [  888.035601] 00000000 0f007806 25000042 008099d0
> [  899.546781] scsi host1: ib_srp: reconnect succeeded
> [  899.580758] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
> [  899.611289] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff880bf1939130
> [  899.658289] scsi host1: ib_srp: Failed to map data (-12)
> [  899.687219] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe
> [  899.718736] 00000000 00000000 00000000 00000000
> [  899.744137] 00000000 00000000 00000000 00000000
> [  899.769206] 00000000 00000000 00000000 00000000
> [  899.795217] 00000000 0f007806 2500004a 008091d0
> [  911.343869] scsi host1: ib_srp: reconnect succeeded
> [  911.376684] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
> [  911.407755] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff880bf1939130
> [  911.454474] scsi host1: ib_srp: Failed to map data (-12)
> [  911.484279] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe
> [  911.514784] 00000000 00000000 00000000 00000000
> [  911.540251] 00000000 00000000 00000000 00000000
> [  911.564841] 00000000 00000000 00000000 00000000
> [  911.590743] 00000000 0f007806 25000052 008089d0
> [  923.066748] scsi host1: ib_srp: reconnect succeeded
> [  923.099656] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
> [  923.131825] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff880bf1939130
> [  923.179514] scsi host1: ib_srp: Failed to map data (-12)
> [  923.209307] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe
> [  923.239986] 00000000 00000000 00000000 00000000
> [  923.265419] 00000000 00000000 00000000 00000000
> [  923.290102] 00000000 00000000 00000000 00000000
> [  923.315120] 00000000 0f007806 2500005a 00c4d4d0
> [  934.839336] scsi host1: ib_srp: reconnect succeeded
> [  934.874582] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
> [  934.906298] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff880bf1939130
> [  934.953712] scsi host1: ib_srp: Failed to map data (-12)
> [  934.983829] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe
> [  935.015371] 00000000 00000000 00000000 00000000
> [  935.041544] 00000000 00000000 00000000 00000000
> [  935.066883] 00000000 00000000 00000000 00000000
> [  935.092755] 00000000 0f007806 25000062 00c4ecd0
> [  946.610744] scsi host1: ib_srp: reconnect succeeded
> [  946.644528] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1)
> [  946.647935] mlx5_0:dump_cqe:262:(pid 752): dump error cqe
> [  946.647936] 00000000 00000000 00000000 00000000
> [  946.647937] 00000000 00000000 00000000 00000000
> [  946.647937] 00000000 00000000 00000000 00000000
> [  946.647938] 00000000 0f007806 2500006a 00c4e4d0
> [  946.647940] scsi host1: ib_srp: failed FAST REG status memory management
> operation error (6) for CQE ffff880b94268c78
> [  946.869439] scsi host1: ib_srp: Failed to map data (-12)
> 
> I will reset and restart to make sure this issue is repeatable.
> 
> Thanks
> Laurence

Sorry for typos, should have been

On the client I see this after starting a single write thread to an XFS on one of the mpaths.
Given its in ib_drain_cq figured I would let you know now.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]             ` <1041506550.30101266.1486922573298.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-02-12 18:06               ` Laurence Oberman
@ 2017-02-12 20:05               ` Bart Van Assche
       [not found]                 ` <1486929901.2918.1.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  1 sibling, 1 reply; 47+ messages in thread
From: Bart Van Assche @ 2017-02-12 20:05 UTC (permalink / raw)
  To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
  Cc: hch-jcswGhMUV9g@public.gmane.org,
	maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote:
> [  861.143141] WARNING: CPU: 27 PID: 1103 at drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 [ib_core]
> [  861.202208] IB_POLL_DIRECT poll_ctx not supported for drain

Hello Laurence,

That warning has been removed by patch 7/8 of this series. Please double check
whether all eight patches have been applied properly.

Bart.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug
  2017-02-12 17:05   ` Leon Romanovsky
@ 2017-02-12 20:07     ` Bart Van Assche
       [not found]       ` <1486930017.2918.3.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Bart Van Assche @ 2017-02-12 20:07 UTC (permalink / raw)
  To: leon@kernel.org
  Cc: maxg@mellanox.com, israelr@mellanox.com,
	linux-rdma@vger.kernel.org, Steve Feeley, dledford@redhat.com,
	loberman@redhat.com, stable@vger.kernel.org

On Sun, 2017-02-12 at 19:05 +0200, Leon Romanovsky wrote:
> On Fri, Feb 10, 2017 at 03:56:04PM -0800, Bart Van Assche wrote:
> > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> > index 79bf48477ddb..4068d34f5427 100644
> > --- a/drivers/infiniband/ulp/srp/ib_srp.c
> > +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> > @@ -1899,7 +1899,14 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp)
> >  		scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag);
> >  		if (scmnd) {
> >  			req = (void *)scmnd->host_scribble;
> > -			scmnd = srp_claim_req(ch, req, NULL, scmnd);
> > +			if (req) {
> > +				scmnd = srp_claim_req(ch, req, NULL, scmnd);
> > +			} else {
> > +				shost_printk(KERN_ERR, target->scsi_host,
> > +					     "NULL host_scribble for response with tag %#llx\n",
> > +					     rsp->tag);
> > +				scmnd = NULL;
> > +			}
> >  		}
> >  		if (!scmnd) {
> >  			shost_printk(KERN_ERR, target->scsi_host,
> 
> You have the chance to print the message below together with your new
> print, because scmd will be NULL.
> 
> What about to do the following check "if (scmd && scmd->host_scribble)"
> instead of your proposed patch?

That approach would still trigger a kernel oops if a duplicate response is
received because the second argument of srp_claim_req() must not be NULL.

Bart.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]         ` <20170212171928.GF14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  2017-02-12 18:02           ` Laurence Oberman
@ 2017-02-12 20:11           ` Bart Van Assche
       [not found]             ` <1486930299.2918.5.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  1 sibling, 1 reply; 47+ messages in thread
From: Bart Van Assche @ 2017-02-12 20:11 UTC (permalink / raw)
  To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
  Cc: hch-jcswGhMUV9g@public.gmane.org,
	maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

On Sun, 2017-02-12 at 19:19 +0200, Leon Romanovsky wrote:
> On Fri, Feb 10, 2017 at 03:56:11PM -0800, Bart Van Assche wrote:
> > -static void srp_destroy_qp(struct ib_qp *qp)
> > +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp)
> >  {
> > -	ib_drain_rq(qp);
> > +	spin_lock_irq(&ch->lock);
> > +	ib_process_cq_direct(ch->send_cq, -1);
> 
> I see that you are already using "-1" in your code, but the comments in the
> ib_process_cq_direct states that no new code should use "-1".
> 
>  61  * Note: for compatibility reasons -1 can be passed in %budget for unlimited
>  62  * polling.  Do not use this feature in new code, it will be removed soon.
>  63  */
>  64 int ib_process_cq_direct(struct ib_cq *cq, int budget)

Although it is possible to avoid passing -1 as 'budget' by passing a number
that is at least as large as the number of expected completions, it would
make it harder to verify the SRP initiator driver. So I propose to modify
the comment above ib_process_cq_direct().

Bart.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]                 ` <1486929901.2918.1.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2017-02-13  2:07                   ` Laurence Oberman
       [not found]                     ` <655392767.30136125.1486951636415.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-13  2:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g,
	maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Sunday, February 12, 2017 3:05:16 PM
> Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
> 
> On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote:
> > [  861.143141] WARNING: CPU: 27 PID: 1103 at
> > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 [ib_core]
> > [  861.202208] IB_POLL_DIRECT poll_ctx not supported for drain
> 
> Hello Laurence,
> 
> That warning has been removed by patch 7/8 of this series. Please double
> check
> whether all eight patches have been applied properly.
> 
> Bart.N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��

Hello 
Just a heads up, working with Bart on this patch series.
We have stability issues with my tests in my MLX5 EDR-100 test bed. 
Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]                     ` <655392767.30136125.1486951636415.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-13  3:14                       ` Laurence Oberman
       [not found]                         ` <1630482470.30208948.1486955693106.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-13  3:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g,
	maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Sunday, February 12, 2017 9:07:16 PM
> Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
> 
> 
> 
> ----- Original Message -----
> > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > Sent: Sunday, February 12, 2017 3:05:16 PM
> > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > QP
> > 
> > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote:
> > > [  861.143141] WARNING: CPU: 27 PID: 1103 at
> > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 [ib_core]
> > > [  861.202208] IB_POLL_DIRECT poll_ctx not supported for drain
> > 
> > Hello Laurence,
> > 
> > That warning has been removed by patch 7/8 of this series. Please double
> > check
> > whether all eight patches have been applied properly.
> > 
> > Bart.N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
> 
> Hello
> Just a heads up, working with Bart on this patch series.
> We have stability issues with my tests in my MLX5 EDR-100 test bed.
> Thanks
> Laurence
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

I went back to Linus' latest tree for a baseline and we fail the same way.
This has none of the latest 8 patches applied so we will
have to figure out what broke this.

Dont forget that I tested all this recently with Bart's dma patch series
and its solid.

Will come back to this tomorrow and see what recently made it into Linus's tree by
checking back with Doug.

[  183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bd4270eb0
[  183.853047] 00000000 00000000 00000000 00000000
[  183.878425] 00000000 00000000 00000000 00000000
[  183.903243] 00000000 00000000 00000000 00000000
[  183.928518] 00000000 0f007806 2500002a ad9fafd1
[  198.538593] scsi host1: ib_srp: reconnect succeeded
[  198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe
[  198.603037] 00000000 00000000 00000000 00000000
[  198.628884] 00000000 00000000 00000000 00000000
[  198.653961] 00000000 00000000 00000000 00000000
[  198.680021] 00000000 0f007806 25000032 00105dd0
[  198.705985] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff880b92860138
[  213.532848] scsi host1: ib_srp: reconnect succeeded
[  213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  227.579684] scsi host1: ib_srp: reconnect succeeded
[  227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  242.633925] scsi host1: ib_srp: reconnect succeeded
[  242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  257.127715] scsi host1: ib_srp: reconnect succeeded
[  257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  272.225762] scsi host1: ib_srp: reconnect succeeded
[  272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  286.350226] scsi host1: ib_srp: reconnect succeeded
[  286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  301.109365] scsi host1: ib_srp: reconnect succeeded
[  301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  315.910860] scsi host1: ib_srp: reconnect succeeded
[  315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  330.551052] scsi host1: ib_srp: reconnect succeeded
[  330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  344.998448] scsi host1: ib_srp: reconnect succeeded
[  345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  359.866731] scsi host1: ib_srp: reconnect succeeded
[  359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
..
..
[  373.113045] scsi host1: ib_srp: reconnect succeeded
[  373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1.
[  388.589517] scsi host1: ib_srp: reconnect succeeded
[  388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  403.086893] scsi host1: ib_srp: reconnect succeeded
[  403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30
[  403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe
[  403.140402] 00000000 00000000 00000000 00000000
[  403.140402] 00000000 00000000 00000000 00000000
[  403.140403] 00000000 00000000 00000000 00000000
[  403.140403] 00

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug
       [not found]       ` <1486930017.2918.3.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2017-02-13  5:54         ` Leon Romanovsky
       [not found]           ` <20170213055432.GM14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Leon Romanovsky @ 2017-02-13  5:54 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Steve Feeley,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

[-- Attachment #1: Type: text/plain, Size: 2986 bytes --]

On Sun, Feb 12, 2017 at 08:07:13PM +0000, Bart Van Assche wrote:
> On Sun, 2017-02-12 at 19:05 +0200, Leon Romanovsky wrote:
> > On Fri, Feb 10, 2017 at 03:56:04PM -0800, Bart Van Assche wrote:
> > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> > > index 79bf48477ddb..4068d34f5427 100644
> > > --- a/drivers/infiniband/ulp/srp/ib_srp.c
> > > +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> > > @@ -1899,7 +1899,14 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp)
> > >  		scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag);
> > >  		if (scmnd) {
> > >  			req = (void *)scmnd->host_scribble;
> > > -			scmnd = srp_claim_req(ch, req, NULL, scmnd);
> > > +			if (req) {
> > > +				scmnd = srp_claim_req(ch, req, NULL, scmnd);
> > > +			} else {
> > > +				shost_printk(KERN_ERR, target->scsi_host,
> > > +					     "NULL host_scribble for response with tag %#llx\n",
> > > +					     rsp->tag);
> > > +				scmnd = NULL;
> > > +			}
> > >  		}
> > >  		if (!scmnd) {
> > >  			shost_printk(KERN_ERR, target->scsi_host,
> >
> > You have the chance to print the message below together with your new
> > print, because scmd will be NULL.
> >
> > What about to do the following check "if (scmd && scmd->host_scribble)"
> > instead of your proposed patch?
>
> That approach would still trigger a kernel oops if a duplicate response is
> received because the second argument of srp_claim_req() must not be NULL.

I'm sure that I'm missing something, but how would it be triggered?
We will enter to call second srp_claim_req() function only if "req" is
not NULL.

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 79bf48477ddb..40e7f27c40bf 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1897,10 +1897,12 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp)
 		complete(&ch->tsk_mgmt_done);
 	} else {
 		scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag);
-		if (scmnd) {
+		if (scmnd && scmnd->host_scribble) {
 			req = (void *)scmnd->host_scribble;
 			scmnd = srp_claim_req(ch, req, NULL, scmnd);
 		}
+		else
+			scnmnd = NULL;
 		if (!scmnd) {
 			shost_printk(KERN_ERR, target->scsi_host,
 				     "Null scmnd for RSP w/tag %#016llx received on ch %td / QP %#x\n",

>
> Bart.
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:
>
> This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]             ` <1486930299.2918.5.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2017-02-13  6:07               ` Leon Romanovsky
  0 siblings, 0 replies; 47+ messages in thread
From: Leon Romanovsky @ 2017-02-13  6:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch-jcswGhMUV9g@public.gmane.org,
	maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

On Sun, Feb 12, 2017 at 08:11:53PM +0000, Bart Van Assche wrote:
> On Sun, 2017-02-12 at 19:19 +0200, Leon Romanovsky wrote:
> > On Fri, Feb 10, 2017 at 03:56:11PM -0800, Bart Van Assche wrote:
> > > -static void srp_destroy_qp(struct ib_qp *qp)
> > > +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp)
> > >  {
> > > -	ib_drain_rq(qp);
> > > +	spin_lock_irq(&ch->lock);
> > > +	ib_process_cq_direct(ch->send_cq, -1);
> >
> > I see that you are already using "-1" in your code, but the comments in the
> > ib_process_cq_direct states that no new code should use "-1".
> >
> >  61  * Note: for compatibility reasons -1 can be passed in %budget for unlimited
> >  62  * polling.  Do not use this feature in new code, it will be removed soon.
> >  63  */
> >  64 int ib_process_cq_direct(struct ib_cq *cq, int budget)
>
> Although it is possible to avoid passing -1 as 'budget' by passing a number
> that is at least as large as the number of expected completions, it would
> make it harder to verify the SRP initiator driver. So I propose to modify
> the comment above ib_process_cq_direct().

I don't know,
It seems like an easiest approach is to change the comment especially while
SRP is the only one user of this call. However ability to properly calculate
number of expected completions and compare it while doing destroy_qp is
a valuable thing for correctness too.

Thanks

>
> Bart.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]                         ` <1630482470.30208948.1486955693106.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-13 13:54                           ` Laurence Oberman
       [not found]                             ` <1633827327.30531404.1486994093828.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-13 13:54 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g,
	maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Sunday, February 12, 2017 10:14:53 PM
> Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
> 
> 
> 
> ----- Original Message -----
> > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > Sent: Sunday, February 12, 2017 9:07:16 PM
> > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > QP
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > Sent: Sunday, February 12, 2017 3:05:16 PM
> > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > > QP
> > > 
> > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote:
> > > > [  861.143141] WARNING: CPU: 27 PID: 1103 at
> > > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0
> > > > [ib_core]
> > > > [  861.202208] IB_POLL_DIRECT poll_ctx not supported for drain
> > > 
> > > Hello Laurence,
> > > 
> > > That warning has been removed by patch 7/8 of this series. Please double
> > > check
> > > whether all eight patches have been applied properly.
> > > 
> > > Bart.N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
> > 
> > Hello
> > Just a heads up, working with Bart on this patch series.
> > We have stability issues with my tests in my MLX5 EDR-100 test bed.
> > Thanks
> > Laurence
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> I went back to Linus' latest tree for a baseline and we fail the same way.
> This has none of the latest 8 patches applied so we will
> have to figure out what broke this.
> 
> Dont forget that I tested all this recently with Bart's dma patch series
> and its solid.
> 
> Will come back to this tomorrow and see what recently made it into Linus's
> tree by
> checking back with Doug.
> 
> [  183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff880bd4270eb0
> [  183.853047] 00000000 00000000 00000000 00000000
> [  183.878425] 00000000 00000000 00000000 00000000
> [  183.903243] 00000000 00000000 00000000 00000000
> [  183.928518] 00000000 0f007806 2500002a ad9fafd1
> [  198.538593] scsi host1: ib_srp: reconnect succeeded
> [  198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe
> [  198.603037] 00000000 00000000 00000000 00000000
> [  198.628884] 00000000 00000000 00000000 00000000
> [  198.653961] 00000000 00000000 00000000 00000000
> [  198.680021] 00000000 0f007806 25000032 00105dd0
> [  198.705985] scsi host1: ib_srp: failed FAST REG status memory management
> operation error (6) for CQE ffff880b92860138
> [  213.532848] scsi host1: ib_srp: reconnect succeeded
> [  213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  227.579684] scsi host1: ib_srp: reconnect succeeded
> [  227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  242.633925] scsi host1: ib_srp: reconnect succeeded
> [  242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  257.127715] scsi host1: ib_srp: reconnect succeeded
> [  257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  272.225762] scsi host1: ib_srp: reconnect succeeded
> [  272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  286.350226] scsi host1: ib_srp: reconnect succeeded
> [  286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  301.109365] scsi host1: ib_srp: reconnect succeeded
> [  301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  315.910860] scsi host1: ib_srp: reconnect succeeded
> [  315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  330.551052] scsi host1: ib_srp: reconnect succeeded
> [  330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  344.998448] scsi host1: ib_srp: reconnect succeeded
> [  345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  359.866731] scsi host1: ib_srp: reconnect succeeded
> [  359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> ..
> ..
> [  373.113045] scsi host1: ib_srp: reconnect succeeded
> [  373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> [  388.589517] scsi host1: ib_srp: reconnect succeeded
> [  388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  403.086893] scsi host1: ib_srp: reconnect succeeded
> [  403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817f2234c30
> [  403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe
> [  403.140402] 00000000 00000000 00000000 00000000
> [  403.140402] 00000000 00000000 00000000 00000000
> [  403.140403] 00000000 00000000 00000000 00000000
> [  403.140403] 00
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Hello

Let summarize where we are and how we got here.

The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4 with Barts dma patches.
All tests passed.

I pulled Linus's tree and applied all 8 patches of the above series and we failed in the 
"failed FAST REG status memory management" area.

I applied only 7 of the 8 patches to Linus's tree because Bart and I thought patch 6 of the series 
may have been the catalyst.

This also failed.

Building from Barts tree which is based on 4.10.0-rc7 failed again.

This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail.

So something has crept into 4.10.0-rc7 affecting this with mlx5 and ib_srp.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]                             ` <1633827327.30531404.1486994093828.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-13 14:17                               ` Leon Romanovsky
       [not found]                                 ` <20170213141724.GQ14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Leon Romanovsky @ 2017-02-13 14:17 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

[-- Attachment #1: Type: text/plain, Size: 9162 bytes --]

On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote:
>
>
> ----- Original Message -----
> > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzXQFizaE/u3fw@public.gmane.orgm, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > Sent: Sunday, February 12, 2017 10:14:53 PM
> > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
> >
> >
> >
> > ----- Original Message -----
> > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr@mellanox.com,
> > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > Sent: Sunday, February 12, 2017 9:07:16 PM
> > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > > QP
> > >
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > Sent: Sunday, February 12, 2017 3:05:16 PM
> > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > > > QP
> > > >
> > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote:
> > > > > [  861.143141] WARNING: CPU: 27 PID: 1103 at
> > > > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0
> > > > > [ib_core]
> > > > > [  861.202208] IB_POLL_DIRECT poll_ctx not supported for drain
> > > >
> > > > Hello Laurence,
> > > >
> > > > That warning has been removed by patch 7/8 of this series. Please double
> > > > check
> > > > whether all eight patches have been applied properly.
> > > >
> > > > Bart.N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
> > >
> > > Hello
> > > Just a heads up, working with Bart on this patch series.
> > > We have stability issues with my tests in my MLX5 EDR-100 test bed.
> > > Thanks
> > > Laurence
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> >
> > I went back to Linus' latest tree for a baseline and we fail the same way.
> > This has none of the latest 8 patches applied so we will
> > have to figure out what broke this.
> >
> > Dont forget that I tested all this recently with Bart's dma patch series
> > and its solid.
> >
> > Will come back to this tomorrow and see what recently made it into Linus's
> > tree by
> > checking back with Doug.
> >
> > [  183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff880bd4270eb0
> > [  183.853047] 00000000 00000000 00000000 00000000
> > [  183.878425] 00000000 00000000 00000000 00000000
> > [  183.903243] 00000000 00000000 00000000 00000000
> > [  183.928518] 00000000 0f007806 2500002a ad9fafd1
> > [  198.538593] scsi host1: ib_srp: reconnect succeeded
> > [  198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe
> > [  198.603037] 00000000 00000000 00000000 00000000
> > [  198.628884] 00000000 00000000 00000000 00000000
> > [  198.653961] 00000000 00000000 00000000 00000000
> > [  198.680021] 00000000 0f007806 25000032 00105dd0
> > [  198.705985] scsi host1: ib_srp: failed FAST REG status memory management
> > operation error (6) for CQE ffff880b92860138
> > [  213.532848] scsi host1: ib_srp: reconnect succeeded
> > [  213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  227.579684] scsi host1: ib_srp: reconnect succeeded
> > [  227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  242.633925] scsi host1: ib_srp: reconnect succeeded
> > [  242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  257.127715] scsi host1: ib_srp: reconnect succeeded
> > [  257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  272.225762] scsi host1: ib_srp: reconnect succeeded
> > [  272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  286.350226] scsi host1: ib_srp: reconnect succeeded
> > [  286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  301.109365] scsi host1: ib_srp: reconnect succeeded
> > [  301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  315.910860] scsi host1: ib_srp: reconnect succeeded
> > [  315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  330.551052] scsi host1: ib_srp: reconnect succeeded
> > [  330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  344.998448] scsi host1: ib_srp: reconnect succeeded
> > [  345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  359.866731] scsi host1: ib_srp: reconnect succeeded
> > [  359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > ..
> > ..
> > [  373.113045] scsi host1: ib_srp: reconnect succeeded
> > [  373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> > [  388.589517] scsi host1: ib_srp: reconnect succeeded
> > [  388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  403.086893] scsi host1: ib_srp: reconnect succeeded
> > [  403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> > ffff8817f2234c30
> > [  403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe
> > [  403.140402] 00000000 00000000 00000000 00000000
> > [  403.140402] 00000000 00000000 00000000 00000000
> > [  403.140403] 00000000 00000000 00000000 00000000
> > [  403.140403] 00
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> Hello
>
> Let summarize where we are and how we got here.
>
> The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4 with Barts dma patches.
> All tests passed.
>
> I pulled Linus's tree and applied all 8 patches of the above series and we failed in the
> "failed FAST REG status memory management" area.
>
> I applied only 7 of the 8 patches to Linus's tree because Bart and I thought patch 6 of the series
> may have been the catalyst.
>
> This also failed.
>
> Building from Barts tree which is based on 4.10.0-rc7 failed again.
>
> This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail.
>
> So something has crept into 4.10.0-rc7 affecting this with mlx5 and ib_srp.

From infiniband side:
➜  linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 -- drivers/inifiniband |wc
      0       0       0

From eth nothing suspicious too:
➜  linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 -- drivers/net/ethernet/mellanox/mlx5
d15118af2683 net/mlx5e: Check ets capability before ets query FW command
a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool
1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed
3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper devices
5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only after FDB destroy
5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering name-space fails
eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering name-space
9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP
e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve
ad05df399f33 net/mlx5e: Remove unused variable
639e9e94160e net/mlx5e: Remove unnecessary checks when setting num channels
abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning


>
> Thanks
> Laurence

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]                                 ` <20170213141724.GQ14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-02-13 14:24                                   ` Laurence Oberman
       [not found]                                     ` <225897984.30545262.1486995841880.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-13 14:24 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> To: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Monday, February 13, 2017 9:17:24 AM
> Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
> 
> On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote:
> >
> >
> > ----- Original Message -----
> > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > Sent: Sunday, February 12, 2017 10:14:53 PM
> > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > > QP
> > >
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > Sent: Sunday, February 12, 2017 9:07:16 PM
> > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying
> > > > a
> > > > QP
> > > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > Sent: Sunday, February 12, 2017 3:05:16 PM
> > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > destroying a
> > > > > QP
> > > > >
> > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote:
> > > > > > [  861.143141] WARNING: CPU: 27 PID: 1103 at
> > > > > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0
> > > > > > [ib_core]
> > > > > > [  861.202208] IB_POLL_DIRECT poll_ctx not supported for drain
> > > > >
> > > > > Hello Laurence,
> > > > >
> > > > > That warning has been removed by patch 7/8 of this series. Please
> > > > > double
> > > > > check
> > > > > whether all eight patches have been applied properly.
> > > > >
> > > > > Bart.N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
> > > >
> > > > Hello
> > > > Just a heads up, working with Bart on this patch series.
> > > > We have stability issues with my tests in my MLX5 EDR-100 test bed.
> > > > Thanks
> > > > Laurence
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> > > > in
> > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > >
> > > I went back to Linus' latest tree for a baseline and we fail the same
> > > way.
> > > This has none of the latest 8 patches applied so we will
> > > have to figure out what broke this.
> > >
> > > Dont forget that I tested all this recently with Bart's dma patch series
> > > and its solid.
> > >
> > > Will come back to this tomorrow and see what recently made it into
> > > Linus's
> > > tree by
> > > checking back with Doug.
> > >
> > > [  183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff880bd4270eb0
> > > [  183.853047] 00000000 00000000 00000000 00000000
> > > [  183.878425] 00000000 00000000 00000000 00000000
> > > [  183.903243] 00000000 00000000 00000000 00000000
> > > [  183.928518] 00000000 0f007806 2500002a ad9fafd1
> > > [  198.538593] scsi host1: ib_srp: reconnect succeeded
> > > [  198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe
> > > [  198.603037] 00000000 00000000 00000000 00000000
> > > [  198.628884] 00000000 00000000 00000000 00000000
> > > [  198.653961] 00000000 00000000 00000000 00000000
> > > [  198.680021] 00000000 0f007806 25000032 00105dd0
> > > [  198.705985] scsi host1: ib_srp: failed FAST REG status memory
> > > management
> > > operation error (6) for CQE ffff880b92860138
> > > [  213.532848] scsi host1: ib_srp: reconnect succeeded
> > > [  213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  227.579684] scsi host1: ib_srp: reconnect succeeded
> > > [  227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  242.633925] scsi host1: ib_srp: reconnect succeeded
> > > [  242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  257.127715] scsi host1: ib_srp: reconnect succeeded
> > > [  257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  272.225762] scsi host1: ib_srp: reconnect succeeded
> > > [  272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  286.350226] scsi host1: ib_srp: reconnect succeeded
> > > [  286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  301.109365] scsi host1: ib_srp: reconnect succeeded
> > > [  301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  315.910860] scsi host1: ib_srp: reconnect succeeded
> > > [  315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  330.551052] scsi host1: ib_srp: reconnect succeeded
> > > [  330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  344.998448] scsi host1: ib_srp: reconnect succeeded
> > > [  345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  359.866731] scsi host1: ib_srp: reconnect succeeded
> > > [  359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > ..
> > > ..
> > > [  373.113045] scsi host1: ib_srp: reconnect succeeded
> > > [  373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> > > [  388.589517] scsi host1: ib_srp: reconnect succeeded
> > > [  388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  403.086893] scsi host1: ib_srp: reconnect succeeded
> > > [  403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > > CQE
> > > ffff8817f2234c30
> > > [  403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe
> > > [  403.140402] 00000000 00000000 00000000 00000000
> > > [  403.140402] 00000000 00000000 00000000 00000000
> > > [  403.140403] 00000000 00000000 00000000 00000000
> > > [  403.140403] 00
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> > Hello
> >
> > Let summarize where we are and how we got here.
> >
> > The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4 with
> > Barts dma patches.
> > All tests passed.
> >
> > I pulled Linus's tree and applied all 8 patches of the above series and we
> > failed in the
> > "failed FAST REG status memory management" area.
> >
> > I applied only 7 of the 8 patches to Linus's tree because Bart and I
> > thought patch 6 of the series
> > may have been the catalyst.
> >
> > This also failed.
> >
> > Building from Barts tree which is based on 4.10.0-rc7 failed again.
> >
> > This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail.
> >
> > So something has crept into 4.10.0-rc7 affecting this with mlx5 and ib_srp.
> 
> From infiniband side:
> ➜  linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 --
> drivers/inifiniband |wc
>       0       0       0
> 
> From eth nothing suspicious too:
> ➜  linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 --
> drivers/net/ethernet/mellanox/mlx5
> d15118af2683 net/mlx5e: Check ets capability before ets query FW command
> a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool
> 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed
> 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper devices
> 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only after FDB
> destroy
> 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering name-space
> fails
> eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering
> name-space
> 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP
> e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve
> ad05df399f33 net/mlx5e: Remove unused variable
> 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num channels
> abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning
> 
> 
> >
> > Thanks
> > Laurence
> 

Hi Leon, 
Yep, I also looked for outliers here that may look suspicious and did not see any.

I guess I will have to start bisecting.
I will start with rc5, if that fails will bisect between rc4 and rc5, as we know rc4 was fine.

I did re-run tests on rc4 last night and I was stable.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug
       [not found]           ` <20170213055432.GM14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-02-13 16:02             ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-13 16:02 UTC (permalink / raw)
  To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
  Cc: maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Steve Feeley,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, 2017-02-13 at 07:54 +0200, Leon Romanovsky wrote:
> I'm sure that I'm missing something, but how would it be triggered?
> We will enter to call second srp_claim_req() function only if "req" is
> not NULL.
> 
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index 79bf48477ddb..40e7f27c40bf 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -1897,10 +1897,12 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp)
>  		complete(&ch->tsk_mgmt_done);
>  	} else {
>  		scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag);
> -		if (scmnd) {
> +		if (scmnd && scmnd->host_scribble) {
>  			req = (void *)scmnd->host_scribble;
>  			scmnd = srp_claim_req(ch, req, NULL, scmnd);
>  		}
> +		else
> +			scnmnd = NULL;
>  		if (!scmnd) {
>  			shost_printk(KERN_ERR, target->scsi_host,
>  				     "Null scmnd for RSP w/tag %#016llx received on ch %td / QP %#x\n",

Hello Leon,

Sorry but I had misread your previous e-mail. I agree that the above should
work fine.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]                                     ` <225897984.30545262.1486995841880.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-13 16:12                                       ` Laurence Oberman
       [not found]                                         ` <1971987443.30613645.1487002375580.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-13 16:12 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Monday, February 13, 2017 9:24:01 AM
> Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
> 
> 
> 
> ----- Original Message -----
> > From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > To: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > Sent: Monday, February 13, 2017 9:17:24 AM
> > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > QP
> > 
> > On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote:
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > Sent: Sunday, February 12, 2017 10:14:53 PM
> > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying
> > > > a
> > > > QP
> > > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > Sent: Sunday, February 12, 2017 9:07:16 PM
> > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > destroying
> > > > > a
> > > > > QP
> > > > >
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > Sent: Sunday, February 12, 2017 3:05:16 PM
> > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > > destroying a
> > > > > > QP
> > > > > >
> > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote:
> > > > > > > [  861.143141] WARNING: CPU: 27 PID: 1103 at
> > > > > > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0
> > > > > > > [ib_core]
> > > > > > > [  861.202208] IB_POLL_DIRECT poll_ctx not supported for drain
> > > > > >
> > > > > > Hello Laurence,
> > > > > >
> > > > > > That warning has been removed by patch 7/8 of this series. Please
> > > > > > double
> > > > > > check
> > > > > > whether all eight patches have been applied properly.
> > > > > >
> > > > > > Bart.N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
> > > > >
> > > > > Hello
> > > > > Just a heads up, working with Bart on this patch series.
> > > > > We have stability issues with my tests in my MLX5 EDR-100 test bed.
> > > > > Thanks
> > > > > Laurence
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> > > > > in
> > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >
> > > >
> > > > I went back to Linus' latest tree for a baseline and we fail the same
> > > > way.
> > > > This has none of the latest 8 patches applied so we will
> > > > have to figure out what broke this.
> > > >
> > > > Dont forget that I tested all this recently with Bart's dma patch
> > > > series
> > > > and its solid.
> > > >
> > > > Will come back to this tomorrow and see what recently made it into
> > > > Linus's
> > > > tree by
> > > > checking back with Doug.
> > > >
> > > > [  183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff880bd4270eb0
> > > > [  183.853047] 00000000 00000000 00000000 00000000
> > > > [  183.878425] 00000000 00000000 00000000 00000000
> > > > [  183.903243] 00000000 00000000 00000000 00000000
> > > > [  183.928518] 00000000 0f007806 2500002a ad9fafd1
> > > > [  198.538593] scsi host1: ib_srp: reconnect succeeded
> > > > [  198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe
> > > > [  198.603037] 00000000 00000000 00000000 00000000
> > > > [  198.628884] 00000000 00000000 00000000 00000000
> > > > [  198.653961] 00000000 00000000 00000000 00000000
> > > > [  198.680021] 00000000 0f007806 25000032 00105dd0
> > > > [  198.705985] scsi host1: ib_srp: failed FAST REG status memory
> > > > management
> > > > operation error (6) for CQE ffff880b92860138
> > > > [  213.532848] scsi host1: ib_srp: reconnect succeeded
> > > > [  213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  227.579684] scsi host1: ib_srp: reconnect succeeded
> > > > [  227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  242.633925] scsi host1: ib_srp: reconnect succeeded
> > > > [  242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  257.127715] scsi host1: ib_srp: reconnect succeeded
> > > > [  257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  272.225762] scsi host1: ib_srp: reconnect succeeded
> > > > [  272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  286.350226] scsi host1: ib_srp: reconnect succeeded
> > > > [  286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  301.109365] scsi host1: ib_srp: reconnect succeeded
> > > > [  301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  315.910860] scsi host1: ib_srp: reconnect succeeded
> > > > [  315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  330.551052] scsi host1: ib_srp: reconnect succeeded
> > > > [  330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  344.998448] scsi host1: ib_srp: reconnect succeeded
> > > > [  345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  359.866731] scsi host1: ib_srp: reconnect succeeded
> > > > [  359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > ..
> > > > ..
> > > > [  373.113045] scsi host1: ib_srp: reconnect succeeded
> > > > [  373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> > > > [  388.589517] scsi host1: ib_srp: reconnect succeeded
> > > > [  388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  403.086893] scsi host1: ib_srp: reconnect succeeded
> > > > [  403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > for
> > > > CQE
> > > > ffff8817f2234c30
> > > > [  403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe
> > > > [  403.140402] 00000000 00000000 00000000 00000000
> > > > [  403.140402] 00000000 00000000 00000000 00000000
> > > > [  403.140403] 00000000 00000000 00000000 00000000
> > > > [  403.140403] 00
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> > > > in
> > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > Hello
> > >
> > > Let summarize where we are and how we got here.
> > >
> > > The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4 with
> > > Barts dma patches.
> > > All tests passed.
> > >
> > > I pulled Linus's tree and applied all 8 patches of the above series and
> > > we
> > > failed in the
> > > "failed FAST REG status memory management" area.
> > >
> > > I applied only 7 of the 8 patches to Linus's tree because Bart and I
> > > thought patch 6 of the series
> > > may have been the catalyst.
> > >
> > > This also failed.
> > >
> > > Building from Barts tree which is based on 4.10.0-rc7 failed again.
> > >
> > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail.
> > >
> > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and
> > > ib_srp.
> > 
> > From infiniband side:
> > ➜  linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 --
> > drivers/inifiniband |wc
> >       0       0       0
> > 
> > From eth nothing suspicious too:
> > ➜  linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 --
> > drivers/net/ethernet/mellanox/mlx5
> > d15118af2683 net/mlx5e: Check ets capability before ets query FW command
> > a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool
> > 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed
> > 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper
> > devices
> > 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only after
> > FDB
> > destroy
> > 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering name-space
> > fails
> > eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering
> > name-space
> > 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP
> > e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve
> > ad05df399f33 net/mlx5e: Remove unused variable
> > 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num channels
> > abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning
> > 
> > 
> > >
> > > Thanks
> > > Laurence
> > 
> 
> Hi Leon,
> Yep, I also looked for outliers here that may look suspicious and did not see
> any.
> 
> I guess I will have to start bisecting.
> I will start with rc5, if that fails will bisect between rc4 and rc5, as we
> know rc4 was fine.
> 
> I did re-run tests on rc4 last night and I was stable.
> 
> Thanks
> Laurence
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

OK, so 4.10.0-rc5 is fine, 4.10.0-rc6 fails, so will start bisecting.
Unless one of you think you know what may be causing this in rc6.
This will take time so will come back to the list once I have it isolated.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]                                         ` <1971987443.30613645.1487002375580.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-13 16:47                                           ` Laurence Oberman
       [not found]                                             ` <21338434.30712464.1487004451595.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-13 16:47 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Monday, February 13, 2017 11:12:55 AM
> Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
> 
> 
> 
> ----- Original Message -----
> > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > Sent: Monday, February 13, 2017 9:24:01 AM
> > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > QP
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > To: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > Sent: Monday, February 13, 2017 9:17:24 AM
> > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > > QP
> > > 
> > > On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote:
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > Sent: Sunday, February 12, 2017 10:14:53 PM
> > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > destroying
> > > > > a
> > > > > QP
> > > > >
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > Sent: Sunday, February 12, 2017 9:07:16 PM
> > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > > destroying
> > > > > > a
> > > > > > QP
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > > Sent: Sunday, February 12, 2017 3:05:16 PM
> > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > > > destroying a
> > > > > > > QP
> > > > > > >
> > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote:
> > > > > > > > [  861.143141] WARNING: CPU: 27 PID: 1103 at
> > > > > > > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0
> > > > > > > > [ib_core]
> > > > > > > > [  861.202208] IB_POLL_DIRECT poll_ctx not supported for drain
> > > > > > >
> > > > > > > Hello Laurence,
> > > > > > >
> > > > > > > That warning has been removed by patch 7/8 of this series. Please
> > > > > > > double
> > > > > > > check
> > > > > > > whether all eight patches have been applied properly.
> > > > > > >
> > > > > > > Bart.N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
> > > > > >
> > > > > > Hello
> > > > > > Just a heads up, working with Bart on this patch series.
> > > > > > We have stability issues with my tests in my MLX5 EDR-100 test bed.
> > > > > > Thanks
> > > > > > Laurence
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > linux-rdma"
> > > > > > in
> > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > >
> > > > >
> > > > > I went back to Linus' latest tree for a baseline and we fail the same
> > > > > way.
> > > > > This has none of the latest 8 patches applied so we will
> > > > > have to figure out what broke this.
> > > > >
> > > > > Dont forget that I tested all this recently with Bart's dma patch
> > > > > series
> > > > > and its solid.
> > > > >
> > > > > Will come back to this tomorrow and see what recently made it into
> > > > > Linus's
> > > > > tree by
> > > > > checking back with Doug.
> > > > >
> > > > > [  183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff880bd4270eb0
> > > > > [  183.853047] 00000000 00000000 00000000 00000000
> > > > > [  183.878425] 00000000 00000000 00000000 00000000
> > > > > [  183.903243] 00000000 00000000 00000000 00000000
> > > > > [  183.928518] 00000000 0f007806 2500002a ad9fafd1
> > > > > [  198.538593] scsi host1: ib_srp: reconnect succeeded
> > > > > [  198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe
> > > > > [  198.603037] 00000000 00000000 00000000 00000000
> > > > > [  198.628884] 00000000 00000000 00000000 00000000
> > > > > [  198.653961] 00000000 00000000 00000000 00000000
> > > > > [  198.680021] 00000000 0f007806 25000032 00105dd0
> > > > > [  198.705985] scsi host1: ib_srp: failed FAST REG status memory
> > > > > management
> > > > > operation error (6) for CQE ffff880b92860138
> > > > > [  213.532848] scsi host1: ib_srp: reconnect succeeded
> > > > > [  213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  227.579684] scsi host1: ib_srp: reconnect succeeded
> > > > > [  227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  242.633925] scsi host1: ib_srp: reconnect succeeded
> > > > > [  242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  257.127715] scsi host1: ib_srp: reconnect succeeded
> > > > > [  257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  272.225762] scsi host1: ib_srp: reconnect succeeded
> > > > > [  272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  286.350226] scsi host1: ib_srp: reconnect succeeded
> > > > > [  286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  301.109365] scsi host1: ib_srp: reconnect succeeded
> > > > > [  301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  315.910860] scsi host1: ib_srp: reconnect succeeded
> > > > > [  315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  330.551052] scsi host1: ib_srp: reconnect succeeded
> > > > > [  330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  344.998448] scsi host1: ib_srp: reconnect succeeded
> > > > > [  345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  359.866731] scsi host1: ib_srp: reconnect succeeded
> > > > > [  359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > ..
> > > > > ..
> > > > > [  373.113045] scsi host1: ib_srp: reconnect succeeded
> > > > > [  373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> > > > > [  388.589517] scsi host1: ib_srp: reconnect succeeded
> > > > > [  388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  403.086893] scsi host1: ib_srp: reconnect succeeded
> > > > > [  403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > > > for
> > > > > CQE
> > > > > ffff8817f2234c30
> > > > > [  403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe
> > > > > [  403.140402] 00000000 00000000 00000000 00000000
> > > > > [  403.140402] 00000000 00000000 00000000 00000000
> > > > > [  403.140403] 00000000 00000000 00000000 00000000
> > > > > [  403.140403] 00
> > > > >
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> > > > > in
> > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > Hello
> > > >
> > > > Let summarize where we are and how we got here.
> > > >
> > > > The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4
> > > > with
> > > > Barts dma patches.
> > > > All tests passed.
> > > >
> > > > I pulled Linus's tree and applied all 8 patches of the above series and
> > > > we
> > > > failed in the
> > > > "failed FAST REG status memory management" area.
> > > >
> > > > I applied only 7 of the 8 patches to Linus's tree because Bart and I
> > > > thought patch 6 of the series
> > > > may have been the catalyst.
> > > >
> > > > This also failed.
> > > >
> > > > Building from Barts tree which is based on 4.10.0-rc7 failed again.
> > > >
> > > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail.
> > > >
> > > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and
> > > > ib_srp.
> > > 
> > > From infiniband side:
> > > ➜  linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 --
> > > drivers/inifiniband |wc
> > >       0       0       0
> > > 
> > > From eth nothing suspicious too:
> > > ➜  linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 --
> > > drivers/net/ethernet/mellanox/mlx5
> > > d15118af2683 net/mlx5e: Check ets capability before ets query FW command
> > > a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool
> > > 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed
> > > 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper
> > > devices
> > > 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only after
> > > FDB
> > > destroy
> > > 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering name-space
> > > fails
> > > eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering
> > > name-space
> > > 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP
> > > e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve
> > > ad05df399f33 net/mlx5e: Remove unused variable
> > > 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num
> > > channels
> > > abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning
> > > 
> > > 
> > > >
> > > > Thanks
> > > > Laurence
> > > 
> > 
> > Hi Leon,
> > Yep, I also looked for outliers here that may look suspicious and did not
> > see
> > any.
> > 
> > I guess I will have to start bisecting.
> > I will start with rc5, if that fails will bisect between rc4 and rc5, as we
> > know rc4 was fine.
> > 
> > I did re-run tests on rc4 last night and I was stable.
> > 
> > Thanks
> > Laurence
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> OK, so 4.10.0-rc5 is fine, 4.10.0-rc6 fails, so will start bisecting.
> Unless one of you think you know what may be causing this in rc6.
> This will take time so will come back to the list once I have it isolated.
> 
> Thanks
> Laurence
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Bisect has 8 possible kernel builds, 200 + changes, started the first one.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]                                             ` <21338434.30712464.1487004451595.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-13 21:34                                               ` Laurence Oberman
       [not found]                                                 ` <1301607843.30852658.1487021644535.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-13 21:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Monday, February 13, 2017 11:47:31 AM
> Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
> 
> 
> 
> ----- Original Message -----
> > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > Sent: Monday, February 13, 2017 11:12:55 AM
> > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > QP
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > Sent: Monday, February 13, 2017 9:24:01 AM
> > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > > QP
> > > 
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > > To: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> > > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > Sent: Monday, February 13, 2017 9:17:24 AM
> > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying
> > > > a
> > > > QP
> > > > 
> > > > On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote:
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > Sent: Sunday, February 12, 2017 10:14:53 PM
> > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > > destroying
> > > > > > a
> > > > > > QP
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > > > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > > Sent: Sunday, February 12, 2017 9:07:16 PM
> > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > > > destroying
> > > > > > > a
> > > > > > > QP
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > > > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > > > Sent: Sunday, February 12, 2017 3:05:16 PM
> > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > > > > destroying a
> > > > > > > > QP
> > > > > > > >
> > > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote:
> > > > > > > > > [  861.143141] WARNING: CPU: 27 PID: 1103 at
> > > > > > > > > drivers/infiniband/core/verbs.c:1959
> > > > > > > > > __ib_drain_sq+0x1bb/0x1c0
> > > > > > > > > [ib_core]
> > > > > > > > > [  861.202208] IB_POLL_DIRECT poll_ctx not supported for
> > > > > > > > > drain
> > > > > > > >
> > > > > > > > Hello Laurence,
> > > > > > > >
> > > > > > > > That warning has been removed by patch 7/8 of this series.
> > > > > > > > Please
> > > > > > > > double
> > > > > > > > check
> > > > > > > > whether all eight patches have been applied properly.
> > > > > > > >
> > > > > > > > Bart.N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
> > > > > > >
> > > > > > > Hello
> > > > > > > Just a heads up, working with Bart on this patch series.
> > > > > > > We have stability issues with my tests in my MLX5 EDR-100 test
> > > > > > > bed.
> > > > > > > Thanks
> > > > > > > Laurence
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > linux-rdma"
> > > > > > > in
> > > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > > > More majordomo info at
> > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > >
> > > > > >
> > > > > > I went back to Linus' latest tree for a baseline and we fail the
> > > > > > same
> > > > > > way.
> > > > > > This has none of the latest 8 patches applied so we will
> > > > > > have to figure out what broke this.
> > > > > >
> > > > > > Dont forget that I tested all this recently with Bart's dma patch
> > > > > > series
> > > > > > and its solid.
> > > > > >
> > > > > > Will come back to this tomorrow and see what recently made it into
> > > > > > Linus's
> > > > > > tree by
> > > > > > checking back with Doug.
> > > > > >
> > > > > > [  183.779175] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff880bd4270eb0
> > > > > > [  183.853047] 00000000 00000000 00000000 00000000
> > > > > > [  183.878425] 00000000 00000000 00000000 00000000
> > > > > > [  183.903243] 00000000 00000000 00000000 00000000
> > > > > > [  183.928518] 00000000 0f007806 2500002a ad9fafd1
> > > > > > [  198.538593] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe
> > > > > > [  198.603037] 00000000 00000000 00000000 00000000
> > > > > > [  198.628884] 00000000 00000000 00000000 00000000
> > > > > > [  198.653961] 00000000 00000000 00000000 00000000
> > > > > > [  198.680021] 00000000 0f007806 25000032 00105dd0
> > > > > > [  198.705985] scsi host1: ib_srp: failed FAST REG status memory
> > > > > > management
> > > > > > operation error (6) for CQE ffff880b92860138
> > > > > > [  213.532848] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  213.568828] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  227.579684] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  227.616175] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  242.633925] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  242.668160] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  257.127715] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  257.165623] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  272.225762] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  272.262570] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  286.350226] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  286.386160] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  301.109365] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  301.144930] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  315.910860] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  315.944594] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  330.551052] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  330.584552] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  344.998448] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  345.032115] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  359.866731] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  359.902114] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > ..
> > > > > > ..
> > > > > > [  373.113045] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  373.149511] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> > > > > > [  388.589517] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  388.623462] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  403.086893] scsi host1: ib_srp: reconnect succeeded
> > > > > > [  403.120876] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > (5)
> > > > > > for
> > > > > > CQE
> > > > > > ffff8817f2234c30
> > > > > > [  403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe
> > > > > > [  403.140402] 00000000 00000000 00000000 00000000
> > > > > > [  403.140402] 00000000 00000000 00000000 00000000
> > > > > > [  403.140403] 00000000 00000000 00000000 00000000
> > > > > > [  403.140403] 00
> > > > > >
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > linux-rdma"
> > > > > > in
> > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > >
> > > > > Hello
> > > > >
> > > > > Let summarize where we are and how we got here.
> > > > >
> > > > > The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4
> > > > > with
> > > > > Barts dma patches.
> > > > > All tests passed.
> > > > >
> > > > > I pulled Linus's tree and applied all 8 patches of the above series
> > > > > and
> > > > > we
> > > > > failed in the
> > > > > "failed FAST REG status memory management" area.
> > > > >
> > > > > I applied only 7 of the 8 patches to Linus's tree because Bart and I
> > > > > thought patch 6 of the series
> > > > > may have been the catalyst.
> > > > >
> > > > > This also failed.
> > > > >
> > > > > Building from Barts tree which is based on 4.10.0-rc7 failed again.
> > > > >
> > > > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail.
> > > > >
> > > > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and
> > > > > ib_srp.
> > > > 
> > > > From infiniband side:
> > > > ➜  linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 --
> > > > drivers/inifiniband |wc
> > > >       0       0       0
> > > > 
> > > > From eth nothing suspicious too:
> > > > ➜  linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 --
> > > > drivers/net/ethernet/mellanox/mlx5
> > > > d15118af2683 net/mlx5e: Check ets capability before ets query FW
> > > > command
> > > > a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool
> > > > 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed
> > > > 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper
> > > > devices
> > > > 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only
> > > > after
> > > > FDB
> > > > destroy
> > > > 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering
> > > > name-space
> > > > fails
> > > > eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering
> > > > name-space
> > > > 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP
> > > > e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve
> > > > ad05df399f33 net/mlx5e: Remove unused variable
> > > > 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num
> > > > channels
> > > > abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning
> > > > 
> > > > 
> > > > >
> > > > > Thanks
> > > > > Laurence
> > > > 
> > > 
> > > Hi Leon,
> > > Yep, I also looked for outliers here that may look suspicious and did not
> > > see
> > > any.
> > > 
> > > I guess I will have to start bisecting.
> > > I will start with rc5, if that fails will bisect between rc4 and rc5, as
> > > we
> > > know rc4 was fine.
> > > 
> > > I did re-run tests on rc4 last night and I was stable.
> > > 
> > > Thanks
> > > Laurence
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > 
> > OK, so 4.10.0-rc5 is fine, 4.10.0-rc6 fails, so will start bisecting.
> > Unless one of you think you know what may be causing this in rc6.
> > This will take time so will come back to the list once I have it isolated.
> > 
> > Thanks
> > Laurence
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> Bisect has 8 possible kernel builds, 200 + changes, started the first one.
> 
> Thanks
> Laurence
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hello

Bisecting got me to this commit, I had reviewed this looking for an explanation at some point.
At the time, I did not understand the need for the change but after explanation I accepted it.
I reverted this and we are good again but reading the code, not seeing how this is affecting us.
 
Makes no sense how this can be the issue.

Nevertheless we will need to revert this please.

I will now apply the 8 patches from Bart to Linus's tree with this reverted and test again.

Bisect run

git bisect start
git bisect bad  566cf877a1fcb6d6dc0126b076aad062054c2637
git bisect good 7a308bb3016f57e5be11a677d15b821536419d36
git bisect good
git bisect good
git bisect bad
git bisect bad
git bisect bad
git bisect bad
git bisect good

Bisecting: 0 revisions left to test after this (roughly 1 step)
[0a475ef4226e305bdcffe12b401ca1eab06c4913] IB/srp: fix invalid indirect_sg_entries parameter value
[loberman@ibclient linux-torvalds]$ git show 0a475ef4226e305bdcffe12b401ca1eab06c4913
commit 0a475ef4226e305bdcffe12b401ca1eab06c4913
Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Date:   Wed Jan 4 15:59:37 2017 +0200

    IB/srp: fix invalid indirect_sg_entries parameter value
    
    After setting indirect_sg_entries module_param to huge value (e.g 500,000),
    srp_alloc_req_data() fails to allocate indirect descriptors for the request
    ring (kmalloc fails). This commit enforces the maximum value of
    indirect_sg_entries to be SG_MAX_SEGMENTS as signified in module param
    description.
    
    Fixes: 65e8617fba17 (scsi: rename SCSI_MAX_{SG, SG_CHAIN}_SEGMENTS)
    Fixes: c07d424d6118 (IB/srp: add support for indirect tables that don't fit in SRP_CMD)
    Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # 4.7+
    Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
    Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
    Reviewed-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>--
    Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 0f67cf9..79bf484 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3699,6 +3699,12 @@ static int __init srp_init_module(void)
                indirect_sg_entries = cmd_sg_entries;
        }
 
+       if (indirect_sg_entries > SG_MAX_SEGMENTS) {
+               pr_warn("Clamping indirect_sg_entries to %u\n",
+                       SG_MAX_SEGMENTS);
+               indirect_sg_entries = SG_MAX_SEGMENTS;
+       }
+
        srp_remove_wq = create_workqueue("srp_remove");
        if (!srp_remove_wq) {
                ret = -ENOMEM;



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
       [not found]                                                 ` <1301607843.30852658.1487021644535.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-13 21:46                                                   ` Laurence Oberman
       [not found]                                                     ` <898197116.30855343.1487022400065.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-13 21:46 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Monday, February 13, 2017 4:34:04 PM
> Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP
> 
> 
> 
> ----- Original Message -----
> > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > Sent: Monday, February 13, 2017 11:47:31 AM
> > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > QP
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > Sent: Monday, February 13, 2017 11:12:55 AM
> > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a
> > > QP
> > > 
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> > > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > Sent: Monday, February 13, 2017 9:24:01 AM
> > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying
> > > > a
> > > > QP
> > > > 
> > > > 
> > > > 
> > > > ----- Original Message -----
> > > > > From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > > > To: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> > > > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > Sent: Monday, February 13, 2017 9:17:24 AM
> > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > destroying
> > > > > a
> > > > > QP
> > > > > 
> > > > > On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote:
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > > > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > > Sent: Sunday, February 12, 2017 10:14:53 PM
> > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > > > destroying
> > > > > > > a
> > > > > > > QP
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > > > > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > > > Sent: Sunday, February 12, 2017 9:07:16 PM
> > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > > > > destroying
> > > > > > > > a
> > > > > > > > QP
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > > > > > > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > > > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > > > > Sent: Sunday, February 12, 2017 3:05:16 PM
> > > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before
> > > > > > > > > destroying a
> > > > > > > > > QP
> > > > > > > > >
> > > > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote:
> > > > > > > > > > [  861.143141] WARNING: CPU: 27 PID: 1103 at
> > > > > > > > > > drivers/infiniband/core/verbs.c:1959
> > > > > > > > > > __ib_drain_sq+0x1bb/0x1c0
> > > > > > > > > > [ib_core]
> > > > > > > > > > [  861.202208] IB_POLL_DIRECT poll_ctx not supported for
> > > > > > > > > > drain
> > > > > > > > >
> > > > > > > > > Hello Laurence,
> > > > > > > > >
> > > > > > > > > That warning has been removed by patch 7/8 of this series.
> > > > > > > > > Please
> > > > > > > > > double
> > > > > > > > > check
> > > > > > > > > whether all eight patches have been applied properly.
> > > > > > > > >
> > > > > > > > > Bart.N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
> > > > > > > >
> > > > > > > > Hello
> > > > > > > > Just a heads up, working with Bart on this patch series.
> > > > > > > > We have stability issues with my tests in my MLX5 EDR-100 test
> > > > > > > > bed.
> > > > > > > > Thanks
> > > > > > > > Laurence
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > > linux-rdma"
> > > > > > > > in
> > > > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > > > > More majordomo info at
> > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > >
> > > > > > >
> > > > > > > I went back to Linus' latest tree for a baseline and we fail the
> > > > > > > same
> > > > > > > way.
> > > > > > > This has none of the latest 8 patches applied so we will
> > > > > > > have to figure out what broke this.
> > > > > > >
> > > > > > > Dont forget that I tested all this recently with Bart's dma patch
> > > > > > > series
> > > > > > > and its solid.
> > > > > > >
> > > > > > > Will come back to this tomorrow and see what recently made it
> > > > > > > into
> > > > > > > Linus's
> > > > > > > tree by
> > > > > > > checking back with Doug.
> > > > > > >
> > > > > > > [  183.779175] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff880bd4270eb0
> > > > > > > [  183.853047] 00000000 00000000 00000000 00000000
> > > > > > > [  183.878425] 00000000 00000000 00000000 00000000
> > > > > > > [  183.903243] 00000000 00000000 00000000 00000000
> > > > > > > [  183.928518] 00000000 0f007806 2500002a ad9fafd1
> > > > > > > [  198.538593] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe
> > > > > > > [  198.603037] 00000000 00000000 00000000 00000000
> > > > > > > [  198.628884] 00000000 00000000 00000000 00000000
> > > > > > > [  198.653961] 00000000 00000000 00000000 00000000
> > > > > > > [  198.680021] 00000000 0f007806 25000032 00105dd0
> > > > > > > [  198.705985] scsi host1: ib_srp: failed FAST REG status memory
> > > > > > > management
> > > > > > > operation error (6) for CQE ffff880b92860138
> > > > > > > [  213.532848] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  213.568828] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  227.579684] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  227.616175] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  242.633925] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  242.668160] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  257.127715] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  257.165623] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  272.225762] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  272.262570] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  286.350226] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  286.386160] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  301.109365] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  301.144930] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  315.910860] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  315.944594] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  330.551052] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  330.584552] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  344.998448] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  345.032115] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  359.866731] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  359.902114] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > ..
> > > > > > > ..
> > > > > > > [  373.113045] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  373.149511] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> > > > > > > [  388.589517] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  388.623462] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  403.086893] scsi host1: ib_srp: reconnect succeeded
> > > > > > > [  403.120876] scsi host1: ib_srp: failed RECV status WR flushed
> > > > > > > (5)
> > > > > > > for
> > > > > > > CQE
> > > > > > > ffff8817f2234c30
> > > > > > > [  403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe
> > > > > > > [  403.140402] 00000000 00000000 00000000 00000000
> > > > > > > [  403.140402] 00000000 00000000 00000000 00000000
> > > > > > > [  403.140403] 00000000 00000000 00000000 00000000
> > > > > > > [  403.140403] 00
> > > > > > >
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > linux-rdma"
> > > > > > > in
> > > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > > > More majordomo info at
> > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > >
> > > > > > Hello
> > > > > >
> > > > > > Let summarize where we are and how we got here.
> > > > > >
> > > > > > The last kernel I tested with mlx5 and ib_srp was
> > > > > > vmlinuz-4.10.0-rc4
> > > > > > with
> > > > > > Barts dma patches.
> > > > > > All tests passed.
> > > > > >
> > > > > > I pulled Linus's tree and applied all 8 patches of the above series
> > > > > > and
> > > > > > we
> > > > > > failed in the
> > > > > > "failed FAST REG status memory management" area.
> > > > > >
> > > > > > I applied only 7 of the 8 patches to Linus's tree because Bart and
> > > > > > I
> > > > > > thought patch 6 of the series
> > > > > > may have been the catalyst.
> > > > > >
> > > > > > This also failed.
> > > > > >
> > > > > > Building from Barts tree which is based on 4.10.0-rc7 failed again.
> > > > > >
> > > > > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we
> > > > > > fail.
> > > > > >
> > > > > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and
> > > > > > ib_srp.
> > > > > 
> > > > > From infiniband side:
> > > > > ➜  linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 --
> > > > > drivers/inifiniband |wc
> > > > >       0       0       0
> > > > > 
> > > > > From eth nothing suspicious too:
> > > > > ➜  linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 --
> > > > > drivers/net/ethernet/mellanox/mlx5
> > > > > d15118af2683 net/mlx5e: Check ets capability before ets query FW
> > > > > command
> > > > > a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool
> > > > > 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed
> > > > > 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper
> > > > > devices
> > > > > 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only
> > > > > after
> > > > > FDB
> > > > > destroy
> > > > > 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering
> > > > > name-space
> > > > > fails
> > > > > eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering
> > > > > name-space
> > > > > 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP
> > > > > e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve
> > > > > ad05df399f33 net/mlx5e: Remove unused variable
> > > > > 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num
> > > > > channels
> > > > > abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning
> > > > > 
> > > > > 
> > > > > >
> > > > > > Thanks
> > > > > > Laurence
> > > > > 
> > > > 
> > > > Hi Leon,
> > > > Yep, I also looked for outliers here that may look suspicious and did
> > > > not
> > > > see
> > > > any.
> > > > 
> > > > I guess I will have to start bisecting.
> > > > I will start with rc5, if that fails will bisect between rc4 and rc5,
> > > > as
> > > > we
> > > > know rc4 was fine.
> > > > 
> > > > I did re-run tests on rc4 last night and I was stable.
> > > > 
> > > > Thanks
> > > > Laurence
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> > > > in
> > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > 
> > > OK, so 4.10.0-rc5 is fine, 4.10.0-rc6 fails, so will start bisecting.
> > > Unless one of you think you know what may be causing this in rc6.
> > > This will take time so will come back to the list once I have it
> > > isolated.
> > > 
> > > Thanks
> > > Laurence
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > Bisect has 8 possible kernel builds, 200 + changes, started the first one.
> > 
> > Thanks
> > Laurence
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> Hello
> 
> Bisecting got me to this commit, I had reviewed this looking for an
> explanation at some point.
> At the time, I did not understand the need for the change but after
> explanation I accepted it.
> I reverted this and we are good again but reading the code, not seeing how
> this is affecting us.
>  
> Makes no sense how this can be the issue.
> 
> Nevertheless we will need to revert this please.
> 
> I will now apply the 8 patches from Bart to Linus's tree with this reverted
> and test again.
> 
> Bisect run
> 
> git bisect start
> git bisect bad  566cf877a1fcb6d6dc0126b076aad062054c2637
> git bisect good 7a308bb3016f57e5be11a677d15b821536419d36
> git bisect good
> git bisect good
> git bisect bad
> git bisect bad
> git bisect bad
> git bisect bad
> git bisect good
> 
> Bisecting: 0 revisions left to test after this (roughly 1 step)
> [0a475ef4226e305bdcffe12b401ca1eab06c4913] IB/srp: fix invalid
> indirect_sg_entries parameter value
> [loberman@ibclient linux-torvalds]$ git show
> 0a475ef4226e305bdcffe12b401ca1eab06c4913
> commit 0a475ef4226e305bdcffe12b401ca1eab06c4913
> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Date:   Wed Jan 4 15:59:37 2017 +0200
> 
>     IB/srp: fix invalid indirect_sg_entries parameter value
>     
>     After setting indirect_sg_entries module_param to huge value (e.g
>     500,000),
>     srp_alloc_req_data() fails to allocate indirect descriptors for the
>     request
>     ring (kmalloc fails). This commit enforces the maximum value of
>     indirect_sg_entries to be SG_MAX_SEGMENTS as signified in module param
>     description.
>     
>     Fixes: 65e8617fba17 (scsi: rename SCSI_MAX_{SG, SG_CHAIN}_SEGMENTS)
>     Fixes: c07d424d6118 (IB/srp: add support for indirect tables that don't
>     fit in SRP_CMD)
>     Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # 4.7+
>     Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>     Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>     Reviewed-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>     Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>--
>     Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> 
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
> b/drivers/infiniband/ulp/srp/ib_srp.c
> index 0f67cf9..79bf484 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -3699,6 +3699,12 @@ static int __init srp_init_module(void)
>                 indirect_sg_entries = cmd_sg_entries;
>         }
>  
> +       if (indirect_sg_entries > SG_MAX_SEGMENTS) {
> +               pr_warn("Clamping indirect_sg_entries to %u\n",
> +                       SG_MAX_SEGMENTS);
> +               indirect_sg_entries = SG_MAX_SEGMENTS;
> +       }
> +
>         srp_remove_wq = create_workqueue("srp_remove");
>         if (!srp_remove_wq) {
>                 ret = -ENOMEM;
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hello

The revert actually does not help. it failed after a while.

This mail was in drafts while I was testing and it got sent and should not have been.
The revert does not help which I am happy about because it made no sense.

So not sure how the bisect got me here but it did.

I will have to run through this again and see where the bisect went wrong.

Thanks
Laurence

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: v4.10-rc SRP + mlx5 regression
       [not found]                                                     ` <898197116.30855343.1487022400065.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-13 21:52                                                       ` Bart Van Assche
       [not found]                                                         ` <1487022735.2719.7.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Bart Van Assche @ 2017-02-13 21:52 UTC (permalink / raw)
  To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
  Cc: hch-jcswGhMUV9g@public.gmane.org,
	maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote:
> I will have to run through this again and see where the bisect went wrong.

Hello Laurence,

If you would be considering to repeat the bisect, did you know that a bisect
can be sped up by specifying the names of the files and/or directories that
are suspected? An example:

git bisect start */infiniband */net

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: v4.10-rc SRP + mlx5 regression
       [not found]                                                         ` <1487022735.2719.7.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2017-02-13 21:56                                                           ` Laurence Oberman
  2017-02-14  2:19                                                           ` Laurence Oberman
  1 sibling, 0 replies; 47+ messages in thread
From: Laurence Oberman @ 2017-02-13 21:56 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g,
	maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Monday, February 13, 2017 4:52:28 PM
> Subject: Re: v4.10-rc SRP + mlx5 regression
> 
> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote:
> > I will have to run through this again and see where the bisect went wrong.
> 
> Hello Laurence,
> 
> If you would be considering to repeat the bisect, did you know that a bisect
> can be sped up by specifying the names of the files and/or directories that
> are suspected? An example:
> 
> git bisect start */infiniband */net
> 
> Bart.--
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Hello Bart

I will try that, I knew it was possible it but had not used it before so wanted to be careful.
Even being careful something went wrong :)
I was very careful and I waited in between tests to give it long enough.
Perhaps I said good when bad or something like that.

I will use your method and by tomorrow I should have this figured out for you.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: v4.10-rc SRP + mlx5 regression
       [not found]                                                         ` <1487022735.2719.7.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  2017-02-13 21:56                                                           ` Laurence Oberman
@ 2017-02-14  2:19                                                           ` Laurence Oberman
       [not found]                                                             ` <568916592.30910570.1487038794766.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-14  2:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g,
	maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Monday, February 13, 2017 4:52:28 PM
> Subject: Re: v4.10-rc SRP + mlx5 regression
> 
> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote:
> > I will have to run through this again and see where the bisect went wrong.
> 
> Hello Laurence,
> 
> If you would be considering to repeat the bisect, did you know that a bisect
> can be sped up by specifying the names of the files and/or directories that
> are suspected? An example:
> 
> git bisect start */infiniband */net
> 
> Bart.--
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hello Bart, 

Much better news this time :), worked late on this but got it figured out.

OK, so we got to this one, which makes a lot more sense and is right in the area where we are having issues.
I must have answered wrong to one of the steps the first time I did the bisect.

Reverted this in the master tree of rc8 and rebuilt the kernel
Now all tests pass on Linus's tree - 4.10.0_rc8+

The interesting point here is that this commit is in rc5 but rc5 was not failing so we have an interoperability issue with this commit


[loberman@ibclient linux]$ git bisect good
Bisecting: 0 revisions left to test after this (roughly 1 step)
[ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when the device supports sg gaps

[loberman@ibclient linux]$ git show ad8e66b4a80182174f73487ed25fd2140cf43361
commit ad8e66b4a80182174f73487ed25fd2140cf43361
Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Date:   Wed Dec 28 12:48:28 2016 +0200

    IB/srp: fix mr allocation when the device supports sg gaps
    
    If the device support arbitrary sg list mapping (device cap
    IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with
    IB_MR_TYPE_SG_GAPS.
    
    Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures")
    Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+
    Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
    Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
    Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
    Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
    Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
    Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
    Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 8ddc071..0f67cf9 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
        struct srp_fr_desc *d;
        struct ib_mr *mr;
        int i, ret = -EINVAL;
+       enum ib_mr_type mr_type;
 
        if (pool_size <= 0)
                goto err;
@@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
        spin_lock_init(&pool->lock);
        INIT_LIST_HEAD(&pool->free_list);
 
+       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
+               mr_type = IB_MR_TYPE_SG_GAPS;
+       else
+               mr_type = IB_MR_TYPE_MEM_REG;
+
        for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
-               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
-                                max_page_list_len);
+               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
                if (IS_ERR(mr)) {
                        ret = PTR_ERR(mr);
                        if (ret == -ENOMEM)
(END)


So here is the revert patch, but you need to decide how you want to deal with this.

    Revert "IB/srp: fix mr allocation when the device supports sg gaps"
    Laurence Oberman
    Traced after bisection to a cause for this failure

Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

commit 90d169d312a173d5350c1bb36d6daab04c592127
Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Date:   Mon Feb 13 20:33:32 2017 -0500

    Revert "IB/srp: fix mr allocation when the device supports sg gaps"
    Laurence Oberman
    Traced after bisection to a cause for this failure
    
    [  130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe
    [  130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0edbfb0
    [  130.510899] 00000000 00000000 00000000 00000000
    [  130.536455] 00000000 00000000 00000000 00000000
    [  130.561878] 00000000 00000000 00000000 00000000
    [  130.585904] 00000000 0f007806 2500002a db0ec4d0
    [  145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1.
    [  146.530439] scsi host1: ib_srp: reconnect succeeded
    [  146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe
    [  146.597635] 00000000 00000000 00000000 00000000
    [  146.623545] 00000000 00000000 00000000 00000000
    [  146.649599] 00000000 00000000 00000000 00000000
    [  146.673938] 00000000 0f007806 25000032 000c46d0
    [  146.697969] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff88
    [  162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1.
    [  162.256337] scsi host1: ib_srp: reconnect succeeded
    [  162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0412ef0`
    
    This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361.

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 79bf484..01338c8 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
        struct srp_fr_desc *d;
        struct ib_mr *mr;
        int i, ret = -EINVAL;
-       enum ib_mr_type mr_type;
 
        if (pool_size <= 0)
                goto err;
@@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
        spin_lock_init(&pool->lock);
        INIT_LIST_HEAD(&pool->free_list);
 
-       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
-               mr_type = IB_MR_TYPE_SG_GAPS;
-       else
-               mr_type = IB_MR_TYPE_MEM_REG;
-
        for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
-               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
+               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
+                                max_page_list_len);
                if (IS_ERR(mr)) {
                        ret = PTR_ERR(mr);
                        if (ret == -ENOMEM)



Now moving on to what got me here in the first place.
Bart, let me know if the 7 of the 8 patches in your most recent series are all still valid after this revert 
Otherwise let me know which ones you want me to apply.

patch 6 - I am thinking i sno longer valid.
"
If a HCA supports the SG_GAPS_REG feature then a single memory
region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch
reduces the number of memory regions that is allocated per SRP
session.
"

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re:   [PATCH 0/8] IB/srp bug fixes
       [not found]                 ` <1051975432.30101289.1486922792858.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-14  3:02                   ` Laurence Oberman
       [not found]                     ` <1465409120.30916025.1487041332560.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-14  3:02 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bart Van Assche, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, Israel Rukshin, Max Gurtovoy

Hello Bart

The following 7 of 8 patches were applied to Linus's latest tree.

However this required first reverting 

commit ad8e66b4a80182174f73487ed25fd2140cf43361
Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Date:   Wed Dec 28 12:48:28 2016 +0200

See my other email regarding why the above needed to be reverted.

All tests passed in my mlx5 EDR-100 test bed for the ib-srp/mlx5 tests.

4.10.0-rc8.bart+

The revert of the above meant I did not apply and test patch 6 of the series
IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported

  IB/srp: Avoid that duplicate responses trigger a kernel bug
  IB/srp: Fix race conditions related to task management
  IB/srp: Document locking conventions
  IB/srp: Make a diagnostic message more informative
  IB/srp: Improve an error path
  *** Not applied and not tested IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported
  IB/core: Add support for draining IB_POLL_DIRECT completion queues
  IB/srp: Drain the send queue before destroying a QP

For the series except patch 6

Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: v4.10-rc SRP + mlx5 regression
       [not found]                                                             ` <568916592.30910570.1487038794766.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-14  6:39                                                               ` Leon Romanovsky
       [not found]                                                                 ` <20170214063953.GF6989-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Leon Romanovsky @ 2017-02-14  6:39 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

[-- Attachment #1: Type: text/plain, Size: 8421 bytes --]

On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote:
>
>
> ----- Original Message -----
> > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > Sent: Monday, February 13, 2017 4:52:28 PM
> > Subject: Re: v4.10-rc SRP + mlx5 regression
> >
> > On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote:
> > > I will have to run through this again and see where the bisect went wrong.
> >
> > Hello Laurence,
> >
> > If you would be considering to repeat the bisect, did you know that a bisect
> > can be sped up by specifying the names of the files and/or directories that
> > are suspected? An example:
> >
> > git bisect start */infiniband */net
> >
> > Bart.--
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
>
> Hello Bart,
>
> Much better news this time :), worked late on this but got it figured out.
>
> OK, so we got to this one, which makes a lot more sense and is right in the area where we are having issues.
> I must have answered wrong to one of the steps the first time I did the bisect.
>
> Reverted this in the master tree of rc8 and rebuilt the kernel
> Now all tests pass on Linus's tree - 4.10.0_rc8+
>
> The interesting point here is that this commit is in rc5 but rc5 was not failing so we have an interoperability issue with this commit
>
>
> [loberman@ibclient linux]$ git bisect good
> Bisecting: 0 revisions left to test after this (roughly 1 step)
> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when the device supports sg gaps
>
> [loberman@ibclient linux]$ git show ad8e66b4a80182174f73487ed25fd2140cf43361
> commit ad8e66b4a80182174f73487ed25fd2140cf43361
> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Date:   Wed Dec 28 12:48:28 2016 +0200
>
>     IB/srp: fix mr allocation when the device supports sg gaps
>
>     If the device support arbitrary sg list mapping (device cap
>     IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with
>     IB_MR_TYPE_SG_GAPS.
>
>     Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures")
>     Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+
>     Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>     Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>     Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>     Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>     Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>     Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
>     Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index 8ddc071..0f67cf9 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
>         struct srp_fr_desc *d;
>         struct ib_mr *mr;
>         int i, ret = -EINVAL;
> +       enum ib_mr_type mr_type;
>
>         if (pool_size <= 0)
>                 goto err;
> @@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
>         spin_lock_init(&pool->lock);
>         INIT_LIST_HEAD(&pool->free_list);
>
> +       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
> +               mr_type = IB_MR_TYPE_SG_GAPS;
> +       else
> +               mr_type = IB_MR_TYPE_MEM_REG;
> +
>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
> -               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
> -                                max_page_list_len);
> +               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);

First, ib_alloc_mr receives u32 as a third parameter, but int was
supplied. Second (I can be wrong here), shouldn't max_page_list_len be
replaced with max_fast_reg_page_list_len?

Thanks

>                 if (IS_ERR(mr)) {
>                         ret = PTR_ERR(mr);
>                         if (ret == -ENOMEM)
> (END)
>
>
> So here is the revert patch, but you need to decide how you want to deal with this.
>
>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
>     Laurence Oberman
>     Traced after bisection to a cause for this failure
>
> Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
> commit 90d169d312a173d5350c1bb36d6daab04c592127
> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Date:   Mon Feb 13 20:33:32 2017 -0500
>
>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
>     Laurence Oberman
>     Traced after bisection to a cause for this failure
>
>     [  130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe
>     [  130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0edbfb0
>     [  130.510899] 00000000 00000000 00000000 00000000
>     [  130.536455] 00000000 00000000 00000000 00000000
>     [  130.561878] 00000000 00000000 00000000 00000000
>     [  130.585904] 00000000 0f007806 2500002a db0ec4d0
>     [  145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1.
>     [  146.530439] scsi host1: ib_srp: reconnect succeeded
>     [  146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe
>     [  146.597635] 00000000 00000000 00000000 00000000
>     [  146.623545] 00000000 00000000 00000000 00000000
>     [  146.649599] 00000000 00000000 00000000 00000000
>     [  146.673938] 00000000 0f007806 25000032 000c46d0
>     [  146.697969] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff88
>     [  162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1.
>     [  162.256337] scsi host1: ib_srp: reconnect succeeded
>     [  162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0412ef0`
>
>     This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361.
>
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index 79bf484..01338c8 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
>         struct srp_fr_desc *d;
>         struct ib_mr *mr;
>         int i, ret = -EINVAL;
> -       enum ib_mr_type mr_type;
>
>         if (pool_size <= 0)
>                 goto err;
> @@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
>         spin_lock_init(&pool->lock);
>         INIT_LIST_HEAD(&pool->free_list);
>
> -       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
> -               mr_type = IB_MR_TYPE_SG_GAPS;
> -       else
> -               mr_type = IB_MR_TYPE_MEM_REG;
> -
>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
> -               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
> +               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
> +                                max_page_list_len);
>                 if (IS_ERR(mr)) {
>                         ret = PTR_ERR(mr);
>                         if (ret == -ENOMEM)
>
>
>
> Now moving on to what got me here in the first place.
> Bart, let me know if the 7 of the 8 patches in your most recent series are all still valid after this revert
> Otherwise let me know which ones you want me to apply.
>
> patch 6 - I am thinking i sno longer valid.
> "
> If a HCA supports the SG_GAPS_REG feature then a single memory
> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch
> reduces the number of memory regions that is allocated per SRP
> session.
> "
>
> Thanks
> Laurence

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: v4.10-rc SRP + mlx5 regression
       [not found]                                                                 ` <20170214063953.GF6989-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-02-14 10:00                                                                   ` Max Gurtovoy
       [not found]                                                                     ` <bfca98d3-3f74-c370-7455-71e2ebd583e9-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Max Gurtovoy @ 2017-02-14 10:00 UTC (permalink / raw)
  To: Leon Romanovsky, Laurence Oberman
  Cc: Bart Van Assche, hch-jcswGhMUV9g, israelr-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

Hi Laurence,
can you specify the test that repro these failures ?
have you tried running with CX5 HCA or only CX4 ?
I think this commit is right and we have issues in other places.


On 2/14/2017 8:39 AM, Leon Romanovsky wrote:
> On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote:
>>
>>
>> ----- Original Message -----
>>> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
>>> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>>> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>>> Sent: Monday, February 13, 2017 4:52:28 PM
>>> Subject: Re: v4.10-rc SRP + mlx5 regression
>>>
>>> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote:
>>>> I will have to run through this again and see where the bisect went wrong.
>>>
>>> Hello Laurence,
>>>
>>> If you would be considering to repeat the bisect, did you know that a bisect
>>> can be sped up by specifying the names of the files and/or directories that
>>> are suspected? An example:
>>>
>>> git bisect start */infiniband */net
>>>
>>> Bart.--
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> Hello Bart,
>>
>> Much better news this time :), worked late on this but got it figured out.
>>
>> OK, so we got to this one, which makes a lot more sense and is right in the area where we are having issues.
>> I must have answered wrong to one of the steps the first time I did the bisect.
>>
>> Reverted this in the master tree of rc8 and rebuilt the kernel
>> Now all tests pass on Linus's tree - 4.10.0_rc8+
>>
>> The interesting point here is that this commit is in rc5 but rc5 was not failing so we have an interoperability issue with this commit
>>
>>
>> [loberman@ibclient linux]$ git bisect good
>> Bisecting: 0 revisions left to test after this (roughly 1 step)
>> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when the device supports sg gaps
>>
>> [loberman@ibclient linux]$ git show ad8e66b4a80182174f73487ed25fd2140cf43361
>> commit ad8e66b4a80182174f73487ed25fd2140cf43361
>> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Date:   Wed Dec 28 12:48:28 2016 +0200
>>
>>     IB/srp: fix mr allocation when the device supports sg gaps
>>
>>     If the device support arbitrary sg list mapping (device cap
>>     IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with
>>     IB_MR_TYPE_SG_GAPS.
>>
>>     Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures")
>>     Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+
>>     Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>     Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>     Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>     Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>     Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>>     Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
>>     Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>
>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
>> index 8ddc071..0f67cf9 100644
>> --- a/drivers/infiniband/ulp/srp/ib_srp.c
>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
>> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
>>         struct srp_fr_desc *d;
>>         struct ib_mr *mr;
>>         int i, ret = -EINVAL;
>> +       enum ib_mr_type mr_type;
>>
>>         if (pool_size <= 0)
>>                 goto err;
>> @@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
>>         spin_lock_init(&pool->lock);
>>         INIT_LIST_HEAD(&pool->free_list);
>>
>> +       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
>> +               mr_type = IB_MR_TYPE_SG_GAPS;
>> +       else
>> +               mr_type = IB_MR_TYPE_MEM_REG;
>> +
>>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
>> -               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
>> -                                max_page_list_len);
>> +               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
>
> First, ib_alloc_mr receives u32 as a third parameter, but int was
> supplied. Second (I can be wrong here), shouldn't max_page_list_len be
> replaced with max_fast_reg_page_list_len?
>
> Thanks

there is a statement that:

	if (srp_dev->use_fast_reg) {
                 srp_dev->max_pages_per_mr =
                         min_t(u32, srp_dev->max_pages_per_mr,
                               attr->max_fast_reg_page_list_len);
         }

so we take the max_fast_reg_page_list_len in this case.

>
>>                 if (IS_ERR(mr)) {
>>                         ret = PTR_ERR(mr);
>>                         if (ret == -ENOMEM)
>> (END)
>>
>>
>> So here is the revert patch, but you need to decide how you want to deal with this.
>>
>>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
>>     Laurence Oberman
>>     Traced after bisection to a cause for this failure
>>
>> Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>
>> commit 90d169d312a173d5350c1bb36d6daab04c592127
>> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Date:   Mon Feb 13 20:33:32 2017 -0500
>>
>>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
>>     Laurence Oberman
>>     Traced after bisection to a cause for this failure
>>
>>     [  130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe
>>     [  130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0edbfb0
>>     [  130.510899] 00000000 00000000 00000000 00000000
>>     [  130.536455] 00000000 00000000 00000000 00000000
>>     [  130.561878] 00000000 00000000 00000000 00000000
>>     [  130.585904] 00000000 0f007806 2500002a db0ec4d0
>>     [  145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1.
>>     [  146.530439] scsi host1: ib_srp: reconnect succeeded
>>     [  146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe
>>     [  146.597635] 00000000 00000000 00000000 00000000
>>     [  146.623545] 00000000 00000000 00000000 00000000
>>     [  146.649599] 00000000 00000000 00000000 00000000
>>     [  146.673938] 00000000 0f007806 25000032 000c46d0
>>     [  146.697969] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff88
>>     [  162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1.
>>     [  162.256337] scsi host1: ib_srp: reconnect succeeded
>>     [  162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0412ef0`
>>
>>     This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361.
>>
>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
>> index 79bf484..01338c8 100644
>> --- a/drivers/infiniband/ulp/srp/ib_srp.c
>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
>> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
>>         struct srp_fr_desc *d;
>>         struct ib_mr *mr;
>>         int i, ret = -EINVAL;
>> -       enum ib_mr_type mr_type;
>>
>>         if (pool_size <= 0)
>>                 goto err;
>> @@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
>>         spin_lock_init(&pool->lock);
>>         INIT_LIST_HEAD(&pool->free_list);
>>
>> -       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
>> -               mr_type = IB_MR_TYPE_SG_GAPS;
>> -       else
>> -               mr_type = IB_MR_TYPE_MEM_REG;
>> -
>>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
>> -               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
>> +               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
>> +                                max_page_list_len);
>>                 if (IS_ERR(mr)) {
>>                         ret = PTR_ERR(mr);
>>                         if (ret == -ENOMEM)
>>
>>
>>
>> Now moving on to what got me here in the first place.
>> Bart, let me know if the 7 of the 8 patches in your most recent series are all still valid after this revert
>> Otherwise let me know which ones you want me to apply.
>>
>> patch 6 - I am thinking i sno longer valid.
>> "
>> If a HCA supports the SG_GAPS_REG feature then a single memory
>> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch
>> reduces the number of memory regions that is allocated per SRP
>> session.
>> "
>>
>> Thanks
>> Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: v4.10-rc SRP + mlx5 regression
       [not found]                                                                     ` <bfca98d3-3f74-c370-7455-71e2ebd583e9-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-02-14 13:31                                                                       ` Laurence Oberman
       [not found]                                                                         ` <656778124.31118982.1487079062235.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-02-14 16:53                                                                       ` Bart Van Assche
  1 sibling, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-14 13:31 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Leon Romanovsky, Bart Van Assche, hch-jcswGhMUV9g,
	israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Tuesday, February 14, 2017 5:00:04 AM
> Subject: Re: v4.10-rc SRP + mlx5 regression
> 
> Hi Laurence,
> can you specify the test that repro these failures ?
> have you tried running with CX5 HCA or only CX4 ?
> I think this commit is right and we have issues in other places.
> 
> 
> On 2/14/2017 8:39 AM, Leon Romanovsky wrote:
> > On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote:
> >>
> >>
> >> ----- Original Message -----
> >>> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> >>> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> >>> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> >>> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> >>> Sent: Monday, February 13, 2017 4:52:28 PM
> >>> Subject: Re: v4.10-rc SRP + mlx5 regression
> >>>
> >>> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote:
> >>>> I will have to run through this again and see where the bisect went
> >>>> wrong.
> >>>
> >>> Hello Laurence,
> >>>
> >>> If you would be considering to repeat the bisect, did you know that a
> >>> bisect
> >>> can be sped up by specifying the names of the files and/or directories
> >>> that
> >>> are suspected? An example:
> >>>
> >>> git bisect start */infiniband */net
> >>>
> >>> Bart.--
> >>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >> Hello Bart,
> >>
> >> Much better news this time :), worked late on this but got it figured out.
> >>
> >> OK, so we got to this one, which makes a lot more sense and is right in
> >> the area where we are having issues.
> >> I must have answered wrong to one of the steps the first time I did the
> >> bisect.
> >>
> >> Reverted this in the master tree of rc8 and rebuilt the kernel
> >> Now all tests pass on Linus's tree - 4.10.0_rc8+
> >>
> >> The interesting point here is that this commit is in rc5 but rc5 was not
> >> failing so we have an interoperability issue with this commit
> >>
> >>
> >> [loberman@ibclient linux]$ git bisect good
> >> Bisecting: 0 revisions left to test after this (roughly 1 step)
> >> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when
> >> the device supports sg gaps
> >>
> >> [loberman@ibclient linux]$ git show
> >> ad8e66b4a80182174f73487ed25fd2140cf43361
> >> commit ad8e66b4a80182174f73487ed25fd2140cf43361
> >> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >> Date:   Wed Dec 28 12:48:28 2016 +0200
> >>
> >>     IB/srp: fix mr allocation when the device supports sg gaps
> >>
> >>     If the device support arbitrary sg list mapping (device cap
> >>     IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with
> >>     IB_MR_TYPE_SG_GAPS.
> >>
> >>     Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures")
> >>     Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+
> >>     Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>     Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>     Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>     Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>     Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> >>     Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> >>     Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >>
> >> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
> >> b/drivers/infiniband/ulp/srp/ib_srp.c
> >> index 8ddc071..0f67cf9 100644
> >> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> >> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> >> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
> >> ib_device *device,
> >>         struct srp_fr_desc *d;
> >>         struct ib_mr *mr;
> >>         int i, ret = -EINVAL;
> >> +       enum ib_mr_type mr_type;
> >>
> >>         if (pool_size <= 0)
> >>                 goto err;
> >> @@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
> >> ib_device *device,
> >>         spin_lock_init(&pool->lock);
> >>         INIT_LIST_HEAD(&pool->free_list);
> >>
> >> +       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
> >> +               mr_type = IB_MR_TYPE_SG_GAPS;
> >> +       else
> >> +               mr_type = IB_MR_TYPE_MEM_REG;
> >> +
> >>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
> >> -               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
> >> -                                max_page_list_len);
> >> +               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
> >
> > First, ib_alloc_mr receives u32 as a third parameter, but int was
> > supplied. Second (I can be wrong here), shouldn't max_page_list_len be
> > replaced with max_fast_reg_page_list_len?
> >
> > Thanks
> 
> there is a statement that:
> 
> 	if (srp_dev->use_fast_reg) {
>                  srp_dev->max_pages_per_mr =
>                          min_t(u32, srp_dev->max_pages_per_mr,
>                                attr->max_fast_reg_page_list_len);
>          }
> 
> so we take the max_fast_reg_page_list_len in this case.
> 
> >
> >>                 if (IS_ERR(mr)) {
> >>                         ret = PTR_ERR(mr);
> >>                         if (ret == -ENOMEM)
> >> (END)
> >>
> >>
> >> So here is the revert patch, but you need to decide how you want to deal
> >> with this.
> >>
> >>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
> >>     Laurence Oberman
> >>     Traced after bisection to a cause for this failure
> >>
> >> Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >>
> >> commit 90d169d312a173d5350c1bb36d6daab04c592127
> >> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> Date:   Mon Feb 13 20:33:32 2017 -0500
> >>
> >>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
> >>     Laurence Oberman
> >>     Traced after bisection to a cause for this failure
> >>
> >>     [  130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe
> >>     [  130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5)
> >>     for CQE ffff8817f0edbfb0
> >>     [  130.510899] 00000000 00000000 00000000 00000000
> >>     [  130.536455] 00000000 00000000 00000000 00000000
> >>     [  130.561878] 00000000 00000000 00000000 00000000
> >>     [  130.585904] 00000000 0f007806 2500002a db0ec4d0
> >>     [  145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> >>     [  146.530439] scsi host1: ib_srp: reconnect succeeded
> >>     [  146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe
> >>     [  146.597635] 00000000 00000000 00000000 00000000
> >>     [  146.623545] 00000000 00000000 00000000 00000000
> >>     [  146.649599] 00000000 00000000 00000000 00000000
> >>     [  146.673938] 00000000 0f007806 25000032 000c46d0
> >>     [  146.697969] scsi host1: ib_srp: failed FAST REG status memory
> >>     management operation error (6) for CQE ffff88
> >>     [  162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> >>     [  162.256337] scsi host1: ib_srp: reconnect succeeded
> >>     [  162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5)
> >>     for CQE ffff8817f0412ef0`
> >>
> >>     This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361.
> >>
> >> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
> >> b/drivers/infiniband/ulp/srp/ib_srp.c
> >> index 79bf484..01338c8 100644
> >> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> >> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> >> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
> >> ib_device *device,
> >>         struct srp_fr_desc *d;
> >>         struct ib_mr *mr;
> >>         int i, ret = -EINVAL;
> >> -       enum ib_mr_type mr_type;
> >>
> >>         if (pool_size <= 0)
> >>                 goto err;
> >> @@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
> >> ib_device *device,
> >>         spin_lock_init(&pool->lock);
> >>         INIT_LIST_HEAD(&pool->free_list);
> >>
> >> -       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
> >> -               mr_type = IB_MR_TYPE_SG_GAPS;
> >> -       else
> >> -               mr_type = IB_MR_TYPE_MEM_REG;
> >> -
> >>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
> >> -               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
> >> +               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
> >> +                                max_page_list_len);
> >>                 if (IS_ERR(mr)) {
> >>                         ret = PTR_ERR(mr);
> >>                         if (ret == -ENOMEM)
> >>
> >>
> >>
> >> Now moving on to what got me here in the first place.
> >> Bart, let me know if the 7 of the 8 patches in your most recent series are
> >> all still valid after this revert
> >> Otherwise let me know which ones you want me to apply.
> >>
> >> patch 6 - I am thinking i sno longer valid.
> >> "
> >> If a HCA supports the SG_GAPS_REG feature then a single memory
> >> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch
> >> reduces the number of memory regions that is allocated per SRP
> >> session.
> >> "
> >>
> >> Thanks
> >> Laurence
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Hello Max,

I only have CX4 and CX3 in my lab, this test bed only has CX4.

CA 'mlx5_0'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.14.2036
	Hardware version: 0
	Node GUID: 0x7cfe900300726ed2
	System image GUID: 0x7cfe900300726ed2
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 3
		LMC: 0
		SM lid: 3
		Capability mask: 0x2651e84a
		Port GUID: 0x7cfe900300726ed2
		Link layer: InfiniBand

The test is simple, it's the same one I start with every time because it always
brings out issues with mapping for large I/O sizes and mem registration if such issues exist.

I have a server running LIO with memory backed LUNS.
These are served via a dual port mlx5 (CX4) over ib_srpt

The client mounts these LUNS via ib_srp (mlx5) and device-mapper-multipath 
and I run a simple dd on the XFS file system.

#!/bin/bash
while true
do
	dd if=/dev/zero of=/data-$1/bigfile bs=4096k count=900 
	sync;
	rm -rf /data-$1/bigfile
done

Once this passes I run a suite of other tests read/write, direct and buffered.

Thanks
Laurence

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: v4.10-rc SRP + mlx5 regression
       [not found]                                                                         ` <656778124.31118982.1487079062235.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-14 16:21                                                                           ` Laurence Oberman
  2017-02-14 17:15                                                                           ` Max Gurtovoy
  2017-02-14 17:15                                                                           ` Max Gurtovoy
  2 siblings, 0 replies; 47+ messages in thread
From: Laurence Oberman @ 2017-02-14 16:21 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Leon Romanovsky, Bart Van Assche, hch-jcswGhMUV9g,
	israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> To: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Tuesday, February 14, 2017 8:31:02 AM
> Subject: Re: v4.10-rc SRP + mlx5 regression
> 
> 
> 
> ----- Original Message -----
> > From: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Laurence Oberman"
> > <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > Sent: Tuesday, February 14, 2017 5:00:04 AM
> > Subject: Re: v4.10-rc SRP + mlx5 regression
> > 
> > Hi Laurence,
> > can you specify the test that repro these failures ?
> > have you tried running with CX5 HCA or only CX4 ?
> > I think this commit is right and we have issues in other places.
> > 
> > 
> > On 2/14/2017 8:39 AM, Leon Romanovsky wrote:
> > > On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote:
> > >>
> > >>
> > >> ----- Original Message -----
> > >>> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > >>> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > >>> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> > >>> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > >>> Sent: Monday, February 13, 2017 4:52:28 PM
> > >>> Subject: Re: v4.10-rc SRP + mlx5 regression
> > >>>
> > >>> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote:
> > >>>> I will have to run through this again and see where the bisect went
> > >>>> wrong.
> > >>>
> > >>> Hello Laurence,
> > >>>
> > >>> If you would be considering to repeat the bisect, did you know that a
> > >>> bisect
> > >>> can be sped up by specifying the names of the files and/or directories
> > >>> that
> > >>> are suspected? An example:
> > >>>
> > >>> git bisect start */infiniband */net
> > >>>
> > >>> Bart.--
> > >>> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> > >>> in
> > >>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>>
> > >>
> > >> Hello Bart,
> > >>
> > >> Much better news this time :), worked late on this but got it figured
> > >> out.
> > >>
> > >> OK, so we got to this one, which makes a lot more sense and is right in
> > >> the area where we are having issues.
> > >> I must have answered wrong to one of the steps the first time I did the
> > >> bisect.
> > >>
> > >> Reverted this in the master tree of rc8 and rebuilt the kernel
> > >> Now all tests pass on Linus's tree - 4.10.0_rc8+
> > >>
> > >> The interesting point here is that this commit is in rc5 but rc5 was not
> > >> failing so we have an interoperability issue with this commit
> > >>
> > >>
> > >> [loberman@ibclient linux]$ git bisect good
> > >> Bisecting: 0 revisions left to test after this (roughly 1 step)
> > >> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation
> > >> when
> > >> the device supports sg gaps
> > >>
> > >> [loberman@ibclient linux]$ git show
> > >> ad8e66b4a80182174f73487ed25fd2140cf43361
> > >> commit ad8e66b4a80182174f73487ed25fd2140cf43361
> > >> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > >> Date:   Wed Dec 28 12:48:28 2016 +0200
> > >>
> > >>     IB/srp: fix mr allocation when the device supports sg gaps
> > >>
> > >>     If the device support arbitrary sg list mapping (device cap
> > >>     IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with
> > >>     IB_MR_TYPE_SG_GAPS.
> > >>
> > >>     Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures")
> > >>     Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+
> > >>     Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > >>     Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > >>     Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > >>     Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > >>     Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> > >>     Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> > >>     Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > >>
> > >> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
> > >> b/drivers/infiniband/ulp/srp/ib_srp.c
> > >> index 8ddc071..0f67cf9 100644
> > >> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> > >> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> > >> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
> > >> ib_device *device,
> > >>         struct srp_fr_desc *d;
> > >>         struct ib_mr *mr;
> > >>         int i, ret = -EINVAL;
> > >> +       enum ib_mr_type mr_type;
> > >>
> > >>         if (pool_size <= 0)
> > >>                 goto err;
> > >> @@ -384,9 +385,13 @@ static struct srp_fr_pool
> > >> *srp_create_fr_pool(struct
> > >> ib_device *device,
> > >>         spin_lock_init(&pool->lock);
> > >>         INIT_LIST_HEAD(&pool->free_list);
> > >>
> > >> +       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
> > >> +               mr_type = IB_MR_TYPE_SG_GAPS;
> > >> +       else
> > >> +               mr_type = IB_MR_TYPE_MEM_REG;
> > >> +
> > >>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
> > >> -               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
> > >> -                                max_page_list_len);
> > >> +               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
> > >
> > > First, ib_alloc_mr receives u32 as a third parameter, but int was
> > > supplied. Second (I can be wrong here), shouldn't max_page_list_len be
> > > replaced with max_fast_reg_page_list_len?
> > >
> > > Thanks
> > 
> > there is a statement that:
> > 
> > 	if (srp_dev->use_fast_reg) {
> >                  srp_dev->max_pages_per_mr =
> >                          min_t(u32, srp_dev->max_pages_per_mr,
> >                                attr->max_fast_reg_page_list_len);
> >          }
> > 
> > so we take the max_fast_reg_page_list_len in this case.
> > 
> > >
> > >>                 if (IS_ERR(mr)) {
> > >>                         ret = PTR_ERR(mr);
> > >>                         if (ret == -ENOMEM)
> > >> (END)
> > >>
> > >>
> > >> So here is the revert patch, but you need to decide how you want to deal
> > >> with this.
> > >>
> > >>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
> > >>     Laurence Oberman
> > >>     Traced after bisection to a cause for this failure
> > >>
> > >> Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > >> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > >>
> > >> commit 90d169d312a173d5350c1bb36d6daab04c592127
> > >> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > >> Date:   Mon Feb 13 20:33:32 2017 -0500
> > >>
> > >>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
> > >>     Laurence Oberman
> > >>     Traced after bisection to a cause for this failure
> > >>
> > >>     [  130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe
> > >>     [  130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > >>     for CQE ffff8817f0edbfb0
> > >>     [  130.510899] 00000000 00000000 00000000 00000000
> > >>     [  130.536455] 00000000 00000000 00000000 00000000
> > >>     [  130.561878] 00000000 00000000 00000000 00000000
> > >>     [  130.585904] 00000000 0f007806 2500002a db0ec4d0
> > >>     [  145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> > >>     [  146.530439] scsi host1: ib_srp: reconnect succeeded
> > >>     [  146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe
> > >>     [  146.597635] 00000000 00000000 00000000 00000000
> > >>     [  146.623545] 00000000 00000000 00000000 00000000
> > >>     [  146.649599] 00000000 00000000 00000000 00000000
> > >>     [  146.673938] 00000000 0f007806 25000032 000c46d0
> > >>     [  146.697969] scsi host1: ib_srp: failed FAST REG status memory
> > >>     management operation error (6) for CQE ffff88
> > >>     [  162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> > >>     [  162.256337] scsi host1: ib_srp: reconnect succeeded
> > >>     [  162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > >>     for CQE ffff8817f0412ef0`
> > >>
> > >>     This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361.
> > >>
> > >> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
> > >> b/drivers/infiniband/ulp/srp/ib_srp.c
> > >> index 79bf484..01338c8 100644
> > >> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> > >> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> > >> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
> > >> ib_device *device,
> > >>         struct srp_fr_desc *d;
> > >>         struct ib_mr *mr;
> > >>         int i, ret = -EINVAL;
> > >> -       enum ib_mr_type mr_type;
> > >>
> > >>         if (pool_size <= 0)
> > >>                 goto err;
> > >> @@ -385,13 +384,9 @@ static struct srp_fr_pool
> > >> *srp_create_fr_pool(struct
> > >> ib_device *device,
> > >>         spin_lock_init(&pool->lock);
> > >>         INIT_LIST_HEAD(&pool->free_list);
> > >>
> > >> -       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
> > >> -               mr_type = IB_MR_TYPE_SG_GAPS;
> > >> -       else
> > >> -               mr_type = IB_MR_TYPE_MEM_REG;
> > >> -
> > >>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
> > >> -               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
> > >> +               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
> > >> +                                max_page_list_len);
> > >>                 if (IS_ERR(mr)) {
> > >>                         ret = PTR_ERR(mr);
> > >>                         if (ret == -ENOMEM)
> > >>
> > >>
> > >>
> > >> Now moving on to what got me here in the first place.
> > >> Bart, let me know if the 7 of the 8 patches in your most recent series
> > >> are
> > >> all still valid after this revert
> > >> Otherwise let me know which ones you want me to apply.
> > >>
> > >> patch 6 - I am thinking i sno longer valid.
> > >> "
> > >> If a HCA supports the SG_GAPS_REG feature then a single memory
> > >> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch
> > >> reduces the number of memory regions that is allocated per SRP
> > >> session.
> > >> "
> > >>
> > >> Thanks
> > >> Laurence
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> Hello Max,
> 
> I only have CX4 and CX3 in my lab, this test bed only has CX4.
> 
> CA 'mlx5_0'
> 	CA type: MT4115
> 	Number of ports: 1
> 	Firmware version: 12.14.2036
> 	Hardware version: 0
> 	Node GUID: 0x7cfe900300726ed2
> 	System image GUID: 0x7cfe900300726ed2
> 	Port 1:
> 		State: Active
> 		Physical state: LinkUp
> 		Rate: 100
> 		Base lid: 3
> 		LMC: 0
> 		SM lid: 3
> 		Capability mask: 0x2651e84a
> 		Port GUID: 0x7cfe900300726ed2
> 		Link layer: InfiniBand
> 
> The test is simple, it's the same one I start with every time because it
> always
> brings out issues with mapping for large I/O sizes and mem registration if
> such issues exist.
> 
> I have a server running LIO with memory backed LUNS.
> These are served via a dual port mlx5 (CX4) over ib_srpt
> 
> The client mounts these LUNS via ib_srp (mlx5) and device-mapper-multipath
> and I run a simple dd on the XFS file system.
> 
> #!/bin/bash
> while true
> do
> 	dd if=/dev/zero of=/data-$1/bigfile bs=4096k count=900
> 	sync;
> 	rm -rf /data-$1/bigfile
> done
> 
> Once this passes I run a suite of other tests read/write, direct and
> buffered.
> 
> Thanks
> Laurence
> 
> 

Max, Leon, Israel, Bart and Doug

We should consider reverting that commit for now until we figure out what specifically brings this out unless a quick fix is
forthcoming.
I have been running since last night with that commit reverted and 7 of Bart's latest patches and its been rock solid stable.
Its also shown no issues performance wise.

Tests included read/writes, large/small I/O sizes, buffered and unbuffered, XFS file-system and direct I/O.

Thanks
Laurence

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: v4.10-rc SRP + mlx5 regression
       [not found]                                                                     ` <bfca98d3-3f74-c370-7455-71e2ebd583e9-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2017-02-14 13:31                                                                       ` Laurence Oberman
@ 2017-02-14 16:53                                                                       ` Bart Van Assche
  1 sibling, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-14 16:53 UTC (permalink / raw)
  To: maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
  Cc: hch-jcswGhMUV9g@public.gmane.org,
	israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

On Tue, 2017-02-14 at 12:00 +0200, Max Gurtovoy wrote:
> can you specify the test that repro these failures ?
> have you tried running with CX5 HCA or only CX4 ?
> I think this commit is right and we have issues in other places.

My proposal is to proceed as Laurence proposed - modify the SRP initiator
driver such that it doesn't use gaps registration anymore. However, an
additional change is needed in addition to the patch Laurence proposed,
namely to call blk_queue_virt_boundary() unconditionally. I'm currently
testing this approach against mlx4.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: v4.10-rc SRP + mlx5 regression
       [not found]                                                                         ` <656778124.31118982.1487079062235.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-02-14 16:21                                                                           ` Laurence Oberman
@ 2017-02-14 17:15                                                                           ` Max Gurtovoy
       [not found]                                                                             ` <a7ae2926-da0a-edf9-7779-09a6edd54d5d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2017-02-14 17:15                                                                           ` Max Gurtovoy
  2 siblings, 1 reply; 47+ messages in thread
From: Max Gurtovoy @ 2017-02-14 17:15 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Leon Romanovsky, Bart Van Assche, hch-jcswGhMUV9g,
	israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



On 2/14/2017 3:31 PM, Laurence Oberman wrote:
>
>
> ----- Original Message -----
>> From: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>> Sent: Tuesday, February 14, 2017 5:00:04 AM
>> Subject: Re: v4.10-rc SRP + mlx5 regression
>>
>> Hi Laurence,
>> can you specify the test that repro these failures ?
>> have you tried running with CX5 HCA or only CX4 ?
>> I think this commit is right and we have issues in other places.
>>
>>
>> On 2/14/2017 8:39 AM, Leon Romanovsky wrote:
>>> On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
>>>>> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>>>>> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
>>>>> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>>>>> Sent: Monday, February 13, 2017 4:52:28 PM
>>>>> Subject: Re: v4.10-rc SRP + mlx5 regression
>>>>>
>>>>> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote:
>>>>>> I will have to run through this again and see where the bisect went
>>>>>> wrong.
>>>>>
>>>>> Hello Laurence,
>>>>>
>>>>> If you would be considering to repeat the bisect, did you know that a
>>>>> bisect
>>>>> can be sped up by specifying the names of the files and/or directories
>>>>> that
>>>>> are suspected? An example:
>>>>>
>>>>> git bisect start */infiniband */net
>>>>>
>>>>> Bart.--
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> Hello Bart,
>>>>
>>>> Much better news this time :), worked late on this but got it figured out.
>>>>
>>>> OK, so we got to this one, which makes a lot more sense and is right in
>>>> the area where we are having issues.
>>>> I must have answered wrong to one of the steps the first time I did the
>>>> bisect.
>>>>
>>>> Reverted this in the master tree of rc8 and rebuilt the kernel
>>>> Now all tests pass on Linus's tree - 4.10.0_rc8+
>>>>
>>>> The interesting point here is that this commit is in rc5 but rc5 was not
>>>> failing so we have an interoperability issue with this commit
>>>>
>>>>
>>>> [loberman@ibclient linux]$ git bisect good
>>>> Bisecting: 0 revisions left to test after this (roughly 1 step)
>>>> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when
>>>> the device supports sg gaps
>>>>
>>>> [loberman@ibclient linux]$ git show
>>>> ad8e66b4a80182174f73487ed25fd2140cf43361
>>>> commit ad8e66b4a80182174f73487ed25fd2140cf43361
>>>> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>> Date:   Wed Dec 28 12:48:28 2016 +0200
>>>>
>>>>     IB/srp: fix mr allocation when the device supports sg gaps
>>>>
>>>>     If the device support arbitrary sg list mapping (device cap
>>>>     IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with
>>>>     IB_MR_TYPE_SG_GAPS.
>>>>
>>>>     Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures")
>>>>     Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+
>>>>     Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>     Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>     Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>     Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>     Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>>>>     Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
>>>>     Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>>
>>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
>>>> b/drivers/infiniband/ulp/srp/ib_srp.c
>>>> index 8ddc071..0f67cf9 100644
>>>> --- a/drivers/infiniband/ulp/srp/ib_srp.c
>>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
>>>> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
>>>> ib_device *device,
>>>>         struct srp_fr_desc *d;
>>>>         struct ib_mr *mr;
>>>>         int i, ret = -EINVAL;
>>>> +       enum ib_mr_type mr_type;
>>>>
>>>>         if (pool_size <= 0)
>>>>                 goto err;
>>>> @@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
>>>> ib_device *device,
>>>>         spin_lock_init(&pool->lock);
>>>>         INIT_LIST_HEAD(&pool->free_list);
>>>>
>>>> +       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
>>>> +               mr_type = IB_MR_TYPE_SG_GAPS;
>>>> +       else
>>>> +               mr_type = IB_MR_TYPE_MEM_REG;
>>>> +
>>>>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
>>>> -               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
>>>> -                                max_page_list_len);
>>>> +               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
>>>
>>> First, ib_alloc_mr receives u32 as a third parameter, but int was
>>> supplied. Second (I can be wrong here), shouldn't max_page_list_len be
>>> replaced with max_fast_reg_page_list_len?
>>>
>>> Thanks
>>
>> there is a statement that:
>>
>> 	if (srp_dev->use_fast_reg) {
>>                  srp_dev->max_pages_per_mr =
>>                          min_t(u32, srp_dev->max_pages_per_mr,
>>                                attr->max_fast_reg_page_list_len);
>>          }
>>
>> so we take the max_fast_reg_page_list_len in this case.
>>
>>>
>>>>                 if (IS_ERR(mr)) {
>>>>                         ret = PTR_ERR(mr);
>>>>                         if (ret == -ENOMEM)
>>>> (END)
>>>>
>>>>
>>>> So here is the revert patch, but you need to decide how you want to deal
>>>> with this.
>>>>
>>>>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
>>>>     Laurence Oberman
>>>>     Traced after bisection to a cause for this failure
>>>>
>>>> Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>>
>>>> commit 90d169d312a173d5350c1bb36d6daab04c592127
>>>> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>> Date:   Mon Feb 13 20:33:32 2017 -0500
>>>>
>>>>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
>>>>     Laurence Oberman
>>>>     Traced after bisection to a cause for this failure
>>>>
>>>>     [  130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe
>>>>     [  130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5)
>>>>     for CQE ffff8817f0edbfb0
>>>>     [  130.510899] 00000000 00000000 00000000 00000000
>>>>     [  130.536455] 00000000 00000000 00000000 00000000
>>>>     [  130.561878] 00000000 00000000 00000000 00000000
>>>>     [  130.585904] 00000000 0f007806 2500002a db0ec4d0
>>>>     [  145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1.
>>>>     [  146.530439] scsi host1: ib_srp: reconnect succeeded
>>>>     [  146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe
>>>>     [  146.597635] 00000000 00000000 00000000 00000000
>>>>     [  146.623545] 00000000 00000000 00000000 00000000
>>>>     [  146.649599] 00000000 00000000 00000000 00000000
>>>>     [  146.673938] 00000000 0f007806 25000032 000c46d0
>>>>     [  146.697969] scsi host1: ib_srp: failed FAST REG status memory
>>>>     management operation error (6) for CQE ffff88
>>>>     [  162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1.
>>>>     [  162.256337] scsi host1: ib_srp: reconnect succeeded
>>>>     [  162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5)
>>>>     for CQE ffff8817f0412ef0`
>>>>
>>>>     This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361.
>>>>
>>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
>>>> b/drivers/infiniband/ulp/srp/ib_srp.c
>>>> index 79bf484..01338c8 100644
>>>> --- a/drivers/infiniband/ulp/srp/ib_srp.c
>>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
>>>> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
>>>> ib_device *device,
>>>>         struct srp_fr_desc *d;
>>>>         struct ib_mr *mr;
>>>>         int i, ret = -EINVAL;
>>>> -       enum ib_mr_type mr_type;
>>>>
>>>>         if (pool_size <= 0)
>>>>                 goto err;
>>>> @@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
>>>> ib_device *device,
>>>>         spin_lock_init(&pool->lock);
>>>>         INIT_LIST_HEAD(&pool->free_list);
>>>>
>>>> -       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
>>>> -               mr_type = IB_MR_TYPE_SG_GAPS;
>>>> -       else
>>>> -               mr_type = IB_MR_TYPE_MEM_REG;
>>>> -
>>>>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
>>>> -               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
>>>> +               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
>>>> +                                max_page_list_len);
>>>>                 if (IS_ERR(mr)) {
>>>>                         ret = PTR_ERR(mr);
>>>>                         if (ret == -ENOMEM)
>>>>
>>>>
>>>>
>>>> Now moving on to what got me here in the first place.
>>>> Bart, let me know if the 7 of the 8 patches in your most recent series are
>>>> all still valid after this revert
>>>> Otherwise let me know which ones you want me to apply.
>>>>
>>>> patch 6 - I am thinking i sno longer valid.
>>>> "
>>>> If a HCA supports the SG_GAPS_REG feature then a single memory
>>>> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch
>>>> reduces the number of memory regions that is allocated per SRP
>>>> session.
>>>> "
>>>>
>>>> Thanks
>>>> Laurence
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> Hello Max,
>
> I only have CX4 and CX3 in my lab, this test bed only has CX4.
>
> CA 'mlx5_0'
> 	CA type: MT4115
> 	Number of ports: 1
> 	Firmware version: 12.14.2036
> 	Hardware version: 0
> 	Node GUID: 0x7cfe900300726ed2
> 	System image GUID: 0x7cfe900300726ed2
> 	Port 1:
> 		State: Active
> 		Physical state: LinkUp
> 		Rate: 100
> 		Base lid: 3
> 		LMC: 0
> 		SM lid: 3
> 		Capability mask: 0x2651e84a
> 		Port GUID: 0x7cfe900300726ed2
> 		Link layer: InfiniBand
>
> The test is simple, it's the same one I start with every time because it always
> brings out issues with mapping for large I/O sizes and mem registration if such issues exist.
>
> I have a server running LIO with memory backed LUNS.
> These are served via a dual port mlx5 (CX4) over ib_srpt
>
> The client mounts these LUNS via ib_srp (mlx5) and device-mapper-multipath
> and I run a simple dd on the XFS file system.
>
> #!/bin/bash
> while true
> do
> 	dd if=/dev/zero of=/data-$1/bigfile bs=4096k count=900
> 	sync;
> 	rm -rf /data-$1/bigfile
> done
>
> Once this passes I run a suite of other tests read/write, direct and buffered.

Laurence,
this is 4MB transactions. can you increase the cmd_sg_entries to the 
maximum and run the test again ?


>
> Thanks
> Laurence
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: v4.10-rc SRP + mlx5 regression
       [not found]                                                                         ` <656778124.31118982.1487079062235.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-02-14 16:21                                                                           ` Laurence Oberman
  2017-02-14 17:15                                                                           ` Max Gurtovoy
@ 2017-02-14 17:15                                                                           ` Max Gurtovoy
  2 siblings, 0 replies; 47+ messages in thread
From: Max Gurtovoy @ 2017-02-14 17:15 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Leon Romanovsky, Bart Van Assche, hch-jcswGhMUV9g,
	israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



On 2/14/2017 3:31 PM, Laurence Oberman wrote:
>
>
> ----- Original Message -----
>> From: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>> Sent: Tuesday, February 14, 2017 5:00:04 AM
>> Subject: Re: v4.10-rc SRP + mlx5 regression
>>
>> Hi Laurence,
>> can you specify the test that repro these failures ?
>> have you tried running with CX5 HCA or only CX4 ?
>> I think this commit is right and we have issues in other places.
>>
>>
>> On 2/14/2017 8:39 AM, Leon Romanovsky wrote:
>>> On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
>>>>> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>>>>> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
>>>>> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>>>>> Sent: Monday, February 13, 2017 4:52:28 PM
>>>>> Subject: Re: v4.10-rc SRP + mlx5 regression
>>>>>
>>>>> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote:
>>>>>> I will have to run through this again and see where the bisect went
>>>>>> wrong.
>>>>>
>>>>> Hello Laurence,
>>>>>
>>>>> If you would be considering to repeat the bisect, did you know that a
>>>>> bisect
>>>>> can be sped up by specifying the names of the files and/or directories
>>>>> that
>>>>> are suspected? An example:
>>>>>
>>>>> git bisect start */infiniband */net
>>>>>
>>>>> Bart.--
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> Hello Bart,
>>>>
>>>> Much better news this time :), worked late on this but got it figured out.
>>>>
>>>> OK, so we got to this one, which makes a lot more sense and is right in
>>>> the area where we are having issues.
>>>> I must have answered wrong to one of the steps the first time I did the
>>>> bisect.
>>>>
>>>> Reverted this in the master tree of rc8 and rebuilt the kernel
>>>> Now all tests pass on Linus's tree - 4.10.0_rc8+
>>>>
>>>> The interesting point here is that this commit is in rc5 but rc5 was not
>>>> failing so we have an interoperability issue with this commit
>>>>
>>>>
>>>> [loberman@ibclient linux]$ git bisect good
>>>> Bisecting: 0 revisions left to test after this (roughly 1 step)
>>>> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when
>>>> the device supports sg gaps
>>>>
>>>> [loberman@ibclient linux]$ git show
>>>> ad8e66b4a80182174f73487ed25fd2140cf43361
>>>> commit ad8e66b4a80182174f73487ed25fd2140cf43361
>>>> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>> Date:   Wed Dec 28 12:48:28 2016 +0200
>>>>
>>>>     IB/srp: fix mr allocation when the device supports sg gaps
>>>>
>>>>     If the device support arbitrary sg list mapping (device cap
>>>>     IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with
>>>>     IB_MR_TYPE_SG_GAPS.
>>>>
>>>>     Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures")
>>>>     Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+
>>>>     Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>     Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>     Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>     Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>     Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>>>>     Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
>>>>     Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>>
>>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
>>>> b/drivers/infiniband/ulp/srp/ib_srp.c
>>>> index 8ddc071..0f67cf9 100644
>>>> --- a/drivers/infiniband/ulp/srp/ib_srp.c
>>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
>>>> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
>>>> ib_device *device,
>>>>         struct srp_fr_desc *d;
>>>>         struct ib_mr *mr;
>>>>         int i, ret = -EINVAL;
>>>> +       enum ib_mr_type mr_type;
>>>>
>>>>         if (pool_size <= 0)
>>>>                 goto err;
>>>> @@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
>>>> ib_device *device,
>>>>         spin_lock_init(&pool->lock);
>>>>         INIT_LIST_HEAD(&pool->free_list);
>>>>
>>>> +       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
>>>> +               mr_type = IB_MR_TYPE_SG_GAPS;
>>>> +       else
>>>> +               mr_type = IB_MR_TYPE_MEM_REG;
>>>> +
>>>>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
>>>> -               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
>>>> -                                max_page_list_len);
>>>> +               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
>>>
>>> First, ib_alloc_mr receives u32 as a third parameter, but int was
>>> supplied. Second (I can be wrong here), shouldn't max_page_list_len be
>>> replaced with max_fast_reg_page_list_len?
>>>
>>> Thanks
>>
>> there is a statement that:
>>
>> 	if (srp_dev->use_fast_reg) {
>>                  srp_dev->max_pages_per_mr =
>>                          min_t(u32, srp_dev->max_pages_per_mr,
>>                                attr->max_fast_reg_page_list_len);
>>          }
>>
>> so we take the max_fast_reg_page_list_len in this case.
>>
>>>
>>>>                 if (IS_ERR(mr)) {
>>>>                         ret = PTR_ERR(mr);
>>>>                         if (ret == -ENOMEM)
>>>> (END)
>>>>
>>>>
>>>> So here is the revert patch, but you need to decide how you want to deal
>>>> with this.
>>>>
>>>>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
>>>>     Laurence Oberman
>>>>     Traced after bisection to a cause for this failure
>>>>
>>>> Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>>
>>>> commit 90d169d312a173d5350c1bb36d6daab04c592127
>>>> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>> Date:   Mon Feb 13 20:33:32 2017 -0500
>>>>
>>>>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
>>>>     Laurence Oberman
>>>>     Traced after bisection to a cause for this failure
>>>>
>>>>     [  130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe
>>>>     [  130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5)
>>>>     for CQE ffff8817f0edbfb0
>>>>     [  130.510899] 00000000 00000000 00000000 00000000
>>>>     [  130.536455] 00000000 00000000 00000000 00000000
>>>>     [  130.561878] 00000000 00000000 00000000 00000000
>>>>     [  130.585904] 00000000 0f007806 2500002a db0ec4d0
>>>>     [  145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1.
>>>>     [  146.530439] scsi host1: ib_srp: reconnect succeeded
>>>>     [  146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe
>>>>     [  146.597635] 00000000 00000000 00000000 00000000
>>>>     [  146.623545] 00000000 00000000 00000000 00000000
>>>>     [  146.649599] 00000000 00000000 00000000 00000000
>>>>     [  146.673938] 00000000 0f007806 25000032 000c46d0
>>>>     [  146.697969] scsi host1: ib_srp: failed FAST REG status memory
>>>>     management operation error (6) for CQE ffff88
>>>>     [  162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1.
>>>>     [  162.256337] scsi host1: ib_srp: reconnect succeeded
>>>>     [  162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5)
>>>>     for CQE ffff8817f0412ef0`
>>>>
>>>>     This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361.
>>>>
>>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
>>>> b/drivers/infiniband/ulp/srp/ib_srp.c
>>>> index 79bf484..01338c8 100644
>>>> --- a/drivers/infiniband/ulp/srp/ib_srp.c
>>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
>>>> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
>>>> ib_device *device,
>>>>         struct srp_fr_desc *d;
>>>>         struct ib_mr *mr;
>>>>         int i, ret = -EINVAL;
>>>> -       enum ib_mr_type mr_type;
>>>>
>>>>         if (pool_size <= 0)
>>>>                 goto err;
>>>> @@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
>>>> ib_device *device,
>>>>         spin_lock_init(&pool->lock);
>>>>         INIT_LIST_HEAD(&pool->free_list);
>>>>
>>>> -       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
>>>> -               mr_type = IB_MR_TYPE_SG_GAPS;
>>>> -       else
>>>> -               mr_type = IB_MR_TYPE_MEM_REG;
>>>> -
>>>>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
>>>> -               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
>>>> +               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
>>>> +                                max_page_list_len);
>>>>                 if (IS_ERR(mr)) {
>>>>                         ret = PTR_ERR(mr);
>>>>                         if (ret == -ENOMEM)
>>>>
>>>>
>>>>
>>>> Now moving on to what got me here in the first place.
>>>> Bart, let me know if the 7 of the 8 patches in your most recent series are
>>>> all still valid after this revert
>>>> Otherwise let me know which ones you want me to apply.
>>>>
>>>> patch 6 - I am thinking i sno longer valid.
>>>> "
>>>> If a HCA supports the SG_GAPS_REG feature then a single memory
>>>> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch
>>>> reduces the number of memory regions that is allocated per SRP
>>>> session.
>>>> "
>>>>
>>>> Thanks
>>>> Laurence
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> Hello Max,
>
> I only have CX4 and CX3 in my lab, this test bed only has CX4.
>
> CA 'mlx5_0'
> 	CA type: MT4115
> 	Number of ports: 1
> 	Firmware version: 12.14.2036
> 	Hardware version: 0
> 	Node GUID: 0x7cfe900300726ed2
> 	System image GUID: 0x7cfe900300726ed2
> 	Port 1:
> 		State: Active
> 		Physical state: LinkUp
> 		Rate: 100
> 		Base lid: 3
> 		LMC: 0
> 		SM lid: 3
> 		Capability mask: 0x2651e84a
> 		Port GUID: 0x7cfe900300726ed2
> 		Link layer: InfiniBand
>
> The test is simple, it's the same one I start with every time because it always
> brings out issues with mapping for large I/O sizes and mem registration if such issues exist.
>
> I have a server running LIO with memory backed LUNS.
> These are served via a dual port mlx5 (CX4) over ib_srpt
>
> The client mounts these LUNS via ib_srp (mlx5) and device-mapper-multipath
> and I run a simple dd on the XFS file system.
>
> #!/bin/bash
> while true
> do
> 	dd if=/dev/zero of=/data-$1/bigfile bs=4096k count=900
> 	sync;
> 	rm -rf /data-$1/bigfile
> done
>
> Once this passes I run a suite of other tests read/write, direct and buffered.

Laurence,
this is 4MB transactions. can you increase the cmd_sg_entries to the 
maximum and run the test again ?


>
> Thanks
> Laurence
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re:   [PATCH 0/8] IB/srp bug fixes
       [not found]                     ` <1465409120.30916025.1487041332560.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-14 17:18                       ` Bart Van Assche
       [not found]                         ` <1487092678.2466.6.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Bart Van Assche @ 2017-02-14 17:18 UTC (permalink / raw)
  To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
  Cc: hch-jcswGhMUV9g@public.gmane.org,
	maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

On Mon, 2017-02-13 at 22:02 -0500, Laurence Oberman wrote:
> The following 7 of 8 patches were applied to Linus's latest tree.
> 
> However this required first reverting 
> 
> commit ad8e66b4a80182174f73487ed25fd2140cf43361
> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Date:   Wed Dec 28 12:48:28 2016 +0200
> 
> See my other email regarding why the above needed to be reverted.
> 
> All tests passed in my mlx5 EDR-100 test bed for the ib-srp/mlx5 tests.
> 
> 4.10.0-rc8.bart+
> 
> The revert of the above meant I did not apply and test patch 6 of the series
> IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported
> 
>   IB/srp: Avoid that duplicate responses trigger a kernel bug
>   IB/srp: Fix race conditions related to task management
>   IB/srp: Document locking conventions
>   IB/srp: Make a diagnostic message more informative
>   IB/srp: Improve an error path
>   *** Not applied and not tested IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported
>   IB/core: Add support for draining IB_POLL_DIRECT completion queues
>   IB/srp: Drain the send queue before destroying a QP
> 
> For the series except patch 6
> 
> Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Hello Laurence,

Thank you for the testing. However, reverting commit ad8e66b4a801 without
making any further changes is not acceptable because it would reintroduce
the SG-list mapping problem addressed by that patch. Can you test the
srp-initiator-for-next branch from my github repository against mlx5 (commit
8dca762deab6)? It passes my tests against mlx4. The patches on that branch
are:

Bart Van Assche (8):
      IB/SRP: Avoid using IB_MR_TYPE_SG_GAPS
      IB/srp: Avoid that duplicate responses trigger a kernel bug
      IB/srp: Fix race conditions related to task management
      IB/srp: Document locking conventions
      IB/srp: Make a diagnostic message more informative
      IB/srp: Improve an error path
      IB/core: Add support for draining IB_POLL_DIRECT completion queues
      IB/srp: Drain the send queue before destroying a QP

Thanks,

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/8] IB/srp bug fixes
       [not found]                         ` <1487092678.2466.6.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2017-02-14 17:22                           ` Laurence Oberman
  2017-02-14 18:47                           ` Laurence Oberman
  1 sibling, 0 replies; 47+ messages in thread
From: Laurence Oberman @ 2017-02-14 17:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g,
	maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Tuesday, February 14, 2017 12:18:11 PM
> Subject: Re:   [PATCH 0/8] IB/srp bug fixes
> 
> On Mon, 2017-02-13 at 22:02 -0500, Laurence Oberman wrote:
> > The following 7 of 8 patches were applied to Linus's latest tree.
> > 
> > However this required first reverting
> > 
> > commit ad8e66b4a80182174f73487ed25fd2140cf43361
> > Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Date:   Wed Dec 28 12:48:28 2016 +0200
> > 
> > See my other email regarding why the above needed to be reverted.
> > 
> > All tests passed in my mlx5 EDR-100 test bed for the ib-srp/mlx5 tests.
> > 
> > 4.10.0-rc8.bart+
> > 
> > The revert of the above meant I did not apply and test patch 6 of the
> > series
> > IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported
> > 
> >   IB/srp: Avoid that duplicate responses trigger a kernel bug
> >   IB/srp: Fix race conditions related to task management
> >   IB/srp: Document locking conventions
> >   IB/srp: Make a diagnostic message more informative
> >   IB/srp: Improve an error path
> >   *** Not applied and not tested IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA
> >   feature if supported
> >   IB/core: Add support for draining IB_POLL_DIRECT completion queues
> >   IB/srp: Drain the send queue before destroying a QP
> > 
> > For the series except patch 6
> > 
> > Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> 
> Hello Laurence,
> 
> Thank you for the testing. However, reverting commit ad8e66b4a801 without
> making any further changes is not acceptable because it would reintroduce
> the SG-list mapping problem addressed by that patch. Can you test the
> srp-initiator-for-next branch from my github repository against mlx5 (commit
> 8dca762deab6)? It passes my tests against mlx4. The patches on that branch
> are:
> 
> Bart Van Assche (8):
>       IB/SRP: Avoid using IB_MR_TYPE_SG_GAPS
>       IB/srp: Avoid that duplicate responses trigger a kernel bug
>       IB/srp: Fix race conditions related to task management
>       IB/srp: Document locking conventions
>       IB/srp: Make a diagnostic message more informative
>       IB/srp: Improve an error path
>       IB/core: Add support for draining IB_POLL_DIRECT completion queues
>       IB/srp: Drain the send queue before destroying a QP
> 
> Thanks,
> 
> Bart.
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality
> Notice & Disclaimer:
> 
> This e-mail and any files transmitted with it may contain confidential or
> legally privileged information of WDC and/or its affiliates, and are
> intended solely for the use of the individual or entity to which they are
> addressed. If you are not the intended recipient, any disclosure, copying,
> distribution or any action taken or omitted to be taken in reliance on it,
> is prohibited. If you have received this e-mail in error, please notify the
> sender immediately and delete the e-mail in its entirety from your system.
> 
> 

Hello Bart, Understood, will pull and test this today.
Thank you for your assistance.

Regards
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: v4.10-rc SRP + mlx5 regression
       [not found]                                                                             ` <a7ae2926-da0a-edf9-7779-09a6edd54d5d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-02-14 17:29                                                                               ` Bart Van Assche
  2017-02-14 17:31                                                                               ` Laurence Oberman
  1 sibling, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-14 17:29 UTC (permalink / raw)
  To: Max Gurtovoy, Laurence Oberman
  Cc: Leon Romanovsky, hch-jcswGhMUV9g, israelr-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On 02/14/2017 09:15 AM, Max Gurtovoy wrote:
> this is 4MB transactions. can you increase the cmd_sg_entries to the
> maximum and run the test again ?

How could that affect the error message Laurence reported? If
cmd_sg_entries is too low then the block layer refuses direct I/O
requests that are too large. From __scsi_init_queue():

	blk_queue_max_segments(q, min_t(unsigned short,
					shost->sg_tablesize,
					SG_MAX_SEGMENTS));

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: v4.10-rc SRP + mlx5 regression
       [not found]                                                                             ` <a7ae2926-da0a-edf9-7779-09a6edd54d5d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2017-02-14 17:29                                                                               ` Bart Van Assche
@ 2017-02-14 17:31                                                                               ` Laurence Oberman
  1 sibling, 0 replies; 47+ messages in thread
From: Laurence Oberman @ 2017-02-14 17:31 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Leon Romanovsky, Bart Van Assche, hch-jcswGhMUV9g,
	israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> To: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Tuesday, February 14, 2017 12:15:20 PM
> Subject: Re: v4.10-rc SRP + mlx5 regression
> 
> 
> 
> On 2/14/2017 3:31 PM, Laurence Oberman wrote:
> >
> >
> > ----- Original Message -----
> >> From: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >> To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Laurence Oberman"
> >> <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org,
> >> israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> >> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> >> Sent: Tuesday, February 14, 2017 5:00:04 AM
> >> Subject: Re: v4.10-rc SRP + mlx5 regression
> >>
> >> Hi Laurence,
> >> can you specify the test that repro these failures ?
> >> have you tried running with CX5 HCA or only CX4 ?
> >> I think this commit is right and we have issues in other places.
> >>
> >>
> >> On 2/14/2017 8:39 AM, Leon Romanovsky wrote:
> >>> On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote:
> >>>>
> >>>>
> >>>> ----- Original Message -----
> >>>>> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> >>>>> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> >>>>> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
> >>>>> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> >>>>> Sent: Monday, February 13, 2017 4:52:28 PM
> >>>>> Subject: Re: v4.10-rc SRP + mlx5 regression
> >>>>>
> >>>>> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote:
> >>>>>> I will have to run through this again and see where the bisect went
> >>>>>> wrong.
> >>>>>
> >>>>> Hello Laurence,
> >>>>>
> >>>>> If you would be considering to repeat the bisect, did you know that a
> >>>>> bisect
> >>>>> can be sped up by specifying the names of the files and/or directories
> >>>>> that
> >>>>> are suspected? An example:
> >>>>>
> >>>>> git bisect start */infiniband */net
> >>>>>
> >>>>> Bart.--
> >>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> >>>>> in
> >>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>
> >>>> Hello Bart,
> >>>>
> >>>> Much better news this time :), worked late on this but got it figured
> >>>> out.
> >>>>
> >>>> OK, so we got to this one, which makes a lot more sense and is right in
> >>>> the area where we are having issues.
> >>>> I must have answered wrong to one of the steps the first time I did the
> >>>> bisect.
> >>>>
> >>>> Reverted this in the master tree of rc8 and rebuilt the kernel
> >>>> Now all tests pass on Linus's tree - 4.10.0_rc8+
> >>>>
> >>>> The interesting point here is that this commit is in rc5 but rc5 was not
> >>>> failing so we have an interoperability issue with this commit
> >>>>
> >>>>
> >>>> [loberman@ibclient linux]$ git bisect good
> >>>> Bisecting: 0 revisions left to test after this (roughly 1 step)
> >>>> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation
> >>>> when
> >>>> the device supports sg gaps
> >>>>
> >>>> [loberman@ibclient linux]$ git show
> >>>> ad8e66b4a80182174f73487ed25fd2140cf43361
> >>>> commit ad8e66b4a80182174f73487ed25fd2140cf43361
> >>>> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>>> Date:   Wed Dec 28 12:48:28 2016 +0200
> >>>>
> >>>>     IB/srp: fix mr allocation when the device supports sg gaps
> >>>>
> >>>>     If the device support arbitrary sg list mapping (device cap
> >>>>     IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with
> >>>>     IB_MR_TYPE_SG_GAPS.
> >>>>
> >>>>     Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures")
> >>>>     Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+
> >>>>     Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>>>     Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>>>     Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>>>     Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>>>     Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> >>>>     Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> >>>>     Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >>>>
> >>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
> >>>> b/drivers/infiniband/ulp/srp/ib_srp.c
> >>>> index 8ddc071..0f67cf9 100644
> >>>> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> >>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> >>>> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
> >>>> ib_device *device,
> >>>>         struct srp_fr_desc *d;
> >>>>         struct ib_mr *mr;
> >>>>         int i, ret = -EINVAL;
> >>>> +       enum ib_mr_type mr_type;
> >>>>
> >>>>         if (pool_size <= 0)
> >>>>                 goto err;
> >>>> @@ -384,9 +385,13 @@ static struct srp_fr_pool
> >>>> *srp_create_fr_pool(struct
> >>>> ib_device *device,
> >>>>         spin_lock_init(&pool->lock);
> >>>>         INIT_LIST_HEAD(&pool->free_list);
> >>>>
> >>>> +       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
> >>>> +               mr_type = IB_MR_TYPE_SG_GAPS;
> >>>> +       else
> >>>> +               mr_type = IB_MR_TYPE_MEM_REG;
> >>>> +
> >>>>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
> >>>> -               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
> >>>> -                                max_page_list_len);
> >>>> +               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
> >>>
> >>> First, ib_alloc_mr receives u32 as a third parameter, but int was
> >>> supplied. Second (I can be wrong here), shouldn't max_page_list_len be
> >>> replaced with max_fast_reg_page_list_len?
> >>>
> >>> Thanks
> >>
> >> there is a statement that:
> >>
> >> 	if (srp_dev->use_fast_reg) {
> >>                  srp_dev->max_pages_per_mr =
> >>                          min_t(u32, srp_dev->max_pages_per_mr,
> >>                                attr->max_fast_reg_page_list_len);
> >>          }
> >>
> >> so we take the max_fast_reg_page_list_len in this case.
> >>
> >>>
> >>>>                 if (IS_ERR(mr)) {
> >>>>                         ret = PTR_ERR(mr);
> >>>>                         if (ret == -ENOMEM)
> >>>> (END)
> >>>>
> >>>>
> >>>> So here is the revert patch, but you need to decide how you want to deal
> >>>> with this.
> >>>>
> >>>>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
> >>>>     Laurence Oberman
> >>>>     Traced after bisection to a cause for this failure
> >>>>
> >>>> Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >>>> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >>>>
> >>>> commit 90d169d312a173d5350c1bb36d6daab04c592127
> >>>> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >>>> Date:   Mon Feb 13 20:33:32 2017 -0500
> >>>>
> >>>>     Revert "IB/srp: fix mr allocation when the device supports sg gaps"
> >>>>     Laurence Oberman
> >>>>     Traced after bisection to a cause for this failure
> >>>>
> >>>>     [  130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe
> >>>>     [  130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5)
> >>>>     for CQE ffff8817f0edbfb0
> >>>>     [  130.510899] 00000000 00000000 00000000 00000000
> >>>>     [  130.536455] 00000000 00000000 00000000 00000000
> >>>>     [  130.561878] 00000000 00000000 00000000 00000000
> >>>>     [  130.585904] 00000000 0f007806 2500002a db0ec4d0
> >>>>     [  145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> >>>>     [  146.530439] scsi host1: ib_srp: reconnect succeeded
> >>>>     [  146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe
> >>>>     [  146.597635] 00000000 00000000 00000000 00000000
> >>>>     [  146.623545] 00000000 00000000 00000000 00000000
> >>>>     [  146.649599] 00000000 00000000 00000000 00000000
> >>>>     [  146.673938] 00000000 0f007806 25000032 000c46d0
> >>>>     [  146.697969] scsi host1: ib_srp: failed FAST REG status memory
> >>>>     management operation error (6) for CQE ffff88
> >>>>     [  162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> >>>>     [  162.256337] scsi host1: ib_srp: reconnect succeeded
> >>>>     [  162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5)
> >>>>     for CQE ffff8817f0412ef0`
> >>>>
> >>>>     This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361.
> >>>>
> >>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c
> >>>> b/drivers/infiniband/ulp/srp/ib_srp.c
> >>>> index 79bf484..01338c8 100644
> >>>> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> >>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> >>>> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct
> >>>> ib_device *device,
> >>>>         struct srp_fr_desc *d;
> >>>>         struct ib_mr *mr;
> >>>>         int i, ret = -EINVAL;
> >>>> -       enum ib_mr_type mr_type;
> >>>>
> >>>>         if (pool_size <= 0)
> >>>>                 goto err;
> >>>> @@ -385,13 +384,9 @@ static struct srp_fr_pool
> >>>> *srp_create_fr_pool(struct
> >>>> ib_device *device,
> >>>>         spin_lock_init(&pool->lock);
> >>>>         INIT_LIST_HEAD(&pool->free_list);
> >>>>
> >>>> -       if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
> >>>> -               mr_type = IB_MR_TYPE_SG_GAPS;
> >>>> -       else
> >>>> -               mr_type = IB_MR_TYPE_MEM_REG;
> >>>> -
> >>>>         for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
> >>>> -               mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
> >>>> +               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
> >>>> +                                max_page_list_len);
> >>>>                 if (IS_ERR(mr)) {
> >>>>                         ret = PTR_ERR(mr);
> >>>>                         if (ret == -ENOMEM)
> >>>>
> >>>>
> >>>>
> >>>> Now moving on to what got me here in the first place.
> >>>> Bart, let me know if the 7 of the 8 patches in your most recent series
> >>>> are
> >>>> all still valid after this revert
> >>>> Otherwise let me know which ones you want me to apply.
> >>>>
> >>>> patch 6 - I am thinking i sno longer valid.
> >>>> "
> >>>> If a HCA supports the SG_GAPS_REG feature then a single memory
> >>>> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch
> >>>> reduces the number of memory regions that is allocated per SRP
> >>>> session.
> >>>> "
> >>>>
> >>>> Thanks
> >>>> Laurence
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> > Hello Max,
> >
> > I only have CX4 and CX3 in my lab, this test bed only has CX4.
> >
> > CA 'mlx5_0'
> > 	CA type: MT4115
> > 	Number of ports: 1
> > 	Firmware version: 12.14.2036
> > 	Hardware version: 0
> > 	Node GUID: 0x7cfe900300726ed2
> > 	System image GUID: 0x7cfe900300726ed2
> > 	Port 1:
> > 		State: Active
> > 		Physical state: LinkUp
> > 		Rate: 100
> > 		Base lid: 3
> > 		LMC: 0
> > 		SM lid: 3
> > 		Capability mask: 0x2651e84a
> > 		Port GUID: 0x7cfe900300726ed2
> > 		Link layer: InfiniBand
> >
> > The test is simple, it's the same one I start with every time because it
> > always
> > brings out issues with mapping for large I/O sizes and mem registration if
> > such issues exist.
> >
> > I have a server running LIO with memory backed LUNS.
> > These are served via a dual port mlx5 (CX4) over ib_srpt
> >
> > The client mounts these LUNS via ib_srp (mlx5) and device-mapper-multipath
> > and I run a simple dd on the XFS file system.
> >
> > #!/bin/bash
> > while true
> > do
> > 	dd if=/dev/zero of=/data-$1/bigfile bs=4096k count=900
> > 	sync;
> > 	rm -rf /data-$1/bigfile
> > done
> >
> > Once this passes I run a suite of other tests read/write, direct and
> > buffered.
> 
> Laurence,
> this is 4MB transactions. can you increase the cmd_sg_entries to the
> maximum and run the test again ?
> 
> 
> >
> > Thanks
> > Laurence
> >
> 

Hello Max,

Yes 4MB is very important for one of our biggest RHEL customers and I worked many hours with Bart last year to stabilize large 4MB
buffered and direct I/O for ib_srp/ib_srpt.

I am already running with:
options ib_srp cmd_sg_entries=255 indirect_sg_entries=2048

Regards and thanks for your assistance

Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/8] IB/srp bug fixes
       [not found]                         ` <1487092678.2466.6.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  2017-02-14 17:22                           ` Laurence Oberman
@ 2017-02-14 18:47                           ` Laurence Oberman
       [not found]                             ` <1364431877.31401761.1487098067033.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 47+ messages in thread
From: Laurence Oberman @ 2017-02-14 18:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g,
	maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA



----- Original Message -----
> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Sent: Tuesday, February 14, 2017 12:18:11 PM
> Subject: Re:   [PATCH 0/8] IB/srp bug fixes
> 
> On Mon, 2017-02-13 at 22:02 -0500, Laurence Oberman wrote:
> > The following 7 of 8 patches were applied to Linus's latest tree.
> > 
> > However this required first reverting
> > 
> > commit ad8e66b4a80182174f73487ed25fd2140cf43361
> > Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Date:   Wed Dec 28 12:48:28 2016 +0200
> > 
> > See my other email regarding why the above needed to be reverted.
> > 
> > All tests passed in my mlx5 EDR-100 test bed for the ib-srp/mlx5 tests.
> > 
> > 4.10.0-rc8.bart+
> > 
> > The revert of the above meant I did not apply and test patch 6 of the
> > series
> > IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported
> > 
> >   IB/srp: Avoid that duplicate responses trigger a kernel bug
> >   IB/srp: Fix race conditions related to task management
> >   IB/srp: Document locking conventions
> >   IB/srp: Make a diagnostic message more informative
> >   IB/srp: Improve an error path
> >   *** Not applied and not tested IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA
> >   feature if supported
> >   IB/core: Add support for draining IB_POLL_DIRECT completion queues
> >   IB/srp: Drain the send queue before destroying a QP
> > 
> > For the series except patch 6
> > 
> > Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> 
> Hello Laurence,
> 
> Thank you for the testing. However, reverting commit ad8e66b4a801 without
> making any further changes is not acceptable because it would reintroduce
> the SG-list mapping problem addressed by that patch. Can you test the
> srp-initiator-for-next branch from my github repository against mlx5 (commit
> 8dca762deab6)? It passes my tests against mlx4. The patches on that branch
> are:
> 
> Bart Van Assche (8):
>       IB/SRP: Avoid using IB_MR_TYPE_SG_GAPS
>       IB/srp: Avoid that duplicate responses trigger a kernel bug
>       IB/srp: Fix race conditions related to task management
>       IB/srp: Document locking conventions
>       IB/srp: Make a diagnostic message more informative
>       IB/srp: Improve an error path
>       IB/core: Add support for draining IB_POLL_DIRECT completion queues
>       IB/srp: Drain the send queue before destroying a QP
> 
> Thanks,
> 
> Bart.--
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hello Bart

4.10.0-rc8.bart_latest+

Built from branch srp-initiator-for-next after pull of your repository.

The large I/O testing is what I focused on but all tests are passing.
small/large I/O, direct and buffered I/O, file-system and direct to mpath devices.

This is a snap of 4 simultaneous 4MB I/O read tasks and 1 buffered write task (that will sporadically exceed 4MB)/

### RECORD    7 >>> ibclient <<< (1487097890.001) (Tue Feb 14 13:44:50 2017) ###
# DISK STATISTICS (/sec)
#                   <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged  IOs Size  Wait  RWSize  QLen  Wait SvcTim Util
13:44:50 dm-11       192512    141   47 4096    20       0      0    0    0     0    4096     1    20     21   99
13:44:50 dm-17       184320    135   45 4096    20       0      0    0    0     0    4096     1    20     22   99
13:44:50 dm-21       163840    120   40 4096    21  1236928   1984  153 8084   319    7257    91   257      5   99
13:44:50 dm-24       786432    576  192 4096     5       0      0    0    0     0    4096     1     5      5   99
13:44:50 dm-30       790528    579  193 4096     5       0      0    0    0     0    4096     1     5      5   99

It looks good Bart

For branch srp-initiator-for-next, all tests are passing.
Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/8] IB/srp bug fixes
       [not found]                             ` <1364431877.31401761.1487098067033.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-14 18:49                               ` Bart Van Assche
  0 siblings, 0 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-14 18:49 UTC (permalink / raw)
  To: loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
  Cc: hch-jcswGhMUV9g@public.gmane.org,
	maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

On Tue, 2017-02-14 at 13:47 -0500, Laurence Oberman wrote:
> For branch srp-initiator-for-next, all tests are passing.
> Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Thank you! I will post these patches as a v2 of this series.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2017-02-14 18:49 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-10 23:56 [PATCH 0/8] IB/srp bug fixes Bart Van Assche
2017-02-10 23:56 ` [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug Bart Van Assche
2017-02-12 17:05   ` Leon Romanovsky
2017-02-12 20:07     ` Bart Van Assche
     [not found]       ` <1486930017.2918.3.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-13  5:54         ` Leon Romanovsky
     [not found]           ` <20170213055432.GM14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-02-13 16:02             ` Bart Van Assche
2017-02-10 23:56 ` [PATCH 2/8] IB/srp: Fix race conditions related to task management Bart Van Assche
     [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-10 23:56   ` [PATCH 3/8] IB/srp: Document locking conventions Bart Van Assche
2017-02-10 23:56   ` [PATCH 4/8] IB/srp: Make a diagnostic message more informative Bart Van Assche
2017-02-10 23:56   ` [PATCH 5/8] IB/srp: Improve an error path Bart Van Assche
2017-02-10 23:56   ` [PATCH 6/8] IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported Bart Van Assche
2017-02-10 23:56   ` [PATCH 7/8] IB/core: Add support for draining IB_POLL_DIRECT completion queues Bart Van Assche
2017-02-10 23:56   ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche
     [not found]     ` <20170210235611.3243-9-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-11  0:07       ` Robert LeBlanc
     [not found]         ` <CAANLjFr+Jd3ctmhpBnjYGKZ4ZQPtYLAB7EWZxL59vHpgekP=Jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-11  0:13           ` Bart Van Assche
2017-02-12 17:19       ` Leon Romanovsky
     [not found]         ` <20170212171928.GF14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-02-12 18:02           ` Laurence Oberman
     [not found]             ` <1041506550.30101266.1486922573298.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-12 18:06               ` Laurence Oberman
     [not found]                 ` <1051975432.30101289.1486922792858.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-14  3:02                   ` [PATCH 0/8] IB/srp bug fixes Laurence Oberman
     [not found]                     ` <1465409120.30916025.1487041332560.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-14 17:18                       ` Bart Van Assche
     [not found]                         ` <1487092678.2466.6.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-14 17:22                           ` Laurence Oberman
2017-02-14 18:47                           ` Laurence Oberman
     [not found]                             ` <1364431877.31401761.1487098067033.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-14 18:49                               ` Bart Van Assche
2017-02-12 20:05               ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche
     [not found]                 ` <1486929901.2918.1.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-13  2:07                   ` Laurence Oberman
     [not found]                     ` <655392767.30136125.1486951636415.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13  3:14                       ` Laurence Oberman
     [not found]                         ` <1630482470.30208948.1486955693106.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 13:54                           ` Laurence Oberman
     [not found]                             ` <1633827327.30531404.1486994093828.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 14:17                               ` Leon Romanovsky
     [not found]                                 ` <20170213141724.GQ14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-02-13 14:24                                   ` Laurence Oberman
     [not found]                                     ` <225897984.30545262.1486995841880.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 16:12                                       ` Laurence Oberman
     [not found]                                         ` <1971987443.30613645.1487002375580.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 16:47                                           ` Laurence Oberman
     [not found]                                             ` <21338434.30712464.1487004451595.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 21:34                                               ` Laurence Oberman
     [not found]                                                 ` <1301607843.30852658.1487021644535.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 21:46                                                   ` Laurence Oberman
     [not found]                                                     ` <898197116.30855343.1487022400065.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 21:52                                                       ` v4.10-rc SRP + mlx5 regression Bart Van Assche
     [not found]                                                         ` <1487022735.2719.7.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-13 21:56                                                           ` Laurence Oberman
2017-02-14  2:19                                                           ` Laurence Oberman
     [not found]                                                             ` <568916592.30910570.1487038794766.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-14  6:39                                                               ` Leon Romanovsky
     [not found]                                                                 ` <20170214063953.GF6989-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-02-14 10:00                                                                   ` Max Gurtovoy
     [not found]                                                                     ` <bfca98d3-3f74-c370-7455-71e2ebd583e9-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-02-14 13:31                                                                       ` Laurence Oberman
     [not found]                                                                         ` <656778124.31118982.1487079062235.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-14 16:21                                                                           ` Laurence Oberman
2017-02-14 17:15                                                                           ` Max Gurtovoy
     [not found]                                                                             ` <a7ae2926-da0a-edf9-7779-09a6edd54d5d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-02-14 17:29                                                                               ` Bart Van Assche
2017-02-14 17:31                                                                               ` Laurence Oberman
2017-02-14 17:15                                                                           ` Max Gurtovoy
2017-02-14 16:53                                                                       ` Bart Van Assche
2017-02-12 20:11           ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche
     [not found]             ` <1486930299.2918.5.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-13  6:07               ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox