* [PATCH 0/8] IB/srp bug fixes
@ 2017-02-10 23:56 Bart Van Assche
2017-02-10 23:56 ` [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug Bart Van Assche
` (2 more replies)
0 siblings, 3 replies; 47+ messages in thread
From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw)
To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche
Hello Doug,
The patches in this series are the initiator patches I came up with while
testing the SRP initiator and target drivers. Please consider these patches
for inclusion in the upstream kernel.
Sorry for sending these patches so close to the merge window. If this means
that it's too late to include these patches in the first kernel v4.11 pull
request that's fine with me.
Bart Van Assche (8):
IB/srp: Avoid that duplicate responses trigger a kernel bug
IB/srp: Fix race conditions related to task management
IB/srp: Document locking conventions
IB/srp: Make a diagnostic message more informative
IB/srp: Improve an error path
IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported
IB/core: Add support for draining IB_POLL_DIRECT completion queues
IB/srp: Drain the send queue before destroying a QP
drivers/infiniband/core/verbs.c | 35 +++++-----
drivers/infiniband/ulp/srp/ib_srp.c | 129 ++++++++++++++++++++++++------------
drivers/infiniband/ulp/srp/ib_srp.h | 1 +
3 files changed, 103 insertions(+), 62 deletions(-)
--
2.11.0
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 47+ messages in thread* [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug 2017-02-10 23:56 [PATCH 0/8] IB/srp bug fixes Bart Van Assche @ 2017-02-10 23:56 ` Bart Van Assche 2017-02-12 17:05 ` Leon Romanovsky 2017-02-10 23:56 ` [PATCH 2/8] IB/srp: Fix race conditions related to task management Bart Van Assche [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 2 siblings, 1 reply; 47+ messages in thread From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma, Bart Van Assche, Israel Rukshin, Max Gurtovoy, Laurence Oberman, Steve Feeley, stable After srp_process_rsp() returns there is a short time during which the scsi_host_find_tag() call will return a pointer to the SCSI command that is being completed. If during that time a duplicate response is received, avoid that the following call stack appears: BUG: unable to handle kernel NULL pointer dereference at (null) IP: srp_recv_done+0x450/0x6b0 [ib_srp] Oops: 0000 [#1] SMP CPU: 10 PID: 0 Comm: swapper/10 Not tainted 4.10.0-rc7-dbg+ #1 Call Trace: <IRQ> __ib_process_cq+0x4b/0xd0 [ib_core] ib_poll_handler+0x1d/0x70 [ib_core] irq_poll_softirq+0xba/0x120 __do_softirq+0xba/0x4c0 irq_exit+0xbe/0xd0 smp_apic_timer_interrupt+0x38/0x50 apic_timer_interrupt+0x90/0xa0 </IRQ> cpuidle_enter_state+0xf2/0x370 cpuidle_enter+0x12/0x20 call_cpuidle+0x1e/0x40 do_idle+0xe3/0x1c0 cpu_startup_entry+0x18/0x20 start_secondary+0x103/0x130 start_cpu+0x14/0x14 RIP: srp_recv_done+0x450/0x6b0 [ib_srp] RSP: ffff88046f483e20 Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Israel Rukshin <israelr@mellanox.com> Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Steve Feeley <Steve.Feeley@sandisk.com> Cc: <stable@vger.kernel.org> --- drivers/infiniband/ulp/srp/ib_srp.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 79bf48477ddb..4068d34f5427 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1899,7 +1899,14 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp) scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag); if (scmnd) { req = (void *)scmnd->host_scribble; - scmnd = srp_claim_req(ch, req, NULL, scmnd); + if (req) { + scmnd = srp_claim_req(ch, req, NULL, scmnd); + } else { + shost_printk(KERN_ERR, target->scsi_host, + "NULL host_scribble for response with tag %#llx\n", + rsp->tag); + scmnd = NULL; + } } if (!scmnd) { shost_printk(KERN_ERR, target->scsi_host, -- 2.11.0 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug 2017-02-10 23:56 ` [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug Bart Van Assche @ 2017-02-12 17:05 ` Leon Romanovsky 2017-02-12 20:07 ` Bart Van Assche 0 siblings, 1 reply; 47+ messages in thread From: Leon Romanovsky @ 2017-02-12 17:05 UTC (permalink / raw) To: Bart Van Assche Cc: Doug Ledford, linux-rdma, Israel Rukshin, Max Gurtovoy, Laurence Oberman, Steve Feeley, stable [-- Attachment #1: Type: text/plain, Size: 2695 bytes --] On Fri, Feb 10, 2017 at 03:56:04PM -0800, Bart Van Assche wrote: > After srp_process_rsp() returns there is a short time during which > the scsi_host_find_tag() call will return a pointer to the SCSI > command that is being completed. If during that time a duplicate > response is received, avoid that the following call stack appears: > > BUG: unable to handle kernel NULL pointer dereference at (null) > IP: srp_recv_done+0x450/0x6b0 [ib_srp] > Oops: 0000 [#1] SMP > CPU: 10 PID: 0 Comm: swapper/10 Not tainted 4.10.0-rc7-dbg+ #1 > Call Trace: > <IRQ> > __ib_process_cq+0x4b/0xd0 [ib_core] > ib_poll_handler+0x1d/0x70 [ib_core] > irq_poll_softirq+0xba/0x120 > __do_softirq+0xba/0x4c0 > irq_exit+0xbe/0xd0 > smp_apic_timer_interrupt+0x38/0x50 > apic_timer_interrupt+0x90/0xa0 > </IRQ> > cpuidle_enter_state+0xf2/0x370 > cpuidle_enter+0x12/0x20 > call_cpuidle+0x1e/0x40 > do_idle+0xe3/0x1c0 > cpu_startup_entry+0x18/0x20 > start_secondary+0x103/0x130 > start_cpu+0x14/0x14 > RIP: srp_recv_done+0x450/0x6b0 [ib_srp] RSP: ffff88046f483e20 > > Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> > Cc: Israel Rukshin <israelr@mellanox.com> > Cc: Max Gurtovoy <maxg@mellanox.com> > Cc: Laurence Oberman <loberman@redhat.com> > Cc: Steve Feeley <Steve.Feeley@sandisk.com> > Cc: <stable@vger.kernel.org> > --- > drivers/infiniband/ulp/srp/ib_srp.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c > index 79bf48477ddb..4068d34f5427 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -1899,7 +1899,14 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp) > scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag); > if (scmnd) { > req = (void *)scmnd->host_scribble; > - scmnd = srp_claim_req(ch, req, NULL, scmnd); > + if (req) { > + scmnd = srp_claim_req(ch, req, NULL, scmnd); > + } else { > + shost_printk(KERN_ERR, target->scsi_host, > + "NULL host_scribble for response with tag %#llx\n", > + rsp->tag); > + scmnd = NULL; > + } > } > if (!scmnd) { > shost_printk(KERN_ERR, target->scsi_host, You have the chance to print the message below together with your new print, because scmd will be NULL. What about to do the following check "if (scmd && scmd->host_scribble)" instead of your proposed patch? Thanks > -- > 2.11.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug 2017-02-12 17:05 ` Leon Romanovsky @ 2017-02-12 20:07 ` Bart Van Assche [not found] ` <1486930017.2918.3.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Bart Van Assche @ 2017-02-12 20:07 UTC (permalink / raw) To: leon@kernel.org Cc: maxg@mellanox.com, israelr@mellanox.com, linux-rdma@vger.kernel.org, Steve Feeley, dledford@redhat.com, loberman@redhat.com, stable@vger.kernel.org On Sun, 2017-02-12 at 19:05 +0200, Leon Romanovsky wrote: > On Fri, Feb 10, 2017 at 03:56:04PM -0800, Bart Van Assche wrote: > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c > > index 79bf48477ddb..4068d34f5427 100644 > > --- a/drivers/infiniband/ulp/srp/ib_srp.c > > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > > @@ -1899,7 +1899,14 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp) > > scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag); > > if (scmnd) { > > req = (void *)scmnd->host_scribble; > > - scmnd = srp_claim_req(ch, req, NULL, scmnd); > > + if (req) { > > + scmnd = srp_claim_req(ch, req, NULL, scmnd); > > + } else { > > + shost_printk(KERN_ERR, target->scsi_host, > > + "NULL host_scribble for response with tag %#llx\n", > > + rsp->tag); > > + scmnd = NULL; > > + } > > } > > if (!scmnd) { > > shost_printk(KERN_ERR, target->scsi_host, > > You have the chance to print the message below together with your new > print, because scmd will be NULL. > > What about to do the following check "if (scmd && scmd->host_scribble)" > instead of your proposed patch? That approach would still trigger a kernel oops if a duplicate response is received because the second argument of srp_claim_req() must not be NULL. Bart. ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <1486930017.2918.3.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>]
* Re: [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug [not found] ` <1486930017.2918.3.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> @ 2017-02-13 5:54 ` Leon Romanovsky [not found] ` <20170213055432.GM14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Leon Romanovsky @ 2017-02-13 5:54 UTC (permalink / raw) To: Bart Van Assche Cc: maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Steve Feeley, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [-- Attachment #1: Type: text/plain, Size: 2986 bytes --] On Sun, Feb 12, 2017 at 08:07:13PM +0000, Bart Van Assche wrote: > On Sun, 2017-02-12 at 19:05 +0200, Leon Romanovsky wrote: > > On Fri, Feb 10, 2017 at 03:56:04PM -0800, Bart Van Assche wrote: > > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c > > > index 79bf48477ddb..4068d34f5427 100644 > > > --- a/drivers/infiniband/ulp/srp/ib_srp.c > > > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > > > @@ -1899,7 +1899,14 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp) > > > scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag); > > > if (scmnd) { > > > req = (void *)scmnd->host_scribble; > > > - scmnd = srp_claim_req(ch, req, NULL, scmnd); > > > + if (req) { > > > + scmnd = srp_claim_req(ch, req, NULL, scmnd); > > > + } else { > > > + shost_printk(KERN_ERR, target->scsi_host, > > > + "NULL host_scribble for response with tag %#llx\n", > > > + rsp->tag); > > > + scmnd = NULL; > > > + } > > > } > > > if (!scmnd) { > > > shost_printk(KERN_ERR, target->scsi_host, > > > > You have the chance to print the message below together with your new > > print, because scmd will be NULL. > > > > What about to do the following check "if (scmd && scmd->host_scribble)" > > instead of your proposed patch? > > That approach would still trigger a kernel oops if a duplicate response is > received because the second argument of srp_claim_req() must not be NULL. I'm sure that I'm missing something, but how would it be triggered? We will enter to call second srp_claim_req() function only if "req" is not NULL. diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 79bf48477ddb..40e7f27c40bf 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1897,10 +1897,12 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp) complete(&ch->tsk_mgmt_done); } else { scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag); - if (scmnd) { + if (scmnd && scmnd->host_scribble) { req = (void *)scmnd->host_scribble; scmnd = srp_claim_req(ch, req, NULL, scmnd); } + else + scnmnd = NULL; if (!scmnd) { shost_printk(KERN_ERR, target->scsi_host, "Null scmnd for RSP w/tag %#016llx received on ch %td / QP %#x\n", > > Bart. > Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: > > This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply related [flat|nested] 47+ messages in thread
[parent not found: <20170213055432.GM14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>]
* Re: [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug [not found] ` <20170213055432.GM14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org> @ 2017-02-13 16:02 ` Bart Van Assche 0 siblings, 0 replies; 47+ messages in thread From: Bart Van Assche @ 2017-02-13 16:02 UTC (permalink / raw) To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Steve Feeley, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Mon, 2017-02-13 at 07:54 +0200, Leon Romanovsky wrote: > I'm sure that I'm missing something, but how would it be triggered? > We will enter to call second srp_claim_req() function only if "req" is > not NULL. > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c > index 79bf48477ddb..40e7f27c40bf 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -1897,10 +1897,12 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp) > complete(&ch->tsk_mgmt_done); > } else { > scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag); > - if (scmnd) { > + if (scmnd && scmnd->host_scribble) { > req = (void *)scmnd->host_scribble; > scmnd = srp_claim_req(ch, req, NULL, scmnd); > } > + else > + scnmnd = NULL; > if (!scmnd) { > shost_printk(KERN_ERR, target->scsi_host, > "Null scmnd for RSP w/tag %#016llx received on ch %td / QP %#x\n", Hello Leon, Sorry but I had misread your previous e-mail. I agree that the above should work fine. Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 2/8] IB/srp: Fix race conditions related to task management 2017-02-10 23:56 [PATCH 0/8] IB/srp bug fixes Bart Van Assche 2017-02-10 23:56 ` [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug Bart Van Assche @ 2017-02-10 23:56 ` Bart Van Assche [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 2 siblings, 0 replies; 47+ messages in thread From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma, Bart Van Assche, Israel Rukshin, Max Gurtovoy, Laurence Oberman, Steve Feeley, stable Avoid that srp_process_rsp() overwrites the status information in ch if the SRP target response timed out and processing of another task management function has already started. Avoid that issuing multiple task management functions concurrently triggers list corruption. This patch prevents that the following stack trace appears in the system log: WARNING: CPU: 8 PID: 9269 at lib/list_debug.c:52 __list_del_entry_valid+0xbc/0xc0 list_del corruption. prev->next should be ffffc90004bb7b00, but was ffff8804052ecc68 CPU: 8 PID: 9269 Comm: sg_reset Tainted: G W 4.10.0-rc7-dbg+ #3 Call Trace: dump_stack+0x68/0x93 __warn+0xc6/0xe0 warn_slowpath_fmt+0x4a/0x50 __list_del_entry_valid+0xbc/0xc0 wait_for_completion_timeout+0x12e/0x170 srp_send_tsk_mgmt+0x1ef/0x2d0 [ib_srp] srp_reset_device+0x5b/0x110 [ib_srp] scsi_ioctl_reset+0x1c7/0x290 scsi_ioctl+0x12a/0x420 sd_ioctl+0x9d/0x100 blkdev_ioctl+0x51e/0x9f0 block_ioctl+0x38/0x40 do_vfs_ioctl+0x8f/0x700 SyS_ioctl+0x3c/0x70 entry_SYSCALL_64_fastpath+0x18/0xad Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Israel Rukshin <israelr@mellanox.com> Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Steve Feeley <Steve.Feeley@sandisk.com> Cc: <stable@vger.kernel.org> --- drivers/infiniband/ulp/srp/ib_srp.c | 45 ++++++++++++++++++++++++------------- drivers/infiniband/ulp/srp/ib_srp.h | 1 + 2 files changed, 30 insertions(+), 16 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 4068d34f5427..511eb4b2e6e0 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1889,12 +1889,17 @@ static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp) if (unlikely(rsp->tag & SRP_TAG_TSK_MGMT)) { spin_lock_irqsave(&ch->lock, flags); ch->req_lim += be32_to_cpu(rsp->req_lim_delta); + if (rsp->tag == ch->tsk_mgmt_tag) { + ch->tsk_mgmt_status = -1; + if (be32_to_cpu(rsp->resp_data_len) >= 4) + ch->tsk_mgmt_status = rsp->data[3]; + complete(&ch->tsk_mgmt_done); + } else { + shost_printk(KERN_ERR, target->scsi_host, + "Received tsk mgmt response too late for tag %#llx\n", + rsp->tag); + } spin_unlock_irqrestore(&ch->lock, flags); - - ch->tsk_mgmt_status = -1; - if (be32_to_cpu(rsp->resp_data_len) >= 4) - ch->tsk_mgmt_status = rsp->data[3]; - complete(&ch->tsk_mgmt_done); } else { scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag); if (scmnd) { @@ -2538,19 +2543,18 @@ srp_change_queue_depth(struct scsi_device *sdev, int qdepth) } static int srp_send_tsk_mgmt(struct srp_rdma_ch *ch, u64 req_tag, u64 lun, - u8 func) + u8 func, u8 *status) { struct srp_target_port *target = ch->target; struct srp_rport *rport = target->rport; struct ib_device *dev = target->srp_host->srp_dev->dev; struct srp_iu *iu; struct srp_tsk_mgmt *tsk_mgmt; + int res; if (!ch->connected || target->qp_in_error) return -1; - init_completion(&ch->tsk_mgmt_done); - /* * Lock the rport mutex to avoid that srp_create_ch_ib() is * invoked while a task management function is being sent. @@ -2573,10 +2577,16 @@ static int srp_send_tsk_mgmt(struct srp_rdma_ch *ch, u64 req_tag, u64 lun, tsk_mgmt->opcode = SRP_TSK_MGMT; int_to_scsilun(lun, &tsk_mgmt->lun); - tsk_mgmt->tag = req_tag | SRP_TAG_TSK_MGMT; tsk_mgmt->tsk_mgmt_func = func; tsk_mgmt->task_tag = req_tag; + spin_lock_irq(&ch->lock); + ch->tsk_mgmt_tag = (ch->tsk_mgmt_tag + 1) | SRP_TAG_TSK_MGMT; + tsk_mgmt->tag = ch->tsk_mgmt_tag; + spin_unlock_irq(&ch->lock); + + init_completion(&ch->tsk_mgmt_done); + ib_dma_sync_single_for_device(dev, iu->dma, sizeof *tsk_mgmt, DMA_TO_DEVICE); if (srp_post_send(ch, iu, sizeof(*tsk_mgmt))) { @@ -2585,13 +2595,15 @@ static int srp_send_tsk_mgmt(struct srp_rdma_ch *ch, u64 req_tag, u64 lun, return -1; } + res = wait_for_completion_timeout(&ch->tsk_mgmt_done, + msecs_to_jiffies(SRP_ABORT_TIMEOUT_MS)); + if (res > 0 && status) + *status = ch->tsk_mgmt_status; mutex_unlock(&rport->mutex); - if (!wait_for_completion_timeout(&ch->tsk_mgmt_done, - msecs_to_jiffies(SRP_ABORT_TIMEOUT_MS))) - return -1; + WARN_ON_ONCE(res < 0); - return 0; + return res > 0 ? 0 : -1; } static int srp_abort(struct scsi_cmnd *scmnd) @@ -2617,7 +2629,7 @@ static int srp_abort(struct scsi_cmnd *scmnd) shost_printk(KERN_ERR, target->scsi_host, "Sending SRP abort for tag %#x\n", tag); if (srp_send_tsk_mgmt(ch, tag, scmnd->device->lun, - SRP_TSK_ABORT_TASK) == 0) + SRP_TSK_ABORT_TASK, NULL) == 0) ret = SUCCESS; else if (target->rport->state == SRP_RPORT_LOST) ret = FAST_IO_FAIL; @@ -2635,14 +2647,15 @@ static int srp_reset_device(struct scsi_cmnd *scmnd) struct srp_target_port *target = host_to_target(scmnd->device->host); struct srp_rdma_ch *ch; int i; + u8 status; shost_printk(KERN_ERR, target->scsi_host, "SRP reset_device called\n"); ch = &target->ch[0]; if (srp_send_tsk_mgmt(ch, SRP_TAG_NO_REQ, scmnd->device->lun, - SRP_TSK_LUN_RESET)) + SRP_TSK_LUN_RESET, &status)) return FAILED; - if (ch->tsk_mgmt_status) + if (status) return FAILED; for (i = 0; i < target->ch_count; i++) { diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index 21c69695f9d4..32ed40db3ca2 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -163,6 +163,7 @@ struct srp_rdma_ch { int max_ti_iu_len; int comp_vector; + u64 tsk_mgmt_tag; struct completion tsk_mgmt_done; u8 tsk_mgmt_status; bool connected; -- 2.11.0 ^ permalink raw reply related [flat|nested] 47+ messages in thread
[parent not found: <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>]
* [PATCH 3/8] IB/srp: Document locking conventions [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> @ 2017-02-10 23:56 ` Bart Van Assche 2017-02-10 23:56 ` [PATCH 4/8] IB/srp: Make a diagnostic message more informative Bart Van Assche ` (4 subsequent siblings) 5 siblings, 0 replies; 47+ messages in thread From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche, Israel Rukshin, Max Gurtovoy, Laurence Oberman Use lockdep_assert_held() statements to verify at run-time whether the proper locks are held. Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/srp/ib_srp.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 511eb4b2e6e0..a43db9d6b399 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -40,6 +40,7 @@ #include <linux/parser.h> #include <linux/random.h> #include <linux/jiffies.h> +#include <linux/lockdep.h> #include <rdma/ib_cache.h> #include <linux/atomic.h> @@ -1804,6 +1805,8 @@ static struct srp_iu *__srp_get_tx_iu(struct srp_rdma_ch *ch, s32 rsv = (iu_type == SRP_IU_TSK_MGMT) ? 0 : SRP_TSK_MGMT_SQ_SIZE; struct srp_iu *iu; + lockdep_assert_held(&ch->lock); + ib_process_cq_direct(ch->send_cq, -1); if (list_empty(&ch->free_tx)) @@ -1834,6 +1837,8 @@ static void srp_send_done(struct ib_cq *cq, struct ib_wc *wc) return; } + lockdep_assert_held(&ch->lock); + list_add(&iu->list, &ch->free_tx); } -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH 4/8] IB/srp: Make a diagnostic message more informative [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 2017-02-10 23:56 ` [PATCH 3/8] IB/srp: Document locking conventions Bart Van Assche @ 2017-02-10 23:56 ` Bart Van Assche 2017-02-10 23:56 ` [PATCH 5/8] IB/srp: Improve an error path Bart Van Assche ` (3 subsequent siblings) 5 siblings, 0 replies; 47+ messages in thread From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche, Israel Rukshin, Max Gurtovoy, Laurence Oberman Report the destination port GID if connecting fails. Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/srp/ib_srp.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index a43db9d6b399..d21611a4e90f 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -3446,9 +3446,10 @@ static ssize_t srp_create_target(struct device *dev, ret = srp_connect_ch(ch, multich); if (ret) { shost_printk(KERN_ERR, target->scsi_host, - PFX "Connection %d/%d failed\n", + PFX "Connection %d/%d to %pI6 failed\n", ch_start + cpu_idx, - target->ch_count); + target->ch_count, + ch->target->orig_dgid.raw); if (node_idx == 0 && cpu_idx == 0) { goto err_disconnect; } else { -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH 5/8] IB/srp: Improve an error path [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 2017-02-10 23:56 ` [PATCH 3/8] IB/srp: Document locking conventions Bart Van Assche 2017-02-10 23:56 ` [PATCH 4/8] IB/srp: Make a diagnostic message more informative Bart Van Assche @ 2017-02-10 23:56 ` Bart Van Assche 2017-02-10 23:56 ` [PATCH 6/8] IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported Bart Van Assche ` (2 subsequent siblings) 5 siblings, 0 replies; 47+ messages in thread From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche, Israel Rukshin, Max Gurtovoy, Laurence Oberman Avoid that the following message is printed if login fails: scsi host0: ib_srp: Sending CM DREQ failed Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/srp/ib_srp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index d21611a4e90f..87efb702b1c6 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -3451,7 +3451,7 @@ static ssize_t srp_create_target(struct device *dev, target->ch_count, ch->target->orig_dgid.raw); if (node_idx == 0 && cpu_idx == 0) { - goto err_disconnect; + goto free_ch; } else { srp_free_ch_ib(target, ch); srp_free_req_data(target, ch); @@ -3498,6 +3498,7 @@ static ssize_t srp_create_target(struct device *dev, err_disconnect: srp_disconnect_target(target); +free_ch: for (i = 0; i < target->ch_count; i++) { ch = &target->ch[i]; srp_free_ch_ib(target, ch); -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH 6/8] IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> ` (2 preceding siblings ...) 2017-02-10 23:56 ` [PATCH 5/8] IB/srp: Improve an error path Bart Van Assche @ 2017-02-10 23:56 ` Bart Van Assche 2017-02-10 23:56 ` [PATCH 7/8] IB/core: Add support for draining IB_POLL_DIRECT completion queues Bart Van Assche 2017-02-10 23:56 ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche 5 siblings, 0 replies; 47+ messages in thread From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche, Israel Rukshin, Max Gurtovoy, Laurence Oberman If a HCA supports the SG_GAPS_REG feature then a single memory region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch reduces the number of memory regions that is allocated per SRP session. Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/srp/ib_srp.c | 43 ++++++++++++++++++++++--------------- 1 file changed, 26 insertions(+), 17 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 87efb702b1c6..2f85255d2aca 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -3356,25 +3356,34 @@ static ssize_t srp_create_target(struct device *dev, } if (srp_dev->use_fast_reg || srp_dev->use_fmr) { - /* - * FR and FMR can only map one HCA page per entry. If the - * start address is not aligned on a HCA page boundary two - * entries will be used for the head and the tail although - * these two entries combined contain at most one HCA page of - * data. Hence the "+ 1" in the calculation below. - * - * The indirect data buffer descriptor is contiguous so the - * memory for that buffer will only be registered if - * register_always is true. Hence add one to mr_per_cmd if - * register_always has been set. - */ + bool gaps_reg = (ibdev->attrs.device_cap_flags & + IB_DEVICE_SG_GAPS_REG); + max_sectors_per_mr = srp_dev->max_pages_per_mr << (ilog2(srp_dev->mr_page_size) - 9); - mr_per_cmd = register_always + - (target->scsi_host->max_sectors + 1 + - max_sectors_per_mr - 1) / max_sectors_per_mr; - pr_debug("max_sectors = %u; max_pages_per_mr = %u; mr_page_size = %u; max_sectors_per_mr = %u; mr_per_cmd = %u\n", - target->scsi_host->max_sectors, + if (!gaps_reg) { + /* + * FR and FMR can only map one HCA page per entry. If + * the start address is not aligned on a HCA page + * boundary two entries will be used for the head and + * the tail although these two entries combined + * contain at most one HCA page of data. Hence the "+ + * 1" in the calculation below. + * + * The indirect data buffer descriptor is contiguous + * so the memory for that buffer will only be + * registered if register_always is true. Hence add + * one to mr_per_cmd if register_always has been set. + */ + mr_per_cmd = register_always + + (target->scsi_host->max_sectors + 1 + + max_sectors_per_mr - 1) / max_sectors_per_mr; + mr_per_cmd = max(2U, mr_per_cmd); + } else { + mr_per_cmd = 1; + } + pr_debug("IB_DEVICE_SG_GAPS_REG = %d; max_sectors = %u; max_pages_per_mr = %u; mr_page_size = %u; max_sectors_per_mr = %u; mr_per_cmd = %u\n", + gaps_reg, target->scsi_host->max_sectors, srp_dev->max_pages_per_mr, srp_dev->mr_page_size, max_sectors_per_mr, mr_per_cmd); } -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH 7/8] IB/core: Add support for draining IB_POLL_DIRECT completion queues [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> ` (3 preceding siblings ...) 2017-02-10 23:56 ` [PATCH 6/8] IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported Bart Van Assche @ 2017-02-10 23:56 ` Bart Van Assche 2017-02-10 23:56 ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche 5 siblings, 0 replies; 47+ messages in thread From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche, Steve Wise, Chuck Lever, Christoph Hellwig, Max Gurtovoy Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> Cc: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> Cc: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> --- drivers/infiniband/core/verbs.c | 35 +++++++++++++++-------------------- 1 file changed, 15 insertions(+), 20 deletions(-) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 71580cc28c9e..42f8927b542c 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -1949,17 +1949,12 @@ static void ib_drain_qp_done(struct ib_cq *cq, struct ib_wc *wc) */ static void __ib_drain_sq(struct ib_qp *qp) { + struct ib_cq *cq = qp->send_cq; struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR }; struct ib_drain_cqe sdrain; struct ib_send_wr swr = {}, *bad_swr; int ret; - if (qp->send_cq->poll_ctx == IB_POLL_DIRECT) { - WARN_ONCE(qp->send_cq->poll_ctx == IB_POLL_DIRECT, - "IB_POLL_DIRECT poll_ctx not supported for drain\n"); - return; - } - swr.wr_cqe = &sdrain.cqe; sdrain.cqe.done = ib_drain_qp_done; init_completion(&sdrain.done); @@ -1976,7 +1971,11 @@ static void __ib_drain_sq(struct ib_qp *qp) return; } - wait_for_completion(&sdrain.done); + if (cq->poll_ctx == IB_POLL_DIRECT) + while (wait_for_completion_timeout(&sdrain.done, HZ / 10) <= 0) + ib_process_cq_direct(cq, -1); + else + wait_for_completion(&sdrain.done); } /* @@ -1984,17 +1983,12 @@ static void __ib_drain_sq(struct ib_qp *qp) */ static void __ib_drain_rq(struct ib_qp *qp) { + struct ib_cq *cq = qp->recv_cq; struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR }; struct ib_drain_cqe rdrain; struct ib_recv_wr rwr = {}, *bad_rwr; int ret; - if (qp->recv_cq->poll_ctx == IB_POLL_DIRECT) { - WARN_ONCE(qp->recv_cq->poll_ctx == IB_POLL_DIRECT, - "IB_POLL_DIRECT poll_ctx not supported for drain\n"); - return; - } - rwr.wr_cqe = &rdrain.cqe; rdrain.cqe.done = ib_drain_qp_done; init_completion(&rdrain.done); @@ -2011,7 +2005,11 @@ static void __ib_drain_rq(struct ib_qp *qp) return; } - wait_for_completion(&rdrain.done); + if (cq->poll_ctx == IB_POLL_DIRECT) + while (wait_for_completion_timeout(&rdrain.done, HZ / 10) <= 0) + ib_process_cq_direct(cq, -1); + else + wait_for_completion(&rdrain.done); } /** @@ -2028,8 +2026,7 @@ static void __ib_drain_rq(struct ib_qp *qp) * ensure there is room in the CQ and SQ for the drain work request and * completion. * - * allocate the CQ using ib_alloc_cq() and the CQ poll context cannot be - * IB_POLL_DIRECT. + * allocate the CQ using ib_alloc_cq(). * * ensure that there are no other contexts that are posting WRs concurrently. * Otherwise the drain is not guaranteed. @@ -2057,8 +2054,7 @@ EXPORT_SYMBOL(ib_drain_sq); * ensure there is room in the CQ and RQ for the drain work request and * completion. * - * allocate the CQ using ib_alloc_cq() and the CQ poll context cannot be - * IB_POLL_DIRECT. + * allocate the CQ using ib_alloc_cq(). * * ensure that there are no other contexts that are posting WRs concurrently. * Otherwise the drain is not guaranteed. @@ -2082,8 +2078,7 @@ EXPORT_SYMBOL(ib_drain_rq); * ensure there is room in the CQ(s), SQ, and RQ for drain work requests * and completions. * - * allocate the CQs using ib_alloc_cq() and the CQ poll context cannot be - * IB_POLL_DIRECT. + * allocate the CQs using ib_alloc_cq(). * * ensure that there are no other contexts that are posting WRs concurrently. * Otherwise the drain is not guaranteed. -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> ` (4 preceding siblings ...) 2017-02-10 23:56 ` [PATCH 7/8] IB/core: Add support for draining IB_POLL_DIRECT completion queues Bart Van Assche @ 2017-02-10 23:56 ` Bart Van Assche [not found] ` <20170210235611.3243-9-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 5 siblings, 1 reply; 47+ messages in thread From: Bart Van Assche @ 2017-02-10 23:56 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche, Christoph Hellwig, Israel Rukshin, Max Gurtovoy, Laurence Oberman A quote from the IB spec: However, if the Consumer does not wait for the Affiliated Asynchronous Last WQE Reached Event, then WQE and Data Segment leakage may occur. Therefore, it is good programming practice to tear down a QP that is associated with an SRQ by using the following process: * Put the QP in the Error State; * wait for the Affiliated Asynchronous Last WQE Reached Event; * either: * drain the CQ by invoking the Poll CQ verb and either wait for CQ to be empty or the number of Poll CQ operations has exceeded CQ capacity size; or * post another WR that completes on the same CQ and wait for this WR to return as a WC; * and then invoke a Destroy QP or Reset QP. Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/srp/ib_srp.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 2f85255d2aca..b50733910f7e 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -471,9 +471,13 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct srp_target_port *target) * completion handler can access the queue pair while it is * being destroyed. */ -static void srp_destroy_qp(struct ib_qp *qp) +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp) { - ib_drain_rq(qp); + spin_lock_irq(&ch->lock); + ib_process_cq_direct(ch->send_cq, -1); + spin_unlock_irq(&ch->lock); + + ib_drain_qp(qp); ib_destroy_qp(qp); } @@ -547,7 +551,7 @@ static int srp_create_ch_ib(struct srp_rdma_ch *ch) } if (ch->qp) - srp_destroy_qp(ch->qp); + srp_destroy_qp(ch, ch->qp); if (ch->recv_cq) ib_free_cq(ch->recv_cq); if (ch->send_cq) @@ -571,7 +575,7 @@ static int srp_create_ch_ib(struct srp_rdma_ch *ch) return 0; err_qp: - srp_destroy_qp(qp); + srp_destroy_qp(ch, qp); err_send_cq: ib_free_cq(send_cq); @@ -614,7 +618,7 @@ static void srp_free_ch_ib(struct srp_target_port *target, ib_destroy_fmr_pool(ch->fmr_pool); } - srp_destroy_qp(ch->qp); + srp_destroy_qp(ch, ch->qp); ib_free_cq(ch->send_cq); ib_free_cq(ch->recv_cq); @@ -1827,6 +1831,11 @@ static struct srp_iu *__srp_get_tx_iu(struct srp_rdma_ch *ch, return iu; } +/* + * Note: if this function is called from inside ib_drain_sq() then it will + * be called without ch->lock being held. If ib_drain_sq() dequeues a WQE + * with status IB_WC_SUCCESS then that's a bug. + */ static void srp_send_done(struct ib_cq *cq, struct ib_wc *wc) { struct srp_iu *iu = container_of(wc->wr_cqe, struct srp_iu, cqe); -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 47+ messages in thread
[parent not found: <20170210235611.3243-9-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <20170210235611.3243-9-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> @ 2017-02-11 0:07 ` Robert LeBlanc [not found] ` <CAANLjFr+Jd3ctmhpBnjYGKZ4ZQPtYLAB7EWZxL59vHpgekP=Jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-02-12 17:19 ` Leon Romanovsky 1 sibling, 1 reply; 47+ messages in thread From: Robert LeBlanc @ 2017-02-11 0:07 UTC (permalink / raw) To: Bart Van Assche Cc: Doug Ledford, linux-rdma, Christoph Hellwig, Israel Rukshin, Max Gurtovoy, Laurence Oberman On Fri, Feb 10, 2017 at 4:56 PM, Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote: > A quote from the IB spec: > > However, if the Consumer does not wait for the Affiliated Asynchronous > Last WQE Reached Event, then WQE and Data Segment leakage may occur. > Therefore, it is good programming practice to tear down a QP that is > associated with an SRQ by using the following process: > * Put the QP in the Error State; > * wait for the Affiliated Asynchronous Last WQE Reached Event; > * either: > * drain the CQ by invoking the Poll CQ verb and either wait for CQ > to be empty or the number of Poll CQ operations has exceeded CQ > capacity size; or > * post another WR that completes on the same CQ and wait for this WR to return as a WC; > * and then invoke a Destroy QP or Reset QP. > > Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> > Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > drivers/infiniband/ulp/srp/ib_srp.c | 19 ++++++++++++++----- > 1 file changed, 14 insertions(+), 5 deletions(-) > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c > index 2f85255d2aca..b50733910f7e 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -471,9 +471,13 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct srp_target_port *target) > * completion handler can access the queue pair while it is > * being destroyed. > */ > -static void srp_destroy_qp(struct ib_qp *qp) > +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp) > { > - ib_drain_rq(qp); > + spin_lock_irq(&ch->lock); > + ib_process_cq_direct(ch->send_cq, -1); > + spin_unlock_irq(&ch->lock); > + > + ib_drain_qp(qp); > ib_destroy_qp(qp); > } > > @@ -547,7 +551,7 @@ static int srp_create_ch_ib(struct srp_rdma_ch *ch) > } > > if (ch->qp) > - srp_destroy_qp(ch->qp); > + srp_destroy_qp(ch, ch->qp); > if (ch->recv_cq) > ib_free_cq(ch->recv_cq); > if (ch->send_cq) > @@ -571,7 +575,7 @@ static int srp_create_ch_ib(struct srp_rdma_ch *ch) > return 0; > > err_qp: > - srp_destroy_qp(qp); > + srp_destroy_qp(ch, qp); > > err_send_cq: > ib_free_cq(send_cq); > @@ -614,7 +618,7 @@ static void srp_free_ch_ib(struct srp_target_port *target, > ib_destroy_fmr_pool(ch->fmr_pool); > } > > - srp_destroy_qp(ch->qp); > + srp_destroy_qp(ch, ch->qp); > ib_free_cq(ch->send_cq); > ib_free_cq(ch->recv_cq); > > @@ -1827,6 +1831,11 @@ static struct srp_iu *__srp_get_tx_iu(struct srp_rdma_ch *ch, > return iu; > } > > +/* > + * Note: if this function is called from inside ib_drain_sq() then it will Don't you mean outside of ib_drain_sq? > + * be called without ch->lock being held. If ib_drain_sq() dequeues a WQE > + * with status IB_WC_SUCCESS then that's a bug. > + */ > static void srp_send_done(struct ib_cq *cq, struct ib_wc *wc) > { > struct srp_iu *iu = container_of(wc->wr_cqe, struct srp_iu, cqe); > -- > 2.11.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Sagi, Does something like this need to happen for iSER as well? Maybe it could help with the D state problem? ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <CAANLjFr+Jd3ctmhpBnjYGKZ4ZQPtYLAB7EWZxL59vHpgekP=Jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <CAANLjFr+Jd3ctmhpBnjYGKZ4ZQPtYLAB7EWZxL59vHpgekP=Jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-02-11 0:13 ` Bart Van Assche 0 siblings, 0 replies; 47+ messages in thread From: Bart Van Assche @ 2017-02-11 0:13 UTC (permalink / raw) To: robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org On Fri, 2017-02-10 at 17:07 -0700, Robert LeBlanc wrote: > > +/* > > + * Note: if this function is called from inside ib_drain_sq() then it will > > Don't you mean outside of ib_drain_sq? I meant inside. Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <20170210235611.3243-9-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 2017-02-11 0:07 ` Robert LeBlanc @ 2017-02-12 17:19 ` Leon Romanovsky [not found] ` <20170212171928.GF14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org> 1 sibling, 1 reply; 47+ messages in thread From: Leon Romanovsky @ 2017-02-12 17:19 UTC (permalink / raw) To: Bart Van Assche Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Israel Rukshin, Max Gurtovoy, Laurence Oberman [-- Attachment #1: Type: text/plain, Size: 2274 bytes --] On Fri, Feb 10, 2017 at 03:56:11PM -0800, Bart Van Assche wrote: > A quote from the IB spec: > > However, if the Consumer does not wait for the Affiliated Asynchronous > Last WQE Reached Event, then WQE and Data Segment leakage may occur. > Therefore, it is good programming practice to tear down a QP that is > associated with an SRQ by using the following process: > * Put the QP in the Error State; > * wait for the Affiliated Asynchronous Last WQE Reached Event; > * either: > * drain the CQ by invoking the Poll CQ verb and either wait for CQ > to be empty or the number of Poll CQ operations has exceeded CQ > capacity size; or > * post another WR that completes on the same CQ and wait for this WR to return as a WC; > * and then invoke a Destroy QP or Reset QP. > > Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> > Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > drivers/infiniband/ulp/srp/ib_srp.c | 19 ++++++++++++++----- > 1 file changed, 14 insertions(+), 5 deletions(-) > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c > index 2f85255d2aca..b50733910f7e 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -471,9 +471,13 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct srp_target_port *target) > * completion handler can access the queue pair while it is > * being destroyed. > */ > -static void srp_destroy_qp(struct ib_qp *qp) > +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp) > { > - ib_drain_rq(qp); > + spin_lock_irq(&ch->lock); > + ib_process_cq_direct(ch->send_cq, -1); I see that you are already using "-1" in your code, but the comments in the ib_process_cq_direct states that no new code should use "-1". 61 * Note: for compatibility reasons -1 can be passed in %budget for unlimited 62 * polling. Do not use this feature in new code, it will be removed soon. 63 */ 64 int ib_process_cq_direct(struct ib_cq *cq, int budget) Thanks [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <20170212171928.GF14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <20170212171928.GF14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org> @ 2017-02-12 18:02 ` Laurence Oberman [not found] ` <1041506550.30101266.1486922573298.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2017-02-12 20:11 ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche 1 sibling, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-12 18:02 UTC (permalink / raw) To: Leon Romanovsky Cc: Bart Van Assche, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Israel Rukshin, Max Gurtovoy ----- Original Message ----- > From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > To: "Bart Van Assche" <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > Cc: "Doug Ledford" <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, "Christoph Hellwig" <hch-jcswGhMUV9g@public.gmane.org>, "Israel > Rukshin" <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>, "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>, "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Sent: Sunday, February 12, 2017 12:19:28 PM > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > On Fri, Feb 10, 2017 at 03:56:11PM -0800, Bart Van Assche wrote: > > A quote from the IB spec: > > > > However, if the Consumer does not wait for the Affiliated Asynchronous > > Last WQE Reached Event, then WQE and Data Segment leakage may occur. > > Therefore, it is good programming practice to tear down a QP that is > > associated with an SRQ by using the following process: > > * Put the QP in the Error State; > > * wait for the Affiliated Asynchronous Last WQE Reached Event; > > * either: > > * drain the CQ by invoking the Poll CQ verb and either wait for CQ > > to be empty or the number of Poll CQ operations has exceeded CQ > > capacity size; or > > * post another WR that completes on the same CQ and wait for this WR to > > return as a WC; > > * and then invoke a Destroy QP or Reset QP. > > > > Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> > > Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > --- > > drivers/infiniband/ulp/srp/ib_srp.c | 19 ++++++++++++++----- > > 1 file changed, 14 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c > > b/drivers/infiniband/ulp/srp/ib_srp.c > > index 2f85255d2aca..b50733910f7e 100644 > > --- a/drivers/infiniband/ulp/srp/ib_srp.c > > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > > @@ -471,9 +471,13 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct > > srp_target_port *target) > > * completion handler can access the queue pair while it is > > * being destroyed. > > */ > > -static void srp_destroy_qp(struct ib_qp *qp) > > +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp) > > { > > - ib_drain_rq(qp); > > + spin_lock_irq(&ch->lock); > > + ib_process_cq_direct(ch->send_cq, -1); > > I see that you are already using "-1" in your code, but the comments in the > ib_process_cq_direct states that no new code should use "-1". > > 61 * Note: for compatibility reasons -1 can be passed in %budget for > unlimited > 62 * polling. Do not use this feature in new code, it will be removed > soon. > 63 */ > 64 int ib_process_cq_direct(struct ib_cq *cq, int budget) > > Thanks > Hello Bart I took latest for-next from your git tree and started the fist set of tests. I bumped into this very quickly, but I only am running the new code on the client. The server has not been updated. On the client I see this after starting a single write thread to and XFS on on eof the mpaths. Given its in ib_strain figured I would let you know now. [ 850.862430] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) [ 850.865203] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f3d94a30 [ 850.941454] scsi host1: ib_srp: Failed to map data (-12) [ 860.990411] mlx5_0:dump_cqe:262:(pid 1103): dump error cqe [ 861.019162] 00000000 00000000 00000000 00000000 [ 861.042085] 00000000 00000000 00000000 00000000 [ 861.066567] 00000000 00000000 00000000 00000000 [ 861.092164] 00000000 0f007806 2500002a cefe87d1 [ 861.117091] ------------[ cut here ]------------ [ 861.143141] WARNING: CPU: 27 PID: 1103 at drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 [ib_core] [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for drain [ 861.235179] Modules linked in: dm_service_time xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat rpcrdma nf_conntrack ib_isert iscsi_target_mod iptable_mangle iptable_security iptable_raw ebtable_filter ib_iser ebtables libiscsi ip6table_filter ip6_tables scsi_transport_iscsi iptable_filter target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32 _pclmul ghash_clmulni_intel [ 861.646587] pcbc aesni_intel crypto_simd ipmi_ssif glue_helper ipmi_si cryptd iTCO_wdt gpio_ich ipmi_devintf iTCO_vendor_support pcspkr hpwdt hpilo pcc_cpufreq sg ipmi_msghandler acpi_power_meter i7core_edac acpi_cpufreq shpchp edac_core lpc_ich nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea sd_mod sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm ptp fjes hpsa crc32c_intel serio_raw i2c_core pps_core bnx2 devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt] [ 861.943997] CPU: 27 PID: 1103 Comm: kworker/27:2 Tainted: G I 4.10.0-rc7+ #1 [ 861.989476] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015 [ 862.024833] Workqueue: events_long srp_reconnect_work [scsi_transport_srp] [ 862.063004] Call Trace: [ 862.076516] dump_stack+0x63/0x87 [ 862.094841] __warn+0xd1/0xf0 [ 862.112164] warn_slowpath_fmt+0x5f/0x80 [ 862.134013] ? mlx5_poll_one+0x59/0xa40 [mlx5_ib] [ 862.161124] __ib_drain_sq+0x1bb/0x1c0 [ib_core] [ 862.187702] ib_drain_sq+0x25/0x30 [ib_core] [ 862.212168] ib_drain_qp+0x12/0x30 [ib_core] [ 862.238138] srp_destroy_qp+0x47/0x60 [ib_srp] [ 862.264155] srp_create_ch_ib+0x26f/0x5f0 [ib_srp] [ 862.291646] ? scsi_done+0x21/0x70 [ 862.312392] ? srp_finish_req+0x93/0xb0 [ib_srp] [ 862.338654] srp_rport_reconnect+0xf0/0x1f0 [ib_srp] [ 862.366274] srp_reconnect_rport+0xca/0x220 [scsi_transport_srp] [ 862.400756] srp_reconnect_work+0x44/0xd1 [scsi_transport_srp] [ 862.434277] process_one_work+0x165/0x410 [ 862.456198] worker_thread+0x137/0x4c0 [ 862.476973] kthread+0x101/0x140 [ 862.493935] ? rescuer_thread+0x3b0/0x3b0 [ 862.516800] ? kthread_park+0x90/0x90 [ 862.537396] ? do_syscall_64+0x67/0x180 [ 862.558477] ret_from_fork+0x2c/0x40 [ 862.578161] ---[ end trace 2a6c2779f0a2d28f ]--- [ 864.274137] scsi host1: ib_srp: reconnect succeeded [ 864.306836] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) [ 864.310916] mlx5_0:dump_cqe:262:(pid 13776): dump error cqe [ 864.310917] 00000000 00000000 00000000 00000000 [ 864.310921] 00000000 00000000 00000000 00000000 [ 864.310922] 00000000 00000000 00000000 00000000 [ 864.310922] 00000000 0f007806 25000032 00044cd0 [ 864.310928] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff880b94268078 [ 864.527890] scsi host1: ib_srp: Failed to map data (-12) [ 876.101124] scsi host1: ib_srp: reconnect succeeded [ 876.133923] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) [ 876.135014] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf1939130 [ 876.210311] scsi host1: ib_srp: Failed to map data (-12) [ 876.239985] mlx5_0:dump_cqe:262:(pid 5945): dump error cqe [ 876.270855] 00000000 00000000 00000000 00000000 [ 876.296525] 00000000 00000000 00000000 00000000 [ 876.322500] 00000000 00000000 00000000 00000000 [ 876.348519] 00000000 0f007806 2500003a 0080e1d0 [ 887.784981] scsi host1: ib_srp: reconnect succeeded [ 887.819808] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) [ 887.851777] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf1939130 [ 887.898850] scsi host1: ib_srp: Failed to map data (-12) [ 887.928647] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe [ 887.959938] 00000000 00000000 00000000 00000000 [ 887.985041] 00000000 00000000 00000000 00000000 [ 888.010619] 00000000 00000000 00000000 00000000 [ 888.035601] 00000000 0f007806 25000042 008099d0 [ 899.546781] scsi host1: ib_srp: reconnect succeeded [ 899.580758] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) [ 899.611289] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf1939130 [ 899.658289] scsi host1: ib_srp: Failed to map data (-12) [ 899.687219] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe [ 899.718736] 00000000 00000000 00000000 00000000 [ 899.744137] 00000000 00000000 00000000 00000000 [ 899.769206] 00000000 00000000 00000000 00000000 [ 899.795217] 00000000 0f007806 2500004a 008091d0 [ 911.343869] scsi host1: ib_srp: reconnect succeeded [ 911.376684] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) [ 911.407755] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf1939130 [ 911.454474] scsi host1: ib_srp: Failed to map data (-12) [ 911.484279] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe [ 911.514784] 00000000 00000000 00000000 00000000 [ 911.540251] 00000000 00000000 00000000 00000000 [ 911.564841] 00000000 00000000 00000000 00000000 [ 911.590743] 00000000 0f007806 25000052 008089d0 [ 923.066748] scsi host1: ib_srp: reconnect succeeded [ 923.099656] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) [ 923.131825] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf1939130 [ 923.179514] scsi host1: ib_srp: Failed to map data (-12) [ 923.209307] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe [ 923.239986] 00000000 00000000 00000000 00000000 [ 923.265419] 00000000 00000000 00000000 00000000 [ 923.290102] 00000000 00000000 00000000 00000000 [ 923.315120] 00000000 0f007806 2500005a 00c4d4d0 [ 934.839336] scsi host1: ib_srp: reconnect succeeded [ 934.874582] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) [ 934.906298] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bf1939130 [ 934.953712] scsi host1: ib_srp: Failed to map data (-12) [ 934.983829] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe [ 935.015371] 00000000 00000000 00000000 00000000 [ 935.041544] 00000000 00000000 00000000 00000000 [ 935.066883] 00000000 00000000 00000000 00000000 [ 935.092755] 00000000 0f007806 25000062 00c4ecd0 [ 946.610744] scsi host1: ib_srp: reconnect succeeded [ 946.644528] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) [ 946.647935] mlx5_0:dump_cqe:262:(pid 752): dump error cqe [ 946.647936] 00000000 00000000 00000000 00000000 [ 946.647937] 00000000 00000000 00000000 00000000 [ 946.647937] 00000000 00000000 00000000 00000000 [ 946.647938] 00000000 0f007806 2500006a 00c4e4d0 [ 946.647940] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff880b94268c78 [ 946.869439] scsi host1: ib_srp: Failed to map data (-12) I will reset and restart to make sure this issue is repeatable. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <1041506550.30101266.1486922573298.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <1041506550.30101266.1486922573298.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-12 18:06 ` Laurence Oberman [not found] ` <1051975432.30101289.1486922792858.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2017-02-12 20:05 ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche 1 sibling, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-12 18:06 UTC (permalink / raw) To: Leon Romanovsky Cc: Bart Van Assche, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Israel Rukshin, Max Gurtovoy ----- Original Message ----- > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: "Bart Van Assche" <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, "Doug Ledford" <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > "Christoph Hellwig" <hch-jcswGhMUV9g@public.gmane.org>, "Israel Rukshin" <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>, "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Sent: Sunday, February 12, 2017 1:02:53 PM > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > ----- Original Message ----- > > From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > To: "Bart Van Assche" <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > Cc: "Doug Ledford" <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > "Christoph Hellwig" <hch-jcswGhMUV9g@public.gmane.org>, "Israel > > Rukshin" <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>, "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>, > > "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > Sent: Sunday, February 12, 2017 12:19:28 PM > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > QP > > > > On Fri, Feb 10, 2017 at 03:56:11PM -0800, Bart Van Assche wrote: > > > A quote from the IB spec: > > > > > > However, if the Consumer does not wait for the Affiliated Asynchronous > > > Last WQE Reached Event, then WQE and Data Segment leakage may occur. > > > Therefore, it is good programming practice to tear down a QP that is > > > associated with an SRQ by using the following process: > > > * Put the QP in the Error State; > > > * wait for the Affiliated Asynchronous Last WQE Reached Event; > > > * either: > > > * drain the CQ by invoking the Poll CQ verb and either wait for CQ > > > to be empty or the number of Poll CQ operations has exceeded CQ > > > capacity size; or > > > * post another WR that completes on the same CQ and wait for this WR to > > > return as a WC; > > > * and then invoke a Destroy QP or Reset QP. > > > > > > Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> > > > Cc: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > > Cc: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > > Cc: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > --- > > > drivers/infiniband/ulp/srp/ib_srp.c | 19 ++++++++++++++----- > > > 1 file changed, 14 insertions(+), 5 deletions(-) > > > > > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c > > > b/drivers/infiniband/ulp/srp/ib_srp.c > > > index 2f85255d2aca..b50733910f7e 100644 > > > --- a/drivers/infiniband/ulp/srp/ib_srp.c > > > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > > > @@ -471,9 +471,13 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct > > > srp_target_port *target) > > > * completion handler can access the queue pair while it is > > > * being destroyed. > > > */ > > > -static void srp_destroy_qp(struct ib_qp *qp) > > > +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp) > > > { > > > - ib_drain_rq(qp); > > > + spin_lock_irq(&ch->lock); > > > + ib_process_cq_direct(ch->send_cq, -1); > > > > I see that you are already using "-1" in your code, but the comments in the > > ib_process_cq_direct states that no new code should use "-1". > > > > 61 * Note: for compatibility reasons -1 can be passed in %budget for > > unlimited > > 62 * polling. Do not use this feature in new code, it will be removed > > soon. > > 63 */ > > 64 int ib_process_cq_direct(struct ib_cq *cq, int budget) > > > > Thanks > > > > Hello Bart > > I took latest for-next from your git tree and started the fist set of tests. > > I bumped into this very quickly, but I only am running the new code on the > client. > The server has not been updated. > > On the client I see this after starting a single write thread to and XFS on > on eof the mpaths. > Given its in ib_strain figured I would let you know now. > > > [ 850.862430] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) > [ 850.865203] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f3d94a30 > [ 850.941454] scsi host1: ib_srp: Failed to map data (-12) > [ 860.990411] mlx5_0:dump_cqe:262:(pid 1103): dump error cqe > [ 861.019162] 00000000 00000000 00000000 00000000 > [ 861.042085] 00000000 00000000 00000000 00000000 > [ 861.066567] 00000000 00000000 00000000 00000000 > [ 861.092164] 00000000 0f007806 2500002a cefe87d1 > [ 861.117091] ------------[ cut here ]------------ > [ 861.143141] WARNING: CPU: 27 PID: 1103 at > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 [ib_core] > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for drain > [ 861.235179] Modules linked in: dm_service_time xt_CHECKSUM ipt_MASQUERADE > nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 > ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat > ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 > nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat > nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat rpcrdma nf_conntrack > ib_isert iscsi_target_mod iptable_mangle iptable_security iptable_raw > ebtable_filter ib_iser ebtables libiscsi ip6table_filter ip6_tables > scsi_transport_iscsi iptable_filter target_core_mod ib_srp > scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm > iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass > crct10dif_pclmul crc32_pclmul ghash_clmulni_intel > [ 861.646587] pcbc aesni_intel crypto_simd ipmi_ssif glue_helper ipmi_si > cryptd iTCO_wdt gpio_ich ipmi_devintf iTCO_vendor_support pcspkr hpwdt hpilo > pcc_cpufreq sg ipmi_msghandler acpi_power_meter i7core_edac acpi_cpufreq > shpchp edac_core lpc_ich nfsd auth_rpcgss nfs_acl lockd grace sunrpc > dm_multipath ip_tables xfs libcrc32c amdkfd amd_iommu_v2 radeon i2c_algo_bit > drm_kms_helper syscopyarea sd_mod sysfillrect sysimgblt fb_sys_fops ttm > mlx5_core drm ptp fjes hpsa crc32c_intel serio_raw i2c_core pps_core bnx2 > devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last > unloaded: ib_srpt] > [ 861.943997] CPU: 27 PID: 1103 Comm: kworker/27:2 Tainted: G I > 4.10.0-rc7+ #1 > [ 861.989476] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015 > [ 862.024833] Workqueue: events_long srp_reconnect_work [scsi_transport_srp] > [ 862.063004] Call Trace: > [ 862.076516] dump_stack+0x63/0x87 > [ 862.094841] __warn+0xd1/0xf0 > [ 862.112164] warn_slowpath_fmt+0x5f/0x80 > [ 862.134013] ? mlx5_poll_one+0x59/0xa40 [mlx5_ib] > [ 862.161124] __ib_drain_sq+0x1bb/0x1c0 [ib_core] > [ 862.187702] ib_drain_sq+0x25/0x30 [ib_core] > [ 862.212168] ib_drain_qp+0x12/0x30 [ib_core] > [ 862.238138] srp_destroy_qp+0x47/0x60 [ib_srp] > [ 862.264155] srp_create_ch_ib+0x26f/0x5f0 [ib_srp] > [ 862.291646] ? scsi_done+0x21/0x70 > [ 862.312392] ? srp_finish_req+0x93/0xb0 [ib_srp] > [ 862.338654] srp_rport_reconnect+0xf0/0x1f0 [ib_srp] > [ 862.366274] srp_reconnect_rport+0xca/0x220 [scsi_transport_srp] > [ 862.400756] srp_reconnect_work+0x44/0xd1 [scsi_transport_srp] > [ 862.434277] process_one_work+0x165/0x410 > [ 862.456198] worker_thread+0x137/0x4c0 > [ 862.476973] kthread+0x101/0x140 > [ 862.493935] ? rescuer_thread+0x3b0/0x3b0 > [ 862.516800] ? kthread_park+0x90/0x90 > [ 862.537396] ? do_syscall_64+0x67/0x180 > [ 862.558477] ret_from_fork+0x2c/0x40 > [ 862.578161] ---[ end trace 2a6c2779f0a2d28f ]--- > [ 864.274137] scsi host1: ib_srp: reconnect succeeded > [ 864.306836] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) > [ 864.310916] mlx5_0:dump_cqe:262:(pid 13776): dump error cqe > [ 864.310917] 00000000 00000000 00000000 00000000 > [ 864.310921] 00000000 00000000 00000000 00000000 > [ 864.310922] 00000000 00000000 00000000 00000000 > [ 864.310922] 00000000 0f007806 25000032 00044cd0 > [ 864.310928] scsi host1: ib_srp: failed FAST REG status memory management > operation error (6) for CQE ffff880b94268078 > [ 864.527890] scsi host1: ib_srp: Failed to map data (-12) > [ 876.101124] scsi host1: ib_srp: reconnect succeeded > [ 876.133923] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) > [ 876.135014] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff880bf1939130 > [ 876.210311] scsi host1: ib_srp: Failed to map data (-12) > [ 876.239985] mlx5_0:dump_cqe:262:(pid 5945): dump error cqe > [ 876.270855] 00000000 00000000 00000000 00000000 > [ 876.296525] 00000000 00000000 00000000 00000000 > [ 876.322500] 00000000 00000000 00000000 00000000 > [ 876.348519] 00000000 0f007806 2500003a 0080e1d0 > [ 887.784981] scsi host1: ib_srp: reconnect succeeded > [ 887.819808] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) > [ 887.851777] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff880bf1939130 > [ 887.898850] scsi host1: ib_srp: Failed to map data (-12) > [ 887.928647] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe > [ 887.959938] 00000000 00000000 00000000 00000000 > [ 887.985041] 00000000 00000000 00000000 00000000 > [ 888.010619] 00000000 00000000 00000000 00000000 > [ 888.035601] 00000000 0f007806 25000042 008099d0 > [ 899.546781] scsi host1: ib_srp: reconnect succeeded > [ 899.580758] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) > [ 899.611289] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff880bf1939130 > [ 899.658289] scsi host1: ib_srp: Failed to map data (-12) > [ 899.687219] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe > [ 899.718736] 00000000 00000000 00000000 00000000 > [ 899.744137] 00000000 00000000 00000000 00000000 > [ 899.769206] 00000000 00000000 00000000 00000000 > [ 899.795217] 00000000 0f007806 2500004a 008091d0 > [ 911.343869] scsi host1: ib_srp: reconnect succeeded > [ 911.376684] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) > [ 911.407755] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff880bf1939130 > [ 911.454474] scsi host1: ib_srp: Failed to map data (-12) > [ 911.484279] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe > [ 911.514784] 00000000 00000000 00000000 00000000 > [ 911.540251] 00000000 00000000 00000000 00000000 > [ 911.564841] 00000000 00000000 00000000 00000000 > [ 911.590743] 00000000 0f007806 25000052 008089d0 > [ 923.066748] scsi host1: ib_srp: reconnect succeeded > [ 923.099656] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) > [ 923.131825] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff880bf1939130 > [ 923.179514] scsi host1: ib_srp: Failed to map data (-12) > [ 923.209307] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe > [ 923.239986] 00000000 00000000 00000000 00000000 > [ 923.265419] 00000000 00000000 00000000 00000000 > [ 923.290102] 00000000 00000000 00000000 00000000 > [ 923.315120] 00000000 0f007806 2500005a 00c4d4d0 > [ 934.839336] scsi host1: ib_srp: reconnect succeeded > [ 934.874582] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) > [ 934.906298] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff880bf1939130 > [ 934.953712] scsi host1: ib_srp: Failed to map data (-12) > [ 934.983829] mlx5_0:dump_cqe:262:(pid 7327): dump error cqe > [ 935.015371] 00000000 00000000 00000000 00000000 > [ 935.041544] 00000000 00000000 00000000 00000000 > [ 935.066883] 00000000 00000000 00000000 00000000 > [ 935.092755] 00000000 0f007806 25000062 00c4ecd0 > [ 946.610744] scsi host1: ib_srp: reconnect succeeded > [ 946.644528] scsi host1: ib_srp: Out of MRs (mr_per_cmd = 1) > [ 946.647935] mlx5_0:dump_cqe:262:(pid 752): dump error cqe > [ 946.647936] 00000000 00000000 00000000 00000000 > [ 946.647937] 00000000 00000000 00000000 00000000 > [ 946.647937] 00000000 00000000 00000000 00000000 > [ 946.647938] 00000000 0f007806 2500006a 00c4e4d0 > [ 946.647940] scsi host1: ib_srp: failed FAST REG status memory management > operation error (6) for CQE ffff880b94268c78 > [ 946.869439] scsi host1: ib_srp: Failed to map data (-12) > > I will reset and restart to make sure this issue is repeatable. > > Thanks > Laurence Sorry for typos, should have been On the client I see this after starting a single write thread to an XFS on one of the mpaths. Given its in ib_drain_cq figured I would let you know now. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <1051975432.30101289.1486922792858.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 0/8] IB/srp bug fixes [not found] ` <1051975432.30101289.1486922792858.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-14 3:02 ` Laurence Oberman [not found] ` <1465409120.30916025.1487041332560.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-14 3:02 UTC (permalink / raw) To: Leon Romanovsky Cc: Bart Van Assche, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Israel Rukshin, Max Gurtovoy Hello Bart The following 7 of 8 patches were applied to Linus's latest tree. However this required first reverting commit ad8e66b4a80182174f73487ed25fd2140cf43361 Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Date: Wed Dec 28 12:48:28 2016 +0200 See my other email regarding why the above needed to be reverted. All tests passed in my mlx5 EDR-100 test bed for the ib-srp/mlx5 tests. 4.10.0-rc8.bart+ The revert of the above meant I did not apply and test patch 6 of the series IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported IB/srp: Avoid that duplicate responses trigger a kernel bug IB/srp: Fix race conditions related to task management IB/srp: Document locking conventions IB/srp: Make a diagnostic message more informative IB/srp: Improve an error path *** Not applied and not tested IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported IB/core: Add support for draining IB_POLL_DIRECT completion queues IB/srp: Drain the send queue before destroying a QP For the series except patch 6 Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <1465409120.30916025.1487041332560.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 0/8] IB/srp bug fixes [not found] ` <1465409120.30916025.1487041332560.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-14 17:18 ` Bart Van Assche [not found] ` <1487092678.2466.6.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Bart Van Assche @ 2017-02-14 17:18 UTC (permalink / raw) To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org On Mon, 2017-02-13 at 22:02 -0500, Laurence Oberman wrote: > The following 7 of 8 patches were applied to Linus's latest tree. > > However this required first reverting > > commit ad8e66b4a80182174f73487ed25fd2140cf43361 > Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Date: Wed Dec 28 12:48:28 2016 +0200 > > See my other email regarding why the above needed to be reverted. > > All tests passed in my mlx5 EDR-100 test bed for the ib-srp/mlx5 tests. > > 4.10.0-rc8.bart+ > > The revert of the above meant I did not apply and test patch 6 of the series > IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported > > IB/srp: Avoid that duplicate responses trigger a kernel bug > IB/srp: Fix race conditions related to task management > IB/srp: Document locking conventions > IB/srp: Make a diagnostic message more informative > IB/srp: Improve an error path > *** Not applied and not tested IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported > IB/core: Add support for draining IB_POLL_DIRECT completion queues > IB/srp: Drain the send queue before destroying a QP > > For the series except patch 6 > > Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Hello Laurence, Thank you for the testing. However, reverting commit ad8e66b4a801 without making any further changes is not acceptable because it would reintroduce the SG-list mapping problem addressed by that patch. Can you test the srp-initiator-for-next branch from my github repository against mlx5 (commit 8dca762deab6)? It passes my tests against mlx4. The patches on that branch are: Bart Van Assche (8): IB/SRP: Avoid using IB_MR_TYPE_SG_GAPS IB/srp: Avoid that duplicate responses trigger a kernel bug IB/srp: Fix race conditions related to task management IB/srp: Document locking conventions IB/srp: Make a diagnostic message more informative IB/srp: Improve an error path IB/core: Add support for draining IB_POLL_DIRECT completion queues IB/srp: Drain the send queue before destroying a QP Thanks, Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <1487092678.2466.6.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>]
* Re: [PATCH 0/8] IB/srp bug fixes [not found] ` <1487092678.2466.6.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> @ 2017-02-14 17:22 ` Laurence Oberman 2017-02-14 18:47 ` Laurence Oberman 1 sibling, 0 replies; 47+ messages in thread From: Laurence Oberman @ 2017-02-14 17:22 UTC (permalink / raw) To: Bart Van Assche Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Tuesday, February 14, 2017 12:18:11 PM > Subject: Re: [PATCH 0/8] IB/srp bug fixes > > On Mon, 2017-02-13 at 22:02 -0500, Laurence Oberman wrote: > > The following 7 of 8 patches were applied to Linus's latest tree. > > > > However this required first reverting > > > > commit ad8e66b4a80182174f73487ed25fd2140cf43361 > > Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > Date: Wed Dec 28 12:48:28 2016 +0200 > > > > See my other email regarding why the above needed to be reverted. > > > > All tests passed in my mlx5 EDR-100 test bed for the ib-srp/mlx5 tests. > > > > 4.10.0-rc8.bart+ > > > > The revert of the above meant I did not apply and test patch 6 of the > > series > > IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported > > > > IB/srp: Avoid that duplicate responses trigger a kernel bug > > IB/srp: Fix race conditions related to task management > > IB/srp: Document locking conventions > > IB/srp: Make a diagnostic message more informative > > IB/srp: Improve an error path > > *** Not applied and not tested IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA > > feature if supported > > IB/core: Add support for draining IB_POLL_DIRECT completion queues > > IB/srp: Drain the send queue before destroying a QP > > > > For the series except patch 6 > > > > Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > Hello Laurence, > > Thank you for the testing. However, reverting commit ad8e66b4a801 without > making any further changes is not acceptable because it would reintroduce > the SG-list mapping problem addressed by that patch. Can you test the > srp-initiator-for-next branch from my github repository against mlx5 (commit > 8dca762deab6)? It passes my tests against mlx4. The patches on that branch > are: > > Bart Van Assche (8): > IB/SRP: Avoid using IB_MR_TYPE_SG_GAPS > IB/srp: Avoid that duplicate responses trigger a kernel bug > IB/srp: Fix race conditions related to task management > IB/srp: Document locking conventions > IB/srp: Make a diagnostic message more informative > IB/srp: Improve an error path > IB/core: Add support for draining IB_POLL_DIRECT completion queues > IB/srp: Drain the send queue before destroying a QP > > Thanks, > > Bart. > Western Digital Corporation (and its subsidiaries) E-mail Confidentiality > Notice & Disclaimer: > > This e-mail and any files transmitted with it may contain confidential or > legally privileged information of WDC and/or its affiliates, and are > intended solely for the use of the individual or entity to which they are > addressed. If you are not the intended recipient, any disclosure, copying, > distribution or any action taken or omitted to be taken in reliance on it, > is prohibited. If you have received this e-mail in error, please notify the > sender immediately and delete the e-mail in its entirety from your system. > > Hello Bart, Understood, will pull and test this today. Thank you for your assistance. Regards Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 0/8] IB/srp bug fixes [not found] ` <1487092678.2466.6.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 2017-02-14 17:22 ` Laurence Oberman @ 2017-02-14 18:47 ` Laurence Oberman [not found] ` <1364431877.31401761.1487098067033.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-14 18:47 UTC (permalink / raw) To: Bart Van Assche Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Tuesday, February 14, 2017 12:18:11 PM > Subject: Re: [PATCH 0/8] IB/srp bug fixes > > On Mon, 2017-02-13 at 22:02 -0500, Laurence Oberman wrote: > > The following 7 of 8 patches were applied to Linus's latest tree. > > > > However this required first reverting > > > > commit ad8e66b4a80182174f73487ed25fd2140cf43361 > > Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > Date: Wed Dec 28 12:48:28 2016 +0200 > > > > See my other email regarding why the above needed to be reverted. > > > > All tests passed in my mlx5 EDR-100 test bed for the ib-srp/mlx5 tests. > > > > 4.10.0-rc8.bart+ > > > > The revert of the above meant I did not apply and test patch 6 of the > > series > > IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported > > > > IB/srp: Avoid that duplicate responses trigger a kernel bug > > IB/srp: Fix race conditions related to task management > > IB/srp: Document locking conventions > > IB/srp: Make a diagnostic message more informative > > IB/srp: Improve an error path > > *** Not applied and not tested IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA > > feature if supported > > IB/core: Add support for draining IB_POLL_DIRECT completion queues > > IB/srp: Drain the send queue before destroying a QP > > > > For the series except patch 6 > > > > Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > Hello Laurence, > > Thank you for the testing. However, reverting commit ad8e66b4a801 without > making any further changes is not acceptable because it would reintroduce > the SG-list mapping problem addressed by that patch. Can you test the > srp-initiator-for-next branch from my github repository against mlx5 (commit > 8dca762deab6)? It passes my tests against mlx4. The patches on that branch > are: > > Bart Van Assche (8): > IB/SRP: Avoid using IB_MR_TYPE_SG_GAPS > IB/srp: Avoid that duplicate responses trigger a kernel bug > IB/srp: Fix race conditions related to task management > IB/srp: Document locking conventions > IB/srp: Make a diagnostic message more informative > IB/srp: Improve an error path > IB/core: Add support for draining IB_POLL_DIRECT completion queues > IB/srp: Drain the send queue before destroying a QP > > Thanks, > > Bart.-- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello Bart 4.10.0-rc8.bart_latest+ Built from branch srp-initiator-for-next after pull of your repository. The large I/O testing is what I focused on but all tests are passing. small/large I/O, direct and buffered I/O, file-system and direct to mpath devices. This is a snap of 4 simultaneous 4MB I/O read tasks and 1 buffered write task (that will sporadically exceed 4MB)/ ### RECORD 7 >>> ibclient <<< (1487097890.001) (Tue Feb 14 13:44:50 2017) ### # DISK STATISTICS (/sec) # <---------reads---------------><---------writes--------------><--------averages--------> Pct #Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util 13:44:50 dm-11 192512 141 47 4096 20 0 0 0 0 0 4096 1 20 21 99 13:44:50 dm-17 184320 135 45 4096 20 0 0 0 0 0 4096 1 20 22 99 13:44:50 dm-21 163840 120 40 4096 21 1236928 1984 153 8084 319 7257 91 257 5 99 13:44:50 dm-24 786432 576 192 4096 5 0 0 0 0 0 4096 1 5 5 99 13:44:50 dm-30 790528 579 193 4096 5 0 0 0 0 0 4096 1 5 5 99 It looks good Bart For branch srp-initiator-for-next, all tests are passing. Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <1364431877.31401761.1487098067033.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 0/8] IB/srp bug fixes [not found] ` <1364431877.31401761.1487098067033.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-14 18:49 ` Bart Van Assche 0 siblings, 0 replies; 47+ messages in thread From: Bart Van Assche @ 2017-02-14 18:49 UTC (permalink / raw) To: loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org On Tue, 2017-02-14 at 13:47 -0500, Laurence Oberman wrote: > For branch srp-initiator-for-next, all tests are passing. > Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Thank you! I will post these patches as a v2 of this series. Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <1041506550.30101266.1486922573298.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2017-02-12 18:06 ` Laurence Oberman @ 2017-02-12 20:05 ` Bart Van Assche [not found] ` <1486929901.2918.1.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 1 sibling, 1 reply; 47+ messages in thread From: Bart Van Assche @ 2017-02-12 20:05 UTC (permalink / raw) To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote: > [ 861.143141] WARNING: CPU: 27 PID: 1103 at drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 [ib_core] > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for drain Hello Laurence, That warning has been removed by patch 7/8 of this series. Please double check whether all eight patches have been applied properly. Bart. ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <1486929901.2918.1.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <1486929901.2918.1.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> @ 2017-02-13 2:07 ` Laurence Oberman [not found] ` <655392767.30136125.1486951636415.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-13 2:07 UTC (permalink / raw) To: Bart Van Assche Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Sunday, February 12, 2017 3:05:16 PM > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote: > > [ 861.143141] WARNING: CPU: 27 PID: 1103 at > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 [ib_core] > > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for drain > > Hello Laurence, > > That warning has been removed by patch 7/8 of this series. Please double > check > whether all eight patches have been applied properly. > > Bart.N�����r��y���b�X��ǧv�^�){.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"�� Hello Just a heads up, working with Bart on this patch series. We have stability issues with my tests in my MLX5 EDR-100 test bed. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <655392767.30136125.1486951636415.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <655392767.30136125.1486951636415.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-13 3:14 ` Laurence Oberman [not found] ` <1630482470.30208948.1486955693106.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-13 3:14 UTC (permalink / raw) To: Bart Van Assche Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Sunday, February 12, 2017 9:07:16 PM > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > > > ----- Original Message ----- > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > Sent: Sunday, February 12, 2017 3:05:16 PM > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > QP > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote: > > > [ 861.143141] WARNING: CPU: 27 PID: 1103 at > > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 [ib_core] > > > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for drain > > > > Hello Laurence, > > > > That warning has been removed by patch 7/8 of this series. Please double > > check > > whether all eight patches have been applied properly. > > > > Bart.N�����r��y���b�X��ǧv�^�){.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"�� > > Hello > Just a heads up, working with Bart on this patch series. > We have stability issues with my tests in my MLX5 EDR-100 test bed. > Thanks > Laurence > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > I went back to Linus' latest tree for a baseline and we fail the same way. This has none of the latest 8 patches applied so we will have to figure out what broke this. Dont forget that I tested all this recently with Bart's dma patch series and its solid. Will come back to this tomorrow and see what recently made it into Linus's tree by checking back with Doug. [ 183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff880bd4270eb0 [ 183.853047] 00000000 00000000 00000000 00000000 [ 183.878425] 00000000 00000000 00000000 00000000 [ 183.903243] 00000000 00000000 00000000 00000000 [ 183.928518] 00000000 0f007806 2500002a ad9fafd1 [ 198.538593] scsi host1: ib_srp: reconnect succeeded [ 198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe [ 198.603037] 00000000 00000000 00000000 00000000 [ 198.628884] 00000000 00000000 00000000 00000000 [ 198.653961] 00000000 00000000 00000000 00000000 [ 198.680021] 00000000 0f007806 25000032 00105dd0 [ 198.705985] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff880b92860138 [ 213.532848] scsi host1: ib_srp: reconnect succeeded [ 213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 227.579684] scsi host1: ib_srp: reconnect succeeded [ 227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 242.633925] scsi host1: ib_srp: reconnect succeeded [ 242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 257.127715] scsi host1: ib_srp: reconnect succeeded [ 257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 272.225762] scsi host1: ib_srp: reconnect succeeded [ 272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 286.350226] scsi host1: ib_srp: reconnect succeeded [ 286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 301.109365] scsi host1: ib_srp: reconnect succeeded [ 301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 315.910860] scsi host1: ib_srp: reconnect succeeded [ 315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 330.551052] scsi host1: ib_srp: reconnect succeeded [ 330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 344.998448] scsi host1: ib_srp: reconnect succeeded [ 345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 359.866731] scsi host1: ib_srp: reconnect succeeded [ 359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 .. .. [ 373.113045] scsi host1: ib_srp: reconnect succeeded [ 373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1. [ 388.589517] scsi host1: ib_srp: reconnect succeeded [ 388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 403.086893] scsi host1: ib_srp: reconnect succeeded [ 403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f2234c30 [ 403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe [ 403.140402] 00000000 00000000 00000000 00000000 [ 403.140402] 00000000 00000000 00000000 00000000 [ 403.140403] 00000000 00000000 00000000 00000000 [ 403.140403] 00 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <1630482470.30208948.1486955693106.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <1630482470.30208948.1486955693106.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-13 13:54 ` Laurence Oberman [not found] ` <1633827327.30531404.1486994093828.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-13 13:54 UTC (permalink / raw) To: Bart Van Assche Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Sunday, February 12, 2017 10:14:53 PM > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > > > ----- Original Message ----- > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > Sent: Sunday, February 12, 2017 9:07:16 PM > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > QP > > > > > > > > ----- Original Message ----- > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > Sent: Sunday, February 12, 2017 3:05:16 PM > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > > QP > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote: > > > > [ 861.143141] WARNING: CPU: 27 PID: 1103 at > > > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 > > > > [ib_core] > > > > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for drain > > > > > > Hello Laurence, > > > > > > That warning has been removed by patch 7/8 of this series. Please double > > > check > > > whether all eight patches have been applied properly. > > > > > > Bart.N�����r��y���b�X��ǧv�^�){.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"�� > > > > Hello > > Just a heads up, working with Bart on this patch series. > > We have stability issues with my tests in my MLX5 EDR-100 test bed. > > Thanks > > Laurence > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > I went back to Linus' latest tree for a baseline and we fail the same way. > This has none of the latest 8 patches applied so we will > have to figure out what broke this. > > Dont forget that I tested all this recently with Bart's dma patch series > and its solid. > > Will come back to this tomorrow and see what recently made it into Linus's > tree by > checking back with Doug. > > [ 183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff880bd4270eb0 > [ 183.853047] 00000000 00000000 00000000 00000000 > [ 183.878425] 00000000 00000000 00000000 00000000 > [ 183.903243] 00000000 00000000 00000000 00000000 > [ 183.928518] 00000000 0f007806 2500002a ad9fafd1 > [ 198.538593] scsi host1: ib_srp: reconnect succeeded > [ 198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe > [ 198.603037] 00000000 00000000 00000000 00000000 > [ 198.628884] 00000000 00000000 00000000 00000000 > [ 198.653961] 00000000 00000000 00000000 00000000 > [ 198.680021] 00000000 0f007806 25000032 00105dd0 > [ 198.705985] scsi host1: ib_srp: failed FAST REG status memory management > operation error (6) for CQE ffff880b92860138 > [ 213.532848] scsi host1: ib_srp: reconnect succeeded > [ 213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 227.579684] scsi host1: ib_srp: reconnect succeeded > [ 227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 242.633925] scsi host1: ib_srp: reconnect succeeded > [ 242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 257.127715] scsi host1: ib_srp: reconnect succeeded > [ 257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 272.225762] scsi host1: ib_srp: reconnect succeeded > [ 272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 286.350226] scsi host1: ib_srp: reconnect succeeded > [ 286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 301.109365] scsi host1: ib_srp: reconnect succeeded > [ 301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 315.910860] scsi host1: ib_srp: reconnect succeeded > [ 315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 330.551052] scsi host1: ib_srp: reconnect succeeded > [ 330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 344.998448] scsi host1: ib_srp: reconnect succeeded > [ 345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 359.866731] scsi host1: ib_srp: reconnect succeeded > [ 359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > .. > .. > [ 373.113045] scsi host1: ib_srp: reconnect succeeded > [ 373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1. > [ 388.589517] scsi host1: ib_srp: reconnect succeeded > [ 388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 403.086893] scsi host1: ib_srp: reconnect succeeded > [ 403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > ffff8817f2234c30 > [ 403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe > [ 403.140402] 00000000 00000000 00000000 00000000 > [ 403.140402] 00000000 00000000 00000000 00000000 > [ 403.140403] 00000000 00000000 00000000 00000000 > [ 403.140403] 00 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello Let summarize where we are and how we got here. The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4 with Barts dma patches. All tests passed. I pulled Linus's tree and applied all 8 patches of the above series and we failed in the "failed FAST REG status memory management" area. I applied only 7 of the 8 patches to Linus's tree because Bart and I thought patch 6 of the series may have been the catalyst. This also failed. Building from Barts tree which is based on 4.10.0-rc7 failed again. This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail. So something has crept into 4.10.0-rc7 affecting this with mlx5 and ib_srp. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <1633827327.30531404.1486994093828.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <1633827327.30531404.1486994093828.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-13 14:17 ` Leon Romanovsky [not found] ` <20170213141724.GQ14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Leon Romanovsky @ 2017-02-13 14:17 UTC (permalink / raw) To: Laurence Oberman Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA [-- Attachment #1: Type: text/plain, Size: 9162 bytes --] On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote: > > > ----- Original Message ----- > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzXQFizaE/u3fw@public.gmane.orgm, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > Sent: Sunday, February 12, 2017 10:14:53 PM > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > > > > > > > ----- Original Message ----- > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr@mellanox.com, > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > Sent: Sunday, February 12, 2017 9:07:16 PM > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > > QP > > > > > > > > > > > > ----- Original Message ----- > > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > Sent: Sunday, February 12, 2017 3:05:16 PM > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > > > QP > > > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote: > > > > > [ 861.143141] WARNING: CPU: 27 PID: 1103 at > > > > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 > > > > > [ib_core] > > > > > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for drain > > > > > > > > Hello Laurence, > > > > > > > > That warning has been removed by patch 7/8 of this series. Please double > > > > check > > > > whether all eight patches have been applied properly. > > > > > > > > Bart.N�����r��y���b�X��ǧv�^�){.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"�� > > > > > > Hello > > > Just a heads up, working with Bart on this patch series. > > > We have stability issues with my tests in my MLX5 EDR-100 test bed. > > > Thanks > > > Laurence > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > I went back to Linus' latest tree for a baseline and we fail the same way. > > This has none of the latest 8 patches applied so we will > > have to figure out what broke this. > > > > Dont forget that I tested all this recently with Bart's dma patch series > > and its solid. > > > > Will come back to this tomorrow and see what recently made it into Linus's > > tree by > > checking back with Doug. > > > > [ 183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff880bd4270eb0 > > [ 183.853047] 00000000 00000000 00000000 00000000 > > [ 183.878425] 00000000 00000000 00000000 00000000 > > [ 183.903243] 00000000 00000000 00000000 00000000 > > [ 183.928518] 00000000 0f007806 2500002a ad9fafd1 > > [ 198.538593] scsi host1: ib_srp: reconnect succeeded > > [ 198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe > > [ 198.603037] 00000000 00000000 00000000 00000000 > > [ 198.628884] 00000000 00000000 00000000 00000000 > > [ 198.653961] 00000000 00000000 00000000 00000000 > > [ 198.680021] 00000000 0f007806 25000032 00105dd0 > > [ 198.705985] scsi host1: ib_srp: failed FAST REG status memory management > > operation error (6) for CQE ffff880b92860138 > > [ 213.532848] scsi host1: ib_srp: reconnect succeeded > > [ 213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 227.579684] scsi host1: ib_srp: reconnect succeeded > > [ 227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 242.633925] scsi host1: ib_srp: reconnect succeeded > > [ 242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 257.127715] scsi host1: ib_srp: reconnect succeeded > > [ 257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 272.225762] scsi host1: ib_srp: reconnect succeeded > > [ 272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 286.350226] scsi host1: ib_srp: reconnect succeeded > > [ 286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 301.109365] scsi host1: ib_srp: reconnect succeeded > > [ 301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 315.910860] scsi host1: ib_srp: reconnect succeeded > > [ 315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 330.551052] scsi host1: ib_srp: reconnect succeeded > > [ 330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 344.998448] scsi host1: ib_srp: reconnect succeeded > > [ 345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 359.866731] scsi host1: ib_srp: reconnect succeeded > > [ 359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > .. > > .. > > [ 373.113045] scsi host1: ib_srp: reconnect succeeded > > [ 373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1. > > [ 388.589517] scsi host1: ib_srp: reconnect succeeded > > [ 388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 403.086893] scsi host1: ib_srp: reconnect succeeded > > [ 403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE > > ffff8817f2234c30 > > [ 403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe > > [ 403.140402] 00000000 00000000 00000000 00000000 > > [ 403.140402] 00000000 00000000 00000000 00000000 > > [ 403.140403] 00000000 00000000 00000000 00000000 > > [ 403.140403] 00 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Hello > > Let summarize where we are and how we got here. > > The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4 with Barts dma patches. > All tests passed. > > I pulled Linus's tree and applied all 8 patches of the above series and we failed in the > "failed FAST REG status memory management" area. > > I applied only 7 of the 8 patches to Linus's tree because Bart and I thought patch 6 of the series > may have been the catalyst. > > This also failed. > > Building from Barts tree which is based on 4.10.0-rc7 failed again. > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail. > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and ib_srp. From infiniband side: ➜ linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 -- drivers/inifiniband |wc 0 0 0 From eth nothing suspicious too: ➜ linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 -- drivers/net/ethernet/mellanox/mlx5 d15118af2683 net/mlx5e: Check ets capability before ets query FW command a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper devices 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only after FDB destroy 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering name-space fails eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering name-space 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve ad05df399f33 net/mlx5e: Remove unused variable 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num channels abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning > > Thanks > Laurence [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <20170213141724.GQ14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <20170213141724.GQ14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org> @ 2017-02-13 14:24 ` Laurence Oberman [not found] ` <225897984.30545262.1486995841880.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-13 14:24 UTC (permalink / raw) To: Leon Romanovsky Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > To: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Monday, February 13, 2017 9:17:24 AM > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote: > > > > > > ----- Original Message ----- > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > Sent: Sunday, February 12, 2017 10:14:53 PM > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > > QP > > > > > > > > > > > > ----- Original Message ----- > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > Sent: Sunday, February 12, 2017 9:07:16 PM > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying > > > > a > > > > QP > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > Sent: Sunday, February 12, 2017 3:05:16 PM > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > destroying a > > > > > QP > > > > > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote: > > > > > > [ 861.143141] WARNING: CPU: 27 PID: 1103 at > > > > > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 > > > > > > [ib_core] > > > > > > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for drain > > > > > > > > > > Hello Laurence, > > > > > > > > > > That warning has been removed by patch 7/8 of this series. Please > > > > > double > > > > > check > > > > > whether all eight patches have been applied properly. > > > > > > > > > > Bart.N�����r��y���b�X��ǧv�^�){.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"�� > > > > > > > > Hello > > > > Just a heads up, working with Bart on this patch series. > > > > We have stability issues with my tests in my MLX5 EDR-100 test bed. > > > > Thanks > > > > Laurence > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > > > > in > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > I went back to Linus' latest tree for a baseline and we fail the same > > > way. > > > This has none of the latest 8 patches applied so we will > > > have to figure out what broke this. > > > > > > Dont forget that I tested all this recently with Bart's dma patch series > > > and its solid. > > > > > > Will come back to this tomorrow and see what recently made it into > > > Linus's > > > tree by > > > checking back with Doug. > > > > > > [ 183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff880bd4270eb0 > > > [ 183.853047] 00000000 00000000 00000000 00000000 > > > [ 183.878425] 00000000 00000000 00000000 00000000 > > > [ 183.903243] 00000000 00000000 00000000 00000000 > > > [ 183.928518] 00000000 0f007806 2500002a ad9fafd1 > > > [ 198.538593] scsi host1: ib_srp: reconnect succeeded > > > [ 198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe > > > [ 198.603037] 00000000 00000000 00000000 00000000 > > > [ 198.628884] 00000000 00000000 00000000 00000000 > > > [ 198.653961] 00000000 00000000 00000000 00000000 > > > [ 198.680021] 00000000 0f007806 25000032 00105dd0 > > > [ 198.705985] scsi host1: ib_srp: failed FAST REG status memory > > > management > > > operation error (6) for CQE ffff880b92860138 > > > [ 213.532848] scsi host1: ib_srp: reconnect succeeded > > > [ 213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 227.579684] scsi host1: ib_srp: reconnect succeeded > > > [ 227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 242.633925] scsi host1: ib_srp: reconnect succeeded > > > [ 242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 257.127715] scsi host1: ib_srp: reconnect succeeded > > > [ 257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 272.225762] scsi host1: ib_srp: reconnect succeeded > > > [ 272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 286.350226] scsi host1: ib_srp: reconnect succeeded > > > [ 286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 301.109365] scsi host1: ib_srp: reconnect succeeded > > > [ 301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 315.910860] scsi host1: ib_srp: reconnect succeeded > > > [ 315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 330.551052] scsi host1: ib_srp: reconnect succeeded > > > [ 330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 344.998448] scsi host1: ib_srp: reconnect succeeded > > > [ 345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 359.866731] scsi host1: ib_srp: reconnect succeeded > > > [ 359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > .. > > > .. > > > [ 373.113045] scsi host1: ib_srp: reconnect succeeded > > > [ 373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1. > > > [ 388.589517] scsi host1: ib_srp: reconnect succeeded > > > [ 388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 403.086893] scsi host1: ib_srp: reconnect succeeded > > > [ 403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE > > > ffff8817f2234c30 > > > [ 403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > [ 403.140403] 00000000 00000000 00000000 00000000 > > > [ 403.140403] 00 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > Hello > > > > Let summarize where we are and how we got here. > > > > The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4 with > > Barts dma patches. > > All tests passed. > > > > I pulled Linus's tree and applied all 8 patches of the above series and we > > failed in the > > "failed FAST REG status memory management" area. > > > > I applied only 7 of the 8 patches to Linus's tree because Bart and I > > thought patch 6 of the series > > may have been the catalyst. > > > > This also failed. > > > > Building from Barts tree which is based on 4.10.0-rc7 failed again. > > > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail. > > > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and ib_srp. > > From infiniband side: > ➜ linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 -- > drivers/inifiniband |wc > 0 0 0 > > From eth nothing suspicious too: > ➜ linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 -- > drivers/net/ethernet/mellanox/mlx5 > d15118af2683 net/mlx5e: Check ets capability before ets query FW command > a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool > 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed > 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper devices > 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only after FDB > destroy > 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering name-space > fails > eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering > name-space > 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP > e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve > ad05df399f33 net/mlx5e: Remove unused variable > 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num channels > abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning > > > > > > Thanks > > Laurence > Hi Leon, Yep, I also looked for outliers here that may look suspicious and did not see any. I guess I will have to start bisecting. I will start with rc5, if that fails will bisect between rc4 and rc5, as we know rc4 was fine. I did re-run tests on rc4 last night and I was stable. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <225897984.30545262.1486995841880.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <225897984.30545262.1486995841880.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-13 16:12 ` Laurence Oberman [not found] ` <1971987443.30613645.1487002375580.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-13 16:12 UTC (permalink / raw) To: Leon Romanovsky Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Monday, February 13, 2017 9:24:01 AM > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > > > ----- Original Message ----- > > From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > To: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > Sent: Monday, February 13, 2017 9:17:24 AM > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > QP > > > > On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote: > > > > > > > > > ----- Original Message ----- > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > Sent: Sunday, February 12, 2017 10:14:53 PM > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying > > > > a > > > > QP > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > Sent: Sunday, February 12, 2017 9:07:16 PM > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > destroying > > > > > a > > > > > QP > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > Sent: Sunday, February 12, 2017 3:05:16 PM > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > destroying a > > > > > > QP > > > > > > > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote: > > > > > > > [ 861.143141] WARNING: CPU: 27 PID: 1103 at > > > > > > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 > > > > > > > [ib_core] > > > > > > > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for drain > > > > > > > > > > > > Hello Laurence, > > > > > > > > > > > > That warning has been removed by patch 7/8 of this series. Please > > > > > > double > > > > > > check > > > > > > whether all eight patches have been applied properly. > > > > > > > > > > > > Bart.N�����r��y���b�X��ǧv�^�){.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"�� > > > > > > > > > > Hello > > > > > Just a heads up, working with Bart on this patch series. > > > > > We have stability issues with my tests in my MLX5 EDR-100 test bed. > > > > > Thanks > > > > > Laurence > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > > > > > in > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > I went back to Linus' latest tree for a baseline and we fail the same > > > > way. > > > > This has none of the latest 8 patches applied so we will > > > > have to figure out what broke this. > > > > > > > > Dont forget that I tested all this recently with Bart's dma patch > > > > series > > > > and its solid. > > > > > > > > Will come back to this tomorrow and see what recently made it into > > > > Linus's > > > > tree by > > > > checking back with Doug. > > > > > > > > [ 183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff880bd4270eb0 > > > > [ 183.853047] 00000000 00000000 00000000 00000000 > > > > [ 183.878425] 00000000 00000000 00000000 00000000 > > > > [ 183.903243] 00000000 00000000 00000000 00000000 > > > > [ 183.928518] 00000000 0f007806 2500002a ad9fafd1 > > > > [ 198.538593] scsi host1: ib_srp: reconnect succeeded > > > > [ 198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe > > > > [ 198.603037] 00000000 00000000 00000000 00000000 > > > > [ 198.628884] 00000000 00000000 00000000 00000000 > > > > [ 198.653961] 00000000 00000000 00000000 00000000 > > > > [ 198.680021] 00000000 0f007806 25000032 00105dd0 > > > > [ 198.705985] scsi host1: ib_srp: failed FAST REG status memory > > > > management > > > > operation error (6) for CQE ffff880b92860138 > > > > [ 213.532848] scsi host1: ib_srp: reconnect succeeded > > > > [ 213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 227.579684] scsi host1: ib_srp: reconnect succeeded > > > > [ 227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 242.633925] scsi host1: ib_srp: reconnect succeeded > > > > [ 242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 257.127715] scsi host1: ib_srp: reconnect succeeded > > > > [ 257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 272.225762] scsi host1: ib_srp: reconnect succeeded > > > > [ 272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 286.350226] scsi host1: ib_srp: reconnect succeeded > > > > [ 286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 301.109365] scsi host1: ib_srp: reconnect succeeded > > > > [ 301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 315.910860] scsi host1: ib_srp: reconnect succeeded > > > > [ 315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 330.551052] scsi host1: ib_srp: reconnect succeeded > > > > [ 330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 344.998448] scsi host1: ib_srp: reconnect succeeded > > > > [ 345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 359.866731] scsi host1: ib_srp: reconnect succeeded > > > > [ 359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > .. > > > > .. > > > > [ 373.113045] scsi host1: ib_srp: reconnect succeeded > > > > [ 373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1. > > > > [ 388.589517] scsi host1: ib_srp: reconnect succeeded > > > > [ 388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 403.086893] scsi host1: ib_srp: reconnect succeeded > > > > [ 403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > for > > > > CQE > > > > ffff8817f2234c30 > > > > [ 403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe > > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > > [ 403.140403] 00000000 00000000 00000000 00000000 > > > > [ 403.140403] 00 > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > > > > in > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > Hello > > > > > > Let summarize where we are and how we got here. > > > > > > The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4 with > > > Barts dma patches. > > > All tests passed. > > > > > > I pulled Linus's tree and applied all 8 patches of the above series and > > > we > > > failed in the > > > "failed FAST REG status memory management" area. > > > > > > I applied only 7 of the 8 patches to Linus's tree because Bart and I > > > thought patch 6 of the series > > > may have been the catalyst. > > > > > > This also failed. > > > > > > Building from Barts tree which is based on 4.10.0-rc7 failed again. > > > > > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail. > > > > > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and > > > ib_srp. > > > > From infiniband side: > > ➜ linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 -- > > drivers/inifiniband |wc > > 0 0 0 > > > > From eth nothing suspicious too: > > ➜ linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 -- > > drivers/net/ethernet/mellanox/mlx5 > > d15118af2683 net/mlx5e: Check ets capability before ets query FW command > > a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool > > 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed > > 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper > > devices > > 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only after > > FDB > > destroy > > 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering name-space > > fails > > eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering > > name-space > > 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP > > e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve > > ad05df399f33 net/mlx5e: Remove unused variable > > 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num channels > > abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning > > > > > > > > > > Thanks > > > Laurence > > > > Hi Leon, > Yep, I also looked for outliers here that may look suspicious and did not see > any. > > I guess I will have to start bisecting. > I will start with rc5, if that fails will bisect between rc4 and rc5, as we > know rc4 was fine. > > I did re-run tests on rc4 last night and I was stable. > > Thanks > Laurence > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > OK, so 4.10.0-rc5 is fine, 4.10.0-rc6 fails, so will start bisecting. Unless one of you think you know what may be causing this in rc6. This will take time so will come back to the list once I have it isolated. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <1971987443.30613645.1487002375580.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <1971987443.30613645.1487002375580.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-13 16:47 ` Laurence Oberman [not found] ` <21338434.30712464.1487004451595.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-13 16:47 UTC (permalink / raw) To: Leon Romanovsky Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Monday, February 13, 2017 11:12:55 AM > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > > > ----- Original Message ----- > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > Sent: Monday, February 13, 2017 9:24:01 AM > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > QP > > > > > > > > ----- Original Message ----- > > > From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > > To: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > Sent: Monday, February 13, 2017 9:17:24 AM > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > > QP > > > > > > On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote: > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > Sent: Sunday, February 12, 2017 10:14:53 PM > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > destroying > > > > > a > > > > > QP > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > Sent: Sunday, February 12, 2017 9:07:16 PM > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > destroying > > > > > > a > > > > > > QP > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > > Sent: Sunday, February 12, 2017 3:05:16 PM > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > > destroying a > > > > > > > QP > > > > > > > > > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote: > > > > > > > > [ 861.143141] WARNING: CPU: 27 PID: 1103 at > > > > > > > > drivers/infiniband/core/verbs.c:1959 __ib_drain_sq+0x1bb/0x1c0 > > > > > > > > [ib_core] > > > > > > > > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for drain > > > > > > > > > > > > > > Hello Laurence, > > > > > > > > > > > > > > That warning has been removed by patch 7/8 of this series. Please > > > > > > > double > > > > > > > check > > > > > > > whether all eight patches have been applied properly. > > > > > > > > > > > > > > Bart.N�����r��y���b�X��ǧv�^�){.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"�� > > > > > > > > > > > > Hello > > > > > > Just a heads up, working with Bart on this patch series. > > > > > > We have stability issues with my tests in my MLX5 EDR-100 test bed. > > > > > > Thanks > > > > > > Laurence > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > linux-rdma" > > > > > > in > > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > I went back to Linus' latest tree for a baseline and we fail the same > > > > > way. > > > > > This has none of the latest 8 patches applied so we will > > > > > have to figure out what broke this. > > > > > > > > > > Dont forget that I tested all this recently with Bart's dma patch > > > > > series > > > > > and its solid. > > > > > > > > > > Will come back to this tomorrow and see what recently made it into > > > > > Linus's > > > > > tree by > > > > > checking back with Doug. > > > > > > > > > > [ 183.779175] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff880bd4270eb0 > > > > > [ 183.853047] 00000000 00000000 00000000 00000000 > > > > > [ 183.878425] 00000000 00000000 00000000 00000000 > > > > > [ 183.903243] 00000000 00000000 00000000 00000000 > > > > > [ 183.928518] 00000000 0f007806 2500002a ad9fafd1 > > > > > [ 198.538593] scsi host1: ib_srp: reconnect succeeded > > > > > [ 198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe > > > > > [ 198.603037] 00000000 00000000 00000000 00000000 > > > > > [ 198.628884] 00000000 00000000 00000000 00000000 > > > > > [ 198.653961] 00000000 00000000 00000000 00000000 > > > > > [ 198.680021] 00000000 0f007806 25000032 00105dd0 > > > > > [ 198.705985] scsi host1: ib_srp: failed FAST REG status memory > > > > > management > > > > > operation error (6) for CQE ffff880b92860138 > > > > > [ 213.532848] scsi host1: ib_srp: reconnect succeeded > > > > > [ 213.568828] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 227.579684] scsi host1: ib_srp: reconnect succeeded > > > > > [ 227.616175] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 242.633925] scsi host1: ib_srp: reconnect succeeded > > > > > [ 242.668160] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 257.127715] scsi host1: ib_srp: reconnect succeeded > > > > > [ 257.165623] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 272.225762] scsi host1: ib_srp: reconnect succeeded > > > > > [ 272.262570] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 286.350226] scsi host1: ib_srp: reconnect succeeded > > > > > [ 286.386160] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 301.109365] scsi host1: ib_srp: reconnect succeeded > > > > > [ 301.144930] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 315.910860] scsi host1: ib_srp: reconnect succeeded > > > > > [ 315.944594] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 330.551052] scsi host1: ib_srp: reconnect succeeded > > > > > [ 330.584552] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 344.998448] scsi host1: ib_srp: reconnect succeeded > > > > > [ 345.032115] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 359.866731] scsi host1: ib_srp: reconnect succeeded > > > > > [ 359.902114] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > .. > > > > > .. > > > > > [ 373.113045] scsi host1: ib_srp: reconnect succeeded > > > > > [ 373.149511] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1. > > > > > [ 388.589517] scsi host1: ib_srp: reconnect succeeded > > > > > [ 388.623462] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 403.086893] scsi host1: ib_srp: reconnect succeeded > > > > > [ 403.120876] scsi host1: ib_srp: failed RECV status WR flushed (5) > > > > > for > > > > > CQE > > > > > ffff8817f2234c30 > > > > > [ 403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe > > > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > > > [ 403.140403] 00000000 00000000 00000000 00000000 > > > > > [ 403.140403] 00 > > > > > > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > > > > > in > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > Hello > > > > > > > > Let summarize where we are and how we got here. > > > > > > > > The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4 > > > > with > > > > Barts dma patches. > > > > All tests passed. > > > > > > > > I pulled Linus's tree and applied all 8 patches of the above series and > > > > we > > > > failed in the > > > > "failed FAST REG status memory management" area. > > > > > > > > I applied only 7 of the 8 patches to Linus's tree because Bart and I > > > > thought patch 6 of the series > > > > may have been the catalyst. > > > > > > > > This also failed. > > > > > > > > Building from Barts tree which is based on 4.10.0-rc7 failed again. > > > > > > > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail. > > > > > > > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and > > > > ib_srp. > > > > > > From infiniband side: > > > ➜ linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 -- > > > drivers/inifiniband |wc > > > 0 0 0 > > > > > > From eth nothing suspicious too: > > > ➜ linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 -- > > > drivers/net/ethernet/mellanox/mlx5 > > > d15118af2683 net/mlx5e: Check ets capability before ets query FW command > > > a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool > > > 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed > > > 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper > > > devices > > > 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only after > > > FDB > > > destroy > > > 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering name-space > > > fails > > > eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering > > > name-space > > > 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP > > > e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve > > > ad05df399f33 net/mlx5e: Remove unused variable > > > 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num > > > channels > > > abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning > > > > > > > > > > > > > > Thanks > > > > Laurence > > > > > > > Hi Leon, > > Yep, I also looked for outliers here that may look suspicious and did not > > see > > any. > > > > I guess I will have to start bisecting. > > I will start with rc5, if that fails will bisect between rc4 and rc5, as we > > know rc4 was fine. > > > > I did re-run tests on rc4 last night and I was stable. > > > > Thanks > > Laurence > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > OK, so 4.10.0-rc5 is fine, 4.10.0-rc6 fails, so will start bisecting. > Unless one of you think you know what may be causing this in rc6. > This will take time so will come back to the list once I have it isolated. > > Thanks > Laurence > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Bisect has 8 possible kernel builds, 200 + changes, started the first one. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <21338434.30712464.1487004451595.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <21338434.30712464.1487004451595.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-13 21:34 ` Laurence Oberman [not found] ` <1301607843.30852658.1487021644535.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-13 21:34 UTC (permalink / raw) To: Leon Romanovsky Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Monday, February 13, 2017 11:47:31 AM > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > > > ----- Original Message ----- > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > Sent: Monday, February 13, 2017 11:12:55 AM > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > QP > > > > > > > > ----- Original Message ----- > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > Sent: Monday, February 13, 2017 9:24:01 AM > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > > QP > > > > > > > > > > > > ----- Original Message ----- > > > > From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > > > To: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > > > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > Sent: Monday, February 13, 2017 9:17:24 AM > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying > > > > a > > > > QP > > > > > > > > On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote: > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > Sent: Sunday, February 12, 2017 10:14:53 PM > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > destroying > > > > > > a > > > > > > QP > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > > > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > > Sent: Sunday, February 12, 2017 9:07:16 PM > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > > destroying > > > > > > > a > > > > > > > QP > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > > > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > > > Sent: Sunday, February 12, 2017 3:05:16 PM > > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > > > destroying a > > > > > > > > QP > > > > > > > > > > > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote: > > > > > > > > > [ 861.143141] WARNING: CPU: 27 PID: 1103 at > > > > > > > > > drivers/infiniband/core/verbs.c:1959 > > > > > > > > > __ib_drain_sq+0x1bb/0x1c0 > > > > > > > > > [ib_core] > > > > > > > > > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for > > > > > > > > > drain > > > > > > > > > > > > > > > > Hello Laurence, > > > > > > > > > > > > > > > > That warning has been removed by patch 7/8 of this series. > > > > > > > > Please > > > > > > > > double > > > > > > > > check > > > > > > > > whether all eight patches have been applied properly. > > > > > > > > > > > > > > > > Bart.N�����r��y���b�X��ǧv�^�){.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"�� > > > > > > > > > > > > > > Hello > > > > > > > Just a heads up, working with Bart on this patch series. > > > > > > > We have stability issues with my tests in my MLX5 EDR-100 test > > > > > > > bed. > > > > > > > Thanks > > > > > > > Laurence > > > > > > > -- > > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > > linux-rdma" > > > > > > > in > > > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > > > > > More majordomo info at > > > > > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > > > > I went back to Linus' latest tree for a baseline and we fail the > > > > > > same > > > > > > way. > > > > > > This has none of the latest 8 patches applied so we will > > > > > > have to figure out what broke this. > > > > > > > > > > > > Dont forget that I tested all this recently with Bart's dma patch > > > > > > series > > > > > > and its solid. > > > > > > > > > > > > Will come back to this tomorrow and see what recently made it into > > > > > > Linus's > > > > > > tree by > > > > > > checking back with Doug. > > > > > > > > > > > > [ 183.779175] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff880bd4270eb0 > > > > > > [ 183.853047] 00000000 00000000 00000000 00000000 > > > > > > [ 183.878425] 00000000 00000000 00000000 00000000 > > > > > > [ 183.903243] 00000000 00000000 00000000 00000000 > > > > > > [ 183.928518] 00000000 0f007806 2500002a ad9fafd1 > > > > > > [ 198.538593] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe > > > > > > [ 198.603037] 00000000 00000000 00000000 00000000 > > > > > > [ 198.628884] 00000000 00000000 00000000 00000000 > > > > > > [ 198.653961] 00000000 00000000 00000000 00000000 > > > > > > [ 198.680021] 00000000 0f007806 25000032 00105dd0 > > > > > > [ 198.705985] scsi host1: ib_srp: failed FAST REG status memory > > > > > > management > > > > > > operation error (6) for CQE ffff880b92860138 > > > > > > [ 213.532848] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 213.568828] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 227.579684] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 227.616175] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 242.633925] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 242.668160] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 257.127715] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 257.165623] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 272.225762] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 272.262570] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 286.350226] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 286.386160] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 301.109365] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 301.144930] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 315.910860] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 315.944594] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 330.551052] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 330.584552] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 344.998448] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 345.032115] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 359.866731] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 359.902114] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > .. > > > > > > .. > > > > > > [ 373.113045] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 373.149511] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1. > > > > > > [ 388.589517] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 388.623462] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 403.086893] scsi host1: ib_srp: reconnect succeeded > > > > > > [ 403.120876] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > (5) > > > > > > for > > > > > > CQE > > > > > > ffff8817f2234c30 > > > > > > [ 403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe > > > > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > > > > [ 403.140403] 00000000 00000000 00000000 00000000 > > > > > > [ 403.140403] 00 > > > > > > > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > linux-rdma" > > > > > > in > > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > Hello > > > > > > > > > > Let summarize where we are and how we got here. > > > > > > > > > > The last kernel I tested with mlx5 and ib_srp was vmlinuz-4.10.0-rc4 > > > > > with > > > > > Barts dma patches. > > > > > All tests passed. > > > > > > > > > > I pulled Linus's tree and applied all 8 patches of the above series > > > > > and > > > > > we > > > > > failed in the > > > > > "failed FAST REG status memory management" area. > > > > > > > > > > I applied only 7 of the 8 patches to Linus's tree because Bart and I > > > > > thought patch 6 of the series > > > > > may have been the catalyst. > > > > > > > > > > This also failed. > > > > > > > > > > Building from Barts tree which is based on 4.10.0-rc7 failed again. > > > > > > > > > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we fail. > > > > > > > > > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and > > > > > ib_srp. > > > > > > > > From infiniband side: > > > > ➜ linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 -- > > > > drivers/inifiniband |wc > > > > 0 0 0 > > > > > > > > From eth nothing suspicious too: > > > > ➜ linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 -- > > > > drivers/net/ethernet/mellanox/mlx5 > > > > d15118af2683 net/mlx5e: Check ets capability before ets query FW > > > > command > > > > a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool > > > > 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed > > > > 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper > > > > devices > > > > 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only > > > > after > > > > FDB > > > > destroy > > > > 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering > > > > name-space > > > > fails > > > > eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering > > > > name-space > > > > 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP > > > > e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve > > > > ad05df399f33 net/mlx5e: Remove unused variable > > > > 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num > > > > channels > > > > abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning > > > > > > > > > > > > > > > > > > Thanks > > > > > Laurence > > > > > > > > > > Hi Leon, > > > Yep, I also looked for outliers here that may look suspicious and did not > > > see > > > any. > > > > > > I guess I will have to start bisecting. > > > I will start with rc5, if that fails will bisect between rc4 and rc5, as > > > we > > > know rc4 was fine. > > > > > > I did re-run tests on rc4 last night and I was stable. > > > > > > Thanks > > > Laurence > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > OK, so 4.10.0-rc5 is fine, 4.10.0-rc6 fails, so will start bisecting. > > Unless one of you think you know what may be causing this in rc6. > > This will take time so will come back to the list once I have it isolated. > > > > Thanks > > Laurence > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Bisect has 8 possible kernel builds, 200 + changes, started the first one. > > Thanks > Laurence > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello Bisecting got me to this commit, I had reviewed this looking for an explanation at some point. At the time, I did not understand the need for the change but after explanation I accepted it. I reverted this and we are good again but reading the code, not seeing how this is affecting us. Makes no sense how this can be the issue. Nevertheless we will need to revert this please. I will now apply the 8 patches from Bart to Linus's tree with this reverted and test again. Bisect run git bisect start git bisect bad 566cf877a1fcb6d6dc0126b076aad062054c2637 git bisect good 7a308bb3016f57e5be11a677d15b821536419d36 git bisect good git bisect good git bisect bad git bisect bad git bisect bad git bisect bad git bisect good Bisecting: 0 revisions left to test after this (roughly 1 step) [0a475ef4226e305bdcffe12b401ca1eab06c4913] IB/srp: fix invalid indirect_sg_entries parameter value [loberman@ibclient linux-torvalds]$ git show 0a475ef4226e305bdcffe12b401ca1eab06c4913 commit 0a475ef4226e305bdcffe12b401ca1eab06c4913 Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Date: Wed Jan 4 15:59:37 2017 +0200 IB/srp: fix invalid indirect_sg_entries parameter value After setting indirect_sg_entries module_param to huge value (e.g 500,000), srp_alloc_req_data() fails to allocate indirect descriptors for the request ring (kmalloc fails). This commit enforces the maximum value of indirect_sg_entries to be SG_MAX_SEGMENTS as signified in module param description. Fixes: 65e8617fba17 (scsi: rename SCSI_MAX_{SG, SG_CHAIN}_SEGMENTS) Fixes: c07d424d6118 (IB/srp: add support for indirect tables that don't fit in SRP_CMD) Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # 4.7+ Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Reviewed-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>-- Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 0f67cf9..79bf484 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -3699,6 +3699,12 @@ static int __init srp_init_module(void) indirect_sg_entries = cmd_sg_entries; } + if (indirect_sg_entries > SG_MAX_SEGMENTS) { + pr_warn("Clamping indirect_sg_entries to %u\n", + SG_MAX_SEGMENTS); + indirect_sg_entries = SG_MAX_SEGMENTS; + } + srp_remove_wq = create_workqueue("srp_remove"); if (!srp_remove_wq) { ret = -ENOMEM; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 47+ messages in thread
[parent not found: <1301607843.30852658.1487021644535.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <1301607843.30852658.1487021644535.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-13 21:46 ` Laurence Oberman [not found] ` <898197116.30855343.1487022400065.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-13 21:46 UTC (permalink / raw) To: Leon Romanovsky Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Monday, February 13, 2017 4:34:04 PM > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > > > ----- Original Message ----- > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > Sent: Monday, February 13, 2017 11:47:31 AM > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > QP > > > > > > > > ----- Original Message ----- > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > Sent: Monday, February 13, 2017 11:12:55 AM > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > > QP > > > > > > > > > > > > ----- Original Message ----- > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > > > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > Sent: Monday, February 13, 2017 9:24:01 AM > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying > > > > a > > > > QP > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > > > > To: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > > > > > maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > Sent: Monday, February 13, 2017 9:17:24 AM > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > destroying > > > > > a > > > > > QP > > > > > > > > > > On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote: > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > > > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > > Sent: Sunday, February 12, 2017 10:14:53 PM > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > > destroying > > > > > > > a > > > > > > > QP > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > > > > To: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > > > > > Cc: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > > > > > > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > > > Sent: Sunday, February 12, 2017 9:07:16 PM > > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > > > destroying > > > > > > > > a > > > > > > > > QP > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > > > > > > > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > > > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > > > > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > > > > > > > > Sent: Sunday, February 12, 2017 3:05:16 PM > > > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > > > > destroying a > > > > > > > > > QP > > > > > > > > > > > > > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote: > > > > > > > > > > [ 861.143141] WARNING: CPU: 27 PID: 1103 at > > > > > > > > > > drivers/infiniband/core/verbs.c:1959 > > > > > > > > > > __ib_drain_sq+0x1bb/0x1c0 > > > > > > > > > > [ib_core] > > > > > > > > > > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for > > > > > > > > > > drain > > > > > > > > > > > > > > > > > > Hello Laurence, > > > > > > > > > > > > > > > > > > That warning has been removed by patch 7/8 of this series. > > > > > > > > > Please > > > > > > > > > double > > > > > > > > > check > > > > > > > > > whether all eight patches have been applied properly. > > > > > > > > > > > > > > > > > > Bart.N�����r��y���b�X��ǧv�^�){.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"�� > > > > > > > > > > > > > > > > Hello > > > > > > > > Just a heads up, working with Bart on this patch series. > > > > > > > > We have stability issues with my tests in my MLX5 EDR-100 test > > > > > > > > bed. > > > > > > > > Thanks > > > > > > > > Laurence > > > > > > > > -- > > > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > > > linux-rdma" > > > > > > > > in > > > > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > > > > > > More majordomo info at > > > > > > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > > > > > > > I went back to Linus' latest tree for a baseline and we fail the > > > > > > > same > > > > > > > way. > > > > > > > This has none of the latest 8 patches applied so we will > > > > > > > have to figure out what broke this. > > > > > > > > > > > > > > Dont forget that I tested all this recently with Bart's dma patch > > > > > > > series > > > > > > > and its solid. > > > > > > > > > > > > > > Will come back to this tomorrow and see what recently made it > > > > > > > into > > > > > > > Linus's > > > > > > > tree by > > > > > > > checking back with Doug. > > > > > > > > > > > > > > [ 183.779175] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff880bd4270eb0 > > > > > > > [ 183.853047] 00000000 00000000 00000000 00000000 > > > > > > > [ 183.878425] 00000000 00000000 00000000 00000000 > > > > > > > [ 183.903243] 00000000 00000000 00000000 00000000 > > > > > > > [ 183.928518] 00000000 0f007806 2500002a ad9fafd1 > > > > > > > [ 198.538593] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe > > > > > > > [ 198.603037] 00000000 00000000 00000000 00000000 > > > > > > > [ 198.628884] 00000000 00000000 00000000 00000000 > > > > > > > [ 198.653961] 00000000 00000000 00000000 00000000 > > > > > > > [ 198.680021] 00000000 0f007806 25000032 00105dd0 > > > > > > > [ 198.705985] scsi host1: ib_srp: failed FAST REG status memory > > > > > > > management > > > > > > > operation error (6) for CQE ffff880b92860138 > > > > > > > [ 213.532848] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 213.568828] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 227.579684] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 227.616175] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 242.633925] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 242.668160] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 257.127715] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 257.165623] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 272.225762] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 272.262570] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 286.350226] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 286.386160] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 301.109365] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 301.144930] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 315.910860] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 315.944594] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 330.551052] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 330.584552] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 344.998448] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 345.032115] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 359.866731] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 359.902114] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > .. > > > > > > > .. > > > > > > > [ 373.113045] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 373.149511] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1. > > > > > > > [ 388.589517] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 388.623462] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 403.086893] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 403.120876] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe > > > > > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > > > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > > > > > [ 403.140403] 00000000 00000000 00000000 00000000 > > > > > > > [ 403.140403] 00 > > > > > > > > > > > > > > -- > > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > > linux-rdma" > > > > > > > in > > > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > > > > > More majordomo info at > > > > > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > Hello > > > > > > > > > > > > Let summarize where we are and how we got here. > > > > > > > > > > > > The last kernel I tested with mlx5 and ib_srp was > > > > > > vmlinuz-4.10.0-rc4 > > > > > > with > > > > > > Barts dma patches. > > > > > > All tests passed. > > > > > > > > > > > > I pulled Linus's tree and applied all 8 patches of the above series > > > > > > and > > > > > > we > > > > > > failed in the > > > > > > "failed FAST REG status memory management" area. > > > > > > > > > > > > I applied only 7 of the 8 patches to Linus's tree because Bart and > > > > > > I > > > > > > thought patch 6 of the series > > > > > > may have been the catalyst. > > > > > > > > > > > > This also failed. > > > > > > > > > > > > Building from Barts tree which is based on 4.10.0-rc7 failed again. > > > > > > > > > > > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we > > > > > > fail. > > > > > > > > > > > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and > > > > > > ib_srp. > > > > > > > > > > From infiniband side: > > > > > ➜ linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 -- > > > > > drivers/inifiniband |wc > > > > > 0 0 0 > > > > > > > > > > From eth nothing suspicious too: > > > > > ➜ linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 -- > > > > > drivers/net/ethernet/mellanox/mlx5 > > > > > d15118af2683 net/mlx5e: Check ets capability before ets query FW > > > > > command > > > > > a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool > > > > > 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed > > > > > 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper > > > > > devices > > > > > 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only > > > > > after > > > > > FDB > > > > > destroy > > > > > 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering > > > > > name-space > > > > > fails > > > > > eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering > > > > > name-space > > > > > 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP > > > > > e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve > > > > > ad05df399f33 net/mlx5e: Remove unused variable > > > > > 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num > > > > > channels > > > > > abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > Laurence > > > > > > > > > > > > > Hi Leon, > > > > Yep, I also looked for outliers here that may look suspicious and did > > > > not > > > > see > > > > any. > > > > > > > > I guess I will have to start bisecting. > > > > I will start with rc5, if that fails will bisect between rc4 and rc5, > > > > as > > > > we > > > > know rc4 was fine. > > > > > > > > I did re-run tests on rc4 last night and I was stable. > > > > > > > > Thanks > > > > Laurence > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > > > > in > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > OK, so 4.10.0-rc5 is fine, 4.10.0-rc6 fails, so will start bisecting. > > > Unless one of you think you know what may be causing this in rc6. > > > This will take time so will come back to the list once I have it > > > isolated. > > > > > > Thanks > > > Laurence > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > Bisect has 8 possible kernel builds, 200 + changes, started the first one. > > > > Thanks > > Laurence > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > Hello > > Bisecting got me to this commit, I had reviewed this looking for an > explanation at some point. > At the time, I did not understand the need for the change but after > explanation I accepted it. > I reverted this and we are good again but reading the code, not seeing how > this is affecting us. > > Makes no sense how this can be the issue. > > Nevertheless we will need to revert this please. > > I will now apply the 8 patches from Bart to Linus's tree with this reverted > and test again. > > Bisect run > > git bisect start > git bisect bad 566cf877a1fcb6d6dc0126b076aad062054c2637 > git bisect good 7a308bb3016f57e5be11a677d15b821536419d36 > git bisect good > git bisect good > git bisect bad > git bisect bad > git bisect bad > git bisect bad > git bisect good > > Bisecting: 0 revisions left to test after this (roughly 1 step) > [0a475ef4226e305bdcffe12b401ca1eab06c4913] IB/srp: fix invalid > indirect_sg_entries parameter value > [loberman@ibclient linux-torvalds]$ git show > 0a475ef4226e305bdcffe12b401ca1eab06c4913 > commit 0a475ef4226e305bdcffe12b401ca1eab06c4913 > Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Date: Wed Jan 4 15:59:37 2017 +0200 > > IB/srp: fix invalid indirect_sg_entries parameter value > > After setting indirect_sg_entries module_param to huge value (e.g > 500,000), > srp_alloc_req_data() fails to allocate indirect descriptors for the > request > ring (kmalloc fails). This commit enforces the maximum value of > indirect_sg_entries to be SG_MAX_SEGMENTS as signified in module param > description. > > Fixes: 65e8617fba17 (scsi: rename SCSI_MAX_{SG, SG_CHAIN}_SEGMENTS) > Fixes: c07d424d6118 (IB/srp: add support for indirect tables that don't > fit in SRP_CMD) > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # 4.7+ > Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Reviewed-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>-- > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c > b/drivers/infiniband/ulp/srp/ib_srp.c > index 0f67cf9..79bf484 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -3699,6 +3699,12 @@ static int __init srp_init_module(void) > indirect_sg_entries = cmd_sg_entries; > } > > + if (indirect_sg_entries > SG_MAX_SEGMENTS) { > + pr_warn("Clamping indirect_sg_entries to %u\n", > + SG_MAX_SEGMENTS); > + indirect_sg_entries = SG_MAX_SEGMENTS; > + } > + > srp_remove_wq = create_workqueue("srp_remove"); > if (!srp_remove_wq) { > ret = -ENOMEM; > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello The revert actually does not help. it failed after a while. This mail was in drafts while I was testing and it got sent and should not have been. The revert does not help which I am happy about because it made no sense. So not sure how the bisect got me here but it did. I will have to run through this again and see where the bisect went wrong. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <898197116.30855343.1487022400065.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: v4.10-rc SRP + mlx5 regression [not found] ` <898197116.30855343.1487022400065.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-13 21:52 ` Bart Van Assche [not found] ` <1487022735.2719.7.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Bart Van Assche @ 2017-02-13 21:52 UTC (permalink / raw) To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote: > I will have to run through this again and see where the bisect went wrong. Hello Laurence, If you would be considering to repeat the bisect, did you know that a bisect can be sped up by specifying the names of the files and/or directories that are suspected? An example: git bisect start */infiniband */net Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <1487022735.2719.7.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>]
* Re: v4.10-rc SRP + mlx5 regression [not found] ` <1487022735.2719.7.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> @ 2017-02-13 21:56 ` Laurence Oberman 2017-02-14 2:19 ` Laurence Oberman 1 sibling, 0 replies; 47+ messages in thread From: Laurence Oberman @ 2017-02-13 21:56 UTC (permalink / raw) To: Bart Van Assche Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Monday, February 13, 2017 4:52:28 PM > Subject: Re: v4.10-rc SRP + mlx5 regression > > On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote: > > I will have to run through this again and see where the bisect went wrong. > > Hello Laurence, > > If you would be considering to repeat the bisect, did you know that a bisect > can be sped up by specifying the names of the files and/or directories that > are suspected? An example: > > git bisect start */infiniband */net > > Bart.-- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello Bart I will try that, I knew it was possible it but had not used it before so wanted to be careful. Even being careful something went wrong :) I was very careful and I waited in between tests to give it long enough. Perhaps I said good when bad or something like that. I will use your method and by tomorrow I should have this figured out for you. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: v4.10-rc SRP + mlx5 regression [not found] ` <1487022735.2719.7.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 2017-02-13 21:56 ` Laurence Oberman @ 2017-02-14 2:19 ` Laurence Oberman [not found] ` <568916592.30910570.1487038794766.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-14 2:19 UTC (permalink / raw) To: Bart Van Assche Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Monday, February 13, 2017 4:52:28 PM > Subject: Re: v4.10-rc SRP + mlx5 regression > > On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote: > > I will have to run through this again and see where the bisect went wrong. > > Hello Laurence, > > If you would be considering to repeat the bisect, did you know that a bisect > can be sped up by specifying the names of the files and/or directories that > are suspected? An example: > > git bisect start */infiniband */net > > Bart.-- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello Bart, Much better news this time :), worked late on this but got it figured out. OK, so we got to this one, which makes a lot more sense and is right in the area where we are having issues. I must have answered wrong to one of the steps the first time I did the bisect. Reverted this in the master tree of rc8 and rebuilt the kernel Now all tests pass on Linus's tree - 4.10.0_rc8+ The interesting point here is that this commit is in rc5 but rc5 was not failing so we have an interoperability issue with this commit [loberman@ibclient linux]$ git bisect good Bisecting: 0 revisions left to test after this (roughly 1 step) [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when the device supports sg gaps [loberman@ibclient linux]$ git show ad8e66b4a80182174f73487ed25fd2140cf43361 commit ad8e66b4a80182174f73487ed25fd2140cf43361 Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Date: Wed Dec 28 12:48:28 2016 +0200 IB/srp: fix mr allocation when the device supports sg gaps If the device support arbitrary sg list mapping (device cap IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with IB_MR_TYPE_SG_GAPS. Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures") Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+ Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8ddc071..0f67cf9 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, struct srp_fr_desc *d; struct ib_mr *mr; int i, ret = -EINVAL; + enum ib_mr_type mr_type; if (pool_size <= 0) goto err; @@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, spin_lock_init(&pool->lock); INIT_LIST_HEAD(&pool->free_list); + if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) + mr_type = IB_MR_TYPE_SG_GAPS; + else + mr_type = IB_MR_TYPE_MEM_REG; + for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { - mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, - max_page_list_len); + mr = ib_alloc_mr(pd, mr_type, max_page_list_len); if (IS_ERR(mr)) { ret = PTR_ERR(mr); if (ret == -ENOMEM) (END) So here is the revert patch, but you need to decide how you want to deal with this. Revert "IB/srp: fix mr allocation when the device supports sg gaps" Laurence Oberman Traced after bisection to a cause for this failure Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> commit 90d169d312a173d5350c1bb36d6daab04c592127 Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Date: Mon Feb 13 20:33:32 2017 -0500 Revert "IB/srp: fix mr allocation when the device supports sg gaps" Laurence Oberman Traced after bisection to a cause for this failure [ 130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe [ 130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0edbfb0 [ 130.510899] 00000000 00000000 00000000 00000000 [ 130.536455] 00000000 00000000 00000000 00000000 [ 130.561878] 00000000 00000000 00000000 00000000 [ 130.585904] 00000000 0f007806 2500002a db0ec4d0 [ 145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1. [ 146.530439] scsi host1: ib_srp: reconnect succeeded [ 146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe [ 146.597635] 00000000 00000000 00000000 00000000 [ 146.623545] 00000000 00000000 00000000 00000000 [ 146.649599] 00000000 00000000 00000000 00000000 [ 146.673938] 00000000 0f007806 25000032 000c46d0 [ 146.697969] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff88 [ 162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1. [ 162.256337] scsi host1: ib_srp: reconnect succeeded [ 162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0412ef0` This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361. diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 79bf484..01338c8 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, struct srp_fr_desc *d; struct ib_mr *mr; int i, ret = -EINVAL; - enum ib_mr_type mr_type; if (pool_size <= 0) goto err; @@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, spin_lock_init(&pool->lock); INIT_LIST_HEAD(&pool->free_list); - if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) - mr_type = IB_MR_TYPE_SG_GAPS; - else - mr_type = IB_MR_TYPE_MEM_REG; - for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { - mr = ib_alloc_mr(pd, mr_type, max_page_list_len); + mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, + max_page_list_len); if (IS_ERR(mr)) { ret = PTR_ERR(mr); if (ret == -ENOMEM) Now moving on to what got me here in the first place. Bart, let me know if the 7 of the 8 patches in your most recent series are all still valid after this revert Otherwise let me know which ones you want me to apply. patch 6 - I am thinking i sno longer valid. " If a HCA supports the SG_GAPS_REG feature then a single memory region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch reduces the number of memory regions that is allocated per SRP session. " Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 47+ messages in thread
[parent not found: <568916592.30910570.1487038794766.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: v4.10-rc SRP + mlx5 regression [not found] ` <568916592.30910570.1487038794766.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-14 6:39 ` Leon Romanovsky [not found] ` <20170214063953.GF6989-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Leon Romanovsky @ 2017-02-14 6:39 UTC (permalink / raw) To: Laurence Oberman Cc: Bart Van Assche, hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA [-- Attachment #1: Type: text/plain, Size: 8421 bytes --] On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote: > > > ----- Original Message ----- > > From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > Sent: Monday, February 13, 2017 4:52:28 PM > > Subject: Re: v4.10-rc SRP + mlx5 regression > > > > On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote: > > > I will have to run through this again and see where the bisect went wrong. > > > > Hello Laurence, > > > > If you would be considering to repeat the bisect, did you know that a bisect > > can be sped up by specifying the names of the files and/or directories that > > are suspected? An example: > > > > git bisect start */infiniband */net > > > > Bart.-- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > Hello Bart, > > Much better news this time :), worked late on this but got it figured out. > > OK, so we got to this one, which makes a lot more sense and is right in the area where we are having issues. > I must have answered wrong to one of the steps the first time I did the bisect. > > Reverted this in the master tree of rc8 and rebuilt the kernel > Now all tests pass on Linus's tree - 4.10.0_rc8+ > > The interesting point here is that this commit is in rc5 but rc5 was not failing so we have an interoperability issue with this commit > > > [loberman@ibclient linux]$ git bisect good > Bisecting: 0 revisions left to test after this (roughly 1 step) > [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when the device supports sg gaps > > [loberman@ibclient linux]$ git show ad8e66b4a80182174f73487ed25fd2140cf43361 > commit ad8e66b4a80182174f73487ed25fd2140cf43361 > Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Date: Wed Dec 28 12:48:28 2016 +0200 > > IB/srp: fix mr allocation when the device supports sg gaps > > If the device support arbitrary sg list mapping (device cap > IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with > IB_MR_TYPE_SG_GAPS. > > Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures") > Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+ > Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c > index 8ddc071..0f67cf9 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, > struct srp_fr_desc *d; > struct ib_mr *mr; > int i, ret = -EINVAL; > + enum ib_mr_type mr_type; > > if (pool_size <= 0) > goto err; > @@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, > spin_lock_init(&pool->lock); > INIT_LIST_HEAD(&pool->free_list); > > + if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) > + mr_type = IB_MR_TYPE_SG_GAPS; > + else > + mr_type = IB_MR_TYPE_MEM_REG; > + > for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { > - mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, > - max_page_list_len); > + mr = ib_alloc_mr(pd, mr_type, max_page_list_len); First, ib_alloc_mr receives u32 as a third parameter, but int was supplied. Second (I can be wrong here), shouldn't max_page_list_len be replaced with max_fast_reg_page_list_len? Thanks > if (IS_ERR(mr)) { > ret = PTR_ERR(mr); > if (ret == -ENOMEM) > (END) > > > So here is the revert patch, but you need to decide how you want to deal with this. > > Revert "IB/srp: fix mr allocation when the device supports sg gaps" > Laurence Oberman > Traced after bisection to a cause for this failure > > Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > commit 90d169d312a173d5350c1bb36d6daab04c592127 > Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Date: Mon Feb 13 20:33:32 2017 -0500 > > Revert "IB/srp: fix mr allocation when the device supports sg gaps" > Laurence Oberman > Traced after bisection to a cause for this failure > > [ 130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe > [ 130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0edbfb0 > [ 130.510899] 00000000 00000000 00000000 00000000 > [ 130.536455] 00000000 00000000 00000000 00000000 > [ 130.561878] 00000000 00000000 00000000 00000000 > [ 130.585904] 00000000 0f007806 2500002a db0ec4d0 > [ 145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1. > [ 146.530439] scsi host1: ib_srp: reconnect succeeded > [ 146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe > [ 146.597635] 00000000 00000000 00000000 00000000 > [ 146.623545] 00000000 00000000 00000000 00000000 > [ 146.649599] 00000000 00000000 00000000 00000000 > [ 146.673938] 00000000 0f007806 25000032 000c46d0 > [ 146.697969] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff88 > [ 162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1. > [ 162.256337] scsi host1: ib_srp: reconnect succeeded > [ 162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0412ef0` > > This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361. > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c > index 79bf484..01338c8 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, > struct srp_fr_desc *d; > struct ib_mr *mr; > int i, ret = -EINVAL; > - enum ib_mr_type mr_type; > > if (pool_size <= 0) > goto err; > @@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, > spin_lock_init(&pool->lock); > INIT_LIST_HEAD(&pool->free_list); > > - if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) > - mr_type = IB_MR_TYPE_SG_GAPS; > - else > - mr_type = IB_MR_TYPE_MEM_REG; > - > for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { > - mr = ib_alloc_mr(pd, mr_type, max_page_list_len); > + mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, > + max_page_list_len); > if (IS_ERR(mr)) { > ret = PTR_ERR(mr); > if (ret == -ENOMEM) > > > > Now moving on to what got me here in the first place. > Bart, let me know if the 7 of the 8 patches in your most recent series are all still valid after this revert > Otherwise let me know which ones you want me to apply. > > patch 6 - I am thinking i sno longer valid. > " > If a HCA supports the SG_GAPS_REG feature then a single memory > region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch > reduces the number of memory regions that is allocated per SRP > session. > " > > Thanks > Laurence [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <20170214063953.GF6989-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>]
* Re: v4.10-rc SRP + mlx5 regression [not found] ` <20170214063953.GF6989-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org> @ 2017-02-14 10:00 ` Max Gurtovoy [not found] ` <bfca98d3-3f74-c370-7455-71e2ebd583e9-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 47+ messages in thread From: Max Gurtovoy @ 2017-02-14 10:00 UTC (permalink / raw) To: Leon Romanovsky, Laurence Oberman Cc: Bart Van Assche, hch-jcswGhMUV9g, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA Hi Laurence, can you specify the test that repro these failures ? have you tried running with CX5 HCA or only CX4 ? I think this commit is right and we have issues in other places. On 2/14/2017 8:39 AM, Leon Romanovsky wrote: > On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote: >> >> >> ----- Original Message ----- >>> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> >>> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >>> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >>> Sent: Monday, February 13, 2017 4:52:28 PM >>> Subject: Re: v4.10-rc SRP + mlx5 regression >>> >>> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote: >>>> I will have to run through this again and see where the bisect went wrong. >>> >>> Hello Laurence, >>> >>> If you would be considering to repeat the bisect, did you know that a bisect >>> can be sped up by specifying the names of the files and/or directories that >>> are suspected? An example: >>> >>> git bisect start */infiniband */net >>> >>> Bart.-- >>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> Hello Bart, >> >> Much better news this time :), worked late on this but got it figured out. >> >> OK, so we got to this one, which makes a lot more sense and is right in the area where we are having issues. >> I must have answered wrong to one of the steps the first time I did the bisect. >> >> Reverted this in the master tree of rc8 and rebuilt the kernel >> Now all tests pass on Linus's tree - 4.10.0_rc8+ >> >> The interesting point here is that this commit is in rc5 but rc5 was not failing so we have an interoperability issue with this commit >> >> >> [loberman@ibclient linux]$ git bisect good >> Bisecting: 0 revisions left to test after this (roughly 1 step) >> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when the device supports sg gaps >> >> [loberman@ibclient linux]$ git show ad8e66b4a80182174f73487ed25fd2140cf43361 >> commit ad8e66b4a80182174f73487ed25fd2140cf43361 >> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >> Date: Wed Dec 28 12:48:28 2016 +0200 >> >> IB/srp: fix mr allocation when the device supports sg gaps >> >> If the device support arbitrary sg list mapping (device cap >> IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with >> IB_MR_TYPE_SG_GAPS. >> >> Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures") >> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+ >> Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >> Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >> Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >> Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> >> Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> >> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> >> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c >> index 8ddc071..0f67cf9 100644 >> --- a/drivers/infiniband/ulp/srp/ib_srp.c >> +++ b/drivers/infiniband/ulp/srp/ib_srp.c >> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, >> struct srp_fr_desc *d; >> struct ib_mr *mr; >> int i, ret = -EINVAL; >> + enum ib_mr_type mr_type; >> >> if (pool_size <= 0) >> goto err; >> @@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, >> spin_lock_init(&pool->lock); >> INIT_LIST_HEAD(&pool->free_list); >> >> + if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) >> + mr_type = IB_MR_TYPE_SG_GAPS; >> + else >> + mr_type = IB_MR_TYPE_MEM_REG; >> + >> for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { >> - mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, >> - max_page_list_len); >> + mr = ib_alloc_mr(pd, mr_type, max_page_list_len); > > First, ib_alloc_mr receives u32 as a third parameter, but int was > supplied. Second (I can be wrong here), shouldn't max_page_list_len be > replaced with max_fast_reg_page_list_len? > > Thanks there is a statement that: if (srp_dev->use_fast_reg) { srp_dev->max_pages_per_mr = min_t(u32, srp_dev->max_pages_per_mr, attr->max_fast_reg_page_list_len); } so we take the max_fast_reg_page_list_len in this case. > >> if (IS_ERR(mr)) { >> ret = PTR_ERR(mr); >> if (ret == -ENOMEM) >> (END) >> >> >> So here is the revert patch, but you need to decide how you want to deal with this. >> >> Revert "IB/srp: fix mr allocation when the device supports sg gaps" >> Laurence Oberman >> Traced after bisection to a cause for this failure >> >> Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> >> commit 90d169d312a173d5350c1bb36d6daab04c592127 >> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> Date: Mon Feb 13 20:33:32 2017 -0500 >> >> Revert "IB/srp: fix mr allocation when the device supports sg gaps" >> Laurence Oberman >> Traced after bisection to a cause for this failure >> >> [ 130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe >> [ 130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0edbfb0 >> [ 130.510899] 00000000 00000000 00000000 00000000 >> [ 130.536455] 00000000 00000000 00000000 00000000 >> [ 130.561878] 00000000 00000000 00000000 00000000 >> [ 130.585904] 00000000 0f007806 2500002a db0ec4d0 >> [ 145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1. >> [ 146.530439] scsi host1: ib_srp: reconnect succeeded >> [ 146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe >> [ 146.597635] 00000000 00000000 00000000 00000000 >> [ 146.623545] 00000000 00000000 00000000 00000000 >> [ 146.649599] 00000000 00000000 00000000 00000000 >> [ 146.673938] 00000000 0f007806 25000032 000c46d0 >> [ 146.697969] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff88 >> [ 162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1. >> [ 162.256337] scsi host1: ib_srp: reconnect succeeded >> [ 162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0412ef0` >> >> This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361. >> >> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c >> index 79bf484..01338c8 100644 >> --- a/drivers/infiniband/ulp/srp/ib_srp.c >> +++ b/drivers/infiniband/ulp/srp/ib_srp.c >> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, >> struct srp_fr_desc *d; >> struct ib_mr *mr; >> int i, ret = -EINVAL; >> - enum ib_mr_type mr_type; >> >> if (pool_size <= 0) >> goto err; >> @@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, >> spin_lock_init(&pool->lock); >> INIT_LIST_HEAD(&pool->free_list); >> >> - if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) >> - mr_type = IB_MR_TYPE_SG_GAPS; >> - else >> - mr_type = IB_MR_TYPE_MEM_REG; >> - >> for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { >> - mr = ib_alloc_mr(pd, mr_type, max_page_list_len); >> + mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, >> + max_page_list_len); >> if (IS_ERR(mr)) { >> ret = PTR_ERR(mr); >> if (ret == -ENOMEM) >> >> >> >> Now moving on to what got me here in the first place. >> Bart, let me know if the 7 of the 8 patches in your most recent series are all still valid after this revert >> Otherwise let me know which ones you want me to apply. >> >> patch 6 - I am thinking i sno longer valid. >> " >> If a HCA supports the SG_GAPS_REG feature then a single memory >> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch >> reduces the number of memory regions that is allocated per SRP >> session. >> " >> >> Thanks >> Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <bfca98d3-3f74-c370-7455-71e2ebd583e9-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: v4.10-rc SRP + mlx5 regression [not found] ` <bfca98d3-3f74-c370-7455-71e2ebd583e9-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2017-02-14 13:31 ` Laurence Oberman [not found] ` <656778124.31118982.1487079062235.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2017-02-14 16:53 ` Bart Van Assche 1 sibling, 1 reply; 47+ messages in thread From: Laurence Oberman @ 2017-02-14 13:31 UTC (permalink / raw) To: Max Gurtovoy Cc: Leon Romanovsky, Bart Van Assche, hch-jcswGhMUV9g, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Tuesday, February 14, 2017 5:00:04 AM > Subject: Re: v4.10-rc SRP + mlx5 regression > > Hi Laurence, > can you specify the test that repro these failures ? > have you tried running with CX5 HCA or only CX4 ? > I think this commit is right and we have issues in other places. > > > On 2/14/2017 8:39 AM, Leon Romanovsky wrote: > > On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote: > >> > >> > >> ----- Original Message ----- > >>> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > >>> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > >>> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > >>> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > >>> Sent: Monday, February 13, 2017 4:52:28 PM > >>> Subject: Re: v4.10-rc SRP + mlx5 regression > >>> > >>> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote: > >>>> I will have to run through this again and see where the bisect went > >>>> wrong. > >>> > >>> Hello Laurence, > >>> > >>> If you would be considering to repeat the bisect, did you know that a > >>> bisect > >>> can be sped up by specifying the names of the files and/or directories > >>> that > >>> are suspected? An example: > >>> > >>> git bisect start */infiniband */net > >>> > >>> Bart.-- > >>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > >>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> > >> > >> Hello Bart, > >> > >> Much better news this time :), worked late on this but got it figured out. > >> > >> OK, so we got to this one, which makes a lot more sense and is right in > >> the area where we are having issues. > >> I must have answered wrong to one of the steps the first time I did the > >> bisect. > >> > >> Reverted this in the master tree of rc8 and rebuilt the kernel > >> Now all tests pass on Linus's tree - 4.10.0_rc8+ > >> > >> The interesting point here is that this commit is in rc5 but rc5 was not > >> failing so we have an interoperability issue with this commit > >> > >> > >> [loberman@ibclient linux]$ git bisect good > >> Bisecting: 0 revisions left to test after this (roughly 1 step) > >> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when > >> the device supports sg gaps > >> > >> [loberman@ibclient linux]$ git show > >> ad8e66b4a80182174f73487ed25fd2140cf43361 > >> commit ad8e66b4a80182174f73487ed25fd2140cf43361 > >> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > >> Date: Wed Dec 28 12:48:28 2016 +0200 > >> > >> IB/srp: fix mr allocation when the device supports sg gaps > >> > >> If the device support arbitrary sg list mapping (device cap > >> IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with > >> IB_MR_TYPE_SG_GAPS. > >> > >> Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures") > >> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+ > >> Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > >> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > >> Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > >> Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > >> Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > >> Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > >> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >> > >> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c > >> b/drivers/infiniband/ulp/srp/ib_srp.c > >> index 8ddc071..0f67cf9 100644 > >> --- a/drivers/infiniband/ulp/srp/ib_srp.c > >> +++ b/drivers/infiniband/ulp/srp/ib_srp.c > >> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct > >> ib_device *device, > >> struct srp_fr_desc *d; > >> struct ib_mr *mr; > >> int i, ret = -EINVAL; > >> + enum ib_mr_type mr_type; > >> > >> if (pool_size <= 0) > >> goto err; > >> @@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct > >> ib_device *device, > >> spin_lock_init(&pool->lock); > >> INIT_LIST_HEAD(&pool->free_list); > >> > >> + if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) > >> + mr_type = IB_MR_TYPE_SG_GAPS; > >> + else > >> + mr_type = IB_MR_TYPE_MEM_REG; > >> + > >> for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { > >> - mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, > >> - max_page_list_len); > >> + mr = ib_alloc_mr(pd, mr_type, max_page_list_len); > > > > First, ib_alloc_mr receives u32 as a third parameter, but int was > > supplied. Second (I can be wrong here), shouldn't max_page_list_len be > > replaced with max_fast_reg_page_list_len? > > > > Thanks > > there is a statement that: > > if (srp_dev->use_fast_reg) { > srp_dev->max_pages_per_mr = > min_t(u32, srp_dev->max_pages_per_mr, > attr->max_fast_reg_page_list_len); > } > > so we take the max_fast_reg_page_list_len in this case. > > > > >> if (IS_ERR(mr)) { > >> ret = PTR_ERR(mr); > >> if (ret == -ENOMEM) > >> (END) > >> > >> > >> So here is the revert patch, but you need to decide how you want to deal > >> with this. > >> > >> Revert "IB/srp: fix mr allocation when the device supports sg gaps" > >> Laurence Oberman > >> Traced after bisection to a cause for this failure > >> > >> Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >> > >> commit 90d169d312a173d5350c1bb36d6daab04c592127 > >> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >> Date: Mon Feb 13 20:33:32 2017 -0500 > >> > >> Revert "IB/srp: fix mr allocation when the device supports sg gaps" > >> Laurence Oberman > >> Traced after bisection to a cause for this failure > >> > >> [ 130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe > >> [ 130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5) > >> for CQE ffff8817f0edbfb0 > >> [ 130.510899] 00000000 00000000 00000000 00000000 > >> [ 130.536455] 00000000 00000000 00000000 00000000 > >> [ 130.561878] 00000000 00000000 00000000 00000000 > >> [ 130.585904] 00000000 0f007806 2500002a db0ec4d0 > >> [ 145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1. > >> [ 146.530439] scsi host1: ib_srp: reconnect succeeded > >> [ 146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe > >> [ 146.597635] 00000000 00000000 00000000 00000000 > >> [ 146.623545] 00000000 00000000 00000000 00000000 > >> [ 146.649599] 00000000 00000000 00000000 00000000 > >> [ 146.673938] 00000000 0f007806 25000032 000c46d0 > >> [ 146.697969] scsi host1: ib_srp: failed FAST REG status memory > >> management operation error (6) for CQE ffff88 > >> [ 162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1. > >> [ 162.256337] scsi host1: ib_srp: reconnect succeeded > >> [ 162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5) > >> for CQE ffff8817f0412ef0` > >> > >> This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361. > >> > >> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c > >> b/drivers/infiniband/ulp/srp/ib_srp.c > >> index 79bf484..01338c8 100644 > >> --- a/drivers/infiniband/ulp/srp/ib_srp.c > >> +++ b/drivers/infiniband/ulp/srp/ib_srp.c > >> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct > >> ib_device *device, > >> struct srp_fr_desc *d; > >> struct ib_mr *mr; > >> int i, ret = -EINVAL; > >> - enum ib_mr_type mr_type; > >> > >> if (pool_size <= 0) > >> goto err; > >> @@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct > >> ib_device *device, > >> spin_lock_init(&pool->lock); > >> INIT_LIST_HEAD(&pool->free_list); > >> > >> - if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) > >> - mr_type = IB_MR_TYPE_SG_GAPS; > >> - else > >> - mr_type = IB_MR_TYPE_MEM_REG; > >> - > >> for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { > >> - mr = ib_alloc_mr(pd, mr_type, max_page_list_len); > >> + mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, > >> + max_page_list_len); > >> if (IS_ERR(mr)) { > >> ret = PTR_ERR(mr); > >> if (ret == -ENOMEM) > >> > >> > >> > >> Now moving on to what got me here in the first place. > >> Bart, let me know if the 7 of the 8 patches in your most recent series are > >> all still valid after this revert > >> Otherwise let me know which ones you want me to apply. > >> > >> patch 6 - I am thinking i sno longer valid. > >> " > >> If a HCA supports the SG_GAPS_REG feature then a single memory > >> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch > >> reduces the number of memory regions that is allocated per SRP > >> session. > >> " > >> > >> Thanks > >> Laurence > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello Max, I only have CX4 and CX3 in my lab, this test bed only has CX4. CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.14.2036 Hardware version: 0 Node GUID: 0x7cfe900300726ed2 System image GUID: 0x7cfe900300726ed2 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 3 LMC: 0 SM lid: 3 Capability mask: 0x2651e84a Port GUID: 0x7cfe900300726ed2 Link layer: InfiniBand The test is simple, it's the same one I start with every time because it always brings out issues with mapping for large I/O sizes and mem registration if such issues exist. I have a server running LIO with memory backed LUNS. These are served via a dual port mlx5 (CX4) over ib_srpt The client mounts these LUNS via ib_srp (mlx5) and device-mapper-multipath and I run a simple dd on the XFS file system. #!/bin/bash while true do dd if=/dev/zero of=/data-$1/bigfile bs=4096k count=900 sync; rm -rf /data-$1/bigfile done Once this passes I run a suite of other tests read/write, direct and buffered. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <656778124.31118982.1487079062235.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: v4.10-rc SRP + mlx5 regression [not found] ` <656778124.31118982.1487079062235.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-02-14 16:21 ` Laurence Oberman 2017-02-14 17:15 ` Max Gurtovoy 2017-02-14 17:15 ` Max Gurtovoy 2 siblings, 0 replies; 47+ messages in thread From: Laurence Oberman @ 2017-02-14 16:21 UTC (permalink / raw) To: Max Gurtovoy Cc: Leon Romanovsky, Bart Van Assche, hch-jcswGhMUV9g, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > To: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > Cc: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Tuesday, February 14, 2017 8:31:02 AM > Subject: Re: v4.10-rc SRP + mlx5 regression > > > > ----- Original Message ----- > > From: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Laurence Oberman" > > <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > Sent: Tuesday, February 14, 2017 5:00:04 AM > > Subject: Re: v4.10-rc SRP + mlx5 regression > > > > Hi Laurence, > > can you specify the test that repro these failures ? > > have you tried running with CX5 HCA or only CX4 ? > > I think this commit is right and we have issues in other places. > > > > > > On 2/14/2017 8:39 AM, Leon Romanovsky wrote: > > > On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote: > > >> > > >> > > >> ----- Original Message ----- > > >>> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > >>> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > >>> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > > >>> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > > >>> Sent: Monday, February 13, 2017 4:52:28 PM > > >>> Subject: Re: v4.10-rc SRP + mlx5 regression > > >>> > > >>> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote: > > >>>> I will have to run through this again and see where the bisect went > > >>>> wrong. > > >>> > > >>> Hello Laurence, > > >>> > > >>> If you would be considering to repeat the bisect, did you know that a > > >>> bisect > > >>> can be sped up by specifying the names of the files and/or directories > > >>> that > > >>> are suspected? An example: > > >>> > > >>> git bisect start */infiniband */net > > >>> > > >>> Bart.-- > > >>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" > > >>> in > > >>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > >>> > > >> > > >> Hello Bart, > > >> > > >> Much better news this time :), worked late on this but got it figured > > >> out. > > >> > > >> OK, so we got to this one, which makes a lot more sense and is right in > > >> the area where we are having issues. > > >> I must have answered wrong to one of the steps the first time I did the > > >> bisect. > > >> > > >> Reverted this in the master tree of rc8 and rebuilt the kernel > > >> Now all tests pass on Linus's tree - 4.10.0_rc8+ > > >> > > >> The interesting point here is that this commit is in rc5 but rc5 was not > > >> failing so we have an interoperability issue with this commit > > >> > > >> > > >> [loberman@ibclient linux]$ git bisect good > > >> Bisecting: 0 revisions left to test after this (roughly 1 step) > > >> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation > > >> when > > >> the device supports sg gaps > > >> > > >> [loberman@ibclient linux]$ git show > > >> ad8e66b4a80182174f73487ed25fd2140cf43361 > > >> commit ad8e66b4a80182174f73487ed25fd2140cf43361 > > >> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > >> Date: Wed Dec 28 12:48:28 2016 +0200 > > >> > > >> IB/srp: fix mr allocation when the device supports sg gaps > > >> > > >> If the device support arbitrary sg list mapping (device cap > > >> IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with > > >> IB_MR_TYPE_SG_GAPS. > > >> > > >> Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures") > > >> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+ > > >> Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > >> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > >> Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > >> Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > >> Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > > >> Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > > >> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > >> > > >> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c > > >> b/drivers/infiniband/ulp/srp/ib_srp.c > > >> index 8ddc071..0f67cf9 100644 > > >> --- a/drivers/infiniband/ulp/srp/ib_srp.c > > >> +++ b/drivers/infiniband/ulp/srp/ib_srp.c > > >> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct > > >> ib_device *device, > > >> struct srp_fr_desc *d; > > >> struct ib_mr *mr; > > >> int i, ret = -EINVAL; > > >> + enum ib_mr_type mr_type; > > >> > > >> if (pool_size <= 0) > > >> goto err; > > >> @@ -384,9 +385,13 @@ static struct srp_fr_pool > > >> *srp_create_fr_pool(struct > > >> ib_device *device, > > >> spin_lock_init(&pool->lock); > > >> INIT_LIST_HEAD(&pool->free_list); > > >> > > >> + if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) > > >> + mr_type = IB_MR_TYPE_SG_GAPS; > > >> + else > > >> + mr_type = IB_MR_TYPE_MEM_REG; > > >> + > > >> for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { > > >> - mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, > > >> - max_page_list_len); > > >> + mr = ib_alloc_mr(pd, mr_type, max_page_list_len); > > > > > > First, ib_alloc_mr receives u32 as a third parameter, but int was > > > supplied. Second (I can be wrong here), shouldn't max_page_list_len be > > > replaced with max_fast_reg_page_list_len? > > > > > > Thanks > > > > there is a statement that: > > > > if (srp_dev->use_fast_reg) { > > srp_dev->max_pages_per_mr = > > min_t(u32, srp_dev->max_pages_per_mr, > > attr->max_fast_reg_page_list_len); > > } > > > > so we take the max_fast_reg_page_list_len in this case. > > > > > > > >> if (IS_ERR(mr)) { > > >> ret = PTR_ERR(mr); > > >> if (ret == -ENOMEM) > > >> (END) > > >> > > >> > > >> So here is the revert patch, but you need to decide how you want to deal > > >> with this. > > >> > > >> Revert "IB/srp: fix mr allocation when the device supports sg gaps" > > >> Laurence Oberman > > >> Traced after bisection to a cause for this failure > > >> > > >> Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > >> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > >> > > >> commit 90d169d312a173d5350c1bb36d6daab04c592127 > > >> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > >> Date: Mon Feb 13 20:33:32 2017 -0500 > > >> > > >> Revert "IB/srp: fix mr allocation when the device supports sg gaps" > > >> Laurence Oberman > > >> Traced after bisection to a cause for this failure > > >> > > >> [ 130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe > > >> [ 130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5) > > >> for CQE ffff8817f0edbfb0 > > >> [ 130.510899] 00000000 00000000 00000000 00000000 > > >> [ 130.536455] 00000000 00000000 00000000 00000000 > > >> [ 130.561878] 00000000 00000000 00000000 00000000 > > >> [ 130.585904] 00000000 0f007806 2500002a db0ec4d0 > > >> [ 145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1. > > >> [ 146.530439] scsi host1: ib_srp: reconnect succeeded > > >> [ 146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe > > >> [ 146.597635] 00000000 00000000 00000000 00000000 > > >> [ 146.623545] 00000000 00000000 00000000 00000000 > > >> [ 146.649599] 00000000 00000000 00000000 00000000 > > >> [ 146.673938] 00000000 0f007806 25000032 000c46d0 > > >> [ 146.697969] scsi host1: ib_srp: failed FAST REG status memory > > >> management operation error (6) for CQE ffff88 > > >> [ 162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1. > > >> [ 162.256337] scsi host1: ib_srp: reconnect succeeded > > >> [ 162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5) > > >> for CQE ffff8817f0412ef0` > > >> > > >> This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361. > > >> > > >> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c > > >> b/drivers/infiniband/ulp/srp/ib_srp.c > > >> index 79bf484..01338c8 100644 > > >> --- a/drivers/infiniband/ulp/srp/ib_srp.c > > >> +++ b/drivers/infiniband/ulp/srp/ib_srp.c > > >> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct > > >> ib_device *device, > > >> struct srp_fr_desc *d; > > >> struct ib_mr *mr; > > >> int i, ret = -EINVAL; > > >> - enum ib_mr_type mr_type; > > >> > > >> if (pool_size <= 0) > > >> goto err; > > >> @@ -385,13 +384,9 @@ static struct srp_fr_pool > > >> *srp_create_fr_pool(struct > > >> ib_device *device, > > >> spin_lock_init(&pool->lock); > > >> INIT_LIST_HEAD(&pool->free_list); > > >> > > >> - if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) > > >> - mr_type = IB_MR_TYPE_SG_GAPS; > > >> - else > > >> - mr_type = IB_MR_TYPE_MEM_REG; > > >> - > > >> for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { > > >> - mr = ib_alloc_mr(pd, mr_type, max_page_list_len); > > >> + mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, > > >> + max_page_list_len); > > >> if (IS_ERR(mr)) { > > >> ret = PTR_ERR(mr); > > >> if (ret == -ENOMEM) > > >> > > >> > > >> > > >> Now moving on to what got me here in the first place. > > >> Bart, let me know if the 7 of the 8 patches in your most recent series > > >> are > > >> all still valid after this revert > > >> Otherwise let me know which ones you want me to apply. > > >> > > >> patch 6 - I am thinking i sno longer valid. > > >> " > > >> If a HCA supports the SG_GAPS_REG feature then a single memory > > >> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch > > >> reduces the number of memory regions that is allocated per SRP > > >> session. > > >> " > > >> > > >> Thanks > > >> Laurence > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Hello Max, > > I only have CX4 and CX3 in my lab, this test bed only has CX4. > > CA 'mlx5_0' > CA type: MT4115 > Number of ports: 1 > Firmware version: 12.14.2036 > Hardware version: 0 > Node GUID: 0x7cfe900300726ed2 > System image GUID: 0x7cfe900300726ed2 > Port 1: > State: Active > Physical state: LinkUp > Rate: 100 > Base lid: 3 > LMC: 0 > SM lid: 3 > Capability mask: 0x2651e84a > Port GUID: 0x7cfe900300726ed2 > Link layer: InfiniBand > > The test is simple, it's the same one I start with every time because it > always > brings out issues with mapping for large I/O sizes and mem registration if > such issues exist. > > I have a server running LIO with memory backed LUNS. > These are served via a dual port mlx5 (CX4) over ib_srpt > > The client mounts these LUNS via ib_srp (mlx5) and device-mapper-multipath > and I run a simple dd on the XFS file system. > > #!/bin/bash > while true > do > dd if=/dev/zero of=/data-$1/bigfile bs=4096k count=900 > sync; > rm -rf /data-$1/bigfile > done > > Once this passes I run a suite of other tests read/write, direct and > buffered. > > Thanks > Laurence > > Max, Leon, Israel, Bart and Doug We should consider reverting that commit for now until we figure out what specifically brings this out unless a quick fix is forthcoming. I have been running since last night with that commit reverted and 7 of Bart's latest patches and its been rock solid stable. Its also shown no issues performance wise. Tests included read/writes, large/small I/O sizes, buffered and unbuffered, XFS file-system and direct I/O. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: v4.10-rc SRP + mlx5 regression [not found] ` <656778124.31118982.1487079062235.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2017-02-14 16:21 ` Laurence Oberman @ 2017-02-14 17:15 ` Max Gurtovoy [not found] ` <a7ae2926-da0a-edf9-7779-09a6edd54d5d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2017-02-14 17:15 ` Max Gurtovoy 2 siblings, 1 reply; 47+ messages in thread From: Max Gurtovoy @ 2017-02-14 17:15 UTC (permalink / raw) To: Laurence Oberman Cc: Leon Romanovsky, Bart Van Assche, hch-jcswGhMUV9g, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA On 2/14/2017 3:31 PM, Laurence Oberman wrote: > > > ----- Original Message ----- >> From: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >> To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, >> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >> Sent: Tuesday, February 14, 2017 5:00:04 AM >> Subject: Re: v4.10-rc SRP + mlx5 regression >> >> Hi Laurence, >> can you specify the test that repro these failures ? >> have you tried running with CX5 HCA or only CX4 ? >> I think this commit is right and we have issues in other places. >> >> >> On 2/14/2017 8:39 AM, Leon Romanovsky wrote: >>> On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote: >>>> >>>> >>>> ----- Original Message ----- >>>>> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> >>>>> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >>>>> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, >>>>> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >>>>> Sent: Monday, February 13, 2017 4:52:28 PM >>>>> Subject: Re: v4.10-rc SRP + mlx5 regression >>>>> >>>>> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote: >>>>>> I will have to run through this again and see where the bisect went >>>>>> wrong. >>>>> >>>>> Hello Laurence, >>>>> >>>>> If you would be considering to repeat the bisect, did you know that a >>>>> bisect >>>>> can be sped up by specifying the names of the files and/or directories >>>>> that >>>>> are suspected? An example: >>>>> >>>>> git bisect start */infiniband */net >>>>> >>>>> Bart.-- >>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>> >>>> Hello Bart, >>>> >>>> Much better news this time :), worked late on this but got it figured out. >>>> >>>> OK, so we got to this one, which makes a lot more sense and is right in >>>> the area where we are having issues. >>>> I must have answered wrong to one of the steps the first time I did the >>>> bisect. >>>> >>>> Reverted this in the master tree of rc8 and rebuilt the kernel >>>> Now all tests pass on Linus's tree - 4.10.0_rc8+ >>>> >>>> The interesting point here is that this commit is in rc5 but rc5 was not >>>> failing so we have an interoperability issue with this commit >>>> >>>> >>>> [loberman@ibclient linux]$ git bisect good >>>> Bisecting: 0 revisions left to test after this (roughly 1 step) >>>> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when >>>> the device supports sg gaps >>>> >>>> [loberman@ibclient linux]$ git show >>>> ad8e66b4a80182174f73487ed25fd2140cf43361 >>>> commit ad8e66b4a80182174f73487ed25fd2140cf43361 >>>> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >>>> Date: Wed Dec 28 12:48:28 2016 +0200 >>>> >>>> IB/srp: fix mr allocation when the device supports sg gaps >>>> >>>> If the device support arbitrary sg list mapping (device cap >>>> IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with >>>> IB_MR_TYPE_SG_GAPS. >>>> >>>> Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures") >>>> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+ >>>> Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >>>> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >>>> Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >>>> Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >>>> Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> >>>> Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> >>>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>>> >>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c >>>> b/drivers/infiniband/ulp/srp/ib_srp.c >>>> index 8ddc071..0f67cf9 100644 >>>> --- a/drivers/infiniband/ulp/srp/ib_srp.c >>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c >>>> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct >>>> ib_device *device, >>>> struct srp_fr_desc *d; >>>> struct ib_mr *mr; >>>> int i, ret = -EINVAL; >>>> + enum ib_mr_type mr_type; >>>> >>>> if (pool_size <= 0) >>>> goto err; >>>> @@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct >>>> ib_device *device, >>>> spin_lock_init(&pool->lock); >>>> INIT_LIST_HEAD(&pool->free_list); >>>> >>>> + if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) >>>> + mr_type = IB_MR_TYPE_SG_GAPS; >>>> + else >>>> + mr_type = IB_MR_TYPE_MEM_REG; >>>> + >>>> for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { >>>> - mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, >>>> - max_page_list_len); >>>> + mr = ib_alloc_mr(pd, mr_type, max_page_list_len); >>> >>> First, ib_alloc_mr receives u32 as a third parameter, but int was >>> supplied. Second (I can be wrong here), shouldn't max_page_list_len be >>> replaced with max_fast_reg_page_list_len? >>> >>> Thanks >> >> there is a statement that: >> >> if (srp_dev->use_fast_reg) { >> srp_dev->max_pages_per_mr = >> min_t(u32, srp_dev->max_pages_per_mr, >> attr->max_fast_reg_page_list_len); >> } >> >> so we take the max_fast_reg_page_list_len in this case. >> >>> >>>> if (IS_ERR(mr)) { >>>> ret = PTR_ERR(mr); >>>> if (ret == -ENOMEM) >>>> (END) >>>> >>>> >>>> So here is the revert patch, but you need to decide how you want to deal >>>> with this. >>>> >>>> Revert "IB/srp: fix mr allocation when the device supports sg gaps" >>>> Laurence Oberman >>>> Traced after bisection to a cause for this failure >>>> >>>> Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>>> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>>> >>>> commit 90d169d312a173d5350c1bb36d6daab04c592127 >>>> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>>> Date: Mon Feb 13 20:33:32 2017 -0500 >>>> >>>> Revert "IB/srp: fix mr allocation when the device supports sg gaps" >>>> Laurence Oberman >>>> Traced after bisection to a cause for this failure >>>> >>>> [ 130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe >>>> [ 130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5) >>>> for CQE ffff8817f0edbfb0 >>>> [ 130.510899] 00000000 00000000 00000000 00000000 >>>> [ 130.536455] 00000000 00000000 00000000 00000000 >>>> [ 130.561878] 00000000 00000000 00000000 00000000 >>>> [ 130.585904] 00000000 0f007806 2500002a db0ec4d0 >>>> [ 145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1. >>>> [ 146.530439] scsi host1: ib_srp: reconnect succeeded >>>> [ 146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe >>>> [ 146.597635] 00000000 00000000 00000000 00000000 >>>> [ 146.623545] 00000000 00000000 00000000 00000000 >>>> [ 146.649599] 00000000 00000000 00000000 00000000 >>>> [ 146.673938] 00000000 0f007806 25000032 000c46d0 >>>> [ 146.697969] scsi host1: ib_srp: failed FAST REG status memory >>>> management operation error (6) for CQE ffff88 >>>> [ 162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1. >>>> [ 162.256337] scsi host1: ib_srp: reconnect succeeded >>>> [ 162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5) >>>> for CQE ffff8817f0412ef0` >>>> >>>> This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361. >>>> >>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c >>>> b/drivers/infiniband/ulp/srp/ib_srp.c >>>> index 79bf484..01338c8 100644 >>>> --- a/drivers/infiniband/ulp/srp/ib_srp.c >>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c >>>> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct >>>> ib_device *device, >>>> struct srp_fr_desc *d; >>>> struct ib_mr *mr; >>>> int i, ret = -EINVAL; >>>> - enum ib_mr_type mr_type; >>>> >>>> if (pool_size <= 0) >>>> goto err; >>>> @@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct >>>> ib_device *device, >>>> spin_lock_init(&pool->lock); >>>> INIT_LIST_HEAD(&pool->free_list); >>>> >>>> - if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) >>>> - mr_type = IB_MR_TYPE_SG_GAPS; >>>> - else >>>> - mr_type = IB_MR_TYPE_MEM_REG; >>>> - >>>> for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { >>>> - mr = ib_alloc_mr(pd, mr_type, max_page_list_len); >>>> + mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, >>>> + max_page_list_len); >>>> if (IS_ERR(mr)) { >>>> ret = PTR_ERR(mr); >>>> if (ret == -ENOMEM) >>>> >>>> >>>> >>>> Now moving on to what got me here in the first place. >>>> Bart, let me know if the 7 of the 8 patches in your most recent series are >>>> all still valid after this revert >>>> Otherwise let me know which ones you want me to apply. >>>> >>>> patch 6 - I am thinking i sno longer valid. >>>> " >>>> If a HCA supports the SG_GAPS_REG feature then a single memory >>>> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch >>>> reduces the number of memory regions that is allocated per SRP >>>> session. >>>> " >>>> >>>> Thanks >>>> Laurence >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > Hello Max, > > I only have CX4 and CX3 in my lab, this test bed only has CX4. > > CA 'mlx5_0' > CA type: MT4115 > Number of ports: 1 > Firmware version: 12.14.2036 > Hardware version: 0 > Node GUID: 0x7cfe900300726ed2 > System image GUID: 0x7cfe900300726ed2 > Port 1: > State: Active > Physical state: LinkUp > Rate: 100 > Base lid: 3 > LMC: 0 > SM lid: 3 > Capability mask: 0x2651e84a > Port GUID: 0x7cfe900300726ed2 > Link layer: InfiniBand > > The test is simple, it's the same one I start with every time because it always > brings out issues with mapping for large I/O sizes and mem registration if such issues exist. > > I have a server running LIO with memory backed LUNS. > These are served via a dual port mlx5 (CX4) over ib_srpt > > The client mounts these LUNS via ib_srp (mlx5) and device-mapper-multipath > and I run a simple dd on the XFS file system. > > #!/bin/bash > while true > do > dd if=/dev/zero of=/data-$1/bigfile bs=4096k count=900 > sync; > rm -rf /data-$1/bigfile > done > > Once this passes I run a suite of other tests read/write, direct and buffered. Laurence, this is 4MB transactions. can you increase the cmd_sg_entries to the maximum and run the test again ? > > Thanks > Laurence > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <a7ae2926-da0a-edf9-7779-09a6edd54d5d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: v4.10-rc SRP + mlx5 regression [not found] ` <a7ae2926-da0a-edf9-7779-09a6edd54d5d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2017-02-14 17:29 ` Bart Van Assche 2017-02-14 17:31 ` Laurence Oberman 1 sibling, 0 replies; 47+ messages in thread From: Bart Van Assche @ 2017-02-14 17:29 UTC (permalink / raw) To: Max Gurtovoy, Laurence Oberman Cc: Leon Romanovsky, hch-jcswGhMUV9g, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA On 02/14/2017 09:15 AM, Max Gurtovoy wrote: > this is 4MB transactions. can you increase the cmd_sg_entries to the > maximum and run the test again ? How could that affect the error message Laurence reported? If cmd_sg_entries is too low then the block layer refuses direct I/O requests that are too large. From __scsi_init_queue(): blk_queue_max_segments(q, min_t(unsigned short, shost->sg_tablesize, SG_MAX_SEGMENTS)); Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: v4.10-rc SRP + mlx5 regression [not found] ` <a7ae2926-da0a-edf9-7779-09a6edd54d5d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2017-02-14 17:29 ` Bart Van Assche @ 2017-02-14 17:31 ` Laurence Oberman 1 sibling, 0 replies; 47+ messages in thread From: Laurence Oberman @ 2017-02-14 17:31 UTC (permalink / raw) To: Max Gurtovoy Cc: Leon Romanovsky, Bart Van Assche, hch-jcswGhMUV9g, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA ----- Original Message ----- > From: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > To: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Cc: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > Sent: Tuesday, February 14, 2017 12:15:20 PM > Subject: Re: v4.10-rc SRP + mlx5 regression > > > > On 2/14/2017 3:31 PM, Laurence Oberman wrote: > > > > > > ----- Original Message ----- > >> From: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > >> To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Laurence Oberman" > >> <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >> Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, > >> israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, > >> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > >> Sent: Tuesday, February 14, 2017 5:00:04 AM > >> Subject: Re: v4.10-rc SRP + mlx5 regression > >> > >> Hi Laurence, > >> can you specify the test that repro these failures ? > >> have you tried running with CX5 HCA or only CX4 ? > >> I think this commit is right and we have issues in other places. > >> > >> > >> On 2/14/2017 8:39 AM, Leon Romanovsky wrote: > >>> On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote: > >>>> > >>>> > >>>> ----- Original Message ----- > >>>>> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > >>>>> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > >>>>> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, > >>>>> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org > >>>>> Sent: Monday, February 13, 2017 4:52:28 PM > >>>>> Subject: Re: v4.10-rc SRP + mlx5 regression > >>>>> > >>>>> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote: > >>>>>> I will have to run through this again and see where the bisect went > >>>>>> wrong. > >>>>> > >>>>> Hello Laurence, > >>>>> > >>>>> If you would be considering to repeat the bisect, did you know that a > >>>>> bisect > >>>>> can be sped up by specifying the names of the files and/or directories > >>>>> that > >>>>> are suspected? An example: > >>>>> > >>>>> git bisect start */infiniband */net > >>>>> > >>>>> Bart.-- > >>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" > >>>>> in > >>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>>>> > >>>> > >>>> Hello Bart, > >>>> > >>>> Much better news this time :), worked late on this but got it figured > >>>> out. > >>>> > >>>> OK, so we got to this one, which makes a lot more sense and is right in > >>>> the area where we are having issues. > >>>> I must have answered wrong to one of the steps the first time I did the > >>>> bisect. > >>>> > >>>> Reverted this in the master tree of rc8 and rebuilt the kernel > >>>> Now all tests pass on Linus's tree - 4.10.0_rc8+ > >>>> > >>>> The interesting point here is that this commit is in rc5 but rc5 was not > >>>> failing so we have an interoperability issue with this commit > >>>> > >>>> > >>>> [loberman@ibclient linux]$ git bisect good > >>>> Bisecting: 0 revisions left to test after this (roughly 1 step) > >>>> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation > >>>> when > >>>> the device supports sg gaps > >>>> > >>>> [loberman@ibclient linux]$ git show > >>>> ad8e66b4a80182174f73487ed25fd2140cf43361 > >>>> commit ad8e66b4a80182174f73487ed25fd2140cf43361 > >>>> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > >>>> Date: Wed Dec 28 12:48:28 2016 +0200 > >>>> > >>>> IB/srp: fix mr allocation when the device supports sg gaps > >>>> > >>>> If the device support arbitrary sg list mapping (device cap > >>>> IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with > >>>> IB_MR_TYPE_SG_GAPS. > >>>> > >>>> Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures") > >>>> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+ > >>>> Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > >>>> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > >>>> Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > >>>> Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > >>>> Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > >>>> Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> > >>>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >>>> > >>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c > >>>> b/drivers/infiniband/ulp/srp/ib_srp.c > >>>> index 8ddc071..0f67cf9 100644 > >>>> --- a/drivers/infiniband/ulp/srp/ib_srp.c > >>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c > >>>> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct > >>>> ib_device *device, > >>>> struct srp_fr_desc *d; > >>>> struct ib_mr *mr; > >>>> int i, ret = -EINVAL; > >>>> + enum ib_mr_type mr_type; > >>>> > >>>> if (pool_size <= 0) > >>>> goto err; > >>>> @@ -384,9 +385,13 @@ static struct srp_fr_pool > >>>> *srp_create_fr_pool(struct > >>>> ib_device *device, > >>>> spin_lock_init(&pool->lock); > >>>> INIT_LIST_HEAD(&pool->free_list); > >>>> > >>>> + if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) > >>>> + mr_type = IB_MR_TYPE_SG_GAPS; > >>>> + else > >>>> + mr_type = IB_MR_TYPE_MEM_REG; > >>>> + > >>>> for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { > >>>> - mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, > >>>> - max_page_list_len); > >>>> + mr = ib_alloc_mr(pd, mr_type, max_page_list_len); > >>> > >>> First, ib_alloc_mr receives u32 as a third parameter, but int was > >>> supplied. Second (I can be wrong here), shouldn't max_page_list_len be > >>> replaced with max_fast_reg_page_list_len? > >>> > >>> Thanks > >> > >> there is a statement that: > >> > >> if (srp_dev->use_fast_reg) { > >> srp_dev->max_pages_per_mr = > >> min_t(u32, srp_dev->max_pages_per_mr, > >> attr->max_fast_reg_page_list_len); > >> } > >> > >> so we take the max_fast_reg_page_list_len in this case. > >> > >>> > >>>> if (IS_ERR(mr)) { > >>>> ret = PTR_ERR(mr); > >>>> if (ret == -ENOMEM) > >>>> (END) > >>>> > >>>> > >>>> So here is the revert patch, but you need to decide how you want to deal > >>>> with this. > >>>> > >>>> Revert "IB/srp: fix mr allocation when the device supports sg gaps" > >>>> Laurence Oberman > >>>> Traced after bisection to a cause for this failure > >>>> > >>>> Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >>>> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >>>> > >>>> commit 90d169d312a173d5350c1bb36d6daab04c592127 > >>>> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >>>> Date: Mon Feb 13 20:33:32 2017 -0500 > >>>> > >>>> Revert "IB/srp: fix mr allocation when the device supports sg gaps" > >>>> Laurence Oberman > >>>> Traced after bisection to a cause for this failure > >>>> > >>>> [ 130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe > >>>> [ 130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5) > >>>> for CQE ffff8817f0edbfb0 > >>>> [ 130.510899] 00000000 00000000 00000000 00000000 > >>>> [ 130.536455] 00000000 00000000 00000000 00000000 > >>>> [ 130.561878] 00000000 00000000 00000000 00000000 > >>>> [ 130.585904] 00000000 0f007806 2500002a db0ec4d0 > >>>> [ 145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1. > >>>> [ 146.530439] scsi host1: ib_srp: reconnect succeeded > >>>> [ 146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe > >>>> [ 146.597635] 00000000 00000000 00000000 00000000 > >>>> [ 146.623545] 00000000 00000000 00000000 00000000 > >>>> [ 146.649599] 00000000 00000000 00000000 00000000 > >>>> [ 146.673938] 00000000 0f007806 25000032 000c46d0 > >>>> [ 146.697969] scsi host1: ib_srp: failed FAST REG status memory > >>>> management operation error (6) for CQE ffff88 > >>>> [ 162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1. > >>>> [ 162.256337] scsi host1: ib_srp: reconnect succeeded > >>>> [ 162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5) > >>>> for CQE ffff8817f0412ef0` > >>>> > >>>> This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361. > >>>> > >>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c > >>>> b/drivers/infiniband/ulp/srp/ib_srp.c > >>>> index 79bf484..01338c8 100644 > >>>> --- a/drivers/infiniband/ulp/srp/ib_srp.c > >>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c > >>>> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct > >>>> ib_device *device, > >>>> struct srp_fr_desc *d; > >>>> struct ib_mr *mr; > >>>> int i, ret = -EINVAL; > >>>> - enum ib_mr_type mr_type; > >>>> > >>>> if (pool_size <= 0) > >>>> goto err; > >>>> @@ -385,13 +384,9 @@ static struct srp_fr_pool > >>>> *srp_create_fr_pool(struct > >>>> ib_device *device, > >>>> spin_lock_init(&pool->lock); > >>>> INIT_LIST_HEAD(&pool->free_list); > >>>> > >>>> - if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) > >>>> - mr_type = IB_MR_TYPE_SG_GAPS; > >>>> - else > >>>> - mr_type = IB_MR_TYPE_MEM_REG; > >>>> - > >>>> for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { > >>>> - mr = ib_alloc_mr(pd, mr_type, max_page_list_len); > >>>> + mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, > >>>> + max_page_list_len); > >>>> if (IS_ERR(mr)) { > >>>> ret = PTR_ERR(mr); > >>>> if (ret == -ENOMEM) > >>>> > >>>> > >>>> > >>>> Now moving on to what got me here in the first place. > >>>> Bart, let me know if the 7 of the 8 patches in your most recent series > >>>> are > >>>> all still valid after this revert > >>>> Otherwise let me know which ones you want me to apply. > >>>> > >>>> patch 6 - I am thinking i sno longer valid. > >>>> " > >>>> If a HCA supports the SG_GAPS_REG feature then a single memory > >>>> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch > >>>> reduces the number of memory regions that is allocated per SRP > >>>> session. > >>>> " > >>>> > >>>> Thanks > >>>> Laurence > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> > > Hello Max, > > > > I only have CX4 and CX3 in my lab, this test bed only has CX4. > > > > CA 'mlx5_0' > > CA type: MT4115 > > Number of ports: 1 > > Firmware version: 12.14.2036 > > Hardware version: 0 > > Node GUID: 0x7cfe900300726ed2 > > System image GUID: 0x7cfe900300726ed2 > > Port 1: > > State: Active > > Physical state: LinkUp > > Rate: 100 > > Base lid: 3 > > LMC: 0 > > SM lid: 3 > > Capability mask: 0x2651e84a > > Port GUID: 0x7cfe900300726ed2 > > Link layer: InfiniBand > > > > The test is simple, it's the same one I start with every time because it > > always > > brings out issues with mapping for large I/O sizes and mem registration if > > such issues exist. > > > > I have a server running LIO with memory backed LUNS. > > These are served via a dual port mlx5 (CX4) over ib_srpt > > > > The client mounts these LUNS via ib_srp (mlx5) and device-mapper-multipath > > and I run a simple dd on the XFS file system. > > > > #!/bin/bash > > while true > > do > > dd if=/dev/zero of=/data-$1/bigfile bs=4096k count=900 > > sync; > > rm -rf /data-$1/bigfile > > done > > > > Once this passes I run a suite of other tests read/write, direct and > > buffered. > > Laurence, > this is 4MB transactions. can you increase the cmd_sg_entries to the > maximum and run the test again ? > > > > > > Thanks > > Laurence > > > Hello Max, Yes 4MB is very important for one of our biggest RHEL customers and I worked many hours with Bart last year to stabilize large 4MB buffered and direct I/O for ib_srp/ib_srpt. I am already running with: options ib_srp cmd_sg_entries=255 indirect_sg_entries=2048 Regards and thanks for your assistance Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: v4.10-rc SRP + mlx5 regression [not found] ` <656778124.31118982.1487079062235.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2017-02-14 16:21 ` Laurence Oberman 2017-02-14 17:15 ` Max Gurtovoy @ 2017-02-14 17:15 ` Max Gurtovoy 2 siblings, 0 replies; 47+ messages in thread From: Max Gurtovoy @ 2017-02-14 17:15 UTC (permalink / raw) To: Laurence Oberman Cc: Leon Romanovsky, Bart Van Assche, hch-jcswGhMUV9g, israelr-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, dledford-H+wXaHxf7aLQT0dZR+AlfA On 2/14/2017 3:31 PM, Laurence Oberman wrote: > > > ----- Original Message ----- >> From: "Max Gurtovoy" <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >> To: "Leon Romanovsky" <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> Cc: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>, hch-jcswGhMUV9g@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, >> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >> Sent: Tuesday, February 14, 2017 5:00:04 AM >> Subject: Re: v4.10-rc SRP + mlx5 regression >> >> Hi Laurence, >> can you specify the test that repro these failures ? >> have you tried running with CX5 HCA or only CX4 ? >> I think this commit is right and we have issues in other places. >> >> >> On 2/14/2017 8:39 AM, Leon Romanovsky wrote: >>> On Mon, Feb 13, 2017 at 09:19:54PM -0500, Laurence Oberman wrote: >>>> >>>> >>>> ----- Original Message ----- >>>>> From: "Bart Van Assche" <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> >>>>> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >>>>> Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, >>>>> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org >>>>> Sent: Monday, February 13, 2017 4:52:28 PM >>>>> Subject: Re: v4.10-rc SRP + mlx5 regression >>>>> >>>>> On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote: >>>>>> I will have to run through this again and see where the bisect went >>>>>> wrong. >>>>> >>>>> Hello Laurence, >>>>> >>>>> If you would be considering to repeat the bisect, did you know that a >>>>> bisect >>>>> can be sped up by specifying the names of the files and/or directories >>>>> that >>>>> are suspected? An example: >>>>> >>>>> git bisect start */infiniband */net >>>>> >>>>> Bart.-- >>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>> >>>> Hello Bart, >>>> >>>> Much better news this time :), worked late on this but got it figured out. >>>> >>>> OK, so we got to this one, which makes a lot more sense and is right in >>>> the area where we are having issues. >>>> I must have answered wrong to one of the steps the first time I did the >>>> bisect. >>>> >>>> Reverted this in the master tree of rc8 and rebuilt the kernel >>>> Now all tests pass on Linus's tree - 4.10.0_rc8+ >>>> >>>> The interesting point here is that this commit is in rc5 but rc5 was not >>>> failing so we have an interoperability issue with this commit >>>> >>>> >>>> [loberman@ibclient linux]$ git bisect good >>>> Bisecting: 0 revisions left to test after this (roughly 1 step) >>>> [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when >>>> the device supports sg gaps >>>> >>>> [loberman@ibclient linux]$ git show >>>> ad8e66b4a80182174f73487ed25fd2140cf43361 >>>> commit ad8e66b4a80182174f73487ed25fd2140cf43361 >>>> Author: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >>>> Date: Wed Dec 28 12:48:28 2016 +0200 >>>> >>>> IB/srp: fix mr allocation when the device supports sg gaps >>>> >>>> If the device support arbitrary sg list mapping (device cap >>>> IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with >>>> IB_MR_TYPE_SG_GAPS. >>>> >>>> Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures") >>>> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.7+ >>>> Signed-off-by: Israel Rukshin <israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >>>> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >>>> Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >>>> Reviewed-by: Mark Bloch <markb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> >>>> Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> >>>> Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> >>>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>>> >>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c >>>> b/drivers/infiniband/ulp/srp/ib_srp.c >>>> index 8ddc071..0f67cf9 100644 >>>> --- a/drivers/infiniband/ulp/srp/ib_srp.c >>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c >>>> @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct >>>> ib_device *device, >>>> struct srp_fr_desc *d; >>>> struct ib_mr *mr; >>>> int i, ret = -EINVAL; >>>> + enum ib_mr_type mr_type; >>>> >>>> if (pool_size <= 0) >>>> goto err; >>>> @@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct >>>> ib_device *device, >>>> spin_lock_init(&pool->lock); >>>> INIT_LIST_HEAD(&pool->free_list); >>>> >>>> + if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) >>>> + mr_type = IB_MR_TYPE_SG_GAPS; >>>> + else >>>> + mr_type = IB_MR_TYPE_MEM_REG; >>>> + >>>> for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { >>>> - mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, >>>> - max_page_list_len); >>>> + mr = ib_alloc_mr(pd, mr_type, max_page_list_len); >>> >>> First, ib_alloc_mr receives u32 as a third parameter, but int was >>> supplied. Second (I can be wrong here), shouldn't max_page_list_len be >>> replaced with max_fast_reg_page_list_len? >>> >>> Thanks >> >> there is a statement that: >> >> if (srp_dev->use_fast_reg) { >> srp_dev->max_pages_per_mr = >> min_t(u32, srp_dev->max_pages_per_mr, >> attr->max_fast_reg_page_list_len); >> } >> >> so we take the max_fast_reg_page_list_len in this case. >> >>> >>>> if (IS_ERR(mr)) { >>>> ret = PTR_ERR(mr); >>>> if (ret == -ENOMEM) >>>> (END) >>>> >>>> >>>> So here is the revert patch, but you need to decide how you want to deal >>>> with this. >>>> >>>> Revert "IB/srp: fix mr allocation when the device supports sg gaps" >>>> Laurence Oberman >>>> Traced after bisection to a cause for this failure >>>> >>>> Tested-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>>> Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>>> >>>> commit 90d169d312a173d5350c1bb36d6daab04c592127 >>>> Author: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>>> Date: Mon Feb 13 20:33:32 2017 -0500 >>>> >>>> Revert "IB/srp: fix mr allocation when the device supports sg gaps" >>>> Laurence Oberman >>>> Traced after bisection to a cause for this failure >>>> >>>> [ 130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe >>>> [ 130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5) >>>> for CQE ffff8817f0edbfb0 >>>> [ 130.510899] 00000000 00000000 00000000 00000000 >>>> [ 130.536455] 00000000 00000000 00000000 00000000 >>>> [ 130.561878] 00000000 00000000 00000000 00000000 >>>> [ 130.585904] 00000000 0f007806 2500002a db0ec4d0 >>>> [ 145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1. >>>> [ 146.530439] scsi host1: ib_srp: reconnect succeeded >>>> [ 146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe >>>> [ 146.597635] 00000000 00000000 00000000 00000000 >>>> [ 146.623545] 00000000 00000000 00000000 00000000 >>>> [ 146.649599] 00000000 00000000 00000000 00000000 >>>> [ 146.673938] 00000000 0f007806 25000032 000c46d0 >>>> [ 146.697969] scsi host1: ib_srp: failed FAST REG status memory >>>> management operation error (6) for CQE ffff88 >>>> [ 162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1. >>>> [ 162.256337] scsi host1: ib_srp: reconnect succeeded >>>> [ 162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5) >>>> for CQE ffff8817f0412ef0` >>>> >>>> This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361. >>>> >>>> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c >>>> b/drivers/infiniband/ulp/srp/ib_srp.c >>>> index 79bf484..01338c8 100644 >>>> --- a/drivers/infiniband/ulp/srp/ib_srp.c >>>> +++ b/drivers/infiniband/ulp/srp/ib_srp.c >>>> @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct >>>> ib_device *device, >>>> struct srp_fr_desc *d; >>>> struct ib_mr *mr; >>>> int i, ret = -EINVAL; >>>> - enum ib_mr_type mr_type; >>>> >>>> if (pool_size <= 0) >>>> goto err; >>>> @@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct >>>> ib_device *device, >>>> spin_lock_init(&pool->lock); >>>> INIT_LIST_HEAD(&pool->free_list); >>>> >>>> - if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) >>>> - mr_type = IB_MR_TYPE_SG_GAPS; >>>> - else >>>> - mr_type = IB_MR_TYPE_MEM_REG; >>>> - >>>> for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { >>>> - mr = ib_alloc_mr(pd, mr_type, max_page_list_len); >>>> + mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, >>>> + max_page_list_len); >>>> if (IS_ERR(mr)) { >>>> ret = PTR_ERR(mr); >>>> if (ret == -ENOMEM) >>>> >>>> >>>> >>>> Now moving on to what got me here in the first place. >>>> Bart, let me know if the 7 of the 8 patches in your most recent series are >>>> all still valid after this revert >>>> Otherwise let me know which ones you want me to apply. >>>> >>>> patch 6 - I am thinking i sno longer valid. >>>> " >>>> If a HCA supports the SG_GAPS_REG feature then a single memory >>>> region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch >>>> reduces the number of memory regions that is allocated per SRP >>>> session. >>>> " >>>> >>>> Thanks >>>> Laurence >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > Hello Max, > > I only have CX4 and CX3 in my lab, this test bed only has CX4. > > CA 'mlx5_0' > CA type: MT4115 > Number of ports: 1 > Firmware version: 12.14.2036 > Hardware version: 0 > Node GUID: 0x7cfe900300726ed2 > System image GUID: 0x7cfe900300726ed2 > Port 1: > State: Active > Physical state: LinkUp > Rate: 100 > Base lid: 3 > LMC: 0 > SM lid: 3 > Capability mask: 0x2651e84a > Port GUID: 0x7cfe900300726ed2 > Link layer: InfiniBand > > The test is simple, it's the same one I start with every time because it always > brings out issues with mapping for large I/O sizes and mem registration if such issues exist. > > I have a server running LIO with memory backed LUNS. > These are served via a dual port mlx5 (CX4) over ib_srpt > > The client mounts these LUNS via ib_srp (mlx5) and device-mapper-multipath > and I run a simple dd on the XFS file system. > > #!/bin/bash > while true > do > dd if=/dev/zero of=/data-$1/bigfile bs=4096k count=900 > sync; > rm -rf /data-$1/bigfile > done > > Once this passes I run a suite of other tests read/write, direct and buffered. Laurence, this is 4MB transactions. can you increase the cmd_sg_entries to the maximum and run the test again ? > > Thanks > Laurence > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: v4.10-rc SRP + mlx5 regression [not found] ` <bfca98d3-3f74-c370-7455-71e2ebd583e9-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2017-02-14 13:31 ` Laurence Oberman @ 2017-02-14 16:53 ` Bart Van Assche 1 sibling, 0 replies; 47+ messages in thread From: Bart Van Assche @ 2017-02-14 16:53 UTC (permalink / raw) To: maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org Cc: hch-jcswGhMUV9g@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org On Tue, 2017-02-14 at 12:00 +0200, Max Gurtovoy wrote: > can you specify the test that repro these failures ? > have you tried running with CX5 HCA or only CX4 ? > I think this commit is right and we have issues in other places. My proposal is to proceed as Laurence proposed - modify the SRP initiator driver such that it doesn't use gaps registration anymore. However, an additional change is needed in addition to the patch Laurence proposed, namely to call blk_queue_virt_boundary() unconditionally. I'm currently testing this approach against mlx4. Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <20170212171928.GF14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org> 2017-02-12 18:02 ` Laurence Oberman @ 2017-02-12 20:11 ` Bart Van Assche [not found] ` <1486930299.2918.5.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 1 sibling, 1 reply; 47+ messages in thread From: Bart Van Assche @ 2017-02-12 20:11 UTC (permalink / raw) To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org On Sun, 2017-02-12 at 19:19 +0200, Leon Romanovsky wrote: > On Fri, Feb 10, 2017 at 03:56:11PM -0800, Bart Van Assche wrote: > > -static void srp_destroy_qp(struct ib_qp *qp) > > +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp) > > { > > - ib_drain_rq(qp); > > + spin_lock_irq(&ch->lock); > > + ib_process_cq_direct(ch->send_cq, -1); > > I see that you are already using "-1" in your code, but the comments in the > ib_process_cq_direct states that no new code should use "-1". > > 61 * Note: for compatibility reasons -1 can be passed in %budget for unlimited > 62 * polling. Do not use this feature in new code, it will be removed soon. > 63 */ > 64 int ib_process_cq_direct(struct ib_cq *cq, int budget) Although it is possible to avoid passing -1 as 'budget' by passing a number that is at least as large as the number of expected completions, it would make it harder to verify the SRP initiator driver. So I propose to modify the comment above ib_process_cq_direct(). Bart. ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <1486930299.2918.5.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>]
* Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP [not found] ` <1486930299.2918.5.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> @ 2017-02-13 6:07 ` Leon Romanovsky 0 siblings, 0 replies; 47+ messages in thread From: Leon Romanovsky @ 2017-02-13 6:07 UTC (permalink / raw) To: Bart Van Assche Cc: hch-jcswGhMUV9g@public.gmane.org, maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, israelr-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org [-- Attachment #1: Type: text/plain, Size: 1407 bytes --] On Sun, Feb 12, 2017 at 08:11:53PM +0000, Bart Van Assche wrote: > On Sun, 2017-02-12 at 19:19 +0200, Leon Romanovsky wrote: > > On Fri, Feb 10, 2017 at 03:56:11PM -0800, Bart Van Assche wrote: > > > -static void srp_destroy_qp(struct ib_qp *qp) > > > +static void srp_destroy_qp(struct srp_rdma_ch *ch, struct ib_qp *qp) > > > { > > > - ib_drain_rq(qp); > > > + spin_lock_irq(&ch->lock); > > > + ib_process_cq_direct(ch->send_cq, -1); > > > > I see that you are already using "-1" in your code, but the comments in the > > ib_process_cq_direct states that no new code should use "-1". > > > > 61 * Note: for compatibility reasons -1 can be passed in %budget for unlimited > > 62 * polling. Do not use this feature in new code, it will be removed soon. > > 63 */ > > 64 int ib_process_cq_direct(struct ib_cq *cq, int budget) > > Although it is possible to avoid passing -1 as 'budget' by passing a number > that is at least as large as the number of expected completions, it would > make it harder to verify the SRP initiator driver. So I propose to modify > the comment above ib_process_cq_direct(). I don't know, It seems like an easiest approach is to change the comment especially while SRP is the only one user of this call. However ability to properly calculate number of expected completions and compare it while doing destroy_qp is a valuable thing for correctness too. Thanks > > Bart. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2017-02-14 18:49 UTC | newest]
Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-10 23:56 [PATCH 0/8] IB/srp bug fixes Bart Van Assche
2017-02-10 23:56 ` [PATCH 1/8] IB/srp: Avoid that duplicate responses trigger a kernel bug Bart Van Assche
2017-02-12 17:05 ` Leon Romanovsky
2017-02-12 20:07 ` Bart Van Assche
[not found] ` <1486930017.2918.3.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-13 5:54 ` Leon Romanovsky
[not found] ` <20170213055432.GM14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-02-13 16:02 ` Bart Van Assche
2017-02-10 23:56 ` [PATCH 2/8] IB/srp: Fix race conditions related to task management Bart Van Assche
[not found] ` <20170210235611.3243-1-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-10 23:56 ` [PATCH 3/8] IB/srp: Document locking conventions Bart Van Assche
2017-02-10 23:56 ` [PATCH 4/8] IB/srp: Make a diagnostic message more informative Bart Van Assche
2017-02-10 23:56 ` [PATCH 5/8] IB/srp: Improve an error path Bart Van Assche
2017-02-10 23:56 ` [PATCH 6/8] IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported Bart Van Assche
2017-02-10 23:56 ` [PATCH 7/8] IB/core: Add support for draining IB_POLL_DIRECT completion queues Bart Van Assche
2017-02-10 23:56 ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche
[not found] ` <20170210235611.3243-9-bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-11 0:07 ` Robert LeBlanc
[not found] ` <CAANLjFr+Jd3ctmhpBnjYGKZ4ZQPtYLAB7EWZxL59vHpgekP=Jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-11 0:13 ` Bart Van Assche
2017-02-12 17:19 ` Leon Romanovsky
[not found] ` <20170212171928.GF14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-02-12 18:02 ` Laurence Oberman
[not found] ` <1041506550.30101266.1486922573298.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-12 18:06 ` Laurence Oberman
[not found] ` <1051975432.30101289.1486922792858.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-14 3:02 ` [PATCH 0/8] IB/srp bug fixes Laurence Oberman
[not found] ` <1465409120.30916025.1487041332560.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-14 17:18 ` Bart Van Assche
[not found] ` <1487092678.2466.6.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-14 17:22 ` Laurence Oberman
2017-02-14 18:47 ` Laurence Oberman
[not found] ` <1364431877.31401761.1487098067033.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-14 18:49 ` Bart Van Assche
2017-02-12 20:05 ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche
[not found] ` <1486929901.2918.1.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-13 2:07 ` Laurence Oberman
[not found] ` <655392767.30136125.1486951636415.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 3:14 ` Laurence Oberman
[not found] ` <1630482470.30208948.1486955693106.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 13:54 ` Laurence Oberman
[not found] ` <1633827327.30531404.1486994093828.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 14:17 ` Leon Romanovsky
[not found] ` <20170213141724.GQ14015-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-02-13 14:24 ` Laurence Oberman
[not found] ` <225897984.30545262.1486995841880.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 16:12 ` Laurence Oberman
[not found] ` <1971987443.30613645.1487002375580.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 16:47 ` Laurence Oberman
[not found] ` <21338434.30712464.1487004451595.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 21:34 ` Laurence Oberman
[not found] ` <1301607843.30852658.1487021644535.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 21:46 ` Laurence Oberman
[not found] ` <898197116.30855343.1487022400065.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-13 21:52 ` v4.10-rc SRP + mlx5 regression Bart Van Assche
[not found] ` <1487022735.2719.7.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-13 21:56 ` Laurence Oberman
2017-02-14 2:19 ` Laurence Oberman
[not found] ` <568916592.30910570.1487038794766.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-14 6:39 ` Leon Romanovsky
[not found] ` <20170214063953.GF6989-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-02-14 10:00 ` Max Gurtovoy
[not found] ` <bfca98d3-3f74-c370-7455-71e2ebd583e9-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-02-14 13:31 ` Laurence Oberman
[not found] ` <656778124.31118982.1487079062235.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-14 16:21 ` Laurence Oberman
2017-02-14 17:15 ` Max Gurtovoy
[not found] ` <a7ae2926-da0a-edf9-7779-09a6edd54d5d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-02-14 17:29 ` Bart Van Assche
2017-02-14 17:31 ` Laurence Oberman
2017-02-14 17:15 ` Max Gurtovoy
2017-02-14 16:53 ` Bart Van Assche
2017-02-12 20:11 ` [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP Bart Van Assche
[not found] ` <1486930299.2918.5.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2017-02-13 6:07 ` Leon Romanovsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox