SRPt oops with 4.5-rc3-ish

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* SRPt oops with 4.5-rc3-ish
@ 2016-02-14 16:09 Doug Ledford
       [not found] ` <56C0A6C3.3010903-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2016-02-28  3:37 ` Nicholas A. Bellinger
  0 siblings, 2 replies; 17+ messages in thread
From: Doug Ledford @ 2016-02-14 16:09 UTC (permalink / raw)
  To: Bart Van Assche, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 5132 bytes --]

While testing with my latest kernel (rc3 plus pening RDMA patches), I
ran across this oops:

[dledford@linux-ws ~]$ console rdma-storage-04
Enter dledford-Qj3k6FK1/F6RgOCG8Jv1mWcM3YSXpimQh7FX57BcuXVWk0Htik3J/w@public.gmane.org's password:
[Enter `^Ec?' for help]
[-- MOTD -- https://home.corp.redhat.com/wiki/conserver]
[playback]
[160605.947614]  [<ffffffff81150545>] ? call_rcu_sched+0x25/0x30
[160605.954074]  [<ffffffffc0b3dd84>]
target_fabric_nacl_base_release+0x64/0x70]
[160605.963731]  [<ffffffff813ccc6f>] config_item_release+0x9f/0x1c0
[160605.970579]  [<ffffffff813ccdf2>] config_item_put+0x62/0x80
[160605.976936]  [<ffffffff813c97d3>] configfs_rmdir+0x343/0x500
[160605.983396]  [<ffffffff8131287a>] vfs_rmdir+0x13a/0x220
[160605.989375]  [<ffffffff813197db>] do_rmdir+0x1fb/0x260
[160605.995244]  [<ffffffff8131adde>] SyS_rmdir+0x1e/0x30
[160606.001019]  [<ffffffff81a0922e>] entry_SYSCALL_64_fastpath+0x12/0x71
[160606.009586] ---[ end trace 820588f5ef5f6148 ]---
[160607.051593] ib_srpt Received SRP_LOGIN_REQ with i_port_id
0x7f0ee700032d1de)
[160607.078225] ib_srpt rejected SRP_LOGIN_REQ because the target port
has not d
[160611.228909] ib_srpt Received IB DREQ ERROR event.
[160613.276862] ib_srpt Received IB TimeWait exit for cm_id
ffff881cc9dc7a00.
[160613.290322] BUG: unable to handle kernel paging request at
0000000000018630
[160613.301470] IP: [<ffffffff81125694>]
native_queued_spin_lock_slowpath+0x2e40
[160613.313112] PGD 0
[160613.318577] Oops: 0002 [#1] SMP
[160613.325358] Modules linked in: nfnetlink(+) ip6t_rpfilter 8021q garp
ip6t_R]
[160613.492357] CPU: 1 PID: 44982 Comm: kworker/1:1 Tainted: G        W
I     44
[160613.505978] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS
1.0.4 084
[160613.517697] Workqueue: events srpt_release_channel_work [ib_srpt]
[160613.527634] task: ffff881d01099000 ti: ffff881d02014000 task.ti:
ffff881d020
[160613.539130] RIP: 0010:[<ffffffff81125694>]  [<ffffffff81125694>]
native_que0
[160613.553326] RSP: 0018:ffff881d02017d90  EFLAGS: 00010006
[160613.562332] RAX: 00000000000000ea RBX: 0000000000000206 RCX:
000000000001860
[160613.573401] RDX: 0000000000080000 RSI: ffff881d4c818600 RDI:
ffff880f2d7c7d8
[160613.584472] RBP: ffff881d02017d90 R08: 0000000000000023 R09:
000000000000000
[160613.595491] R10: 00000000ffffffd8 R11: 00000000000211c0 R12:
ffff880f2d7c7d0
[160613.606568] R13: ffff881ce426d000 R14: ffff881cca702a00 R15:
000000000000000
[160613.617643] FS:  0000000000000000(0000) GS:ffff881d4c800000(0000)
knlGS:0000
[160613.629793] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[160613.639315] CR2: 0000000000018630 CR3: 0000000001ca9000 CR4:
000000000014060
[160613.650471] Stack:
[160613.655843]  ffff881d02017da0 ffffffff8122ac4c ffff881d02017db8
ffffffff81a7
[160613.667361]  ffff880f2d7c7d18 ffff881d02017de0 ffffffff81121255
ffff881cca70
[160613.678885]  ffff881ce426d058 ffff881ce426d000 ffff881d02017e10
ffffffffc070
[160613.690366] Call Trace:
[160613.696195]  [<ffffffff8122ac4c>] queued_spin_lock_slowpath+0x12/0x1d
[160613.706533]  [<ffffffff81a08ea7>] _raw_spin_lock_irqsave+0x87/0xa0
[160613.716586]  [<ffffffff81121255>] complete+0x25/0x70
[160613.725318]  [<ffffffffc07e7e80>]
srpt_release_channel_work+0x180/0x210 [ib]
[160613.736889]  [<ffffffff810e6dd8>] process_one_work+0x228/0x650
[160613.746616]  [<ffffffff810e79be>] worker_thread+0x21e/0x800
[160613.756047]  [<ffffffff81a02035>] ? __schedule+0x4b5/0xe6a
[160613.765371]  [<ffffffff810e77a0>] ? kzalloc+0x30/0x30
[160613.774203]  [<ffffffff810efc38>] kthread+0x118/0x150
[160613.783000]  [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0
[160613.792932]  [<ffffffff81a0958f>] ret_from_fork+0x3f/0x70
[160613.801994]  [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0
[160613.811897] Code: 01 00 00 74 ec e9 d7 fd ff ff 48 89 c1 c1 e8 12 48
c1 e9
[160613.840260] RIP  [<ffffffff81125694>]
native_queued_spin_lock_slowpath+0x2e0
[160613.851846]  RSP <ffff881d02017d90>
[160613.858812] CR2: 0000000000018630
[160613.874762] ---[ end trace 820588f5ef5f6149 ]---
[160613.937225] Kernel panic - not syncing: Fatal exception
[160613.946167] Kernel Offset: disabled
[160614.004693] ---[ end Kernel panic - not syncing: Fatal exception
[-- MARK -- Sun Feb 14 15:50:00 2016]
[-- dledford-CKb8VAQLn9hXrIkS9f7CXA@public.gmane.org@ovpn-116-26.rdu2.redhat.com attached -- Sun Feb
14 15:5]



Basic description of situation that cause the oops:

Server with 30+ SRPt luns, 2 SRP devices, 1 active client busy beating
away on 1 lun via two paths (active/passive setup)

Run dnf upgrade (dnf is yum's replacement, so just a system wide
software update).

Get to the cleanup for targetcli/target-restore and it invokes an
attempt to reload the target service while still in use.  During the
process of deconfiguring the luns that are in use, this oops occurred.
Sending the report to you because it appears to involve the
multi-channel support.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
       [not found] ` <56C0A6C3.3010903-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-02-16  1:42   ` Bart Van Assche
  2016-02-29  9:11   ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2016-02-16  1:42 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma

On 02/14/16 08:09, Doug Ledford wrote:
> While testing with my latest kernel (rc3 plus pending RDMA patches), I
> ran across this oops:
>
> [dledford@linux-ws ~]$ console rdma-storage-04
> Enter dledford-Qj3k6FK1/F6RgOCG8Jv1mWcM3YSXpimQh7FX57BcuXVWk0Htik3J/w@public.gmane.org's password:
> [Enter `^Ec?' for help]
> [-- MOTD -- https://home.corp.redhat.com/wiki/conserver]
> [playback]
> [160605.947614]  [<ffffffff81150545>] ? call_rcu_sched+0x25/0x30
> [160605.954074]  [<ffffffffc0b3dd84>]
> target_fabric_nacl_base_release+0x64/0x70]
> [160605.963731]  [<ffffffff813ccc6f>] config_item_release+0x9f/0x1c0
> [160605.970579]  [<ffffffff813ccdf2>] config_item_put+0x62/0x80
> [160605.976936]  [<ffffffff813c97d3>] configfs_rmdir+0x343/0x500
> [160605.983396]  [<ffffffff8131287a>] vfs_rmdir+0x13a/0x220
> [160605.989375]  [<ffffffff813197db>] do_rmdir+0x1fb/0x260
> [160605.995244]  [<ffffffff8131adde>] SyS_rmdir+0x1e/0x30
> [160606.001019]  [<ffffffff81a0922e>] entry_SYSCALL_64_fastpath+0x12/0x71
> [160606.009586] ---[ end trace 820588f5ef5f6148 ]---
> [160607.051593] ib_srpt Received SRP_LOGIN_REQ with i_port_id
> 0x7f0ee700032d1de)
> [160607.078225] ib_srpt rejected SRP_LOGIN_REQ because the target port
> has not d
> [160611.228909] ib_srpt Received IB DREQ ERROR event.
> [160613.276862] ib_srpt Received IB TimeWait exit for cm_id
> ffff881cc9dc7a00.
> [160613.290322] BUG: unable to handle kernel paging request at
> 0000000000018630
> [160613.301470] IP: [<ffffffff81125694>]
> native_queued_spin_lock_slowpath+0x2e40
> [160613.313112] PGD 0
> [160613.318577] Oops: 0002 [#1] SMP
> [160613.325358] Modules linked in: nfnetlink(+) ip6t_rpfilter 8021q garp
> ip6t_R]
> [160613.492357] CPU: 1 PID: 44982 Comm: kworker/1:1 Tainted: G        W
> I     44
> [160613.505978] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS
> 1.0.4 084
> [160613.517697] Workqueue: events srpt_release_channel_work [ib_srpt]
> [160613.527634] task: ffff881d01099000 ti: ffff881d02014000 task.ti:
> ffff881d020
> [160613.539130] RIP: 0010:[<ffffffff81125694>]  [<ffffffff81125694>]
> native_que0
> [160613.553326] RSP: 0018:ffff881d02017d90  EFLAGS: 00010006
> [160613.562332] RAX: 00000000000000ea RBX: 0000000000000206 RCX:
> 000000000001860
> [160613.573401] RDX: 0000000000080000 RSI: ffff881d4c818600 RDI:
> ffff880f2d7c7d8
> [160613.584472] RBP: ffff881d02017d90 R08: 0000000000000023 R09:
> 000000000000000
> [160613.595491] R10: 00000000ffffffd8 R11: 00000000000211c0 R12:
> ffff880f2d7c7d0
> [160613.606568] R13: ffff881ce426d000 R14: ffff881cca702a00 R15:
> 000000000000000
> [160613.617643] FS:  0000000000000000(0000) GS:ffff881d4c800000(0000)
> knlGS:0000
> [160613.629793] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [160613.639315] CR2: 0000000000018630 CR3: 0000000001ca9000 CR4:
> 000000000014060
> [160613.650471] Stack:
> [160613.655843]  ffff881d02017da0 ffffffff8122ac4c ffff881d02017db8
> ffffffff81a7
> [160613.667361]  ffff880f2d7c7d18 ffff881d02017de0 ffffffff81121255
> ffff881cca70
> [160613.678885]  ffff881ce426d058 ffff881ce426d000 ffff881d02017e10
> ffffffffc070
> [160613.690366] Call Trace:
> [160613.696195]  [<ffffffff8122ac4c>] queued_spin_lock_slowpath+0x12/0x1d
> [160613.706533]  [<ffffffff81a08ea7>] _raw_spin_lock_irqsave+0x87/0xa0
> [160613.716586]  [<ffffffff81121255>] complete+0x25/0x70
> [160613.725318]  [<ffffffffc07e7e80>]
> srpt_release_channel_work+0x180/0x210 [ib]
> [160613.736889]  [<ffffffff810e6dd8>] process_one_work+0x228/0x650
> [160613.746616]  [<ffffffff810e79be>] worker_thread+0x21e/0x800
> [160613.756047]  [<ffffffff81a02035>] ? __schedule+0x4b5/0xe6a
> [160613.765371]  [<ffffffff810e77a0>] ? kzalloc+0x30/0x30
> [160613.774203]  [<ffffffff810efc38>] kthread+0x118/0x150
> [160613.783000]  [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0
> [160613.792932]  [<ffffffff81a0958f>] ret_from_fork+0x3f/0x70
> [160613.801994]  [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0
> [160613.811897] Code: 01 00 00 74 ec e9 d7 fd ff ff 48 89 c1 c1 e8 12 48
> c1 e9
> [160613.840260] RIP  [<ffffffff81125694>]
> native_queued_spin_lock_slowpath+0x2e0
> [160613.851846]  RSP <ffff881d02017d90>
> [160613.858812] CR2: 0000000000018630
> [160613.874762] ---[ end trace 820588f5ef5f6149 ]---
> [160613.937225] Kernel panic - not syncing: Fatal exception
> [160613.946167] Kernel Offset: disabled
> [160614.004693] ---[ end Kernel panic - not syncing: Fatal exception
> [-- MARK -- Sun Feb 14 15:50:00 2016]
> [-- dledford-CKb8VAQLn9hXrIkS9f7CXA@public.gmane.org@ovpn-116-26.rdu2.redhat.com attached -- Sun Feb
> 14 15:5]
>
>
>
> Basic description of situation that cause the oops:
>
> Server with 30+ SRPt luns, 2 SRP devices, 1 active client busy beating
> away on 1 lun via two paths (active/passive setup)
>
> Run dnf upgrade (dnf is yum's replacement, so just a system wide
> software update).
>
> Get to the cleanup for targetcli/target-restore and it invokes an
> attempt to reload the target service while still in use.  During the
> process of deconfiguring the luns that are in use, this oops occurred.
> Sending the report to you because it appears to involve the
> multi-channel support.

Hello Doug,

As far as I know the session shutdown code in the LIO core has never 
worked reliably in the presence of active I/O in any upstream kernel 
version. All my tests of the ib_srpt patch series I submitted recently 
have been performed on top of a long series of bug fixes for the LIO 
core. The tree I have been testing is available at 
https://github.com/bvanassche/linux/tree/lio-tmf-fixes-2016-01-13. I 
have tried a few times to submit the LIO core patches to Nic Bellinger 
(making TMF handling synchronous + several fixes for race conditions 
related to session shutdown). Apparently Nic is trying to fix the 
existing approach for TMF handling (handling TMF from another context 
than the regular command execution context) but so far without success 
(see e.g. http://www.spinics.net/lists/target-devel/index.html#11822).

Bart.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
  2016-02-14 16:09 SRPt oops with 4.5-rc3-ish Doug Ledford
       [not found] ` <56C0A6C3.3010903-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-02-28  3:37 ` Nicholas A. Bellinger
       [not found]   ` <1456630639.19657.47.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
  2016-02-28  8:26   ` Nicholas A. Bellinger
  1 sibling, 2 replies; 17+ messages in thread
From: Nicholas A. Bellinger @ 2016-02-28  3:37 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Bart Van Assche, linux-rdma, target-devel

Hi Doug,

On Sun, 2016-02-14 at 11:09 -0500, Doug Ledford wrote:
> While testing with my latest kernel (rc3 plus pening RDMA patches), I
> ran across this oops:
> 
> [dledford@linux-ws ~]$ console rdma-storage-04
> Enter dledford@conserver-01.app.eng.rdu2.redhat.com's password:
> [Enter `^Ec?' for help]
> [-- MOTD -- https://home.corp.redhat.com/wiki/conserver]
> [playback]
> [160605.947614]  [<ffffffff81150545>] ? call_rcu_sched+0x25/0x30
> [160605.954074]  [<ffffffffc0b3dd84>] target_fabric_nacl_base_release+0x64/0x70]
> [160605.963731]  [<ffffffff813ccc6f>] config_item_release+0x9f/0x1c0
> [160605.970579]  [<ffffffff813ccdf2>] config_item_put+0x62/0x80
> [160605.976936]  [<ffffffff813c97d3>] configfs_rmdir+0x343/0x500
> [160605.983396]  [<ffffffff8131287a>] vfs_rmdir+0x13a/0x220
> [160605.989375]  [<ffffffff813197db>] do_rmdir+0x1fb/0x260
> [160605.995244]  [<ffffffff8131adde>] SyS_rmdir+0x1e/0x30
> [160606.001019]  [<ffffffff81a0922e>] entry_SYSCALL_64_fastpath+0x12/0x71
> [160606.009586] ---[ end trace 820588f5ef5f6148 ]---
> [160607.051593] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x7f0ee700032d1de)
> [160607.078225] ib_srpt rejected SRP_LOGIN_REQ because the target port has not d
> [160611.228909] ib_srpt Received IB DREQ ERROR event.
> [160613.276862] ib_srpt Received IB TimeWait exit for cm_id ffff881cc9dc7a00.
> [160613.290322] BUG: unable to handle kernel paging request at 0000000000018630
> [160613.301470] IP: [<ffffffff81125694>] native_queued_spin_lock_slowpath+0x2e40
> [160613.313112] PGD 0
> [160613.318577] Oops: 0002 [#1] SMP
> [160613.325358] Modules linked in: nfnetlink(+) ip6t_rpfilter 8021q garp ip6t_R]
> [160613.492357] CPU: 1 PID: 44982 Comm: kworker/1:1 Tainted: G        W I     44
> [160613.505978] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 1.0.4 084
> [160613.517697] Workqueue: events srpt_release_channel_work [ib_srpt]
> [160613.527634] task: ffff881d01099000 ti: ffff881d02014000 task.ti: ffff881d020
> [160613.539130] RIP: 0010:[<ffffffff81125694>]  [<ffffffff81125694>] native_que0
> [160613.553326] RSP: 0018:ffff881d02017d90  EFLAGS: 00010006
> [160613.562332] RAX: 00000000000000ea RBX: 0000000000000206 RCX: 000000000001860
> [160613.573401] RDX: 0000000000080000 RSI: ffff881d4c818600 RDI: ffff880f2d7c7d8
> [160613.584472] RBP: ffff881d02017d90 R08: 0000000000000023 R09: 000000000000000
> [160613.595491] R10: 00000000ffffffd8 R11: 00000000000211c0 R12: ffff880f2d7c7d0
> [160613.606568] R13: ffff881ce426d000 R14: ffff881cca702a00 R15: 000000000000000
> [160613.617643] FS:  0000000000000000(0000) GS:ffff881d4c800000(0000) knlGS:0000
> [160613.629793] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [160613.639315] CR2: 0000000000018630 CR3: 0000000001ca9000 CR4: 000000000014060
> [160613.650471] Stack:
> [160613.655843]  ffff881d02017da0 ffffffff8122ac4c ffff881d02017db8 ffffffff81a7
> [160613.667361]  ffff880f2d7c7d18 ffff881d02017de0 ffffffff81121255 ffff881cca70
> [160613.678885]  ffff881ce426d058 ffff881ce426d000 ffff881d02017e10 ffffffffc070
> [160613.690366] Call Trace:
> [160613.696195]  [<ffffffff8122ac4c>] queued_spin_lock_slowpath+0x12/0x1d
> [160613.706533]  [<ffffffff81a08ea7>] _raw_spin_lock_irqsave+0x87/0xa0
> [160613.716586]  [<ffffffff81121255>] complete+0x25/0x70
> [160613.725318]  [<ffffffffc07e7e80>] srpt_release_channel_work+0x180/0x210 [ib]
> [160613.736889]  [<ffffffff810e6dd8>] process_one_work+0x228/0x650
> [160613.746616]  [<ffffffff810e79be>] worker_thread+0x21e/0x800
> [160613.756047]  [<ffffffff81a02035>] ? __schedule+0x4b5/0xe6a
> [160613.765371]  [<ffffffff810e77a0>] ? kzalloc+0x30/0x30
> [160613.774203]  [<ffffffff810efc38>] kthread+0x118/0x150
> [160613.783000]  [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0
> [160613.792932]  [<ffffffff81a0958f>] ret_from_fork+0x3f/0x70
> [160613.801994]  [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0
> [160613.811897] Code: 01 00 00 74 ec e9 d7 fd ff ff 48 89 c1 c1 e8 12 48 c1 e9
> [160613.840260] RIP  [<ffffffff81125694>] native_queued_spin_lock_slowpath+0x2e0
> [160613.851846]  RSP <ffff881d02017d90>
> [160613.858812] CR2: 0000000000018630
> [160613.874762] ---[ end trace 820588f5ef5f6149 ]---
> [160613.937225] Kernel panic - not syncing: Fatal exception
> [160613.946167] Kernel Offset: disabled
> [160614.004693] ---[ end Kernel panic - not syncing: Fatal exception
> [-- MARK -- Sun Feb 14 15:50:00 2016]
> [-- dledford@REDHAT.COM@ovpn-116-26.rdu2.redhat.com attached -- Sun Feb
> 14 15:5]
> 
> 
> 
> Basic description of situation that cause the oops:
> 
> Server with 30+ SRPt luns, 2 SRP devices, 1 active client busy beating
> away on 1 lun via two paths (active/passive setup)
> 
> Run dnf upgrade (dnf is yum's replacement, so just a system wide
> software update).
> 
> Get to the cleanup for targetcli/target-restore and it invokes an
> attempt to reload the target service while still in use.  During the
> process of deconfiguring the luns that are in use, this oops occurred.
> Sending the report to you because it appears to involve the
> multi-channel support.
> 

This is a fairly recent srpt shutdown regression, right..?

Any chance to reproduce with full pr_debug enabled..?

I'm curious to see if HCH's changes in commit 59fae4dea to drop
ib_create_cq() w/ ib_comp_handler -> srpt_compl_thread() usage
are somehow involved.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
       [not found]   ` <1456630639.19657.47.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
@ 2016-02-28  4:18     ` Bart Van Assche
       [not found]       ` <56D274F8.9070804-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2016-02-28  4:18 UTC (permalink / raw)
  To: Nicholas A. Bellinger, Doug Ledford; +Cc: linux-rdma, target-devel

On 02/27/16 19:37, Nicholas A. Bellinger wrote:
> This is a fairly recent srpt shutdown regression, right..?

Hi Nic,

My patch series to make TMR handling synchronous fixes what Doug 
reported. If you want I can rebase and repost that patch series.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
       [not found]       ` <56D274F8.9070804-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2016-02-28  4:47         ` Nicholas A. Bellinger
  2016-02-28  4:49           ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Nicholas A. Bellinger @ 2016-02-28  4:47 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Doug Ledford, linux-rdma, target-devel

On Sat, 2016-02-27 at 20:18 -0800, Bart Van Assche wrote:
> On 02/27/16 19:37, Nicholas A. Bellinger wrote:
> > This is a fairly recent srpt shutdown regression, right..?
> 
> Hi Nic,
> 
> My patch series to make TMR handling synchronous fixes what Doug 
> reported. If you want I can rebase and repost that patch series.
> 

There aren't even any TMRs being processed, so I don't see how that has
anything to do with it.

>From the logs, this OOPsen is related to some manner of recent srpt
configfs se_node_acl + se_session active I/O shutdown regression.

So short of sitting down and reproducing myself on v4.5-rc code,
commit 59fae4de's removal of ib_create_cq() + ib_comp_handler callback
usage look like a good place to start the investigation.

It would be useful to first find out what changes introduced this
regression, and how far back Doug is able to reproduce.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
  2016-02-28  4:47         ` Nicholas A. Bellinger
@ 2016-02-28  4:49           ` Bart Van Assche
  2016-02-28  5:00             ` Nicholas A. Bellinger
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2016-02-28  4:49 UTC (permalink / raw)
  To: Nicholas A. Bellinger; +Cc: Doug Ledford, linux-rdma, target-devel

On 02/27/16 20:47, Nicholas A. Bellinger wrote:
> On Sat, 2016-02-27 at 20:18 -0800, Bart Van Assche wrote:
>> On 02/27/16 19:37, Nicholas A. Bellinger wrote:
>>> This is a fairly recent srpt shutdown regression, right..?
>>
>> Hi Nic,
>>
>> My patch series to make TMR handling synchronous fixes what Doug
>> reported. If you want I can rebase and repost that patch series.
>>
>
> There aren't even any TMRs being processed, so I don't see how that has
> anything to do with it.
>
>>From the logs, this OOPsen is related to some manner of recent srpt
> configfs se_node_acl + se_session active I/O shutdown regression.
>
> So short of sitting down and reproducing myself on v4.5-rc code,
> commit 59fae4de's removal of ib_create_cq() + ib_comp_handler callback
> usage look like a good place to start the investigation.
>
> It would be useful to first find out what changes introduced this
> regression, and how far back Doug is able to reproduce.

As I wrote before, this patch series works 100% stable on top of my most 
recent LIO core patch series, a patch series I have also made available 
on github. So what Doug ran into is a LIO core bug and not an ib_srpt bug.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
  2016-02-28  4:49           ` Bart Van Assche
@ 2016-02-28  5:00             ` Nicholas A. Bellinger
  2016-03-03 15:24               ` Doug Ledford
  0 siblings, 1 reply; 17+ messages in thread
From: Nicholas A. Bellinger @ 2016-02-28  5:00 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Doug Ledford, linux-rdma, target-devel

On Sat, 2016-02-27 at 20:49 -0800, Bart Van Assche wrote:
> On 02/27/16 20:47, Nicholas A. Bellinger wrote:
> > On Sat, 2016-02-27 at 20:18 -0800, Bart Van Assche wrote:
> >> On 02/27/16 19:37, Nicholas A. Bellinger wrote:
> >>> This is a fairly recent srpt shutdown regression, right..?
> >>
> >> Hi Nic,
> >>
> >> My patch series to make TMR handling synchronous fixes what Doug
> >> reported. If you want I can rebase and repost that patch series.
> >>
> >
> > There aren't even any TMRs being processed, so I don't see how that has
> > anything to do with it.
> >
> >>From the logs, this OOPsen is related to some manner of recent srpt
> > configfs se_node_acl + se_session active I/O shutdown regression.
> >
> > So short of sitting down and reproducing myself on v4.5-rc code,
> > commit 59fae4de's removal of ib_create_cq() + ib_comp_handler callback
> > usage look like a good place to start the investigation.
> >
> > It would be useful to first find out what changes introduced this
> > regression, and how far back Doug is able to reproduce.
> 
> As I wrote before, this patch series works 100% stable on top of my most 
> recent LIO core patch series, a patch series I have also made available 
> on github. So what Doug ran into is a LIO core bug and not an ib_srpt bug.
> 

Active I/O shutdown with srpt has not always triggered this OOPs.

There is a reason why this is happening now, and it needs to be
identified.

Either you can help out doing that, or not.  Either way, I'm certainly
not going to let you hack up LIO TMR code, when there even aren't signs
ABORT_TASK and friends are occuring in Doug's particular shutdown case.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
  2016-02-28  3:37 ` Nicholas A. Bellinger
       [not found]   ` <1456630639.19657.47.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
@ 2016-02-28  8:26   ` Nicholas A. Bellinger
  2016-02-28 16:14     ` Bart Van Assche
       [not found]     ` <1456647963.19657.135.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
  1 sibling, 2 replies; 17+ messages in thread
From: Nicholas A. Bellinger @ 2016-02-28  8:26 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Bart Van Assche, linux-rdma, target-devel

On Sat, 2016-02-27 at 19:37 -0800, Nicholas A. Bellinger wrote:
> Hi Doug,
> 
> On Sun, 2016-02-14 at 11:09 -0500, Doug Ledford wrote:
> > While testing with my latest kernel (rc3 plus pening RDMA patches), I
> > ran across this oops:
> > 
> > [dledford@linux-ws ~]$ console rdma-storage-04
> > Enter dledford@conserver-01.app.eng.rdu2.redhat.com's password:
> > [Enter `^Ec?' for help]
> > [-- MOTD -- https://home.corp.redhat.com/wiki/conserver]
> > [playback]
> > [160605.947614]  [<ffffffff81150545>] ? call_rcu_sched+0x25/0x30
> > [160605.954074]  [<ffffffffc0b3dd84>] target_fabric_nacl_base_release+0x64/0x70]
> > [160605.963731]  [<ffffffff813ccc6f>] config_item_release+0x9f/0x1c0
> > [160605.970579]  [<ffffffff813ccdf2>] config_item_put+0x62/0x80
> > [160605.976936]  [<ffffffff813c97d3>] configfs_rmdir+0x343/0x500
> > [160605.983396]  [<ffffffff8131287a>] vfs_rmdir+0x13a/0x220
> > [160605.989375]  [<ffffffff813197db>] do_rmdir+0x1fb/0x260
> > [160605.995244]  [<ffffffff8131adde>] SyS_rmdir+0x1e/0x30
> > [160606.001019]  [<ffffffff81a0922e>] entry_SYSCALL_64_fastpath+0x12/0x71
> > [160606.009586] ---[ end trace 820588f5ef5f6148 ]---
> > [160607.051593] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x7f0ee700032d1de)
> > [160607.078225] ib_srpt rejected SRP_LOGIN_REQ because the target port has not d
> > [160611.228909] ib_srpt Received IB DREQ ERROR event.
> > [160613.276862] ib_srpt Received IB TimeWait exit for cm_id ffff881cc9dc7a00.
> > [160613.290322] BUG: unable to handle kernel paging request at 0000000000018630
> > [160613.301470] IP: [<ffffffff81125694>] native_queued_spin_lock_slowpath+0x2e40
> > [160613.313112] PGD 0
> > [160613.318577] Oops: 0002 [#1] SMP
> > [160613.325358] Modules linked in: nfnetlink(+) ip6t_rpfilter 8021q garp ip6t_R]
> > [160613.492357] CPU: 1 PID: 44982 Comm: kworker/1:1 Tainted: G        W I     44
> > [160613.505978] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 1.0.4 084
> > [160613.517697] Workqueue: events srpt_release_channel_work [ib_srpt]
> > [160613.527634] task: ffff881d01099000 ti: ffff881d02014000 task.ti: ffff881d020
> > [160613.539130] RIP: 0010:[<ffffffff81125694>]  [<ffffffff81125694>] native_que0
> > [160613.553326] RSP: 0018:ffff881d02017d90  EFLAGS: 00010006
> > [160613.562332] RAX: 00000000000000ea RBX: 0000000000000206 RCX: 000000000001860
> > [160613.573401] RDX: 0000000000080000 RSI: ffff881d4c818600 RDI: ffff880f2d7c7d8
> > [160613.584472] RBP: ffff881d02017d90 R08: 0000000000000023 R09: 000000000000000
> > [160613.595491] R10: 00000000ffffffd8 R11: 00000000000211c0 R12: ffff880f2d7c7d0
> > [160613.606568] R13: ffff881ce426d000 R14: ffff881cca702a00 R15: 000000000000000
> > [160613.617643] FS:  0000000000000000(0000) GS:ffff881d4c800000(0000) knlGS:0000
> > [160613.629793] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [160613.639315] CR2: 0000000000018630 CR3: 0000000001ca9000 CR4: 000000000014060
> > [160613.650471] Stack:
> > [160613.655843]  ffff881d02017da0 ffffffff8122ac4c ffff881d02017db8 ffffffff81a7
> > [160613.667361]  ffff880f2d7c7d18 ffff881d02017de0 ffffffff81121255 ffff881cca70
> > [160613.678885]  ffff881ce426d058 ffff881ce426d000 ffff881d02017e10 ffffffffc070
> > [160613.690366] Call Trace:
> > [160613.696195]  [<ffffffff8122ac4c>] queued_spin_lock_slowpath+0x12/0x1d
> > [160613.706533]  [<ffffffff81a08ea7>] _raw_spin_lock_irqsave+0x87/0xa0
> > [160613.716586]  [<ffffffff81121255>] complete+0x25/0x70
> > [160613.725318]  [<ffffffffc07e7e80>] srpt_release_channel_work+0x180/0x210 [ib]
> > [160613.736889]  [<ffffffff810e6dd8>] process_one_work+0x228/0x650
> > [160613.746616]  [<ffffffff810e79be>] worker_thread+0x21e/0x800
> > [160613.756047]  [<ffffffff81a02035>] ? __schedule+0x4b5/0xe6a
> > [160613.765371]  [<ffffffff810e77a0>] ? kzalloc+0x30/0x30
> > [160613.774203]  [<ffffffff810efc38>] kthread+0x118/0x150
> > [160613.783000]  [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0
> > [160613.792932]  [<ffffffff81a0958f>] ret_from_fork+0x3f/0x70
> > [160613.801994]  [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0
> > [160613.811897] Code: 01 00 00 74 ec e9 d7 fd ff ff 48 89 c1 c1 e8 12 48 c1 e9
> > [160613.840260] RIP  [<ffffffff81125694>] native_queued_spin_lock_slowpath+0x2e0
> > [160613.851846]  RSP <ffff881d02017d90>
> > [160613.858812] CR2: 0000000000018630
> > [160613.874762] ---[ end trace 820588f5ef5f6149 ]---
> > [160613.937225] Kernel panic - not syncing: Fatal exception
> > [160613.946167] Kernel Offset: disabled
> > [160614.004693] ---[ end Kernel panic - not syncing: Fatal exception
> > [-- MARK -- Sun Feb 14 15:50:00 2016]
> > [-- dledford@REDHAT.COM@ovpn-116-26.rdu2.redhat.com attached -- Sun Feb
> > 14 15:5]
> > 
> > 
> > 
> > Basic description of situation that cause the oops:
> > 
> > Server with 30+ SRPt luns, 2 SRP devices, 1 active client busy beating
> > away on 1 lun via two paths (active/passive setup)
> > 
> > Run dnf upgrade (dnf is yum's replacement, so just a system wide
> > software update).
> > 
> > Get to the cleanup for targetcli/target-restore and it invokes an
> > attempt to reload the target service while still in use.  During the
> > process of deconfiguring the luns that are in use, this oops occurred.
> > Sending the report to you because it appears to involve the
> > multi-channel support.
> > 
> 
> This is a fairly recent srpt shutdown regression, right..?
> 
> Any chance to reproduce with full pr_debug enabled..?
> 
> I'm curious to see if HCH's changes in commit 59fae4dea to drop
> ib_create_cq() w/ ib_comp_handler -> srpt_compl_thread() usage
> are somehow involved.
> 

AFAIK, the oldest last working srpt commit with se_node_acl + se_session
active I/O shutdown is:

ib_srpt: Call target_sess_cmd_list_set_waiting during shutdown_session
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/infiniband/ulp/srpt?id=1d19f7800d

Note this is ~40 upstream commits between then and now in v4.5-rc5.

Please confirm when you started triggering this regression during target
service restart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
  2016-02-28  8:26   ` Nicholas A. Bellinger
@ 2016-02-28 16:14     ` Bart Van Assche
       [not found]       ` <56D31CC9.7000609-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
       [not found]     ` <1456647963.19657.135.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2016-02-28 16:14 UTC (permalink / raw)
  To: Nicholas A. Bellinger, Doug Ledford; +Cc: linux-rdma, target-devel

On 02/28/16 00:26, Nicholas A. Bellinger wrote:
> Please confirm when you started triggering this regression during target
> service restart.

Hi Nic,

Are you aware that Doug was not the first person to report this crash? I 
had already reported this crash myself seven weeks ago. Together with 
the report of this crash I had also sent a root cause analysis to you 
and a fix. In the patch description I explained clearly that this crash 
is caused by a bug in the LIO core and should be fixed in the LIO core. 
See also Bart Van Assche, [PATCH 07/21] target: Fix a use-after-free in 
core_tpg_del_initiator_node_acl(), target-devel mailing list, January 5, 
2016 
(http://thread.gmane.org/gmane.linux.scsi.target.devel/10905/focus=10891).

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
       [not found]       ` <56D31CC9.7000609-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2016-02-28 20:43         ` Nicholas A. Bellinger
  2016-02-29  0:37           ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Nicholas A. Bellinger @ 2016-02-28 20:43 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Doug Ledford, linux-rdma, target-devel

On Sun, 2016-02-28 at 08:14 -0800, Bart Van Assche wrote:
> On 02/28/16 00:26, Nicholas A. Bellinger wrote:
> > Please confirm when you started triggering this regression during target
> > service restart.
> 
> Hi Nic,
> 
> Are you aware that Doug was not the first person to report this crash? I 
> had already reported this crash myself seven weeks ago. Together with 
> the report of this crash I had also sent a root cause analysis to you 
> and a fix.

As we've discussed, your analysis was incorrect.

http://thread.gmane.org/gmane.linux.scsi.target.devel/10905/focus=10891

Adding a second, new kref to se_session just for srpt is completely
wrong, and now that I've thrown out the legacy srpt_lookup_acl() junk in
v4.5-rc1, srpt can finally come out of the stone age wrt to se_node_acl
shutdown.

>  In the patch description I explained clearly that this crash 
> is caused by a bug in the LIO core and should be fixed in the LIO core. 
> See also Bart Van Assche, [PATCH 07/21] target: Fix a use-after-free in 
> core_tpg_del_initiator_node_acl(), target-devel mailing list, January 5, 
> 2016 
> (http://thread.gmane.org/gmane.linux.scsi.target.devel/10905/focus=10891).
> 

Anyways, I'll sit down this week and figure out what's going on with
Doug's active I/O shutdown regression.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
  2016-02-28 20:43         ` Nicholas A. Bellinger
@ 2016-02-29  0:37           ` Bart Van Assche
       [not found]             ` <56D392D4.2000105-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2016-02-29  0:37 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: Doug Ledford, linux-rdma, target-devel, Christoph Hellwig

On 02/28/16 12:43, Nicholas A. Bellinger wrote:
> Anyways, I'll sit down this week and figure out what's going on with
> Doug's active I/O shutdown regression.

The crash occurs in the core_tpg_del_initiator_node_acl() function
and a call to that function has been added recently in
target_fabric_nacl_base_release(). I think it was added through the
following patch:

commit c7d6a803926bae9bbf4510a18fc8dd8957cc0e01
Date:   Mon Apr 13 19:51:14 2015 +0200

    target: refactor init/drop_nodeacl methods
    
    By always allocating and adding, respectively removing and freeing
    the se_node_acl structure in core code we can remove tons of repeated
    code in the init_nodeacl and drop_nodeacl routines.  Additionally
    this now respects the get_default_queue_depth method in this code
    path as well.
    
Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
       [not found]             ` <56D392D4.2000105-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2016-02-29  6:05               ` Christoph Hellwig
  2016-03-01  6:49                 ` Nicholas A. Bellinger
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2016-02-29  6:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Nicholas A. Bellinger, Doug Ledford, linux-rdma, target-devel,
	Christoph Hellwig

On Sun, Feb 28, 2016 at 04:37:40PM -0800, Bart Van Assche wrote:
> On 02/28/16 12:43, Nicholas A. Bellinger wrote:
> > Anyways, I'll sit down this week and figure out what's going on with
> > Doug's active I/O shutdown regression.
> 
> The crash occurs in the core_tpg_del_initiator_node_acl() function
> and a call to that function has been added recently in
> target_fabric_nacl_base_release(). I think it was added through the
> following patch:

That patch just moved the call from the .fabric_drop_nodeacl instances
(in the SRPT case srpt_drop_nodeacl) to the caller in
target_fabric_nacl_base_release.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
       [not found] ` <56C0A6C3.3010903-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2016-02-16  1:42   ` Bart Van Assche
@ 2016-02-29  9:11   ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-02-29  9:11 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Bart Van Assche, linux-rdma, nab-IzHhD5pYlfBP7FQvKIMDCQ

Hi Doug,

can you give my series at:

http://thread.gmane.org/gmane.linux.scsi.target.devel/11518

a a try?  This sorts out the session list handling so that whoever
removes a session from the list on the node ACLs gets to tear it down
fully.  Previously the old kref papered over the lack of clear
responsibility here.

I think this should sort out this race, but as I can't reproduce it on
my SRP test setup I'm not 100% sure.

I've also uploaded a git tree to make your life easier, as it sits on
top of Nic's for-next branch:

	git://git.infradead.org/users/hch/scsi.git target-session-cleanup

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
  2016-02-29  6:05               ` Christoph Hellwig
@ 2016-03-01  6:49                 ` Nicholas A. Bellinger
  2016-03-01  7:16                   ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Nicholas A. Bellinger @ 2016-03-01  6:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Bart Van Assche, Doug Ledford, linux-rdma, target-devel

On Mon, 2016-02-29 at 07:05 +0100, Christoph Hellwig wrote:
> On Sun, Feb 28, 2016 at 04:37:40PM -0800, Bart Van Assche wrote:
> > On 02/28/16 12:43, Nicholas A. Bellinger wrote:
> > > Anyways, I'll sit down this week and figure out what's going on with
> > > Doug's active I/O shutdown regression.
> > 
> > The crash occurs in the core_tpg_del_initiator_node_acl() function
> > and a call to that function has been added recently in
> > target_fabric_nacl_base_release(). I think it was added through the
> > following patch:
> 
> That patch just moved the call from the .fabric_drop_nodeacl instances
> (in the SRPT case srpt_drop_nodeacl) to the caller in
> target_fabric_nacl_base_release.

I've not re-produced with v4.5-rc, but IIRC pre-commit 59fae4de usage of
ib_create_cq() w/ srpt_compl_thread() -> kthread_stop(ch->thread) in
srpt_destroy_ch_ib() did play a role wrt IB CQ active I/O shutdown
completion in original code.

Btw, is original ib_create_cq() usage a incompatible with chain RDMA
READ/WRITE requests, or was that an extra improvement..?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
  2016-03-01  6:49                 ` Nicholas A. Bellinger
@ 2016-03-01  7:16                   ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-03-01  7:16 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: Christoph Hellwig, Bart Van Assche, Doug Ledford, linux-rdma,
	target-devel

On Mon, Feb 29, 2016 at 10:49:58PM -0800, Nicholas A. Bellinger wrote:
> Btw, is original ib_create_cq() usage a incompatible with chain RDMA
> READ/WRITE requests, or was that an extra improvement..?

Old-style CQs can be used for chained requests, see the iser target
for an example.  It's just a lot more painful.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
  2016-02-28  5:00             ` Nicholas A. Bellinger
@ 2016-03-03 15:24               ` Doug Ledford
  0 siblings, 0 replies; 17+ messages in thread
From: Doug Ledford @ 2016-03-03 15:24 UTC (permalink / raw)
  To: Nicholas A. Bellinger, Bart Van Assche; +Cc: linux-rdma, target-devel

[-- Attachment #1: Type: text/plain, Size: 2226 bytes --]

On 02/28/2016 12:00 AM, Nicholas A. Bellinger wrote:
> On Sat, 2016-02-27 at 20:49 -0800, Bart Van Assche wrote:
>> On 02/27/16 20:47, Nicholas A. Bellinger wrote:
>>> On Sat, 2016-02-27 at 20:18 -0800, Bart Van Assche wrote:
>>>> On 02/27/16 19:37, Nicholas A. Bellinger wrote:
>>>>> This is a fairly recent srpt shutdown regression, right..?
>>>>
>>>> Hi Nic,
>>>>
>>>> My patch series to make TMR handling synchronous fixes what Doug
>>>> reported. If you want I can rebase and repost that patch series.
>>>>
>>>
>>> There aren't even any TMRs being processed, so I don't see how that has
>>> anything to do with it.
>>>
>>> >From the logs, this OOPsen is related to some manner of recent srpt
>>> configfs se_node_acl + se_session active I/O shutdown regression.
>>>
>>> So short of sitting down and reproducing myself on v4.5-rc code,
>>> commit 59fae4de's removal of ib_create_cq() + ib_comp_handler callback
>>> usage look like a good place to start the investigation.
>>>
>>> It would be useful to first find out what changes introduced this
>>> regression, and how far back Doug is able to reproduce.
>>
>> As I wrote before, this patch series works 100% stable on top of my most 
>> recent LIO core patch series, a patch series I have also made available 
>> on github. So what Doug ran into is a LIO core bug and not an ib_srpt bug.
>>
> 
> Active I/O shutdown with srpt has not always triggered this OOPs.
> 
> There is a reason why this is happening now, and it needs to be
> identified.
> 
> Either you can help out doing that, or not.  Either way, I'm certainly
> not going to let you hack up LIO TMR code, when there even aren't signs
> ABORT_TASK and friends are occuring in Doug's particular shutdown case.
> 

Sorry I didn't notice this thread had picked back up, I was off on other
stuff.

I can't say if this is new or not.  We added some new testing, that had
considerably more luns in use and more transfers taking place, and while
I was rebooting some actively used servers, I saw this issue.  It might
exist on earlier kernels, I would have to try them to know for sure.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: SRPt oops with 4.5-rc3-ish
       [not found]     ` <1456647963.19657.135.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
@ 2016-04-11 20:08       ` Doug Ledford
  0 siblings, 0 replies; 17+ messages in thread
From: Doug Ledford @ 2016-04-11 20:08 UTC (permalink / raw)
  To: Nicholas A. Bellinger; +Cc: Bart Van Assche, linux-rdma, target-devel

[-- Attachment #1: Type: text/plain, Size: 13469 bytes --]

On 02/28/2016 03:26 AM, Nicholas A. Bellinger wrote:

> AFAIK, the oldest last working srpt commit with se_node_acl + se_session
> active I/O shutdown is:
> 
> ib_srpt: Call target_sess_cmd_list_set_waiting during shutdown_session
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/infiniband/ulp/srpt?id=1d19f7800d
> 
> Note this is ~40 upstream commits between then and now in v4.5-rc5.
> 
> Please confirm when you started triggering this regression during target
> service restart.

I don't have a clear answer for that, although it just happened again on
a v4.5-rc4 kernel.  It's pretty annoying because the trigger is (as
often as anything else) and yum upgrade process.  And it hangs mid way
through the process.  I don't want to know how corrupted my RPM db or my
filesystem is :-(

Anyway, I have a clearer oops this time that I'll attach here, but this
will be my last one from this kernel as I'm upgrading to the most recent
v4.6-rc kernel.  If the oops still happens on v4.6-rc, I'll update here.

Here's the oops series, machine was useless after this (disk access was
blocked for all processes):

[4752021.950589] ------------[ cut here ]------------
[4752021.955992] WARNING: CPU: 5 PID: 10364 at
drivers/infiniband/ulp/srpt/ib_srpt.c:3251
srpt_close_session+0x12f/0x140 [ib_srpt]()
[4752021.969091] Modules linked in: hfi1(C) 8021q garp mrp
target_core_user uio target_core_pscsi target_core_file
target_core_iblock ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ip_set
nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge stp llc
ebtables ip6table_mangle ip6table_raw nf_defrag_ipv6 ip6table_security
ip6table_filter ip6_tables iptable_mangle iptable_raw nf_defrag_ipv4
nf_conntrack(-) iptable_security ib_isert iscsi_target_mod ib_iser
libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp
scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm
ib_cm iw_cm ib_sa ib_mad intel_rapl x86_pkg_temp_thermal coretemp
kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul ipmi_devintf iTCO_wdt
crc32_pclmul ghash_clmulni_intel iTCO_vendor_support dcdbas ipmi_si
sb_edac mei_me edac_core
[4752022.049588]  ioatdma mei ipmi_msghandler lpc_ich dca shpchp wmi
acpi_power_meter tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc
xfs libcrc32c mlx5_ib raid1 raid0 ib_core ib_addr mgag200 i2c_algo_bit
drm_kms_helper ttm crc32c_intel mlx5_core tg3 drm ptp megaraid_sas
pps_core fjes [last unloaded: nf_conntrack_ipv6]
[4752022.080463] CPU: 5 PID: 10364 Comm: targetctl Tainted: G         CI
    4.5.0-0.rc4.git0.1.fc24.x86_64 #1
[4752022.091366] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS
1.0.4 08/28/2014
[4752022.100131]  0000000000000286 00000000189b0c8a ffff880de32ffcc0
ffffffff813d3e0f
[4752022.108624]  0000000000000000 ffffffffa04872f0 ffff880de32ffcf8
ffffffff810a4fe2
[4752022.117126]  ffff881fd427a800 ffff88100fcb7000 0000000000000001
ffff88100fcb70e8
[4752022.125629] Call Trace:
[4752022.128565]  [<ffffffff813d3e0f>] dump_stack+0x63/0x84
[4752022.134513]  [<ffffffff810a4fe2>] warn_slowpath_common+0x82/0xc0
[4752022.141431]  [<ffffffff810a512a>] warn_slowpath_null+0x1a/0x20
[4752022.148155]  [<ffffffffa04830bf>] srpt_close_session+0x12f/0x140
[ib_srpt]
[4752022.156055]  [<ffffffffa0639de4>] target_release_session+0x24/0x30
[target_core_mod]
[4752022.164925]  [<ffffffffa063bb3d>] target_put_session+0x1d/0x20
[target_core_mod]
[4752022.173403]  [<ffffffffa06395eb>]
core_tpg_del_initiator_node_acl+0x16b/0x240 [target_core_mod]
[4752022.183343]  [<ffffffffa062d23f>]
target_fabric_nacl_base_release+0x3f/0x50 [target_core_mod]
[4752022.193082]  [<ffffffff812cc133>] config_item_release+0x63/0xd0
[4752022.199902]  [<ffffffff812cc1c2>] config_item_put+0x22/0x30
[4752022.206326]  [<ffffffff812ca676>] configfs_rmdir+0x1d6/0x2e0
[4752022.212857]  [<ffffffff8124ea0c>] vfs_rmdir+0xbc/0x130
[4752022.218803]  [<ffffffff81253c6a>] do_rmdir+0x19a/0x220
[4752022.224750]  [<ffffffff81254a16>] SyS_rmdir+0x16/0x20
[4752022.230598]  [<ffffffff817cd6ae>] entry_SYSCALL_64_fastpath+0x12/0x6d
[4752022.238009] ---[ end trace befc2f337e9f56d7 ]---
[4752027.739051] ib_srpt Received IB DREQ ERROR event.
[4752029.794988] ib_srpt Received IB TimeWait exit for cm_id
ffff881ff5d55800.
[4752029.807121] BUG: unable to handle kernel paging request at
0000000000017930
[4752029.815120] IP: [<ffffffff810ee9a5>]
queued_spin_lock_slowpath+0x105/0x190
[4752029.823015] PGD 0
[4752029.825466] Oops: 0002 [#1] SMP
[4752029.829286] Modules linked in: hfi1(C) 8021q garp mrp
target_core_user uio target_core_pscsi target_core_file
target_core_iblock ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ip_set
nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge stp llc
ebtables ip6table_mangle ip6table_raw nf_defrag_ipv6 ip6table_security
ip6table_filter ip6_tables iptable_mangle iptable_raw nf_defrag_ipv4
nf_conntrack(-) iptable_security ib_isert iscsi_target_mod ib_iser
libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp
scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm
ib_cm iw_cm ib_sa ib_mad intel_rapl x86_pkg_temp_thermal coretemp
kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul ipmi_devintf iTCO_wdt
crc32_pclmul ghash_clmulni_intel iTCO_vendor_support dcdbas ipmi_si
sb_edac mei_me edac_core
[4752029.913124]  ioatdma mei ipmi_msghandler lpc_ich dca shpchp wmi
acpi_power_meter tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc
xfs libcrc32c mlx5_ib raid1 raid0 ib_core ib_addr mgag200 i2c_algo_bit
drm_kms_helper ttm crc32c_intel mlx5_core tg3 drm ptp megaraid_sas
pps_core fjes [last unloaded: nf_conntrack_ipv6]
[4752029.946121] CPU: 7 PID: 288828 Comm: kworker/7:0 Tainted: G
WCI     4.5.0-0.rc4.git0.1.fc24.x86_64 #1
[4752029.958057] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS
1.0.4 08/28/2014
[4752029.967563] Workqueue: events srpt_release_channel_work [ib_srpt]
[4752029.975315] task: ffff8820352e5b80 ti: ffff881f5da10000 task.ti:
ffff881f5da10000
[4752029.984607] RIP: 0010:[<ffffffff810ee9a5>]  [<ffffffff810ee9a5>]
queued_spin_lock_slowpath+0x105/0x190
[4752029.995941] RSP: 0018:ffff881f5da13da8  EFLAGS: 00010006
[4752030.002790] RAX: 0000000000017930 RBX: 0000000000000286 RCX:
ffff88203d2d7900
[4752030.011668] RDX: 00000000000039eb RSI: 00000000e7b31ae8 RDI:
ffff880de32ffd20
[4752030.020528] RBP: ffff881f5da13da8 R08: 0000000000200000 R09:
0000000000000000
[4752030.029374] R10: 0000000000000000 R11: 000000000001a700 R12:
ffff880de32ffd18
[4752030.038206] R13: ffff881fd2c6b780 R14: ffff881fd427a800 R15:
ffff881fd427a8d0
[4752030.047025] FS:  0000000000000000(0000) GS:ffff88203d2c0000(0000)
knlGS:0000000000000000
[4752030.056913] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[4752030.064174] CR2: 0000000000017930 CR3: 0000000de33db000 CR4:
00000000001406e0
[4752030.072995] Stack:
[4752030.076087]  ffff881f5da13dc0 ffffffff817cd4c7 ffff880de32ffd20
ffff881f5da13de8
[4752030.085236]  ffffffff810e7cfd ffff881fd427a8d0 ffff88100fcb7000
ffff881fd2c6b780
[4752030.094382]  ffff881f5da13e18 ffffffffa0485931 ffff881fc81c60c0
ffff88203d2d65c0
[4752030.103531] Call Trace:
[4752030.107120]  [<ffffffff817cd4c7>] _raw_spin_lock_irqsave+0x37/0x40
[4752030.114886]  [<ffffffff810e7cfd>] complete+0x1d/0x50
[4752030.121291]  [<ffffffffa0485931>]
srpt_release_channel_work+0xe1/0x140 [ib_srpt]
[4752030.130416]  [<ffffffff810bd6fd>] process_one_work+0x1ad/0x400
[4752030.137791]  [<ffffffff810bd99e>] worker_thread+0x4e/0x480
[4752030.144772]  [<ffffffff810bd950>] ? process_one_work+0x400/0x400
[4752030.152327]  [<ffffffff810bd950>] ? process_one_work+0x400/0x400
[4752030.159879]  [<ffffffff810c38e8>] kthread+0xd8/0xf0
[4752030.166170]  [<ffffffff810c3810>] ? kthread_worker_fn+0x180/0x180
[4752030.173823]  [<ffffffff817cd9ff>] ret_from_fork+0x3f/0x70
[4752030.180702]  [<ffffffff810c3810>] ? kthread_worker_fn+0x180/0x180
[4752030.188352] Code: 02 89 c2 45 31 c9 c1 e2 10 85 d2 74 41 c1 ea 12
83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 00 79 01 00 48 03 04 d5 00
d5 d3 81 <48> 89 08 8b 41 08 85 c0 75 09 f3 90 8b 41 08 85 c0 74 f7 4c 8b
[4752030.211521] RIP  [<ffffffff810ee9a5>]
queued_spin_lock_slowpath+0x105/0x190
[4752030.220180]  RSP <ffff881f5da13da8>
[4752030.224954] CR2: 0000000000017930
[4752030.231895] ---[ end trace befc2f337e9f56d8 ]---
[4752030.312493] BUG: unable to handle kernel paging request at
ffffffffffffffd8
[4752030.322906] IP: [<ffffffff810c3f80>] kthread_data+0x10/0x20
[4752030.331299] PGD 1c0d067 PUD 1c0f067 PMD 0
[4752030.337938] Oops: 0000 [#2] SMP
[4752030.343539] Modules linked in: hfi1(C) 8021q garp mrp
target_core_user uio target_core_pscsi target_core_file
target_core_iblock ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ip_set
nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge stp llc
ebtables ip6table_mangle ip6table_raw nf_defrag_ipv6 ip6table_security
ip6table_filter ip6_tables iptable_mangle iptable_raw nf_defrag_ipv4
nf_conntrack(-) iptable_security ib_isert iscsi_target_mod ib_iser
libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp
scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm
ib_cm iw_cm ib_sa ib_mad intel_rapl x86_pkg_temp_thermal coretemp
kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul ipmi_devintf iTCO_wdt
crc32_pclmul ghash_clmulni_intel iTCO_vendor_support dcdbas ipmi_si
sb_edac mei_me edac_core
[4752030.432786]  ioatdma mei ipmi_msghandler lpc_ich dca shpchp wmi
acpi_power_meter tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc
xfs libcrc32c mlx5_ib raid1 raid0 ib_core ib_addr mgag200 i2c_algo_bit
drm_kms_helper ttm crc32c_intel mlx5_core tg3 drm ptp megaraid_sas
pps_core fjes [last unloaded: nf_conntrack_ipv6]
[4752030.467298] CPU: 7 PID: 288828 Comm: kworker/7:0 Tainted: G      D
WCI     4.5.0-0.rc4.git0.1.fc24.x86_64 #1
[4752030.479665] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS
1.0.4 08/28/2014
[4752030.489575] task: ffff8820352e5b80 ti: ffff881f5da10000 task.ti:
ffff881f5da10000
[4752030.499244] RIP: 0010:[<ffffffff810c3f80>]  [<ffffffff810c3f80>]
kthread_data+0x10/0x20
[4752030.509511] RSP: 0018:ffff881f5da13a80  EFLAGS: 00010002
[4752030.516747] RAX: 0000000000000000 RBX: 0000000000000007 RCX:
0000000000000007
[4752030.526034] RDX: ffff88103d410000 RSI: 0000000000000007 RDI:
ffff8820352e5b80
[4752030.535318] RBP: ffff881f5da13a80 R08: ffff8820352e5c28 R09:
ffff8820352e5c00
[4752030.544599] R10: 0000000000000000 R11: 000000000000002f R12:
0000000000016dc0
[4752030.553884] R13: ffff8820352e61d8 R14: ffff8820352e5b80 R15:
ffff88203d2d6dc0
[4752030.563161] FS:  0000000000000000(0000) GS:ffff88203d2c0000(0000)
knlGS:0000000000000000
[4752030.573516] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[4752030.581247] CR2: 0000000000000028 CR3: 0000000de33db000 CR4:
00000000001406e0
[4752030.590525] Stack:
[4752030.594064]  ffff881f5da13a98 ffffffff810be581 ffff88203d2d6dc0
ffff881f5da13ae8
[4752030.603691]  ffffffff817c91ba 00ff881f652b6478 ffff881f00000007
ffff8820352e5b80
[4752030.613311]  ffff881f5da10000 0000000000000000 ffff881f5da13b38
ffff881f5da135d0
[4752030.622926] Call Trace:
[4752030.626959]  [<ffffffff810be581>] wq_worker_sleeping+0x11/0x90
[4752030.634789]  [<ffffffff817c91ba>] __schedule+0x62a/0x9b0
[4752030.642030]  [<ffffffff817c957c>] schedule+0x3c/0x90
[4752030.648874]  [<ffffffff810a7f48>] do_exit+0x7a8/0xb30
[4752030.655813]  [<ffffffff8101992a>] oops_end+0x9a/0xd0
[4752030.662650]  [<ffffffff81067e7e>] no_context+0x13e/0x390
[4752030.669886]  [<ffffffff81068150>] __bad_area_nosemaphore+0x80/0x1f0
[4752030.678193]  [<ffffffff810682d3>] bad_area_nosemaphore+0x13/0x20
[4752030.686209]  [<ffffffff81068597>] __do_page_fault+0xb7/0x400
[4752030.693834]  [<ffffffff81068910>] do_page_fault+0x30/0x80
[4752030.701166]  [<ffffffff817cfa48>] page_fault+0x28/0x30
[4752030.708210]  [<ffffffff810ee9a5>] ?
queued_spin_lock_slowpath+0x105/0x190
[4752030.717062]  [<ffffffff817cd4c7>] _raw_spin_lock_irqsave+0x37/0x40
[4752030.725221]  [<ffffffff810e7cfd>] complete+0x1d/0x50
[4752030.731999]  [<ffffffffa0485931>]
srpt_release_channel_work+0xe1/0x140 [ib_srpt]
[4752030.741523]  [<ffffffff810bd6fd>] process_one_work+0x1ad/0x400
[4752030.749298]  [<ffffffff810bd99e>] worker_thread+0x4e/0x480
[4752030.756677]  [<ffffffff810bd950>] ? process_one_work+0x400/0x400
[4752030.764626]  [<ffffffff810bd950>] ? process_one_work+0x400/0x400
[4752030.772558]  [<ffffffff810c38e8>] kthread+0xd8/0xf0
[4752030.779231]  [<ffffffff810c3810>] ? kthread_worker_fn+0x180/0x180
[4752030.787241]  [<ffffffff817cd9ff>] ret_from_fork+0x3f/0x70
[4752030.794438]  [<ffffffff810c3810>] ? kthread_worker_fn+0x180/0x180
[4752030.802395] Code: 97 69 70 00 e9 53 ff ff ff e8 4d 0e fe ff 0f 1f
00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 e0 05 00 00 55
48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[4752030.826210] RIP  [<ffffffff810c3f80>] kthread_data+0x10/0x20
[4752030.833669]  RSP <ffff881f5da13a80>
[4752030.838651] CR2: ffffffffffffffd8
[4752030.843418] ---[ end trace befc2f337e9f56d9 ]---
[4752030.933774] Fixing recursive fault but reboot is needed!




-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-04-11 20:08 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-14 16:09 SRPt oops with 4.5-rc3-ish Doug Ledford
     [not found] ` <56C0A6C3.3010903-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-16  1:42   ` Bart Van Assche
2016-02-29  9:11   ` Christoph Hellwig
2016-02-28  3:37 ` Nicholas A. Bellinger
     [not found]   ` <1456630639.19657.47.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2016-02-28  4:18     ` Bart Van Assche
     [not found]       ` <56D274F8.9070804-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-02-28  4:47         ` Nicholas A. Bellinger
2016-02-28  4:49           ` Bart Van Assche
2016-02-28  5:00             ` Nicholas A. Bellinger
2016-03-03 15:24               ` Doug Ledford
2016-02-28  8:26   ` Nicholas A. Bellinger
2016-02-28 16:14     ` Bart Van Assche
     [not found]       ` <56D31CC9.7000609-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-02-28 20:43         ` Nicholas A. Bellinger
2016-02-29  0:37           ` Bart Van Assche
     [not found]             ` <56D392D4.2000105-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-02-29  6:05               ` Christoph Hellwig
2016-03-01  6:49                 ` Nicholas A. Bellinger
2016-03-01  7:16                   ` Christoph Hellwig
     [not found]     ` <1456647963.19657.135.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2016-04-11 20:08       ` Doug Ledford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox