* SRPt oops with 4.5-rc3-ish
@ 2016-02-14 16:09 Doug Ledford
[not found] ` <56C0A6C3.3010903-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-28 3:37 ` Nicholas A. Bellinger
0 siblings, 2 replies; 17+ messages in thread
From: Doug Ledford @ 2016-02-14 16:09 UTC (permalink / raw)
To: Bart Van Assche, linux-rdma
[-- Attachment #1: Type: text/plain, Size: 5132 bytes --]
While testing with my latest kernel (rc3 plus pening RDMA patches), I
ran across this oops:
[dledford@linux-ws ~]$ console rdma-storage-04
Enter dledford-Qj3k6FK1/F6RgOCG8Jv1mWcM3YSXpimQh7FX57BcuXVWk0Htik3J/w@public.gmane.org's password:
[Enter `^Ec?' for help]
[-- MOTD -- https://home.corp.redhat.com/wiki/conserver]
[playback]
[160605.947614] [<ffffffff81150545>] ? call_rcu_sched+0x25/0x30
[160605.954074] [<ffffffffc0b3dd84>]
target_fabric_nacl_base_release+0x64/0x70]
[160605.963731] [<ffffffff813ccc6f>] config_item_release+0x9f/0x1c0
[160605.970579] [<ffffffff813ccdf2>] config_item_put+0x62/0x80
[160605.976936] [<ffffffff813c97d3>] configfs_rmdir+0x343/0x500
[160605.983396] [<ffffffff8131287a>] vfs_rmdir+0x13a/0x220
[160605.989375] [<ffffffff813197db>] do_rmdir+0x1fb/0x260
[160605.995244] [<ffffffff8131adde>] SyS_rmdir+0x1e/0x30
[160606.001019] [<ffffffff81a0922e>] entry_SYSCALL_64_fastpath+0x12/0x71
[160606.009586] ---[ end trace 820588f5ef5f6148 ]---
[160607.051593] ib_srpt Received SRP_LOGIN_REQ with i_port_id
0x7f0ee700032d1de)
[160607.078225] ib_srpt rejected SRP_LOGIN_REQ because the target port
has not d
[160611.228909] ib_srpt Received IB DREQ ERROR event.
[160613.276862] ib_srpt Received IB TimeWait exit for cm_id
ffff881cc9dc7a00.
[160613.290322] BUG: unable to handle kernel paging request at
0000000000018630
[160613.301470] IP: [<ffffffff81125694>]
native_queued_spin_lock_slowpath+0x2e40
[160613.313112] PGD 0
[160613.318577] Oops: 0002 [#1] SMP
[160613.325358] Modules linked in: nfnetlink(+) ip6t_rpfilter 8021q garp
ip6t_R]
[160613.492357] CPU: 1 PID: 44982 Comm: kworker/1:1 Tainted: G W
I 44
[160613.505978] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS
1.0.4 084
[160613.517697] Workqueue: events srpt_release_channel_work [ib_srpt]
[160613.527634] task: ffff881d01099000 ti: ffff881d02014000 task.ti:
ffff881d020
[160613.539130] RIP: 0010:[<ffffffff81125694>] [<ffffffff81125694>]
native_que0
[160613.553326] RSP: 0018:ffff881d02017d90 EFLAGS: 00010006
[160613.562332] RAX: 00000000000000ea RBX: 0000000000000206 RCX:
000000000001860
[160613.573401] RDX: 0000000000080000 RSI: ffff881d4c818600 RDI:
ffff880f2d7c7d8
[160613.584472] RBP: ffff881d02017d90 R08: 0000000000000023 R09:
000000000000000
[160613.595491] R10: 00000000ffffffd8 R11: 00000000000211c0 R12:
ffff880f2d7c7d0
[160613.606568] R13: ffff881ce426d000 R14: ffff881cca702a00 R15:
000000000000000
[160613.617643] FS: 0000000000000000(0000) GS:ffff881d4c800000(0000)
knlGS:0000
[160613.629793] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[160613.639315] CR2: 0000000000018630 CR3: 0000000001ca9000 CR4:
000000000014060
[160613.650471] Stack:
[160613.655843] ffff881d02017da0 ffffffff8122ac4c ffff881d02017db8
ffffffff81a7
[160613.667361] ffff880f2d7c7d18 ffff881d02017de0 ffffffff81121255
ffff881cca70
[160613.678885] ffff881ce426d058 ffff881ce426d000 ffff881d02017e10
ffffffffc070
[160613.690366] Call Trace:
[160613.696195] [<ffffffff8122ac4c>] queued_spin_lock_slowpath+0x12/0x1d
[160613.706533] [<ffffffff81a08ea7>] _raw_spin_lock_irqsave+0x87/0xa0
[160613.716586] [<ffffffff81121255>] complete+0x25/0x70
[160613.725318] [<ffffffffc07e7e80>]
srpt_release_channel_work+0x180/0x210 [ib]
[160613.736889] [<ffffffff810e6dd8>] process_one_work+0x228/0x650
[160613.746616] [<ffffffff810e79be>] worker_thread+0x21e/0x800
[160613.756047] [<ffffffff81a02035>] ? __schedule+0x4b5/0xe6a
[160613.765371] [<ffffffff810e77a0>] ? kzalloc+0x30/0x30
[160613.774203] [<ffffffff810efc38>] kthread+0x118/0x150
[160613.783000] [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0
[160613.792932] [<ffffffff81a0958f>] ret_from_fork+0x3f/0x70
[160613.801994] [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0
[160613.811897] Code: 01 00 00 74 ec e9 d7 fd ff ff 48 89 c1 c1 e8 12 48
c1 e9
[160613.840260] RIP [<ffffffff81125694>]
native_queued_spin_lock_slowpath+0x2e0
[160613.851846] RSP <ffff881d02017d90>
[160613.858812] CR2: 0000000000018630
[160613.874762] ---[ end trace 820588f5ef5f6149 ]---
[160613.937225] Kernel panic - not syncing: Fatal exception
[160613.946167] Kernel Offset: disabled
[160614.004693] ---[ end Kernel panic - not syncing: Fatal exception
[-- MARK -- Sun Feb 14 15:50:00 2016]
[-- dledford-CKb8VAQLn9hXrIkS9f7CXA@public.gmane.org@ovpn-116-26.rdu2.redhat.com attached -- Sun Feb
14 15:5]
Basic description of situation that cause the oops:
Server with 30+ SRPt luns, 2 SRP devices, 1 active client busy beating
away on 1 lun via two paths (active/passive setup)
Run dnf upgrade (dnf is yum's replacement, so just a system wide
software update).
Get to the cleanup for targetcli/target-restore and it invokes an
attempt to reload the target service while still in use. During the
process of deconfiguring the luns that are in use, this oops occurred.
Sending the report to you because it appears to involve the
multi-channel support.
--
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
GPG KeyID: 0E572FDD
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread[parent not found: <56C0A6C3.3010903-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: SRPt oops with 4.5-rc3-ish [not found] ` <56C0A6C3.3010903-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2016-02-16 1:42 ` Bart Van Assche 2016-02-29 9:11 ` Christoph Hellwig 1 sibling, 0 replies; 17+ messages in thread From: Bart Van Assche @ 2016-02-16 1:42 UTC (permalink / raw) To: Doug Ledford, linux-rdma On 02/14/16 08:09, Doug Ledford wrote: > While testing with my latest kernel (rc3 plus pending RDMA patches), I > ran across this oops: > > [dledford@linux-ws ~]$ console rdma-storage-04 > Enter dledford-Qj3k6FK1/F6RgOCG8Jv1mWcM3YSXpimQh7FX57BcuXVWk0Htik3J/w@public.gmane.org's password: > [Enter `^Ec?' for help] > [-- MOTD -- https://home.corp.redhat.com/wiki/conserver] > [playback] > [160605.947614] [<ffffffff81150545>] ? call_rcu_sched+0x25/0x30 > [160605.954074] [<ffffffffc0b3dd84>] > target_fabric_nacl_base_release+0x64/0x70] > [160605.963731] [<ffffffff813ccc6f>] config_item_release+0x9f/0x1c0 > [160605.970579] [<ffffffff813ccdf2>] config_item_put+0x62/0x80 > [160605.976936] [<ffffffff813c97d3>] configfs_rmdir+0x343/0x500 > [160605.983396] [<ffffffff8131287a>] vfs_rmdir+0x13a/0x220 > [160605.989375] [<ffffffff813197db>] do_rmdir+0x1fb/0x260 > [160605.995244] [<ffffffff8131adde>] SyS_rmdir+0x1e/0x30 > [160606.001019] [<ffffffff81a0922e>] entry_SYSCALL_64_fastpath+0x12/0x71 > [160606.009586] ---[ end trace 820588f5ef5f6148 ]--- > [160607.051593] ib_srpt Received SRP_LOGIN_REQ with i_port_id > 0x7f0ee700032d1de) > [160607.078225] ib_srpt rejected SRP_LOGIN_REQ because the target port > has not d > [160611.228909] ib_srpt Received IB DREQ ERROR event. > [160613.276862] ib_srpt Received IB TimeWait exit for cm_id > ffff881cc9dc7a00. > [160613.290322] BUG: unable to handle kernel paging request at > 0000000000018630 > [160613.301470] IP: [<ffffffff81125694>] > native_queued_spin_lock_slowpath+0x2e40 > [160613.313112] PGD 0 > [160613.318577] Oops: 0002 [#1] SMP > [160613.325358] Modules linked in: nfnetlink(+) ip6t_rpfilter 8021q garp > ip6t_R] > [160613.492357] CPU: 1 PID: 44982 Comm: kworker/1:1 Tainted: G W > I 44 > [160613.505978] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS > 1.0.4 084 > [160613.517697] Workqueue: events srpt_release_channel_work [ib_srpt] > [160613.527634] task: ffff881d01099000 ti: ffff881d02014000 task.ti: > ffff881d020 > [160613.539130] RIP: 0010:[<ffffffff81125694>] [<ffffffff81125694>] > native_que0 > [160613.553326] RSP: 0018:ffff881d02017d90 EFLAGS: 00010006 > [160613.562332] RAX: 00000000000000ea RBX: 0000000000000206 RCX: > 000000000001860 > [160613.573401] RDX: 0000000000080000 RSI: ffff881d4c818600 RDI: > ffff880f2d7c7d8 > [160613.584472] RBP: ffff881d02017d90 R08: 0000000000000023 R09: > 000000000000000 > [160613.595491] R10: 00000000ffffffd8 R11: 00000000000211c0 R12: > ffff880f2d7c7d0 > [160613.606568] R13: ffff881ce426d000 R14: ffff881cca702a00 R15: > 000000000000000 > [160613.617643] FS: 0000000000000000(0000) GS:ffff881d4c800000(0000) > knlGS:0000 > [160613.629793] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [160613.639315] CR2: 0000000000018630 CR3: 0000000001ca9000 CR4: > 000000000014060 > [160613.650471] Stack: > [160613.655843] ffff881d02017da0 ffffffff8122ac4c ffff881d02017db8 > ffffffff81a7 > [160613.667361] ffff880f2d7c7d18 ffff881d02017de0 ffffffff81121255 > ffff881cca70 > [160613.678885] ffff881ce426d058 ffff881ce426d000 ffff881d02017e10 > ffffffffc070 > [160613.690366] Call Trace: > [160613.696195] [<ffffffff8122ac4c>] queued_spin_lock_slowpath+0x12/0x1d > [160613.706533] [<ffffffff81a08ea7>] _raw_spin_lock_irqsave+0x87/0xa0 > [160613.716586] [<ffffffff81121255>] complete+0x25/0x70 > [160613.725318] [<ffffffffc07e7e80>] > srpt_release_channel_work+0x180/0x210 [ib] > [160613.736889] [<ffffffff810e6dd8>] process_one_work+0x228/0x650 > [160613.746616] [<ffffffff810e79be>] worker_thread+0x21e/0x800 > [160613.756047] [<ffffffff81a02035>] ? __schedule+0x4b5/0xe6a > [160613.765371] [<ffffffff810e77a0>] ? kzalloc+0x30/0x30 > [160613.774203] [<ffffffff810efc38>] kthread+0x118/0x150 > [160613.783000] [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0 > [160613.792932] [<ffffffff81a0958f>] ret_from_fork+0x3f/0x70 > [160613.801994] [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0 > [160613.811897] Code: 01 00 00 74 ec e9 d7 fd ff ff 48 89 c1 c1 e8 12 48 > c1 e9 > [160613.840260] RIP [<ffffffff81125694>] > native_queued_spin_lock_slowpath+0x2e0 > [160613.851846] RSP <ffff881d02017d90> > [160613.858812] CR2: 0000000000018630 > [160613.874762] ---[ end trace 820588f5ef5f6149 ]--- > [160613.937225] Kernel panic - not syncing: Fatal exception > [160613.946167] Kernel Offset: disabled > [160614.004693] ---[ end Kernel panic - not syncing: Fatal exception > [-- MARK -- Sun Feb 14 15:50:00 2016] > [-- dledford-CKb8VAQLn9hXrIkS9f7CXA@public.gmane.org@ovpn-116-26.rdu2.redhat.com attached -- Sun Feb > 14 15:5] > > > > Basic description of situation that cause the oops: > > Server with 30+ SRPt luns, 2 SRP devices, 1 active client busy beating > away on 1 lun via two paths (active/passive setup) > > Run dnf upgrade (dnf is yum's replacement, so just a system wide > software update). > > Get to the cleanup for targetcli/target-restore and it invokes an > attempt to reload the target service while still in use. During the > process of deconfiguring the luns that are in use, this oops occurred. > Sending the report to you because it appears to involve the > multi-channel support. Hello Doug, As far as I know the session shutdown code in the LIO core has never worked reliably in the presence of active I/O in any upstream kernel version. All my tests of the ib_srpt patch series I submitted recently have been performed on top of a long series of bug fixes for the LIO core. The tree I have been testing is available at https://github.com/bvanassche/linux/tree/lio-tmf-fixes-2016-01-13. I have tried a few times to submit the LIO core patches to Nic Bellinger (making TMF handling synchronous + several fixes for race conditions related to session shutdown). Apparently Nic is trying to fix the existing approach for TMF handling (handling TMF from another context than the regular command execution context) but so far without success (see e.g. http://www.spinics.net/lists/target-devel/index.html#11822). Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: SRPt oops with 4.5-rc3-ish [not found] ` <56C0A6C3.3010903-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2016-02-16 1:42 ` Bart Van Assche @ 2016-02-29 9:11 ` Christoph Hellwig 1 sibling, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2016-02-29 9:11 UTC (permalink / raw) To: Doug Ledford; +Cc: Bart Van Assche, linux-rdma, nab-IzHhD5pYlfBP7FQvKIMDCQ Hi Doug, can you give my series at: http://thread.gmane.org/gmane.linux.scsi.target.devel/11518 a a try? This sorts out the session list handling so that whoever removes a session from the list on the node ACLs gets to tear it down fully. Previously the old kref papered over the lack of clear responsibility here. I think this should sort out this race, but as I can't reproduce it on my SRP test setup I'm not 100% sure. I've also uploaded a git tree to make your life easier, as it sits on top of Nic's for-next branch: git://git.infradead.org/users/hch/scsi.git target-session-cleanup -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: SRPt oops with 4.5-rc3-ish 2016-02-14 16:09 SRPt oops with 4.5-rc3-ish Doug Ledford [not found] ` <56C0A6C3.3010903-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2016-02-28 3:37 ` Nicholas A. Bellinger [not found] ` <1456630639.19657.47.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org> 2016-02-28 8:26 ` Nicholas A. Bellinger 1 sibling, 2 replies; 17+ messages in thread From: Nicholas A. Bellinger @ 2016-02-28 3:37 UTC (permalink / raw) To: Doug Ledford; +Cc: Bart Van Assche, linux-rdma, target-devel Hi Doug, On Sun, 2016-02-14 at 11:09 -0500, Doug Ledford wrote: > While testing with my latest kernel (rc3 plus pening RDMA patches), I > ran across this oops: > > [dledford@linux-ws ~]$ console rdma-storage-04 > Enter dledford@conserver-01.app.eng.rdu2.redhat.com's password: > [Enter `^Ec?' for help] > [-- MOTD -- https://home.corp.redhat.com/wiki/conserver] > [playback] > [160605.947614] [<ffffffff81150545>] ? call_rcu_sched+0x25/0x30 > [160605.954074] [<ffffffffc0b3dd84>] target_fabric_nacl_base_release+0x64/0x70] > [160605.963731] [<ffffffff813ccc6f>] config_item_release+0x9f/0x1c0 > [160605.970579] [<ffffffff813ccdf2>] config_item_put+0x62/0x80 > [160605.976936] [<ffffffff813c97d3>] configfs_rmdir+0x343/0x500 > [160605.983396] [<ffffffff8131287a>] vfs_rmdir+0x13a/0x220 > [160605.989375] [<ffffffff813197db>] do_rmdir+0x1fb/0x260 > [160605.995244] [<ffffffff8131adde>] SyS_rmdir+0x1e/0x30 > [160606.001019] [<ffffffff81a0922e>] entry_SYSCALL_64_fastpath+0x12/0x71 > [160606.009586] ---[ end trace 820588f5ef5f6148 ]--- > [160607.051593] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x7f0ee700032d1de) > [160607.078225] ib_srpt rejected SRP_LOGIN_REQ because the target port has not d > [160611.228909] ib_srpt Received IB DREQ ERROR event. > [160613.276862] ib_srpt Received IB TimeWait exit for cm_id ffff881cc9dc7a00. > [160613.290322] BUG: unable to handle kernel paging request at 0000000000018630 > [160613.301470] IP: [<ffffffff81125694>] native_queued_spin_lock_slowpath+0x2e40 > [160613.313112] PGD 0 > [160613.318577] Oops: 0002 [#1] SMP > [160613.325358] Modules linked in: nfnetlink(+) ip6t_rpfilter 8021q garp ip6t_R] > [160613.492357] CPU: 1 PID: 44982 Comm: kworker/1:1 Tainted: G W I 44 > [160613.505978] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 1.0.4 084 > [160613.517697] Workqueue: events srpt_release_channel_work [ib_srpt] > [160613.527634] task: ffff881d01099000 ti: ffff881d02014000 task.ti: ffff881d020 > [160613.539130] RIP: 0010:[<ffffffff81125694>] [<ffffffff81125694>] native_que0 > [160613.553326] RSP: 0018:ffff881d02017d90 EFLAGS: 00010006 > [160613.562332] RAX: 00000000000000ea RBX: 0000000000000206 RCX: 000000000001860 > [160613.573401] RDX: 0000000000080000 RSI: ffff881d4c818600 RDI: ffff880f2d7c7d8 > [160613.584472] RBP: ffff881d02017d90 R08: 0000000000000023 R09: 000000000000000 > [160613.595491] R10: 00000000ffffffd8 R11: 00000000000211c0 R12: ffff880f2d7c7d0 > [160613.606568] R13: ffff881ce426d000 R14: ffff881cca702a00 R15: 000000000000000 > [160613.617643] FS: 0000000000000000(0000) GS:ffff881d4c800000(0000) knlGS:0000 > [160613.629793] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [160613.639315] CR2: 0000000000018630 CR3: 0000000001ca9000 CR4: 000000000014060 > [160613.650471] Stack: > [160613.655843] ffff881d02017da0 ffffffff8122ac4c ffff881d02017db8 ffffffff81a7 > [160613.667361] ffff880f2d7c7d18 ffff881d02017de0 ffffffff81121255 ffff881cca70 > [160613.678885] ffff881ce426d058 ffff881ce426d000 ffff881d02017e10 ffffffffc070 > [160613.690366] Call Trace: > [160613.696195] [<ffffffff8122ac4c>] queued_spin_lock_slowpath+0x12/0x1d > [160613.706533] [<ffffffff81a08ea7>] _raw_spin_lock_irqsave+0x87/0xa0 > [160613.716586] [<ffffffff81121255>] complete+0x25/0x70 > [160613.725318] [<ffffffffc07e7e80>] srpt_release_channel_work+0x180/0x210 [ib] > [160613.736889] [<ffffffff810e6dd8>] process_one_work+0x228/0x650 > [160613.746616] [<ffffffff810e79be>] worker_thread+0x21e/0x800 > [160613.756047] [<ffffffff81a02035>] ? __schedule+0x4b5/0xe6a > [160613.765371] [<ffffffff810e77a0>] ? kzalloc+0x30/0x30 > [160613.774203] [<ffffffff810efc38>] kthread+0x118/0x150 > [160613.783000] [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0 > [160613.792932] [<ffffffff81a0958f>] ret_from_fork+0x3f/0x70 > [160613.801994] [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0 > [160613.811897] Code: 01 00 00 74 ec e9 d7 fd ff ff 48 89 c1 c1 e8 12 48 c1 e9 > [160613.840260] RIP [<ffffffff81125694>] native_queued_spin_lock_slowpath+0x2e0 > [160613.851846] RSP <ffff881d02017d90> > [160613.858812] CR2: 0000000000018630 > [160613.874762] ---[ end trace 820588f5ef5f6149 ]--- > [160613.937225] Kernel panic - not syncing: Fatal exception > [160613.946167] Kernel Offset: disabled > [160614.004693] ---[ end Kernel panic - not syncing: Fatal exception > [-- MARK -- Sun Feb 14 15:50:00 2016] > [-- dledford@REDHAT.COM@ovpn-116-26.rdu2.redhat.com attached -- Sun Feb > 14 15:5] > > > > Basic description of situation that cause the oops: > > Server with 30+ SRPt luns, 2 SRP devices, 1 active client busy beating > away on 1 lun via two paths (active/passive setup) > > Run dnf upgrade (dnf is yum's replacement, so just a system wide > software update). > > Get to the cleanup for targetcli/target-restore and it invokes an > attempt to reload the target service while still in use. During the > process of deconfiguring the luns that are in use, this oops occurred. > Sending the report to you because it appears to involve the > multi-channel support. > This is a fairly recent srpt shutdown regression, right..? Any chance to reproduce with full pr_debug enabled..? I'm curious to see if HCH's changes in commit 59fae4dea to drop ib_create_cq() w/ ib_comp_handler -> srpt_compl_thread() usage are somehow involved. ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <1456630639.19657.47.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>]
* Re: SRPt oops with 4.5-rc3-ish [not found] ` <1456630639.19657.47.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org> @ 2016-02-28 4:18 ` Bart Van Assche [not found] ` <56D274F8.9070804-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 0 siblings, 1 reply; 17+ messages in thread From: Bart Van Assche @ 2016-02-28 4:18 UTC (permalink / raw) To: Nicholas A. Bellinger, Doug Ledford; +Cc: linux-rdma, target-devel On 02/27/16 19:37, Nicholas A. Bellinger wrote: > This is a fairly recent srpt shutdown regression, right..? Hi Nic, My patch series to make TMR handling synchronous fixes what Doug reported. If you want I can rebase and repost that patch series. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <56D274F8.9070804-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>]
* Re: SRPt oops with 4.5-rc3-ish [not found] ` <56D274F8.9070804-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> @ 2016-02-28 4:47 ` Nicholas A. Bellinger 2016-02-28 4:49 ` Bart Van Assche 0 siblings, 1 reply; 17+ messages in thread From: Nicholas A. Bellinger @ 2016-02-28 4:47 UTC (permalink / raw) To: Bart Van Assche; +Cc: Doug Ledford, linux-rdma, target-devel On Sat, 2016-02-27 at 20:18 -0800, Bart Van Assche wrote: > On 02/27/16 19:37, Nicholas A. Bellinger wrote: > > This is a fairly recent srpt shutdown regression, right..? > > Hi Nic, > > My patch series to make TMR handling synchronous fixes what Doug > reported. If you want I can rebase and repost that patch series. > There aren't even any TMRs being processed, so I don't see how that has anything to do with it. >From the logs, this OOPsen is related to some manner of recent srpt configfs se_node_acl + se_session active I/O shutdown regression. So short of sitting down and reproducing myself on v4.5-rc code, commit 59fae4de's removal of ib_create_cq() + ib_comp_handler callback usage look like a good place to start the investigation. It would be useful to first find out what changes introduced this regression, and how far back Doug is able to reproduce. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: SRPt oops with 4.5-rc3-ish 2016-02-28 4:47 ` Nicholas A. Bellinger @ 2016-02-28 4:49 ` Bart Van Assche 2016-02-28 5:00 ` Nicholas A. Bellinger 0 siblings, 1 reply; 17+ messages in thread From: Bart Van Assche @ 2016-02-28 4:49 UTC (permalink / raw) To: Nicholas A. Bellinger; +Cc: Doug Ledford, linux-rdma, target-devel On 02/27/16 20:47, Nicholas A. Bellinger wrote: > On Sat, 2016-02-27 at 20:18 -0800, Bart Van Assche wrote: >> On 02/27/16 19:37, Nicholas A. Bellinger wrote: >>> This is a fairly recent srpt shutdown regression, right..? >> >> Hi Nic, >> >> My patch series to make TMR handling synchronous fixes what Doug >> reported. If you want I can rebase and repost that patch series. >> > > There aren't even any TMRs being processed, so I don't see how that has > anything to do with it. > >>From the logs, this OOPsen is related to some manner of recent srpt > configfs se_node_acl + se_session active I/O shutdown regression. > > So short of sitting down and reproducing myself on v4.5-rc code, > commit 59fae4de's removal of ib_create_cq() + ib_comp_handler callback > usage look like a good place to start the investigation. > > It would be useful to first find out what changes introduced this > regression, and how far back Doug is able to reproduce. As I wrote before, this patch series works 100% stable on top of my most recent LIO core patch series, a patch series I have also made available on github. So what Doug ran into is a LIO core bug and not an ib_srpt bug. Bart. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: SRPt oops with 4.5-rc3-ish 2016-02-28 4:49 ` Bart Van Assche @ 2016-02-28 5:00 ` Nicholas A. Bellinger 2016-03-03 15:24 ` Doug Ledford 0 siblings, 1 reply; 17+ messages in thread From: Nicholas A. Bellinger @ 2016-02-28 5:00 UTC (permalink / raw) To: Bart Van Assche; +Cc: Doug Ledford, linux-rdma, target-devel On Sat, 2016-02-27 at 20:49 -0800, Bart Van Assche wrote: > On 02/27/16 20:47, Nicholas A. Bellinger wrote: > > On Sat, 2016-02-27 at 20:18 -0800, Bart Van Assche wrote: > >> On 02/27/16 19:37, Nicholas A. Bellinger wrote: > >>> This is a fairly recent srpt shutdown regression, right..? > >> > >> Hi Nic, > >> > >> My patch series to make TMR handling synchronous fixes what Doug > >> reported. If you want I can rebase and repost that patch series. > >> > > > > There aren't even any TMRs being processed, so I don't see how that has > > anything to do with it. > > > >>From the logs, this OOPsen is related to some manner of recent srpt > > configfs se_node_acl + se_session active I/O shutdown regression. > > > > So short of sitting down and reproducing myself on v4.5-rc code, > > commit 59fae4de's removal of ib_create_cq() + ib_comp_handler callback > > usage look like a good place to start the investigation. > > > > It would be useful to first find out what changes introduced this > > regression, and how far back Doug is able to reproduce. > > As I wrote before, this patch series works 100% stable on top of my most > recent LIO core patch series, a patch series I have also made available > on github. So what Doug ran into is a LIO core bug and not an ib_srpt bug. > Active I/O shutdown with srpt has not always triggered this OOPs. There is a reason why this is happening now, and it needs to be identified. Either you can help out doing that, or not. Either way, I'm certainly not going to let you hack up LIO TMR code, when there even aren't signs ABORT_TASK and friends are occuring in Doug's particular shutdown case. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: SRPt oops with 4.5-rc3-ish 2016-02-28 5:00 ` Nicholas A. Bellinger @ 2016-03-03 15:24 ` Doug Ledford 0 siblings, 0 replies; 17+ messages in thread From: Doug Ledford @ 2016-03-03 15:24 UTC (permalink / raw) To: Nicholas A. Bellinger, Bart Van Assche; +Cc: linux-rdma, target-devel [-- Attachment #1: Type: text/plain, Size: 2226 bytes --] On 02/28/2016 12:00 AM, Nicholas A. Bellinger wrote: > On Sat, 2016-02-27 at 20:49 -0800, Bart Van Assche wrote: >> On 02/27/16 20:47, Nicholas A. Bellinger wrote: >>> On Sat, 2016-02-27 at 20:18 -0800, Bart Van Assche wrote: >>>> On 02/27/16 19:37, Nicholas A. Bellinger wrote: >>>>> This is a fairly recent srpt shutdown regression, right..? >>>> >>>> Hi Nic, >>>> >>>> My patch series to make TMR handling synchronous fixes what Doug >>>> reported. If you want I can rebase and repost that patch series. >>>> >>> >>> There aren't even any TMRs being processed, so I don't see how that has >>> anything to do with it. >>> >>> >From the logs, this OOPsen is related to some manner of recent srpt >>> configfs se_node_acl + se_session active I/O shutdown regression. >>> >>> So short of sitting down and reproducing myself on v4.5-rc code, >>> commit 59fae4de's removal of ib_create_cq() + ib_comp_handler callback >>> usage look like a good place to start the investigation. >>> >>> It would be useful to first find out what changes introduced this >>> regression, and how far back Doug is able to reproduce. >> >> As I wrote before, this patch series works 100% stable on top of my most >> recent LIO core patch series, a patch series I have also made available >> on github. So what Doug ran into is a LIO core bug and not an ib_srpt bug. >> > > Active I/O shutdown with srpt has not always triggered this OOPs. > > There is a reason why this is happening now, and it needs to be > identified. > > Either you can help out doing that, or not. Either way, I'm certainly > not going to let you hack up LIO TMR code, when there even aren't signs > ABORT_TASK and friends are occuring in Doug's particular shutdown case. > Sorry I didn't notice this thread had picked back up, I was off on other stuff. I can't say if this is new or not. We added some new testing, that had considerably more luns in use and more transfers taking place, and while I was rebooting some actively used servers, I saw this issue. It might exist on earlier kernels, I would have to try them to know for sure. -- Doug Ledford <dledford@redhat.com> GPG KeyID: 0E572FDD [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 884 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: SRPt oops with 4.5-rc3-ish 2016-02-28 3:37 ` Nicholas A. Bellinger [not found] ` <1456630639.19657.47.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org> @ 2016-02-28 8:26 ` Nicholas A. Bellinger 2016-02-28 16:14 ` Bart Van Assche [not found] ` <1456647963.19657.135.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org> 1 sibling, 2 replies; 17+ messages in thread From: Nicholas A. Bellinger @ 2016-02-28 8:26 UTC (permalink / raw) To: Doug Ledford; +Cc: Bart Van Assche, linux-rdma, target-devel On Sat, 2016-02-27 at 19:37 -0800, Nicholas A. Bellinger wrote: > Hi Doug, > > On Sun, 2016-02-14 at 11:09 -0500, Doug Ledford wrote: > > While testing with my latest kernel (rc3 plus pening RDMA patches), I > > ran across this oops: > > > > [dledford@linux-ws ~]$ console rdma-storage-04 > > Enter dledford@conserver-01.app.eng.rdu2.redhat.com's password: > > [Enter `^Ec?' for help] > > [-- MOTD -- https://home.corp.redhat.com/wiki/conserver] > > [playback] > > [160605.947614] [<ffffffff81150545>] ? call_rcu_sched+0x25/0x30 > > [160605.954074] [<ffffffffc0b3dd84>] target_fabric_nacl_base_release+0x64/0x70] > > [160605.963731] [<ffffffff813ccc6f>] config_item_release+0x9f/0x1c0 > > [160605.970579] [<ffffffff813ccdf2>] config_item_put+0x62/0x80 > > [160605.976936] [<ffffffff813c97d3>] configfs_rmdir+0x343/0x500 > > [160605.983396] [<ffffffff8131287a>] vfs_rmdir+0x13a/0x220 > > [160605.989375] [<ffffffff813197db>] do_rmdir+0x1fb/0x260 > > [160605.995244] [<ffffffff8131adde>] SyS_rmdir+0x1e/0x30 > > [160606.001019] [<ffffffff81a0922e>] entry_SYSCALL_64_fastpath+0x12/0x71 > > [160606.009586] ---[ end trace 820588f5ef5f6148 ]--- > > [160607.051593] ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x7f0ee700032d1de) > > [160607.078225] ib_srpt rejected SRP_LOGIN_REQ because the target port has not d > > [160611.228909] ib_srpt Received IB DREQ ERROR event. > > [160613.276862] ib_srpt Received IB TimeWait exit for cm_id ffff881cc9dc7a00. > > [160613.290322] BUG: unable to handle kernel paging request at 0000000000018630 > > [160613.301470] IP: [<ffffffff81125694>] native_queued_spin_lock_slowpath+0x2e40 > > [160613.313112] PGD 0 > > [160613.318577] Oops: 0002 [#1] SMP > > [160613.325358] Modules linked in: nfnetlink(+) ip6t_rpfilter 8021q garp ip6t_R] > > [160613.492357] CPU: 1 PID: 44982 Comm: kworker/1:1 Tainted: G W I 44 > > [160613.505978] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 1.0.4 084 > > [160613.517697] Workqueue: events srpt_release_channel_work [ib_srpt] > > [160613.527634] task: ffff881d01099000 ti: ffff881d02014000 task.ti: ffff881d020 > > [160613.539130] RIP: 0010:[<ffffffff81125694>] [<ffffffff81125694>] native_que0 > > [160613.553326] RSP: 0018:ffff881d02017d90 EFLAGS: 00010006 > > [160613.562332] RAX: 00000000000000ea RBX: 0000000000000206 RCX: 000000000001860 > > [160613.573401] RDX: 0000000000080000 RSI: ffff881d4c818600 RDI: ffff880f2d7c7d8 > > [160613.584472] RBP: ffff881d02017d90 R08: 0000000000000023 R09: 000000000000000 > > [160613.595491] R10: 00000000ffffffd8 R11: 00000000000211c0 R12: ffff880f2d7c7d0 > > [160613.606568] R13: ffff881ce426d000 R14: ffff881cca702a00 R15: 000000000000000 > > [160613.617643] FS: 0000000000000000(0000) GS:ffff881d4c800000(0000) knlGS:0000 > > [160613.629793] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [160613.639315] CR2: 0000000000018630 CR3: 0000000001ca9000 CR4: 000000000014060 > > [160613.650471] Stack: > > [160613.655843] ffff881d02017da0 ffffffff8122ac4c ffff881d02017db8 ffffffff81a7 > > [160613.667361] ffff880f2d7c7d18 ffff881d02017de0 ffffffff81121255 ffff881cca70 > > [160613.678885] ffff881ce426d058 ffff881ce426d000 ffff881d02017e10 ffffffffc070 > > [160613.690366] Call Trace: > > [160613.696195] [<ffffffff8122ac4c>] queued_spin_lock_slowpath+0x12/0x1d > > [160613.706533] [<ffffffff81a08ea7>] _raw_spin_lock_irqsave+0x87/0xa0 > > [160613.716586] [<ffffffff81121255>] complete+0x25/0x70 > > [160613.725318] [<ffffffffc07e7e80>] srpt_release_channel_work+0x180/0x210 [ib] > > [160613.736889] [<ffffffff810e6dd8>] process_one_work+0x228/0x650 > > [160613.746616] [<ffffffff810e79be>] worker_thread+0x21e/0x800 > > [160613.756047] [<ffffffff81a02035>] ? __schedule+0x4b5/0xe6a > > [160613.765371] [<ffffffff810e77a0>] ? kzalloc+0x30/0x30 > > [160613.774203] [<ffffffff810efc38>] kthread+0x118/0x150 > > [160613.783000] [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0 > > [160613.792932] [<ffffffff81a0958f>] ret_from_fork+0x3f/0x70 > > [160613.801994] [<ffffffff810efb20>] ? flush_kthread_worker+0xd0/0xd0 > > [160613.811897] Code: 01 00 00 74 ec e9 d7 fd ff ff 48 89 c1 c1 e8 12 48 c1 e9 > > [160613.840260] RIP [<ffffffff81125694>] native_queued_spin_lock_slowpath+0x2e0 > > [160613.851846] RSP <ffff881d02017d90> > > [160613.858812] CR2: 0000000000018630 > > [160613.874762] ---[ end trace 820588f5ef5f6149 ]--- > > [160613.937225] Kernel panic - not syncing: Fatal exception > > [160613.946167] Kernel Offset: disabled > > [160614.004693] ---[ end Kernel panic - not syncing: Fatal exception > > [-- MARK -- Sun Feb 14 15:50:00 2016] > > [-- dledford@REDHAT.COM@ovpn-116-26.rdu2.redhat.com attached -- Sun Feb > > 14 15:5] > > > > > > > > Basic description of situation that cause the oops: > > > > Server with 30+ SRPt luns, 2 SRP devices, 1 active client busy beating > > away on 1 lun via two paths (active/passive setup) > > > > Run dnf upgrade (dnf is yum's replacement, so just a system wide > > software update). > > > > Get to the cleanup for targetcli/target-restore and it invokes an > > attempt to reload the target service while still in use. During the > > process of deconfiguring the luns that are in use, this oops occurred. > > Sending the report to you because it appears to involve the > > multi-channel support. > > > > This is a fairly recent srpt shutdown regression, right..? > > Any chance to reproduce with full pr_debug enabled..? > > I'm curious to see if HCH's changes in commit 59fae4dea to drop > ib_create_cq() w/ ib_comp_handler -> srpt_compl_thread() usage > are somehow involved. > AFAIK, the oldest last working srpt commit with se_node_acl + se_session active I/O shutdown is: ib_srpt: Call target_sess_cmd_list_set_waiting during shutdown_session https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/infiniband/ulp/srpt?id=1d19f7800d Note this is ~40 upstream commits between then and now in v4.5-rc5. Please confirm when you started triggering this regression during target service restart. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: SRPt oops with 4.5-rc3-ish 2016-02-28 8:26 ` Nicholas A. Bellinger @ 2016-02-28 16:14 ` Bart Van Assche [not found] ` <56D31CC9.7000609-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> [not found] ` <1456647963.19657.135.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org> 1 sibling, 1 reply; 17+ messages in thread From: Bart Van Assche @ 2016-02-28 16:14 UTC (permalink / raw) To: Nicholas A. Bellinger, Doug Ledford; +Cc: linux-rdma, target-devel On 02/28/16 00:26, Nicholas A. Bellinger wrote: > Please confirm when you started triggering this regression during target > service restart. Hi Nic, Are you aware that Doug was not the first person to report this crash? I had already reported this crash myself seven weeks ago. Together with the report of this crash I had also sent a root cause analysis to you and a fix. In the patch description I explained clearly that this crash is caused by a bug in the LIO core and should be fixed in the LIO core. See also Bart Van Assche, [PATCH 07/21] target: Fix a use-after-free in core_tpg_del_initiator_node_acl(), target-devel mailing list, January 5, 2016 (http://thread.gmane.org/gmane.linux.scsi.target.devel/10905/focus=10891). Bart. ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <56D31CC9.7000609-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>]
* Re: SRPt oops with 4.5-rc3-ish [not found] ` <56D31CC9.7000609-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> @ 2016-02-28 20:43 ` Nicholas A. Bellinger 2016-02-29 0:37 ` Bart Van Assche 0 siblings, 1 reply; 17+ messages in thread From: Nicholas A. Bellinger @ 2016-02-28 20:43 UTC (permalink / raw) To: Bart Van Assche; +Cc: Doug Ledford, linux-rdma, target-devel On Sun, 2016-02-28 at 08:14 -0800, Bart Van Assche wrote: > On 02/28/16 00:26, Nicholas A. Bellinger wrote: > > Please confirm when you started triggering this regression during target > > service restart. > > Hi Nic, > > Are you aware that Doug was not the first person to report this crash? I > had already reported this crash myself seven weeks ago. Together with > the report of this crash I had also sent a root cause analysis to you > and a fix. As we've discussed, your analysis was incorrect. http://thread.gmane.org/gmane.linux.scsi.target.devel/10905/focus=10891 Adding a second, new kref to se_session just for srpt is completely wrong, and now that I've thrown out the legacy srpt_lookup_acl() junk in v4.5-rc1, srpt can finally come out of the stone age wrt to se_node_acl shutdown. > In the patch description I explained clearly that this crash > is caused by a bug in the LIO core and should be fixed in the LIO core. > See also Bart Van Assche, [PATCH 07/21] target: Fix a use-after-free in > core_tpg_del_initiator_node_acl(), target-devel mailing list, January 5, > 2016 > (http://thread.gmane.org/gmane.linux.scsi.target.devel/10905/focus=10891). > Anyways, I'll sit down this week and figure out what's going on with Doug's active I/O shutdown regression. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: SRPt oops with 4.5-rc3-ish 2016-02-28 20:43 ` Nicholas A. Bellinger @ 2016-02-29 0:37 ` Bart Van Assche [not found] ` <56D392D4.2000105-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> 0 siblings, 1 reply; 17+ messages in thread From: Bart Van Assche @ 2016-02-29 0:37 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: Doug Ledford, linux-rdma, target-devel, Christoph Hellwig On 02/28/16 12:43, Nicholas A. Bellinger wrote: > Anyways, I'll sit down this week and figure out what's going on with > Doug's active I/O shutdown regression. The crash occurs in the core_tpg_del_initiator_node_acl() function and a call to that function has been added recently in target_fabric_nacl_base_release(). I think it was added through the following patch: commit c7d6a803926bae9bbf4510a18fc8dd8957cc0e01 Date: Mon Apr 13 19:51:14 2015 +0200 target: refactor init/drop_nodeacl methods By always allocating and adding, respectively removing and freeing the se_node_acl structure in core code we can remove tons of repeated code in the init_nodeacl and drop_nodeacl routines. Additionally this now respects the get_default_queue_depth method in this code path as well. Bart. ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <56D392D4.2000105-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>]
* Re: SRPt oops with 4.5-rc3-ish [not found] ` <56D392D4.2000105-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> @ 2016-02-29 6:05 ` Christoph Hellwig 2016-03-01 6:49 ` Nicholas A. Bellinger 0 siblings, 1 reply; 17+ messages in thread From: Christoph Hellwig @ 2016-02-29 6:05 UTC (permalink / raw) To: Bart Van Assche Cc: Nicholas A. Bellinger, Doug Ledford, linux-rdma, target-devel, Christoph Hellwig On Sun, Feb 28, 2016 at 04:37:40PM -0800, Bart Van Assche wrote: > On 02/28/16 12:43, Nicholas A. Bellinger wrote: > > Anyways, I'll sit down this week and figure out what's going on with > > Doug's active I/O shutdown regression. > > The crash occurs in the core_tpg_del_initiator_node_acl() function > and a call to that function has been added recently in > target_fabric_nacl_base_release(). I think it was added through the > following patch: That patch just moved the call from the .fabric_drop_nodeacl instances (in the SRPT case srpt_drop_nodeacl) to the caller in target_fabric_nacl_base_release. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: SRPt oops with 4.5-rc3-ish 2016-02-29 6:05 ` Christoph Hellwig @ 2016-03-01 6:49 ` Nicholas A. Bellinger 2016-03-01 7:16 ` Christoph Hellwig 0 siblings, 1 reply; 17+ messages in thread From: Nicholas A. Bellinger @ 2016-03-01 6:49 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Bart Van Assche, Doug Ledford, linux-rdma, target-devel On Mon, 2016-02-29 at 07:05 +0100, Christoph Hellwig wrote: > On Sun, Feb 28, 2016 at 04:37:40PM -0800, Bart Van Assche wrote: > > On 02/28/16 12:43, Nicholas A. Bellinger wrote: > > > Anyways, I'll sit down this week and figure out what's going on with > > > Doug's active I/O shutdown regression. > > > > The crash occurs in the core_tpg_del_initiator_node_acl() function > > and a call to that function has been added recently in > > target_fabric_nacl_base_release(). I think it was added through the > > following patch: > > That patch just moved the call from the .fabric_drop_nodeacl instances > (in the SRPT case srpt_drop_nodeacl) to the caller in > target_fabric_nacl_base_release. I've not re-produced with v4.5-rc, but IIRC pre-commit 59fae4de usage of ib_create_cq() w/ srpt_compl_thread() -> kthread_stop(ch->thread) in srpt_destroy_ch_ib() did play a role wrt IB CQ active I/O shutdown completion in original code. Btw, is original ib_create_cq() usage a incompatible with chain RDMA READ/WRITE requests, or was that an extra improvement..? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: SRPt oops with 4.5-rc3-ish 2016-03-01 6:49 ` Nicholas A. Bellinger @ 2016-03-01 7:16 ` Christoph Hellwig 0 siblings, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2016-03-01 7:16 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: Christoph Hellwig, Bart Van Assche, Doug Ledford, linux-rdma, target-devel On Mon, Feb 29, 2016 at 10:49:58PM -0800, Nicholas A. Bellinger wrote: > Btw, is original ib_create_cq() usage a incompatible with chain RDMA > READ/WRITE requests, or was that an extra improvement..? Old-style CQs can be used for chained requests, see the iser target for an example. It's just a lot more painful. ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <1456647963.19657.135.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>]
* Re: SRPt oops with 4.5-rc3-ish [not found] ` <1456647963.19657.135.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org> @ 2016-04-11 20:08 ` Doug Ledford 0 siblings, 0 replies; 17+ messages in thread From: Doug Ledford @ 2016-04-11 20:08 UTC (permalink / raw) To: Nicholas A. Bellinger; +Cc: Bart Van Assche, linux-rdma, target-devel [-- Attachment #1: Type: text/plain, Size: 13469 bytes --] On 02/28/2016 03:26 AM, Nicholas A. Bellinger wrote: > AFAIK, the oldest last working srpt commit with se_node_acl + se_session > active I/O shutdown is: > > ib_srpt: Call target_sess_cmd_list_set_waiting during shutdown_session > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/infiniband/ulp/srpt?id=1d19f7800d > > Note this is ~40 upstream commits between then and now in v4.5-rc5. > > Please confirm when you started triggering this regression during target > service restart. I don't have a clear answer for that, although it just happened again on a v4.5-rc4 kernel. It's pretty annoying because the trigger is (as often as anything else) and yum upgrade process. And it hangs mid way through the process. I don't want to know how corrupted my RPM db or my filesystem is :-( Anyway, I have a clearer oops this time that I'll attach here, but this will be my last one from this kernel as I'm upgrading to the most recent v4.6-rc kernel. If the oops still happens on v4.6-rc, I'll update here. Here's the oops series, machine was useless after this (disk access was blocked for all processes): [4752021.950589] ------------[ cut here ]------------ [4752021.955992] WARNING: CPU: 5 PID: 10364 at drivers/infiniband/ulp/srpt/ib_srpt.c:3251 srpt_close_session+0x12f/0x140 [ib_srpt]() [4752021.969091] Modules linked in: hfi1(C) 8021q garp mrp target_core_user uio target_core_pscsi target_core_file target_core_iblock ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ip_set nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge stp llc ebtables ip6table_mangle ip6table_raw nf_defrag_ipv6 ip6table_security ip6table_filter ip6_tables iptable_mangle iptable_raw nf_defrag_ipv4 nf_conntrack(-) iptable_security ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad intel_rapl x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul ipmi_devintf iTCO_wdt crc32_pclmul ghash_clmulni_intel iTCO_vendor_support dcdbas ipmi_si sb_edac mei_me edac_core [4752022.049588] ioatdma mei ipmi_msghandler lpc_ich dca shpchp wmi acpi_power_meter tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx5_ib raid1 raid0 ib_core ib_addr mgag200 i2c_algo_bit drm_kms_helper ttm crc32c_intel mlx5_core tg3 drm ptp megaraid_sas pps_core fjes [last unloaded: nf_conntrack_ipv6] [4752022.080463] CPU: 5 PID: 10364 Comm: targetctl Tainted: G CI 4.5.0-0.rc4.git0.1.fc24.x86_64 #1 [4752022.091366] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 1.0.4 08/28/2014 [4752022.100131] 0000000000000286 00000000189b0c8a ffff880de32ffcc0 ffffffff813d3e0f [4752022.108624] 0000000000000000 ffffffffa04872f0 ffff880de32ffcf8 ffffffff810a4fe2 [4752022.117126] ffff881fd427a800 ffff88100fcb7000 0000000000000001 ffff88100fcb70e8 [4752022.125629] Call Trace: [4752022.128565] [<ffffffff813d3e0f>] dump_stack+0x63/0x84 [4752022.134513] [<ffffffff810a4fe2>] warn_slowpath_common+0x82/0xc0 [4752022.141431] [<ffffffff810a512a>] warn_slowpath_null+0x1a/0x20 [4752022.148155] [<ffffffffa04830bf>] srpt_close_session+0x12f/0x140 [ib_srpt] [4752022.156055] [<ffffffffa0639de4>] target_release_session+0x24/0x30 [target_core_mod] [4752022.164925] [<ffffffffa063bb3d>] target_put_session+0x1d/0x20 [target_core_mod] [4752022.173403] [<ffffffffa06395eb>] core_tpg_del_initiator_node_acl+0x16b/0x240 [target_core_mod] [4752022.183343] [<ffffffffa062d23f>] target_fabric_nacl_base_release+0x3f/0x50 [target_core_mod] [4752022.193082] [<ffffffff812cc133>] config_item_release+0x63/0xd0 [4752022.199902] [<ffffffff812cc1c2>] config_item_put+0x22/0x30 [4752022.206326] [<ffffffff812ca676>] configfs_rmdir+0x1d6/0x2e0 [4752022.212857] [<ffffffff8124ea0c>] vfs_rmdir+0xbc/0x130 [4752022.218803] [<ffffffff81253c6a>] do_rmdir+0x19a/0x220 [4752022.224750] [<ffffffff81254a16>] SyS_rmdir+0x16/0x20 [4752022.230598] [<ffffffff817cd6ae>] entry_SYSCALL_64_fastpath+0x12/0x6d [4752022.238009] ---[ end trace befc2f337e9f56d7 ]--- [4752027.739051] ib_srpt Received IB DREQ ERROR event. [4752029.794988] ib_srpt Received IB TimeWait exit for cm_id ffff881ff5d55800. [4752029.807121] BUG: unable to handle kernel paging request at 0000000000017930 [4752029.815120] IP: [<ffffffff810ee9a5>] queued_spin_lock_slowpath+0x105/0x190 [4752029.823015] PGD 0 [4752029.825466] Oops: 0002 [#1] SMP [4752029.829286] Modules linked in: hfi1(C) 8021q garp mrp target_core_user uio target_core_pscsi target_core_file target_core_iblock ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ip_set nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge stp llc ebtables ip6table_mangle ip6table_raw nf_defrag_ipv6 ip6table_security ip6table_filter ip6_tables iptable_mangle iptable_raw nf_defrag_ipv4 nf_conntrack(-) iptable_security ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad intel_rapl x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul ipmi_devintf iTCO_wdt crc32_pclmul ghash_clmulni_intel iTCO_vendor_support dcdbas ipmi_si sb_edac mei_me edac_core [4752029.913124] ioatdma mei ipmi_msghandler lpc_ich dca shpchp wmi acpi_power_meter tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx5_ib raid1 raid0 ib_core ib_addr mgag200 i2c_algo_bit drm_kms_helper ttm crc32c_intel mlx5_core tg3 drm ptp megaraid_sas pps_core fjes [last unloaded: nf_conntrack_ipv6] [4752029.946121] CPU: 7 PID: 288828 Comm: kworker/7:0 Tainted: G WCI 4.5.0-0.rc4.git0.1.fc24.x86_64 #1 [4752029.958057] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 1.0.4 08/28/2014 [4752029.967563] Workqueue: events srpt_release_channel_work [ib_srpt] [4752029.975315] task: ffff8820352e5b80 ti: ffff881f5da10000 task.ti: ffff881f5da10000 [4752029.984607] RIP: 0010:[<ffffffff810ee9a5>] [<ffffffff810ee9a5>] queued_spin_lock_slowpath+0x105/0x190 [4752029.995941] RSP: 0018:ffff881f5da13da8 EFLAGS: 00010006 [4752030.002790] RAX: 0000000000017930 RBX: 0000000000000286 RCX: ffff88203d2d7900 [4752030.011668] RDX: 00000000000039eb RSI: 00000000e7b31ae8 RDI: ffff880de32ffd20 [4752030.020528] RBP: ffff881f5da13da8 R08: 0000000000200000 R09: 0000000000000000 [4752030.029374] R10: 0000000000000000 R11: 000000000001a700 R12: ffff880de32ffd18 [4752030.038206] R13: ffff881fd2c6b780 R14: ffff881fd427a800 R15: ffff881fd427a8d0 [4752030.047025] FS: 0000000000000000(0000) GS:ffff88203d2c0000(0000) knlGS:0000000000000000 [4752030.056913] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [4752030.064174] CR2: 0000000000017930 CR3: 0000000de33db000 CR4: 00000000001406e0 [4752030.072995] Stack: [4752030.076087] ffff881f5da13dc0 ffffffff817cd4c7 ffff880de32ffd20 ffff881f5da13de8 [4752030.085236] ffffffff810e7cfd ffff881fd427a8d0 ffff88100fcb7000 ffff881fd2c6b780 [4752030.094382] ffff881f5da13e18 ffffffffa0485931 ffff881fc81c60c0 ffff88203d2d65c0 [4752030.103531] Call Trace: [4752030.107120] [<ffffffff817cd4c7>] _raw_spin_lock_irqsave+0x37/0x40 [4752030.114886] [<ffffffff810e7cfd>] complete+0x1d/0x50 [4752030.121291] [<ffffffffa0485931>] srpt_release_channel_work+0xe1/0x140 [ib_srpt] [4752030.130416] [<ffffffff810bd6fd>] process_one_work+0x1ad/0x400 [4752030.137791] [<ffffffff810bd99e>] worker_thread+0x4e/0x480 [4752030.144772] [<ffffffff810bd950>] ? process_one_work+0x400/0x400 [4752030.152327] [<ffffffff810bd950>] ? process_one_work+0x400/0x400 [4752030.159879] [<ffffffff810c38e8>] kthread+0xd8/0xf0 [4752030.166170] [<ffffffff810c3810>] ? kthread_worker_fn+0x180/0x180 [4752030.173823] [<ffffffff817cd9ff>] ret_from_fork+0x3f/0x70 [4752030.180702] [<ffffffff810c3810>] ? kthread_worker_fn+0x180/0x180 [4752030.188352] Code: 02 89 c2 45 31 c9 c1 e2 10 85 d2 74 41 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 00 79 01 00 48 03 04 d5 00 d5 d3 81 <48> 89 08 8b 41 08 85 c0 75 09 f3 90 8b 41 08 85 c0 74 f7 4c 8b [4752030.211521] RIP [<ffffffff810ee9a5>] queued_spin_lock_slowpath+0x105/0x190 [4752030.220180] RSP <ffff881f5da13da8> [4752030.224954] CR2: 0000000000017930 [4752030.231895] ---[ end trace befc2f337e9f56d8 ]--- [4752030.312493] BUG: unable to handle kernel paging request at ffffffffffffffd8 [4752030.322906] IP: [<ffffffff810c3f80>] kthread_data+0x10/0x20 [4752030.331299] PGD 1c0d067 PUD 1c0f067 PMD 0 [4752030.337938] Oops: 0000 [#2] SMP [4752030.343539] Modules linked in: hfi1(C) 8021q garp mrp target_core_user uio target_core_pscsi target_core_file target_core_iblock ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ip_set nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge stp llc ebtables ip6table_mangle ip6table_raw nf_defrag_ipv6 ip6table_security ip6table_filter ip6_tables iptable_mangle iptable_raw nf_defrag_ipv4 nf_conntrack(-) iptable_security ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad intel_rapl x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul ipmi_devintf iTCO_wdt crc32_pclmul ghash_clmulni_intel iTCO_vendor_support dcdbas ipmi_si sb_edac mei_me edac_core [4752030.432786] ioatdma mei ipmi_msghandler lpc_ich dca shpchp wmi acpi_power_meter tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx5_ib raid1 raid0 ib_core ib_addr mgag200 i2c_algo_bit drm_kms_helper ttm crc32c_intel mlx5_core tg3 drm ptp megaraid_sas pps_core fjes [last unloaded: nf_conntrack_ipv6] [4752030.467298] CPU: 7 PID: 288828 Comm: kworker/7:0 Tainted: G D WCI 4.5.0-0.rc4.git0.1.fc24.x86_64 #1 [4752030.479665] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 1.0.4 08/28/2014 [4752030.489575] task: ffff8820352e5b80 ti: ffff881f5da10000 task.ti: ffff881f5da10000 [4752030.499244] RIP: 0010:[<ffffffff810c3f80>] [<ffffffff810c3f80>] kthread_data+0x10/0x20 [4752030.509511] RSP: 0018:ffff881f5da13a80 EFLAGS: 00010002 [4752030.516747] RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000007 [4752030.526034] RDX: ffff88103d410000 RSI: 0000000000000007 RDI: ffff8820352e5b80 [4752030.535318] RBP: ffff881f5da13a80 R08: ffff8820352e5c28 R09: ffff8820352e5c00 [4752030.544599] R10: 0000000000000000 R11: 000000000000002f R12: 0000000000016dc0 [4752030.553884] R13: ffff8820352e61d8 R14: ffff8820352e5b80 R15: ffff88203d2d6dc0 [4752030.563161] FS: 0000000000000000(0000) GS:ffff88203d2c0000(0000) knlGS:0000000000000000 [4752030.573516] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [4752030.581247] CR2: 0000000000000028 CR3: 0000000de33db000 CR4: 00000000001406e0 [4752030.590525] Stack: [4752030.594064] ffff881f5da13a98 ffffffff810be581 ffff88203d2d6dc0 ffff881f5da13ae8 [4752030.603691] ffffffff817c91ba 00ff881f652b6478 ffff881f00000007 ffff8820352e5b80 [4752030.613311] ffff881f5da10000 0000000000000000 ffff881f5da13b38 ffff881f5da135d0 [4752030.622926] Call Trace: [4752030.626959] [<ffffffff810be581>] wq_worker_sleeping+0x11/0x90 [4752030.634789] [<ffffffff817c91ba>] __schedule+0x62a/0x9b0 [4752030.642030] [<ffffffff817c957c>] schedule+0x3c/0x90 [4752030.648874] [<ffffffff810a7f48>] do_exit+0x7a8/0xb30 [4752030.655813] [<ffffffff8101992a>] oops_end+0x9a/0xd0 [4752030.662650] [<ffffffff81067e7e>] no_context+0x13e/0x390 [4752030.669886] [<ffffffff81068150>] __bad_area_nosemaphore+0x80/0x1f0 [4752030.678193] [<ffffffff810682d3>] bad_area_nosemaphore+0x13/0x20 [4752030.686209] [<ffffffff81068597>] __do_page_fault+0xb7/0x400 [4752030.693834] [<ffffffff81068910>] do_page_fault+0x30/0x80 [4752030.701166] [<ffffffff817cfa48>] page_fault+0x28/0x30 [4752030.708210] [<ffffffff810ee9a5>] ? queued_spin_lock_slowpath+0x105/0x190 [4752030.717062] [<ffffffff817cd4c7>] _raw_spin_lock_irqsave+0x37/0x40 [4752030.725221] [<ffffffff810e7cfd>] complete+0x1d/0x50 [4752030.731999] [<ffffffffa0485931>] srpt_release_channel_work+0xe1/0x140 [ib_srpt] [4752030.741523] [<ffffffff810bd6fd>] process_one_work+0x1ad/0x400 [4752030.749298] [<ffffffff810bd99e>] worker_thread+0x4e/0x480 [4752030.756677] [<ffffffff810bd950>] ? process_one_work+0x400/0x400 [4752030.764626] [<ffffffff810bd950>] ? process_one_work+0x400/0x400 [4752030.772558] [<ffffffff810c38e8>] kthread+0xd8/0xf0 [4752030.779231] [<ffffffff810c3810>] ? kthread_worker_fn+0x180/0x180 [4752030.787241] [<ffffffff817cd9ff>] ret_from_fork+0x3f/0x70 [4752030.794438] [<ffffffff810c3810>] ? kthread_worker_fn+0x180/0x180 [4752030.802395] Code: 97 69 70 00 e9 53 ff ff ff e8 4d 0e fe ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 e0 05 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 [4752030.826210] RIP [<ffffffff810c3f80>] kthread_data+0x10/0x20 [4752030.833669] RSP <ffff881f5da13a80> [4752030.838651] CR2: ffffffffffffffd8 [4752030.843418] ---[ end trace befc2f337e9f56d9 ]--- [4752030.933774] Fixing recursive fault but reboot is needed! -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 884 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2016-04-11 20:08 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-14 16:09 SRPt oops with 4.5-rc3-ish Doug Ledford
[not found] ` <56C0A6C3.3010903-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-02-16 1:42 ` Bart Van Assche
2016-02-29 9:11 ` Christoph Hellwig
2016-02-28 3:37 ` Nicholas A. Bellinger
[not found] ` <1456630639.19657.47.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2016-02-28 4:18 ` Bart Van Assche
[not found] ` <56D274F8.9070804-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-02-28 4:47 ` Nicholas A. Bellinger
2016-02-28 4:49 ` Bart Van Assche
2016-02-28 5:00 ` Nicholas A. Bellinger
2016-03-03 15:24 ` Doug Ledford
2016-02-28 8:26 ` Nicholas A. Bellinger
2016-02-28 16:14 ` Bart Van Assche
[not found] ` <56D31CC9.7000609-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-02-28 20:43 ` Nicholas A. Bellinger
2016-02-29 0:37 ` Bart Van Assche
[not found] ` <56D392D4.2000105-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-02-29 6:05 ` Christoph Hellwig
2016-03-01 6:49 ` Nicholas A. Bellinger
2016-03-01 7:16 ` Christoph Hellwig
[not found] ` <1456647963.19657.135.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2016-04-11 20:08 ` Doug Ledford
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox