* nvme_tcp: nvme connect failed after execute stress-ng: unshare [not found] <1594444669.711245.1599710442490.JavaMail.zimbra@redhat.com> @ 2020-09-10 4:35 ` Yi Zhang 2020-09-14 23:50 ` Sagi Grimberg 0 siblings, 1 reply; 4+ messages in thread From: Yi Zhang @ 2020-09-10 4:35 UTC (permalink / raw) To: sagi, gscrivan, colin.king; +Cc: linux-nvme Hello Recently I found nvme-tcp connecting always failed[1] after execute stress-ng:unshare[2], by bisecting I finally found it was introduced with commit[3], the connecting works well after revert it. I'm not sure whether it's one test case issue or kernel issue, could anyone help check it. [1] # sh test.sh + ./stress-ng/stress-ng --unshare 0 --timeout 5 --log-file unshare.log stress-ng: info: [355534] dispatching hogs: 32 unshare stress-ng: info: [355534] successful run completed in 5.04s + modprobe null-blk nr-devices=1 + modprobe nvmet-tcp + modprobe nvme-tcp + nvmetcli restore tcp.json + nvme connect -t tcp -n nqn.2014-08.org.nvmexpress.discovery -a 127.0.0.1 -s 4420 Failed to write to /dev/nvme-fabrics: Input/output error # dmesg | tail -9 [ 700.012299] null_blk: module loaded [ 700.073415] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [ 700.073923] nvmet_tcp: enabling port 0 (127.0.0.1:4420) [ 715.291020] nvmet: ctrl 1 keep-alive timer (15 seconds) expired! [ 715.297031] nvmet: ctrl 1 fatal error occurred! [ 749.939898] nvmet: creating controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:e405e6bb-8e28-4a73-b338-3fddb5746b8c. [ 763.417376] nvme nvme0: queue 0: timeout request 0x0 type 4 [ 763.422979] nvme nvme0: Connect command failed, error wo/DNR bit: 881 [ 763.429419] nvme nvme0: failed to connect queue: 0 ret=881 # uname -r 5.9.0-rc4 [2] stress-ng: unshare case https://github.com/ColinIanKing/stress-ng.git https://github.com/ColinIanKing/stress-ng/blob/master/stress-unshare.c [3] commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7 Author: Giuseppe Scrivano <gscrivan@redhat.com> Date: Sun Jun 7 21:40:10 2020 -0700 ipc/namespace.c: use a work queue to free_ipc the reason is to avoid a delay caused by the synchronize_rcu() call in kern_umount() when the mqueue mount is freed. [4] # cat tcp.json { "hosts": [], "ports": [ { "addr": { "adrfam": "ipv4", "traddr": "127.0.0.1", "treq": "not specified", "trsvcid": "4420", "trtype": "tcp" }, "portid": 0, "referrals": [], "subsystems": [ "blktests-subsystem-1" ] } ], "subsystems": [ { "allowed_hosts": [], "attr": { "allow_any_host": "1", "cntlid_max": "65519", "cntlid_min": "1", "model": "Linux", "pi_enable": "0", "serial": "7d833f5501f6b240", "version": "1.3" }, "namespaces": [ { "device": { "nguid": "00000000-0000-0000-0000-000000000000", "path": "/dev/nullb0", "uuid": "b07c7eef-8428-47bf-8e79-26ec8c30f334" }, "enable": 1, "nsid": 1 } ], "nqn": "blktests-subsystem-1" } ] } Best Regards, Yi Zhang _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: nvme_tcp: nvme connect failed after execute stress-ng: unshare 2020-09-10 4:35 ` nvme_tcp: nvme connect failed after execute stress-ng: unshare Yi Zhang @ 2020-09-14 23:50 ` Sagi Grimberg 2020-09-15 1:40 ` Yi Zhang 0 siblings, 1 reply; 4+ messages in thread From: Sagi Grimberg @ 2020-09-14 23:50 UTC (permalink / raw) To: Yi Zhang, gscrivan, colin.king; +Cc: linux-nvme > Hello > > Recently I found nvme-tcp connecting always failed[1] after execute stress-ng:unshare[2], by bisecting I finally found it was introduced with commit[3], the connecting works well after revert it. > I'm not sure whether it's one test case issue or kernel issue, could anyone help check it. Is this failure persistent or transient? > > [1] > # sh test.sh > + ./stress-ng/stress-ng --unshare 0 --timeout 5 --log-file unshare.log > stress-ng: info: [355534] dispatching hogs: 32 unshare > stress-ng: info: [355534] successful run completed in 5.04s > + modprobe null-blk nr-devices=1 > + modprobe nvmet-tcp > + modprobe nvme-tcp > + nvmetcli restore tcp.json > + nvme connect -t tcp -n nqn.2014-08.org.nvmexpress.discovery -a 127.0.0.1 -s 4420 > Failed to write to /dev/nvme-fabrics: Input/output error > > # dmesg | tail -9 > [ 700.012299] null_blk: module loaded > [ 700.073415] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > [ 700.073923] nvmet_tcp: enabling port 0 (127.0.0.1:4420) > [ 715.291020] nvmet: ctrl 1 keep-alive timer (15 seconds) expired! > [ 715.297031] nvmet: ctrl 1 fatal error occurred! > [ 749.939898] nvmet: creating controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:e405e6bb-8e28-4a73-b338-3fddb5746b8c. > [ 763.417376] nvme nvme0: queue 0: timeout request 0x0 type 4 > [ 763.422979] nvme nvme0: Connect command failed, error wo/DNR bit: 881 > [ 763.429419] nvme nvme0: failed to connect queue: 0 ret=881 > > # uname -r > 5.9.0-rc4 > > > [2] stress-ng: unshare case > https://github.com/ColinIanKing/stress-ng.git > https://github.com/ColinIanKing/stress-ng/blob/master/stress-unshare.c > > > [3] > commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7 > Author: Giuseppe Scrivano <gscrivan@redhat.com> > Date: Sun Jun 7 21:40:10 2020 -0700 > > ipc/namespace.c: use a work queue to free_ipc > > the reason is to avoid a delay caused by the synchronize_rcu() call in > kern_umount() when the mqueue mount is freed. > > > [4] > # cat tcp.json > { > "hosts": [], > "ports": [ > { > "addr": { > "adrfam": "ipv4", > "traddr": "127.0.0.1", > "treq": "not specified", > "trsvcid": "4420", > "trtype": "tcp" > }, > "portid": 0, > "referrals": [], > "subsystems": [ > "blktests-subsystem-1" > ] > } > ], > "subsystems": [ > { > "allowed_hosts": [], > "attr": { > "allow_any_host": "1", > "cntlid_max": "65519", > "cntlid_min": "1", > "model": "Linux", > "pi_enable": "0", > "serial": "7d833f5501f6b240", > "version": "1.3" > }, > "namespaces": [ > { > "device": { > "nguid": "00000000-0000-0000-0000-000000000000", > "path": "/dev/nullb0", > "uuid": "b07c7eef-8428-47bf-8e79-26ec8c30f334" > }, > "enable": 1, > "nsid": 1 > } > ], > "nqn": "blktests-subsystem-1" > } > ] > } > > > Best Regards, > Yi Zhang > > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: nvme_tcp: nvme connect failed after execute stress-ng: unshare 2020-09-14 23:50 ` Sagi Grimberg @ 2020-09-15 1:40 ` Yi Zhang 2020-10-28 12:19 ` Yi Zhang 0 siblings, 1 reply; 4+ messages in thread From: Yi Zhang @ 2020-09-15 1:40 UTC (permalink / raw) To: Sagi Grimberg, gscrivan, colin.king; +Cc: linux-nvme Hi Sagi On 9/15/20 7:50 AM, Sagi Grimberg wrote: > >> Hello >> >> Recently I found nvme-tcp connecting always failed[1] after execute >> stress-ng:unshare[2], by bisecting I finally found it was introduced >> with commit[3], the connecting works well after revert it. >> I'm not sure whether it's one test case issue or kernel issue, could >> anyone help check it. > > Is this failure persistent or transient? > It's persistent, and most of recent CKI job with 5.8 stable kernel also showed this failure. >> >> [1] >> # sh test.sh >> + ./stress-ng/stress-ng --unshare 0 --timeout 5 --log-file unshare.log >> stress-ng: info: [355534] dispatching hogs: 32 unshare >> stress-ng: info: [355534] successful run completed in 5.04s >> + modprobe null-blk nr-devices=1 >> + modprobe nvmet-tcp >> + modprobe nvme-tcp >> + nvmetcli restore tcp.json >> + nvme connect -t tcp -n nqn.2014-08.org.nvmexpress.discovery -a >> 127.0.0.1 -s 4420 >> Failed to write to /dev/nvme-fabrics: Input/output error >> >> # dmesg | tail -9 >> [ 700.012299] null_blk: module loaded >> [ 700.073415] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >> [ 700.073923] nvmet_tcp: enabling port 0 (127.0.0.1:4420) >> [ 715.291020] nvmet: ctrl 1 keep-alive timer (15 seconds) expired! >> [ 715.297031] nvmet: ctrl 1 fatal error occurred! >> [ 749.939898] nvmet: creating controller 1 for subsystem >> nqn.2014-08.org.nvmexpress.discovery for NQN >> nqn.2014-08.org.nvmexpress:uuid:e405e6bb-8e28-4a73-b338-3fddb5746b8c. >> [ 763.417376] nvme nvme0: queue 0: timeout request 0x0 type 4 >> [ 763.422979] nvme nvme0: Connect command failed, error wo/DNR bit: 881 >> [ 763.429419] nvme nvme0: failed to connect queue: 0 ret=881 >> >> # uname -r >> 5.9.0-rc4 >> >> >> [2] stress-ng: unshare case >> https://github.com/ColinIanKing/stress-ng.git >> https://github.com/ColinIanKing/stress-ng/blob/master/stress-unshare.c >> >> >> [3] >> commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7 >> Author: Giuseppe Scrivano <gscrivan@redhat.com> >> Date: Sun Jun 7 21:40:10 2020 -0700 >> >> ipc/namespace.c: use a work queue to free_ipc >> the reason is to avoid a delay caused by the >> synchronize_rcu() call in >> kern_umount() when the mqueue mount is freed. >> >> >> [4] >> # cat tcp.json >> { >> "hosts": [], >> "ports": [ >> { >> "addr": { >> "adrfam": "ipv4", >> "traddr": "127.0.0.1", >> "treq": "not specified", >> "trsvcid": "4420", >> "trtype": "tcp" >> }, >> "portid": 0, >> "referrals": [], >> "subsystems": [ >> "blktests-subsystem-1" >> ] >> } >> ], >> "subsystems": [ >> { >> "allowed_hosts": [], >> "attr": { >> "allow_any_host": "1", >> "cntlid_max": "65519", >> "cntlid_min": "1", >> "model": "Linux", >> "pi_enable": "0", >> "serial": "7d833f5501f6b240", >> "version": "1.3" >> }, >> "namespaces": [ >> { >> "device": { >> "nguid": "00000000-0000-0000-0000-000000000000", >> "path": "/dev/nullb0", >> "uuid": "b07c7eef-8428-47bf-8e79-26ec8c30f334" >> }, >> "enable": 1, >> "nsid": 1 >> } >> ], >> "nqn": "blktests-subsystem-1" >> } >> ] >> } >> >> >> Best Regards, >> Yi Zhang >> >> > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: nvme_tcp: nvme connect failed after execute stress-ng: unshare 2020-09-15 1:40 ` Yi Zhang @ 2020-10-28 12:19 ` Yi Zhang 0 siblings, 0 replies; 4+ messages in thread From: Yi Zhang @ 2020-10-28 12:19 UTC (permalink / raw) To: Sagi Grimberg, gscrivan; +Cc: colin.king, linux-nvme On 9/15/20 9:40 AM, Yi Zhang wrote: > Hi Sagi > > On 9/15/20 7:50 AM, Sagi Grimberg wrote: >> >>> Hello >>> >>> Recently I found nvme-tcp connecting always failed[1] after execute >>> stress-ng:unshare[2], by bisecting I finally found it was introduced >>> with commit[3], the connecting works well after revert it. >>> I'm not sure whether it's one test case issue or kernel issue, could >>> anyone help check it. >> >> Is this failure persistent or transient? >> > It's persistent, and most of recent CKI job with 5.8 stable kernel > also showed this failure. > Found "WARNING: possible circular locking" with latest 5.10.0-rc1 [ 230.353213] run blktests nvme/003 at 2020-10-28 07:41:08 [ 230.666092] loop: module loaded [ 230.870124] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [ 230.902252] nvmet_tcp: enabling port 0 (127.0.0.1:4420) [ 246.097201] nvmet: ctrl 1 keep-alive timer (15 seconds) expired! [ 246.104886] nvmet: ctrl 1 fatal error occurred! [ 246.113796] ====================================================== [ 246.120042] WARNING: possible circular locking dependency detected [ 246.126288] 5.10.0-rc1 #1 Not tainted [ 246.129987] ------------------------------------------------------ [ 246.136231] kworker/0:2/251 is trying to acquire lock: [ 246.141423] ffff00012ec0b830 ((work_completion)(&queue->io_work)){+.+.}-{0:0}, at: __flush_work+0x48/0x4f0 [ 246.151202] but task is already holding lock: [ 246.157095] ffff8000147e3dd8 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x670 [ 246.167743] which lock already depends on the new lock. [ 246.176007] the existing dependency chain (in reverse order) is: [ 246.183569] -> #2 ((work_completion)(&queue->release_work)){+.+.}-{0:0}: [ 246.191849] validate_chain+0x5f0/0xa00 [ 246.196251] __lock_acquire+0x3b0/0xb70 [ 246.200652] lock_acquire+0x104/0x428 [ 246.204878] process_one_work+0x21c/0x670 [ 246.209456] worker_thread+0x54/0x408 [ 246.213684] kthread+0x11c/0x148 [ 246.217473] ret_from_fork+0x10/0x18 [ 246.221611] -> #1 ((wq_completion)events){+.+.}-{0:0}: [ 246.228602] validate_chain+0x5f0/0xa00 [ 246.233005] __lock_acquire+0x3b0/0xb70 [ 246.237407] lock_acquire+0x104/0x428 [ 246.241634] flush_workqueue+0x8c/0x4a8 [ 246.246041] nvmet_tcp_install_queue+0x100/0x110 [nvmet_tcp] [ 246.252309] nvmet_install_queue+0xc4/0x150 [nvmet] [ 246.257775] nvmet_execute_admin_connect+0xf4/0x1e0 [nvmet] [ 246.263935] nvmet_tcp_io_work+0x8dc/0x970 [nvmet_tcp] [ 246.269654] process_one_work+0x268/0x670 [ 246.274231] worker_thread+0x54/0x408 [ 246.278458] kthread+0x11c/0x148 [ 246.282246] ret_from_fork+0x10/0x18 [ 246.288493] -> #0 ((work_completion)(&queue->io_work)){+.+.}-{0:0}: [ 246.300559] check_noncircular+0x100/0x128 [ 246.307367] check_prev_add+0xa4/0x668 [ 246.313777] validate_chain+0x5f0/0xa00 [ 246.320230] __lock_acquire+0x3b0/0xb70 [ 246.326652] lock_acquire+0x104/0x428 [ 246.332879] __flush_work+0x70/0x4f0 [ 246.338994] flush_work+0x1c/0x28 [ 246.344833] nvmet_tcp_release_queue_work+0xd0/0x230 [nvmet_tcp] [ 246.353405] process_one_work+0x268/0x670 [ 246.359955] worker_thread+0x54/0x408 [ 246.366131] kthread+0x11c/0x148 [ 246.371843] ret_from_fork+0x10/0x18 [ 246.377876] other info that might help us debug this: [ 246.391416] Chain exists of: (work_completion)(&queue->io_work) --> (wq_completion)events --> (work_completion)(&queue->release_work) [ 246.411895] Possible unsafe locking scenario: [ 246.421363] CPU0 CPU1 [ 246.427687] ---- ---- [ 246.433978] lock((work_completion)(&queue->release_work)); [ 246.441413] lock((wq_completion)events); [ 246.449816] lock((work_completion)(&queue->release_work)); [ 246.459780] lock((work_completion)(&queue->io_work)); [ 246.466808] *** DEADLOCK *** [ 246.478008] 2 locks held by kworker/0:2/251: [ 246.484053] #0: ffff000100051b38 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x670 [ 246.495355] #1: ffff8000147e3dd8 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x670 [ 246.508301] stack backtrace: [ 246.516512] CPU: 0 PID: 251 Comm: kworker/0:2 Not tainted 5.10.0-rc1 #1 [ 246.525121] Hardware name: GIGABYTE R120-T32-00/MT30-GS1-00, BIOS F02 08/06/2019 [ 246.534560] Workqueue: events nvmet_tcp_release_queue_work [nvmet_tcp] [ 246.543156] Call trace: [ 246.547634] dump_backtrace+0x0/0x1c8 [ 246.553325] show_stack+0x20/0x70 [ 246.558653] dump_stack+0xf8/0x164 [ 246.564043] print_circular_bug.isra.0+0x234/0x290 [ 246.570834] check_noncircular+0x100/0x128 [ 246.576931] check_prev_add+0xa4/0x668 [ 246.582676] validate_chain+0x5f0/0xa00 [ 246.588512] __lock_acquire+0x3b0/0xb70 [ 246.594346] lock_acquire+0x104/0x428 [ 246.599986] __flush_work+0x70/0x4f0 [ 246.605506] flush_work+0x1c/0x28 [ 246.610765] nvmet_tcp_release_queue_work+0xd0/0x230 [nvmet_tcp] [ 246.618768] process_one_work+0x268/0x670 [ 246.624777] worker_thread+0x54/0x408 [ 246.630438] kthread+0x11c/0x148 [ 246.635661] ret_from_fork+0x10/0x18 [ 295.375201] nvme nvme0: queue 0: timeout request 0x0 type 4 [ 295.383074] nvme nvme0: Connect command failed, error wo/DNR bit: 881 [ 295.391912] nvme nvme0: failed to connect queue: 0 ret=881 [ 479.792299] nvmet: creating controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:0e657e35bc5f402480543a94c7a292ed. >>> >>> [1] >>> # sh test.sh >>> + ./stress-ng/stress-ng --unshare 0 --timeout 5 --log-file unshare.log >>> stress-ng: info: [355534] dispatching hogs: 32 unshare >>> stress-ng: info: [355534] successful run completed in 5.04s >>> + modprobe null-blk nr-devices=1 >>> + modprobe nvmet-tcp >>> + modprobe nvme-tcp >>> + nvmetcli restore tcp.json >>> + nvme connect -t tcp -n nqn.2014-08.org.nvmexpress.discovery -a >>> 127.0.0.1 -s 4420 >>> Failed to write to /dev/nvme-fabrics: Input/output error >>> >>> # dmesg | tail -9 >>> [ 700.012299] null_blk: module loaded >>> [ 700.073415] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >>> [ 700.073923] nvmet_tcp: enabling port 0 (127.0.0.1:4420) >>> [ 715.291020] nvmet: ctrl 1 keep-alive timer (15 seconds) expired! >>> [ 715.297031] nvmet: ctrl 1 fatal error occurred! >>> [ 749.939898] nvmet: creating controller 1 for subsystem >>> nqn.2014-08.org.nvmexpress.discovery for NQN >>> nqn.2014-08.org.nvmexpress:uuid:e405e6bb-8e28-4a73-b338-3fddb5746b8c. >>> [ 763.417376] nvme nvme0: queue 0: timeout request 0x0 type 4 >>> [ 763.422979] nvme nvme0: Connect command failed, error wo/DNR bit: >>> 881 >>> [ 763.429419] nvme nvme0: failed to connect queue: 0 ret=881 >>> >>> # uname -r >>> 5.9.0-rc4 >>> >>> >>> [2] stress-ng: unshare case >>> https://github.com/ColinIanKing/stress-ng.git >>> https://github.com/ColinIanKing/stress-ng/blob/master/stress-unshare.c >>> >>> >>> [3] >>> commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7 >>> Author: Giuseppe Scrivano <gscrivan@redhat.com> >>> Date: Sun Jun 7 21:40:10 2020 -0700 >>> >>> ipc/namespace.c: use a work queue to free_ipc >>> the reason is to avoid a delay caused by the >>> synchronize_rcu() call in >>> kern_umount() when the mqueue mount is freed. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-10-28 12:19 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1594444669.711245.1599710442490.JavaMail.zimbra@redhat.com>
2020-09-10 4:35 ` nvme_tcp: nvme connect failed after execute stress-ng: unshare Yi Zhang
2020-09-14 23:50 ` Sagi Grimberg
2020-09-15 1:40 ` Yi Zhang
2020-10-28 12:19 ` Yi Zhang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox