* v4.14-rc5 NVMeOF regression?
@ 2017-10-16 22:23 Bart Van Assche
2017-10-17 10:01 ` Sagi Grimberg
0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2017-10-16 22:23 UTC (permalink / raw)
Hello,
It has been a while since I ran any NVMeOF tests. But when I tried to test
the v4.14-rc5 NVMeOF drivers the output shown below appeared. Is this a
known issue? The following test triggered these call stacks:
# srp-test/run_tests -n -f xfs -e deadline -r 60
Thanks,
Bart.
======================================================
======================================================
WARNING: possible circular locking dependency detected
WARNING: possible circular locking dependency detected
4.14.0-rc5-dbg+ #3 Not tainted
4.14.0-rc5-dbg+ #3 Not tainted
------------------------------------------------------
------------------------------------------------------
modprobe/2272 is trying to acquire lock:
modprobe/2272 is trying to acquire lock:
("events"){+.+.}, at: [<ffffffff81084185>] flush_workqueue+0x75/0x520
("events"){+.+.}, at: [<ffffffff81084185>] flush_workqueue+0x75/0x520
but task is already holding lock:
but task is already holding lock:
(device_mutex){+.+.}, at: [<ffffffffa05d6bf7>] ib_unregister_client+0x27/0x200 [ib_core]
(device_mutex){+.+.}, at: [<ffffffffa05d6bf7>] ib_unregister_client+0x27/0x200 [ib_core]
which lock already depends on the new lock.
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
the existing dependency chain (in reverse order) is:
-> #3 (device_mutex){+.+.}:
-> #3 (device_mutex){+.+.}:
lock_acquire+0xdc/0x1d0
lock_acquire+0xdc/0x1d0
__mutex_lock+0x86/0x990
__mutex_lock+0x86/0x990
mutex_lock_nested+0x1b/0x20
mutex_lock_nested+0x1b/0x20
ib_register_device+0xa3/0x650 [ib_core]
ib_register_device+0xa3/0x650 [ib_core]
mlx4_ib_add+0xcfd/0x1440 [mlx4_ib]
mlx4_ib_add+0xcfd/0x1440 [mlx4_ib]
mlx4_add_device+0x45/0xe0 [mlx4_core]
mlx4_add_device+0x45/0xe0 [mlx4_core]
mlx4_register_interface+0xa8/0x120 [mlx4_core]
mlx4_register_interface+0xa8/0x120 [mlx4_core]
0xffffffffa05b2051
0xffffffffa05b2051
do_one_initcall+0x43/0x166
do_one_initcall+0x43/0x166
do_init_module+0x5f/0x206
do_init_module+0x5f/0x206
load_module+0x26fe/0x2db0
load_module+0x26fe/0x2db0
SYSC_finit_module+0xbc/0xf0
SYSC_finit_module+0xbc/0xf0
SyS_finit_module+0xe/0x10
SyS_finit_module+0xe/0x10
entry_SYSCALL_64_fastpath+0x18/0xad
entry_SYSCALL_64_fastpath+0x18/0xad
-> #2 (intf_mutex){+.+.}:
-> #2 (intf_mutex){+.+.}:
lock_acquire+0xdc/0x1d0
lock_acquire+0xdc/0x1d0
__mutex_lock+0x86/0x990
__mutex_lock+0x86/0x990
mutex_lock_nested+0x1b/0x20
mutex_lock_nested+0x1b/0x20
mlx4_register_device+0x30/0xc0 [mlx4_core]
mlx4_register_device+0x30/0xc0 [mlx4_core]
mlx4_load_one+0x15f4/0x16f0 [mlx4_core]
mlx4_load_one+0x15f4/0x16f0 [mlx4_core]
mlx4_init_one+0x4b9/0x700 [mlx4_core]
mlx4_init_one+0x4b9/0x700 [mlx4_core]
local_pci_probe+0x42/0xa0
local_pci_probe+0x42/0xa0
work_for_cpu_fn+0x14/0x20
work_for_cpu_fn+0x14/0x20
process_one_work+0x1fd/0x630
process_one_work+0x1fd/0x630
worker_thread+0x1db/0x3b0
worker_thread+0x1db/0x3b0
kthread+0x11e/0x150
kthread+0x11e/0x150
ret_from_fork+0x27/0x40
ret_from_fork+0x27/0x40
-> #1 ((&wfc.work)){+.+.}:
-> #1 ((&wfc.work)){+.+.}:
lock_acquire+0xdc/0x1d0
lock_acquire+0xdc/0x1d0
process_one_work+0x1da/0x630
process_one_work+0x1da/0x630
worker_thread+0x4e/0x3b0
worker_thread+0x4e/0x3b0
kthread+0x11e/0x150
kthread+0x11e/0x150
ret_from_fork+0x27/0x40
ret_from_fork+0x27/0x40
-> #0 ("events"){+.+.}:
-> #0 ("events"){+.+.}:
__lock_acquire+0x13b5/0x13f0
__lock_acquire+0x13b5/0x13f0
lock_acquire+0xdc/0x1d0
lock_acquire+0xdc/0x1d0
flush_workqueue+0x98/0x520
flush_workqueue+0x98/0x520
nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
ib_unregister_client+0x18f/0x200 [ib_core]
ib_unregister_client+0x18f/0x200 [ib_core]
nvmet_rdma_exit+0xb3/0x856 [nvmet_rdma]
nvmet_rdma_exit+0xb3/0x856 [nvmet_rdma]
SyS_delete_module+0x18c/0x1e0
SyS_delete_module+0x18c/0x1e0
entry_SYSCALL_64_fastpath+0x18/0xad
entry_SYSCALL_64_fastpath+0x18/0xad
other info that might help us debug this:
other info that might help us debug this:
Chain exists of:
"events" --> intf_mutex --> device_mutex
Chain exists of:
"events" --> intf_mutex --> device_mutex
Possible unsafe locking scenario:
Possible unsafe locking scenario:
CPU0 CPU1
CPU0 CPU1
---- ----
---- ----
lock(device_mutex);
lock(device_mutex);
lock(intf_mutex);
lock(intf_mutex);
lock(device_mutex);
lock(device_mutex);
lock("events");
lock("events");
*** DEADLOCK ***
*** DEADLOCK ***
1 lock held by modprobe/2272:
1 lock held by modprobe/2272:
#0: (device_mutex){+.+.}, at: [<ffffffffa05d6bf7>] ib_unregister_client+0x27/0x200 [ib_core]
#0: (device_mutex){+.+.}, at: [<ffffffffa05d6bf7>] ib_unregister_client+0x27/0x200 [ib_core]
stack backtrace:
stack backtrace:
CPU: 9 PID: 2272 Comm: modprobe Not tainted 4.14.0-rc5-dbg+ #3
CPU: 9 PID: 2272 Comm: modprobe Not tainted 4.14.0-rc5-dbg+ #3
Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 09/11/2012
Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 09/11/2012
Call Trace:
Call Trace:
dump_stack+0x68/0x9f
dump_stack+0x68/0x9f
print_circular_bug.isra.38+0x1d8/0x1e6
print_circular_bug.isra.38+0x1d8/0x1e6
__lock_acquire+0x13b5/0x13f0
__lock_acquire+0x13b5/0x13f0
lock_acquire+0xdc/0x1d0
lock_acquire+0xdc/0x1d0
? lock_acquire+0xdc/0x1d0
? lock_acquire+0xdc/0x1d0
? flush_workqueue+0x75/0x520
? flush_workqueue+0x75/0x520
flush_workqueue+0x98/0x520
flush_workqueue+0x98/0x520
? flush_workqueue+0x75/0x520
? flush_workqueue+0x75/0x520
nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
? nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
? nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
ib_unregister_client+0x18f/0x200 [ib_core]
ib_unregister_client+0x18f/0x200 [ib_core]
nvmet_rdma_exit+0xb3/0x856 [nvmet_rdma]
nvmet_rdma_exit+0xb3/0x856 [nvmet_rdma]
SyS_delete_module+0x18c/0x1e0
SyS_delete_module+0x18c/0x1e0
entry_SYSCALL_64_fastpath+0x18/0xad
entry_SYSCALL_64_fastpath+0x18/0xad
^ permalink raw reply [flat|nested] 6+ messages in thread* v4.14-rc5 NVMeOF regression? 2017-10-16 22:23 v4.14-rc5 NVMeOF regression? Bart Van Assche @ 2017-10-17 10:01 ` Sagi Grimberg [not found] ` <5fb38923-36f7-c069-5f1d-96f4a9c98248@wdc.com> 0 siblings, 1 reply; 6+ messages in thread From: Sagi Grimberg @ 2017-10-17 10:01 UTC (permalink / raw) > Hello, Hi Bart, thanks for reporting > It has been a while since I ran any NVMeOF tests. But when I tried to test > the v4.14-rc5 NVMeOF drivers the output shown below appeared. Is this a > known issue? The following test triggered these call stacks: Yea, we probably need to silence those with: -- diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 92a03ff5fb4d..d7504f180d93 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1988,7 +1988,7 @@ static void nvme_rdma_remove_one(struct ib_device *ib_device, void *client_data) struct nvme_rdma_ctrl *ctrl; /* Delete all controllers using this device */ - mutex_lock(&nvme_rdma_ctrl_mutex); + mutex_lock_nested(&nvme_rdma_ctrl_mutex, SINGLE_DEPTH_NESTING); list_for_each_entry(ctrl, &nvme_rdma_ctrl_list, list) { if (ctrl->device->dev != ib_device) continue; -- I'll compile with CONFIG_DEBUG_LOCK_ALLOC and check this, but if you happen to beat me to it, I can use your Tested-by tag. ^ permalink raw reply related [flat|nested] 6+ messages in thread
[parent not found: <5fb38923-36f7-c069-5f1d-96f4a9c98248@wdc.com>]
* v4.14-rc5 NVMeOF regression? [not found] ` <5fb38923-36f7-c069-5f1d-96f4a9c98248@wdc.com> @ 2017-10-18 5:26 ` Sagi Grimberg 2017-10-18 6:55 ` Christoph Hellwig 2017-10-18 16:24 ` Bart Van Assche 0 siblings, 2 replies; 6+ messages in thread From: Sagi Grimberg @ 2017-10-18 5:26 UTC (permalink / raw) > Hello Sagi, > > Had I mentioned that I ran into a real deadlock and not only into a > lockdep complaint? No, and the output is a lockdep complaint which I thought needed silencing. If you ran into a real deadlock, did you have any other output from hung_task watchdog? I do not yet understand the root cause from lockdep info provided. Also, do you know at which test-case this happened? > Additionally, can you explain to me why you think the > above patch is necessary? As far as I can see nvme_rdma_ctrl_mutex is > not held around the ib_register_client(&nvme_rdma_ib_client) call nor > around the ib_unregister_client(&nvme_rdma_ib_client) calls. Its held under ib_unregister_client, my understanding is that the _nested annotation is required for the inner mutex lock acquire. ^ permalink raw reply [flat|nested] 6+ messages in thread
* v4.14-rc5 NVMeOF regression? 2017-10-18 5:26 ` Sagi Grimberg @ 2017-10-18 6:55 ` Christoph Hellwig 2017-10-18 16:24 ` Bart Van Assche 1 sibling, 0 replies; 6+ messages in thread From: Christoph Hellwig @ 2017-10-18 6:55 UTC (permalink / raw) On Wed, Oct 18, 2017@08:26:05AM +0300, Sagi Grimberg wrote: > > Additionally, can you explain to me why you think the above patch is > > necessary? As far as I can see nvme_rdma_ctrl_mutex is not held around > > the ib_register_client(&nvme_rdma_ib_client) call nor around the > > ib_unregister_client(&nvme_rdma_ib_client) calls. > > Its held under ib_unregister_client, my understanding is that the > _nested annotation is required for the inner mutex lock acquire. No. Anytime we take a single lock out of order it is a bug an a potential deadlock. _nested only works if we have multiple instances of a lock (say in each inode) and we have a well defined ordering between the instances (e.g. by the address of the inode structure). ^ permalink raw reply [flat|nested] 6+ messages in thread
* v4.14-rc5 NVMeOF regression? 2017-10-18 5:26 ` Sagi Grimberg 2017-10-18 6:55 ` Christoph Hellwig @ 2017-10-18 16:24 ` Bart Van Assche 2017-10-22 17:16 ` Sagi Grimberg 1 sibling, 1 reply; 6+ messages in thread From: Bart Van Assche @ 2017-10-18 16:24 UTC (permalink / raw) On Wed, 2017-10-18@08:26 +0300, Sagi Grimberg wrote: > If you ran into a real deadlock, did you have any other output from > hung_task watchdog? I do not yet understand the root cause from > lockdep info provided. > > Also, do you know at which test-case this happened? Hello Sagi, Running test case 1 should be sufficient to trigger the deadlock. SysRq-w produced the following output: sysrq: SysRq : Show Blocked State task PC stack pid father kworker/u66:2 D 0 440 2 0x80000000 Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma] Call Trace: __schedule+0x3e9/0xb00 schedule+0x40/0x90 schedule_timeout+0x221/0x580 io_schedule_timeout+0x1e/0x50 wait_for_completion_io_timeout+0x118/0x180 blk_execute_rq+0x86/0xc0 __nvme_submit_sync_cmd+0x89/0xf0 nvmf_reg_write32+0x4b/0x90 [nvme_fabrics] nvme_shutdown_ctrl+0x41/0xe0 nvme_rdma_shutdown_ctrl+0xca/0xd0 [nvme_rdma] nvme_rdma_remove_ctrl+0x2b/0x40 [nvme_rdma] nvme_rdma_del_ctrl_work+0x25/0x30 [nvme_rdma] process_one_work+0x1fd/0x630 worker_thread+0x1db/0x3b0 kthread+0x11e/0x150 ret_from_fork+0x27/0x40 01 D 0 2868 2862 0x00000000 Call Trace: __schedule+0x3e9/0xb00 schedule+0x40/0x90 schedule_timeout+0x260/0x580 wait_for_completion+0x108/0x170 flush_work+0x1e0/0x270 nvme_rdma_del_ctrl+0x5a/0x80 [nvme_rdma] nvme_sysfs_delete+0x2a/0x40 dev_attr_store+0x18/0x30 sysfs_kf_write+0x45/0x60 kernfs_fop_write+0x124/0x1c0 __vfs_write+0x28/0x150 vfs_write+0xc7/0x1b0 SyS_write+0x49/0xa0 entry_SYSCALL_64_fastpath+0x18/0xad Bart. ^ permalink raw reply [flat|nested] 6+ messages in thread
* v4.14-rc5 NVMeOF regression? 2017-10-18 16:24 ` Bart Van Assche @ 2017-10-22 17:16 ` Sagi Grimberg 0 siblings, 0 replies; 6+ messages in thread From: Sagi Grimberg @ 2017-10-22 17:16 UTC (permalink / raw) >> If you ran into a real deadlock, did you have any other output from >> hung_task watchdog? I do not yet understand the root cause from >> lockdep info provided. >> >> Also, do you know at which test-case this happened? > > Hello Sagi, > > Running test case 1 should be sufficient to trigger the deadlock. SysRq-w > produced the following output: > > sysrq: SysRq : Show Blocked State > task PC stack pid father > kworker/u66:2 D 0 440 2 0x80000000 > Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma] > Call Trace: > __schedule+0x3e9/0xb00 > schedule+0x40/0x90 > schedule_timeout+0x221/0x580 > io_schedule_timeout+0x1e/0x50 > wait_for_completion_io_timeout+0x118/0x180 > blk_execute_rq+0x86/0xc0 > __nvme_submit_sync_cmd+0x89/0xf0 > nvmf_reg_write32+0x4b/0x90 [nvme_fabrics] > nvme_shutdown_ctrl+0x41/0xe0 > nvme_rdma_shutdown_ctrl+0xca/0xd0 [nvme_rdma] > nvme_rdma_remove_ctrl+0x2b/0x40 [nvme_rdma] > nvme_rdma_del_ctrl_work+0x25/0x30 [nvme_rdma] > process_one_work+0x1fd/0x630 > worker_thread+0x1db/0x3b0 > kthread+0x11e/0x150 > ret_from_fork+0x27/0x40 > 01 D 0 2868 2862 0x00000000 > Call Trace: > __schedule+0x3e9/0xb00 > schedule+0x40/0x90 > schedule_timeout+0x260/0x580 > wait_for_completion+0x108/0x170 > flush_work+0x1e0/0x270 > nvme_rdma_del_ctrl+0x5a/0x80 [nvme_rdma] > nvme_sysfs_delete+0x2a/0x40 > dev_attr_store+0x18/0x30 > sysfs_kf_write+0x45/0x60 > kernfs_fop_write+0x124/0x1c0 > __vfs_write+0x28/0x150 > vfs_write+0xc7/0x1b0 > SyS_write+0x49/0xa0 > entry_SYSCALL_64_fastpath+0x18/0xad Hi Bart, So I've looked into this, and I want to share my findings. I'm able to reproduce this hang when trying to disconnect from a controller which is already in reconnecting state. The problem as I see it, is that we are returning BLK_STS_RESOURCE from nvme_rdma_queue_rq() before we start the request timer (fail before blk_mq_start_request) so the request timeout never expires (and given that we are in deleting sequence, the command is never expected to complete). But for some reason, I don't see the request get re-issued again, should the driver take care of this by calling blk_mq_delay_run_hw_queue()? Thinking more on this, if we are disconnecting from a controller, and we are unable to issue admin/io commands (queue state is not LIVE) we probably should not fail with BLK_STS_RESOURCE but rather BLK_STS_IOERR. This change makes the issue go away: -- diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 5b5458012c2c..be77cd098182 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1393,6 +1393,12 @@ nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue, struct request *rq) cmd->common.opcode != nvme_fabrics_command || cmd->fabrics.fctype != nvme_fabrics_type_connect) { /* + * deleting state means that the ctrl will never accept + * commands again, fail it permanently. + */ + if (queue->ctrl->ctrl.state == NVME_CTRL_DELETING) + return BLK_STS_IOERR; + /* * reconnecting state means transport disruption, which * can take a long time and even might fail permanently, * so we can't let incoming I/O be requeued forever. -- Does anyone have a better idea? ^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-10-22 17:16 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-16 22:23 v4.14-rc5 NVMeOF regression? Bart Van Assche
2017-10-17 10:01 ` Sagi Grimberg
[not found] ` <5fb38923-36f7-c069-5f1d-96f4a9c98248@wdc.com>
2017-10-18 5:26 ` Sagi Grimberg
2017-10-18 6:55 ` Christoph Hellwig
2017-10-18 16:24 ` Bart Van Assche
2017-10-22 17:16 ` Sagi Grimberg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).