v4.14-rc5 NVMeOF regression?

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* v4.14-rc5 NVMeOF regression?
@ 2017-10-16 22:23 Bart Van Assche
  2017-10-17 10:01 ` Sagi Grimberg
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2017-10-16 22:23 UTC (permalink / raw)


Hello,

It has been a while since I ran any NVMeOF tests. But when I tried to test
the v4.14-rc5 NVMeOF drivers the output shown below appeared. Is this a
known issue? The following test triggered these call stacks:

# srp-test/run_tests -n -f xfs -e deadline -r 60

Thanks,

Bart.

======================================================
======================================================
WARNING: possible circular locking dependency detected
WARNING: possible circular locking dependency detected
4.14.0-rc5-dbg+ #3 Not tainted
4.14.0-rc5-dbg+ #3 Not tainted
------------------------------------------------------
------------------------------------------------------
modprobe/2272 is trying to acquire lock:
modprobe/2272 is trying to acquire lock:
 ("events"){+.+.}, at: [<ffffffff81084185>] flush_workqueue+0x75/0x520
 ("events"){+.+.}, at: [<ffffffff81084185>] flush_workqueue+0x75/0x520

but task is already holding lock:

but task is already holding lock:
 (device_mutex){+.+.}, at: [<ffffffffa05d6bf7>] ib_unregister_client+0x27/0x200 [ib_core]
 (device_mutex){+.+.}, at: [<ffffffffa05d6bf7>] ib_unregister_client+0x27/0x200 [ib_core]

which lock already depends on the new lock.


which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

the existing dependency chain (in reverse order) is:

-> #3 (device_mutex){+.+.}:

-> #3 (device_mutex){+.+.}:
       lock_acquire+0xdc/0x1d0
       lock_acquire+0xdc/0x1d0
       __mutex_lock+0x86/0x990
       __mutex_lock+0x86/0x990
       mutex_lock_nested+0x1b/0x20
       mutex_lock_nested+0x1b/0x20
       ib_register_device+0xa3/0x650 [ib_core]
       ib_register_device+0xa3/0x650 [ib_core]
       mlx4_ib_add+0xcfd/0x1440 [mlx4_ib]
       mlx4_ib_add+0xcfd/0x1440 [mlx4_ib]
       mlx4_add_device+0x45/0xe0 [mlx4_core]
       mlx4_add_device+0x45/0xe0 [mlx4_core]
       mlx4_register_interface+0xa8/0x120 [mlx4_core]
       mlx4_register_interface+0xa8/0x120 [mlx4_core]
       0xffffffffa05b2051
       0xffffffffa05b2051
       do_one_initcall+0x43/0x166
       do_one_initcall+0x43/0x166
       do_init_module+0x5f/0x206
       do_init_module+0x5f/0x206
       load_module+0x26fe/0x2db0
       load_module+0x26fe/0x2db0
       SYSC_finit_module+0xbc/0xf0
       SYSC_finit_module+0xbc/0xf0
       SyS_finit_module+0xe/0x10
       SyS_finit_module+0xe/0x10
       entry_SYSCALL_64_fastpath+0x18/0xad
       entry_SYSCALL_64_fastpath+0x18/0xad

-> #2 (intf_mutex){+.+.}:

-> #2 (intf_mutex){+.+.}:
       lock_acquire+0xdc/0x1d0
       lock_acquire+0xdc/0x1d0
       __mutex_lock+0x86/0x990
       __mutex_lock+0x86/0x990
       mutex_lock_nested+0x1b/0x20
       mutex_lock_nested+0x1b/0x20
       mlx4_register_device+0x30/0xc0 [mlx4_core]
       mlx4_register_device+0x30/0xc0 [mlx4_core]
       mlx4_load_one+0x15f4/0x16f0 [mlx4_core]
       mlx4_load_one+0x15f4/0x16f0 [mlx4_core]
       mlx4_init_one+0x4b9/0x700 [mlx4_core]
       mlx4_init_one+0x4b9/0x700 [mlx4_core]
       local_pci_probe+0x42/0xa0
       local_pci_probe+0x42/0xa0
       work_for_cpu_fn+0x14/0x20
       work_for_cpu_fn+0x14/0x20
       process_one_work+0x1fd/0x630
       process_one_work+0x1fd/0x630
       worker_thread+0x1db/0x3b0
       worker_thread+0x1db/0x3b0
       kthread+0x11e/0x150
       kthread+0x11e/0x150
       ret_from_fork+0x27/0x40
       ret_from_fork+0x27/0x40

-> #1 ((&wfc.work)){+.+.}:

-> #1 ((&wfc.work)){+.+.}:
       lock_acquire+0xdc/0x1d0
       lock_acquire+0xdc/0x1d0
       process_one_work+0x1da/0x630
       process_one_work+0x1da/0x630
       worker_thread+0x4e/0x3b0
       worker_thread+0x4e/0x3b0
       kthread+0x11e/0x150
       kthread+0x11e/0x150
       ret_from_fork+0x27/0x40
       ret_from_fork+0x27/0x40

-> #0 ("events"){+.+.}:

-> #0 ("events"){+.+.}:
       __lock_acquire+0x13b5/0x13f0
       __lock_acquire+0x13b5/0x13f0
       lock_acquire+0xdc/0x1d0
       lock_acquire+0xdc/0x1d0
       flush_workqueue+0x98/0x520
       flush_workqueue+0x98/0x520
       nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
       nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
       ib_unregister_client+0x18f/0x200 [ib_core]
       ib_unregister_client+0x18f/0x200 [ib_core]
       nvmet_rdma_exit+0xb3/0x856 [nvmet_rdma]
       nvmet_rdma_exit+0xb3/0x856 [nvmet_rdma]
       SyS_delete_module+0x18c/0x1e0
       SyS_delete_module+0x18c/0x1e0
       entry_SYSCALL_64_fastpath+0x18/0xad
       entry_SYSCALL_64_fastpath+0x18/0xad

other info that might help us debug this:


other info that might help us debug this:

Chain exists of:
  "events" --> intf_mutex --> device_mutex

Chain exists of:
  "events" --> intf_mutex --> device_mutex

 Possible unsafe locking scenario:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       CPU0                    CPU1
       ----                    ----
       ----                    ----
  lock(device_mutex);
  lock(device_mutex);
                               lock(intf_mutex);
                               lock(intf_mutex);
                               lock(device_mutex);
                               lock(device_mutex);
  lock("events");
  lock("events");

 *** DEADLOCK ***


 *** DEADLOCK ***

1 lock held by modprobe/2272:
1 lock held by modprobe/2272:
 #0:  (device_mutex){+.+.}, at: [<ffffffffa05d6bf7>] ib_unregister_client+0x27/0x200 [ib_core]
 #0:  (device_mutex){+.+.}, at: [<ffffffffa05d6bf7>] ib_unregister_client+0x27/0x200 [ib_core]

stack backtrace:

stack backtrace:
CPU: 9 PID: 2272 Comm: modprobe Not tainted 4.14.0-rc5-dbg+ #3
CPU: 9 PID: 2272 Comm: modprobe Not tainted 4.14.0-rc5-dbg+ #3
Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 09/11/2012
Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 09/11/2012
Call Trace:
Call Trace:
 dump_stack+0x68/0x9f
 dump_stack+0x68/0x9f
 print_circular_bug.isra.38+0x1d8/0x1e6
 print_circular_bug.isra.38+0x1d8/0x1e6
 __lock_acquire+0x13b5/0x13f0
 __lock_acquire+0x13b5/0x13f0
 lock_acquire+0xdc/0x1d0
 lock_acquire+0xdc/0x1d0
 ? lock_acquire+0xdc/0x1d0
 ? lock_acquire+0xdc/0x1d0
 ? flush_workqueue+0x75/0x520
 ? flush_workqueue+0x75/0x520
 flush_workqueue+0x98/0x520
 flush_workqueue+0x98/0x520
 ? flush_workqueue+0x75/0x520
 ? flush_workqueue+0x75/0x520
 nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
 nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
 ? nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
 ? nvmet_rdma_remove_one+0x73/0xa0 [nvmet_rdma]
 ib_unregister_client+0x18f/0x200 [ib_core]
 ib_unregister_client+0x18f/0x200 [ib_core]
 nvmet_rdma_exit+0xb3/0x856 [nvmet_rdma]
 nvmet_rdma_exit+0xb3/0x856 [nvmet_rdma]
 SyS_delete_module+0x18c/0x1e0
 SyS_delete_module+0x18c/0x1e0
 entry_SYSCALL_64_fastpath+0x18/0xad
 entry_SYSCALL_64_fastpath+0x18/0xad

^ permalink raw reply	[flat|nested] 6+ messages in thread

* v4.14-rc5 NVMeOF regression?
  2017-10-16 22:23 v4.14-rc5 NVMeOF regression? Bart Van Assche
@ 2017-10-17 10:01 ` Sagi Grimberg
       [not found]   ` <5fb38923-36f7-c069-5f1d-96f4a9c98248@wdc.com>
  0 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2017-10-17 10:01 UTC (permalink / raw)



> Hello,

Hi Bart, thanks for reporting

> It has been a while since I ran any NVMeOF tests. But when I tried to test
> the v4.14-rc5 NVMeOF drivers the output shown below appeared. Is this a
> known issue? The following test triggered these call stacks:

Yea, we probably need to silence those with:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 92a03ff5fb4d..d7504f180d93 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1988,7 +1988,7 @@ static void nvme_rdma_remove_one(struct ib_device 
*ib_device, void *client_data)
         struct nvme_rdma_ctrl *ctrl;

         /* Delete all controllers using this device */
-       mutex_lock(&nvme_rdma_ctrl_mutex);
+       mutex_lock_nested(&nvme_rdma_ctrl_mutex, SINGLE_DEPTH_NESTING);
         list_for_each_entry(ctrl, &nvme_rdma_ctrl_list, list) {
                 if (ctrl->device->dev != ib_device)
                         continue;
--

I'll compile with CONFIG_DEBUG_LOCK_ALLOC and check this, but if you
happen to beat me to it, I can use your Tested-by tag.

^ permalink raw reply related	[flat|nested] 6+ messages in thread

[parent not found: <5fb38923-36f7-c069-5f1d-96f4a9c98248@wdc.com>]

* v4.14-rc5 NVMeOF regression?
       [not found]   ` <5fb38923-36f7-c069-5f1d-96f4a9c98248@wdc.com>
@ 2017-10-18  5:26     ` Sagi Grimberg
  2017-10-18  6:55       ` Christoph Hellwig
  2017-10-18 16:24       ` Bart Van Assche
  0 siblings, 2 replies; 6+ messages in thread
From: Sagi Grimberg @ 2017-10-18  5:26 UTC (permalink / raw)



> Hello Sagi,
> 
> Had I mentioned that I ran into a real deadlock and not only into a 
> lockdep complaint?

No, and the output is a lockdep complaint which I thought needed
silencing.

If you ran into a real deadlock, did you have any other output from
hung_task watchdog? I do not yet understand the root cause from
lockdep info provided.

Also, do you know at which test-case this happened?

> Additionally, can you explain to me why you think the 
> above patch is necessary? As far as I can see nvme_rdma_ctrl_mutex is 
> not held around the ib_register_client(&nvme_rdma_ib_client) call nor 
> around the ib_unregister_client(&nvme_rdma_ib_client) calls.

Its held under ib_unregister_client, my understanding is that the
_nested annotation is required for the inner mutex lock acquire.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* v4.14-rc5 NVMeOF regression?
  2017-10-18  5:26     ` Sagi Grimberg
@ 2017-10-18  6:55       ` Christoph Hellwig
  2017-10-18 16:24       ` Bart Van Assche
  1 sibling, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2017-10-18  6:55 UTC (permalink / raw)


On Wed, Oct 18, 2017@08:26:05AM +0300, Sagi Grimberg wrote:
> > Additionally, can you explain to me why you think the above patch is
> > necessary? As far as I can see nvme_rdma_ctrl_mutex is not held around
> > the ib_register_client(&nvme_rdma_ib_client) call nor around the
> > ib_unregister_client(&nvme_rdma_ib_client) calls.
> 
> Its held under ib_unregister_client, my understanding is that the
> _nested annotation is required for the inner mutex lock acquire.

No.  Anytime we take a single lock out of order it is a bug an a
potential deadlock.  _nested only works if we have multiple instances
of a lock (say in each inode) and we have a well defined ordering
between the instances (e.g. by the address of the inode structure).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* v4.14-rc5 NVMeOF regression?
  2017-10-18  5:26     ` Sagi Grimberg
  2017-10-18  6:55       ` Christoph Hellwig
@ 2017-10-18 16:24       ` Bart Van Assche
  2017-10-22 17:16         ` Sagi Grimberg
  1 sibling, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2017-10-18 16:24 UTC (permalink / raw)


On Wed, 2017-10-18@08:26 +0300, Sagi Grimberg wrote:
> If you ran into a real deadlock, did you have any other output from
> hung_task watchdog? I do not yet understand the root cause from
> lockdep info provided.
>
> Also, do you know at which test-case this happened?

Hello Sagi,

Running test case 1 should be sufficient to trigger the deadlock. SysRq-w
produced the following output:

sysrq: SysRq : Show Blocked State
  task                        PC stack   pid father
kworker/u66:2   D    0   440      2 0x80000000
Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
Call Trace:
 __schedule+0x3e9/0xb00
 schedule+0x40/0x90
 schedule_timeout+0x221/0x580
 io_schedule_timeout+0x1e/0x50
 wait_for_completion_io_timeout+0x118/0x180
 blk_execute_rq+0x86/0xc0
 __nvme_submit_sync_cmd+0x89/0xf0
 nvmf_reg_write32+0x4b/0x90 [nvme_fabrics]
 nvme_shutdown_ctrl+0x41/0xe0
 nvme_rdma_shutdown_ctrl+0xca/0xd0 [nvme_rdma]
 nvme_rdma_remove_ctrl+0x2b/0x40 [nvme_rdma]
 nvme_rdma_del_ctrl_work+0x25/0x30 [nvme_rdma]
 process_one_work+0x1fd/0x630
 worker_thread+0x1db/0x3b0
 kthread+0x11e/0x150
 ret_from_fork+0x27/0x40
01              D    0  2868   2862 0x00000000
Call Trace:
 __schedule+0x3e9/0xb00
 schedule+0x40/0x90
 schedule_timeout+0x260/0x580
 wait_for_completion+0x108/0x170
 flush_work+0x1e0/0x270
 nvme_rdma_del_ctrl+0x5a/0x80 [nvme_rdma]
 nvme_sysfs_delete+0x2a/0x40
 dev_attr_store+0x18/0x30
 sysfs_kf_write+0x45/0x60
 kernfs_fop_write+0x124/0x1c0
 __vfs_write+0x28/0x150
 vfs_write+0xc7/0x1b0
 SyS_write+0x49/0xa0
 entry_SYSCALL_64_fastpath+0x18/0xad

Bart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* v4.14-rc5 NVMeOF regression?
  2017-10-18 16:24       ` Bart Van Assche
@ 2017-10-22 17:16         ` Sagi Grimberg
  0 siblings, 0 replies; 6+ messages in thread
From: Sagi Grimberg @ 2017-10-22 17:16 UTC (permalink / raw)



>> If you ran into a real deadlock, did you have any other output from
>> hung_task watchdog? I do not yet understand the root cause from
>> lockdep info provided.
>>
>> Also, do you know at which test-case this happened?
> 
> Hello Sagi,
> 
> Running test case 1 should be sufficient to trigger the deadlock. SysRq-w
> produced the following output:
> 
> sysrq: SysRq : Show Blocked State
>    task                        PC stack   pid father
> kworker/u66:2   D    0   440      2 0x80000000
> Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
> Call Trace:
>   __schedule+0x3e9/0xb00
>   schedule+0x40/0x90
>   schedule_timeout+0x221/0x580
>   io_schedule_timeout+0x1e/0x50
>   wait_for_completion_io_timeout+0x118/0x180
>   blk_execute_rq+0x86/0xc0
>   __nvme_submit_sync_cmd+0x89/0xf0
>   nvmf_reg_write32+0x4b/0x90 [nvme_fabrics]
>   nvme_shutdown_ctrl+0x41/0xe0
>   nvme_rdma_shutdown_ctrl+0xca/0xd0 [nvme_rdma]
>   nvme_rdma_remove_ctrl+0x2b/0x40 [nvme_rdma]
>   nvme_rdma_del_ctrl_work+0x25/0x30 [nvme_rdma]
>   process_one_work+0x1fd/0x630
>   worker_thread+0x1db/0x3b0
>   kthread+0x11e/0x150
>   ret_from_fork+0x27/0x40
> 01              D    0  2868   2862 0x00000000
> Call Trace:
>   __schedule+0x3e9/0xb00
>   schedule+0x40/0x90
>   schedule_timeout+0x260/0x580
>   wait_for_completion+0x108/0x170
>   flush_work+0x1e0/0x270
>   nvme_rdma_del_ctrl+0x5a/0x80 [nvme_rdma]
>   nvme_sysfs_delete+0x2a/0x40
>   dev_attr_store+0x18/0x30
>   sysfs_kf_write+0x45/0x60
>   kernfs_fop_write+0x124/0x1c0
>   __vfs_write+0x28/0x150
>   vfs_write+0xc7/0x1b0
>   SyS_write+0x49/0xa0
>   entry_SYSCALL_64_fastpath+0x18/0xad

Hi Bart,

So I've looked into this, and I want to share my findings.
I'm able to reproduce this hang when trying to disconnect from
a controller which is already in reconnecting state.

The problem as I see it, is that we are returning BLK_STS_RESOURCE
from nvme_rdma_queue_rq() before we start the request timer (fail
before blk_mq_start_request) so the request timeout never
expires (and given that we are in deleting sequence, the command is
never expected to complete).

But for some reason, I don't see the request get re-issued again,
should the driver take care of this by calling
blk_mq_delay_run_hw_queue()?

Thinking more on this, if we are disconnecting from a controller,
and we are unable to issue admin/io commands (queue state is not
LIVE) we probably should not fail with BLK_STS_RESOURCE but rather
BLK_STS_IOERR.

This change makes the issue go away:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 5b5458012c2c..be77cd098182 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1393,6 +1393,12 @@ nvme_rdma_queue_is_ready(struct nvme_rdma_queue 
*queue, struct request *rq)
                     cmd->common.opcode != nvme_fabrics_command ||
                     cmd->fabrics.fctype != nvme_fabrics_type_connect) {
                         /*
+                        * deleting state means that the ctrl will never 
accept
+                        * commands again, fail it permanently.
+                        */
+                       if (queue->ctrl->ctrl.state == NVME_CTRL_DELETING)
+                               return BLK_STS_IOERR;
+                       /*
                          * reconnecting state means transport 
disruption, which
                          * can take a long time and even might fail 
permanently,
                          * so we can't let incoming I/O be requeued 
forever.
--

Does anyone have a better idea?

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-10-22 17:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-16 22:23 v4.14-rc5 NVMeOF regression? Bart Van Assche
2017-10-17 10:01 ` Sagi Grimberg
     [not found]   ` <5fb38923-36f7-c069-5f1d-96f4a9c98248@wdc.com>
2017-10-18  5:26     ` Sagi Grimberg
2017-10-18  6:55       ` Christoph Hellwig
2017-10-18 16:24       ` Bart Van Assche
2017-10-22 17:16         ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).