* Bug Report: can't unload nvme module in case of disabled device @ 2017-08-01 12:58 Max Gurtovoy 2017-08-10 8:59 ` Christoph Hellwig 2017-08-10 16:45 ` Keith Busch 0 siblings, 2 replies; 9+ messages in thread From: Max Gurtovoy @ 2017-08-01 12:58 UTC (permalink / raw) Hi all, I would like to report a bug that reproduced by the following steps (I'm using 4.13.0-rc3+): 1. modprobe nvme 2. echo 0 > /sys/block/nvme0n1/device/device/enable 3. nvme list (stuck for more than 1-2 mins) 4. modprobe -r nvme (stuck forever) log: [ 1342.388888] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [ 1476.021392] INFO: task kworker/u98:1:436 blocked for more than 120 seconds. [ 1476.029072] Not tainted 4.13.0-rc3+ #19 [ 1476.033878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1476.042505] kworker/u98:1 D 0 436 2 0x00000000 [ 1476.048569] Workqueue: nvme-wq nvme_reset_work [nvme] [ 1476.054133] Call Trace: [ 1476.056862] __schedule+0x1dc/0x780 [ 1476.060706] schedule+0x36/0x80 [ 1476.064180] blk_mq_freeze_queue_wait+0x4b/0xb0 [ 1476.069175] ? remove_wait_queue+0x60/0x60 [ 1476.073693] nvme_wait_freeze+0x33/0x50 [nvme_core] [ 1476.079068] nvme_reset_work+0x6b9/0xc40 [nvme] [ 1476.084075] ? __switch_to+0x23e/0x4a0 [ 1476.088209] process_one_work+0x149/0x360 [ 1476.092625] worker_thread+0x4d/0x3c0 [ 1476.096692] kthread+0x109/0x140 [ 1476.100247] ? rescuer_thread+0x380/0x380 [ 1476.104664] ? kthread_park+0x60/0x60 [ 1476.108698] ret_from_fork+0x25/0x30 [ 1598.901883] INFO: task kworker/u98:1:436 blocked for more than 120 seconds. [ 1598.909557] Not tainted 4.13.0-rc3+ #19 [ 1598.914362] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1598.923004] kworker/u98:1 D 0 436 2 0x00000000 [ 1598.929063] Workqueue: nvme-wq nvme_reset_work [nvme] [ 1598.934637] Call Trace: [ 1598.937348] __schedule+0x1dc/0x780 [ 1598.941208] schedule+0x36/0x80 [ 1598.944682] blk_mq_freeze_queue_wait+0x4b/0xb0 [ 1598.949675] ? remove_wait_queue+0x60/0x60 [ 1598.954189] nvme_wait_freeze+0x33/0x50 [nvme_core] [ 1598.959574] nvme_reset_work+0x6b9/0xc40 [nvme] [ 1598.964580] ? __switch_to+0x23e/0x4a0 [ 1598.968723] process_one_work+0x149/0x360 [ 1598.973192] worker_thread+0x4d/0x3c0 [ 1598.977240] kthread+0x109/0x140 [ 1598.980797] ? rescuer_thread+0x380/0x380 [ 1598.985226] ? kthread_park+0x60/0x60 [ 1598.989262] ret_from_fork+0x25/0x30 [ 1721.782347] INFO: task kworker/u98:1:436 blocked for more than 120 seconds. [ 1721.790026] Not tainted 4.13.0-rc3+ #19 [ 1721.795326] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1721.804425] kworker/u98:1 D 0 436 2 0x00000000 [ 1721.810958] Workqueue: nvme-wq nvme_reset_work [nvme] [ 1721.816999] Call Trace: [ 1721.820161] __schedule+0x1dc/0x780 [ 1721.824470] schedule+0x36/0x80 [ 1721.828389] blk_mq_freeze_queue_wait+0x4b/0xb0 [ 1721.833835] ? remove_wait_queue+0x60/0x60 [ 1721.838781] nvme_wait_freeze+0x33/0x50 [nvme_core] [ 1721.844596] nvme_reset_work+0x6b9/0xc40 [nvme] [ 1721.850208] ? __switch_to+0x23e/0x4a0 [ 1721.854756] process_one_work+0x149/0x360 [ 1721.859606] worker_thread+0x4d/0x3c0 [ 1721.864035] kthread+0x109/0x140 [ 1721.867985] ? rescuer_thread+0x380/0x380 [ 1721.872805] ? kthread_park+0x60/0x60 [ 1721.877222] ret_from_fork+0x25/0x30 [ 1721.881589] INFO: task modprobe:12986 blocked for more than 120 seconds. Any thoughts ? -Max. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Bug Report: can't unload nvme module in case of disabled device 2017-08-01 12:58 Bug Report: can't unload nvme module in case of disabled device Max Gurtovoy @ 2017-08-10 8:59 ` Christoph Hellwig 2017-08-10 17:04 ` Max Gurtovoy 2017-08-10 16:45 ` Keith Busch 1 sibling, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2017-08-10 8:59 UTC (permalink / raw) Is this a PCIe or fabrics controller? Did you get a chance to bisect where this behavior appeared? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Bug Report: can't unload nvme module in case of disabled device 2017-08-10 8:59 ` Christoph Hellwig @ 2017-08-10 17:04 ` Max Gurtovoy 2017-08-10 19:36 ` Keith Busch 0 siblings, 1 reply; 9+ messages in thread From: Max Gurtovoy @ 2017-08-10 17:04 UTC (permalink / raw) On 8/10/2017 11:59 AM, Christoph Hellwig wrote: > Is this a PCIe or fabrics controller? Did you get a chance to bisect > where this behavior appeared? > I'm using PCIe ctrl. Using 4.13-rc4+ I couldn't even run easier scenario of only unloading the nvme module (with SAMSUNG MZPLL1T6HEHP-00003 and Intel P3500/3700 devices): [ 369.997917] INFO: task modprobe:3709 blocked for more than 120 seconds. [ 370.005215] Not tainted 4.13.0-rc4+ #21 [ 370.010017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 370.018647] modprobe D 0 3709 3654 0x00000000 [ 370.024695] Call Trace: [ 370.027400] __schedule+0x1dc/0x780 [ 370.031261] schedule+0x36/0x80 [ 370.034756] blk_mq_freeze_queue_wait+0x4b/0xb0 [ 370.039750] ? remove_wait_queue+0x60/0x60 [ 370.044263] blk_freeze_queue+0x1a/0x20 [ 370.048489] blk_cleanup_queue+0x7f/0x150 [ 370.052927] nvme_dev_remove_admin+0x36/0x50 [nvme] [ 370.058303] nvme_remove+0xa2/0x130 [nvme] [ 370.062820] pci_device_remove+0x39/0xc0 [ 370.067142] device_release_driver_internal+0x141/0x200 [ 370.072898] driver_detach+0x3f/0x80 [ 370.076852] bus_remove_driver+0x55/0xd0 [ 370.081186] driver_unregister+0x2c/0x50 [ 370.085521] pci_unregister_driver+0x2a/0xa0 [ 370.090227] nvme_exit+0x10/0xb84 [nvme] [ 370.094562] SyS_delete_module+0x171/0x250 [ 370.099101] ? exit_to_usermode_loop+0x5e/0x88 [ 370.103996] entry_SYSCALL_64_fastpath+0x1a/0xa5 [ 370.109096] RIP: 0033:0x7f146b5106b7 [ 370.113037] RSP: 002b:00007ffd2cae12e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 370.121431] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f146b5106b7 [ 370.129295] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000000000223f5e8 [ 370.137167] RBP: 000000000223f580 R08: 00007f146b7d5060 R09: 00007f146b580a40 [ 370.145029] R10: 00007ffd2cae1070 R11: 0000000000000206 R12: 00007ffd2cae0310 [ 370.152890] R13: 0000000000000000 R14: 000000000223f5e8 R15: 0000000000000000 the new scenario: 1. modprobe nvme 2. sleep 10 3. modprobe -r nvme works on 4.11.0/4.12.0 but not on 4.13.0-rc4+. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Bug Report: can't unload nvme module in case of disabled device 2017-08-10 17:04 ` Max Gurtovoy @ 2017-08-10 19:36 ` Keith Busch 2017-08-13 8:29 ` Max Gurtovoy 0 siblings, 1 reply; 9+ messages in thread From: Keith Busch @ 2017-08-10 19:36 UTC (permalink / raw) On Thu, Aug 10, 2017@08:04:13PM +0300, Max Gurtovoy wrote: > > I'm using PCIe ctrl. > Using 4.13-rc4+ I couldn't even run easier scenario of only unloading the > nvme module (with SAMSUNG MZPLL1T6HEHP-00003 and Intel P3500/3700 devices): > > [ 369.997917] INFO: task modprobe:3709 blocked for more than 120 seconds. > [ 370.005215] Not tainted 4.13.0-rc4+ #21 > [ 370.010017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 370.018647] modprobe D 0 3709 3654 0x00000000 > [ 370.024695] Call Trace: > [ 370.027400] __schedule+0x1dc/0x780 > [ 370.031261] schedule+0x36/0x80 > [ 370.034756] blk_mq_freeze_queue_wait+0x4b/0xb0 > [ 370.039750] ? remove_wait_queue+0x60/0x60 > [ 370.044263] blk_freeze_queue+0x1a/0x20 > [ 370.048489] blk_cleanup_queue+0x7f/0x150 > [ 370.052927] nvme_dev_remove_admin+0x36/0x50 [nvme] > [ 370.058303] nvme_remove+0xa2/0x130 [nvme] > [ 370.062820] pci_device_remove+0x39/0xc0 > [ 370.067142] device_release_driver_internal+0x141/0x200 > [ 370.072898] driver_detach+0x3f/0x80 > [ 370.076852] bus_remove_driver+0x55/0xd0 > [ 370.081186] driver_unregister+0x2c/0x50 > [ 370.085521] pci_unregister_driver+0x2a/0xa0 > [ 370.090227] nvme_exit+0x10/0xb84 [nvme] > [ 370.094562] SyS_delete_module+0x171/0x250 > [ 370.099101] ? exit_to_usermode_loop+0x5e/0x88 > [ 370.103996] entry_SYSCALL_64_fastpath+0x1a/0xa5 > [ 370.109096] RIP: 0033:0x7f146b5106b7 > [ 370.113037] RSP: 002b:00007ffd2cae12e8 EFLAGS: 00000206 ORIG_RAX: > 00000000000000b0 > [ 370.121431] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: > 00007f146b5106b7 > [ 370.129295] RDX: 0000000000000000 RSI: 0000000000000800 RDI: > 000000000223f5e8 > [ 370.137167] RBP: 000000000223f580 R08: 00007f146b7d5060 R09: > 00007f146b580a40 > [ 370.145029] R10: 00007ffd2cae1070 R11: 0000000000000206 R12: > 00007ffd2cae0310 > [ 370.152890] R13: 0000000000000000 R14: 000000000223f5e8 R15: > 0000000000000000 > > the new scenario: > 1. modprobe nvme > 2. sleep 10 > 3. modprobe -r nvme > > works on 4.11.0/4.12.0 but not on 4.13.0-rc4+. This I'm not able to reproduce. The stack trace is saying there are entered requests on the admin queue, but that shouldn't be possible at this point in nvme_remove. I'll keep looking. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Bug Report: can't unload nvme module in case of disabled device 2017-08-10 19:36 ` Keith Busch @ 2017-08-13 8:29 ` Max Gurtovoy 2017-08-14 20:24 ` Keith Busch 0 siblings, 1 reply; 9+ messages in thread From: Max Gurtovoy @ 2017-08-13 8:29 UTC (permalink / raw) On 8/10/2017 10:36 PM, Keith Busch wrote: > On Thu, Aug 10, 2017@08:04:13PM +0300, Max Gurtovoy wrote: >> >> I'm using PCIe ctrl. >> Using 4.13-rc4+ I couldn't even run easier scenario of only unloading the >> nvme module (with SAMSUNG MZPLL1T6HEHP-00003 and Intel P3500/3700 devices): >> >> [ 369.997917] INFO: task modprobe:3709 blocked for more than 120 seconds. >> [ 370.005215] Not tainted 4.13.0-rc4+ #21 >> [ 370.010017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables >> this message. >> [ 370.018647] modprobe D 0 3709 3654 0x00000000 >> [ 370.024695] Call Trace: >> [ 370.027400] __schedule+0x1dc/0x780 >> [ 370.031261] schedule+0x36/0x80 >> [ 370.034756] blk_mq_freeze_queue_wait+0x4b/0xb0 >> [ 370.039750] ? remove_wait_queue+0x60/0x60 >> [ 370.044263] blk_freeze_queue+0x1a/0x20 >> [ 370.048489] blk_cleanup_queue+0x7f/0x150 >> [ 370.052927] nvme_dev_remove_admin+0x36/0x50 [nvme] >> [ 370.058303] nvme_remove+0xa2/0x130 [nvme] >> [ 370.062820] pci_device_remove+0x39/0xc0 >> [ 370.067142] device_release_driver_internal+0x141/0x200 >> [ 370.072898] driver_detach+0x3f/0x80 >> [ 370.076852] bus_remove_driver+0x55/0xd0 >> [ 370.081186] driver_unregister+0x2c/0x50 >> [ 370.085521] pci_unregister_driver+0x2a/0xa0 >> [ 370.090227] nvme_exit+0x10/0xb84 [nvme] >> [ 370.094562] SyS_delete_module+0x171/0x250 >> [ 370.099101] ? exit_to_usermode_loop+0x5e/0x88 >> [ 370.103996] entry_SYSCALL_64_fastpath+0x1a/0xa5 >> [ 370.109096] RIP: 0033:0x7f146b5106b7 >> [ 370.113037] RSP: 002b:00007ffd2cae12e8 EFLAGS: 00000206 ORIG_RAX: >> 00000000000000b0 >> [ 370.121431] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: >> 00007f146b5106b7 >> [ 370.129295] RDX: 0000000000000000 RSI: 0000000000000800 RDI: >> 000000000223f5e8 >> [ 370.137167] RBP: 000000000223f580 R08: 00007f146b7d5060 R09: >> 00007f146b580a40 >> [ 370.145029] R10: 00007ffd2cae1070 R11: 0000000000000206 R12: >> 00007ffd2cae0310 >> [ 370.152890] R13: 0000000000000000 R14: 000000000223f5e8 R15: >> 0000000000000000 >> >> the new scenario: >> 1. modprobe nvme >> 2. sleep 10 >> 3. modprobe -r nvme >> >> works on 4.11.0/4.12.0 but not on 4.13.0-rc4+. > > This I'm not able to reproduce. The stack trace is saying there are > entered requests on the admin queue, but that shouldn't be possible at > this point in nvme_remove. I'll keep looking. > After bisecting I found that the following commit caused the simple load/unload nvme driver failure: commit 1ad43c0078b79a76accd0fe64062e47b3430dc6b Author: Ming Lei <minlei at redhat.com> Date: Wed Aug 2 08:01:45 2017 +0800 blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed Adding Ming to this thread. I'm continuing with the debug of the new scenario (load nvme && sleep 10 && unload nvme). ^ permalink raw reply [flat|nested] 9+ messages in thread
* Bug Report: can't unload nvme module in case of disabled device 2017-08-13 8:29 ` Max Gurtovoy @ 2017-08-14 20:24 ` Keith Busch 0 siblings, 0 replies; 9+ messages in thread From: Keith Busch @ 2017-08-14 20:24 UTC (permalink / raw) On Sun, Aug 13, 2017@11:29:59AM +0300, Max Gurtovoy wrote: > > After bisecting I found that the following commit caused the simple > load/unload nvme driver failure: > > commit 1ad43c0078b79a76accd0fe64062e47b3430dc6b > Author: Ming Lei <minlei at redhat.com> > Date: Wed Aug 2 08:01:45 2017 +0800 > > blk-mq: don't leak preempt counter/q_usage_counter when allocating rq > failed > > Adding Ming to this thread. > > I'm continuing with the debug of the new scenario (load nvme && sleep 10 && > unload nvme). I'm reviewing that commit, and it looks wrong to me. It is only pairing the blk_queue_exit if request allocation was successful. That will get the q_usage_counter off when request allocation fails, making a queue freeze impossible. I'll send a patch. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Bug Report: can't unload nvme module in case of disabled device 2017-08-01 12:58 Bug Report: can't unload nvme module in case of disabled device Max Gurtovoy 2017-08-10 8:59 ` Christoph Hellwig @ 2017-08-10 16:45 ` Keith Busch 2017-08-10 19:17 ` Keith Busch 1 sibling, 1 reply; 9+ messages in thread From: Keith Busch @ 2017-08-10 16:45 UTC (permalink / raw) On Tue, Aug 01, 2017@03:58:10PM +0300, Max Gurtovoy wrote: > Hi all, > > I would like to report a bug that reproduced by the following steps (I'm > using 4.13.0-rc3+): > > 1. modprobe nvme > 2. echo 0 > /sys/block/nvme0n1/device/device/enable > 3. nvme list (stuck for more than 1-2 mins) > 4. modprobe -r nvme (stuck forever) > > log: > > [ 1342.388888] nvme nvme0: controller is down; will reset: CSTS=0x3, > PCI_STATUS=0x10 > [ 1476.021392] INFO: task kworker/u98:1:436 blocked for more than 120 > seconds. > [ 1476.029072] Not tainted 4.13.0-rc3+ #19 > [ 1476.033878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 1476.042505] kworker/u98:1 D 0 436 2 0x00000000 > [ 1476.048569] Workqueue: nvme-wq nvme_reset_work [nvme] > [ 1476.054133] Call Trace: > [ 1476.056862] __schedule+0x1dc/0x780 > [ 1476.060706] schedule+0x36/0x80 > [ 1476.064180] blk_mq_freeze_queue_wait+0x4b/0xb0 > [ 1476.069175] ? remove_wait_queue+0x60/0x60 > [ 1476.073693] nvme_wait_freeze+0x33/0x50 [nvme_core] > [ 1476.079068] nvme_reset_work+0x6b9/0xc40 [nvme] > [ 1476.084075] ? __switch_to+0x23e/0x4a0 > [ 1476.088209] process_one_work+0x149/0x360 > [ 1476.092625] worker_thread+0x4d/0x3c0 > [ 1476.096692] kthread+0x109/0x140 > [ 1476.100247] ? rescuer_thread+0x380/0x380 > [ 1476.104664] ? kthread_park+0x60/0x60 > [ 1476.108698] ret_from_fork+0x25/0x30 This looks like a path does not pair the freeze start with the reset's freeze wait. I'll have to see what the pci 'enable' sysfs entry does. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Bug Report: can't unload nvme module in case of disabled device 2017-08-10 16:45 ` Keith Busch @ 2017-08-10 19:17 ` Keith Busch 2017-08-10 19:34 ` Keith Busch 0 siblings, 1 reply; 9+ messages in thread From: Keith Busch @ 2017-08-10 19:17 UTC (permalink / raw) On Thu, Aug 10, 2017@12:45:36PM -0400, Keith Busch wrote: > On Tue, Aug 01, 2017@03:58:10PM +0300, Max Gurtovoy wrote: > > Hi all, > > > > I would like to report a bug that reproduced by the following steps (I'm > > using 4.13.0-rc3+): > > > > 1. modprobe nvme > > 2. echo 0 > /sys/block/nvme0n1/device/device/enable > > 3. nvme list (stuck for more than 1-2 mins) > > 4. modprobe -r nvme (stuck forever) > > > > log: > > > > [ 1342.388888] nvme nvme0: controller is down; will reset: CSTS=0x3, > > PCI_STATUS=0x10 > > [ 1476.021392] INFO: task kworker/u98:1:436 blocked for more than 120 > > seconds. > > [ 1476.029072] Not tainted 4.13.0-rc3+ #19 > > [ 1476.033878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > this message. > > [ 1476.042505] kworker/u98:1 D 0 436 2 0x00000000 > > [ 1476.048569] Workqueue: nvme-wq nvme_reset_work [nvme] > > [ 1476.054133] Call Trace: > > [ 1476.056862] __schedule+0x1dc/0x780 > > [ 1476.060706] schedule+0x36/0x80 > > [ 1476.064180] blk_mq_freeze_queue_wait+0x4b/0xb0 > > [ 1476.069175] ? remove_wait_queue+0x60/0x60 > > [ 1476.073693] nvme_wait_freeze+0x33/0x50 [nvme_core] > > [ 1476.079068] nvme_reset_work+0x6b9/0xc40 [nvme] > > [ 1476.084075] ? __switch_to+0x23e/0x4a0 > > [ 1476.088209] process_one_work+0x149/0x360 > > [ 1476.092625] worker_thread+0x4d/0x3c0 > > [ 1476.096692] kthread+0x109/0x140 > > [ 1476.100247] ? rescuer_thread+0x380/0x380 > > [ 1476.104664] ? kthread_park+0x60/0x60 > > [ 1476.108698] ret_from_fork+0x25/0x30 > > This looks like a path does not pair the freeze start with the reset's > freeze wait. I'll have to see what the pci 'enable' sysfs entry does. I see how the freeze start/stops are not paired in this scenario: nvme_dev_disable doesn't start the freeze if the pci device isn't disabled. It uses this to know if it is disabling the device twice. In this test, though, you are disabling the pci device without the driver's knowledge, so that breaks that logic. In light of that, we'll need different criteria to know when the driver should start a freeze. I'll test some things out and send a patch. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Bug Report: can't unload nvme module in case of disabled device 2017-08-10 19:17 ` Keith Busch @ 2017-08-10 19:34 ` Keith Busch 0 siblings, 0 replies; 9+ messages in thread From: Keith Busch @ 2017-08-10 19:34 UTC (permalink / raw) On Thu, Aug 10, 2017@03:17:17PM -0400, Keith Busch wrote: > On Thu, Aug 10, 2017@12:45:36PM -0400, Keith Busch wrote: > > On Tue, Aug 01, 2017@03:58:10PM +0300, Max Gurtovoy wrote: > > > Hi all, > > > > > > I would like to report a bug that reproduced by the following steps (I'm > > > using 4.13.0-rc3+): > > > > > > 1. modprobe nvme > > > 2. echo 0 > /sys/block/nvme0n1/device/device/enable > > > 3. nvme list (stuck for more than 1-2 mins) > > > 4. modprobe -r nvme (stuck forever) > > > > > > log: > > > > > > [ 1342.388888] nvme nvme0: controller is down; will reset: CSTS=0x3, > > > PCI_STATUS=0x10 > > > [ 1476.021392] INFO: task kworker/u98:1:436 blocked for more than 120 > > > seconds. > > > [ 1476.029072] Not tainted 4.13.0-rc3+ #19 > > > [ 1476.033878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > > this message. > > > [ 1476.042505] kworker/u98:1 D 0 436 2 0x00000000 > > > [ 1476.048569] Workqueue: nvme-wq nvme_reset_work [nvme] > > > [ 1476.054133] Call Trace: > > > [ 1476.056862] __schedule+0x1dc/0x780 > > > [ 1476.060706] schedule+0x36/0x80 > > > [ 1476.064180] blk_mq_freeze_queue_wait+0x4b/0xb0 > > > [ 1476.069175] ? remove_wait_queue+0x60/0x60 > > > [ 1476.073693] nvme_wait_freeze+0x33/0x50 [nvme_core] > > > [ 1476.079068] nvme_reset_work+0x6b9/0xc40 [nvme] > > > [ 1476.084075] ? __switch_to+0x23e/0x4a0 > > > [ 1476.088209] process_one_work+0x149/0x360 > > > [ 1476.092625] worker_thread+0x4d/0x3c0 > > > [ 1476.096692] kthread+0x109/0x140 > > > [ 1476.100247] ? rescuer_thread+0x380/0x380 > > > [ 1476.104664] ? kthread_park+0x60/0x60 > > > [ 1476.108698] ret_from_fork+0x25/0x30 > > > > This looks like a path does not pair the freeze start with the reset's > > freeze wait. I'll have to see what the pci 'enable' sysfs entry does. > > I see how the freeze start/stops are not paired in this scenario: > nvme_dev_disable doesn't start the freeze if the pci device isn't > disabled. It uses this to know if it is disabling the device twice. > > In this test, though, you are disabling the pci device without the > driver's knowledge, so that breaks that logic. In light of that, we'll > need different criteria to know when the driver should start a freeze. > I'll test some things out and send a patch. This should fix it for your scenario, but I am not completely sure this can't get the freeze depth higher than we need. --- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index cd888a4..ca03980 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2006,12 +2006,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) struct pci_dev *pdev = to_pci_dev(dev->dev); mutex_lock(&dev->shutdown_lock); + + if (dev->ctrl.state == NVME_CTRL_LIVE || + dev->ctrl.state == NVME_CTRL_RESETTING) + nvme_start_freeze(&dev->ctrl); + if (pci_is_enabled(pdev)) { u32 csts = readl(dev->bar + NVME_REG_CSTS); - if (dev->ctrl.state == NVME_CTRL_LIVE || - dev->ctrl.state == NVME_CTRL_RESETTING) - nvme_start_freeze(&dev->ctrl); dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) || pdev->error_state != pci_channel_io_normal); } -- ^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-08-14 20:24 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-08-01 12:58 Bug Report: can't unload nvme module in case of disabled device Max Gurtovoy 2017-08-10 8:59 ` Christoph Hellwig 2017-08-10 17:04 ` Max Gurtovoy 2017-08-10 19:36 ` Keith Busch 2017-08-13 8:29 ` Max Gurtovoy 2017-08-14 20:24 ` Keith Busch 2017-08-10 16:45 ` Keith Busch 2017-08-10 19:17 ` Keith Busch 2017-08-10 19:34 ` Keith Busch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox