* [PATCH 1/2] nvme: fix race between removing and reseting failure [not found] <20170517012729.13469-1-ming.lei@redhat.com> @ 2017-05-17 1:27 ` Ming Lei 2017-05-17 6:38 ` Johannes Thumshirn ` (3 more replies) 2017-05-17 1:27 ` [PATCH 2/2] nvme: avoid to hang in remove disk Ming Lei 1 sibling, 4 replies; 17+ messages in thread From: Ming Lei @ 2017-05-17 1:27 UTC (permalink / raw) To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg Cc: linux-nvme, Zhang Yi, linux-block, Ming Lei, stable When one NVMe PCI device is being resetted and found reset failue, nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues are put into stopped first, then schedule .remove_work to release the driver. Unfortunately if the driver is being released via sysfs store just before the .remove_work is run, del_gendisk() from nvme_remove() may hang forever because hw queues are stopped and the submitted writeback IOs from fsync_bdev() can't be completed at all. This patch fixes the following issue[1][2] by moving nvme_kill_queues() into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove() flushs .reset_work, and this way is reasonable and safe because nvme_dev_disable() has started to suspend queues and canceled requests already. [1] test script fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \ -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \ -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60 sleep 35 echo 1 > $SYSFS_NVME_PCI_PATH/rescan echo 1 > $SYSFS_NVME_PCI_PATH/reset echo 1 > $SYSFS_NVME_PCI_PATH/remove echo 1 > /sys/bus/pci/rescan [2] kernel hang log [ 492.232593] INFO: task nvme-test:5939 blocked for more than 120 seconds. [ 492.240081] Not tainted 4.11.0.nvme_v4.11_debug_hang+ #3 [ 492.246600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 492.255346] nvme-test D 0 5939 5938 0x00000080 [ 492.261475] Call Trace: [ 492.264215] __schedule+0x289/0x8f0 [ 492.268105] ? write_cache_pages+0x14c/0x510 [ 492.272873] schedule+0x36/0x80 [ 492.276381] io_schedule+0x16/0x40 [ 492.280181] wait_on_page_bit_common+0x137/0x220 [ 492.285336] ? page_cache_tree_insert+0x120/0x120 [ 492.290589] __filemap_fdatawait_range+0x128/0x1a0 [ 492.295941] filemap_fdatawait_range+0x14/0x30 [ 492.300902] filemap_fdatawait+0x23/0x30 [ 492.305282] filemap_write_and_wait+0x4c/0x80 [ 492.310151] __sync_blockdev+0x1f/0x40 [ 492.314336] fsync_bdev+0x44/0x50 [ 492.318039] invalidate_partition+0x24/0x50 [ 492.322710] del_gendisk+0xcd/0x2e0 [ 492.326608] nvme_ns_remove+0x105/0x130 [nvme_core] [ 492.332054] nvme_remove_namespaces+0x32/0x50 [nvme_core] [ 492.338082] nvme_uninit_ctrl+0x2d/0xa0 [nvme_core] [ 492.343519] nvme_remove+0x5d/0x170 [nvme] [ 492.348096] pci_device_remove+0x39/0xc0 [ 492.352477] device_release_driver_internal+0x141/0x1f0 [ 492.358311] device_release_driver+0x12/0x20 [ 492.363072] pci_stop_bus_device+0x8c/0xa0 [ 492.367646] pci_stop_and_remove_bus_device_locked+0x1a/0x30 [ 492.373965] remove_store+0x7c/0x90 [ 492.377852] dev_attr_store+0x18/0x30 [ 492.381941] sysfs_kf_write+0x3a/0x50 [ 492.386028] kernfs_fop_write+0xff/0x180 [ 492.390409] __vfs_write+0x37/0x160 [ 492.394304] ? selinux_file_permission+0xe5/0x120 [ 492.399556] ? security_file_permission+0x3b/0xc0 [ 492.404807] vfs_write+0xb2/0x1b0 [ 492.408508] ? syscall_trace_enter+0x1d0/0x2b0 [ 492.413462] SyS_write+0x55/0xc0 [ 492.417064] do_syscall_64+0x67/0x180 [ 492.421155] entry_SYSCALL64_slow_path+0x25/0x25 Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/nvme/host/pci.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fed803232edc..5e39abe57c56 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1887,6 +1887,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status) kref_get(&dev->ctrl.kref); nvme_dev_disable(dev, false); + + /* + * nvme_dev_disable() has suspended queues, then no new I/O + * can be submitted to hardware successfully any more, so + * kill queues now for avoiding race between reset failure + * and remove. + */ + nvme_kill_queues(&dev->ctrl); if (!schedule_work(&dev->remove_work)) nvme_put_ctrl(&dev->ctrl); } @@ -1993,7 +2001,6 @@ static void nvme_remove_dead_ctrl_work(struct work_struct *work) struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work); struct pci_dev *pdev = to_pci_dev(dev->dev); - nvme_kill_queues(&dev->ctrl); if (pci_get_drvdata(pdev)) device_release_driver(&pdev->dev); nvme_put_ctrl(&dev->ctrl); -- 2.9.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure 2017-05-17 1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei @ 2017-05-17 6:38 ` Johannes Thumshirn 2017-05-17 7:01 ` Ming Lei 2017-05-18 13:47 ` Christoph Hellwig ` (2 subsequent siblings) 3 siblings, 1 reply; 17+ messages in thread From: Johannes Thumshirn @ 2017-05-17 6:38 UTC (permalink / raw) To: Ming Lei, Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg Cc: linux-nvme, Zhang Yi, linux-block, stable On 05/17/2017 03:27 AM, Ming Lei wrote: > When one NVMe PCI device is being resetted and found reset failue, > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues > are put into stopped first, then schedule .remove_work to release the driver. > > Unfortunately if the driver is being released via sysfs store > just before the .remove_work is run, del_gendisk() from > nvme_remove() may hang forever because hw queues are stopped and > the submitted writeback IOs from fsync_bdev() can't be completed at all. > > This patch fixes the following issue[1][2] by moving nvme_kill_queues() > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove() > flushs .reset_work, and this way is reasonable and safe because > nvme_dev_disable() has started to suspend queues and canceled requests > already. > > [1] test script > fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \ > -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \ > -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60 Nit: the actual size after the -size parameter is missing. Anyways: Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure 2017-05-17 6:38 ` Johannes Thumshirn @ 2017-05-17 7:01 ` Ming Lei 0 siblings, 0 replies; 17+ messages in thread From: Ming Lei @ 2017-05-17 7:01 UTC (permalink / raw) To: Johannes Thumshirn Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg, linux-nvme, Zhang Yi, linux-block, stable On Wed, May 17, 2017 at 08:38:01AM +0200, Johannes Thumshirn wrote: > On 05/17/2017 03:27 AM, Ming Lei wrote: > > When one NVMe PCI device is being resetted and found reset failue, > > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues > > are put into stopped first, then schedule .remove_work to release the driver. > > > > Unfortunately if the driver is being released via sysfs store > > just before the .remove_work is run, del_gendisk() from > > nvme_remove() may hang forever because hw queues are stopped and > > the submitted writeback IOs from fsync_bdev() can't be completed at all. > > > > This patch fixes the following issue[1][2] by moving nvme_kill_queues() > > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove() > > flushs .reset_work, and this way is reasonable and safe because > > nvme_dev_disable() has started to suspend queues and canceled requests > > already. > > > > [1] test script > > fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \ > > -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \ > > -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60 > > Nit: the actual size after the -size parameter is missing. Forget to mention, $NVME_DISK has to be one partition of nvme disk, and the type need to be 'filesystem' type for reproduction, then actual size isn't needed. > > Anyways: > Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Thanks for review! Thanks, Ming ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure 2017-05-17 1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei 2017-05-17 6:38 ` Johannes Thumshirn @ 2017-05-18 13:47 ` Christoph Hellwig 2017-05-18 15:04 ` Ming Lei 2017-05-18 14:13 ` Keith Busch 2017-05-19 14:41 ` Jens Axboe 3 siblings, 1 reply; 17+ messages in thread From: Christoph Hellwig @ 2017-05-18 13:47 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg, linux-nvme, Zhang Yi, linux-block, stable On Wed, May 17, 2017 at 09:27:28AM +0800, Ming Lei wrote: > When one NVMe PCI device is being resetted and found reset failue, > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues > are put into stopped first, then schedule .remove_work to release the driver. > > Unfortunately if the driver is being released via sysfs store > just before the .remove_work is run, del_gendisk() from > nvme_remove() may hang forever because hw queues are stopped and > the submitted writeback IOs from fsync_bdev() can't be completed at all. > > This patch fixes the following issue[1][2] by moving nvme_kill_queues() > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove() > flushs .reset_work, and this way is reasonable and safe because > nvme_dev_disable() has started to suspend queues and canceled requests > already. > > [1] test script > fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \ > -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \ > -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60 > > sleep 35 > echo 1 > $SYSFS_NVME_PCI_PATH/rescan > echo 1 > $SYSFS_NVME_PCI_PATH/reset > echo 1 > $SYSFS_NVME_PCI_PATH/remove > echo 1 > /sys/bus/pci/rescan > > [2] kernel hang log > [ 492.232593] INFO: task nvme-test:5939 blocked for more than 120 seconds. > [ 492.240081] Not tainted 4.11.0.nvme_v4.11_debug_hang+ #3 > [ 492.246600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 492.255346] nvme-test D 0 5939 5938 0x00000080 > [ 492.261475] Call Trace: > [ 492.264215] __schedule+0x289/0x8f0 > [ 492.268105] ? write_cache_pages+0x14c/0x510 > [ 492.272873] schedule+0x36/0x80 > [ 492.276381] io_schedule+0x16/0x40 > [ 492.280181] wait_on_page_bit_common+0x137/0x220 > [ 492.285336] ? page_cache_tree_insert+0x120/0x120 > [ 492.290589] __filemap_fdatawait_range+0x128/0x1a0 > [ 492.295941] filemap_fdatawait_range+0x14/0x30 > [ 492.300902] filemap_fdatawait+0x23/0x30 > [ 492.305282] filemap_write_and_wait+0x4c/0x80 > [ 492.310151] __sync_blockdev+0x1f/0x40 > [ 492.314336] fsync_bdev+0x44/0x50 > [ 492.318039] invalidate_partition+0x24/0x50 > [ 492.322710] del_gendisk+0xcd/0x2e0 > [ 492.326608] nvme_ns_remove+0x105/0x130 [nvme_core] > [ 492.332054] nvme_remove_namespaces+0x32/0x50 [nvme_core] > [ 492.338082] nvme_uninit_ctrl+0x2d/0xa0 [nvme_core] > [ 492.343519] nvme_remove+0x5d/0x170 [nvme] > [ 492.348096] pci_device_remove+0x39/0xc0 > [ 492.352477] device_release_driver_internal+0x141/0x1f0 > [ 492.358311] device_release_driver+0x12/0x20 > [ 492.363072] pci_stop_bus_device+0x8c/0xa0 > [ 492.367646] pci_stop_and_remove_bus_device_locked+0x1a/0x30 > [ 492.373965] remove_store+0x7c/0x90 > [ 492.377852] dev_attr_store+0x18/0x30 > [ 492.381941] sysfs_kf_write+0x3a/0x50 > [ 492.386028] kernfs_fop_write+0xff/0x180 > [ 492.390409] __vfs_write+0x37/0x160 > [ 492.394304] ? selinux_file_permission+0xe5/0x120 > [ 492.399556] ? security_file_permission+0x3b/0xc0 > [ 492.404807] vfs_write+0xb2/0x1b0 > [ 492.408508] ? syscall_trace_enter+0x1d0/0x2b0 > [ 492.413462] SyS_write+0x55/0xc0 > [ 492.417064] do_syscall_64+0x67/0x180 > [ 492.421155] entry_SYSCALL64_slow_path+0x25/0x25 > > Cc: stable@vger.kernel.org > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > drivers/nvme/host/pci.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index fed803232edc..5e39abe57c56 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -1887,6 +1887,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status) > > kref_get(&dev->ctrl.kref); > nvme_dev_disable(dev, false); > + /* > + * nvme_dev_disable() has suspended queues, then no new I/O > + * can be submitted to hardware successfully any more, so > + * kill queues now for avoiding race between reset failure > + * and remove. How about this instead: /* * nvme_dev_disable() has suspended the I/O queues and no new I/O can * be submitted now. Kill the queues now to avoid races between a * possible reset failure and the controller removal work. */ Otherwise this looks fine to me: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure 2017-05-18 13:47 ` Christoph Hellwig @ 2017-05-18 15:04 ` Ming Lei 0 siblings, 0 replies; 17+ messages in thread From: Ming Lei @ 2017-05-18 15:04 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme, Zhang Yi, linux-block, stable On Thu, May 18, 2017 at 03:47:34PM +0200, Christoph Hellwig wrote: > On Wed, May 17, 2017 at 09:27:28AM +0800, Ming Lei wrote: > > When one NVMe PCI device is being resetted and found reset failue, > > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues > > are put into stopped first, then schedule .remove_work to release the driver. > > > > Unfortunately if the driver is being released via sysfs store > > just before the .remove_work is run, del_gendisk() from > > nvme_remove() may hang forever because hw queues are stopped and > > the submitted writeback IOs from fsync_bdev() can't be completed at all. > > > > This patch fixes the following issue[1][2] by moving nvme_kill_queues() > > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove() > > flushs .reset_work, and this way is reasonable and safe because > > nvme_dev_disable() has started to suspend queues and canceled requests > > already. > > > > [1] test script > > fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \ > > -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \ > > -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60 > > > > sleep 35 > > echo 1 > $SYSFS_NVME_PCI_PATH/rescan > > echo 1 > $SYSFS_NVME_PCI_PATH/reset > > echo 1 > $SYSFS_NVME_PCI_PATH/remove > > echo 1 > /sys/bus/pci/rescan > > > > [2] kernel hang log > > [ 492.232593] INFO: task nvme-test:5939 blocked for more than 120 seconds. > > [ 492.240081] Not tainted 4.11.0.nvme_v4.11_debug_hang+ #3 > > [ 492.246600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [ 492.255346] nvme-test D 0 5939 5938 0x00000080 > > [ 492.261475] Call Trace: > > [ 492.264215] __schedule+0x289/0x8f0 > > [ 492.268105] ? write_cache_pages+0x14c/0x510 > > [ 492.272873] schedule+0x36/0x80 > > [ 492.276381] io_schedule+0x16/0x40 > > [ 492.280181] wait_on_page_bit_common+0x137/0x220 > > [ 492.285336] ? page_cache_tree_insert+0x120/0x120 > > [ 492.290589] __filemap_fdatawait_range+0x128/0x1a0 > > [ 492.295941] filemap_fdatawait_range+0x14/0x30 > > [ 492.300902] filemap_fdatawait+0x23/0x30 > > [ 492.305282] filemap_write_and_wait+0x4c/0x80 > > [ 492.310151] __sync_blockdev+0x1f/0x40 > > [ 492.314336] fsync_bdev+0x44/0x50 > > [ 492.318039] invalidate_partition+0x24/0x50 > > [ 492.322710] del_gendisk+0xcd/0x2e0 > > [ 492.326608] nvme_ns_remove+0x105/0x130 [nvme_core] > > [ 492.332054] nvme_remove_namespaces+0x32/0x50 [nvme_core] > > [ 492.338082] nvme_uninit_ctrl+0x2d/0xa0 [nvme_core] > > [ 492.343519] nvme_remove+0x5d/0x170 [nvme] > > [ 492.348096] pci_device_remove+0x39/0xc0 > > [ 492.352477] device_release_driver_internal+0x141/0x1f0 > > [ 492.358311] device_release_driver+0x12/0x20 > > [ 492.363072] pci_stop_bus_device+0x8c/0xa0 > > [ 492.367646] pci_stop_and_remove_bus_device_locked+0x1a/0x30 > > [ 492.373965] remove_store+0x7c/0x90 > > [ 492.377852] dev_attr_store+0x18/0x30 > > [ 492.381941] sysfs_kf_write+0x3a/0x50 > > [ 492.386028] kernfs_fop_write+0xff/0x180 > > [ 492.390409] __vfs_write+0x37/0x160 > > [ 492.394304] ? selinux_file_permission+0xe5/0x120 > > [ 492.399556] ? security_file_permission+0x3b/0xc0 > > [ 492.404807] vfs_write+0xb2/0x1b0 > > [ 492.408508] ? syscall_trace_enter+0x1d0/0x2b0 > > [ 492.413462] SyS_write+0x55/0xc0 > > [ 492.417064] do_syscall_64+0x67/0x180 > > [ 492.421155] entry_SYSCALL64_slow_path+0x25/0x25 > > > > Cc: stable@vger.kernel.org > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > drivers/nvme/host/pci.c | 9 ++++++++- > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > index fed803232edc..5e39abe57c56 100644 > > --- a/drivers/nvme/host/pci.c > > +++ b/drivers/nvme/host/pci.c > > @@ -1887,6 +1887,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status) > > > > kref_get(&dev->ctrl.kref); > > nvme_dev_disable(dev, false); > > > + /* > > + * nvme_dev_disable() has suspended queues, then no new I/O > > + * can be submitted to hardware successfully any more, so > > + * kill queues now for avoiding race between reset failure > > + * and remove. > > How about this instead: > > /* > * nvme_dev_disable() has suspended the I/O queues and no new I/O can > * be submitted now. Kill the queues now to avoid races between a > * possible reset failure and the controller removal work. > */ That is fine, will change to this. Thanks, Ming ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure 2017-05-17 1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei 2017-05-17 6:38 ` Johannes Thumshirn 2017-05-18 13:47 ` Christoph Hellwig @ 2017-05-18 14:13 ` Keith Busch 2017-05-19 12:52 ` Ming Lei 2017-05-19 14:41 ` Jens Axboe 3 siblings, 1 reply; 17+ messages in thread From: Keith Busch @ 2017-05-18 14:13 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, Zhang Yi, linux-block, stable On Wed, May 17, 2017 at 09:27:28AM +0800, Ming Lei wrote: > When one NVMe PCI device is being resetted and found reset failue, > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues > are put into stopped first, then schedule .remove_work to release the driver. > > Unfortunately if the driver is being released via sysfs store > just before the .remove_work is run, del_gendisk() from > nvme_remove() may hang forever because hw queues are stopped and > the submitted writeback IOs from fsync_bdev() can't be completed at all. > > This patch fixes the following issue[1][2] by moving nvme_kill_queues() > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove() > flushs .reset_work, and this way is reasonable and safe because > nvme_dev_disable() has started to suspend queues and canceled requests > already. I'm still not sure moving where we kill the queues is the correct way to fix this problem. The nvme_kill_queues restarts all the hardware queues to force all IO to failure already, so why is this really stuck? We should be able to make forward progress even if we kill the queues while calling into del_gendisk, right? That could happen with a different sequence of events, so that also needs to work. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure 2017-05-18 14:13 ` Keith Busch @ 2017-05-19 12:52 ` Ming Lei 2017-05-19 15:15 ` Keith Busch 0 siblings, 1 reply; 17+ messages in thread From: Ming Lei @ 2017-05-19 12:52 UTC (permalink / raw) To: Keith Busch Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, Zhang Yi, linux-block, stable On Thu, May 18, 2017 at 10:13:07AM -0400, Keith Busch wrote: > On Wed, May 17, 2017 at 09:27:28AM +0800, Ming Lei wrote: > > When one NVMe PCI device is being resetted and found reset failue, > > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues > > are put into stopped first, then schedule .remove_work to release the driver. > > > > Unfortunately if the driver is being released via sysfs store > > just before the .remove_work is run, del_gendisk() from > > nvme_remove() may hang forever because hw queues are stopped and > > the submitted writeback IOs from fsync_bdev() can't be completed at all. > > > > This patch fixes the following issue[1][2] by moving nvme_kill_queues() > > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove() > > flushs .reset_work, and this way is reasonable and safe because > > nvme_dev_disable() has started to suspend queues and canceled requests > > already. > > I'm still not sure moving where we kill the queues is the correct way > to fix this problem. The nvme_kill_queues restarts all the hardware > queues to force all IO to failure already, so why is this really stuck? That is a good question. Today I investigate further, and figured out that it is because that blk_mq_start_stopped_hw_queues() in nvme_kill_queues() does not run hw queues actually becasue the queues are started in nvme_reset_work() already. Will figure out a patch later to fix the issue. And the reason why this patch 'fixes' the issue is that just timing, I guess. > We should be able to make forward progress even if we kill the queues > while calling into del_gendisk, right? That could happen with a different > sequence of events, so that also needs to work. Now I am not a big fun of this patch for fixing the issue. But I still think it may be better to move nvme_kill_queues() into nvme_remove_dead_ctrl() as an improvement because during this small window page cache can be used up by write application, and no writeback can move on meantime. Thanks, Ming ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure 2017-05-19 12:52 ` Ming Lei @ 2017-05-19 15:15 ` Keith Busch 0 siblings, 0 replies; 17+ messages in thread From: Keith Busch @ 2017-05-19 15:15 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, Zhang Yi, linux-block, stable On Fri, May 19, 2017 at 08:52:45PM +0800, Ming Lei wrote: > But I still think it may be better to move nvme_kill_queues() into > nvme_remove_dead_ctrl() as an improvement because during this small > window page cache can be used up by write application, and no writeback > can move on meantime. Yes, I agree that's a better placement for it. I was just concerned about the reasoning since it would also mean we're still stuck if an IO timeout occurs while calling del_gendisk. So I'm okay with the patch as-is, but I'll also look into my other concern. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure 2017-05-17 1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei ` (2 preceding siblings ...) 2017-05-18 14:13 ` Keith Busch @ 2017-05-19 14:41 ` Jens Axboe 2017-05-19 15:10 ` Ming Lei 2017-05-19 16:40 ` Ming Lei 3 siblings, 2 replies; 17+ messages in thread From: Jens Axboe @ 2017-05-19 14:41 UTC (permalink / raw) To: Ming Lei, Keith Busch, Christoph Hellwig, Sagi Grimberg Cc: linux-nvme, Zhang Yi, linux-block, stable On 05/16/2017 07:27 PM, Ming Lei wrote: > When one NVMe PCI device is being resetted and found reset failue, > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues > are put into stopped first, then schedule .remove_work to release the driver. > > Unfortunately if the driver is being released via sysfs store > just before the .remove_work is run, del_gendisk() from > nvme_remove() may hang forever because hw queues are stopped and > the submitted writeback IOs from fsync_bdev() can't be completed at all. > > This patch fixes the following issue[1][2] by moving nvme_kill_queues() > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove() > flushs .reset_work, and this way is reasonable and safe because > nvme_dev_disable() has started to suspend queues and canceled requests > already. > > [1] test script > fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \ > -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \ > -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60 > > sleep 35 > echo 1 > $SYSFS_NVME_PCI_PATH/rescan > echo 1 > $SYSFS_NVME_PCI_PATH/reset > echo 1 > $SYSFS_NVME_PCI_PATH/remove > echo 1 > /sys/bus/pci/rescan The patch looks good to me. But since you have a nice reproducer, how about turning that into a blktests [1] test case? [1] https://github.com/osandov/blktests -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure 2017-05-19 14:41 ` Jens Axboe @ 2017-05-19 15:10 ` Ming Lei 2017-05-19 16:40 ` Ming Lei 1 sibling, 0 replies; 17+ messages in thread From: Ming Lei @ 2017-05-19 15:10 UTC (permalink / raw) To: Jens Axboe Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, linux-block, stable, linux-nvme, Zhang Yi On Fri, May 19, 2017 at 08:41:13AM -0600, Jens Axboe wrote: > On 05/16/2017 07:27 PM, Ming Lei wrote: > > When one NVMe PCI device is being resetted and found reset failue, > > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues > > are put into stopped first, then schedule .remove_work to release the driver. > > > > Unfortunately if the driver is being released via sysfs store > > just before the .remove_work is run, del_gendisk() from > > nvme_remove() may hang forever because hw queues are stopped and > > the submitted writeback IOs from fsync_bdev() can't be completed at all. > > > > This patch fixes the following issue[1][2] by moving nvme_kill_queues() > > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove() > > flushs .reset_work, and this way is reasonable and safe because > > nvme_dev_disable() has started to suspend queues and canceled requests > > already. > > > > [1] test script > > fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \ > > -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \ > > -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60 > > > > sleep 35 > > echo 1 > $SYSFS_NVME_PCI_PATH/rescan > > echo 1 > $SYSFS_NVME_PCI_PATH/reset > > echo 1 > $SYSFS_NVME_PCI_PATH/remove > > echo 1 > /sys/bus/pci/rescan > > The patch looks good to me. But since you have a nice reproducer, how about > turning that into a blktests [1] test case? This test has triggered several problems recently, and looks a good idea to integrate into blktests. I am a little busy recently, and may take a while to start that. If anyone would like to integrate the case, please go ahead, and I am happy to provide any details you need. Thanks, Ming ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure 2017-05-19 14:41 ` Jens Axboe 2017-05-19 15:10 ` Ming Lei @ 2017-05-19 16:40 ` Ming Lei 2017-05-19 16:55 ` yizhan 1 sibling, 1 reply; 17+ messages in thread From: Ming Lei @ 2017-05-19 16:40 UTC (permalink / raw) To: Jens Axboe Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, linux-block, stable, linux-nvme, Zhang Yi On Fri, May 19, 2017 at 08:41:13AM -0600, Jens Axboe wrote: > On 05/16/2017 07:27 PM, Ming Lei wrote: > > When one NVMe PCI device is being resetted and found reset failue, > > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues > > are put into stopped first, then schedule .remove_work to release the driver. > > > > Unfortunately if the driver is being released via sysfs store > > just before the .remove_work is run, del_gendisk() from > > nvme_remove() may hang forever because hw queues are stopped and > > the submitted writeback IOs from fsync_bdev() can't be completed at all. > > > > This patch fixes the following issue[1][2] by moving nvme_kill_queues() > > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove() > > flushs .reset_work, and this way is reasonable and safe because > > nvme_dev_disable() has started to suspend queues and canceled requests > > already. > > > > [1] test script > > fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \ > > -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \ > > -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60 > > > > sleep 35 > > echo 1 > $SYSFS_NVME_PCI_PATH/rescan > > echo 1 > $SYSFS_NVME_PCI_PATH/reset > > echo 1 > $SYSFS_NVME_PCI_PATH/remove > > echo 1 > /sys/bus/pci/rescan > > The patch looks good to me. But since you have a nice reproducer, how about > turning that into a blktests [1] test case? > > [1] https://github.com/osandov/blktests Forget to mention, this test case is written by Zhang Yi. Zhang Yi, maybe you can try to integrate your NVMe test case into blktests if you are interested, :-) Thanks, Ming ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure 2017-05-19 16:40 ` Ming Lei @ 2017-05-19 16:55 ` yizhan 0 siblings, 0 replies; 17+ messages in thread From: yizhan @ 2017-05-19 16:55 UTC (permalink / raw) To: Ming Lei, Jens Axboe Cc: linux-block, Sagi Grimberg, linux-nvme, Keith Busch, stable, Christoph Hellwig On 05/20/2017 12:40 AM, Ming Lei wrote: > On Fri, May 19, 2017 at 08:41:13AM -0600, Jens Axboe wrote: >> On 05/16/2017 07:27 PM, Ming Lei wrote: >>> When one NVMe PCI device is being resetted and found reset failue, >>> nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues >>> are put into stopped first, then schedule .remove_work to release the driver. >>> >>> Unfortunately if the driver is being released via sysfs store >>> just before the .remove_work is run, del_gendisk() from >>> nvme_remove() may hang forever because hw queues are stopped and >>> the submitted writeback IOs from fsync_bdev() can't be completed at all. >>> >>> This patch fixes the following issue[1][2] by moving nvme_kill_queues() >>> into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove() >>> flushs .reset_work, and this way is reasonable and safe because >>> nvme_dev_disable() has started to suspend queues and canceled requests >>> already. >>> >>> [1] test script >>> fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \ >>> -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \ >>> -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60 >>> >>> sleep 35 >>> echo 1 > $SYSFS_NVME_PCI_PATH/rescan >>> echo 1 > $SYSFS_NVME_PCI_PATH/reset >>> echo 1 > $SYSFS_NVME_PCI_PATH/remove >>> echo 1 > /sys/bus/pci/rescan >> The patch looks good to me. But since you have a nice reproducer, how about >> turning that into a blktests [1] test case? >> >> [1] https://github.com/osandov/blktests > Forget to mention, this test case is written by Zhang Yi. > > Zhang Yi, maybe you can try to integrate your NVMe test case into > blktests if you are interested, :-) Sure, I will have a try. :) Thanks Yi > Thanks, > Ming > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/2] nvme: avoid to hang in remove disk [not found] <20170517012729.13469-1-ming.lei@redhat.com> 2017-05-17 1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei @ 2017-05-17 1:27 ` Ming Lei 2017-05-18 13:49 ` Christoph Hellwig 1 sibling, 1 reply; 17+ messages in thread From: Ming Lei @ 2017-05-17 1:27 UTC (permalink / raw) To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg Cc: linux-nvme, Zhang Yi, linux-block, Ming Lei, stable If some writeback requests are submitted just before queue is killed, and these requests may not be canceled in nvme_dev_disable() because they are not started yet, it is still possible for blk-mq to hold these requests in .requeue list. So we have to abort these requests first before del_gendisk(), because del_gendisk() may wait for completion of these requests. Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/nvme/host/core.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index d5e0906262ea..8eaeea86509a 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2097,6 +2097,14 @@ static void nvme_ns_remove(struct nvme_ns *ns) &nvme_ns_attr_group); if (ns->ndev) nvme_nvm_unregister_sysfs(ns); + /* + * If queue is dead, we have to abort requests in + * requeue list because fsync_bdev() in removing disk + * path may wait for these IOs, which can't + * be submitted to hardware too. + */ + if (blk_queue_dying(ns->queue)) + blk_mq_abort_requeue_list(ns->queue); del_gendisk(ns->disk); blk_mq_abort_requeue_list(ns->queue); blk_cleanup_queue(ns->queue); -- 2.9.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] nvme: avoid to hang in remove disk 2017-05-17 1:27 ` [PATCH 2/2] nvme: avoid to hang in remove disk Ming Lei @ 2017-05-18 13:49 ` Christoph Hellwig 2017-05-18 15:35 ` Ming Lei 0 siblings, 1 reply; 17+ messages in thread From: Christoph Hellwig @ 2017-05-18 13:49 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg, linux-nvme, Zhang Yi, linux-block, stable On Wed, May 17, 2017 at 09:27:29AM +0800, Ming Lei wrote: > If some writeback requests are submitted just before queue is killed, > and these requests may not be canceled in nvme_dev_disable() because > they are not started yet, it is still possible for blk-mq to hold > these requests in .requeue list. > > So we have to abort these requests first before del_gendisk(), because > del_gendisk() may wait for completion of these requests. > > Cc: stable@vger.kernel.org > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > drivers/nvme/host/core.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index d5e0906262ea..8eaeea86509a 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -2097,6 +2097,14 @@ static void nvme_ns_remove(struct nvme_ns *ns) > &nvme_ns_attr_group); > if (ns->ndev) > nvme_nvm_unregister_sysfs(ns); > + /* > + * If queue is dead, we have to abort requests in > + * requeue list because fsync_bdev() in removing disk > + * path may wait for these IOs, which can't > + * be submitted to hardware too. > + */ > + if (blk_queue_dying(ns->queue)) > + blk_mq_abort_requeue_list(ns->queue); > del_gendisk(ns->disk); > blk_mq_abort_requeue_list(ns->queue); Why can't we just move the blk_mq_abort_requeue_list call before del_gendisk in general? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] nvme: avoid to hang in remove disk 2017-05-18 13:49 ` Christoph Hellwig @ 2017-05-18 15:35 ` Ming Lei 2017-05-18 16:06 ` Keith Busch 0 siblings, 1 reply; 17+ messages in thread From: Ming Lei @ 2017-05-18 15:35 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme, Zhang Yi, linux-block, stable On Thu, May 18, 2017 at 03:49:31PM +0200, Christoph Hellwig wrote: > On Wed, May 17, 2017 at 09:27:29AM +0800, Ming Lei wrote: > > If some writeback requests are submitted just before queue is killed, > > and these requests may not be canceled in nvme_dev_disable() because > > they are not started yet, it is still possible for blk-mq to hold > > these requests in .requeue list. > > > > So we have to abort these requests first before del_gendisk(), because > > del_gendisk() may wait for completion of these requests. > > > > Cc: stable@vger.kernel.org > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > drivers/nvme/host/core.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > index d5e0906262ea..8eaeea86509a 100644 > > --- a/drivers/nvme/host/core.c > > +++ b/drivers/nvme/host/core.c > > @@ -2097,6 +2097,14 @@ static void nvme_ns_remove(struct nvme_ns *ns) > > &nvme_ns_attr_group); > > if (ns->ndev) > > nvme_nvm_unregister_sysfs(ns); > > + /* > > + * If queue is dead, we have to abort requests in > > + * requeue list because fsync_bdev() in removing disk > > + * path may wait for these IOs, which can't > > + * be submitted to hardware too. > > + */ > > + if (blk_queue_dying(ns->queue)) > > + blk_mq_abort_requeue_list(ns->queue); > > del_gendisk(ns->disk); > > blk_mq_abort_requeue_list(ns->queue); > > Why can't we just move the blk_mq_abort_requeue_list call before > del_gendisk in general? That may cause data loss if queue isn't killed. Normally queue is only killed when the controller is dead(such as in reset failure) or !pci_device_is_present() (in nvme_remove()). Thanks, Ming ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] nvme: avoid to hang in remove disk 2017-05-18 15:35 ` Ming Lei @ 2017-05-18 16:06 ` Keith Busch 2017-05-19 13:19 ` Ming Lei 0 siblings, 1 reply; 17+ messages in thread From: Keith Busch @ 2017-05-18 16:06 UTC (permalink / raw) To: Ming Lei Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, linux-nvme, Zhang Yi, linux-block, stable On Thu, May 18, 2017 at 11:35:43PM +0800, Ming Lei wrote: > On Thu, May 18, 2017 at 03:49:31PM +0200, Christoph Hellwig wrote: > > On Wed, May 17, 2017 at 09:27:29AM +0800, Ming Lei wrote: > > > If some writeback requests are submitted just before queue is killed, > > > and these requests may not be canceled in nvme_dev_disable() because > > > they are not started yet, it is still possible for blk-mq to hold > > > these requests in .requeue list. > > > > > > So we have to abort these requests first before del_gendisk(), because > > > del_gendisk() may wait for completion of these requests. > > > > > > Cc: stable@vger.kernel.org > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > --- > > > drivers/nvme/host/core.c | 8 ++++++++ > > > 1 file changed, 8 insertions(+) > > > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > > index d5e0906262ea..8eaeea86509a 100644 > > > --- a/drivers/nvme/host/core.c > > > +++ b/drivers/nvme/host/core.c > > > @@ -2097,6 +2097,14 @@ static void nvme_ns_remove(struct nvme_ns *ns) > > > &nvme_ns_attr_group); > > > if (ns->ndev) > > > nvme_nvm_unregister_sysfs(ns); > > > + /* > > > + * If queue is dead, we have to abort requests in > > > + * requeue list because fsync_bdev() in removing disk > > > + * path may wait for these IOs, which can't > > > + * be submitted to hardware too. > > > + */ > > > + if (blk_queue_dying(ns->queue)) > > > + blk_mq_abort_requeue_list(ns->queue); > > > del_gendisk(ns->disk); > > > blk_mq_abort_requeue_list(ns->queue); > > > > Why can't we just move the blk_mq_abort_requeue_list call before > > del_gendisk in general? > > That may cause data loss if queue isn't killed. Normally queue is only killed > when the controller is dead(such as in reset failure) or !pci_device_is_present() > (in nvme_remove()). But in your test, your controller isn't even dead. Why are we killing it when it's still functional? I think we need to first not consider this perfectly functional controller to be dead under these conditions, and second, understand why killing the queues after del_gendisk is called does not allow forward progress. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] nvme: avoid to hang in remove disk 2017-05-18 16:06 ` Keith Busch @ 2017-05-19 13:19 ` Ming Lei 0 siblings, 0 replies; 17+ messages in thread From: Ming Lei @ 2017-05-19 13:19 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, linux-nvme, Zhang Yi, linux-block, stable On Thu, May 18, 2017 at 12:06:24PM -0400, Keith Busch wrote: > On Thu, May 18, 2017 at 11:35:43PM +0800, Ming Lei wrote: > > On Thu, May 18, 2017 at 03:49:31PM +0200, Christoph Hellwig wrote: > > > On Wed, May 17, 2017 at 09:27:29AM +0800, Ming Lei wrote: > > > > If some writeback requests are submitted just before queue is killed, > > > > and these requests may not be canceled in nvme_dev_disable() because > > > > they are not started yet, it is still possible for blk-mq to hold > > > > these requests in .requeue list. > > > > > > > > So we have to abort these requests first before del_gendisk(), because > > > > del_gendisk() may wait for completion of these requests. > > > > > > > > Cc: stable@vger.kernel.org > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > > --- > > > > drivers/nvme/host/core.c | 8 ++++++++ > > > > 1 file changed, 8 insertions(+) > > > > > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > > > index d5e0906262ea..8eaeea86509a 100644 > > > > --- a/drivers/nvme/host/core.c > > > > +++ b/drivers/nvme/host/core.c > > > > @@ -2097,6 +2097,14 @@ static void nvme_ns_remove(struct nvme_ns *ns) > > > > &nvme_ns_attr_group); > > > > if (ns->ndev) > > > > nvme_nvm_unregister_sysfs(ns); > > > > + /* > > > > + * If queue is dead, we have to abort requests in > > > > + * requeue list because fsync_bdev() in removing disk > > > > + * path may wait for these IOs, which can't > > > > + * be submitted to hardware too. > > > > + */ > > > > + if (blk_queue_dying(ns->queue)) > > > > + blk_mq_abort_requeue_list(ns->queue); > > > > del_gendisk(ns->disk); > > > > blk_mq_abort_requeue_list(ns->queue); > > > > > > Why can't we just move the blk_mq_abort_requeue_list call before > > > del_gendisk in general? > > > > That may cause data loss if queue isn't killed. Normally queue is only killed > > when the controller is dead(such as in reset failure) or !pci_device_is_present() > > (in nvme_remove()). > > But in your test, your controller isn't even dead. Why are we killing > it when it's still functional? I think we need to first not consider Last time, you posted one patch which looks fixes the hang, but the NVMe device becomes not functional after rebinding later, see the link: http://marc.info/?l=linux-block&m=149430047402103&w=2 Also not sure it is worthy of doing that, because in my case, nvme_dev_disable() has been called in nvme_reset_notify() before the reset, that means requests have been canceled already. As I mentioned, the two directions aren't contradictorily. But thinking controller as live under this situation need big change, such as, not canceling requests in whole reset path. Finally this patch is not only for the case of reset failure vs. remove, but also for hot-unplug, queues are still killed in path of hot remove. > this perfectly functional controller to be dead under these conditions, > and second, understand why killing the queues after del_gendisk is called > does not allow forward progress. I have explained this cause in last email, :-) Thanks, Ming ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2017-05-19 16:55 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20170517012729.13469-1-ming.lei@redhat.com>
2017-05-17 1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei
2017-05-17 6:38 ` Johannes Thumshirn
2017-05-17 7:01 ` Ming Lei
2017-05-18 13:47 ` Christoph Hellwig
2017-05-18 15:04 ` Ming Lei
2017-05-18 14:13 ` Keith Busch
2017-05-19 12:52 ` Ming Lei
2017-05-19 15:15 ` Keith Busch
2017-05-19 14:41 ` Jens Axboe
2017-05-19 15:10 ` Ming Lei
2017-05-19 16:40 ` Ming Lei
2017-05-19 16:55 ` yizhan
2017-05-17 1:27 ` [PATCH 2/2] nvme: avoid to hang in remove disk Ming Lei
2017-05-18 13:49 ` Christoph Hellwig
2017-05-18 15:35 ` Ming Lei
2017-05-18 16:06 ` Keith Busch
2017-05-19 13:19 ` Ming Lei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).