* [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl()
@ 2025-01-13 12:44 Ming Lei
2025-01-14 17:06 ` Ewan Milne
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Ming Lei @ 2025-01-13 12:44 UTC (permalink / raw)
To: Christoph Hellwig, Keith Busch, linux-nvme
Cc: Ming Lei, Marco Patalano, Ewan Milne, James Smart, Sagi Grimberg
Now target is removed from nvme_fc_ctrl_free() which is the ctrl->ref
release handler. And even admin queue is unquiesced there, this way
is definitely wrong because the ctr->ref is grabbed when submitting
command.
And Marco observed that nvme_fc_ctrl_free() can be called from request
completion code path, and trigger kernel warning since request completes
from softirq context.
Fix the issue by moveing target removal into nvme_fc_delete_ctrl(),
which is also aligned with nvme-tcp and nvme-rdma.
Cc: Marco Patalano <mpatalan@redhat.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
drivers/nvme/host/fc.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index b81af7919e94..b94b4e50a3df 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2389,17 +2389,11 @@ nvme_fc_ctrl_free(struct kref *ref)
container_of(ref, struct nvme_fc_ctrl, ref);
unsigned long flags;
- if (ctrl->ctrl.tagset)
- nvme_remove_io_tag_set(&ctrl->ctrl);
-
/* remove from rport list */
spin_lock_irqsave(&ctrl->rport->lock, flags);
list_del(&ctrl->ctrl_list);
spin_unlock_irqrestore(&ctrl->rport->lock, flags);
- nvme_unquiesce_admin_queue(&ctrl->ctrl);
- nvme_remove_admin_tag_set(&ctrl->ctrl);
-
kfree(ctrl->queues);
put_device(ctrl->dev);
@@ -3288,6 +3282,13 @@ nvme_fc_delete_ctrl(struct nvme_ctrl *nctrl)
cancel_work_sync(&ctrl->ioerr_work);
cancel_delayed_work_sync(&ctrl->connect_work);
+
+ if (ctrl->ctrl.tagset)
+ nvme_remove_io_tag_set(&ctrl->ctrl);
+
+ nvme_unquiesce_admin_queue(&ctrl->ctrl);
+ nvme_remove_admin_tag_set(&ctrl->ctrl);
+
/*
* kill the association on the link side. this will block
* waiting for io to terminate
--
2.47.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl()
2025-01-13 12:44 [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl() Ming Lei
@ 2025-01-14 17:06 ` Ewan Milne
2025-01-14 18:31 ` Ewan Milne
2025-01-16 6:10 ` Christoph Hellwig
2025-01-20 13:51 ` Hannes Reinecke
2 siblings, 1 reply; 9+ messages in thread
From: Ewan Milne @ 2025-01-14 17:06 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Keith Busch, linux-nvme, Marco Patalano,
James Smart, Sagi Grimberg
Hi Ming-
So there is definitely something wrong, because we are recently
starting to see a 100% reproducible
warning on a couple of our test systems (but not another one) where
the nvme_fc_ctrl kref decrements to
zero on command completion. However, I'm not sure that this patch
addresses the underlying issue.
If there was an outstanding command on the admin queue, there should
have been a corresponding
kref taken in nvme_fc_start_fcp_op() so we should not be in
nvme_fc_ctrl_free() until after it completes.
and the kref should not be zero at the end of command completion.
Normally, we would have already done through the code path in the
bottom of __nvme_fc_abort_outstanding_ios()
which should have resulted in no commands outstanding either on the
admin_q or the io queues prior to the
->free_ctrl() callback to nvme_fc_free_ctrl() which I think should be
the last put on the nvme_fc_ctrl object.
I think there there is also a problem with your change regarding when
the tagsets are removed,
which addresses WARN_ON() in blk_mq_run_hw_queue():
WARN_ON_ONCE(!async && in_interrupt());
...because moving the removal of the tagsets to earlier in
nvme_fc_delete_ctrl() is removing the tagsets
to be prior to the the nvme_fc_delete_association() call which then
calls __nvme_fc_abort_outstanding_ios()
which invokes blk_mq_tagset_busy_iter() on the tagsets. Perhaps
moving the tagset removal to after the
nvme_fc_delete_assocation() call would address your specific concern
regarding the correct execution context?
I think the real problem is that nvme_fc_ctrl_free() is called due to
the kref put when the command completes.
This shouldn't happen. The last put should happen when the core code
calls ->free_ctrl().
Below is the stack trace we see for those interested. The test is
doing a echo 1 > /sys/..../delete_controller.
I can't reproduce this on my otherwise identical system (connected to
different storage fabric arrays though).
So we have to run any experiments through the test automation instead
of interactively.
-Ewan
[ 4095.189976] ------------[ cut here ]------------
[ 4095.194596] WARNING: CPU: 9 PID: 38355 at block/blk-mq.c:2325
blk_mq_run_hw_queue+0x26/0x280
[ 4095.203036] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4
dns_resolver nfs lockd grace netfs sunrpc dm_service_time dm_multipath
amd_atl intel_rapl_msr intel_rapl_common amd64_edac dell_pc
platform_profile edac_mce_amd kvm_amd dell_wmi sparse_keymap rfkill
dell_smbios video dcdbas ipmi_ssif mgag200 kvm i2c_algo_bit
drm_client_lib drm_shmem_helper drm_kms_helper dell_wmi_descriptor
wmi_bmof rapl pcspkr i2c_piix4 ptdma i2c_smbus k10temp ipmi_si
acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler drm fuse xfs
libcrc32c sd_mod sg qla2xxx nvme_fc nvme_fabrics nvme_keyring ahci
crct10dif_pclmul nvme_core libahci crc32_pclmul nvme_auth crc32c_intel
libata ghash_clmulni_intel ccp scsi_transport_fc tg3 megaraid_sas
sp5100_tco wmi dm_mirror dm_region_hash dm_log dm_mod
[ 4095.271190] CPU: 9 UID: 0 PID: 38355 Comm: kworker/u65:3 Not
tainted 6.13.0-rc4+ #1
[ 4095.278851] Hardware name: Dell Inc. PowerEdge R6515/04F3CJ, BIOS
2.13.3 09/12/2023
[ 4095.286515] Workqueue: 0x0 (writeback)
[ 4095.290361] RIP: 0010:blk_mq_run_hw_queue+0x26/0x280
[ 4095.295334] Code: 90 90 90 90 0f 1f 44 00 00 41 57 41 56 41 55 41
54 55 48 89 fd 53 89 f3 40 84 f6 75 1c 65 f7 05 34 e9 40 4e 00 ff ff
00 74 02 <0f> 0b f6 85 a8 00 00 00 10 0f 85 16 02 00 00 48 8b 85 b8 00
00 00
[ 4095.314088] RSP: 0018:ffffb8ca0056cea8 EFLAGS: 00010206
[ 4095.319321] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 4095.326455] RDX: ffff9c82cb227728 RSI: 0000000000000000 RDI: ffff9c82d1e73e00
[ 4095.333585] RBP: ffff9c82d1e73e00 R08: 0000000000000000 R09: ffffb8ca0056ccf8
[ 4095.340720] R10: ffffb8ca0056ccf0 R11: ffffffffb33e1348 R12: 0000000000000000
[ 4095.347851] R13: ffff9c82ddfd1380 R14: 0000000000000010 R15: ffffffffb30070e0
[ 4095.354986] FS: 0000000000000000(0000) GS:ffff9c862f380000(0000)
knlGS:0000000000000000
[ 4095.363078] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4095.368825] CR2: 0000563f2b03c0b0 CR3: 000000011dfe6000 CR4: 0000000000350ef0
[ 4095.375966] Call Trace:
[ 4095.378418] <IRQ>
[ 4095.380439] ? __warn+0x84/0x130
[ 4095.383679] ? blk_mq_run_hw_queue+0x26/0x280
[ 4095.388046] ? report_bug+0x18a/0x1a0
[ 4095.391721] ? handle_bug+0x53/0x90
[ 4095.395222] ? exc_invalid_op+0x14/0x70
[ 4095.399063] ? asm_exc_invalid_op+0x16/0x20
[ 4095.403261] ? blk_mq_run_hw_queue+0x26/0x280
[ 4095.407626] blk_mq_run_hw_queues+0x63/0x120
[ 4095.411908] __blk_freeze_queue_start+0x64/0x70
[ 4095.416447] blk_queue_start_drain+0x18/0x50
[ 4095.420728] blk_mq_destroy_queue+0x30/0x60
[ 4095.424915] nvme_remove_io_tag_set+0x26/0x40 [nvme_core]
[ 4095.430330] nvme_fc_ctrl_free+0x2e/0x130 [nvme_fc]
[ 4095.435219] blk_complete_reqs+0x40/0x50
[ 4095.439154] handle_softirqs+0xd3/0x2b0
[ 4095.443004] __irq_exit_rcu+0xcd/0xf0
[ 4095.446673] sysvec_call_function_single+0x71/0x90
[ 4095.451468] </IRQ>
On Mon, Jan 13, 2025 at 7:45 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Now target is removed from nvme_fc_ctrl_free() which is the ctrl->ref
> release handler. And even admin queue is unquiesced there, this way
> is definitely wrong because the ctr->ref is grabbed when submitting
> command.
>
> And Marco observed that nvme_fc_ctrl_free() can be called from request
> completion code path, and trigger kernel warning since request completes
> from softirq context.
>
> Fix the issue by moveing target removal into nvme_fc_delete_ctrl(),
> which is also aligned with nvme-tcp and nvme-rdma.
>
> Cc: Marco Patalano <mpatalan@redhat.com>
> Cc: Ewan Milne <emilne@redhat.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> drivers/nvme/host/fc.c | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index b81af7919e94..b94b4e50a3df 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -2389,17 +2389,11 @@ nvme_fc_ctrl_free(struct kref *ref)
> container_of(ref, struct nvme_fc_ctrl, ref);
> unsigned long flags;
>
> - if (ctrl->ctrl.tagset)
> - nvme_remove_io_tag_set(&ctrl->ctrl);
> -
> /* remove from rport list */
> spin_lock_irqsave(&ctrl->rport->lock, flags);
> list_del(&ctrl->ctrl_list);
> spin_unlock_irqrestore(&ctrl->rport->lock, flags);
>
> - nvme_unquiesce_admin_queue(&ctrl->ctrl);
> - nvme_remove_admin_tag_set(&ctrl->ctrl);
> -
> kfree(ctrl->queues);
>
> put_device(ctrl->dev);
> @@ -3288,6 +3282,13 @@ nvme_fc_delete_ctrl(struct nvme_ctrl *nctrl)
>
> cancel_work_sync(&ctrl->ioerr_work);
> cancel_delayed_work_sync(&ctrl->connect_work);
> +
> + if (ctrl->ctrl.tagset)
> + nvme_remove_io_tag_set(&ctrl->ctrl);
> +
> + nvme_unquiesce_admin_queue(&ctrl->ctrl);
> + nvme_remove_admin_tag_set(&ctrl->ctrl);
> +
> /*
> * kill the association on the link side. this will block
> * waiting for io to terminate
> --
> 2.47.0
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl()
2025-01-14 17:06 ` Ewan Milne
@ 2025-01-14 18:31 ` Ewan Milne
2025-01-15 17:33 ` Ewan Milne
0 siblings, 1 reply; 9+ messages in thread
From: Ewan Milne @ 2025-01-14 18:31 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Keith Busch, linux-nvme, Marco Patalano,
James Smart, Sagi Grimberg
Patch results in a crash w/slub_debug=FZPU enabled. I'll try moving
the tagset removal further down.
-Ewan
[ 36.608477] BUG: unable to handle page fault for address: ffffffffbd92a8d8
[ 36.608479] #PF: supervisor read access in kernel mode
[ 36.608481] #PF: error_code(0x0000) - not-present page
[ 36.608483] PGD 687827067 P4D 687827067 PUD 687828063 PMD 10938d063
PTE 800ffff9788d5022
[ 36.608488] Oops: Oops: 0000 [#1] PREEMPT SMP PTI
[ 36.608492] CPU: 3 UID: 0 PID: 952 Comm: nvme Not tainted 6.13.0-rc6+ #2
[ 36.608495] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS
1.4.8 05/22/2018
[ 36.608496] RIP: 0010:lock_timer_base+0x4a/0xa0
[ 36.608504] Code: 1c 73 bf 00 f3 90 44 8b 6b 20 41 f7 c5 00 00 04
00 75 f1 44 89 e8 44 89 ea 25 00 00 10 00 83 f8 01 48 19 c0 81 e2 ff
ff 03 00 <48> 8b 14 d5 80 4d 77 bd 25 80 12 00 00 48 05 00 1b 02 00 4c
8d 34
[ 36.608506] RSP: 0018:ffffabdd80b77b70 EFLAGS: 00010202
[ 36.608509] RAX: ffffffffffffffff RBX: ffff898856cdd970 RCX: 0000000000000002
[ 36.608511] RDX: 0000000000036b6b RSI: ffffabdd80b77ba0 RDI: ffff898856cdd970
[ 36.608512] RBP: 0000000000024000 R08: ffff899350bb6730 R09: 00000008721e3400
[ 36.608513] R10: 0000000000000000 R11: ffff899350bb430c R12: ffffabdd80b77ba0
[ 36.608515] R13: 000000006b6b6b6b R14: ffffabdd80b77d28 R15: ffff898860917060
[ 36.608516] FS: 00007f4d32b0fc40(0000) GS:ffff899350b80000(0000)
knlGS:0000000000000000
[ 36.608519] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 36.608520] CR2: ffffffffbd92a8d8 CR3: 00000001154ba006 CR4: 00000000007726f0
[ 36.608522] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 36.608523] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 36.608524] PKRU: 55555554
[ 36.608525] Call Trace:
[ 36.608526] <TASK>
[ 36.608528] ? show_trace_log_lvl+0x1b0/0x2f0
[ 36.608532] ? show_trace_log_lvl+0x1b0/0x2f0
[ 36.608536] ? __try_to_del_timer_sync+0x2b/0x90
[ 36.608540] ? __die_body.cold+0x8/0x12
[ 36.608543] ? page_fault_oops+0x146/0x160
[ 36.608549] ? exc_page_fault+0x152/0x160
[ 36.608553] ? asm_exc_page_fault+0x26/0x30
[ 36.608559] ? lock_timer_base+0x4a/0xa0
[ 36.608563] __try_to_del_timer_sync+0x2b/0x90
[ 36.608567] __timer_delete_sync+0x28/0x40
[ 36.608571] blk_sync_queue+0x19/0x30
[ 36.608575] __nvme_fc_abort_outstanding_ios+0x76/0x160 [nvme_fc]
[ 36.608582] nvme_fc_delete_association+0x84/0x250 [nvme_fc]
[ 36.608587] ? blk_mq_free_tag_set+0x120/0x160
[ 36.608593] nvme_do_delete_ctrl+0x7d/0x90 [nvme_core]
[ 36.608612] nvme_delete_ctrl_sync.cold+0x8/0xd [nvme_core]
[ 36.608624] nvme_sysfs_delete+0x38/0x50 [nvme_core]
[ 36.608638] kernfs_fop_write_iter+0x13b/0x1f0
[ 36.608643] vfs_write+0x28d/0x450
[ 36.608647] ksys_write+0x6c/0xe0
[ 36.608650] do_syscall_64+0x7d/0x160
[ 36.608654] ? syscall_exit_to_user_mode+0x32/0x190
[ 36.608658] ? do_syscall_64+0x89/0x160
[ 36.608661] ? rseq_get_rseq_cs+0x1d/0x220
[ 36.608665] ? rseq_ip_fixup+0x8d/0x1d0
[ 36.608668] ? nvme_dev_ioctl+0x1b3/0x1d0 [nvme_core]
[ 36.608680] ? switch_fpu_return+0x4e/0xd0
[ 36.608684] ? arch_exit_to_user_mode_prepare.isra.0+0x83/0xb0
[ 36.608687] ? syscall_exit_to_user_mode+0x32/0x190
[ 36.608690] ? do_syscall_64+0x89/0x160
[ 36.608694] ? exc_page_fault+0x73/0x160
[ 36.608697] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 36.608701] RIP: 0033:0x7f4d32d8c6a4
[ 36.608704] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f
84 00 00 00 00 00 f3 0f 1e fa 80 3d a5 c9 0d 00 00 74 13 b8 01 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24
18 48
[ 36.608705] RSP: 002b:00007ffef88f49a8 EFLAGS: 00000202 ORIG_RAX:
0000000000000001
[ 36.608708] RAX: ffffffffffffffda RBX: 0000565309cf54f0 RCX: 00007f4d32d8c6a4
[ 36.608709] RDX: 0000000000000001 RSI: 00007f4d32edcdb5 RDI: 0000000000000004
[ 36.608710] RBP: 0000565309cf53c0 R08: 0000000000000073 R09: 00000000ffffffff
[ 36.608712] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000004
[ 36.608713] R13: 0000565309cf53c0 R14: 0000565309cf5410 R15: 00007ffef88f4bf0
[ 36.608717] </TASK>
[ 36.608718] Modules linked in: qrtr rfkill intel_rapl_msr
intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_thermal
intel_powerclamp coretemp kvm_intel kvm ipmi_ssif dell_wmi_descriptor
iTCO_wdt platform_profile iTCO_vendor_support wmi ipmi_si rapl mei_me
acpi_power_meter dcdbas tg3 mei intel_cstate i2c_i801 intel_uncore
acpi_ipmi intel_pch_thermal pcspkr lpc_ich i2c_smbus sg ipmi_devintf
ipmi_msghandler dm_multipath loop nfnetlink xfs lpfc sd_mod nvmet_fc
nvmet nvme_fc nvme_fabrics nvme_keyring crct10dif_pclmul nvme_core
crc32_pclmul ahci crc32c_intel libahci nvme_auth mgag200
ghash_clmulni_intel megaraid_sas scsi_transport_fc libata i2c_algo_bit
dm_mirror dm_region_hash dm_log dm_mod fuse
[ 36.608765] CR2: ffffffffbd92a8d8
[ 36.608766] ---[ end trace 0000000000000000 ]---
On Tue, Jan 14, 2025 at 12:06 PM Ewan Milne <emilne@redhat.com> wrote:
>
> Hi Ming-
>
> So there is definitely something wrong, because we are recently
> starting to see a 100% reproducible
> warning on a couple of our test systems (but not another one) where
> the nvme_fc_ctrl kref decrements to
> zero on command completion. However, I'm not sure that this patch
> addresses the underlying issue.
> If there was an outstanding command on the admin queue, there should
> have been a corresponding
> kref taken in nvme_fc_start_fcp_op() so we should not be in
> nvme_fc_ctrl_free() until after it completes.
> and the kref should not be zero at the end of command completion.
>
> Normally, we would have already done through the code path in the
> bottom of __nvme_fc_abort_outstanding_ios()
> which should have resulted in no commands outstanding either on the
> admin_q or the io queues prior to the
> ->free_ctrl() callback to nvme_fc_free_ctrl() which I think should be
> the last put on the nvme_fc_ctrl object.
>
> I think there there is also a problem with your change regarding when
> the tagsets are removed,
> which addresses WARN_ON() in blk_mq_run_hw_queue():
>
> WARN_ON_ONCE(!async && in_interrupt());
>
> ...because moving the removal of the tagsets to earlier in
> nvme_fc_delete_ctrl() is removing the tagsets
> to be prior to the the nvme_fc_delete_association() call which then
> calls __nvme_fc_abort_outstanding_ios()
> which invokes blk_mq_tagset_busy_iter() on the tagsets. Perhaps
> moving the tagset removal to after the
> nvme_fc_delete_assocation() call would address your specific concern
> regarding the correct execution context?
>
> I think the real problem is that nvme_fc_ctrl_free() is called due to
> the kref put when the command completes.
> This shouldn't happen. The last put should happen when the core code
> calls ->free_ctrl().
>
> Below is the stack trace we see for those interested. The test is
> doing a echo 1 > /sys/..../delete_controller.
>
> I can't reproduce this on my otherwise identical system (connected to
> different storage fabric arrays though).
> So we have to run any experiments through the test automation instead
> of interactively.
>
> -Ewan
>
> [ 4095.189976] ------------[ cut here ]------------
> [ 4095.194596] WARNING: CPU: 9 PID: 38355 at block/blk-mq.c:2325
> blk_mq_run_hw_queue+0x26/0x280
> [ 4095.203036] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4
> dns_resolver nfs lockd grace netfs sunrpc dm_service_time dm_multipath
> amd_atl intel_rapl_msr intel_rapl_common amd64_edac dell_pc
> platform_profile edac_mce_amd kvm_amd dell_wmi sparse_keymap rfkill
> dell_smbios video dcdbas ipmi_ssif mgag200 kvm i2c_algo_bit
> drm_client_lib drm_shmem_helper drm_kms_helper dell_wmi_descriptor
> wmi_bmof rapl pcspkr i2c_piix4 ptdma i2c_smbus k10temp ipmi_si
> acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler drm fuse xfs
> libcrc32c sd_mod sg qla2xxx nvme_fc nvme_fabrics nvme_keyring ahci
> crct10dif_pclmul nvme_core libahci crc32_pclmul nvme_auth crc32c_intel
> libata ghash_clmulni_intel ccp scsi_transport_fc tg3 megaraid_sas
> sp5100_tco wmi dm_mirror dm_region_hash dm_log dm_mod
> [ 4095.271190] CPU: 9 UID: 0 PID: 38355 Comm: kworker/u65:3 Not
> tainted 6.13.0-rc4+ #1
> [ 4095.278851] Hardware name: Dell Inc. PowerEdge R6515/04F3CJ, BIOS
> 2.13.3 09/12/2023
> [ 4095.286515] Workqueue: 0x0 (writeback)
> [ 4095.290361] RIP: 0010:blk_mq_run_hw_queue+0x26/0x280
> [ 4095.295334] Code: 90 90 90 90 0f 1f 44 00 00 41 57 41 56 41 55 41
> 54 55 48 89 fd 53 89 f3 40 84 f6 75 1c 65 f7 05 34 e9 40 4e 00 ff ff
> 00 74 02 <0f> 0b f6 85 a8 00 00 00 10 0f 85 16 02 00 00 48 8b 85 b8 00
> 00 00
> [ 4095.314088] RSP: 0018:ffffb8ca0056cea8 EFLAGS: 00010206
> [ 4095.319321] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [ 4095.326455] RDX: ffff9c82cb227728 RSI: 0000000000000000 RDI: ffff9c82d1e73e00
> [ 4095.333585] RBP: ffff9c82d1e73e00 R08: 0000000000000000 R09: ffffb8ca0056ccf8
> [ 4095.340720] R10: ffffb8ca0056ccf0 R11: ffffffffb33e1348 R12: 0000000000000000
> [ 4095.347851] R13: ffff9c82ddfd1380 R14: 0000000000000010 R15: ffffffffb30070e0
> [ 4095.354986] FS: 0000000000000000(0000) GS:ffff9c862f380000(0000)
> knlGS:0000000000000000
> [ 4095.363078] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4095.368825] CR2: 0000563f2b03c0b0 CR3: 000000011dfe6000 CR4: 0000000000350ef0
> [ 4095.375966] Call Trace:
> [ 4095.378418] <IRQ>
> [ 4095.380439] ? __warn+0x84/0x130
> [ 4095.383679] ? blk_mq_run_hw_queue+0x26/0x280
> [ 4095.388046] ? report_bug+0x18a/0x1a0
> [ 4095.391721] ? handle_bug+0x53/0x90
> [ 4095.395222] ? exc_invalid_op+0x14/0x70
> [ 4095.399063] ? asm_exc_invalid_op+0x16/0x20
> [ 4095.403261] ? blk_mq_run_hw_queue+0x26/0x280
> [ 4095.407626] blk_mq_run_hw_queues+0x63/0x120
> [ 4095.411908] __blk_freeze_queue_start+0x64/0x70
> [ 4095.416447] blk_queue_start_drain+0x18/0x50
> [ 4095.420728] blk_mq_destroy_queue+0x30/0x60
> [ 4095.424915] nvme_remove_io_tag_set+0x26/0x40 [nvme_core]
> [ 4095.430330] nvme_fc_ctrl_free+0x2e/0x130 [nvme_fc]
> [ 4095.435219] blk_complete_reqs+0x40/0x50
> [ 4095.439154] handle_softirqs+0xd3/0x2b0
> [ 4095.443004] __irq_exit_rcu+0xcd/0xf0
> [ 4095.446673] sysvec_call_function_single+0x71/0x90
> [ 4095.451468] </IRQ>
>
>
>
> On Mon, Jan 13, 2025 at 7:45 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Now target is removed from nvme_fc_ctrl_free() which is the ctrl->ref
> > release handler. And even admin queue is unquiesced there, this way
> > is definitely wrong because the ctr->ref is grabbed when submitting
> > command.
> >
> > And Marco observed that nvme_fc_ctrl_free() can be called from request
> > completion code path, and trigger kernel warning since request completes
> > from softirq context.
> >
> > Fix the issue by moveing target removal into nvme_fc_delete_ctrl(),
> > which is also aligned with nvme-tcp and nvme-rdma.
> >
> > Cc: Marco Patalano <mpatalan@redhat.com>
> > Cc: Ewan Milne <emilne@redhat.com>
> > Cc: James Smart <james.smart@broadcom.com>
> > Cc: Sagi Grimberg <sagi@grimberg.me>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > drivers/nvme/host/fc.c | 13 +++++++------
> > 1 file changed, 7 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> > index b81af7919e94..b94b4e50a3df 100644
> > --- a/drivers/nvme/host/fc.c
> > +++ b/drivers/nvme/host/fc.c
> > @@ -2389,17 +2389,11 @@ nvme_fc_ctrl_free(struct kref *ref)
> > container_of(ref, struct nvme_fc_ctrl, ref);
> > unsigned long flags;
> >
> > - if (ctrl->ctrl.tagset)
> > - nvme_remove_io_tag_set(&ctrl->ctrl);
> > -
> > /* remove from rport list */
> > spin_lock_irqsave(&ctrl->rport->lock, flags);
> > list_del(&ctrl->ctrl_list);
> > spin_unlock_irqrestore(&ctrl->rport->lock, flags);
> >
> > - nvme_unquiesce_admin_queue(&ctrl->ctrl);
> > - nvme_remove_admin_tag_set(&ctrl->ctrl);
> > -
> > kfree(ctrl->queues);
> >
> > put_device(ctrl->dev);
> > @@ -3288,6 +3282,13 @@ nvme_fc_delete_ctrl(struct nvme_ctrl *nctrl)
> >
> > cancel_work_sync(&ctrl->ioerr_work);
> > cancel_delayed_work_sync(&ctrl->connect_work);
> > +
> > + if (ctrl->ctrl.tagset)
> > + nvme_remove_io_tag_set(&ctrl->ctrl);
> > +
> > + nvme_unquiesce_admin_queue(&ctrl->ctrl);
> > + nvme_remove_admin_tag_set(&ctrl->ctrl);
> > +
> > /*
> > * kill the association on the link side. this will block
> > * waiting for io to terminate
> > --
> > 2.47.0
> >
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl()
2025-01-14 18:31 ` Ewan Milne
@ 2025-01-15 17:33 ` Ewan Milne
2025-01-16 1:57 ` Ming Lei
0 siblings, 1 reply; 9+ messages in thread
From: Ewan Milne @ 2025-01-15 17:33 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Keith Busch, linux-nvme, Marco Patalano,
James Smart, Sagi Grimberg
Moving the tagset removal to the end of nvme_fc_delete_ctrl() ran OK and we
no longer see the WARN_ON() seen earlier, however I think that there is a
potential problem with doing this because the keepalive is not stopped until
nvme_uninit_ctrl() is called. On our systems w/FC kato is normally disabled
so we wouldn't see any problem but this isn't true in the general case.
-Ewan
On Tue, Jan 14, 2025 at 1:31 PM Ewan Milne <emilne@redhat.com> wrote:
>
> Patch results in a crash w/slub_debug=FZPU enabled. I'll try moving
> the tagset removal further down.
>
> -Ewan
>
> [ 36.608477] BUG: unable to handle page fault for address: ffffffffbd92a8d8
> [ 36.608479] #PF: supervisor read access in kernel mode
> [ 36.608481] #PF: error_code(0x0000) - not-present page
> [ 36.608483] PGD 687827067 P4D 687827067 PUD 687828063 PMD 10938d063
> PTE 800ffff9788d5022
> [ 36.608488] Oops: Oops: 0000 [#1] PREEMPT SMP PTI
> [ 36.608492] CPU: 3 UID: 0 PID: 952 Comm: nvme Not tainted 6.13.0-rc6+ #2
> [ 36.608495] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS
> 1.4.8 05/22/2018
> [ 36.608496] RIP: 0010:lock_timer_base+0x4a/0xa0
> [ 36.608504] Code: 1c 73 bf 00 f3 90 44 8b 6b 20 41 f7 c5 00 00 04
> 00 75 f1 44 89 e8 44 89 ea 25 00 00 10 00 83 f8 01 48 19 c0 81 e2 ff
> ff 03 00 <48> 8b 14 d5 80 4d 77 bd 25 80 12 00 00 48 05 00 1b 02 00 4c
> 8d 34
> [ 36.608506] RSP: 0018:ffffabdd80b77b70 EFLAGS: 00010202
> [ 36.608509] RAX: ffffffffffffffff RBX: ffff898856cdd970 RCX: 0000000000000002
> [ 36.608511] RDX: 0000000000036b6b RSI: ffffabdd80b77ba0 RDI: ffff898856cdd970
> [ 36.608512] RBP: 0000000000024000 R08: ffff899350bb6730 R09: 00000008721e3400
> [ 36.608513] R10: 0000000000000000 R11: ffff899350bb430c R12: ffffabdd80b77ba0
> [ 36.608515] R13: 000000006b6b6b6b R14: ffffabdd80b77d28 R15: ffff898860917060
> [ 36.608516] FS: 00007f4d32b0fc40(0000) GS:ffff899350b80000(0000)
> knlGS:0000000000000000
> [ 36.608519] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 36.608520] CR2: ffffffffbd92a8d8 CR3: 00000001154ba006 CR4: 00000000007726f0
> [ 36.608522] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 36.608523] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 36.608524] PKRU: 55555554
> [ 36.608525] Call Trace:
> [ 36.608526] <TASK>
> [ 36.608528] ? show_trace_log_lvl+0x1b0/0x2f0
> [ 36.608532] ? show_trace_log_lvl+0x1b0/0x2f0
> [ 36.608536] ? __try_to_del_timer_sync+0x2b/0x90
> [ 36.608540] ? __die_body.cold+0x8/0x12
> [ 36.608543] ? page_fault_oops+0x146/0x160
> [ 36.608549] ? exc_page_fault+0x152/0x160
> [ 36.608553] ? asm_exc_page_fault+0x26/0x30
> [ 36.608559] ? lock_timer_base+0x4a/0xa0
> [ 36.608563] __try_to_del_timer_sync+0x2b/0x90
> [ 36.608567] __timer_delete_sync+0x28/0x40
> [ 36.608571] blk_sync_queue+0x19/0x30
> [ 36.608575] __nvme_fc_abort_outstanding_ios+0x76/0x160 [nvme_fc]
> [ 36.608582] nvme_fc_delete_association+0x84/0x250 [nvme_fc]
> [ 36.608587] ? blk_mq_free_tag_set+0x120/0x160
> [ 36.608593] nvme_do_delete_ctrl+0x7d/0x90 [nvme_core]
> [ 36.608612] nvme_delete_ctrl_sync.cold+0x8/0xd [nvme_core]
> [ 36.608624] nvme_sysfs_delete+0x38/0x50 [nvme_core]
> [ 36.608638] kernfs_fop_write_iter+0x13b/0x1f0
> [ 36.608643] vfs_write+0x28d/0x450
> [ 36.608647] ksys_write+0x6c/0xe0
> [ 36.608650] do_syscall_64+0x7d/0x160
> [ 36.608654] ? syscall_exit_to_user_mode+0x32/0x190
> [ 36.608658] ? do_syscall_64+0x89/0x160
> [ 36.608661] ? rseq_get_rseq_cs+0x1d/0x220
> [ 36.608665] ? rseq_ip_fixup+0x8d/0x1d0
> [ 36.608668] ? nvme_dev_ioctl+0x1b3/0x1d0 [nvme_core]
> [ 36.608680] ? switch_fpu_return+0x4e/0xd0
> [ 36.608684] ? arch_exit_to_user_mode_prepare.isra.0+0x83/0xb0
> [ 36.608687] ? syscall_exit_to_user_mode+0x32/0x190
> [ 36.608690] ? do_syscall_64+0x89/0x160
> [ 36.608694] ? exc_page_fault+0x73/0x160
> [ 36.608697] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 36.608701] RIP: 0033:0x7f4d32d8c6a4
> [ 36.608704] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f
> 84 00 00 00 00 00 f3 0f 1e fa 80 3d a5 c9 0d 00 00 74 13 b8 01 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24
> 18 48
> [ 36.608705] RSP: 002b:00007ffef88f49a8 EFLAGS: 00000202 ORIG_RAX:
> 0000000000000001
> [ 36.608708] RAX: ffffffffffffffda RBX: 0000565309cf54f0 RCX: 00007f4d32d8c6a4
> [ 36.608709] RDX: 0000000000000001 RSI: 00007f4d32edcdb5 RDI: 0000000000000004
> [ 36.608710] RBP: 0000565309cf53c0 R08: 0000000000000073 R09: 00000000ffffffff
> [ 36.608712] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000004
> [ 36.608713] R13: 0000565309cf53c0 R14: 0000565309cf5410 R15: 00007ffef88f4bf0
> [ 36.608717] </TASK>
> [ 36.608718] Modules linked in: qrtr rfkill intel_rapl_msr
> intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_thermal
> intel_powerclamp coretemp kvm_intel kvm ipmi_ssif dell_wmi_descriptor
> iTCO_wdt platform_profile iTCO_vendor_support wmi ipmi_si rapl mei_me
> acpi_power_meter dcdbas tg3 mei intel_cstate i2c_i801 intel_uncore
> acpi_ipmi intel_pch_thermal pcspkr lpc_ich i2c_smbus sg ipmi_devintf
> ipmi_msghandler dm_multipath loop nfnetlink xfs lpfc sd_mod nvmet_fc
> nvmet nvme_fc nvme_fabrics nvme_keyring crct10dif_pclmul nvme_core
> crc32_pclmul ahci crc32c_intel libahci nvme_auth mgag200
> ghash_clmulni_intel megaraid_sas scsi_transport_fc libata i2c_algo_bit
> dm_mirror dm_region_hash dm_log dm_mod fuse
> [ 36.608765] CR2: ffffffffbd92a8d8
> [ 36.608766] ---[ end trace 0000000000000000 ]---
>
> On Tue, Jan 14, 2025 at 12:06 PM Ewan Milne <emilne@redhat.com> wrote:
> >
> > Hi Ming-
> >
> > So there is definitely something wrong, because we are recently
> > starting to see a 100% reproducible
> > warning on a couple of our test systems (but not another one) where
> > the nvme_fc_ctrl kref decrements to
> > zero on command completion. However, I'm not sure that this patch
> > addresses the underlying issue.
> > If there was an outstanding command on the admin queue, there should
> > have been a corresponding
> > kref taken in nvme_fc_start_fcp_op() so we should not be in
> > nvme_fc_ctrl_free() until after it completes.
> > and the kref should not be zero at the end of command completion.
> >
> > Normally, we would have already done through the code path in the
> > bottom of __nvme_fc_abort_outstanding_ios()
> > which should have resulted in no commands outstanding either on the
> > admin_q or the io queues prior to the
> > ->free_ctrl() callback to nvme_fc_free_ctrl() which I think should be
> > the last put on the nvme_fc_ctrl object.
> >
> > I think there there is also a problem with your change regarding when
> > the tagsets are removed,
> > which addresses WARN_ON() in blk_mq_run_hw_queue():
> >
> > WARN_ON_ONCE(!async && in_interrupt());
> >
> > ...because moving the removal of the tagsets to earlier in
> > nvme_fc_delete_ctrl() is removing the tagsets
> > to be prior to the the nvme_fc_delete_association() call which then
> > calls __nvme_fc_abort_outstanding_ios()
> > which invokes blk_mq_tagset_busy_iter() on the tagsets. Perhaps
> > moving the tagset removal to after the
> > nvme_fc_delete_assocation() call would address your specific concern
> > regarding the correct execution context?
> >
> > I think the real problem is that nvme_fc_ctrl_free() is called due to
> > the kref put when the command completes.
> > This shouldn't happen. The last put should happen when the core code
> > calls ->free_ctrl().
> >
> > Below is the stack trace we see for those interested. The test is
> > doing a echo 1 > /sys/..../delete_controller.
> >
> > I can't reproduce this on my otherwise identical system (connected to
> > different storage fabric arrays though).
> > So we have to run any experiments through the test automation instead
> > of interactively.
> >
> > -Ewan
> >
> > [ 4095.189976] ------------[ cut here ]------------
> > [ 4095.194596] WARNING: CPU: 9 PID: 38355 at block/blk-mq.c:2325
> > blk_mq_run_hw_queue+0x26/0x280
> > [ 4095.203036] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4
> > dns_resolver nfs lockd grace netfs sunrpc dm_service_time dm_multipath
> > amd_atl intel_rapl_msr intel_rapl_common amd64_edac dell_pc
> > platform_profile edac_mce_amd kvm_amd dell_wmi sparse_keymap rfkill
> > dell_smbios video dcdbas ipmi_ssif mgag200 kvm i2c_algo_bit
> > drm_client_lib drm_shmem_helper drm_kms_helper dell_wmi_descriptor
> > wmi_bmof rapl pcspkr i2c_piix4 ptdma i2c_smbus k10temp ipmi_si
> > acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler drm fuse xfs
> > libcrc32c sd_mod sg qla2xxx nvme_fc nvme_fabrics nvme_keyring ahci
> > crct10dif_pclmul nvme_core libahci crc32_pclmul nvme_auth crc32c_intel
> > libata ghash_clmulni_intel ccp scsi_transport_fc tg3 megaraid_sas
> > sp5100_tco wmi dm_mirror dm_region_hash dm_log dm_mod
> > [ 4095.271190] CPU: 9 UID: 0 PID: 38355 Comm: kworker/u65:3 Not
> > tainted 6.13.0-rc4+ #1
> > [ 4095.278851] Hardware name: Dell Inc. PowerEdge R6515/04F3CJ, BIOS
> > 2.13.3 09/12/2023
> > [ 4095.286515] Workqueue: 0x0 (writeback)
> > [ 4095.290361] RIP: 0010:blk_mq_run_hw_queue+0x26/0x280
> > [ 4095.295334] Code: 90 90 90 90 0f 1f 44 00 00 41 57 41 56 41 55 41
> > 54 55 48 89 fd 53 89 f3 40 84 f6 75 1c 65 f7 05 34 e9 40 4e 00 ff ff
> > 00 74 02 <0f> 0b f6 85 a8 00 00 00 10 0f 85 16 02 00 00 48 8b 85 b8 00
> > 00 00
> > [ 4095.314088] RSP: 0018:ffffb8ca0056cea8 EFLAGS: 00010206
> > [ 4095.319321] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > [ 4095.326455] RDX: ffff9c82cb227728 RSI: 0000000000000000 RDI: ffff9c82d1e73e00
> > [ 4095.333585] RBP: ffff9c82d1e73e00 R08: 0000000000000000 R09: ffffb8ca0056ccf8
> > [ 4095.340720] R10: ffffb8ca0056ccf0 R11: ffffffffb33e1348 R12: 0000000000000000
> > [ 4095.347851] R13: ffff9c82ddfd1380 R14: 0000000000000010 R15: ffffffffb30070e0
> > [ 4095.354986] FS: 0000000000000000(0000) GS:ffff9c862f380000(0000)
> > knlGS:0000000000000000
> > [ 4095.363078] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 4095.368825] CR2: 0000563f2b03c0b0 CR3: 000000011dfe6000 CR4: 0000000000350ef0
> > [ 4095.375966] Call Trace:
> > [ 4095.378418] <IRQ>
> > [ 4095.380439] ? __warn+0x84/0x130
> > [ 4095.383679] ? blk_mq_run_hw_queue+0x26/0x280
> > [ 4095.388046] ? report_bug+0x18a/0x1a0
> > [ 4095.391721] ? handle_bug+0x53/0x90
> > [ 4095.395222] ? exc_invalid_op+0x14/0x70
> > [ 4095.399063] ? asm_exc_invalid_op+0x16/0x20
> > [ 4095.403261] ? blk_mq_run_hw_queue+0x26/0x280
> > [ 4095.407626] blk_mq_run_hw_queues+0x63/0x120
> > [ 4095.411908] __blk_freeze_queue_start+0x64/0x70
> > [ 4095.416447] blk_queue_start_drain+0x18/0x50
> > [ 4095.420728] blk_mq_destroy_queue+0x30/0x60
> > [ 4095.424915] nvme_remove_io_tag_set+0x26/0x40 [nvme_core]
> > [ 4095.430330] nvme_fc_ctrl_free+0x2e/0x130 [nvme_fc]
> > [ 4095.435219] blk_complete_reqs+0x40/0x50
> > [ 4095.439154] handle_softirqs+0xd3/0x2b0
> > [ 4095.443004] __irq_exit_rcu+0xcd/0xf0
> > [ 4095.446673] sysvec_call_function_single+0x71/0x90
> > [ 4095.451468] </IRQ>
> >
> >
> >
> > On Mon, Jan 13, 2025 at 7:45 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > Now target is removed from nvme_fc_ctrl_free() which is the ctrl->ref
> > > release handler. And even admin queue is unquiesced there, this way
> > > is definitely wrong because the ctr->ref is grabbed when submitting
> > > command.
> > >
> > > And Marco observed that nvme_fc_ctrl_free() can be called from request
> > > completion code path, and trigger kernel warning since request completes
> > > from softirq context.
> > >
> > > Fix the issue by moveing target removal into nvme_fc_delete_ctrl(),
> > > which is also aligned with nvme-tcp and nvme-rdma.
> > >
> > > Cc: Marco Patalano <mpatalan@redhat.com>
> > > Cc: Ewan Milne <emilne@redhat.com>
> > > Cc: James Smart <james.smart@broadcom.com>
> > > Cc: Sagi Grimberg <sagi@grimberg.me>
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > > drivers/nvme/host/fc.c | 13 +++++++------
> > > 1 file changed, 7 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> > > index b81af7919e94..b94b4e50a3df 100644
> > > --- a/drivers/nvme/host/fc.c
> > > +++ b/drivers/nvme/host/fc.c
> > > @@ -2389,17 +2389,11 @@ nvme_fc_ctrl_free(struct kref *ref)
> > > container_of(ref, struct nvme_fc_ctrl, ref);
> > > unsigned long flags;
> > >
> > > - if (ctrl->ctrl.tagset)
> > > - nvme_remove_io_tag_set(&ctrl->ctrl);
> > > -
> > > /* remove from rport list */
> > > spin_lock_irqsave(&ctrl->rport->lock, flags);
> > > list_del(&ctrl->ctrl_list);
> > > spin_unlock_irqrestore(&ctrl->rport->lock, flags);
> > >
> > > - nvme_unquiesce_admin_queue(&ctrl->ctrl);
> > > - nvme_remove_admin_tag_set(&ctrl->ctrl);
> > > -
> > > kfree(ctrl->queues);
> > >
> > > put_device(ctrl->dev);
> > > @@ -3288,6 +3282,13 @@ nvme_fc_delete_ctrl(struct nvme_ctrl *nctrl)
> > >
> > > cancel_work_sync(&ctrl->ioerr_work);
> > > cancel_delayed_work_sync(&ctrl->connect_work);
> > > +
> > > + if (ctrl->ctrl.tagset)
> > > + nvme_remove_io_tag_set(&ctrl->ctrl);
> > > +
> > > + nvme_unquiesce_admin_queue(&ctrl->ctrl);
> > > + nvme_remove_admin_tag_set(&ctrl->ctrl);
> > > +
> > > /*
> > > * kill the association on the link side. this will block
> > > * waiting for io to terminate
> > > --
> > > 2.47.0
> > >
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl()
2025-01-15 17:33 ` Ewan Milne
@ 2025-01-16 1:57 ` Ming Lei
2025-01-17 22:03 ` Ewan Milne
0 siblings, 1 reply; 9+ messages in thread
From: Ming Lei @ 2025-01-16 1:57 UTC (permalink / raw)
To: Ewan Milne
Cc: Christoph Hellwig, Keith Busch, linux-nvme, Marco Patalano,
James Smart, Sagi Grimberg
Hi Ewan,
On Wed, Jan 15, 2025 at 12:33:50PM -0500, Ewan Milne wrote:
> Moving the tagset removal to the end of nvme_fc_delete_ctrl() ran OK and we
> no longer see the WARN_ON() seen earlier, however I think that there is a
> potential problem with doing this because the keepalive is not stopped until
> nvme_uninit_ctrl() is called. On our systems w/FC kato is normally disabled
> so we wouldn't see any problem but this isn't true in the general case.
Please see e9869c85c811 ("nvme-fabrics: fix kernel crash while shutting down controller")
which stops the keepalive request at the entry of
nvme_remove_admin_tag_set().
So your above concern shouldn't exist.
Thanks,
Ming
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl()
2025-01-13 12:44 [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl() Ming Lei
2025-01-14 17:06 ` Ewan Milne
@ 2025-01-16 6:10 ` Christoph Hellwig
2025-01-17 22:06 ` Ewan Milne
2025-01-20 13:51 ` Hannes Reinecke
2 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2025-01-16 6:10 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Keith Busch, linux-nvme, Marco Patalano,
Ewan Milne, James Smart, Sagi Grimberg
On Mon, Jan 13, 2025 at 08:44:46PM +0800, Ming Lei wrote:
> Now target is removed from nvme_fc_ctrl_free() which is the ctrl->ref
> release handler. And even admin queue is unquiesced there, this way
> is definitely wrong because the ctr->ref is grabbed when submitting
> command.
>
> And Marco observed that nvme_fc_ctrl_free() can be called from request
> completion code path, and trigger kernel warning since request completes
> from softirq context.
>
> Fix the issue by moveing target removal into nvme_fc_delete_ctrl(),
> which is also aligned with nvme-tcp and nvme-rdma.
>
Yeah, the current placement is wrong. Can I talk someone to make
the whole setup/teardown process in nvme-fc match that of the other
fabrics (and slightly less so PCIe) order? I know a long time ago
Sagi was atually looking into sharing more code as well, but that
would be a next step.
But I'd really like to have one known working sequence instead
of various different ones.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl()
2025-01-16 1:57 ` Ming Lei
@ 2025-01-17 22:03 ` Ewan Milne
0 siblings, 0 replies; 9+ messages in thread
From: Ewan Milne @ 2025-01-17 22:03 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Keith Busch, linux-nvme, Marco Patalano,
James Smart, Sagi Grimberg
So testing looks like it is showing that the WARN_ON() I referenced
earlier goes away
after reverting all the recent keepalive changes, all the way back to
and including:
a54a93d0e359 nvme: move stopping keep-alive into nvme_uninit_ctrl()
But I think this is just because this makes the keeaplive stop early enough.
I think the problem is that the call do nvme_stop_keep_alive() does
not wait unitil
the keepalive command, if in progress, completes. It just makes sure that
nvme_keep_alive_work() is not queued or running. But the keepalive command
is asynchronous, invoked via blk_execute_rq_nowait(). (I had thought that the
ref taken on the nvme_fc_ctrl while the command was in progress would have
covered this though.)
-Ewan
On Wed, Jan 15, 2025 at 8:57 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Hi Ewan,
>
> On Wed, Jan 15, 2025 at 12:33:50PM -0500, Ewan Milne wrote:
> > Moving the tagset removal to the end of nvme_fc_delete_ctrl() ran OK and we
> > no longer see the WARN_ON() seen earlier, however I think that there is a
> > potential problem with doing this because the keepalive is not stopped until
> > nvme_uninit_ctrl() is called. On our systems w/FC kato is normally disabled
> > so we wouldn't see any problem but this isn't true in the general case.
>
> Please see e9869c85c811 ("nvme-fabrics: fix kernel crash while shutting down controller")
> which stops the keepalive request at the entry of
> nvme_remove_admin_tag_set().
>
> So your above concern shouldn't exist.
>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl()
2025-01-16 6:10 ` Christoph Hellwig
@ 2025-01-17 22:06 ` Ewan Milne
0 siblings, 0 replies; 9+ messages in thread
From: Ewan Milne @ 2025-01-17 22:06 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Ming Lei, Keith Busch, linux-nvme, Marco Patalano, James Smart,
Sagi Grimberg
I'll work with Ming & Marco on this and see if we can get this straightened out.
-Ewan
On Thu, Jan 16, 2025 at 1:10 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Mon, Jan 13, 2025 at 08:44:46PM +0800, Ming Lei wrote:
> > Now target is removed from nvme_fc_ctrl_free() which is the ctrl->ref
> > release handler. And even admin queue is unquiesced there, this way
> > is definitely wrong because the ctr->ref is grabbed when submitting
> > command.
> >
> > And Marco observed that nvme_fc_ctrl_free() can be called from request
> > completion code path, and trigger kernel warning since request completes
> > from softirq context.
> >
> > Fix the issue by moveing target removal into nvme_fc_delete_ctrl(),
> > which is also aligned with nvme-tcp and nvme-rdma.
> >
>
> Yeah, the current placement is wrong. Can I talk someone to make
> the whole setup/teardown process in nvme-fc match that of the other
> fabrics (and slightly less so PCIe) order? I know a long time ago
> Sagi was atually looking into sharing more code as well, but that
> would be a next step.
>
> But I'd really like to have one known working sequence instead
> of various different ones.
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl()
2025-01-13 12:44 [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl() Ming Lei
2025-01-14 17:06 ` Ewan Milne
2025-01-16 6:10 ` Christoph Hellwig
@ 2025-01-20 13:51 ` Hannes Reinecke
2 siblings, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2025-01-20 13:51 UTC (permalink / raw)
To: Ming Lei, Christoph Hellwig, Keith Busch, linux-nvme
Cc: Marco Patalano, Ewan Milne, James Smart, Sagi Grimberg
On 1/13/25 13:44, Ming Lei wrote:
> Now target is removed from nvme_fc_ctrl_free() which is the ctrl->ref
> release handler. And even admin queue is unquiesced there, this way
> is definitely wrong because the ctr->ref is grabbed when submitting
> command.
>
> And Marco observed that nvme_fc_ctrl_free() can be called from request
> completion code path, and trigger kernel warning since request completes
> from softirq context.
>
> Fix the issue by moveing target removal into nvme_fc_delete_ctrl(),
> which is also aligned with nvme-tcp and nvme-rdma.
>
> Cc: Marco Patalano <mpatalan@redhat.com>
> Cc: Ewan Milne <emilne@redhat.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> drivers/nvme/host/fc.c | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index b81af7919e94..b94b4e50a3df 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -2389,17 +2389,11 @@ nvme_fc_ctrl_free(struct kref *ref)
> container_of(ref, struct nvme_fc_ctrl, ref);
> unsigned long flags;
>
> - if (ctrl->ctrl.tagset)
> - nvme_remove_io_tag_set(&ctrl->ctrl);
> -
> /* remove from rport list */
> spin_lock_irqsave(&ctrl->rport->lock, flags);
> list_del(&ctrl->ctrl_list);
> spin_unlock_irqrestore(&ctrl->rport->lock, flags);
>
> - nvme_unquiesce_admin_queue(&ctrl->ctrl);
> - nvme_remove_admin_tag_set(&ctrl->ctrl);
> -
> kfree(ctrl->queues);
>
> put_device(ctrl->dev);
> @@ -3288,6 +3282,13 @@ nvme_fc_delete_ctrl(struct nvme_ctrl *nctrl)
>
> cancel_work_sync(&ctrl->ioerr_work);
> cancel_delayed_work_sync(&ctrl->connect_work);
> +
> + if (ctrl->ctrl.tagset)
> + nvme_remove_io_tag_set(&ctrl->ctrl);
> +
> + nvme_unquiesce_admin_queue(&ctrl->ctrl);
> + nvme_remove_admin_tag_set(&ctrl->ctrl);
> +
> /*
> * kill the association on the link side. this will block
> * waiting for io to terminate
Not sure that'll work. We're waiting for I/Os to complete just after
the line in the last hunk, and I guess the need a tagset to complete
properly.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-01-20 13:51 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-13 12:44 [RFC PATCH] nvme-fc: move tagset removal to nvme_fc_delete_ctrl() Ming Lei
2025-01-14 17:06 ` Ewan Milne
2025-01-14 18:31 ` Ewan Milne
2025-01-15 17:33 ` Ewan Milne
2025-01-16 1:57 ` Ming Lei
2025-01-17 22:03 ` Ewan Milne
2025-01-16 6:10 ` Christoph Hellwig
2025-01-17 22:06 ` Ewan Milne
2025-01-20 13:51 ` Hannes Reinecke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox