* [PATCH] nvmet-loop: avoid using mutex in IO hotpath
@ 2024-11-29 10:48 Nilay Shroff
2024-12-03 8:38 ` Hannes Reinecke
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Nilay Shroff @ 2024-11-29 10:48 UTC (permalink / raw)
To: linux-nvme
Cc: shinichiro.kawasaki, hch, kbusch, sagi, axboe, chaitanyak, gjoyce
Using mutex lock in IO hot path causes the kernel BUG sleeping while
atomic. Shinichiro[1], first encountered this issue while running blktest
nvme/052 shown below:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 996, name: (udev-worker)
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by (udev-worker)/996:
#0: ffff8881004570c8 (mapping.invalidate_lock){.+.+}-{3:3}, at: page_cache_ra_unbounded+0x155/0x5c0
#1: ffffffff8607eaa0 (rcu_read_lock){....}-{1:2}, at: blk_mq_flush_plug_list+0xa75/0x1950
CPU: 2 UID: 0 PID: 996 Comm: (udev-worker) Not tainted 6.12.0-rc3+ #339
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x6a/0x90
__might_resched.cold+0x1f7/0x23d
? __pfx___might_resched+0x10/0x10
? vsnprintf+0xdeb/0x18f0
__mutex_lock+0xf4/0x1220
? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
? __pfx_vsnprintf+0x10/0x10
? __pfx___mutex_lock+0x10/0x10
? snprintf+0xa5/0xe0
? xas_load+0x1ce/0x3f0
? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
? __pfx_nvmet_subsys_nsid_exists+0x10/0x10 [nvmet]
nvmet_req_find_ns+0x24e/0x300 [nvmet]
nvmet_req_init+0x694/0xd40 [nvmet]
? blk_mq_start_request+0x11c/0x750
? nvme_setup_cmd+0x369/0x990 [nvme_core]
nvme_loop_queue_rq+0x2a7/0x7a0 [nvme_loop]
? __pfx___lock_acquire+0x10/0x10
? __pfx_nvme_loop_queue_rq+0x10/0x10 [nvme_loop]
__blk_mq_issue_directly+0xe2/0x1d0
? __pfx___blk_mq_issue_directly+0x10/0x10
? blk_mq_request_issue_directly+0xc2/0x140
blk_mq_plug_issue_direct+0x13f/0x630
? lock_acquire+0x2d/0xc0
? blk_mq_flush_plug_list+0xa75/0x1950
blk_mq_flush_plug_list+0xa9d/0x1950
? __pfx_blk_mq_flush_plug_list+0x10/0x10
? __pfx_mpage_readahead+0x10/0x10
__blk_flush_plug+0x278/0x4d0
? __pfx___blk_flush_plug+0x10/0x10
? lock_release+0x460/0x7a0
blk_finish_plug+0x4e/0x90
read_pages+0x51b/0xbc0
? __pfx_read_pages+0x10/0x10
? lock_release+0x460/0x7a0
page_cache_ra_unbounded+0x326/0x5c0
force_page_cache_ra+0x1ea/0x2f0
filemap_get_pages+0x59e/0x17b0
? __pfx_filemap_get_pages+0x10/0x10
? lock_is_held_type+0xd5/0x130
? __pfx___might_resched+0x10/0x10
? find_held_lock+0x2d/0x110
filemap_read+0x317/0xb70
? up_write+0x1ba/0x510
? __pfx_filemap_read+0x10/0x10
? inode_security+0x54/0xf0
? selinux_file_permission+0x36d/0x420
blkdev_read_iter+0x143/0x3b0
vfs_read+0x6ac/0xa20
? __pfx_vfs_read+0x10/0x10
? __pfx_vm_mmap_pgoff+0x10/0x10
? __pfx___seccomp_filter+0x10/0x10
ksys_read+0xf7/0x1d0
? __pfx_ksys_read+0x10/0x10
do_syscall_64+0x93/0x180
? lockdep_hardirqs_on_prepare+0x16d/0x400
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on+0x78/0x100
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on_prepare+0x16d/0x400
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f565bd1ce11
Code: 00 48 8b 15 09 90 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 d0 ad 01 00 f3 0f 1e fa 80 3d 35 12 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
RSP: 002b:00007ffd6e7a20c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f565bd1ce11
RDX: 0000000000001000 RSI: 00007f565babb000 RDI: 0000000000000014
RBP: 00007ffd6e7a2130 R08: 00000000ffffffff R09: 0000000000000000
R10: 0000556000bfa610 R11: 0000000000000246 R12: 000000003ffff000
R13: 0000556000bfa5b0 R14: 0000000000000e00 R15: 0000556000c07328
</TASK>
Apparently, the above issue is caused due to using mutex lock while
we're in IO hot path. It's a regression caused with commit 505363957fad
("nvmet: fix nvme status code when namespace is disabled"). The mutex
->su_mutex is used to find whether a disabled nsid exists in the config
group or not. This is to differentiate between a nsid that is disabled
vs non-existent.
To mitigate the above issue, we've worked upon a fix[2] where we now
insert nsid in subsys Xarray as soon as it's created under config group
and later when that nsid is enabled, we add an Xarray mark on it and set
ns->enabled to true. The Xarray mark is useful while we need to loop
through all enabled namepsaces under a subsystem using xa_for_each_marked()
API. If later a nsid is disabled then we clear Xarray mark from it and also
set ns->enabled to false. It's only when nsid is deleted from the config
group we delete it from the Xarray.
So with this change, now we could easily differentiate a nsid is disabled
(i.e. Xarray entry for ns exists but ns->enabled is set to false) vs non-
existent (i.e.Xarray entry for ns doesn't exist).
Link: https://lore.kernel.org/linux-nvme/20241022070252.GA11389@lst.de/ [2]
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-nvme/tqcy3sveity7p56v7ywp7ssyviwcb3w4623cnxj3knoobfcanq@yxgt2mjkbkam/ [1]
Fixes: 505363957fad ("nvmet: fix nvme status code when namespace is disabled")
Fix-suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/target/admin-cmd.c | 13 ++--
drivers/nvme/target/core.c | 108 +++++++++++++++++++-------------
drivers/nvme/target/nvmet.h | 1 +
drivers/nvme/target/pr.c | 10 +--
4 files changed, 79 insertions(+), 53 deletions(-)
diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 934b401fbc2f..439fc96a82c8 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -139,7 +139,8 @@ static u16 nvmet_get_smart_log_all(struct nvmet_req *req,
unsigned long idx;
ctrl = req->sq->ctrl;
- xa_for_each(&ctrl->subsys->namespaces, idx, ns) {
+ xa_for_each_marked(&ctrl->subsys->namespaces, idx,
+ ns, NVMET_NS_ENABLED) {
/* we don't have the right data for file backed ns */
if (!ns->bdev)
continue;
@@ -331,9 +332,11 @@ static u32 nvmet_format_ana_group(struct nvmet_req *req, u32 grpid,
u32 count = 0;
if (!(req->cmd->get_log_page.lsp & NVME_ANA_LOG_RGO)) {
- xa_for_each(&ctrl->subsys->namespaces, idx, ns)
+ xa_for_each_marked(&ctrl->subsys->namespaces, idx,
+ ns, NVMET_NS_ENABLED) {
if (ns->anagrpid == grpid)
desc->nsids[count++] = cpu_to_le32(ns->nsid);
+ }
}
desc->grpid = cpu_to_le32(grpid);
@@ -771,7 +774,8 @@ static void nvmet_execute_identify_endgrp_list(struct nvmet_req *req)
goto out;
}
- xa_for_each(&ctrl->subsys->namespaces, idx, ns) {
+ xa_for_each_marked(&ctrl->subsys->namespaces, idx,
+ ns, NVMET_NS_ENABLED) {
if (ns->nsid <= min_endgid)
continue;
@@ -814,7 +818,8 @@ static void nvmet_execute_identify_nslist(struct nvmet_req *req, bool match_css)
goto out;
}
- xa_for_each(&ctrl->subsys->namespaces, idx, ns) {
+ xa_for_each_marked(&ctrl->subsys->namespaces, idx,
+ ns, NVMET_NS_ENABLED) {
if (ns->nsid <= min_nsid)
continue;
if (match_css && req->ns->csi != req->cmd->identify.csi)
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 1f4e9989663b..983f20005b76 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -127,7 +127,7 @@ static u32 nvmet_max_nsid(struct nvmet_subsys *subsys)
unsigned long idx;
u32 nsid = 0;
- xa_for_each(&subsys->namespaces, idx, cur)
+ xa_for_each_marked(&subsys->namespaces, idx, cur, NVMET_NS_ENABLED)
nsid = cur->nsid;
return nsid;
@@ -441,11 +441,14 @@ u16 nvmet_req_find_ns(struct nvmet_req *req)
struct nvmet_subsys *subsys = nvmet_req_subsys(req);
req->ns = xa_load(&subsys->namespaces, nsid);
- if (unlikely(!req->ns)) {
+ if (unlikely(!req->ns || !req->ns->enabled)) {
req->error_loc = offsetof(struct nvme_common_command, nsid);
- if (nvmet_subsys_nsid_exists(subsys, nsid))
- return NVME_SC_INTERNAL_PATH_ERROR;
- return NVME_SC_INVALID_NS | NVME_STATUS_DNR;
+ if (!req->ns) /* ns doesn't exist! */
+ return NVME_SC_INVALID_NS | NVME_STATUS_DNR;
+
+ /* ns exists but it's disabled */
+ req->ns = NULL;
+ return NVME_SC_INTERNAL_PATH_ERROR;
}
percpu_ref_get(&req->ns->ref);
@@ -583,8 +586,6 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
goto out_unlock;
ret = -EMFILE;
- if (subsys->nr_namespaces == NVMET_MAX_NAMESPACES)
- goto out_unlock;
ret = nvmet_bdev_ns_enable(ns);
if (ret == -ENOTBLK)
@@ -599,38 +600,19 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
nvmet_p2pmem_ns_add_p2p(ctrl, ns);
- ret = percpu_ref_init(&ns->ref, nvmet_destroy_namespace,
- 0, GFP_KERNEL);
- if (ret)
- goto out_dev_put;
-
- if (ns->nsid > subsys->max_nsid)
- subsys->max_nsid = ns->nsid;
-
- ret = xa_insert(&subsys->namespaces, ns->nsid, ns, GFP_KERNEL);
- if (ret)
- goto out_restore_subsys_maxnsid;
-
if (ns->pr.enable) {
ret = nvmet_pr_init_ns(ns);
if (ret)
- goto out_remove_from_subsys;
+ goto out_dev_put;
}
- subsys->nr_namespaces++;
-
nvmet_ns_changed(subsys, ns->nsid);
ns->enabled = true;
+ xa_set_mark(&subsys->namespaces, ns->nsid, NVMET_NS_ENABLED);
ret = 0;
out_unlock:
mutex_unlock(&subsys->lock);
return ret;
-
-out_remove_from_subsys:
- xa_erase(&subsys->namespaces, ns->nsid);
-out_restore_subsys_maxnsid:
- subsys->max_nsid = nvmet_max_nsid(subsys);
- percpu_ref_exit(&ns->ref);
out_dev_put:
list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
pci_dev_put(radix_tree_delete(&ctrl->p2p_ns_map, ns->nsid));
@@ -649,15 +631,37 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
goto out_unlock;
ns->enabled = false;
- xa_erase(&ns->subsys->namespaces, ns->nsid);
- if (ns->nsid == subsys->max_nsid)
- subsys->max_nsid = nvmet_max_nsid(subsys);
+ xa_clear_mark(&subsys->namespaces, ns->nsid, NVMET_NS_ENABLED);
list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
pci_dev_put(radix_tree_delete(&ctrl->p2p_ns_map, ns->nsid));
mutex_unlock(&subsys->lock);
+ if (ns->pr.enable)
+ nvmet_pr_exit_ns(ns);
+
+ mutex_lock(&subsys->lock);
+ nvmet_ns_changed(subsys, ns->nsid);
+ nvmet_ns_dev_disable(ns);
+out_unlock:
+ mutex_unlock(&subsys->lock);
+}
+
+void nvmet_ns_free(struct nvmet_ns *ns)
+{
+ struct nvmet_subsys *subsys = ns->subsys;
+
+ nvmet_ns_disable(ns);
+
+ mutex_lock(&subsys->lock);
+
+ xa_erase(&subsys->namespaces, ns->nsid);
+ if (ns->nsid == subsys->max_nsid)
+ subsys->max_nsid = nvmet_max_nsid(subsys);
+
+ mutex_unlock(&subsys->lock);
+
/*
* Now that we removed the namespaces from the lookup list, we
* can kill the per_cpu ref and wait for any remaining references
@@ -671,21 +675,9 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
wait_for_completion(&ns->disable_done);
percpu_ref_exit(&ns->ref);
- if (ns->pr.enable)
- nvmet_pr_exit_ns(ns);
-
mutex_lock(&subsys->lock);
-
subsys->nr_namespaces--;
- nvmet_ns_changed(subsys, ns->nsid);
- nvmet_ns_dev_disable(ns);
-out_unlock:
mutex_unlock(&subsys->lock);
-}
-
-void nvmet_ns_free(struct nvmet_ns *ns)
-{
- nvmet_ns_disable(ns);
down_write(&nvmet_ana_sem);
nvmet_ana_group_enabled[ns->anagrpid]--;
@@ -699,15 +691,33 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid)
{
struct nvmet_ns *ns;
+ mutex_lock(&subsys->lock);
+
+ if (subsys->nr_namespaces == NVMET_MAX_NAMESPACES)
+ goto out_unlock;
+
ns = kzalloc(sizeof(*ns), GFP_KERNEL);
if (!ns)
- return NULL;
+ goto out_unlock;
init_completion(&ns->disable_done);
ns->nsid = nsid;
ns->subsys = subsys;
+ if (percpu_ref_init(&ns->ref, nvmet_destroy_namespace, 0, GFP_KERNEL))
+ goto out_free;
+
+ if (ns->nsid > subsys->max_nsid)
+ subsys->max_nsid = nsid;
+
+ if (xa_insert(&subsys->namespaces, ns->nsid, ns, GFP_KERNEL))
+ goto out_exit;
+
+ subsys->nr_namespaces++;
+
+ mutex_unlock(&subsys->lock);
+
down_write(&nvmet_ana_sem);
ns->anagrpid = NVMET_DEFAULT_ANA_GRPID;
nvmet_ana_group_enabled[ns->anagrpid]++;
@@ -718,6 +728,14 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid)
ns->csi = NVME_CSI_NVM;
return ns;
+out_exit:
+ subsys->max_nsid = nvmet_max_nsid(subsys);
+ percpu_ref_exit(&ns->ref);
+out_free:
+ kfree(ns);
+out_unlock:
+ mutex_unlock(&subsys->lock);
+ return NULL;
}
static void nvmet_update_sq_head(struct nvmet_req *req)
@@ -1394,7 +1412,7 @@ static void nvmet_setup_p2p_ns_map(struct nvmet_ctrl *ctrl,
ctrl->p2p_client = get_device(req->p2p_client);
- xa_for_each(&ctrl->subsys->namespaces, idx, ns)
+ xa_for_each_marked(&ctrl->subsys->namespaces, idx, ns, NVMET_NS_ENABLED)
nvmet_p2pmem_ns_add_p2p(ctrl, ns);
}
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 58328b35dc96..0fc0abf1150c 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -24,6 +24,7 @@
#define NVMET_DEFAULT_VS NVME_VS(2, 1, 0)
+#define NVMET_NS_ENABLED XA_MARK_1
#define NVMET_ASYNC_EVENTS 4
#define NVMET_ERROR_LOG_SLOTS 128
#define NVMET_NO_ERROR_LOC ((u16)-1)
diff --git a/drivers/nvme/target/pr.c b/drivers/nvme/target/pr.c
index 25a02b50d9f3..50acdb603572 100644
--- a/drivers/nvme/target/pr.c
+++ b/drivers/nvme/target/pr.c
@@ -60,7 +60,8 @@ u16 nvmet_set_feat_resv_notif_mask(struct nvmet_req *req, u32 mask)
goto success;
}
- xa_for_each(&ctrl->subsys->namespaces, idx, ns) {
+ xa_for_each_marked(&ctrl->subsys->namespaces, idx,
+ ns, NVMET_NS_ENABLED) {
if (ns->pr.enable)
WRITE_ONCE(ns->pr.notify_mask, mask);
}
@@ -1057,7 +1058,7 @@ int nvmet_ctrl_init_pr(struct nvmet_ctrl *ctrl)
* nvmet_pr_init_ns(), see more details in nvmet_ns_enable().
* So just check ns->pr.enable.
*/
- xa_for_each(&subsys->namespaces, idx, ns) {
+ xa_for_each_marked(&subsys->namespaces, idx, ns, NVMET_NS_ENABLED) {
if (ns->pr.enable) {
ret = nvmet_pr_alloc_and_insert_pc_ref(ns, ctrl->cntlid,
&ctrl->hostid);
@@ -1068,7 +1069,7 @@ int nvmet_ctrl_init_pr(struct nvmet_ctrl *ctrl)
return 0;
free_per_ctrl_refs:
- xa_for_each(&subsys->namespaces, idx, ns) {
+ xa_for_each_marked(&subsys->namespaces, idx, ns, NVMET_NS_ENABLED) {
if (ns->pr.enable) {
pc_ref = xa_erase(&ns->pr_per_ctrl_refs, ctrl->cntlid);
if (pc_ref)
@@ -1088,7 +1089,8 @@ void nvmet_ctrl_destroy_pr(struct nvmet_ctrl *ctrl)
kfifo_free(&ctrl->pr_log_mgr.log_queue);
mutex_destroy(&ctrl->pr_log_mgr.lock);
- xa_for_each(&ctrl->subsys->namespaces, idx, ns) {
+ xa_for_each_marked(&ctrl->subsys->namespaces, idx,
+ ns, NVMET_NS_ENABLED) {
if (ns->pr.enable) {
pc_ref = xa_erase(&ns->pr_per_ctrl_refs, ctrl->cntlid);
if (pc_ref)
--
2.45.2
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH] nvmet-loop: avoid using mutex in IO hotpath
2024-11-29 10:48 [PATCH] nvmet-loop: avoid using mutex in IO hotpath Nilay Shroff
@ 2024-12-03 8:38 ` Hannes Reinecke
2024-12-03 8:45 ` Chaitanya Kulkarni
2024-12-09 8:01 ` Christoph Hellwig
2 siblings, 0 replies; 5+ messages in thread
From: Hannes Reinecke @ 2024-12-03 8:38 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme
Cc: shinichiro.kawasaki, hch, kbusch, sagi, axboe, chaitanyak, gjoyce
On 11/29/24 11:48, Nilay Shroff wrote:
> Using mutex lock in IO hot path causes the kernel BUG sleeping while
> atomic. Shinichiro[1], first encountered this issue while running blktest
> nvme/052 shown below:
>
> BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
> in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 996, name: (udev-worker)
> preempt_count: 0, expected: 0
> RCU nest depth: 1, expected: 0
> 2 locks held by (udev-worker)/996:
> #0: ffff8881004570c8 (mapping.invalidate_lock){.+.+}-{3:3}, at: page_cache_ra_unbounded+0x155/0x5c0
> #1: ffffffff8607eaa0 (rcu_read_lock){....}-{1:2}, at: blk_mq_flush_plug_list+0xa75/0x1950
> CPU: 2 UID: 0 PID: 996 Comm: (udev-worker) Not tainted 6.12.0-rc3+ #339
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
> Call Trace:
> <TASK>
> dump_stack_lvl+0x6a/0x90
> __might_resched.cold+0x1f7/0x23d
> ? __pfx___might_resched+0x10/0x10
> ? vsnprintf+0xdeb/0x18f0
> __mutex_lock+0xf4/0x1220
> ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
> ? __pfx_vsnprintf+0x10/0x10
> ? __pfx___mutex_lock+0x10/0x10
> ? snprintf+0xa5/0xe0
> ? xas_load+0x1ce/0x3f0
> ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
> nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
> ? __pfx_nvmet_subsys_nsid_exists+0x10/0x10 [nvmet]
> nvmet_req_find_ns+0x24e/0x300 [nvmet]
> nvmet_req_init+0x694/0xd40 [nvmet]
> ? blk_mq_start_request+0x11c/0x750
> ? nvme_setup_cmd+0x369/0x990 [nvme_core]
> nvme_loop_queue_rq+0x2a7/0x7a0 [nvme_loop]
> ? __pfx___lock_acquire+0x10/0x10
> ? __pfx_nvme_loop_queue_rq+0x10/0x10 [nvme_loop]
> __blk_mq_issue_directly+0xe2/0x1d0
> ? __pfx___blk_mq_issue_directly+0x10/0x10
> ? blk_mq_request_issue_directly+0xc2/0x140
> blk_mq_plug_issue_direct+0x13f/0x630
> ? lock_acquire+0x2d/0xc0
> ? blk_mq_flush_plug_list+0xa75/0x1950
> blk_mq_flush_plug_list+0xa9d/0x1950
> ? __pfx_blk_mq_flush_plug_list+0x10/0x10
> ? __pfx_mpage_readahead+0x10/0x10
> __blk_flush_plug+0x278/0x4d0
> ? __pfx___blk_flush_plug+0x10/0x10
> ? lock_release+0x460/0x7a0
> blk_finish_plug+0x4e/0x90
> read_pages+0x51b/0xbc0
> ? __pfx_read_pages+0x10/0x10
> ? lock_release+0x460/0x7a0
> page_cache_ra_unbounded+0x326/0x5c0
> force_page_cache_ra+0x1ea/0x2f0
> filemap_get_pages+0x59e/0x17b0
> ? __pfx_filemap_get_pages+0x10/0x10
> ? lock_is_held_type+0xd5/0x130
> ? __pfx___might_resched+0x10/0x10
> ? find_held_lock+0x2d/0x110
> filemap_read+0x317/0xb70
> ? up_write+0x1ba/0x510
> ? __pfx_filemap_read+0x10/0x10
> ? inode_security+0x54/0xf0
> ? selinux_file_permission+0x36d/0x420
> blkdev_read_iter+0x143/0x3b0
> vfs_read+0x6ac/0xa20
> ? __pfx_vfs_read+0x10/0x10
> ? __pfx_vm_mmap_pgoff+0x10/0x10
> ? __pfx___seccomp_filter+0x10/0x10
> ksys_read+0xf7/0x1d0
> ? __pfx_ksys_read+0x10/0x10
> do_syscall_64+0x93/0x180
> ? lockdep_hardirqs_on_prepare+0x16d/0x400
> ? do_syscall_64+0x9f/0x180
> ? lockdep_hardirqs_on+0x78/0x100
> ? do_syscall_64+0x9f/0x180
> ? lockdep_hardirqs_on_prepare+0x16d/0x400
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7f565bd1ce11
> Code: 00 48 8b 15 09 90 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 d0 ad 01 00 f3 0f 1e fa 80 3d 35 12 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
> RSP: 002b:00007ffd6e7a20c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f565bd1ce11
> RDX: 0000000000001000 RSI: 00007f565babb000 RDI: 0000000000000014
> RBP: 00007ffd6e7a2130 R08: 00000000ffffffff R09: 0000000000000000
> R10: 0000556000bfa610 R11: 0000000000000246 R12: 000000003ffff000
> R13: 0000556000bfa5b0 R14: 0000000000000e00 R15: 0000556000c07328
> </TASK>
>
> Apparently, the above issue is caused due to using mutex lock while
> we're in IO hot path. It's a regression caused with commit 505363957fad
> ("nvmet: fix nvme status code when namespace is disabled"). The mutex
> ->su_mutex is used to find whether a disabled nsid exists in the config
> group or not. This is to differentiate between a nsid that is disabled
> vs non-existent.
>
> To mitigate the above issue, we've worked upon a fix[2] where we now
> insert nsid in subsys Xarray as soon as it's created under config group
> and later when that nsid is enabled, we add an Xarray mark on it and set
> ns->enabled to true. The Xarray mark is useful while we need to loop
> through all enabled namepsaces under a subsystem using xa_for_each_marked()
> API. If later a nsid is disabled then we clear Xarray mark from it and also
> set ns->enabled to false. It's only when nsid is deleted from the config
> group we delete it from the Xarray.
>
> So with this change, now we could easily differentiate a nsid is disabled
> (i.e. Xarray entry for ns exists but ns->enabled is set to false) vs non-
> existent (i.e.Xarray entry for ns doesn't exist).
>
> Link: https://lore.kernel.org/linux-nvme/20241022070252.GA11389@lst.de/ [2]
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Closes: https://lore.kernel.org/linux-nvme/tqcy3sveity7p56v7ywp7ssyviwcb3w4623cnxj3knoobfcanq@yxgt2mjkbkam/ [1]
> Fixes: 505363957fad ("nvmet: fix nvme status code when namespace is disabled")
> Fix-suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/target/admin-cmd.c | 13 ++--
> drivers/nvme/target/core.c | 108 +++++++++++++++++++-------------
> drivers/nvme/target/nvmet.h | 1 +
> drivers/nvme/target/pr.c | 10 +--
> 4 files changed, 79 insertions(+), 53 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH] nvmet-loop: avoid using mutex in IO hotpath
2024-11-29 10:48 [PATCH] nvmet-loop: avoid using mutex in IO hotpath Nilay Shroff
2024-12-03 8:38 ` Hannes Reinecke
@ 2024-12-03 8:45 ` Chaitanya Kulkarni
2024-12-09 8:01 ` Christoph Hellwig
2 siblings, 0 replies; 5+ messages in thread
From: Chaitanya Kulkarni @ 2024-12-03 8:45 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme@lists.infradead.org
Cc: shinichiro.kawasaki@wdc.com, hch@lst.de, kbusch@kernel.org,
sagi@grimberg.me, axboe@kernel.dk, gjoyce@linux.ibm.com
On 11/29/24 02:48, Nilay Shroff wrote:
> Using mutex lock in IO hot path causes the kernel BUG sleeping while
> atomic. Shinichiro[1], first encountered this issue while running blktest
> nvme/052 shown below:
Looks good.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] nvmet-loop: avoid using mutex in IO hotpath
2024-11-29 10:48 [PATCH] nvmet-loop: avoid using mutex in IO hotpath Nilay Shroff
2024-12-03 8:38 ` Hannes Reinecke
2024-12-03 8:45 ` Chaitanya Kulkarni
@ 2024-12-09 8:01 ` Christoph Hellwig
2024-12-09 17:37 ` Nilay Shroff
2 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2024-12-09 8:01 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-nvme, shinichiro.kawasaki, hch, kbusch, sagi, axboe,
chaitanyak, gjoyce
On Fri, Nov 29, 2024 at 04:18:33PM +0530, Nilay Shroff wrote:
> - xa_for_each(&ctrl->subsys->namespaces, idx, ns) {
> + xa_for_each_marked(&ctrl->subsys->namespaces, idx,
> + ns, NVMET_NS_ENABLED) {
Given that this is duplicated quite a bit, maybe add a
nvmet_for_each_enabled_ns wrapper for it? And also a nvmet_for_each_ns
one that iterates all namespaces?
Otherwise this looks good, and I really like the use of the xarray
mark to simplify things!
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH] nvmet-loop: avoid using mutex in IO hotpath
2024-12-09 8:01 ` Christoph Hellwig
@ 2024-12-09 17:37 ` Nilay Shroff
0 siblings, 0 replies; 5+ messages in thread
From: Nilay Shroff @ 2024-12-09 17:37 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvme, shinichiro.kawasaki, kbusch, sagi, axboe, chaitanyak,
gjoyce
On 12/9/24 13:31, Christoph Hellwig wrote:
> On Fri, Nov 29, 2024 at 04:18:33PM +0530, Nilay Shroff wrote:
>> - xa_for_each(&ctrl->subsys->namespaces, idx, ns) {
>> + xa_for_each_marked(&ctrl->subsys->namespaces, idx,
>> + ns, NVMET_NS_ENABLED) {
>
> Given that this is duplicated quite a bit, maybe add a
> nvmet_for_each_enabled_ns wrapper for it? And also a nvmet_for_each_ns
> one that iterates all namespaces?
Yeah sure, I will add those wrappers and spin a new patch.
> Otherwise this looks good, and I really like the use of the xarray
> mark to simplify things!
>
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-12-09 17:38 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-29 10:48 [PATCH] nvmet-loop: avoid using mutex in IO hotpath Nilay Shroff
2024-12-03 8:38 ` Hannes Reinecke
2024-12-03 8:45 ` Chaitanya Kulkarni
2024-12-09 8:01 ` Christoph Hellwig
2024-12-09 17:37 ` Nilay Shroff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox