All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nilay Shroff <nilay@linux.ibm.com>
To: Hannes Reinecke <hare@suse.de>, linux-nvme@lists.infradead.org
Cc: hch@lst.de, kbusch@kernel.org, sagi@grimberg.me, axboe@kernel.dk,
	shinichiro.kawasaki@wdc.com, chaitanyak@nvidia.com,
	gjoyce@linux.ibm.com
Subject: Re: [PATCHv2] nvmet-loop: avoid using mutex in IO hotpath
Date: Wed, 11 Dec 2024 17:29:54 +0530	[thread overview]
Message-ID: <bf2a7797-0eb9-4355-a8ac-ffcf2235d227@linux.ibm.com> (raw)
In-Reply-To: <faa65186-839d-4352-9d94-386e440abb48@suse.de>



On 12/11/24 16:59, Hannes Reinecke wrote:
> On 12/11/24 09:58, Nilay Shroff wrote:
>> Using mutex lock in IO hot path causes the kernel BUG sleeping while
>> atomic. Shinichiro[1], first encountered this issue while running blktest
>> nvme/052 shown below:
>>
>> BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
>> in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 996, name: (udev-worker)
>> preempt_count: 0, expected: 0
>> RCU nest depth: 1, expected: 0
>> 2 locks held by (udev-worker)/996:
>>   #0: ffff8881004570c8 (mapping.invalidate_lock){.+.+}-{3:3}, at: page_cache_ra_unbounded+0x155/0x5c0
>>   #1: ffffffff8607eaa0 (rcu_read_lock){....}-{1:2}, at: blk_mq_flush_plug_list+0xa75/0x1950
>> CPU: 2 UID: 0 PID: 996 Comm: (udev-worker) Not tainted 6.12.0-rc3+ #339
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
>> Call Trace:
>>   <TASK>
>>   dump_stack_lvl+0x6a/0x90
>>   __might_resched.cold+0x1f7/0x23d
>>   ? __pfx___might_resched+0x10/0x10
>>   ? vsnprintf+0xdeb/0x18f0
>>   __mutex_lock+0xf4/0x1220
>>   ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
>>   ? __pfx_vsnprintf+0x10/0x10
>>   ? __pfx___mutex_lock+0x10/0x10
>>   ? snprintf+0xa5/0xe0
>>   ? xas_load+0x1ce/0x3f0
>>   ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
>>   nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
>>   ? __pfx_nvmet_subsys_nsid_exists+0x10/0x10 [nvmet]
>>   nvmet_req_find_ns+0x24e/0x300 [nvmet]
>>   nvmet_req_init+0x694/0xd40 [nvmet]
>>   ? blk_mq_start_request+0x11c/0x750
>>   ? nvme_setup_cmd+0x369/0x990 [nvme_core]
>>   nvme_loop_queue_rq+0x2a7/0x7a0 [nvme_loop]
>>   ? __pfx___lock_acquire+0x10/0x10
>>   ? __pfx_nvme_loop_queue_rq+0x10/0x10 [nvme_loop]
>>   __blk_mq_issue_directly+0xe2/0x1d0
>>   ? __pfx___blk_mq_issue_directly+0x10/0x10
>>   ? blk_mq_request_issue_directly+0xc2/0x140
>>   blk_mq_plug_issue_direct+0x13f/0x630
>>   ? lock_acquire+0x2d/0xc0
>>   ? blk_mq_flush_plug_list+0xa75/0x1950
>>   blk_mq_flush_plug_list+0xa9d/0x1950
>>   ? __pfx_blk_mq_flush_plug_list+0x10/0x10
>>   ? __pfx_mpage_readahead+0x10/0x10
>>   __blk_flush_plug+0x278/0x4d0
>>   ? __pfx___blk_flush_plug+0x10/0x10
>>   ? lock_release+0x460/0x7a0
>>   blk_finish_plug+0x4e/0x90
>>   read_pages+0x51b/0xbc0
>>   ? __pfx_read_pages+0x10/0x10
>>   ? lock_release+0x460/0x7a0
>>   page_cache_ra_unbounded+0x326/0x5c0
>>   force_page_cache_ra+0x1ea/0x2f0
>>   filemap_get_pages+0x59e/0x17b0
>>   ? __pfx_filemap_get_pages+0x10/0x10
>>   ? lock_is_held_type+0xd5/0x130
>>   ? __pfx___might_resched+0x10/0x10
>>   ? find_held_lock+0x2d/0x110
>>   filemap_read+0x317/0xb70
>>   ? up_write+0x1ba/0x510
>>   ? __pfx_filemap_read+0x10/0x10
>>   ? inode_security+0x54/0xf0
>>   ? selinux_file_permission+0x36d/0x420
>>   blkdev_read_iter+0x143/0x3b0
>>   vfs_read+0x6ac/0xa20
>>   ? __pfx_vfs_read+0x10/0x10
>>   ? __pfx_vm_mmap_pgoff+0x10/0x10
>>   ? __pfx___seccomp_filter+0x10/0x10
>>   ksys_read+0xf7/0x1d0
>>   ? __pfx_ksys_read+0x10/0x10
>>   do_syscall_64+0x93/0x180
>>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>>   ? do_syscall_64+0x9f/0x180
>>   ? lockdep_hardirqs_on+0x78/0x100
>>   ? do_syscall_64+0x9f/0x180
>>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> RIP: 0033:0x7f565bd1ce11
>> Code: 00 48 8b 15 09 90 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 d0 ad 01 00 f3 0f 1e fa 80 3d 35 12 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
>> RSP: 002b:00007ffd6e7a20c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>> RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f565bd1ce11
>> RDX: 0000000000001000 RSI: 00007f565babb000 RDI: 0000000000000014
>> RBP: 00007ffd6e7a2130 R08: 00000000ffffffff R09: 0000000000000000
>> R10: 0000556000bfa610 R11: 0000000000000246 R12: 000000003ffff000
>> R13: 0000556000bfa5b0 R14: 0000000000000e00 R15: 0000556000c07328
>>   </TASK>
>>
>> Apparently, the above issue is caused due to using mutex lock while
>> we're in IO hot path. It's a regression caused with commit 505363957fad
>> ("nvmet: fix nvme status code when namespace is disabled"). The mutex
>> ->su_mutex is used to find whether a disabled nsid exists in the config
>> group or not. This is to differentiate between a nsid that is disabled
>> vs non-existent.
>>
>> To mitigate the above issue, we've worked upon a fix[2] where we now
>> insert nsid in subsys Xarray as soon as it's created under config group
>> and later when that nsid is enabled, we add an Xarray mark on it and set
>> ns->enabled to true. The Xarray mark is useful while we need to loop
>> through all enabled namepsaces under a subsystem using xa_for_each_marked()
>> API. If later a nsid is disabled then we clear Xarray mark from it and also
>> set ns->enabled to false. It's only when nsid is deleted from the config
>> group we delete it from the Xarray.
>>
>> So with this change, now we could easily differentiate a nsid is disabled
>> (i.e. Xarray entry for ns exists but ns->enabled is set to false) vs non-
>> existent (i.e.Xarray entry for ns doesn't exist).
>>
>> Link: https://lore.kernel.org/linux-nvme/20241022070252.GA11389@lst.de/ [2]
>> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>> Closes: https://lore.kernel.org/linux-nvme/tqcy3sveity7p56v7ywp7ssyviwcb3w4623cnxj3knoobfcanq@yxgt2mjkbkam/ [1]
>> Fixes: 505363957fad ("nvmet: fix nvme status code when namespace is disabled")
>> Fix-suggested-by: Christoph Hellwig <hch@lst.de>
>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> Changes from v1:
>> - Added nvmet_for_each_enabled_ns wrapper for xa_for_each_marked (hch)
>> - Added nvmet_for_each_ns which iterates all namespaces (hch)
>> - Removed now ununsed function nvmet_subsys_nsid_exists
>> - Pick up Reviewed-by: Hannes Reinecke <hare@suse.de>
>> - Pick up Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
>> - Link to v1: https://lore.kernel.org/all/20241129104838.3015665-1-nilay@linux.ibm.com/
>> ---
>>   drivers/nvme/target/admin-cmd.c |   9 +--
>>   drivers/nvme/target/configfs.c  |  12 ----
>>   drivers/nvme/target/core.c      | 108 +++++++++++++++++++-------------
>>   drivers/nvme/target/nvmet.h     |   7 +++
>>   drivers/nvme/target/pr.c        |   8 +--
>>   5 files changed, 79 insertions(+), 65 deletions(-)
>>
>> diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
>> index 2962794ce881..fa89b0549c36 100644
>> --- a/drivers/nvme/target/admin-cmd.c
>> +++ b/drivers/nvme/target/admin-cmd.c
>> @@ -139,7 +139,7 @@ static u16 nvmet_get_smart_log_all(struct nvmet_req *req,
>>       unsigned long idx;
>>         ctrl = req->sq->ctrl;
>> -    xa_for_each(&ctrl->subsys->namespaces, idx, ns) {
>> +    nvmet_for_each_enabled_ns(&ctrl->subsys->namespaces, idx, ns) {
>>           /* we don't have the right data for file backed ns */
>>           if (!ns->bdev)
>>               continue;
>> @@ -331,9 +331,10 @@ static u32 nvmet_format_ana_group(struct nvmet_req *req, u32 grpid,
>>       u32 count = 0;
>>         if (!(req->cmd->get_log_page.lsp & NVME_ANA_LOG_RGO)) {
>> -        xa_for_each(&ctrl->subsys->namespaces, idx, ns)
>> +        nvmet_for_each_enabled_ns(&ctrl->subsys->namespaces, idx, ns) {
>>               if (ns->anagrpid == grpid)
>>                   desc->nsids[count++] = cpu_to_le32(ns->nsid);
>> +        }
>>       }
>>         desc->grpid = cpu_to_le32(grpid);
>> @@ -772,7 +773,7 @@ static void nvmet_execute_identify_endgrp_list(struct nvmet_req *req)
>>           goto out;
>>       }
>>   -    xa_for_each(&ctrl->subsys->namespaces, idx, ns) {
>> +    nvmet_for_each_enabled_ns(&ctrl->subsys->namespaces, idx, ns) {
>>           if (ns->nsid <= min_endgid)
>>               continue;
>>   @@ -815,7 +816,7 @@ static void nvmet_execute_identify_nslist(struct nvmet_req *req, bool match_css)
>>           goto out;
>>       }
>>   -    xa_for_each(&ctrl->subsys->namespaces, idx, ns) {
>> +    nvmet_for_each_enabled_ns(&ctrl->subsys->namespaces, idx, ns) {
>>           if (ns->nsid <= min_nsid)
>>               continue;
>>           if (match_css && req->ns->csi != req->cmd->identify.csi)
>> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
>> index eeee9e9b854c..c0000db47cbb 100644
>> --- a/drivers/nvme/target/configfs.c
>> +++ b/drivers/nvme/target/configfs.c
>> @@ -810,18 +810,6 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
>>       NULL,
>>   };
>>   -bool nvmet_subsys_nsid_exists(struct nvmet_subsys *subsys, u32 nsid)
>> -{
>> -    struct config_item *ns_item;
>> -    char name[12];
>> -
>> -    snprintf(name, sizeof(name), "%u", nsid);
>> -    mutex_lock(&subsys->namespaces_group.cg_subsys->su_mutex);
>> -    ns_item = config_group_find_item(&subsys->namespaces_group, name);
>> -    mutex_unlock(&subsys->namespaces_group.cg_subsys->su_mutex);
>> -    return ns_item != NULL;
>> -}
>> -
>>   static void nvmet_ns_release(struct config_item *item)
>>   {
>>       struct nvmet_ns *ns = to_nvmet_ns(item);
>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
>> index 1f4e9989663b..fde6c555af61 100644
>> --- a/drivers/nvme/target/core.c
>> +++ b/drivers/nvme/target/core.c
>> @@ -127,7 +127,7 @@ static u32 nvmet_max_nsid(struct nvmet_subsys *subsys)
>>       unsigned long idx;
>>       u32 nsid = 0;
>>   -    xa_for_each(&subsys->namespaces, idx, cur)
>> +    nvmet_for_each_enabled_ns(&subsys->namespaces, idx, cur)
>>           nsid = cur->nsid;
>>         return nsid;
>> @@ -441,11 +441,14 @@ u16 nvmet_req_find_ns(struct nvmet_req *req)
>>       struct nvmet_subsys *subsys = nvmet_req_subsys(req);
>>         req->ns = xa_load(&subsys->namespaces, nsid);
>> -    if (unlikely(!req->ns)) {
>> +    if (unlikely(!req->ns || !req->ns->enabled)) {
>>           req->error_loc = offsetof(struct nvme_common_command, nsid);
>> -        if (nvmet_subsys_nsid_exists(subsys, nsid))
>> -            return NVME_SC_INTERNAL_PATH_ERROR;
>> -        return NVME_SC_INVALID_NS | NVME_STATUS_DNR;
>> +        if (!req->ns) /* ns doesn't exist! */
>> +            return NVME_SC_INVALID_NS | NVME_STATUS_DNR;
>> +
>> +        /* ns exists but it's disabled */
>> +        req->ns = NULL;
>> +        return NVME_SC_INTERNAL_PATH_ERROR;
>>       }
>>         percpu_ref_get(&req->ns->ref);
> 
> Now I have to ask: Why do you set 'req->ns' to NULL here?
> The way I read this it would mean that the first call to nvmet_find_req_ns() would return INTERNAL_PATH_ERROR, and the
> second call would return INVALID_NS.
> 
> Is that intentional?
The nvmet_req_find_ns function returns error code for following two cases:
1) namespace doesn't exist in the config fs 
- the namespace entry doesn't exists in subsystem Xarray
- For this case req->ns would be always NULL as xa_load would return NULL
- We return error (NVME_SC_INVALID_NS | NVME_STATUS_DNR)

2) namespace exists in config fs but it's disabled
- the namespace entry exists in subsystem Xarray but it's marked as disabled
- As the ns is disabled, we don't increment refcount of this ns. So we set req->ns 
  to NULL and return error NVME_SC_INTERNAL_PATH_ERROR to the caller. Later in
  the call stack when we complete this request (please refer __nvmet_req_complete()), 
  we don't need to put the reference of the ns as req->ns is NULL.

Essentially, if nvmet_req_find_ns() returns error then we don't get reference 
to ns and hence set req->ns to NULL. And we don't need to make two different calls
to nvmet_find_req_ns(). It's based on one of two cases explianed above, 
nvmet_find_req_ns() would either return (NVME_SC_INVALID_NS | NVME_STATUS_DNR) or 
NVME_SC_INTERNAL_PATH_ERROR.

You may also refer this commit 505363957fad ("nvmet: fix nvme status code when 
namespace is disabled") where we first added error code NVME_SC_INTERNAL_PATH_ERROR in 
nvmet_req_find_ns().

Thanks,
--Nilay


  reply	other threads:[~2024-12-11 12:00 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-11  8:58 [PATCHv2] nvmet-loop: avoid using mutex in IO hotpath Nilay Shroff
2024-12-11 11:29 ` Hannes Reinecke
2024-12-11 11:59   ` Nilay Shroff [this message]
2024-12-12  5:42 ` Christoph Hellwig
2024-12-24 11:24 ` Sagi Grimberg
2024-12-27 21:28 ` Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bf2a7797-0eb9-4355-a8ac-ffcf2235d227@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@kernel.dk \
    --cc=chaitanyak@nvidia.com \
    --cc=gjoyce@linux.ibm.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    --cc=shinichiro.kawasaki@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.