From: Ming Lei <ming.lei@redhat.com>
To: Nilay Shroff <nilay@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org,
syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
Subject: Re: [PATCH] block: don't grab elevator lock during queue initialization
Date: Tue, 8 Apr 2025 21:50:26 +0800 [thread overview]
Message-ID: <Z_UpoiWlBnwaUW7B@fedora> (raw)
In-Reply-To: <c2c9e913-1c24-49c7-bfc5-671728f8dddc@linux.ibm.com>
On Tue, Apr 08, 2025 at 06:55:26PM +0530, Nilay Shroff wrote:
>
>
> On 4/8/25 1:08 PM, Ming Lei wrote:
> > On Mon, Apr 07, 2025 at 01:59:48PM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 4/7/25 8:39 AM, Ming Lei wrote:
> >>> On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote:
> >>>>
> >>>>
> >>>> On 4/4/25 2:40 PM, Christoph Hellwig wrote:
> >>>>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
> >>>>>> Fixes the following lockdep warning:
> >>>>>
> >>>>> Please spell the actual dependency out here, links are not permanent
> >>>>> and also not readable for any offline reading of the commit logs.
> >>>>>
> >>>>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> >>>>>> + struct request_queue *q, bool lock)
> >>>>>> +{
> >>>>>> + if (lock) {
> >>>>>
> >>>>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus
> >>>>> screaming at you (in this case even for the right reason :))
> >>>>>
> >>>>>> + /* protect against switching io scheduler */
> >>>>>> + mutex_lock(&q->elevator_lock);
> >>>>>> + __blk_mq_realloc_hw_ctxs(set, q);
> >>>>>> + mutex_unlock(&q->elevator_lock);
> >>>>>> + } else {
> >>>>>> + __blk_mq_realloc_hw_ctxs(set, q);
> >>>>>> + }
> >>>>>
> >>>>> I think the problem here is again that because of all the other
> >>>>> dependencies elevator_lock really needs to be per-set instead of
> >>>>> per-queue which will allows us to have much saner locking hierarchies.
> >>>>>
> >>>> I believe you meant here q->tag_set->elevator_lock?
> >>>
> >>> I don't know what locks you are planning to invent.
> >>>
> >>> For set->tag_list_lock, it has been very fragile:
> >>>
> >>> blk_mq_update_nr_hw_queues
> >>> set->tag_list_lock
> >>> freeze_queue
> >>>
> >>> If IO failure happens when waiting in above freeze_queue(), the nvme error
> >>> handling can't provide forward progress any more, because the error
> >>> handling code path requires set->tag_list_lock.
> >>
> >> I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues
> >
> > Yes.
> >
> >> which is called in nvme error handling path. If yes then I believe this function
> >> could be easily modified so that it doesn't require ->tag_list_lock.
> >
> > Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list".
> >
> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock:
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 777db89fdaa7..002d2fd20e0c 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl)
> {
> if (!ctrl->tagset)
> return;
> - if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
> - blk_mq_quiesce_tagset(ctrl->tagset);
> - else
> - blk_mq_wait_quiesce_done(ctrl->tagset);
> + if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) {
> + struct nvme_ns *ns;
> + int srcu_idx;
> +
> + srcu_idx = srcu_read_lock(&ctrl->srcu);
> + list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
> + srcu_read_lock_held(&ctrl->srcu)) {
> + if (!blk_queue_skip_tagset_quiesce(ns->queue))
> + blk_mq_quiesce_queue_nowait(ns->queue);
> + }
> + srcu_read_unlock(&ctrl->srcu, srcu_idx);
> + }
> + blk_mq_wait_quiesce_done(ctrl->tagset);
> }
> EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues);
>
> Here we iterate through ctrl->namespaces instead of relying on tag_list
> and so we don't need to acquire ->tag_list_lock.
How can you make sure all NSs are covered in this way? RCU/SRCU can't
provide such kind of guarantee.
>
> > And the same list is iterated in blk_mq_update_nr_hw_queues() too.
> >
> >>
> >>>
> >>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues,
> >>> fortunately that is what nvme is doing.
> >>>
> >>>
> >>>> If yes then it means that we should be able to grab ->elevator_lock
> >>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
> >>>> order should be in each code path,
> >>>>
> >>>> __blk_mq_update_nr_hw_queues
> >>>> ->elevator_lock
> >>>> ->freeze_lock
> >>>
> >>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way
> >>> just make things worse. Why can't we disable elevator switch during
> >>> updating nr_hw_queues?
> >>>
> >> I couldn't quite understand this. As we already first disable the elevator
> >> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues().
> >> Once mapping is successful we switch back the elevator.
> >
> > Yes, but user still may switch elevator from none to others during the
> > period, right?
> >
> Yes correct, that's possible. So your suggestion was to disable elevator
> update while we're running __blk_mq_update_nr_hw_queues? And that way user
> couldn't update elevator through sysfs (elv_iosched_store) while we update
> nr_hw_queues? If this is true then still how could it help solve lockdep
> splat?
Then why do you think per-set lock can solve the lockdep splat?
__blk_mq_update_nr_hw_queues is the only chance for tagset wide queues
involved wrt. switching elevator. If elevator switching is not allowed
when __blk_mq_update_nr_hw_queues() is started, why do we need per-set
lock?
Thanks,
Ming
next prev parent reply other threads:[~2025-04-08 13:50 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-03 10:54 [PATCH] block: don't grab elevator lock during queue initialization Ming Lei
2025-04-03 13:19 ` Nilay Shroff
2025-04-03 14:24 ` Ming Lei
2025-04-05 14:00 ` Nilay Shroff
2025-04-03 14:32 ` Jens Axboe
2025-04-04 9:10 ` Christoph Hellwig
2025-04-04 12:09 ` Ming Lei
2025-04-07 6:49 ` Christoph Hellwig
2025-04-05 14:14 ` Nilay Shroff
2025-04-07 3:09 ` Ming Lei
2025-04-07 8:29 ` Nilay Shroff
2025-04-08 7:38 ` Ming Lei
2025-04-08 13:25 ` Nilay Shroff
2025-04-08 13:50 ` Ming Lei [this message]
2025-04-09 9:12 ` Nilay Shroff
2025-04-09 11:46 ` Ming Lei
2025-04-09 13:46 ` Nilay Shroff
2025-04-09 14:08 ` Ming Lei
2025-04-09 19:45 ` Nilay Shroff
2025-04-10 2:10 ` Ming Lei
2025-04-10 13:36 ` Nilay Shroff
2025-04-10 14:23 ` Ming Lei
2025-04-10 14:48 ` Nilay Shroff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z_UpoiWlBnwaUW7B@fedora \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=hch@lst.de \
--cc=linux-block@vger.kernel.org \
--cc=nilay@linux.ibm.com \
--cc=syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.