From: Ming Lei <ming.lei@redhat.com>
To: Nilay Shroff <nilay@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org,
syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
Subject: Re: [PATCH] block: don't grab elevator lock during queue initialization
Date: Thu, 10 Apr 2025 22:23:38 +0800 [thread overview]
Message-ID: <Z_fUaumnM2lzQKIh@fedora> (raw)
In-Reply-To: <917201e2-acac-4ab0-982e-04635806304c@linux.ibm.com>
On Thu, Apr 10, 2025 at 07:06:23PM +0530, Nilay Shroff wrote:
>
>
> On 4/10/25 7:40 AM, Ming Lei wrote:
> >
> > If new NS needn't to be considered, the issue is easier to deal with updating
> > nr_hw_queues, such as:
> >
> > - disable scheduler in 1st iterating tag_list
> >
> > - update nr_hw_queues in 2nd iterating tag_list
> >
> > - restore scheduler in 3rd iterating tag_list
> >
> > ->tag_list_lock is acquired & released in each iteration.
> >
> > Then per-set lock isn't needed any more.
> >
> Hmm still thinking...
> >>>
> >>> Please see the attached patch which treats elv_iosched_store() as reader.
> >>>
> >> I looked trough patch and looks good. However we still have an issue
> >> in code paths where we acquire ->elevator_lock without first freezing
> >> the queue.In this case if we try allocate memory using GFP_KERNEL
> >> then fs_reclaim would still trigger lockdep splat. As for this case
> >> the lock order would be,
> >>
> >> elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io)
> >
> > That is why I call ->elevator_lock is used too much, especially it is
> > depended for dealing with kobject & sysfs, which isn't needed in this way.
> >
> >>
> >> In fact, I tested your patch and got into the above splat. I've
> >> attached lockdep splat for your reference. So IMO, we should instead
> >> try make locking order such that ->freeze_lock shall never depend on
> >> ->elevator_lock. It seems one way to make that possible if we make
> >> ->elevator_lock per-set.
> >
> > The attached patch is just for avoiding race between add/del disk vs.
> > updating nr_hw_queues, and it should have been for covering race between
> > adding/exiting request queue vs. updating nr_hw_queues.
> >
> Okay understood.
>
> > If re-order can be done, that is definitely good, but...
> Yeah so I tried re-ordering locks so that we first grab q->elevator_lock
> and then ->freeze_lock. As we know there's a challenge with re-arranging
> the lock order in __blk_mq_update_nr_hw_queues, but I managed to rectify
> the lock order. I added one more tag_list iterator where we first acquire
> ->elevator lock and then in the next tag_list iterator we acquire
> ->freeze_lock. I have also updated this lock order at other call sites.
>
> Then as we have already discussed, there's also another challenge re-arranging
> the lock order in del_gendisk, however, I managed to mitigate that by moving
> elevator_exit and elv_unregister_queue (from blk_unregister_queue) just after
> we delete queue tag set (or in another words when remove the queue from the
> tag-list) in del_gendisk. So that means that we could now safely invoke
> elv_unregister_queue and elevator_exit from del_gendisk without needing to
> acquire ->elevator_lock.
>
> For reference, I attached a (informal) patch. Yes I know this patch would not
> fix all splats. But at-least we would stop observing splat related to
> ->frezze_lock dependency on ->elevator_lcok.
I just sent out the whole patchset, which did one thing basically: move kobject
& debugfs & cpuhp out of queue freezing, then no any lockdep is observed in my
test VM after running blktests of './check block/'.
So the point is _not_ related with elevator lock or its order with
freeze lock. What matters is actually "do not connect freeze lock with
other subsystem(debugfs, sysfs, cpuhp, ...)", because freeze_lock relies
on fs_reclaim directly with commit ffa1e7ada456, but other subsystem easily
depends on fs_reclaim again.
I did have patches to follow your idea to reorder elevator lock vs. freeze
lock, but it didn't make difference, even though after I killed most of
elevator_lock.
Finally I realized the above point.
>
> >
> >>
> >>>>>
> >>>>> Actually freeze lock is already held for nvme before calling
> >>>>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
> >>>>> frozen for updating nr_hw_queues, so the above order may not match
> >>>>> with the existed code.
> >>>>>
> >>>>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now?
> >>>>>
> >>>> I think we should consider (may be in different patch) updating
> >>>> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove
> >>>> its dependency on ->tag_list_lock.
> >>>
> >>> If we need to take nvme into account, the above lock order doesn't work,
> >>> because nvme freezes queue before calling blk_mq_update_nr_hw_queues(),
> >>> and elevator lock still depends on freeze lock.
> >>>
> >>> If it needn't to be considered, per-set lock becomes necessary.
> >> IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues
> >> we shall also update nvme pci, rdma and tcp drivers so that those
> >> drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues
> >> nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that
> >> nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues.
> >
> > This way you cause new deadlock or new trouble if you reply on freeze in
> > blk_mq_update_nr_hw_queues():
> >
> > ->tag_list_lock
> > freeze_lock
> >
> > If timeout or io failure happens during the above freeze_lock, timeout
> > handler can not move on because new blk_mq_update_nr_hw_queues() can't
> > grab the lock.
> >
> > Either deadlock or device has to been removed.
> >
> With this new attached patch do you still foresee this issue? Yes this patch
> doesn't change anything with nvme driver, but later we may update nvme driver
> so that nvme driver doesn't require freezing queue before invoking blk_mq_
> update_nr_hw_queues. I think this is requires so that we follow the same lock
> order in all call paths wrt ->elevator_lock and ->freeze_lock.
nvme freeze lock is non_owner, and it can't be verified now, so don't use
nvme for running this lock test.
>
> >>> As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one
> >>> host wide event, so either lock or srcu sync is needed.
> >> Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel
> >> with del_gendisk or blk_unregister_queue.
> >
> > Then the per-set lock is required in both add/del disk, then how to re-order
> > freeze_lock & elevator lock in del_gendisk()?
> >
> > And there is same risk with the one in blk_mq_update_nr_hw_queues().
> >
> Yes please see above as I explained how we could potentially avoid lock order
> issue in del_gendisk.
> >
> > Another ways is to make sure that ->elevator_lock isn't required for dealing
> > with kobject/debugfs thing, which needs to refactor elevator code.
> >
> > Such as, ->elevator_lock is grabbed in debugfs handler, removing sched debugfs
> > actually need to drain reader, that is deadlock too.
> >
> I do agree. But I think lets first focus on cutting dependency of ->freeze_lock
> on ->elevator_lock. Once we get past it we may address other.
>
> This commit ffa1e7ada456 ("block: Make request_queue lockdep splats show up
> earlier") has now opened up pandora's box of lockdep splats :)
> Anyways it's good that we could now catch these issues early on. In general,
> I feel that now this change would show up early on the issues where ->freeze_lock
> depends on any other locks.
It also shows block layer freeze queue API should be used very carefully, fortunately
we have the powerful lockdep.
Thanks,
Ming
next prev parent reply other threads:[~2025-04-10 14:23 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-03 10:54 [PATCH] block: don't grab elevator lock during queue initialization Ming Lei
2025-04-03 13:19 ` Nilay Shroff
2025-04-03 14:24 ` Ming Lei
2025-04-05 14:00 ` Nilay Shroff
2025-04-03 14:32 ` Jens Axboe
2025-04-04 9:10 ` Christoph Hellwig
2025-04-04 12:09 ` Ming Lei
2025-04-07 6:49 ` Christoph Hellwig
2025-04-05 14:14 ` Nilay Shroff
2025-04-07 3:09 ` Ming Lei
2025-04-07 8:29 ` Nilay Shroff
2025-04-08 7:38 ` Ming Lei
2025-04-08 13:25 ` Nilay Shroff
2025-04-08 13:50 ` Ming Lei
2025-04-09 9:12 ` Nilay Shroff
2025-04-09 11:46 ` Ming Lei
2025-04-09 13:46 ` Nilay Shroff
2025-04-09 14:08 ` Ming Lei
2025-04-09 19:45 ` Nilay Shroff
2025-04-10 2:10 ` Ming Lei
2025-04-10 13:36 ` Nilay Shroff
2025-04-10 14:23 ` Ming Lei [this message]
2025-04-10 14:48 ` Nilay Shroff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z_fUaumnM2lzQKIh@fedora \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=hch@lst.de \
--cc=linux-block@vger.kernel.org \
--cc=nilay@linux.ibm.com \
--cc=syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox