Re: [PATCH] block: don't grab elevator lock during queue initialization

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Nilay Shroff <nilay@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org,
	syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
Subject: Re: [PATCH] block: don't grab elevator lock during queue initialization
Date: Thu, 10 Apr 2025 10:10:37 +0800	[thread overview]
Message-ID: <Z_concavD65-DXqA@fedora> (raw)
In-Reply-To: <5c2274c0-fd82-4752-b735-32e52de2a80f@linux.ibm.com>

On Thu, Apr 10, 2025 at 01:15:22AM +0530, Nilay Shroff wrote:
> 
> 
> On 4/9/25 7:38 PM, Ming Lei wrote:
> >>>>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues
> >>>>> involved wrt. switching elevator. If elevator switching is not allowed
> >>>>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set
> >>>>> lock?
> >>>>>
> >>>> Yes if elevator switch is not allowed then we probably don't need per-set lock. 
> >>>> However my question was if we were to not allow elevator switch while 
> >>>> __blk_mq_update_nr_hw_queues is running then how would we implement it?
> >>>
> >>> It can be done easily by tag_set->srcu.
> >> Ok great if that's possible! But I'm not sure how it could be done in this
> >> case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store
> >> run in the writer/updater context. So you may still need lock? Can you
> >> please send across a (informal) patch with your idea ?

> On Wed, Apr 09, 2025 at 07:16:03PM +0530, Nilay Shroff wrote:
>> Yes new incoming NS may not be live yet when we iterate through
>> ctrl->namespaces. So we don't need bother about it yet.

If new NS needn't to be considered, the issue is easier to deal with updating
nr_hw_queues, such as:

- disable scheduler in 1st iterating tag_list

- update nr_hw_queues in 2nd iterating tag_list

- restore scheduler in 3rd iterating tag_list

->tag_list_lock is acquired & released in each iteration.

Then per-set lock isn't needed any more.

> > 
> > Please see the attached patch which treats elv_iosched_store() as reader.
> > 
> I looked trough patch and looks good. However we still have an issue 
> in code paths where we acquire ->elevator_lock without first freezing 
> the queue.In this case if we try allocate memory using GFP_KERNEL 
> then fs_reclaim would still trigger lockdep splat. As for this case
> the lock order would be,
> 
> elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io)

That is why I call ->elevator_lock is used too much, especially it is
depended for dealing with kobject & sysfs, which isn't needed in this way.

> 
> In fact, I tested your patch and got into the above splat. I've 
> attached lockdep splat for your reference. So IMO, we should instead
> try make locking order such that ->freeze_lock shall never depend on
> ->elevator_lock. It seems one way to make that possible if we make
> ->elevator_lock per-set. 

The attached patch is just for avoiding race between add/del disk vs.
updating nr_hw_queues, and it should have been for covering race between
adding/exiting request queue vs. updating nr_hw_queues.

If re-order can be done, that is definitely good, but...

> 
> >>>
> >>> Actually freeze lock is already held for nvme before calling
> >>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
> >>> frozen for updating nr_hw_queues, so the above order may not match
> >>> with the existed code.
> >>>
> >>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now?
> >>>
> >> I think we should consider (may be in different patch) updating
> >> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove
> >> its dependency on ->tag_list_lock.
> > 
> > If we need to take nvme into account, the above lock order doesn't work,
> > because nvme freezes queue before calling blk_mq_update_nr_hw_queues(),
> > and elevator lock still depends on freeze lock.
> > 
> > If it needn't to be considered, per-set lock becomes necessary.
> IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues
> we shall also update nvme pci, rdma and tcp drivers so that those 
> drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues
> nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that
> nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues.

This way you cause new deadlock or new trouble if you reply on freeze in
blk_mq_update_nr_hw_queues():

 ->tag_list_lock
 	freeze_lock

If timeout or io failure happens during the above freeze_lock, timeout
handler can not move on because new blk_mq_update_nr_hw_queues() can't
grab the lock.

Either deadlock or device has to been removed.

> 
> >>>>
> >>>> So now ->freeze_lock should not depend on ->elevator_lock and that shall
> >>>> help avoid few of the recent lockdep splats reported with fs_reclaim.
> >>>> What do you think?
> >>>
> >>> Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim
> >>> related splat.
> >>>
> >>> However, in del_gendisk(), freeze_lock is still held before calling
> >>> elevator_exit() and blk_unregister_queue(), and looks not easy to reorder.
> >>
> >> Yes agreed, however elevator_exit() called from del_gendisk() or 
> >> elv_unregister_queue() called from blk_unregister_queue() are called 
> >> after we unregister the queue. And if queue has been already unregistered
> >> while invoking elevator_exit or del_gensidk then ideally we don't need to
> >> acquire ->elevator_lock. The same is true for elevator_exit() called 
> >> from add_disk_fwnode(). So IMO, we should update these paths to avoid 
> >> acquiring ->elevator_lock.
> > 
> > As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one
> > host wide event, so either lock or srcu sync is needed.
> Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel 
> with del_gendisk or blk_unregister_queue.

Then the per-set lock is required in both add/del disk, then how to re-order
freeze_lock & elevator lock in del_gendisk()? 

And there is same risk with the one in blk_mq_update_nr_hw_queues().

> 
> Thanks,
> --Nilay

> [  399.112145] ======================================================
> [  399.112153] WARNING: possible circular locking dependency detected
> [  399.112160] 6.15.0-rc1+ #159 Not tainted
> [  399.112167] ------------------------------------------------------
> [  399.112174] bash/5891 is trying to acquire lock:
> [  399.112181] c0000000bc2334f8 (&q->elevator_lock){+.+.}-{4:4}, at: elv_iosched_store+0x11c/0x5d4
> [  399.112205]
> [  399.112205] but task is already holding lock:
> [  399.112211] c0000000bc232fd8 (&q->q_usage_counter(io)#20){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
> [  399.112231]
> [  399.112231] which lock already depends on the new lock.
> [  399.112231]
> [  399.112239]
> [  399.112239] the existing dependency chain (in reverse order) is:
> [  399.112246]
> [  399.112246] -> #2 (&q->q_usage_counter(io)#20){++++}-{0:0}:
> [  399.112260]        blk_alloc_queue+0x3a8/0x3e4
> [  399.112268]        blk_mq_alloc_queue+0x88/0x11c
> [  399.112278]        __blk_mq_alloc_disk+0x34/0xd8
> [  399.112290]        null_add_dev+0x3c8/0x914 [null_blk]
> [  399.112306]        null_init+0x1e0/0x4bc [null_blk]
> [  399.112319]        do_one_initcall+0x8c/0x4b8
> [  399.112340]        do_init_module+0x7c/0x2c4
> [  399.112396]        init_module_from_file+0xb4/0x108
> [  399.112405]        idempotent_init_module+0x26c/0x368
> [  399.112414]        sys_finit_module+0x98/0x150
> [  399.112422]        system_call_exception+0x134/0x360
> [  399.112430]        system_call_vectored_common+0x15c/0x2ec
> [  399.112441]
> [  399.112441] -> #1 (fs_reclaim){+.+.}-{0:0}:
> [  399.112453]        fs_reclaim_acquire+0xe4/0x120
> [  399.112461]        kmem_cache_alloc_noprof+0x74/0x570
> [  399.112469]        __kernfs_new_node+0x98/0x37c
> [  399.112479]        kernfs_new_node+0x94/0xe4
> [  399.112490]        kernfs_create_dir_ns+0x44/0x120
> [  399.112501]        sysfs_create_dir_ns+0x94/0x160
> [  399.112512]        kobject_add_internal+0xf4/0x3c8
> [  399.112524]        kobject_add+0x70/0x10c
> [  399.112533]        elv_register_queue+0x70/0x14c
> [  399.112562]        blk_register_queue+0x1d8/0x2bc
> [  399.112575]        add_disk_fwnode+0x3b4/0x5d0
> [  399.112588]        sd_probe+0x3bc/0x5b4 [sd_mod]
> [  399.112601]        really_probe+0x104/0x4c4
> [  399.112613]        __driver_probe_device+0xb8/0x200
> [  399.112623]        driver_probe_device+0x54/0x128
> [  399.112632]        __driver_attach_async_helper+0x7c/0x150
> [  399.112641]        async_run_entry_fn+0x60/0x1bc
> [  399.112665]        process_one_work+0x2ac/0x7e4
> [  399.112678]        worker_thread+0x238/0x460
> [  399.112691]        kthread+0x158/0x188
> [  399.112700]        start_kernel_thread+0x14/0x18
> [  399.112722]
> [  399.112722] -> #0 (&q->elevator_lock){+.+.}-{4:4}:

Another ways is to make sure that ->elevator_lock isn't required for dealing
with kobject/debugfs thing, which needs to refactor elevator code.

Such as, ->elevator_lock is grabbed in debugfs handler, removing sched debugfs
actually need to drain reader, that is deadlock too.

Thanks,
Ming

next prev parent reply	other threads:[~2025-04-10  2:10 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-03 10:54 [PATCH] block: don't grab elevator lock during queue initialization Ming Lei
2025-04-03 13:19 ` Nilay Shroff
2025-04-03 14:24   ` Ming Lei
2025-04-05 14:00     ` Nilay Shroff
2025-04-03 14:32 ` Jens Axboe
2025-04-04  9:10 ` Christoph Hellwig
2025-04-04 12:09   ` Ming Lei
2025-04-07  6:49     ` Christoph Hellwig
2025-04-05 14:14   ` Nilay Shroff
2025-04-07  3:09     ` Ming Lei
2025-04-07  8:29       ` Nilay Shroff
2025-04-08  7:38         ` Ming Lei
2025-04-08 13:25           ` Nilay Shroff
2025-04-08 13:50             ` Ming Lei
2025-04-09  9:12               ` Nilay Shroff
2025-04-09 11:46                 ` Ming Lei
2025-04-09 13:46                   ` Nilay Shroff
2025-04-09 14:08                     ` Ming Lei
2025-04-09 19:45                       ` Nilay Shroff
2025-04-10  2:10                         ` Ming Lei [this message]
2025-04-10 13:36                           ` Nilay Shroff
2025-04-10 14:23                             ` Ming Lei
2025-04-10 14:48                               ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z_concavD65-DXqA@fedora \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    --cc=nilay@linux.ibm.com \
    --cc=syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.