From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 849BB1C1F13 for ; Thu, 10 Apr 2025 02:10:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744251055; cv=none; b=Noiq/H0kHtYjQ+JogYDGthDEdjMCQ0lG0temZX8KBXVEP8JSlD464SlHxWHBQ86Gp174HHKgmUs0h0kwk0Zpg8pI/Utd+HjhuZCe/7aV5b/f2J1zsLTzmtr6fEp64noZ1aLLsQa8rvTHtsY9DyAk42XN3lvoKfzVy4bCz15nMVs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744251055; c=relaxed/simple; bh=5sZvu1Ou31lv+/54ZojKAArbpHA4yfQade2EkyUNAPQ=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=I3EaBw0cEqfZuYMcUfuxVi5zLa5S2FH/BUV1t5y4NfBenWSSlvv9kVKISRG9NT0ph2KGZC7ychlSZrCCO9nNeQowF0PAdC4i/RZ09eU2qBB62LXrscCIMvirnMZ4x89XCdb64UhKucus0rYQn5Or0SuEdsSQyfVwzYw2E+yZkmE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=jJTkMfsm; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="jJTkMfsm" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1744251052; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=oKGnpuPU7kD11I+yz1MSiqXQRYRxHKG0sX4OPHEaOys=; b=jJTkMfsm3jNU6r8V8/qATp1vMHc69YaRY9+BMjYf00fNWU19Rphyj0egPhxDRrmLNsTe/0 +ULYDKS3Ca5pCymPMenp5i+0qQFF5QDoWSATZMxy1/NGosiRD5qI3hiZc/uQLSuyTV3+RI FFNqTxgla7LFyWe4tSxMTEcDNLwNCfI= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-90-MZl1sp2iPjaNdrM7cCR_lQ-1; Wed, 09 Apr 2025 22:10:50 -0400 X-MC-Unique: MZl1sp2iPjaNdrM7cCR_lQ-1 X-Mimecast-MFC-AGG-ID: MZl1sp2iPjaNdrM7cCR_lQ_1744251049 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id C03171800EC5; Thu, 10 Apr 2025 02:10:48 +0000 (UTC) Received: from fedora (unknown [10.72.120.20]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id CEAE91809BF7; Thu, 10 Apr 2025 02:10:43 +0000 (UTC) Date: Thu, 10 Apr 2025 10:10:37 +0800 From: Ming Lei To: Nilay Shroff Cc: Christoph Hellwig , Jens Axboe , linux-block@vger.kernel.org, syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com Subject: Re: [PATCH] block: don't grab elevator lock during queue initialization Message-ID: References: <03ba309d-b266-4596-83aa-1731c6cc1cfb@linux.ibm.com> <5c2274c0-fd82-4752-b735-32e52de2a80f@linux.ibm.com> Precedence: bulk X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5c2274c0-fd82-4752-b735-32e52de2a80f@linux.ibm.com> X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 On Thu, Apr 10, 2025 at 01:15:22AM +0530, Nilay Shroff wrote: > > > On 4/9/25 7:38 PM, Ming Lei wrote: > >>>>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues > >>>>> involved wrt. switching elevator. If elevator switching is not allowed > >>>>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set > >>>>> lock? > >>>>> > >>>> Yes if elevator switch is not allowed then we probably don't need per-set lock. > >>>> However my question was if we were to not allow elevator switch while > >>>> __blk_mq_update_nr_hw_queues is running then how would we implement it? > >>> > >>> It can be done easily by tag_set->srcu. > >> Ok great if that's possible! But I'm not sure how it could be done in this > >> case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store > >> run in the writer/updater context. So you may still need lock? Can you > >> please send across a (informal) patch with your idea ? > On Wed, Apr 09, 2025 at 07:16:03PM +0530, Nilay Shroff wrote: >> Yes new incoming NS may not be live yet when we iterate through >> ctrl->namespaces. So we don't need bother about it yet. If new NS needn't to be considered, the issue is easier to deal with updating nr_hw_queues, such as: - disable scheduler in 1st iterating tag_list - update nr_hw_queues in 2nd iterating tag_list - restore scheduler in 3rd iterating tag_list ->tag_list_lock is acquired & released in each iteration. Then per-set lock isn't needed any more. > > > > Please see the attached patch which treats elv_iosched_store() as reader. > > > I looked trough patch and looks good. However we still have an issue > in code paths where we acquire ->elevator_lock without first freezing > the queue.In this case if we try allocate memory using GFP_KERNEL > then fs_reclaim would still trigger lockdep splat. As for this case > the lock order would be, > > elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io) That is why I call ->elevator_lock is used too much, especially it is depended for dealing with kobject & sysfs, which isn't needed in this way. > > In fact, I tested your patch and got into the above splat. I've > attached lockdep splat for your reference. So IMO, we should instead > try make locking order such that ->freeze_lock shall never depend on > ->elevator_lock. It seems one way to make that possible if we make > ->elevator_lock per-set. The attached patch is just for avoiding race between add/del disk vs. updating nr_hw_queues, and it should have been for covering race between adding/exiting request queue vs. updating nr_hw_queues. If re-order can be done, that is definitely good, but... > > >>> > >>> Actually freeze lock is already held for nvme before calling > >>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue > >>> frozen for updating nr_hw_queues, so the above order may not match > >>> with the existed code. > >>> > >>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now? > >>> > >> I think we should consider (may be in different patch) updating > >> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove > >> its dependency on ->tag_list_lock. > > > > If we need to take nvme into account, the above lock order doesn't work, > > because nvme freezes queue before calling blk_mq_update_nr_hw_queues(), > > and elevator lock still depends on freeze lock. > > > > If it needn't to be considered, per-set lock becomes necessary. > IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues > we shall also update nvme pci, rdma and tcp drivers so that those > drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues > nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that > nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues. This way you cause new deadlock or new trouble if you reply on freeze in blk_mq_update_nr_hw_queues(): ->tag_list_lock freeze_lock If timeout or io failure happens during the above freeze_lock, timeout handler can not move on because new blk_mq_update_nr_hw_queues() can't grab the lock. Either deadlock or device has to been removed. > > >>>> > >>>> So now ->freeze_lock should not depend on ->elevator_lock and that shall > >>>> help avoid few of the recent lockdep splats reported with fs_reclaim. > >>>> What do you think? > >>> > >>> Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim > >>> related splat. > >>> > >>> However, in del_gendisk(), freeze_lock is still held before calling > >>> elevator_exit() and blk_unregister_queue(), and looks not easy to reorder. > >> > >> Yes agreed, however elevator_exit() called from del_gendisk() or > >> elv_unregister_queue() called from blk_unregister_queue() are called > >> after we unregister the queue. And if queue has been already unregistered > >> while invoking elevator_exit or del_gensidk then ideally we don't need to > >> acquire ->elevator_lock. The same is true for elevator_exit() called > >> from add_disk_fwnode(). So IMO, we should update these paths to avoid > >> acquiring ->elevator_lock. > > > > As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one > > host wide event, so either lock or srcu sync is needed. > Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel > with del_gendisk or blk_unregister_queue. Then the per-set lock is required in both add/del disk, then how to re-order freeze_lock & elevator lock in del_gendisk()? And there is same risk with the one in blk_mq_update_nr_hw_queues(). > > Thanks, > --Nilay > [ 399.112145] ====================================================== > [ 399.112153] WARNING: possible circular locking dependency detected > [ 399.112160] 6.15.0-rc1+ #159 Not tainted > [ 399.112167] ------------------------------------------------------ > [ 399.112174] bash/5891 is trying to acquire lock: > [ 399.112181] c0000000bc2334f8 (&q->elevator_lock){+.+.}-{4:4}, at: elv_iosched_store+0x11c/0x5d4 > [ 399.112205] > [ 399.112205] but task is already holding lock: > [ 399.112211] c0000000bc232fd8 (&q->q_usage_counter(io)#20){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 > [ 399.112231] > [ 399.112231] which lock already depends on the new lock. > [ 399.112231] > [ 399.112239] > [ 399.112239] the existing dependency chain (in reverse order) is: > [ 399.112246] > [ 399.112246] -> #2 (&q->q_usage_counter(io)#20){++++}-{0:0}: > [ 399.112260] blk_alloc_queue+0x3a8/0x3e4 > [ 399.112268] blk_mq_alloc_queue+0x88/0x11c > [ 399.112278] __blk_mq_alloc_disk+0x34/0xd8 > [ 399.112290] null_add_dev+0x3c8/0x914 [null_blk] > [ 399.112306] null_init+0x1e0/0x4bc [null_blk] > [ 399.112319] do_one_initcall+0x8c/0x4b8 > [ 399.112340] do_init_module+0x7c/0x2c4 > [ 399.112396] init_module_from_file+0xb4/0x108 > [ 399.112405] idempotent_init_module+0x26c/0x368 > [ 399.112414] sys_finit_module+0x98/0x150 > [ 399.112422] system_call_exception+0x134/0x360 > [ 399.112430] system_call_vectored_common+0x15c/0x2ec > [ 399.112441] > [ 399.112441] -> #1 (fs_reclaim){+.+.}-{0:0}: > [ 399.112453] fs_reclaim_acquire+0xe4/0x120 > [ 399.112461] kmem_cache_alloc_noprof+0x74/0x570 > [ 399.112469] __kernfs_new_node+0x98/0x37c > [ 399.112479] kernfs_new_node+0x94/0xe4 > [ 399.112490] kernfs_create_dir_ns+0x44/0x120 > [ 399.112501] sysfs_create_dir_ns+0x94/0x160 > [ 399.112512] kobject_add_internal+0xf4/0x3c8 > [ 399.112524] kobject_add+0x70/0x10c > [ 399.112533] elv_register_queue+0x70/0x14c > [ 399.112562] blk_register_queue+0x1d8/0x2bc > [ 399.112575] add_disk_fwnode+0x3b4/0x5d0 > [ 399.112588] sd_probe+0x3bc/0x5b4 [sd_mod] > [ 399.112601] really_probe+0x104/0x4c4 > [ 399.112613] __driver_probe_device+0xb8/0x200 > [ 399.112623] driver_probe_device+0x54/0x128 > [ 399.112632] __driver_attach_async_helper+0x7c/0x150 > [ 399.112641] async_run_entry_fn+0x60/0x1bc > [ 399.112665] process_one_work+0x2ac/0x7e4 > [ 399.112678] worker_thread+0x238/0x460 > [ 399.112691] kthread+0x158/0x188 > [ 399.112700] start_kernel_thread+0x14/0x18 > [ 399.112722] > [ 399.112722] -> #0 (&q->elevator_lock){+.+.}-{4:4}: Another ways is to make sure that ->elevator_lock isn't required for dealing with kobject/debugfs thing, which needs to refactor elevator code. Such as, ->elevator_lock is grabbed in debugfs handler, removing sched debugfs actually need to drain reader, that is deadlock too. Thanks, Ming