From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 255E5261560; Tue, 8 Apr 2025 07:03:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744095790; cv=none; b=BSoBwrQjV560f86GOXkWJ2wW/m+WxRAS6eVlI3RKqrfjskBPQrXcSbdMhPYVKf5FnvwturQTH0/h3YYDUmTJykW/zmIfo4ujb5F/ilXoMFePowmzRJYRR3FyxXuqItF6qnRYcCNYAaflD0tzAkQ0FCv85+zvX7DDFjhI7L2LbrQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744095790; c=relaxed/simple; bh=Mpar1H1xLzORg6HkbqJjbKb4o6wfw+53maSN75ZJCaA=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=gKIHeZF41FyC3lRDgA1HttaRETiqAa7TJnhokEOHtkLPEl+s0OKBu2bSFfOrFml9KQtBaAESnmKSIo1o7VlF7RXRaH83dbNwioeNgHHBYnyc6lhddr1+ZLLvX5ujecYewDmX6SJCK+9LrjqMFiu80Lk42iqEAA3pZfGAm5GVebQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=BOuX7mre; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="BOuX7mre" Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 537KZjdB025333; Tue, 8 Apr 2025 07:02:57 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=pp1; bh=ioCar8 E22wJOUv+ACxM9AbBY5id8yTVLNcAW+LM3E5w=; b=BOuX7mrebuaLJQHLMFT/Tc ufTkwC1Dn5Me/FniLt25V3jKux3El5cJtiRW9QnHt3CLQfZgZ6TZxN63WHq/5aUt o2P/zViXvuuG9K5Z5AmkpLW7C2qh2D0TXMZqAHTg6cg6/8PfxaZz96r0f1UoA6ek e2XJKC0ZopP/v/1vOQBxSi3FdCVFzQR4AzyIEeR4BdXtoI1BNieYSfCdvWgDsOhm eLNrT0FNZw3nAU9YXXwME6Yw+ksk6hZibQJ9BnVdwAgk82w7Ehbv7Sk1mN+6drSg OA3QIo4pdljat2KwttltOWzCRASQKZroyYcJ7PY0pP5SjdHz+ibsEM6KcH4oVPAg == Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45vnvq21jd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 08 Apr 2025 07:02:56 +0000 (GMT) Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1]) by ppma22.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 5382QnOV011518; Tue, 8 Apr 2025 07:02:56 GMT Received: from smtprelay01.wdc07v.mail.ibm.com ([172.16.1.68]) by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 45uf7yhmpr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 08 Apr 2025 07:02:56 +0000 Received: from smtpav05.wdc07v.mail.ibm.com (smtpav05.wdc07v.mail.ibm.com [10.39.53.232]) by smtprelay01.wdc07v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 53872teW13631810 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 8 Apr 2025 07:02:55 GMT Received: from smtpav05.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A6A9058065; Tue, 8 Apr 2025 07:02:55 +0000 (GMT) Received: from smtpav05.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EEE0A58059; Tue, 8 Apr 2025 07:02:52 +0000 (GMT) Received: from [9.109.198.149] (unknown [9.109.198.149]) by smtpav05.wdc07v.mail.ibm.com (Postfix) with ESMTP; Tue, 8 Apr 2025 07:02:52 +0000 (GMT) Message-ID: <16ed195d-7f26-41bb-86d6-ea961b3fa691@linux.ibm.com> Date: Tue, 8 Apr 2025 12:32:50 +0530 Precedence: bulk X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RESEND] Circular locking dependency involving, elevator_lock, rq_qos_mutex and pcpu_alloc_mutex To: Tejun Heo Cc: Jakub Kicinski , Jens Axboe , cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Christoph Hellwig , Ming Lei References: Content-Language: en-US From: Nilay Shroff In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-GUID: kGURlarxvk5d80skYNIxmm0mHwGBN6kL X-Proofpoint-ORIG-GUID: kGURlarxvk5d80skYNIxmm0mHwGBN6kL X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1095,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-04-08_02,2025-04-07_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 spamscore=0 mlxscore=0 lowpriorityscore=0 adultscore=0 suspectscore=0 mlxlogscore=999 malwarescore=0 priorityscore=1501 clxscore=1011 bulkscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2502280000 definitions=main-2504080045 Adding Ming and Christoph in Cc. On 4/8/25 12:43 AM, Tejun Heo wrote: > [sorry, lost cc list somehow, resending] > > Hello, > > Jakub reports the following lockdep splat. It looks like q_usage_counter > somehow depends on elevator_lock. After your recent changes, iocost init > path performs memory allocation while holding elevator_lock completing the > circular dependency. > > I don't understand q_usage_counter -> elevator_lock dependency. Where is > that coming from? Ah, that's q->io_lockdep_map, not the percpu_ref itself. I > think it's elevator switch acquiring elevator_lock while queue is frozen, > which puts the elevator_lock depended on from io path, and thus you can't do > reclaiming memory allocations while holding it. > > The involved commits are: > > 245618f8e45f ("block: protect wbt_lat_usec using q->elevator_lock") > 9730763f4756 ("block: correct locking order for protecting blk-wbt parameters") > > Can you please take a look? It looks like the second one is expanding the > locking scopes too wide. > > Thanks. > > [ 139.119772] fb-cgroups-setu/1238 is trying to acquire lock: > [ 139.119776] ffffffff867ca448 (pcpu_alloc_mutex){+.+.}-{4:4}, at: pcpu_alloc_noprof+0x96f/0x1000 > [ 139.169460] > but task is already holding lock: > [ 139.169462] ffff88813ba10298 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_conf_open_bdev_frozen+0x218/0x2b0 > [ 139.217563] > which lock already depends on the new lock. > [ 139.217566] > the existing dependency chain (in reverse order) is: > [ 139.217568] > -> #4 (&q->rq_qos_mutex){+.+.}-{4:4}: > [ 139.217577] __mutex_lock+0x17b/0x17c0 > [ 139.217587] blkg_conf_open_bdev_frozen+0x218/0x2b0 > [ 139.280864] ioc_qos_write+0xc9/0xbc0 > [ 139.280870] cgroup_file_write+0x1a3/0x6f0 > [ 139.280878] kernfs_fop_write_iter+0x350/0x520 > [ 139.280885] vfs_write+0x9b2/0xf50 > [ 139.280891] ksys_write+0xf3/0x1d0 > [ 139.280896] do_syscall_64+0x6e/0x190 > [ 139.280901] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > [ 139.280906] > -> #3 (&q->elevator_lock){+.+.}-{4:4}: > [ 139.280915] __mutex_lock+0x17b/0x17c0 > [ 139.280919] blkg_conf_open_bdev_frozen+0x1c8/0x2b0 > [ 139.280925] ioc_qos_write+0xc9/0xbc0 > [ 139.280929] cgroup_file_write+0x1a3/0x6f0 > [ 139.280934] kernfs_fop_write_iter+0x350/0x520 > [ 139.280938] vfs_write+0x9b2/0xf50 > [ 139.280943] ksys_write+0xf3/0x1d0 > [ 139.280947] do_syscall_64+0x6e/0x190 > [ 139.280952] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > [ 139.280956] > -> #2 (&q->q_usage_counter(io)#2){++++}-{0:0}: > [ 139.280965] blk_alloc_queue+0x5c1/0x700 > [ 139.280971] blk_mq_alloc_queue+0x14c/0x230 > [ 139.280978] __blk_mq_alloc_disk+0x15/0xc0 > [ 139.280983] nvme_alloc_ns+0x21d/0x30f0 > [ 139.280988] nvme_scan_ns+0x4f1/0x850 > [ 139.280991] async_run_entry_fn+0x93/0x4f0 > [ 139.280997] process_one_work+0x89e/0x1910 > [ 139.281001] worker_thread+0x58d/0xcf0 > [ 139.281005] kthread+0x3d5/0x7a0 > [ 139.281010] ret_from_fork+0x2d/0x70 > [ 139.281016] ret_from_fork_asm+0x11/0x20 > [ 139.281023] > -> #1 (fs_reclaim){+.+.}-{0:0}: > [ 139.281031] fs_reclaim_acquire+0xff/0x150 > [ 139.281037] __kmalloc_noprof+0xa9/0x5f0 > [ 139.281042] pcpu_create_chunk+0x23/0x6e0 > [ 139.281049] pcpu_alloc_noprof+0xd34/0x1000 > [ 139.281054] bts_init+0xaa/0x180 > [ 139.281060] do_one_initcall+0xfa/0x500 > [ 139.281065] kernel_init_freeable+0x4af/0x6d0 > [ 139.281070] kernel_init+0x1b/0x1d0 > [ 139.281074] ret_from_fork+0x2d/0x70 > [ 139.281078] ret_from_fork_asm+0x11/0x20 > [ 139.281083] > -> #0 (pcpu_alloc_mutex){+.+.}-{4:4}: > [ 139.281091] __lock_acquire+0x1569/0x2640 > [ 139.281097] lock_acquire+0x179/0x330 > [ 139.281102] __mutex_lock+0x17b/0x17c0 > [ 139.281106] pcpu_alloc_noprof+0x96f/0x1000 > [ 139.281111] blk_iocost_init+0x6f/0x820 > [ 139.281116] ioc_qos_write+0x468/0xbc0 > [ 139.281120] cgroup_file_write+0x1a3/0x6f0 > [ 139.281125] kernfs_fop_write_iter+0x350/0x520 > [ 139.281130] vfs_write+0x9b2/0xf50 > [ 139.281134] ksys_write+0xf3/0x1d0 > [ 139.281138] do_syscall_64+0x6e/0x190 > [ 139.281143] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > [ 139.281147] > other info that might help us debug this: > [ 139.281149] Chain exists of: > pcpu_alloc_mutex --> &q->elevator_lock --> &q->rq_qos_mutex > [ 139.281158] Possible unsafe locking scenario: > [ 139.281159] CPU0 CPU1 > [ 139.281161] ---- ---- > [ 139.281162] lock(&q->rq_qos_mutex); > [ 139.281166] lock(&q->elevator_lock); > [ 139.281170] lock(&q->rq_qos_mutex); > [ 139.281174] lock(pcpu_alloc_mutex); > [ 139.281178] > *** DEADLOCK *** > [ 139.281179] 8 locks held by fb-cgroups-setu/1238: > [ 139.281183] #0: ffff888114b3aaf8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x22c/0x2e0 > [ 139.281197] #1: ffff888148a7c400 (sb_writers#8){.+.+}-{0:0}, at: ksys_write+0xf3/0x1d0 > [ 139.281210] #2: ffff88819c5a1088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x212/0x520 > [ 139.281223] #3: ffff888124ec8b48 (kn->active#101){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x235/0x520 > [ 139.281236] #4: ffff88813ba100a8 (&q->q_usage_counter(io)#2){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0xe/0x20 > [ 139.281249] #5: ffff88813ba100e0 (&q->q_usage_counter(queue)#2){+.+.}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0xe/0x20 > [ 139.281262] #6: ffff88813ba105c0 (&q->elevator_lock){+.+.}-{4:4}, at: blkg_conf_open_bdev_frozen+0x1c8/0x2b0 > [ 139.281275] #7: ffff88813ba10298 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_conf_open_bdev_frozen+0x218/0x2b0 > [ 139.281286] > stack backtrace: > [ 139.281291] CPU: 34 UID: 0 PID: 1238 Comm: fb-cgroups-setu Tainted: G N 6.14.0-13254-g2ecc111972cc #114 PREEMPT(undef) > [ 139.281299] Tainted: [N]=TEST > [ 139.281301] Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A23 12/08/2020 > [ 139.281304] Call Trace: > [ 139.281307] > [ 139.281309] dump_stack_lvl+0x7e/0xc0 > [ 139.281318] print_circular_bug+0x2d8/0x410 > [ 139.281326] check_noncircular+0x12b/0x140 > [ 139.281336] __lock_acquire+0x1569/0x2640 > [ 139.281348] lock_acquire+0x179/0x330 > [ 139.281353] ? pcpu_alloc_noprof+0x96f/0x1000 > [ 139.281365] __mutex_lock+0x17b/0x17c0 > [ 139.281370] ? pcpu_alloc_noprof+0x96f/0x1000 > [ 139.281375] ? __kasan_kmalloc+0x77/0x90 > [ 139.281380] ? ioc_qos_write+0x468/0xbc0 > [ 139.281384] ? cgroup_file_write+0x1a3/0x6f0 > [ 139.281390] ? pcpu_alloc_noprof+0x96f/0x1000 > [ 139.281395] ? ksys_write+0xf3/0x1d0 > [ 139.281399] ? do_syscall_64+0x6e/0x190 > [ 139.281404] ? entry_SYSCALL_64_after_hwframe+0x4b/0x53 > [ 139.281411] ? mutex_lock_io_nested+0x1570/0x1570 > [ 139.281418] ? do_raw_spin_lock+0x12c/0x270 > [ 139.281425] ? find_held_lock+0x2b/0x80 > [ 139.281432] ? mark_held_locks+0x49/0x70 > [ 139.281437] ? _raw_spin_unlock_irqrestore+0x55/0x70 > [ 139.281442] ? lockdep_hardirqs_on+0x78/0x100 > [ 139.281449] ? pcpu_alloc_noprof+0x96f/0x1000 > [ 139.281454] pcpu_alloc_noprof+0x96f/0x1000 > [ 139.281465] ? kasan_save_track+0x10/0x30 > [ 139.281471] blk_iocost_init+0x6f/0x820 > [ 139.281480] ioc_qos_write+0x468/0xbc0 > [ 139.281485] ? __lock_acquire+0x42c/0x2640 > [ 139.281494] ? ioc_cost_model_write+0x7a0/0x7a0 > [ 139.281501] ? __lock_acquire+0x42c/0x2640 > [ 139.281509] ? rcu_is_watching+0x11/0xb0 > [ 139.281519] ? find_held_lock+0x2b/0x80 > [ 139.281525] ? kernfs_root+0xb2/0x1c0 > [ 139.281532] ? kernfs_root+0xbc/0x1c0 > [ 139.281539] cgroup_file_write+0x1a3/0x6f0 > [ 139.281546] ? cgroup_addrm_files+0xa90/0xa90 > [ 139.281552] ? __virt_addr_valid+0x1e1/0x3c0 > [ 139.281563] ? cgroup_addrm_files+0xa90/0xa90 > [ 139.281568] kernfs_fop_write_iter+0x350/0x520 > [ 139.281576] vfs_write+0x9b2/0xf50 > [ 139.281583] ? kernel_write+0x550/0x550 > [ 139.281600] ksys_write+0xf3/0x1d0 > [ 139.281606] ? __ia32_sys_read+0xa0/0xa0 > [ 139.281611] ? rcu_is_watching+0x11/0xb0 > [ 139.281620] do_syscall_64+0x6e/0x190 > [ 139.281626] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > [ 139.281631] RIP: 0033:0x7f9d58116f8d > [ 139.281643] Code: e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 a8 ca f7 ff 41 89 c0 48 8b 55 e8 48 8b 75 f0 8b 7d f8 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3b 44 89 c7 48 89 45 f8 e8 df ca f7 ff 48 8b > [ 139.281648] RSP: 002b:00007ffcdd0a6890 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 > [ 139.281653] RAX: ffffffffffffffda RBX: 0000000000000043 RCX: 00007f9d58116f8d > [ 139.281656] RDX: 0000000000000043 RSI: 00007f9d56ecf200 RDI: 0000000000000007 > [ 139.281659] RBP: 00007ffcdd0a68b0 R08: 0000000000000000 R09: 00007f9d57a19010 > [ 139.281662] R10: 00007f9d5800afd0 R11: 0000000000000293 R12: 0000000000000043 > [ 139.281665] R13: 0000000000000007 R14: 00007ffcdd0a70f0 R15: 0000000000000000 > [ 139.281676] > > Thanks for the report! This is now a known issue. And we're currently working on it to cut dependency of q->io_lockdep_map (or queue freeze-lock) on q->elevator_lock. Please see below :https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/ https://lore.kernel.org/all/20250403105402.1334206-1-ming.lei@redhat.com/ Thanks, --Nilay