From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5B594CF2562 for ; Tue, 18 Nov 2025 22:43:34 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1vLUPH-000780-UI; Tue, 18 Nov 2025 17:42:39 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1vLUPG-00077C-6a for qemu-devel@nongnu.org; Tue, 18 Nov 2025 17:42:38 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1vLUPE-0007w7-FQ for qemu-devel@nongnu.org; Tue, 18 Nov 2025 17:42:37 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1763505755; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=eyUUCD+Xa71zcld5+YQGwhGVX/Bdhb9veA4x2A2Bzcg=; b=PC/zOTTcfB/U4smugRed0i3s6jbAkwcrKP6zuWpECpzSWlK9nri7w9GCecqY7WOa8JzRZ1 CWtg+5kwSEwyVxoDXSBYMFIm3WeavJiacus9pJyo/dVMo21UsfhyWG++eTvf1hVPfEpwf2 ZJiPPhOaWBAgj68LijWJTrb+lViLmbA= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-508-g_frS83xMcOH4xR4jKlhkg-1; Tue, 18 Nov 2025 17:42:31 -0500 X-MC-Unique: g_frS83xMcOH4xR4jKlhkg-1 X-Mimecast-MFC-AGG-ID: g_frS83xMcOH4xR4jKlhkg_1763505750 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 9D1C91800358; Tue, 18 Nov 2025 22:42:29 +0000 (UTC) Received: from redhat.com (unknown [10.44.32.26]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id E2E30180049F; Tue, 18 Nov 2025 22:42:26 +0000 (UTC) Date: Tue, 18 Nov 2025 23:42:24 +0100 From: Kevin Wolf To: andrey.drobyshev@virtuozzo.com Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org, f.ebner@proxmox.com, hreitz@redhat.com, den@virtuozzo.com Subject: Re: [BUG] Block graph assertion failure on blockdev-add Message-ID: References: <20251112164108.506976-1-andrey.drobyshev@virtuozzo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20251112164108.506976-1-andrey.drobyshev@virtuozzo.com> X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Received-SPF: pass client-ip=170.10.133.124; envelope-from=kwolf@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Am 12.11.2025 um 17:41 hat andrey.drobyshev@virtuozzo.com geschrieben: > From: Andrey Drobyshev > > Hi all, > > There's a bug in block layer which leads to an assertion failure and crash. > The reproduce is flaky, which suggests there's a race involved. Here's how > it's reproduced: > > 1. Run QEMU with 2 disks: one system (with actual guest) and one empty; attach > them to an iothread; > 2. In the guest, format the empty disk and start sequential IO on it; > 3. At the host, start growing a chain of snapshots upon the empty image. > > Here're the scripts: > [...] > So draining doesn't work as expected: in-flight requests are supposed to be > polled right before calling bdrv_drain_assert_idle(), but there's new IO > arriving: > > void coroutine_mixed_fn bdrv_drain_all_begin(void) > { > ... > bdrv_drain_all_begin_nopoll(); > > /* Now poll the in-flight requests */ > AIO_WAIT_WHILE_UNLOCKED(NULL, bdrv_drain_all_poll()); <--------- > > while ((bs = bdrv_next_all_states(bs))) { > bdrv_drain_assert_idle(bs); > } > } Took me a bit, but I can reproduce the problem. Yes, "new" I/O arrives between AIO_WAIT_WHILE_UNLOCKED() and the assertion, which means that drain itself doesn't work correctly. After some debugging, I ended up adding some new assertions that would trigger when the new I/O is added during this period, and what I caught is this (same in multiple runs): (gdb) qemu bt #0 __GI_abort () at abort.c:95 #1 0x00007f99cb823639 in __assert_fail_base (fmt=, assertion=, file=, line=, function=) at assert.c:118 #2 0x00007f99cb8340af in __assert_fail (assertion=, file=, line=, function=) at assert.c:127 #3 0x0000557c2a0d64c6 in blk_wait_while_drained (blk=0x557c69d18020) at ../block/block-backend.c:1325 #4 0x0000557c2a0d5069 in blk_co_do_pwritev_part (blk=0x1, offset=140292968848728, bytes=0, qiov=0x7f99b4004568, qiov_offset=qiov_offset@entry=0, flags=3416308576) at ../block/block-backend.c:1417 #5 0x0000557c2a0d55ae in blk_aio_write_entry (opaque=0x7f99b4002020) at ../block/block-backend.c:1635 #6 0x0000557c2a30a376 in coroutine_trampoline (i0=, i1=) at ../util/coroutine-ucontext.c:175 #7 0x00007f99cb856f70 in ??? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:66 #8 0x00007f99c35f0260 in ??? () #9 0x0000000000000000 in ??? () [Thread 0x7f98a94fc6c0 (LWP 321021) exited] [Thread 0x7f99c2dfd6c0 (LWP 320958) exited] Coroutine at 0x7f99c35fe588: #0 qemu_coroutine_switch (from_=from_@entry=0x7f99c35fe588, to_=to_@entry=0x7f99b40051f0, action=action@entry=COROUTINE_ENTER) at ../util/coroutine-ucontext.c:322 #1 0x0000557c2a3089e8 in qemu_aio_coroutine_enter (ctx=ctx@entry=0x557c68d5cae0, co=) at ../util/qemu-coroutine.c:293 #2 0x0000557c2a3071f7 in co_schedule_bh_cb (opaque=0x557c68d5cae0) at ../util/async.c:547 #3 0x0000557c2a306b79 in aio_bh_call (bh=0x557c68d822d0) at ../util/async.c:172 #4 aio_bh_poll (ctx=ctx@entry=0x557c68d5cae0) at ../util/async.c:219 #5 0x0000557c2a2ee9e5 in aio_poll (ctx=0x557c68d5cae0, blocking=) at ../util/aio-posix.c:719 #6 0x0000557c2a0ac032 in iothread_run (opaque=opaque@entry=0x557c68d810a0) at ../iothread.c:63 #7 0x0000557c2a2f2aee in qemu_thread_start (args=) at ../util/qemu-thread-posix.c:393 #8 0x00007f99cb893f54 in start_thread (arg=) at pthread_create.c:448 #9 0x00007f99cb91732c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 This suggests that the following happens: 1. Drain section starts 2. A new request still arrives at the BlockBackend from the iothread and is queued in blk_wait_while_drained(). This alone is already interesting because we thought that the BlockBackend queuing was only necessary for IDE and friends, not for virtio-blk. Probably only true in the main thread. 3. Drain section ends, blk_root_drained_end() calls qemu_co_enter_next() and that is effectively aio_co_wake(). As the request coroutine is in the iothread and blk_root_drained_end() in the main thread, the coroutine is only scheduled in its AioContext, but doesn't run yet. 4. Another drain section starts 4a. bdrv_drain_all_begin_nopoll() 4b. AIO_WAIT_WHILE_UNLOCKED(NULL, bdrv_drain_all_poll()); 5. The iothread processes its BH list and reenters the queued request, which increases blk->in_flight and bs->in_flight and runs the actual request even though we're drained again. Oops. 4c. bdrv_drain_assert_idle() fails After a simple s/if/while/ in blk_wait_while_drained(), I can't reproduce the bug any more. I don't think it's a fully correct fix because blk->in_flight is still increased for a short time and that could in theory still trigger the assertion. It's just that the window is now much smaller because the request doesn't actually execute but is requeued right away. I need to think a bit more about the right synchronisation tomorrow. > I've bisected the issue, and it seems to be introduced by Fiona's series [0] > for fixing a deadlock. Namely, before the commit b13f546545 ("block: move > drain outside of bdrv_root_unref_child()") my reproducer above produces a > deadlock similar to what we had in [1]. And after commit b13f546545 is > applied, we get the crash. > > Attached is a naive fix which simply eliminates global draining (i.e. draining > of all the block graph) in problematic codepaths. While global draining might > indeed be redundant there, the real problem, i.e. the race, still remains. Yes, we don't want to just paper over the bug. Kevin