From: Kevin Wolf <kwolf@redhat.com>
To: Hanna Czenczek <hreitz@redhat.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org,
Paolo Bonzini <pbonzini@redhat.com>,
Fiona Ebner <f.ebner@proxmox.com>, John Snow <jsnow@redhat.com>,
levon@movementarian.org
Subject: Re: [PATCH v2] block-backend: Add new bds_io_in_flight counter
Date: Tue, 19 Sep 2023 15:37:20 +0200 [thread overview]
Message-ID: <ZQmkEAbcVrOPHL6i@redhat.com> (raw)
In-Reply-To: <20230331162335.27518-1-hreitz@redhat.com>
Am 31.03.2023 um 18:23 hat Hanna Czenczek geschrieben:
> IDE TRIM is a BB user that wants to elevate its BB's in-flight counter
> for a "macro" operation that consists of several actual I/O operations.
> Each of those operations is individually started and awaited. It does
> this so that blk_drain() will drain the whole TRIM, and not just a
> single one of the many discard operations it may encompass.
>
> When request queuing is enabled, this leads to a deadlock: The currently
> ongoing discard is drained, and the next one is queued, waiting for the
> drain to stop. Meanwhile, TRIM still keeps the in-flight counter
> elevated, waiting for all discards to stop -- which will never happen,
> because with the in-flight counter elevated, the BB is never considered
> drained, so the drained section does not begin and cannot end.
Alright, let's have another look at this now that another similar
deadlock was reported:
https://lists.gnu.org/archive/html/qemu-block/2023-09/msg00536.html
> There are two separate cases to look at here, namely bdrv_drain*() and
> blk_drain*(). As said above, we do want blk_drain*() to settle the
> whole operation: The only way to do so is to disable request queuing,
> then. So, we do that: Have blk_drain() and blk_drain_all() temporarily
> disable request queuing, which prevents the deadlock.
Two separate cases with two separate fixes suggests that it could be
two separate patches. I feel the blk_*() case is uncontroversial and it
would fix John's case, so splitting wouldn't only make this easier to
understand, but could mean that we can fix a useful subset earlier.
> (The devil's in the details, though: blk_drain_all() runs
> bdrv_drain_all_begin() first, so when we get to the individual BB, there
> may already be queued requests. Therefore, we have to not only disable
> request queuing then, but resume all already-queued requests, too.)
Why can't we just disable request queuing before calling bdrv_drain_*()?
Is it possible that the same problem occurs because someone else already
called bdrv_drain_*()? That is, blk_drain_*() would be called from a
callback in the nested event loop in bdrv_drain_*()? If so, we can't
avoid that there are already queued requests.
Restarting them seems correct anyway.
> For bdrv_drain*(), we want request queuing -- and macro requests such as
> IDE's TRIM request do not matter. bdrv_drain*() wants to keep I/O
> requests from BDS nodes, and the TRIM does not issue such requests; it
> instead does so through blk_*() functions, which themselves elevate the
> BB's in-flight counter. So the idea is to drain (and potentially queue)
> those blk_*() requests, but completely ignore the TRIM.
>
> We can do that by splitting a new counter off of the existing BB
> counter: The new bds_io_in_flight counter counts all those blk_*()
> requests that can issue I/O to a BDS (so must be drained by
> bdrv_drain*()), but will never block waiting on another request on the
> BB.
You end up changing all of the existing blk_inc_in_flight() callers
except those in IDE and virtio-blk.
That makes me wonder if it shouldn't be approached the other way around:
BlockBackend users that want to be included in drain should use a
special function blk_inc_in_flight_external() or something that wouldn't
increase blk->in_flight, but only a new separate counter. And then only
blk_drain*() wait for it (by extending the AIO_WAIT_WHILE() condition),
but not the child_root callbacks.
This would give us more directly the semantics that we actually need:
The root BDS doesn't care if the operation on the device level has
completed as long as nothing new arrives, only external callers which
use blk_drain*() do. I believe internal/external is easier to reason
about than "requests that can issue I/O to a BDS [directly]", it keeps
the external callers the special ones that need extra care while the
normal I/O path is unaffected, and it would make the patch much smaller.
> In blk_drain*(), we disable request queuing and settle all requests (the
> full in_flight count). In bdrv_drain*() (i.e. blk_root_drained_poll()),
> we only settle bds_io_in_flight_count, ignoring all requests that will
> not directly issue I/O requests to BDS nodes.
>
> Reported-by: Fiona Ebner <f.ebner@proxmox.com>
> Fixes: 7e5cdb345f77d76cb4877fe6230c4e17a7d0d0ca
> ("ide: Increment BB in-flight counter for TRIM BH")
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Kevin
prev parent reply other threads:[~2023-09-19 13:38 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-31 16:23 [PATCH v2] block-backend: Add new bds_io_in_flight counter Hanna Czenczek
2023-09-19 13:37 ` Kevin Wolf [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZQmkEAbcVrOPHL6i@redhat.com \
--to=kwolf@redhat.com \
--cc=f.ebner@proxmox.com \
--cc=hreitz@redhat.com \
--cc=jsnow@redhat.com \
--cc=levon@movementarian.org \
--cc=pbonzini@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.