From: Hanna Czenczek <hreitz@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org,
"Stefan Hajnoczi" <stefanha@redhat.com>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Richard W . M . Jones" <rjones@redhat.com>,
"Ilya Dryomov" <idryomov@gmail.com>,
"Peter Lieven" <pl@dlhnet.de>,
"Philippe Mathieu-Daudé" <philmd@linaro.org>,
"Alex Bennée" <alex.bennee@linaro.org>,
"Fam Zheng" <fam@euphon.net>,
"Ronnie Sahlberg" <ronniesahlberg@gmail.com>
Subject: Re: [PATCH 10/16] qcow2: Fix cache_clean_timer
Date: Fri, 31 Oct 2025 10:29:07 +0100 [thread overview]
Message-ID: <0d98f477-722c-4023-9b28-54d8faffff66@redhat.com> (raw)
In-Reply-To: <aQJ30-Ifcji8lrme@redhat.com>
On 29.10.25 21:23, Kevin Wolf wrote:
> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
>> The cache-cleaner runs as a timer CB in the BDS AioContext. With
>> multiqueue, it can run concurrently to I/O requests, and because it does
>> not take any lock, this can break concurrent cache accesses, corrupting
>> the image. While the chances of this happening are low, it can be
>> reproduced e.g. by modifying the code to schedule the timer CB every
>> 5 ms (instead of at most once per second) and modifying the last (inner)
>> while loop of qcow2_cache_clean_unused() like so:
>>
>> while (i < c->size && can_clean_entry(c, i)) {
>> for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
>> usleep(100);
>> }
>> c->entries[i].offset = 0;
>> c->entries[i].lru_counter = 0;
>> i++;
>> to_clean++;
>> }
>>
>> i.e. making it wait on purpose for the point in time where the cache is
>> in use by something else.
>>
>> The solution chosen for this in this patch is not the best solution, I
>> hope, but I admittedly can’t come up with anything strictly better.
>>
>> We can protect from concurrent cache accesses either by taking the
>> existing s->lock, or we introduce a new (non-coroutine) mutex
>> specifically for cache accesses. I would prefer to avoid the latter so
>> as not to introduce additional (very slight) overhead.
> In theory, the old plan was that eventually qcow2 would use fine grained
> locks instead of the single s->lock, and having a separate cache lock
> would be a step towards it. But if we never actually make use of it to
> hold s->lock for a shorter time, that's not really a good argument. I'm
> not sure if that's ever going to happen unless for a rewrite in Rust or
> something.
>
> I never tried to measure specifically if lock contention is a problem
> with high queue depth and random I/O on a huge disk. Intuitively,
> holding s->lock while doing I/O for loading entries into the cache can't
> be really good.
>
> Anyway, I went a bit on a tangent there...
>
>> Using s->lock, which is a coroutine mutex, however means that we need to
>> take it in a coroutine, so the timer CB must enter such a coroutine. As
>> a result, descheduling the timer is no longer a guarantee that the
>> cache-cleaner will not run, because it may now be yielding in
>> qemu_co_mutex_lock().
> I think creating a coroutine in cache_clean_timer_cb() is the wrong
> approach. Instead, cache_clean_timer_init() could create a coroutine
> and its implementation could be something like this:
>
> while (!s->cache_clean_timer_stopping) {
> qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
> QEMU_CLOCK_VIRTUAL,
> s->cache_clean_interval * NANOSECONDS_PER_SECOND);
>
> qemu_co_mutex_lock(&s->lock);
> qcow2_cache_clean_unused(s->l2_table_cache);
> qcow2_cache_clean_unused(s->refcount_block_cache);
> qemu_co_mutex_unlock(&s->lock);
> }
> s->cache_clean_timer_stopping = false;
Ah, that’s nicer. I think we can replace the flag by checking
s->cache_clean_interval > 0 and adding a CoQueue to wake up waiting
coroutines.
>> (Note even now this was only guaranteed for cache_clean_timer_del()
>> callers that run in the BDS (the timer’s) AioContext. For callers
>> running in the main context, the problem may have already existed,
>> though maybe the BQL prevents timers from running in other contexts, I’m
>> not sure.)
>>
>> Polling to await the timer to actually settle seems very complicated for
>> something that’s rather a minor problem, but I can’t come up with any
>> better solution that doesn’t again just overlook potential problems.
>>
>> (One cleaner idea may be to have a generic way to have timers run
>> coroutines, and to await those when descheduling the timer. But while
>> cleaner, it would also be more complicated, and I don’t think worth it
>> at this point.)
>>
>> (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
>> I’m not too fond of this solution.)
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>> block/qcow2.h | 1 +
>> block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
>> 2 files changed, 79 insertions(+), 12 deletions(-)
>> @@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState *bs)
>> }
>> }
>>
>> +/*
>> + * Delete the cache clean timer and await any yet running instance.
>> + * Must be called from the main or BDS AioContext, holding s->lock.
>> + */
>> +static void coroutine_fn
>> +cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
>> +{
>> + BDRVQcow2State *s = bs->opaque;
>> + IO_OR_GS_CODE();
>> + cache_clean_timer_del(bs);
>> + if (qatomic_read(&s->cache_clean_running)) {
>> + qemu_co_mutex_unlock(&s->lock);
>> + qatomic_set(&s->cache_clean_polling, true);
>> + BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> Polling in a coroutine_fn is verboten.
>
> If we do need this function, I think it would be a yield here and a wake
> on the other side. I think we might be able to get around it if we move
> the call from qcow2_do_open() into qcow2_open() (i.e. outside the
> coroutine). A bit ugly, so your choice.
We can let a CoQueue do the waking, no?
>> + qemu_co_mutex_lock(&s->lock);
>> + }
>> +}
>> +
>> +/*
>> + * Delete the cache clean timer and await any yet running instance.
>> + * Must be called from the main or BDS AioContext without s->lock held.
>> + */
>> +static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
>> +{
>> + BDRVQcow2State *s = bs->opaque;
>> + IO_OR_GS_CODE();
>> + cache_clean_timer_del(bs);
>> + if (qatomic_read(&s->cache_clean_running)) {
>> + qatomic_set(&s->cache_clean_polling, true);
>> + BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
>> + }
>> +}
>> +
>> static void qcow2_detach_aio_context(BlockDriverState *bs)
>> {
>> cache_clean_timer_del(bs);
>> @@ -1214,12 +1273,20 @@ fail:
>> return ret;
>> }
>>
>> +/* s_locked specifies whether s->lock is held or not */
>> static void qcow2_update_options_commit(BlockDriverState *bs,
>> - Qcow2ReopenState *r)
>> + Qcow2ReopenState *r,
>> + bool s_locked)
>> {
>> BDRVQcow2State *s = bs->opaque;
>> int i;
>>
>> + if (s_locked) {
>> + cache_clean_timer_locked_co_del_and_wait(bs);
>> + } else {
>> + cache_clean_timer_del_and_wait(bs);
>> + }
>> +
>> if (s->l2_table_cache) {
>> qcow2_cache_destroy(s->l2_table_cache);
>> }
>> @@ -1228,6 +1295,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>> }
>> s->l2_table_cache = r->l2_table_cache;
>> s->refcount_block_cache = r->refcount_block_cache;
>> +
>> + s->cache_clean_interval = r->cache_clean_interval;
>> + cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
>> +
>> s->l2_slice_size = r->l2_slice_size;
>>
>> s->overlap_check = r->overlap_check;
>> @@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>>
>> s->discard_no_unref = r->discard_no_unref;
>>
>> - if (s->cache_clean_interval != r->cache_clean_interval) {
>> - cache_clean_timer_del(bs);
>> - s->cache_clean_interval = r->cache_clean_interval;
>> - cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
>> - }
>> -
> I think the del/init pair here won't be necessary any more after
> switching to the background coroutine. It will just start using the new
> value of s->cache_clean_interval the next time it sleeps.
One problem is that if we don’t lock s->lock, the coroutine can read
s->l2_table_cache and s->refcount_block_cache while they’re invalid,
which is why I moved the deletion above. We also still need to delete
if the interval is set to 0 (or special-case that in the coroutine to
wait forever).
We could run all of this in a coroutine so we can lock s->lock, or we
have to force-stop the timer/coroutine at the start. Maybe running it
in a coroutine is better…
Hanna
next prev parent reply other threads:[~2025-10-31 9:30 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
2025-10-28 16:33 ` [PATCH 01/16] block: Note on aio_co_wake use if not yet yielding Hanna Czenczek
2025-10-28 16:33 ` [PATCH 02/16] rbd: Run co BH CB in the coroutine’s AioContext Hanna Czenczek
2025-10-28 16:33 ` [PATCH 03/16] iscsi: " Hanna Czenczek
2025-10-29 14:27 ` Kevin Wolf
2025-10-31 9:07 ` Hanna Czenczek
2025-10-31 13:27 ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 04/16] nfs: " Hanna Czenczek
2025-10-28 16:33 ` [PATCH 05/16] curl: Fix coroutine waking Hanna Czenczek
2025-10-29 16:57 ` Kevin Wolf
2025-10-31 9:15 ` Hanna Czenczek
2025-10-31 13:17 ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 06/16] gluster: Do not move coroutine into BDS context Hanna Czenczek
2025-10-29 17:10 ` Kevin Wolf
2025-10-31 9:16 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 07/16] nvme: Kick and check completions in " Hanna Czenczek
2025-10-29 17:23 ` Kevin Wolf
2025-10-29 17:39 ` Kevin Wolf
2025-10-31 9:18 ` Hanna Czenczek
2025-10-31 9:19 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 08/16] nvme: Fix coroutine waking Hanna Czenczek
2025-10-29 17:43 ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 09/16] block/io: Take reqs_lock for tracked_requests Hanna Czenczek
2025-10-28 16:33 ` [PATCH 10/16] qcow2: Fix cache_clean_timer Hanna Czenczek
2025-10-29 20:23 ` Kevin Wolf
2025-10-31 9:29 ` Hanna Czenczek [this message]
2025-10-31 13:03 ` Kevin Wolf
2025-11-06 16:08 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 11/16] ssh: Run restart_coroutine in current AioContext Hanna Czenczek
2025-10-28 16:33 ` [PATCH 12/16] blkreplay: Run BH in coroutine’s AioContext Hanna Czenczek
2025-10-28 16:33 ` [PATCH 13/16] block: Note in which AioContext AIO CBs are called Hanna Czenczek
2025-10-28 16:33 ` [PATCH 14/16] iscsi: Create AIO BH in original AioContext Hanna Czenczek
2025-10-28 16:33 ` [PATCH 15/16] null-aio: Run CB " Hanna Czenczek
2025-10-28 16:33 ` [PATCH 16/16] win32-aio: Run CB in original context Hanna Czenczek
2025-10-30 14:12 ` Kevin Wolf
2025-10-31 9:31 ` Hanna Czenczek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0d98f477-722c-4023-9b28-54d8faffff66@redhat.com \
--to=hreitz@redhat.com \
--cc=alex.bennee@linaro.org \
--cc=fam@euphon.net \
--cc=idryomov@gmail.com \
--cc=kwolf@redhat.com \
--cc=pbonzini@redhat.com \
--cc=philmd@linaro.org \
--cc=pl@dlhnet.de \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=rjones@redhat.com \
--cc=ronniesahlberg@gmail.com \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).