Re: [PATCH 10/16] qcow2: Fix cache_clean_timer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kevin Wolf <kwolf@redhat.com>
To: Hanna Czenczek <hreitz@redhat.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Richard W . M . Jones" <rjones@redhat.com>,
	"Ilya Dryomov" <idryomov@gmail.com>,
	"Peter Lieven" <pl@dlhnet.de>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"Fam Zheng" <fam@euphon.net>,
	"Ronnie Sahlberg" <ronniesahlberg@gmail.com>
Subject: Re: [PATCH 10/16] qcow2: Fix cache_clean_timer
Date: Fri, 31 Oct 2025 14:03:23 +0100	[thread overview]
Message-ID: <aQSzm_LOySEpFmuG@redhat.com> (raw)
In-Reply-To: <0d98f477-722c-4023-9b28-54d8faffff66@redhat.com>

Am 31.10.2025 um 10:29 hat Hanna Czenczek geschrieben:
> On 29.10.25 21:23, Kevin Wolf wrote:
> > Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> > > The cache-cleaner runs as a timer CB in the BDS AioContext.  With
> > > multiqueue, it can run concurrently to I/O requests, and because it does
> > > not take any lock, this can break concurrent cache accesses, corrupting
> > > the image.  While the chances of this happening are low, it can be
> > > reproduced e.g. by modifying the code to schedule the timer CB every
> > > 5 ms (instead of at most once per second) and modifying the last (inner)
> > > while loop of qcow2_cache_clean_unused() like so:
> > > 
> > >      while (i < c->size && can_clean_entry(c, i)) {
> > >          for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
> > >              usleep(100);
> > >          }
> > >          c->entries[i].offset = 0;
> > >          c->entries[i].lru_counter = 0;
> > >          i++;
> > >          to_clean++;
> > >      }
> > > 
> > > i.e. making it wait on purpose for the point in time where the cache is
> > > in use by something else.
> > > 
> > > The solution chosen for this in this patch is not the best solution, I
> > > hope, but I admittedly can’t come up with anything strictly better.
> > > 
> > > We can protect from concurrent cache accesses either by taking the
> > > existing s->lock, or we introduce a new (non-coroutine) mutex
> > > specifically for cache accesses.  I would prefer to avoid the latter so
> > > as not to introduce additional (very slight) overhead.
> > In theory, the old plan was that eventually qcow2 would use fine grained
> > locks instead of the single s->lock, and having a separate cache lock
> > would be a step towards it. But if we never actually make use of it to
> > hold s->lock for a shorter time, that's not really a good argument. I'm
> > not sure if that's ever going to happen unless for a rewrite in Rust or
> > something.
> > 
> > I never tried to measure specifically if lock contention is a problem
> > with high queue depth and random I/O on a huge disk. Intuitively,
> > holding s->lock while doing I/O for loading entries into the cache can't
> > be really good.
> > 
> > Anyway, I went a bit on a tangent there...
> > 
> > > Using s->lock, which is a coroutine mutex, however means that we need to
> > > take it in a coroutine, so the timer CB must enter such a coroutine.  As
> > > a result, descheduling the timer is no longer a guarantee that the
> > > cache-cleaner will not run, because it may now be yielding in
> > > qemu_co_mutex_lock().
> > I think creating a coroutine in cache_clean_timer_cb() is the wrong
> > approach. Instead, cache_clean_timer_init() could create a coroutine
> > and its implementation could be something like this:
> > 
> >      while (!s->cache_clean_timer_stopping) {
> >          qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
> >                                    QEMU_CLOCK_VIRTUAL,

Oh, I wanted to comment on this one, too, but apparently forgot...

I have absolutely no idea why we're using QEMU_CLOCK_VIRTUAL in the
existing code. I don't think the cache cleaning should stop just because
the guest is paused. You can still have block jobs and exports that work
on the image and fill the cache.

> >                                    s->cache_clean_interval * NANOSECONDS_PER_SECOND);
> > 
> >          qemu_co_mutex_lock(&s->lock);
> >          qcow2_cache_clean_unused(s->l2_table_cache);
> >          qcow2_cache_clean_unused(s->refcount_block_cache);
> >          qemu_co_mutex_unlock(&s->lock);
> >      }
> >      s->cache_clean_timer_stopping = false;
> 
> Ah, that’s nicer.  I think we can replace the flag by checking
> s->cache_clean_interval > 0 and adding a CoQueue to wake up waiting
> coroutines.

Ah, yes, stopping on s->cache_clean_interval == 0 is good.

> > > (Note even now this was only guaranteed for cache_clean_timer_del()
> > > callers that run in the BDS (the timer’s) AioContext.  For callers
> > > running in the main context, the problem may have already existed,
> > > though maybe the BQL prevents timers from running in other contexts, I’m
> > > not sure.)
> > > 
> > > Polling to await the timer to actually settle seems very complicated for
> > > something that’s rather a minor problem, but I can’t come up with any
> > > better solution that doesn’t again just overlook potential problems.
> > > 
> > > (One cleaner idea may be to have a generic way to have timers run
> > > coroutines, and to await those when descheduling the timer.  But while
> > > cleaner, it would also be more complicated, and I don’t think worth it
> > > at this point.)
> > > 
> > > (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
> > > I’m not too fond of this solution.)
> > > 
> > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > ---
> > >   block/qcow2.h |  1 +
> > >   block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
> > >   2 files changed, 79 insertions(+), 12 deletions(-)
> > > @@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState *bs)
> > >       }
> > >   }
> > > +/*
> > > + * Delete the cache clean timer and await any yet running instance.
> > > + * Must be called from the main or BDS AioContext, holding s->lock.
> > > + */
> > > +static void coroutine_fn
> > > +cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
> > > +{
> > > +    BDRVQcow2State *s = bs->opaque;
> > > +    IO_OR_GS_CODE();
> > > +    cache_clean_timer_del(bs);
> > > +    if (qatomic_read(&s->cache_clean_running)) {
> > > +        qemu_co_mutex_unlock(&s->lock);
> > > +        qatomic_set(&s->cache_clean_polling, true);
> > > +        BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> > Polling in a coroutine_fn is verboten.
> > 
> > If we do need this function, I think it would be a yield here and a wake
> > on the other side. I think we might be able to get around it if we move
> > the call from qcow2_do_open() into qcow2_open() (i.e. outside the
> > coroutine). A bit ugly, so your choice.
> 
> We can let a CoQueue do the waking, no?

Yes, that's probably nicer. I don't expect that you'll ever have more
than a single waiter, but a CoQueue would be safe in either case.

> > > +        qemu_co_mutex_lock(&s->lock);
> > > +    }
> > > +}
> > > +
> > > +/*
> > > + * Delete the cache clean timer and await any yet running instance.
> > > + * Must be called from the main or BDS AioContext without s->lock held.
> > > + */
> > > +static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
> > > +{
> > > +    BDRVQcow2State *s = bs->opaque;
> > > +    IO_OR_GS_CODE();
> > > +    cache_clean_timer_del(bs);
> > > +    if (qatomic_read(&s->cache_clean_running)) {
> > > +        qatomic_set(&s->cache_clean_polling, true);
> > > +        BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> > > +    }
> > > +}
> > > +
> > >   static void qcow2_detach_aio_context(BlockDriverState *bs)
> > >   {
> > >       cache_clean_timer_del(bs);
> > > @@ -1214,12 +1273,20 @@ fail:
> > >       return ret;
> > >   }
> > > +/* s_locked specifies whether s->lock is held or not */
> > >   static void qcow2_update_options_commit(BlockDriverState *bs,
> > > -                                        Qcow2ReopenState *r)
> > > +                                        Qcow2ReopenState *r,
> > > +                                        bool s_locked)
> > >   {
> > >       BDRVQcow2State *s = bs->opaque;
> > >       int i;
> > > +    if (s_locked) {
> > > +        cache_clean_timer_locked_co_del_and_wait(bs);
> > > +    } else {
> > > +        cache_clean_timer_del_and_wait(bs);
> > > +    }
> > > +
> > >       if (s->l2_table_cache) {
> > >           qcow2_cache_destroy(s->l2_table_cache);
> > >       }
> > > @@ -1228,6 +1295,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
> > >       }
> > >       s->l2_table_cache = r->l2_table_cache;
> > >       s->refcount_block_cache = r->refcount_block_cache;
> > > +
> > > +    s->cache_clean_interval = r->cache_clean_interval;
> > > +    cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> > > +
> > >       s->l2_slice_size = r->l2_slice_size;
> > >       s->overlap_check = r->overlap_check;
> > > @@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
> > >       s->discard_no_unref = r->discard_no_unref;
> > > -    if (s->cache_clean_interval != r->cache_clean_interval) {
> > > -        cache_clean_timer_del(bs);
> > > -        s->cache_clean_interval = r->cache_clean_interval;
> > > -        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> > > -    }
> > > -
> > I think the del/init pair here won't be necessary any more after
> > switching to the background coroutine. It will just start using the new
> > value of s->cache_clean_interval the next time it sleeps.
> 
> One problem is that if we don’t lock s->lock, the coroutine can read
> s->l2_table_cache and s->refcount_block_cache while they’re invalid, which
> is why I moved the deletion above.

I see. This is a preexisting problem, right? The timer runs in the BDS
main context, while qcow2_update_options_commit() runs in the main loop
or... essentially anywhere else?

Should it be a separate patch then? A comment in the code wouldn't hurt
either.

> We also still need to delete if the interval is set to 0 (or
> special-case that in the coroutine to wait forever).

If the coroutine terminates on 0 as you suggested above, that would
automatically be solved.

> We could run all of this in a coroutine so we can lock s->lock, or we
> have to force-stop the timer/coroutine at the start.  Maybe running it
> in a coroutine is better…

So qcow2_reopen_commit() would immediately enter a coroutine and
BDRV_POLL_WHILE() for its completion?

It's not exactly pretty (and I hope polling in reopen callbacks is even
allowed), but maybe more local ugliness than having additional state
(like s->cache_clean_running) to allow BDRV_POLL_WHILE() to wait for the
cache clean coroutine to stop.

Kevin

next prev parent reply	other threads:[~2025-10-31 13:05 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
2025-10-28 16:33 ` [PATCH 01/16] block: Note on aio_co_wake use if not yet yielding Hanna Czenczek
2025-10-28 16:33 ` [PATCH 02/16] rbd: Run co BH CB in the coroutine’s AioContext Hanna Czenczek
2025-10-28 16:33 ` [PATCH 03/16] iscsi: " Hanna Czenczek
2025-10-29 14:27   ` Kevin Wolf
2025-10-31  9:07     ` Hanna Czenczek
2025-10-31 13:27       ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 04/16] nfs: " Hanna Czenczek
2025-10-28 16:33 ` [PATCH 05/16] curl: Fix coroutine waking Hanna Czenczek
2025-10-29 16:57   ` Kevin Wolf
2025-10-31  9:15     ` Hanna Czenczek
2025-10-31 13:17       ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 06/16] gluster: Do not move coroutine into BDS context Hanna Czenczek
2025-10-29 17:10   ` Kevin Wolf
2025-10-31  9:16     ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 07/16] nvme: Kick and check completions in " Hanna Czenczek
2025-10-29 17:23   ` Kevin Wolf
2025-10-29 17:39     ` Kevin Wolf
2025-10-31  9:18       ` Hanna Czenczek
2025-10-31  9:19     ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 08/16] nvme: Fix coroutine waking Hanna Czenczek
2025-10-29 17:43   ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 09/16] block/io: Take reqs_lock for tracked_requests Hanna Czenczek
2025-10-28 16:33 ` [PATCH 10/16] qcow2: Fix cache_clean_timer Hanna Czenczek
2025-10-29 20:23   ` Kevin Wolf
2025-10-31  9:29     ` Hanna Czenczek
2025-10-31 13:03       ` Kevin Wolf [this message]
2025-11-06 16:08         ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 11/16] ssh: Run restart_coroutine in current AioContext Hanna Czenczek
2025-10-28 16:33 ` [PATCH 12/16] blkreplay: Run BH in coroutine’s AioContext Hanna Czenczek
2025-10-28 16:33 ` [PATCH 13/16] block: Note in which AioContext AIO CBs are called Hanna Czenczek
2025-10-28 16:33 ` [PATCH 14/16] iscsi: Create AIO BH in original AioContext Hanna Czenczek
2025-10-28 16:33 ` [PATCH 15/16] null-aio: Run CB " Hanna Czenczek
2025-10-28 16:33 ` [PATCH 16/16] win32-aio: Run CB in original context Hanna Czenczek
2025-10-30 14:12   ` Kevin Wolf
2025-10-31  9:31     ` Hanna Czenczek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQSzm_LOySEpFmuG@redhat.com \
    --to=kwolf@redhat.com \
    --cc=alex.bennee@linaro.org \
    --cc=fam@euphon.net \
    --cc=hreitz@redhat.com \
    --cc=idryomov@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@linaro.org \
    --cc=pl@dlhnet.de \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rjones@redhat.com \
    --cc=ronniesahlberg@gmail.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.