From: Emanuele Giuseppe Esposito <eesposit@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>,
qemu-block@nongnu.org, Hanna Reitz <hreitz@redhat.com>,
John Snow <jsnow@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>,
Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
Fam Zheng <fam@euphon.net>,
qemu-devel@nongnu.org
Subject: Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock
Date: Wed, 25 May 2022 10:27:34 +0200 [thread overview]
Message-ID: <98027e36-a8a2-e679-4018-c01e08cf124d@redhat.com> (raw)
In-Reply-To: <YozLHPif/jCmOfei@redhat.com>
Am 24/05/2022 um 14:10 schrieb Kevin Wolf:
> Am 18.05.2022 um 14:28 hat Emanuele Giuseppe Esposito geschrieben:
>> label: // read till the end to see why I wrote this here
>>
>> I was hoping someone from the "No" party would answer to your question,
>> because as you know we reached this same conclusion together.
>>
>> We thought about dropping the drain for various reasons, the main one
>> (at least as far as I understood) is that we are not sure if something
>> can still happen in between drain_begin/end, and it is a little bit
>> confusing to use the same mechanism to block I/O and protect the graph.
>>
>> We then thought about implementing a rwlock. A rdlock would clarify what
>> we are protecting and who is using the lock. I had a rwlock draft
>> implementation sent in this thread, but this also lead to additional
>> problems.
>> Main problem was that this new lock would introduce nested event loops,
>> that together with such locking would just create deadlocks.
>> If readers are in coroutines and writers are not (because graph
>> operations are not running in coroutines), we have a lot of deadlocks.
>> If a writer has to take the lock, it must wait all other readers to
>> finish. And it does it by internally calling AIO_WAIT_WHILE, creating
>> nested event loop. We don't know what could execute when polling for
>> events, and for example another writer could be resumed.
>
> Why is this a problem? Your AIO_WAIT_WHILE() condition would be that
> there are neither readers nor writers, so you would just keep waiting
> until the other writer is done.
Yes, but when we get to the AIO_WAIT_WHILE() condition the wrlock is
already being taken in the current writer.
I think that's what you also mean below.
>
>> Ideally, we want writers in coroutines too.
>>
>> Additionally, many readers are running in what we call "mixed"
>> functions: usually implemented automatically with generated_co_wrapper
>> tag, they let a function (usually bdrv callback) run always in a
>> coroutine, creating one if necessary. For example, bdrv_flush() makes
>> sure hat bs->bdrv_co_flush() is always run in a coroutine.
>> Such mixed functions are used in other callbacks too, making it really
>> difficult to understand if we are in a coroutine or not, and mostly
>> important make rwlock usage very difficult.
>
> How do they make rwlock usage difficult?
>
> *goes back to old IRC discussions*
>
> Ah, the problem was not the AIO_WAIT_WHILE() while taking the lock, but
> taking the lock first and then running an AIO_WAIT_WHILE() inside the
> locked section. This is forbidden because the callbacks that run during
> AIO_WAIT_WHILE() may in turn wait for the lock that you own, causing a
> deadlock.
>
Yes
> This is indeed a problem that running in coroutines would avoid because
> the inner waiter would just yield and the outer one could complete its
> job as soon as it's its turn.
>
> My conclusion in the IRC discussion was that maybe we need to take the
> graph locks when we're entering coroutine context, i.e. the "mixed"
> functions would rdlock the graph when called from non-coroutine context
> and would assume that it's already locked when called from coroutine
> context.
Yes, and that's what I tried to do.
But the first step was to transform all callbacks as coroutines. I think
you also agree with this, correct?
And therefore the easiest step was to convert all callbacks in
generated_co_wrapper functions, so that afterwards we could split them
between coroutine and not-coroutine logic, as discussed on IRC.
Once split, we add the lock in the way you suggested.
However, I didn't even get to the first step part, because tests were
deadlocking after just transforming 2-3 callbacks.
See Paolo thread for a nice explanation on why they are deadlocking and
converting these callbacks is difficult.
>
>> Which lead us to stepping back once more and try to convert all
>> BlockDriverState callbacks in coroutines. This would greatly simplify
>> rwlock usage, because we could make the rwlock coroutine-frendly
>> (without any AIO_WAIT_WHILE, allowing a writer to wait for readers by
>> just yielding and queuing itself in coroutine queues).
>>
>> First step was then to convert all callbacks in coroutines, using
>> generated_coroutine_wrapper (g_c_w).
>> A typical g_c_w is implemented in this way:
>> if (qemu_in_coroutine()) {
>> callback();
>> } else { // much simplified
>> co = qemu_coroutine_create(callback);
>> bdrv_coroutine_enter(bs, co);
>> BDRV_POLL_WHILE(bs, coroutine_in_progress);
>> }
>> Once all callbacks are implemented using g_c_w, we can start splitting
>> the two sides of the if function to only create a coroutine when we are
>> outside from a bdrv callback.
>>
>> However, we immediately found a problem while starting to convert the
>> first callbacks: the AioContext lock is taken around some non coroutine
>> callbacks! For example, bs->bdrv_open() is always called with the
>> AioContext lock taken. In addition, callbacks like bdrv_open are
>> graph-modifying functions, which is probably why we are taking the
>> Aiocontext lock, and they do not like to run in coroutines.
>> Anyways, the real problem comes when we create a coroutine in such
>> places where the AioContext lock is taken and we have a graph-modifying
>> function.
>>
>> bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks
>> if the coroutine is entering another context from the current (which is
>> not the case for open) and if we are already in coroutine (for sure
>> not). Therefore it resorts to the following calls;
>> aio_context_acquire(ctx);
>> qemu_aio_coroutine_enter(ctx, co);
>> aio_context_release(ctx);
>> Which is clearly a problem, because we are taking the lock twice: once
>> from the original caller of the callback, and once here due to the
>> coroutine. This creates a lot of deadlock situations.
>
> What are the deadlock situations that are created by locking twice?
Scratch this, and refer to Paolo's thread.
>
> The only problem I'm aware of is AIO_WAIT_WHILE(), which wants to
> temporarily unlock the AioContext It calls aio_context_release() once to
> achieve this, which obviously isn't enough when the context was locked
> twice.
>
> But AIO_WAIT_WHILE() isn't allowed in coroutines anyway. So how are we
> running into deadlocks here?
>
> Note that we're probably already doing this inside the .bdrv_open
> implementations: They will ususally read something from the image file,
> calling bdrv_preadv() which is already a generated_coroutine_wrapper
> today and creates a coroutine internally with the same locking pattern
> applied that you describe as problematic here.
>
> Making .bdrv_open itself a generated_coroutine_wrapper wouldn't really
> change anything fundamental, it would just pull the existing mechanism
> one function higher in the call stack.
>
>> For example, all callers of bdrv_open() always take the AioContext lock.
>> Often it is taken very high in the call stack, but it's always taken.
>>
>> Getting rid of the lock around qemu_aio_coroutine_enter() is difficult
>> too, because coroutines expect to have the lock taken. For example, if
>> we want to drain from a coroutine, bdrv_co_yield_to_drain releases the
>> lock for us.
>
> It's not difficult at all in your case where you know that you're
> already in the right thread and the lock is taken: You can call
> qemu_aio_coroutine_enter() directly instead of bdrv_coroutine_enter() in
> this case.
>
> But as I said, I'm not sure why we need to get rid of it at all.
>
> Kevin
>
prev parent reply other threads:[~2022-05-25 8:37 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-26 8:51 [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock Emanuele Giuseppe Esposito
2022-04-26 8:51 ` [RFC PATCH v2 1/8] aio_wait_kick: add missing memory barrier Emanuele Giuseppe Esposito
2022-04-28 11:09 ` Stefan Hajnoczi
2022-04-29 8:06 ` Emanuele Giuseppe Esposito
2022-04-30 5:21 ` Stefan Hajnoczi
2022-04-29 8:12 ` Paolo Bonzini
2022-04-26 8:51 ` [RFC PATCH v2 2/8] coroutine-lock: release lock when restarting all coroutines Emanuele Giuseppe Esposito
2022-04-26 14:59 ` Paolo Bonzini
2022-04-28 11:21 ` Stefan Hajnoczi
2022-04-28 22:14 ` Paolo Bonzini
2022-04-29 9:35 ` Emanuele Giuseppe Esposito
2022-04-26 8:51 ` [RFC PATCH v2 3/8] block: introduce a lock to protect graph operations Emanuele Giuseppe Esposito
2022-04-26 15:00 ` Paolo Bonzini
2022-04-28 13:45 ` Stefan Hajnoczi
2022-04-29 8:37 ` Emanuele Giuseppe Esposito
2022-04-30 5:48 ` Stefan Hajnoczi
2022-05-02 7:54 ` Emanuele Giuseppe Esposito
2022-05-03 10:50 ` Stefan Hajnoczi
2022-04-26 8:51 ` [RFC PATCH v2 4/8] async: register/unregister aiocontext in graph lock list Emanuele Giuseppe Esposito
2022-04-28 13:46 ` Stefan Hajnoczi
2022-04-28 22:19 ` Paolo Bonzini
2022-04-29 8:37 ` Emanuele Giuseppe Esposito
2022-04-26 8:51 ` [RFC PATCH v2 5/8] block.c: wrlock in bdrv_replace_child_noperm Emanuele Giuseppe Esposito
2022-04-26 15:07 ` Paolo Bonzini
2022-04-28 13:55 ` Stefan Hajnoczi
2022-04-29 8:41 ` Emanuele Giuseppe Esposito
2022-04-26 8:51 ` [RFC PATCH v2 6/8] block: assert that graph read and writes are performed correctly Emanuele Giuseppe Esposito
2022-04-28 14:43 ` Stefan Hajnoczi
2022-04-26 8:51 ` [RFC PATCH v2 7/8] graph-lock: implement WITH_GRAPH_RDLOCK_GUARD and GRAPH_RDLOCK_GUARD macros Emanuele Giuseppe Esposito
2022-04-28 15:00 ` Stefan Hajnoczi
2022-04-26 8:51 ` [RFC PATCH v2 8/8] mirror: protect drains in coroutine with rdlock Emanuele Giuseppe Esposito
2022-04-27 6:55 ` [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock Emanuele Giuseppe Esposito
2022-04-28 10:45 ` Stefan Hajnoczi
2022-04-28 21:56 ` Emanuele Giuseppe Esposito
2022-04-30 5:17 ` Stefan Hajnoczi
2022-05-02 8:02 ` Emanuele Giuseppe Esposito
2022-05-02 13:15 ` Paolo Bonzini
2022-05-03 8:24 ` Kevin Wolf
2022-05-03 11:04 ` Stefan Hajnoczi
2022-04-28 10:34 ` Stefan Hajnoczi
2022-04-29 8:06 ` Emanuele Giuseppe Esposito
2022-05-04 13:39 ` Stefan Hajnoczi
2022-05-17 10:59 ` Stefan Hajnoczi
2022-05-18 12:28 ` Emanuele Giuseppe Esposito
2022-05-18 12:43 ` Paolo Bonzini
2022-05-18 14:57 ` Stefan Hajnoczi
2022-05-18 16:14 ` Kevin Wolf
2022-05-19 11:27 ` Stefan Hajnoczi
2022-05-19 12:52 ` Kevin Wolf
2022-05-22 15:06 ` Stefan Hajnoczi
2022-05-23 8:48 ` Emanuele Giuseppe Esposito
2022-05-23 13:15 ` Stefan Hajnoczi
2022-05-23 13:54 ` Emanuele Giuseppe Esposito
2022-05-23 13:02 ` Kevin Wolf
2022-05-23 15:13 ` Stefan Hajnoczi
2022-05-23 16:04 ` Kevin Wolf
2022-05-23 16:45 ` Stefan Hajnoczi
2022-05-24 7:55 ` Paolo Bonzini
2022-05-24 8:08 ` Stefan Hajnoczi
2022-05-24 9:17 ` Paolo Bonzini
2022-05-24 10:20 ` Stefan Hajnoczi
2022-05-24 17:25 ` Paolo Bonzini
2022-05-24 10:36 ` Kevin Wolf
2022-05-25 7:41 ` Paolo Bonzini
2022-05-18 14:27 ` Stefan Hajnoczi
2022-05-24 12:10 ` Kevin Wolf
2022-05-25 8:27 ` Emanuele Giuseppe Esposito [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=98027e36-a8a2-e679-4018-c01e08cf124d@redhat.com \
--to=eesposit@redhat.com \
--cc=fam@euphon.net \
--cc=hreitz@redhat.com \
--cc=jsnow@redhat.com \
--cc=kwolf@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=vsementsov@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).