From: Hanna Reitz <hreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>,
qemu-block@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>,
Emanuele Giuseppe Esposito <eesposit@redhat.com>,
John Snow <jsnow@redhat.com>,
qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [PATCH v2] block/stream: Drain subtree around graph change
Date: Tue, 29 Mar 2022 14:15:56 +0200 [thread overview]
Message-ID: <7c4aa4e1-4809-628b-1a4c-ccd5a8aa89ce@redhat.com> (raw)
In-Reply-To: <cba4cda5-c70a-e2e3-9a10-a60b418e153a@mail.ru>
On 29.03.22 11:55, Vladimir Sementsov-Ogievskiy wrote:
> 29.03.2022 11:54, Hanna Reitz wrote:
>> On 28.03.22 12:24, Vladimir Sementsov-Ogievskiy wrote:
>>> 28.03.2022 11:09, Hanna Reitz wrote:
>>>> On 28.03.22 09:44, Hanna Reitz wrote:
>>>>> On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 24.03.2022 17:09, Hanna Reitz wrote:
>>>>>>> When the stream block job cuts out the nodes between top and
>>>>>>> base in
>>>>>>> stream_prepare(), it does not drain the subtree manually; it
>>>>>>> fetches the
>>>>>>> base node, and tries to insert it as the top node's backing node
>>>>>>> with
>>>>>>> bdrv_set_backing_hd(). bdrv_set_backing_hd() however will
>>>>>>> drain, and so
>>>>>>> the actual base node might change (because the base node is
>>>>>>> actually not
>>>>>>> part of the stream job) before the old base node passed to
>>>>>>> bdrv_set_backing_hd() is installed.
>>>>>>>
>>>>>>> This has two implications:
>>>>>>>
>>>>>>> First, the stream job does not keep a strong reference to the
>>>>>>> base node.
>>>>>>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>>>>>>> because some other block job is drained to finish), we will get a
>>>>>>> use-after-free. We should keep a strong reference to that node.
>>>>>>>
>>>>>>> Second, even with such a strong reference, the problem remains
>>>>>>> that the
>>>>>>> base node might change before bdrv_set_backing_hd() actually
>>>>>>> runs and as
>>>>>>> a result the wrong base node is installed.
>>>>>>
>>>>>> Hmm.
>>>>>>
>>>>>> So, we don't really need a strong reference, as if it helps to
>>>>>> avoid some use-after-free, it means that we'll finish up with
>>>>>> wrong block graph..
>>>>>
>>>>> Sure. But I found it better style to strongly reference a node
>>>>> while it’s used. I’d rather have an outdated block graph (as in:
>>>>> A node that was supposed to disappear would still be in use) than
>>>>> a use-after-free.
>>>>>
>>>>>> Graph modifying operations must be somehow isolated from each other.
>>>>>>
>>>>>>>
>>>>>>> Both effects can be seen in 030's
>>>>>>> TestParallelOps.test_overlapping_5()
>>>>>>> case, which has five nodes, and simultaneously streams from the
>>>>>>> middle
>>>>>>> node to the top node, and commits the middle node down to the
>>>>>>> base node.
>>>>>>> As it is, this will sometimes crash, namely when we encounter the
>>>>>>> above-described use-after-free.
>>>>>>>
>>>>>>> Taking a strong reference to the base node, we no longer get a
>>>>>>> crash,
>>>>>>> but the resuling block graph is less than ideal: The expected
>>>>>>> result is
>>>>>>> obviously that all middle nodes are cut out and the base node is
>>>>>>> the
>>>>>>> immediate backing child of the top node. However, if
>>>>>>> stream_prepare()
>>>>>>> takes a strong reference to its base node (the middle node), and
>>>>>>> then
>>>>>>> the commit job finishes in bdrv_set_backing_hd(), supposedly
>>>>>>> dropping
>>>>>>> that middle node, the stream job will just reinstall it again.
>>>>>>>
>>>>>>> Therefore, we need to keep the whole subtree drained in
>>>>>>> stream_prepare(), so that the graph modification it performs is
>>>>>>> effectively atomic, i.e. that the base node it fetches is still
>>>>>>> the base
>>>>>>> node when bdrv_set_backing_hd() sets it as the top node's
>>>>>>> backing node.
>>>>>>
>>>>>> Emanuele has similar idea of isolating graph changes from each
>>>>>> other by subtree-drain.
>>>>>>
>>>>>> If I understand correctly the idea is that we'll drain all other
>>>>>> block jobs, so the wouldn't do their block-graph modification
>>>>>> during drained section. So, we can safely modify the graph.
>>>>>>
>>>>>> I don't like this idea:
>>>>>>
>>>>>> 1. drained section = stop IO. But we don't need to stop IO in the
>>>>>> whole subtree to do a needed block-graph modification.
>>>>>
>>>>> If you mean to say that draining just the single node should be
>>>>> sufficient, I’ll be happy to change it.
>>>>>
>>>>> Not sure which node, though, because I’d think it would be `base`,
>>>>> but to safely fetch it I’d need to drain it, which seems to bite
>>>>> itself in the tail. That’s why I went for a subtree drain from
>>>>> `above_base`.
>>>>>
>>>>>> 2. Drained section is not a lock, several clients may drain same
>>>>>> set of nodes.. So we exploit the fact that concurrent clients
>>>>>> will be paused by drained section and don't proceed to
>>>>>> graph-modification code.. But are we sure that block-jobs are
>>>>>> (and will be?) the only concurrent block-graph modifying clients?
>>>>>> Can qmp commands interleave somehow?
>>>>>
>>>>> They can under very specific circumstances and that’s a bug. See
>>>>> https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html
>>>>> .
>>>>>
>>>>>> Can some jobs from other subtree start a block-graph modification
>>>>>> that touches our subtree?
>>>>>
>>>>> That would be wrong. A block job shouldn’t change nodes it
>>>>> doesn’t own; stream doesn’t own the base, but it also doesn’t
>>>>> change it, it only needs to have the top node point to it.
>>>>>
>>>>>> If go this way, that would be more safe to drain the whole
>>>>>> block-graph on any block-graph modification..
>>>>>>
>>>>>> I think we'd better have a separate global mechanism for
>>>>>> isolating graph modifications. Something like a global co-mutex
>>>>>> or queue, where clients waits for their turn in block graph
>>>>>> modifications.
>>>>>>
>>>>>> Here is my old proposal on that topic:
>>>>>> https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/
>>>>>>
>>>>>
>>>>> That would only solve the very specific issue in 030, right?
>>>
>>> It should solve the same issue as your patch. You don't add
>>> subtree_drain around every graph modification.. Or we already have it?
>>
>> Well, I’m not saying it will solve every single bug, but draining in
>> stream_prepare() will at least mean that that is safe from basically
>> anything else, because it will prevent concurrent automatic graph
>> changes (e.g. because of jobs finishing), and QMP commands shouldn’t
>> be executed in drained sections either (when they do, it’s wrong, but
>> that seems to occur only extremely rarely). Draining alone should
>> make a place safe, it isn’t a lock that all sides need to take.
>>
>>>>> The stream job isn’t protected from any graph modifications but
>>>>> those coming from mirror. Might be a solution going forward (I
>>>>> didn’t look closer at it at the time, given I saw you had a
>>>>> discussion with Kevin), if we lock every graph change operation
>>>>> (though a global lock honestly doesn’t sound strictly better than
>>>>> draining subsections of the graph, both have their drawbacks), but
>>>>> that doesn’t look like it’d be something for 7.1.
>>>
>>> Same way, with draining solution you should make a subtree-drain for
>>> every graph change operation.
>>
>> Since we don’t have any lock yet, draining is the de-facto way we use
>> to forbid concurrent graph modifications. I’m not saying we use it
>> correctly and thoroughly, but it is what we do right now.
>>
>>>
>>>>
>>>> I wonder whether we could have a short-term version of
>>>> `BdrvChild.frozen` that’s a coroutine mutex. If `.frozen` is set,
>>>> you just can’t change the graph, and you also can’t wait, so that’s
>>>> just an error. But if `.frozen_lock` is set, you can wait on it.
>>>> Here, we’d keep `.frozen` set for all links between top and
>>>> above_base, and then in prepare() take `.frozen_lock` on the link
>>>> between above_base and base.
>>>>
>>>
>>> Yes that's seems an alternative to global lock, that doesn't block
>>> the whole graph. Still, I don't think that is bad to lock the whole
>>> graph for graph modificaiton, as modification should be rare and fast.
>>
>> Fair enough.
>>
>>> Another thought: does subtree-drain really drain the whole
>>> connectivity component of the graph?
>>>
>>> imagine something like this:
>>>
>>> [A] [ C ]
>>> | | |
>>> v v v
>>> [ B ] [ D ]
>>>
>>>
>>> If we do subtree drain at node A, this will drain B and C, but not D..
>>>
>>> Imagine, some another job is attached to node D, and it will start
>>> drained section too. So, for example both jobs will share drained
>>> section on node C. That doesn't seem save, and draining is not a lock.
>>>
>>> So, if we are going to rely on drained section as on lock, that
>>> isolates graph modifications from each other, we should drain the
>>> whole connectivity component of the graph.
>>
>> The drained section is not a lock, but if the other job is only
>> attached to node D, it won’t change node C. It might change the link
>> from C to D, but that doesn’t concern the job that is concerned about
>> A and B. Overlapping drains are fine.
>
> OK. Maybe it works. It's just not obvious to me that subtree_drain
> works good in all cases. And global graph-modification-lock should
> obviously work.
>
>>
>>> Next, I'm not relly sure that two jobs can simultaneusly enter
>>> drained section and do graph modifications. What prevents this?
>>> Assume two block-stream jobs reaches their finish simultaneously and
>>> go to subtree-drain. That just means that job_pause will be called
>>> for both jobs.. But what that means for the block-stream jobs that
>>> is in bdrv_subtree_drained_beeing() call in stream_prepare()? Seems
>>> nothing? Then both jobs will start graph modification process
>>> simultaneously and can interleave on any yield point (for exmaple
>>> rewriting backing_file in qcow2 metadata).
>>
>> So I don’t think that scenario can really happen, because the stream
>> job freezes the chain between above_base and top, so you can’t really
>> have two simultaneous stream jobs that cause problems between each
>> other.
>
> They cause problem on the boundary, as base of one stream job may be
> top of another, and we have also a filter, that should be
> inserted/removed at some moment. As I remember, that's the problematic
> case in 030..
>
>>
>> Furthermore, the prepare() functions are run in the main thread, so
>> the only real danger is actually that draining around the actual
>> graph modification (bdrv_set_backing_hd()) causes another block job
>> to finish, modifying the block graph. But then that job will also
>> actually finish, because it’s all in the main thread.
>>
>> It is true that child_job_drained_poll() says that job that are about
>> to prepare() are quiesced, but I don’t think that’s a problem, given
>> that all jobs in that state run in the main thread.
>>
>>>
>>> Another reason, why I think that subtree drain is a wrong tool, as I
>>> said, is extra IO-stop.
>>
>> I know and agree, but that’s an optimization question.
>>
>>> Imaging the following graph:
>>>
>>> [A]
>>> |
>>> v
>>> [B] [C]
>>> | |
>>> v v
>>> [base]
>>>
>>> If we want to rebase A onto base, we actually need only stop IO
>>> requests in A and B. Why C should suffer from this graph
>>> modification? IO requests produced by C, and that are living in C
>>> and in base don't intersect with rebasing A on base process in any way.
>>>
>>> ====
>>>
>>> Actually, I'm not strictly against your patch, and believe that it
>>> fixes the problem in most cases. And it's probably OK in short term.
>>> The only real doubt on including it now is that drained sections
>>> sometimes lead to dead locks, and is it possible that we now fix the
>>> bug that happens only in iotest 30 (or is it reported somewhere?)
>>> and risking to introduce some dead-lock?
>>
>> Saying that the example in 030 is contrived would mean we
>> could/should re-include the base into the list of nodes that belong
>> to the stream job, which would simply disallow the case in 030 that’s
>> being tested and fails.
>>
>> Then we don’t need a subtree drain, and the plain drain in
>> bdrv_set_backing_hd() would be fine.
>>
>>> Seems that if in some code it's safe to call drained_begin(), it
>>> should be safe to call subtree_drained_begin(). And if it trigger
>>> some deadlock, it just shows some another bug.. Is it worth fixing
>>> now, near to 7.0 release? We live with this bug for years.. Or
>>> something changed that I missed?
>>
>> I mean... I can understand your concern that adding a subtree drain
>> has performance implications (when a stream job ends, which shouldn’t
>> be often). But I’m not sure whether I should share the deadlock
>> concern. Sounds like a sad state of affairs if I can’t just drain
>> something when I need it to be drained.
>>
>> I wasn’t aware of this bug, actually. Now I am, and I feel rather
>> uncomfortable living with a use-after-free bug, because that’s on the
>> worse end of the bug spectrum.
>>
>
> OK, I don't know:) And others are silent. Agree that global-lock
> solution is not for 7.0 anyway. And this one is simple enough. So,
> take my
>
> Acked-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>
Thanks!
A global lock solution sounds good to me for 7.1+, and it even provides
a solution to solving the problem with QMP commands being executed while
other main-thread code is running. (I mean, the QMP command would still
be executed, but at least concurrent graph changes would be impossible.)
next prev parent reply other threads:[~2022-03-29 12:19 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-24 14:09 [PATCH v2] block/stream: Drain subtree around graph change Hanna Reitz
2022-03-24 18:49 ` John Snow
2022-03-25 8:50 ` Hanna Reitz
2022-03-25 14:45 ` Eric Blake
2022-03-25 16:37 ` Vladimir Sementsov-Ogievskiy
2022-03-28 7:44 ` Hanna Reitz
2022-03-28 8:09 ` Hanna Reitz
2022-03-28 10:24 ` Vladimir Sementsov-Ogievskiy
2022-03-29 8:54 ` Hanna Reitz
2022-03-29 9:55 ` Vladimir Sementsov-Ogievskiy
2022-03-29 12:15 ` Hanna Reitz [this message]
2022-03-30 9:40 ` Emanuele Giuseppe Esposito
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7c4aa4e1-4809-628b-1a4c-ccd5a8aa89ce@redhat.com \
--to=hreitz@redhat.com \
--cc=eesposit@redhat.com \
--cc=jsnow@redhat.com \
--cc=kwolf@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=v.sementsov-og@mail.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).