qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org, jsnow@redhat.com,
	mreitz@redhat.com, philmd@redhat.com, peter.maydell@linaro.org,
	berto@igalia.com, stefanha@redhat.com, pbonzini@redhat.com,
	den@openvz.org, eblake@redhat.com
Subject: Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Date: Mon, 23 Nov 2020 16:44:31 +0300	[thread overview]
Message-ID: <c43ad0a0-8c61-057d-603e-229ae592b94e@virtuozzo.com> (raw)
In-Reply-To: <20201123111024.GB5317@merkur.fritz.box>

23.11.2020 14:10, Kevin Wolf wrote:
> Am 23.11.2020 um 11:29 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 23.11.2020 13:10, Kevin Wolf wrote:
>>> Am 20.11.2020 um 19:19 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 20.11.2020 20:22, Kevin Wolf wrote:
>>>>> Am 20.11.2020 um 17:43 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>> 20.11.2020 19:36, Kevin Wolf wrote:
>>>>>>> Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>>>> Hi all!
>>>>>>>>
>>>>>>>> As Peter recently noted, iotest 30 accidentally fails.
>>>>>>>>
>>>>>>>> I found that Qemu crashes due to interleaving of graph-update
>>>>>>>> operations of parallel mirror and stream block-jobs.
>>>>>>>
>>>>>>> I haven't found the time yet to properly look into this or your other
>>>>>>> thread where you had a similar question, but there is one thing I'm
>>>>>>> wondering: Why can the nested job even make progress and run its
>>>>>>> completion handler?
>>>>>>>
>>>>>>> When we modify the graph, we should have drained the subtree in
>>>>>>> question, so in theory while one job finishes and modifies the graph,
>>>>>>> there should be no way for the other job to make progress and get
>>>>>>> interleaved - it shouldn't be able to start I/O requests and much less
>>>>>>> to run its completion handler and modify the graph.
>>>>>>>
>>>>>>> Are we missing drained sections somewhere or do they fail to achieve
>>>>>>> what I think they should achieve?
>>>>>>>
>>>>>>
>>>>>> It all looks like both jobs are reached their finish simultaneously.
>>>>>> So, all progress is done in both jobs. And they go concurrently to
>>>>>> completion procedures which interleaves. So, there no more io through
>>>>>> blk, which is restricted by drained sections.
>>>>>
>>>>> They can't be truly simultaneous because they run in the same thread.
>>>>> During job completion, this is the main thread.
>>>>
>>>> No, they not truly simultaneous, but completions may interleave
>>>> through nested aio_poll loops.
>>>>
>>>>>
>>>>> However as soon as job_is_completed() returns true, it seems we're not
>>>>> pausing the job any more when one of its nodes gets drained.
>>>>>
>>>>> Possibly also relevant: The job->busy = false in job_exit(). The comment
>>>>> there says it's a lie, but we might deadlock otherwise.
>>>>>
>>>>> This problem will probably affect other callers, too, which drain a
>>>>> subtree and then resonably expect that nobody will modify the graph
>>>>> until they end the drained section. So I think the problem that we need
>>>>> to address is that jobs run their completion handlers even though they
>>>>> are supposed to be paused by a drain.
>>>>
>>>> Hmm. I always thought about drained section as about thing that stops
>>>> IO requests, not other operations.. And we do graph modifications in
>>>> drained section to avoid in-flight IO requests during graph
>>>> modification.
>>>
>>> Is there any use for an operation that only stops I/O, but doesn't
>>> prevent graph changes?
>>>
>>> I always understood it as a request to have exclusive access to a
>>> subtree, so that nobody else would touch it.
>>>
>>>>> I'm not saying that your graph modification locks are a bad idea, but
>>>>> they are probably not a complete solution.
>>>>>
>>>>
>>>> Hmm. What do you mean? It's of course not complete, as I didn't
>>>> protect every graph modification procedure.. But if we do protect all
>>>> such things and do graph modifications always under this mutex, it
>>>> should work I think.
>>>
>>> What I mean is that not only graph modifications can conflict with each
>>> other, but most callers of drain_begin/end will probably not be prepared
>>> for the graph changing under their feet, even if they don't actively
>>> change the graph themselves.
>>>
>>
>> Understand now.. Right.. Anyway, it looks as we need some kind of
>> mutex. As the user of drained section of course wants to do graph
>> modifications and even IO (for example update backing-link in
>> metadata). The first thing that comes to mind is to protect all
>> outer-most drained sections by global CoMutex and assert in
>> drain_begin/drain_end that the mutex is locked.
>>
>> Hmm, it also looks like RW-lock, and simple IO is "read" and something
>> under drained section is "write".
> 
> In a way, drain _is_ the implementation of a lock. But as you've shown,
> it's a buggy implementation.
> 
> What I was looking at was basically fixing the one instance of a bug
> while leaving the design as it is.
> 
> My impression is that you're looking at this more radically and want to
> rearchitecture the whole drain mechanism so that such bugs would be much
> less likely to start with. Maybe this is a good idea, but it's probably
> also a lot more effort.
> 
> Basically, for making use of more traditional locks, the naive approach
> would be changing blk/bdrv_inc/dec_in_flight() so that it takes/releases
> an actual coroutine lock. As you suggest, probably a CoRwLock.
> 
> I see a few non-trivial questions already for this part:
> 
> * What about requests for which bs->in_flight is increased more than
>    once? Do we need some sort of recursive lock for them?

Is there reasonable example? I'd avoid recursive locks if possible, it doesn't make things simpler.

> 
> * How do you know whether you should take a reader or a writer lock? For
>    drains called from coroutine context, maybe you could store the caller
>    that "owns" the drain section in the BDS, but what about non-coroutine
>    drains?

Intuitively, readers are all IO requests and writers are drained sections.

Why should we store drained-section owner somewhere?

Ah, we need to do "read" when holding write lock.. So we need a possibility for readers to operate without taking lock, when write lock is taken by current[*] coroutine.

And I don't see the way to make synchronization without moving everything to coroutine.
In non coroutine we'll dead lock on nested aio_poll loop which will do nested drain of another job. In coroutine we can wait on mutex more efficiently.

> 
>    What do you do if coroutine A drains and then (directly or indirectly)
>    spawns coroutine B to do some work?

And that means that [*] is not current coroutine but may be some another coroutine started indirectly by lock owner..

So, just RW lock is not enough.. We need something like RW lock but with a possibility to call read operations while write lock is taken (by owner of write lock)..

> 
> * Multiple concurrent requests from the same origin (the drain section
>    owner) shouldn't be serialised, so the CoRwLock needs to be taken once
>    per origin, not once per request. Again, how do we identify origins
>    and where do we store the common lock?

This makes things complex. Why not to take lock per request? IO requests will take read lock and go in parallel.

> 
> * Is it desirable that requests hang in bdrv_inc_in_flight() waiting for
>    the lock to be released? This may be in the middle of another
>    operation that needs to complete before drain_begin can return.

drain_begin should take write lock. This automatically means than all read locks are released..

> 
>    I seem to remember that introducing queued_requests in BlockBackend
>    was already non-trivial because of potential deadlocks. We would have
>    to prepare for more of the same in BlockDriverState.
> 
>    The BlockBackend code involves temporarily dropping the in_flight
>    counter change that the request made, but on the BDS level we don't
>    even know which counters we increased how often before reaching
>    bdrv_inc_in_flight().
> 
> Do you have a different approach for placing the locks or do you have
> ideas for how we would find good answers for these problems?
> 

No, I don't have good complete architecture in-mind.. I was interested in this because I'm preparing a series to refactor permissions-update, and it's quite close. But still a lot simpler than all these problems. So, I'd start from my already prepared series anyway.

Do you think it worth protecting by global lock some job handlers, like I propose in patch 05? It's of course better to move generic job_*() functions to coroutine, to not make same thing for each job. It will not solve the whole problem but at least fix 030 iotest.. I don't remember real bugs around this and don't think that users really run parallel stream and commit jobs. On the other hand the fix should not hurt.

-- 
Best regards,
Vladimir


      reply	other threads:[~2020-11-23 13:48 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-20 16:16 [PATCH RFC 0/5] Fix accidental crash in iotest 30 Vladimir Sementsov-Ogievskiy
2020-11-20 16:16 ` [PATCH 1/5] abort-on-set-to-true Vladimir Sementsov-Ogievskiy
2020-11-20 16:16 ` [PATCH 2/5] iotest-30-shorten: concentrate on failing test case Vladimir Sementsov-Ogievskiy
2020-11-20 16:16 ` [PATCH 3/5] scripts/block-coroutine-wrapper.py: allow more function types Vladimir Sementsov-Ogievskiy
2020-11-20 16:16 ` [PATCH 4/5] block: move some mirror and stream handlers to coroutine Vladimir Sementsov-Ogievskiy
2020-11-20 16:16 ` [PATCH 5/5] block: protect some graph-modifyng things by mutex Vladimir Sementsov-Ogievskiy
2020-11-20 16:27 ` [PATCH RFC 0/5] Fix accidental crash in iotest 30 no-reply
2020-11-20 16:35   ` Vladimir Sementsov-Ogievskiy
2020-11-20 16:36 ` Kevin Wolf
2020-11-20 16:43   ` Vladimir Sementsov-Ogievskiy
2020-11-20 17:22     ` Kevin Wolf
2020-11-20 18:19       ` Vladimir Sementsov-Ogievskiy
2020-11-23 10:10         ` Kevin Wolf
2020-11-23 10:29           ` Vladimir Sementsov-Ogievskiy
2020-11-23 11:10             ` Kevin Wolf
2020-11-23 13:44               ` Vladimir Sementsov-Ogievskiy [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c43ad0a0-8c61-057d-603e-229ae592b94e@virtuozzo.com \
    --to=vsementsov@virtuozzo.com \
    --cc=berto@igalia.com \
    --cc=den@openvz.org \
    --cc=eblake@redhat.com \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=philmd@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).