From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Max Reitz <mreitz@redhat.com>,
qemu-devel@nongnu.org, qemu-block@nongnu.org
Cc: kwolf@redhat.com, jsnow@redhat.com, den@openvz.org
Subject: Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
Date: Fri, 30 Mar 2018 18:32:37 +0300 [thread overview]
Message-ID: <7be588fb-dd63-d6fd-0e39-81fbee1bdd25@virtuozzo.com> (raw)
In-Reply-To: <9f49e65a-fc44-ea11-ff0a-cb38270451aa@virtuozzo.com>
30.03.2018 16:31, Vladimir Sementsov-Ogievskiy wrote:
> 29.03.2018 18:09, Vladimir Sementsov-Ogievskiy wrote:
>> 29.03.2018 17:03, Max Reitz wrote:
>>> On 2018-03-29 10:08, Vladimir Sementsov-Ogievskiy wrote:
>>>> 28.03.2018 17:53, Max Reitz wrote:
>>>>> On 2018-03-27 12:11, Vladimir Sementsov-Ogievskiy wrote:
>>> [...]
>>>
>>>>>> isn't it because a lot of cat processes? will check, update loop to
>>>>>> i=0; while check -qcow2 169; do ((i++)); echo $i OK; killall -9 cat;
>>>>>> done
>>>>> Hmm... I know I tried to kill all of the cats, but for some
>>>>> reason that
>>>>> didn't really help yesterday. Seems to help now, for 2.12.0-rc0 at
>>>>> least (that is, before this series).
>>>> reproduced with killing... (without these series, just on master)
>>>>
>>>>> After the whole series, I still get a lot of failures in 169
>>>>> (mismatching bitmap hash, mostly).
>>>>>
>>>>> And interestingly, if I add an abort():
>>>>>
>>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>>> index 486f3e83b7..9204c1c0ac 100644
>>>>> --- a/block/qcow2.c
>>>>> +++ b/block/qcow2.c
>>>>> @@ -1481,6 +1481,7 @@ static int coroutine_fn
>>>>> qcow2_do_open(BlockDriverState *bs, QDict *options, }
>>>>>
>>>>> if (bdrv_dirty_bitmap_next(bs, NULL)) {
>>>>> + abort();
>>>>> /* It's some kind of reopen with already existing dirty
>>>>> bitmaps. There
>>>>> * are no known cases where we need loading bitmaps in
>>>>> such
>>>>> situation,
>>>>> * so it's safer don't load them.
>>>>>
>>>>> Then this fires for a couple of test cases of 169 even without the
>>>>> third
>>>>> patch of this series.
>>>>>
>>>>> I guess bdrv_dirty_bitmap_next() reacts to some bitmaps that
>>>>> migration
>>>>> adds or something? Then this would be the wrong condition, because I
>>>>> guess we still want to load the bitmaps that are in the qcow2 file.
>>>>>
>>>>> I'm not sure whether bdrv_has_readonly_bitmaps() is the correct
>>>>> condition then, either, though. Maybe let's take a step back: We
>>>>> want
>>>>> to load all the bitmaps from the file exactly once, and that is
>>>>> when it
>>>>> is opened the first time. Or that's what I would have thought... Is
>>>>> that even correct?
>>>>>
>>>>> Why do we load the bitmaps when the device is inactive anyway?
>>>>> Shouldn't we load them only once the device is activated?
>>>> Hmm, not sure. May be, we don't need. But we anyway need to load them,
>>>> when opening read-only, and we should correspondingly reopen in
>>>> this case.
>>> Yeah, well, yeah, but the current state seems just wrong. Apparently
>>> there are cases where a qcow2 node may have bitmaps before we try to
>>> load them from the file, so the current condition doesn't work.
>>>
>>> Furthermore, it seems like the current "state machine" is too
>>> complex so
>>> we don't know which cases are possible anymore and what to do when.
>>>
>>> So the first thing we could do is add a field to the BDRVQCow2State
>>> that
>>> tells us whether the bitmaps have been loaded already or not. If not,
>>> we invoke qcow2_load_dirty_bitmaps() and set the value to true. If the
>>> value was true already and the BDS is active and R/W now, we call
>>> qcow2_reopen_bitmaps_rw_hint(). That should solve one problem.
>>
>> good idea, will do.
>>
>>>
>>> The other problem of course is the question whether we should call
>>> qcow2_load_dirty_bitmaps() at all while the drive is still inactive.
>>> You know the migration model better than me, so I'm asking this
>>> question
>>> to you. We can phrase it differently: Do we need to load the bitmaps
>>> before the drive is activated?
>>
>> Now I think that we don't need. At least, we don't have such cases in
>> Virtuozzo (I hope :).
>>
>> Why did I doubt:
>>
>> 1. We have cases, when we want to start vm as inactive, to be able to
>> export it's drive as NBD export, push some data to it and then start
>> the VM (which involves activating)
>> 2. We have cases, when we want to start vm stopped and operate on
>> dirty bitmaps.
>>
>> If just open all images in inactive mode until vm start, it looks
>> like we need bitmaps in inactive mode (for 2.). But it looks like
>> wrong approach anyway.
>> Firstly, I tried to solve (1.) by simply inactivate_all() in case of
>> start vm in paused mode, but it breaks at least (2.), so finally, I
>> solved (1.) by an approach similar with "-incoming defer". So, we
>> have inactive mode in two cases:
>> - incoming migration
>> - push data to vm before start
>>
>> and, in these cases, we don't need to load dirty-bitmaps.
>>
>> Also, inconsistency: now, we remove persistent bitmaps on inactivate.
>> So, it is inconsistent to load the in inactive mode.
>>
>> Ok, I'll try to respin.
>
> finally, what cases we actually have for qcow2_do_open?
>
> 1. INACTIVE -> ACTIVE (through invalidate_cache, we obviously should
> load bitmaps, if we decided that we have no persistent bitmaps in
> INACTIVE mode)
> 2. creating new bdrv state (first open of the image) in INACTIVE mode
> (will not load bitmaps)
> 3. creating new bdrv state (first open of the image) in ACTIVE mode
> (will load bitmaps, maybe read-only if disk is RO)
>
> If only these three cases, it would be enough to just load bitmaps if
> !INACTIVE and do nothing otherwise.
>
> Or, we have some of the following cases too?
>
> 1?. ACTIVE -> ACTIVE (through invalidate_cache, some kind of no-op, we
> should not reload bitmaps)
> 2?. RO -> RW (we should reopen_bitmaps_rw) (or it is possible only
> through bdrv_reopen, which will not touch qcow2_do_open?
> 3?. RW -> RO (reopen_bitmaps_ro ?)
> ? something other??
>
>>
>>>
>>> Max
>>>
>>
>> about 169, how often is it reproducible for you?
>>
>
> it becomes very interseting.
>
> persistent-migbitmap-online case failed on line 136. with mismathced
> bitmap sha. This check is not relate to migration, on this line,
> bitmap is already successfully migrated. But for some reason it is
> corrupted after stop/start the VM. How is it possible - I can't
> imagine. But it looks not really related to migration.. May it relate
> to case, when postcopy was not finished or something like this?..
> maybe fixing wait(RESUME) to something more appropriate will help. But
> it is strange anyway.
>
> persistent-notmigbitmap-offline case failed on ine 128, which is
> "self.vm_b.event_wait("RESUME", timeout=10.0)" with timeout.. can we
> skip RESUME for some reason? maybe just move to MIGRATION event will
> help, will check
>
looks like moving from "RESUME" to "MIGRATION" events fixes the whole
issue (I've already 740 successful iterations, which is good enough, but
I will not stop the loop until at least 2000). I'll submit a patch. But
I don't understand, why it fix sha256 mismatch, and why sha256-check
fails with "RESUME" (check success after migration and fail after
stop/start, o_O)...
--
Best regards,
Vladimir
next prev parent reply other threads:[~2018-03-30 15:32 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-20 17:05 [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Vladimir Sementsov-Ogievskiy
2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 1/3] qcow2-bitmap: add qcow2_reopen_bitmaps_rw_hint() Vladimir Sementsov-Ogievskiy
2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 2/3] qcow2: fix bitmaps loading when bitmaps already exist Vladimir Sementsov-Ogievskiy
2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 2/3] qcow2: handle reopening bitmaps on bdrv_invalidate_cache Vladimir Sementsov-Ogievskiy
2018-03-20 17:07 ` [Qemu-devel] [PATCH v4 2/3] qcow2: handle reopening bitmaps on bdrv_invalidate_cache DROP IT Vladimir Sementsov-Ogievskiy
2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 3/3] iotests: enable shared migration cases in 169 Vladimir Sementsov-Ogievskiy
2018-03-21 13:20 ` [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Max Reitz
2018-03-26 18:06 ` Max Reitz
2018-03-27 9:28 ` Vladimir Sementsov-Ogievskiy
2018-03-27 9:53 ` Vladimir Sementsov-Ogievskiy
2018-03-27 10:11 ` Vladimir Sementsov-Ogievskiy
2018-03-28 14:53 ` Max Reitz
2018-03-29 8:08 ` Vladimir Sementsov-Ogievskiy
2018-03-29 14:03 ` Max Reitz
2018-03-29 15:09 ` Vladimir Sementsov-Ogievskiy
2018-03-30 13:31 ` Vladimir Sementsov-Ogievskiy
2018-03-30 15:32 ` Vladimir Sementsov-Ogievskiy [this message]
2018-04-03 16:03 ` Max Reitz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7be588fb-dd63-d6fd-0e39-81fbee1bdd25@virtuozzo.com \
--to=vsementsov@virtuozzo.com \
--cc=den@openvz.org \
--cc=jsnow@redhat.com \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).