From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:38400)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@openvz.org>) id 1YB7t0-0002QQ-Mp
	for qemu-devel@nongnu.org; Tue, 13 Jan 2015 15:16:23 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den@openvz.org>) id 1YB7sv-0000Hq-1k
	for qemu-devel@nongnu.org; Tue, 13 Jan 2015 15:16:22 -0500
Received: from mx2.parallels.com ([199.115.105.18]:51649)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@openvz.org>) id 1YB7su-0000Hi-Rr
	for qemu-devel@nongnu.org; Tue, 13 Jan 2015 15:16:16 -0500
Message-ID: <54B57D04.3010005@openvz.org>
Date: Tue, 13 Jan 2015 23:16:04 +0300
From: "Denis V. Lunev" <den@openvz.org>
MIME-Version: 1.0
References: <1419931701-19362-1-git-send-email-den@openvz.org>
	<1419934032-24216-1-git-send-email-den@openvz.org>
	<1419934032-24216-7-git-send-email-den@openvz.org>
	<20150113145011.GD14136@rkaganb.sw.ru>
	<54B53711.6040601@openvz.org>
In-Reply-To: <54B53711.6040601@openvz.org>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 16/19] block/parallels: no need to flush on
 each block allocation table update
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Roman Kagan <rkagan@parallels.com>, Kevin Wolf <kwolf@redhat.com>, qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>

On 13/01/15 18:17, Denis V. Lunev wrote:
> On 13/01/15 17:50, Roman Kagan wrote:
>> On Tue, Dec 30, 2014 at 01:07:09PM +0300, Denis V. Lunev wrote:
>>>  From the point of guest each write to real disk prior to disk barrier
>>> operation could be lost. Therefore there is no problem that "not 
>>> synced"
>>> new block is lost due to not updated allocation table if QEMU is 
>>> crashed.
>>>
>>> This patch improves writing performance of
>>>    qemu-img create -f parallels -o cluster_size=64k ./1.hds 64G
>>>    qemu-io -f parallels -c "write -P 0x11 0 1024k" 1.hds
>>> from 45 Mb/sec to 160 Mb/sec on my SSD disk. The gain on rotational 
>>> media
>>> is much more sufficient, from 800 Kb/sec to 45 Mb/sec.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   block/parallels.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/block/parallels.c b/block/parallels.c
>>> index ddc3aee..46cf031 100644
>>> --- a/block/parallels.c
>>> +++ b/block/parallels.c
>>> @@ -194,7 +194,7 @@ static int64_t allocate_sector(BlockDriverState 
>>> *bs, int64_t sector_num)
>>>         tmp = cpu_to_le32(s->bat[idx]);
>>>   -    ret = bdrv_pwrite_sync(bs->file, bat_offset(idx), &tmp, 
>>> sizeof(tmp));
>>> +    ret = bdrv_pwrite(bs->file, bat_offset(idx), &tmp, sizeof(tmp));
>>>       if (ret < 0) {
>>>           return ret;
>>>       }
>> AFAICT this worsens the problem that existed before this patch:
>>
>> if the preceding bdrv_truncate(+cluster_size) gets reordered with this
>> bat entry update, there's a risk that on poweroff (snapshot, etc.) the
>> bat entry gets written to the storage while the file size is not
>> updated.  As a result, the bat entry in the file would point at a
>> cluster which hadn't been allocated yet.  When a new block is eventually
>> added to the file, you'd have two bat entries pointing at the same
>> cluster.
>>
>> The _sync version used to leave this window quite narrow due to the
>> flush following the write.  The patch makes the reordering more likely.
>>
>> I'm afraid the only reliable way to handle it is to put a barrier
>> between truncate and bat update, and mitigate the costs by batching the
>> file expansions, as you seem to do in the followup patches.
>>
>>
>> Roman.
> no, I think that this will not happen.
>
> Here we are under the lock which was was taken in
> parallels_co_writev and thus 2 contexts are impossible.
> On the other hand bdrv_truncate does not delegate
> the job to the other context and is performed
> immediately. This from my point of view means
> that truncate is performed always first and from
> the safe context.
>
> Regards,
>     Den


Thinking a bit more on this. IMHO there is no problem
with a normal workflow, the problem could come
from a crash when the file state and a BAT state will
be not consistent and the BAT entry will point out
of the file (could point out of the file).

But it is not possible to fix at all this way. Single sync
does not provide much warranty. This should be
dealt in the other way, like done in QCOW2.

We should write proper magic into ParallelsHeader->inuse
on open and call sync immediately. On subsequent
open of the broken image we should perform
check consistency. The magic should be set to 0
on clean close.

I think that this stuff could be implemented separately
in the next patchset. If you this that this is mandatory,
OK, I can do that, be this will increase patchset a lot.
Check consistency is not an easy stuff.

Regards,
     Den