From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:36119)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@openvz.org>) id 1YlDj2-00086s-I5
	for qemu-devel@nongnu.org; Thu, 23 Apr 2015 05:47:18 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den@openvz.org>) id 1YlDiy-0005BN-EP
	for qemu-devel@nongnu.org; Thu, 23 Apr 2015 05:47:16 -0400
Received: from relay.parallels.com ([195.214.232.42]:60745)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@openvz.org>) id 1YlDiy-0005B7-5I
	for qemu-devel@nongnu.org; Thu, 23 Apr 2015 05:47:12 -0400
Message-ID: <5538BF94.9080904@openvz.org>
Date: Thu, 23 Apr 2015 12:47:00 +0300
From: "Denis V. Lunev" <den@openvz.org>
MIME-Version: 1.0
References: <1426069701-1405-1-git-send-email-den@openvz.org>
	<1426069701-1405-9-git-send-email-den@openvz.org>
	<20150422130855.GI27617@stefanha-thinkpad.redhat.com>
	<55379F36.3060906@openvz.org>
	<20150423092031.GC8811@stefanha-thinkpad.redhat.com>
	<20150423093223.GC5289@noname.redhat.com>
In-Reply-To: <20150423093223.GC5289@noname.redhat.com>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 08/27] block/parallels: _co_writev callback
 for Parallels format
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>, qemu-devel@nongnu.org

On 23/04/15 12:32, Kevin Wolf wrote:
> Am 23.04.2015 um 11:20 hat Stefan Hajnoczi geschrieben:
>> On Wed, Apr 22, 2015 at 04:16:38PM +0300, Denis V. Lunev wrote:
>>> On 22/04/15 16:08, Stefan Hajnoczi wrote:
>>>> On Wed, Mar 11, 2015 at 01:28:02PM +0300, Denis V. Lunev wrote:
>>>>> +static int64_t allocate_cluster(BlockDriverState *bs, int64_t sector_num)
>>>>> +{
>>>>> +    BDRVParallelsState *s = bs->opaque;
>>>>> +    uint32_t idx, offset, tmp;
>>>>> +    int64_t pos;
>>>>> +    int ret;
>>>>> +
>>>>> +    idx = sector_num / s->tracks;
>>>>> +    offset = sector_num % s->tracks;
>>>>> +
>>>>> +    if (idx >= s->catalog_size) {
>>>>> +        return -EINVAL;
>>>>> +    }
>>>>> +    if (s->catalog_bitmap[idx] != 0) {
>>>>> +        return (uint64_t)s->catalog_bitmap[idx] * s->off_multiplier + offset;
>>>>> +    }
>>>>> +
>>>>> +    pos = bdrv_getlength(bs->file) >> BDRV_SECTOR_BITS;
>>>>> +    bdrv_truncate(bs->file, (pos + s->tracks) << BDRV_SECTOR_BITS);
>>>>> +    s->catalog_bitmap[idx] = pos / s->off_multiplier;
>>>>> +
>>>>> +    tmp = cpu_to_le32(s->catalog_bitmap[idx]);
>>>>> +
>>>>> +    ret = bdrv_pwrite_sync(bs->file,
>>>>> +            sizeof(ParallelsHeader) + idx * sizeof(tmp), &tmp, sizeof(tmp));
>>>> What is the purpose of the sync?
>>> This is necessary to preserve image consistency on crash from
>>> my point of view. There is no check consistency at the moment.
>>> The sync will be removed later when proper crash detection
>>> code will be added (patches 19, 20, 21)
>> Let's look at possible orderings in case of failure:
>>
>> A. BAT update
>> B. Data write
>>
>> This sync enforces A, B ordering.  If we can see B, then A must also
>> have happened thanks to the sync.
>>
>> But A, B ordering is too conservative.  Imagine B, A ordering and the
>> failure where we crash before A.  It means we wrote the data but never
>> linked it into the BAT.
>>
>> What happens in that case?  We've leaked a cluster in the underlying
>> image file but it doesn't corrupt the visible disk from the guest
>> point-of-view.
>>
>> Because your implementation uses truncate to extend the file size before
>> A, even the A, B failure case results in a leaked cluster.  So the B, A
>> case is not worse in any way.
>>
>> Why do other image formats sync cluster allocation updates?  Because
>> they support backing files and in that case an A, B ordering results in
>> data corruption so they enforce B, A ordering (the opposite of what
>> you're trying to do!).
>>
>> The reason why A, B ordering results in data corruption when backing
>> files are in use is because the guest's write request might touch only a
>> subset of the cluster (a couple of sectors out of the whole cluster).
>> So the guest needs to copy the remaining sectors from the backing file.
>> If there is a dangling BAT entry like in the A, B failure case, then the
>> guest will see a zeroed cluster instead of the contents of the backing
>> file.  This is a data corruption, but only if a backing file is being
>> used!
>>
>> So the sync is not necessary, both A, B and B, A ordering work for
>> block/parallels.c.
> Actually, I suspect this means that the parallels driver is restricted
> to protocols with bdrv_has_zero_init() == true, otherwise zeros can turn
> into random data (which means that it can't work e.g. directly on host
> block devices).
>
> Do we enforce this?
>
> Kevin
this is fixed in the patch 26 when the code is replaced with

+    if (s->data_end + s->tracks > pos) {
+        int ret;
+        if (s->prealloc_mode == PRL_PREALLOC_MODE_FALLOCATE) {
+            ret = bdrv_write_zeroes(bs->file, s->data_end,
+                                    s->prealloc_size, 0);
+        } else {
+            ret = bdrv_truncate(bs->file,
+                    (s->data_end + s->prealloc_size) << BDRV_SECTOR_BITS);
+        }
+        if (ret < 0) {
+            return ret;
+        }
+    }

on a default path, but you are correct. Some checking is
necessary to be on a safe side.

Den