From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33951) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YlDZ7-00031F-Gs for qemu-devel@nongnu.org; Thu, 23 Apr 2015 05:37:03 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YlDZ2-0001nE-Gd for qemu-devel@nongnu.org; Thu, 23 Apr 2015 05:37:01 -0400 Received: from relay.parallels.com ([195.214.232.42]:59683) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YlDZ2-0001n0-72 for qemu-devel@nongnu.org; Thu, 23 Apr 2015 05:36:56 -0400 Message-ID: <5538BD28.9070403@openvz.org> Date: Thu, 23 Apr 2015 12:36:40 +0300 From: "Denis V. Lunev" MIME-Version: 1.0 References: <1426069701-1405-1-git-send-email-den@openvz.org> <1426069701-1405-9-git-send-email-den@openvz.org> <20150422130855.GI27617@stefanha-thinkpad.redhat.com> <55379F36.3060906@openvz.org> <20150423092031.GC8811@stefanha-thinkpad.redhat.com> In-Reply-To: <20150423092031.GC8811@stefanha-thinkpad.redhat.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 08/27] block/parallels: _co_writev callback for Parallels format List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Kevin Wolf , Stefan Hajnoczi , qemu-devel@nongnu.org On 23/04/15 12:20, Stefan Hajnoczi wrote: > On Wed, Apr 22, 2015 at 04:16:38PM +0300, Denis V. Lunev wrote: >> On 22/04/15 16:08, Stefan Hajnoczi wrote: >>> On Wed, Mar 11, 2015 at 01:28:02PM +0300, Denis V. Lunev wrote: >>>> +static int64_t allocate_cluster(BlockDriverState *bs, int64_t sector_num) >>>> +{ >>>> + BDRVParallelsState *s = bs->opaque; >>>> + uint32_t idx, offset, tmp; >>>> + int64_t pos; >>>> + int ret; >>>> + >>>> + idx = sector_num / s->tracks; >>>> + offset = sector_num % s->tracks; >>>> + >>>> + if (idx >= s->catalog_size) { >>>> + return -EINVAL; >>>> + } >>>> + if (s->catalog_bitmap[idx] != 0) { >>>> + return (uint64_t)s->catalog_bitmap[idx] * s->off_multiplier + offset; >>>> + } >>>> + >>>> + pos = bdrv_getlength(bs->file) >> BDRV_SECTOR_BITS; >>>> + bdrv_truncate(bs->file, (pos + s->tracks) << BDRV_SECTOR_BITS); >>>> + s->catalog_bitmap[idx] = pos / s->off_multiplier; >>>> + >>>> + tmp = cpu_to_le32(s->catalog_bitmap[idx]); >>>> + >>>> + ret = bdrv_pwrite_sync(bs->file, >>>> + sizeof(ParallelsHeader) + idx * sizeof(tmp), &tmp, sizeof(tmp)); >>> What is the purpose of the sync? >> This is necessary to preserve image consistency on crash from >> my point of view. There is no check consistency at the moment. >> The sync will be removed later when proper crash detection >> code will be added (patches 19, 20, 21) > Let's look at possible orderings in case of failure: > > A. BAT update > B. Data write > > This sync enforces A, B ordering. If we can see B, then A must also > have happened thanks to the sync. > > But A, B ordering is too conservative. Imagine B, A ordering and the > failure where we crash before A. It means we wrote the data but never > linked it into the BAT. > > What happens in that case? We've leaked a cluster in the underlying > image file but it doesn't corrupt the visible disk from the guest > point-of-view. > > Because your implementation uses truncate to extend the file size before > A, even the A, B failure case results in a leaked cluster. So the B, A > case is not worse in any way. > > Why do other image formats sync cluster allocation updates? Because > they support backing files and in that case an A, B ordering results in > data corruption so they enforce B, A ordering (the opposite of what > you're trying to do!). > > The reason why A, B ordering results in data corruption when backing > files are in use is because the guest's write request might touch only a > subset of the cluster (a couple of sectors out of the whole cluster). > So the guest needs to copy the remaining sectors from the backing file. > If there is a dangling BAT entry like in the A, B failure case, then the > guest will see a zeroed cluster instead of the contents of the backing > file. This is a data corruption, but only if a backing file is being > used! > > So the sync is not necessary, both A, B and B, A ordering work for > block/parallels.c. > > Stefan this is very interesting speculation. I have never considered the problem from this point of view. Thank you very much for this!