From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50844) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YBN6S-0000hD-DL for qemu-devel@nongnu.org; Wed, 14 Jan 2015 07:31:17 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YBN6O-0002Sm-A0 for qemu-devel@nongnu.org; Wed, 14 Jan 2015 07:31:16 -0500 Received: from mx2.parallels.com ([199.115.105.18]:55428) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YBN6O-0002II-43 for qemu-devel@nongnu.org; Wed, 14 Jan 2015 07:31:12 -0500 Message-ID: <54B66159.3010906@openvz.org> Date: Wed, 14 Jan 2015 15:30:17 +0300 From: "Denis V. Lunev" MIME-Version: 1.0 References: <1419931701-19362-1-git-send-email-den@openvz.org> <1419934032-24216-1-git-send-email-den@openvz.org> <1419934032-24216-7-git-send-email-den@openvz.org> <20150113145011.GD14136@rkaganb.sw.ru> <54B53711.6040601@openvz.org> <54B57D04.3010005@openvz.org> <20150114120146.GA27418@rkaganb.sw.ru> In-Reply-To: <20150114120146.GA27418@rkaganb.sw.ru> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 16/19] block/parallels: no need to flush on each block allocation table update List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Roman Kagan , Kevin Wolf , qemu-devel@nongnu.org, Stefan Hajnoczi On 14/01/15 15:01, Roman Kagan wrote: > On Tue, Jan 13, 2015 at 11:16:04PM +0300, Denis V. Lunev wrote: >> On 13/01/15 18:17, Denis V. Lunev wrote: >>> On 13/01/15 17:50, Roman Kagan wrote: >>>> On Tue, Dec 30, 2014 at 01:07:09PM +0300, Denis V. Lunev wrote: >>>>> --- a/block/parallels.c >>>>> +++ b/block/parallels.c >>>>> @@ -194,7 +194,7 @@ static int64_t allocate_sector(BlockDriverState >>>>> *bs, int64_t sector_num) >>>>> tmp = cpu_to_le32(s->bat[idx]); >>>>> - ret = bdrv_pwrite_sync(bs->file, bat_offset(idx), &tmp, >>>>> sizeof(tmp)); >>>>> + ret = bdrv_pwrite(bs->file, bat_offset(idx), &tmp, sizeof(tmp)); >>>>> if (ret < 0) { >>>>> return ret; >>>>> } >>>> if the preceding bdrv_truncate(+cluster_size) gets reordered with this >>>> bat entry update, there's a risk that on poweroff (snapshot, etc.) the >>>> bat entry gets written to the storage while the file size is not >>>> updated. As a result, the bat entry in the file would point at a >>>> cluster which hadn't been allocated yet. When a new block is eventually >>>> added to the file, you'd have two bat entries pointing at the same >>>> cluster. >>>> >>>> The _sync version used to leave this window quite narrow due to the >>>> flush following the write. The patch makes the reordering more likely. >>>> >>>> I'm afraid the only reliable way to handle it is to put a barrier >>>> between truncate and bat update, and mitigate the costs by batching the >>>> file expansions, as you seem to do in the followup patches. >> Thinking a bit more on this. IMHO there is no problem >> with a normal workflow, the problem could come >> from a crash when the file state and a BAT state will >> be not consistent and the BAT entry will point out >> of the file (could point out of the file). > Yes that's exactly the scenario I've been talking about. > >> But it is not possible to fix at all this way. Single sync >> does not provide much warranty. > How's that? > >> We should write proper magic into ParallelsHeader->inuse >> on open and call sync immediately. On subsequent >> open of the broken image we should perform >> check consistency. The magic should be set to 0 >> on clean close. > This seems to address the problem, but only if this protocol is adhered > to by all the tools that may access the image (Parallels Desktop, > Parallels Server, ploop, etc.). Furthermore, if it is, it *has* to be > adhered to by this code, too, while adding write support, otherwise > we'll break the assumptions of those other tools. > >> I think that this stuff could be implemented separately >> in the next patchset. If you this that this is mandatory, >> OK, I can do that, be this will increase patchset a lot. >> Check consistency is not an easy stuff. > I think as a 1st approximation, instead of a full-fledged consistency > check and repair, just refusing write access to the image that has > ->inuse set would suffice. > > Roman. ok, seems reasonable