From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=50427 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Ontk5-00077e-Uv for qemu-devel@nongnu.org; Tue, 24 Aug 2010 09:40:50 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Ontk1-0002Zl-PB for qemu-devel@nongnu.org; Tue, 24 Aug 2010 09:40:45 -0400 Received: from mail-iw0-f173.google.com ([209.85.214.173]:52088) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Ontk1-0002Zf-Ld for qemu-devel@nongnu.org; Tue, 24 Aug 2010 09:40:41 -0400 Received: by iwn38 with SMTP id 38so2919862iwn.4 for ; Tue, 24 Aug 2010 06:40:41 -0700 (PDT) Message-ID: <4C73CBD6.7000900@codemonkey.ws> Date: Tue, 24 Aug 2010 08:40:38 -0500 From: Anthony Liguori MIME-Version: 1.0 References: <1282646430-5777-1-git-send-email-kwolf@redhat.com> <4C73C2BF.8050300@codemonkey.ws> <4C73C622.7080808@redhat.com> <4C73C926.3010901@codemonkey.ws> <4C73C9CF.7090800@redhat.com> <4C73CAA9.2060104@codemonkey.ws> <4C73CB85.9010306@redhat.com> In-Reply-To: <4C73CB85.9010306@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] Re: [RFC][STABLE 0.13] Revert "qcow2: Use bdrv_(p)write_sync for metadata writes" List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Avi Kivity Cc: Kevin Wolf , stefanha@gmail.com, mjt@tls.msk.ru, qemu-devel@nongnu.org, hch@lst.de On 08/24/2010 08:39 AM, Avi Kivity wrote: > On 08/24/2010 04:35 PM, Anthony Liguori wrote: >>> It's about metadata writes. If an operation changes metadata, we >>> must sync it to disk before writing any data or other metadata which >>> depends on it, regardless of any promises to the guest. >> >> >> Why? If the metadata isn't sync, we loose the write. >> >> But that can happen anyway because we're not sync'ing the data >> >> We need to sync the metadata in the event of a guest initiated flush, >> but we shouldn't need to for a normal write. > > 1. Allocate a cluster (increase refcount table) > > 2. Link cluster to L2 table > > 3. Second operation makes it to disk; first still in pagecache > > 4. Crash > > 5. Dangling pointer from L2 to freed cluster Yes, having this discussion in IRC. The problem is that we maintain a refcount table. If we didn't do internal disk snapshots, we wouldn't have this problem. IOW, VMDK doesn't have this problem so the answer to my very first question is that qcow2 is too difficult a format to get right. Regards, Anthony Liguori