From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=34858 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Otb0U-0002iy-2J for qemu-devel@nongnu.org; Thu, 09 Sep 2010 02:53:15 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Otb0S-0007Vj-Ur for qemu-devel@nongnu.org; Thu, 09 Sep 2010 02:53:14 -0400 Received: from mx1.redhat.com ([209.132.183.28]:56241) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Otb0S-0007Ve-NP for qemu-devel@nongnu.org; Thu, 09 Sep 2010 02:53:12 -0400 Message-ID: <4C888451.9070306@redhat.com> Date: Thu, 09 Sep 2010 09:53:05 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format References: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com> <4C84E738.3020802@codemonkey.ws> <4C865187.6090508@redhat.com> <4C865CFE.7010508@codemonkey.ws> <4C8663C4.1090508@redhat.com> <4C866773.2030103@codemonkey.ws> <4C86BC6B.5010809@codemonkey.ws> <4C874812.9090807@redhat.com> <395D4377-00F9-4765-94C4-470BDFA1F96E@suse.de> <4C874F22.6060802@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Kevin Wolf , qemu-devel@nongnu.org, Alexander Graf , Stefan Hajnoczi On 09/08/2010 02:15 PM, Stefan Hajnoczi wrote: > 3. Metadata update reaches disk but data does not. The interesting > case! The L2 table now points to a cluster which is beyond the last > cluster in the image file. Remember that file size is rounded down by > cluster size, so partial data writes are discarded and this case > applies. > > Now we're in trouble. The image cannot be accessed without some > sanity checking because not only do table entries point to invalid > clusters, but new allocating writes might make previously invalid > cluster offsets valid again (then there would be two or more table > entries pointing to the same cluster)! > > Anthony's suggestion is to use a "mounted" or "dirty" bit in the qed > header to detect a crashed image when opening the image file. If no > crash has occurred, then the mounted bit is unset and normal operation > is safe. If the mounted bit is set, then an check of the L1/L2 tables > must be performed and any invalid cluster offsets must be cleared to > zero. When an invalid cluster is cleared to zero, we arrive back at > case 1 above: neither data write nor metadata update reached the disk, > and we are in a safe state. While fsck has a lovely ext2 retro feel, there's a reason it's shunned - it can take quite a while to run. A fully loaded L1 with 32K entries will require 32K random I/Os, which can take over 5 minutes on a disk that provides 100 IOPS. On a large shared disk, you'll have a lot more IOPS, but likely much fewer IOPS per guest, so if you have a power loss, fsck time per guest will likely be longer (irrespective of guest size). Preallocation, on the other hand, is amortized, or you can piggy-back its fsync on a guest flush. Note its equally applicable to qcow2 and qed. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.