From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=52792 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Ou4By-0000BZ-CJ for qemu-devel@nongnu.org; Fri, 10 Sep 2010 10:03:03 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Ou4Bt-0001R8-H5 for qemu-devel@nongnu.org; Fri, 10 Sep 2010 10:03:02 -0400 Received: from mx1.redhat.com ([209.132.183.28]:21973) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Ou4Bt-0001Qr-AS for qemu-devel@nongnu.org; Fri, 10 Sep 2010 10:02:57 -0400 Message-ID: <4C8A3A88.6050104@redhat.com> Date: Fri, 10 Sep 2010 17:02:48 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format References: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com> <4C84E738.3020802@codemonkey.ws> <4C865187.6090508@redhat.com> <4C865CFE.7010508@codemonkey.ws> <4C8663C4.1090508@redhat.com> <4C866773.2030103@codemonkey.ws> <4C86BC6B.5010809@codemonkey.ws> <4C874812.9090807@redhat.com> <4C87860A.3060904@codemonkey.ws> <4C888287.8020209@redhat.com> <4C88D7CC.5000806@codemonkey.ws> <4C8A1311.8070903@redhat.com> <4C8A15C4.40201@redhat.com> <4C8A19CA.3040000@redhat.com> <4C8A3106.8050501@codemonkey.ws> In-Reply-To: <4C8A3106.8050501@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Kevin Wolf , Stefan Hajnoczi , Stefan Hajnoczi , qemu-devel@nongnu.org On 09/10/2010 04:22 PM, Anthony Liguori wrote: >> Looks like it depends on fsck, which is not a good idea for large >> images. > > > fsck will always be fast on qed because the metadata is small. For a > 1PB image, there's 128MB worth of L2s if it's fully allocated It's 32,000 seeks. > (keeping in mind, that once you're fully allocated, you'll never fsck > again). Why? Fully populated L1 (so all L2s are allocated) doesn't mean a fully allocated image. You're still allocating and linking into L2s. > If you've got 1PB worth of storage, I'm fairly sure you're going to > be able to do 128MB of reads in a short period of time. Even if it's > a few seconds, it only occurs on power failure so it's pretty reasonable. Consider a cloud recovering from power loss, even if you're fscking thousands of 100GB images you'll create a horrible seek storm on your storage (to be followed by a seek storm from all the guests booting). No, fsck is not a good idea. > >>> I need to look at the actual ATA and SCSI specs for how this will >>> work. The issue I am concerned with is sub-cluster trim operations. >>> If the trim region is less than a cluster, then both qed and qcow2 >>> don't really have a way to handle it. Perhaps we could punch a hole >>> in the file, given a userspace interface to do this, but that isn't >>> ideal because we're losing sparseness again. >> >> To deal with a sub-cluster TRIM, look at the surrounding sectors. If >> they're zero, free the cluster. If not, write zeros or use >> sys_punch() to the range specified by TRIM. > > Better yet, if you can't trim a full cluster, just write out zeros and > have a separate background process that punches out zero clusters. > That can work as well, or a combination perhaps. > That approach is a bit more generic and will help compact images > independently of guest trims. You still need a freelist. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.