From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=52792 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Ou4By-0000BZ-CJ
	for qemu-devel@nongnu.org; Fri, 10 Sep 2010 10:03:03 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1Ou4Bt-0001R8-H5
	for qemu-devel@nongnu.org; Fri, 10 Sep 2010 10:03:02 -0400
Received: from mx1.redhat.com ([209.132.183.28]:21973)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1Ou4Bt-0001Qr-AS
	for qemu-devel@nongnu.org; Fri, 10 Sep 2010 10:02:57 -0400
Message-ID: <4C8A3A88.6050104@redhat.com>
Date: Fri, 10 Sep 2010 17:02:48 +0300
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
References: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com>	<4C84E738.3020802@codemonkey.ws>	<4C865187.6090508@redhat.com>	<4C865CFE.7010508@codemonkey.ws>	<4C8663C4.1090508@redhat.com>	<4C866773.2030103@codemonkey.ws>	<4C86BC6B.5010809@codemonkey.ws>	<4C874812.9090807@redhat.com>	<4C87860A.3060904@codemonkey.ws>	<4C888287.8020209@redhat.com>	<4C88D7CC.5000806@codemonkey.ws>	<4C8A1311.8070903@redhat.com>	<4C8A15C4.40201@redhat.com>
	<AANLkTi=tt+Zh_i15LFz36kOdD8sFBKpJ2wsPRGM=3zkV@mail.gmail.com>
	<4C8A19CA.3040000@redhat.com> <4C8A3106.8050501@codemonkey.ws>
In-Reply-To: <4C8A3106.8050501@codemonkey.ws>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@gmail.com>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>, qemu-devel@nongnu.org

  On 09/10/2010 04:22 PM, Anthony Liguori wrote:
>> Looks like it depends on fsck, which is not a good idea for large 
>> images.
>
>
> fsck will always be fast on qed because the metadata is small.  For a 
> 1PB image, there's 128MB worth of L2s if it's fully allocated

It's 32,000 seeks.

> (keeping in mind, that once you're fully allocated, you'll never fsck 
> again).

Why?  Fully populated L1 (so all L2s are allocated) doesn't mean a fully 
allocated image.  You're still allocating and linking into L2s.

>   If you've got 1PB worth of storage, I'm fairly sure you're going to 
> be able to do 128MB of reads in a short period of time.  Even if it's 
> a few seconds, it only occurs on power failure so it's pretty reasonable.

Consider a cloud recovering from power loss, even if you're fscking 
thousands of 100GB images you'll create a horrible seek storm on your 
storage (to be followed by a seek storm from all the guests booting).

No, fsck is not a good idea.

>
>>> I need to look at the actual ATA and SCSI specs for how this will
>>> work.  The issue I am concerned with is sub-cluster trim operations.
>>> If the trim region is less than a cluster, then both qed and qcow2
>>> don't really have a way to handle it.  Perhaps we could punch a hole
>>> in the file, given a userspace interface to do this, but that isn't
>>> ideal because we're losing sparseness again.
>>
>> To deal with a sub-cluster TRIM, look at the surrounding sectors.  If 
>> they're zero, free the cluster.  If not, write zeros or use 
>> sys_punch() to the range specified by TRIM.
>
> Better yet, if you can't trim a full cluster, just write out zeros and 
> have a separate background process that punches out zero clusters.
>

That can work as well, or a combination perhaps.

> That approach is a bit more generic and will help compact images 
> independently of guest trims.

You still need a freelist.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.