From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38592) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VsUN9-0002jr-40 for qemu-devel@nongnu.org; Mon, 16 Dec 2013 04:22:00 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VsUN4-0007bV-Gn for qemu-devel@nongnu.org; Mon, 16 Dec 2013 04:21:55 -0500 Received: from mx1.redhat.com ([209.132.183.28]:10099) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VsUN2-0007bA-6q for qemu-devel@nongnu.org; Mon, 16 Dec 2013 04:21:50 -0500 Message-ID: <52AEC613.50800@redhat.com> Date: Mon, 16 Dec 2013 17:21:23 +0800 From: Fam Zheng MIME-Version: 1.0 References: <52956083.5000409@redhat.com> <20131127060109.GG24296@G08FNSTD100614.fnst.cn.fujitsu.com> <529593C5.5060003@redhat.com> <5295C37F.40703@kamp.de> <5295C474.4080002@redhat.com> <5295C5D4.9090108@kamp.de> <20131128084859.GN24296@G08FNSTD100614.fnst.cn.fujitsu.com> <529714D8.5010304@kamp.de> <20131211073339.GE17024@G08FNSTD100614.fnst.cn.fujitsu.com> In-Reply-To: <20131211073339.GE17024@G08FNSTD100614.fnst.cn.fujitsu.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC PATCH v2 5/6] qcow2: implement bdrv_preallocate List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Hu Tao , Peter Lieven Cc: Kevin Wolf , qemu-devel@nongnu.org On 2013=E5=B9=B412=E6=9C=8811=E6=97=A5 15:33, Hu Tao wrote: > On Thu, Nov 28, 2013 at 11:03:04AM +0100, Peter Lieven wrote: >> On 28.11.2013 09:48, Hu Tao wrote: >>> On Wed, Nov 27, 2013 at 11:13:40AM +0100, Peter Lieven wrote: >>>> Am 27.11.2013 11:07, schrieb Fam Zheng: >>>>> On 2013=E5=B9=B411=E6=9C=8827=E6=97=A5 18:03, Peter Lieven wrote: >>>>>> Am 27.11.2013 07:40, schrieb Fam Zheng: >>>>>>> On 2013=E5=B9=B411=E6=9C=8827=E6=97=A5 14:01, Hu Tao wrote: >>>>>>>> On Wed, Nov 27, 2013 at 11:01:23AM +0800, Fam Zheng wrote: >>>>>>>>> On 2013=E5=B9=B411=E6=9C=8827=E6=97=A5 10:15, Hu Tao wrote: >>>>>>>>>> Signed-off-by: Hu Tao >>>>>>>>>> --- >>>>>>>>>> block/qcow2.c | 7 +++++++ >>>>>>>>>> 1 file changed, 7 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/block/qcow2.c b/block/qcow2.c >>>>>>>>>> index b054a01..a23fade 100644 >>>>>>>>>> --- a/block/qcow2.c >>>>>>>>>> +++ b/block/qcow2.c >>>>>>>>>> @@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDri= verState *bs, >>>>>>>>>> return 0; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> +static int qcow2_preallocate(BlockDriverState *bs, int64_t of= fset, >>>>>>>>>> + int64_t length) >>>>>>>>>> +{ >>>>>>>>>> + return bdrv_preallocate(bs->file, offset, length); >>>>>>>>>> +} >>>>>>>>>> + >>>>>>>>> What's the semantics of .bdrv_preallocate? I think you should m= ap >>>>>>>>> [offset, offset + length) to clusters in image file, and then >>>>>>>>> forward to bs->file, rather than this direct wrapper. >>>>>>>>> >>>>>>>>> E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call >>>>>>>>> bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster, >>>>>>>>> cluster_size). >>>>>>>> You mean data clusters here, right? Is there a single function t= o get >>>>>>>> the offset of the first data cluster? >>>>>>>> >>>>>>> There is a function, qcow2_get_cluster_offset. >>>>>> This should return no valid offset as long as the cluster is not a= llocated. >>>>>> >>>>>> I think you actually have to "write" all clusters of a qcow2 one b= y one. >>>>>> Eventually this write could be an fallocate call instead of a zero= write. >>>>>> >>>>> Yes, I was wrong about qcow2_get_cluster_offset. The logic here is = more like cluster allocation in qcow2_alloc_cluster_offset. Maybe we can = reuse that. >>>> What I don't like about the preallocation is that we would loose the= information that a cluster contains no valid data and would read it e.g.= during >>>> conversion. >>> So the information is stored in table and you mean we shouldn't clear >>> table when do preallocation? I'm not sure how the information could b= e >>> useful on a newly-created image, but it seems ideal to keep informati= ons >>> in table. >> When you want to e.g. convert this qcow2 later the performance is lowe= r than needed because >> you read all those preallocated sectors altough you could now they are= empty. >>> >>>> I think what we want is a preallocated image with all clusters seque= ntally mapped into the qcow2 file. Preallocate all the cluster extends, b= ut still >>>> have the information in the table that the cluster in fact has no va= lid data. So we would need a valid cluster offset while still haveing the >>>> flag that the cluster is unallocated. I think this would require tho= ughtfully checking all the cluster functions if they can easily cope with= this. >>>> >>>> The quetion is Hu, what do you want to achieve? Do you want that the= space on the filesystem is preallocated so you can't overcommit or >>>> do you also want a sequential mapping of all the clusters into the f= ile? >>> The goal is to avoid sparse file as it can cause performance problem.= So >>> the first one. I'm not sure about the second but IIUC, one fallocate(= ) >>> is enough for all clusters if they are sequentially mapped. >> If you do not premap them they are allocated in the order they are wri= tten. >> So if you are going to preallocate the whole file anyway, you should s= equentally map all clusters into the file >> AND still keep the information that they are in fact not yet written. > > Can this be achieved by first fallocate() the disk file, then allocate > metadata? This way all metadata clusters are allocated before any data > clusters, leaving all data clusters at the end of file. > I think Peter means your need to sequentially map clusters into the=20 file, so that sequential IO in guest is translated to sequential IO on=20 the image file. fallocate() or posix_fallocate() should work. You need to set zero flag=20 on the allocated cluster when mapping it in L2, instead of actually=20 writing zeros. Fam