From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:47636) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1S7neQ-0002Vc-PB for qemu-devel@nongnu.org; Wed, 14 Mar 2012 08:50:04 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1S7neL-0005Rn-MV for qemu-devel@nongnu.org; Wed, 14 Mar 2012 08:49:58 -0400 Received: from mx1.redhat.com ([209.132.183.28]:50983) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1S7neL-0005Ra-Ei for qemu-devel@nongnu.org; Wed, 14 Mar 2012 08:49:53 -0400 Message-ID: <4F6093EC.2090303@redhat.com> Date: Wed, 14 Mar 2012 13:49:48 +0100 From: Paolo Bonzini MIME-Version: 1.0 References: <1331226917-6658-1-git-send-email-pbonzini@redhat.com> <1331226917-6658-7-git-send-email-pbonzini@redhat.com> <4F5A31B2.3050701@redhat.com> <4F5A46A1.4000508@redhat.com> <1331402560.8577.46.camel@watermelon.coderich.net> <4F5DEBCE.3040409@redhat.com> <1331665990.24052.42.camel@watermelon.coderich.net> <4F604B98.9090606@redhat.com> <4F60889F.6030401@redhat.com> <4F608B9A.8040406@redhat.com> <4F60911E.6030201@redhat.com> In-Reply-To: <4F60911E.6030201@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: Richard Laager , qemu-devel@nongnu.org Il 14/03/2012 13:37, Kevin Wolf ha scritto: > Am 14.03.2012 13:14, schrieb Paolo Bonzini: >>> Paolo mentioned a use case as a fast way for guests to write zeros, b= ut >>> is it really faster than a normal write when we have to emulate it by= a >>> bdrv_write with a temporary buffer of zeros?=20 >> >> No, of course not. >> >>> On the other hand we have >>> the cases where discard really means "I don't care about the data any >>> more" and emulating it by writing zeros is just a waste of resources = there. >>> >>> So I think we only want to advertise that discard zeroes data if we c= an >>> do it efficiently. This means that the format does support it, and th= at >>> the device is able to communicate the discard granularity (=3D cluste= r >>> size) to the guest OS. >> >> Note that the discard granularity is only a hint, so it's really more = a >> maximum suggested value than a granularity. Outside of a cluster >> boundary the format would still have to write zeros manually. >=20 > You're talking about SCSI here, I guess? Would be one case where being > able to define sane semantics for virtio-blk would have been an > advantage... I had hoped that SCSI was already sane, but if doesn't > distinguish between "I don't care about this any more" and "I want to > have zeros here", then I'm afraid I can't call it sane any more. It does make the distinction. "I don't care" is UNMAP (or WRITE SAME(16) with the UNMAP bit set); "I want to have zeroes" is WRITE SAME(10) or WRITE SAME(16) with an all-zero payload. > We can make the conditions even stricter, i.e. allow it only if protoco= l > can pass through discards for unaligned requests. This wouldn't free > clusters on an image format level, but at least on a file system level. >=20 >> Also, Linux for example will only round the number of sectors down to >> the granularity, not the start sector. Rereading the code, for SCSI w= e >> want to advertise a zero granularity (aka do whatever you want), >> otherwise we may get only misaligned discard requests and end up writi= ng >> zeroes inefficiently all the time. >=20 > Does this make sense with real hardware or is it a Linux bug? It's a bug, SCSI defines the "optimal unmap request starting LBA" to be "(n =C3=97 optimal unmap granularity) + unmap granularity alignment". >> The problem is that advertising discard_zeroes_data based on the backe= nd >> calls for trouble as soon as you migrate between storage formats, >> filesystems or disks. >=20 > True. You would have to emulate if you migrate from a source that can > discard to zeros efficiently to a destination that can't. >=20 > In the end, I guess we'll just have to accept that we can't fix bad > semantics of ATA and SCSI, and just need to decide whether "I don't > care" or "I want to have zeros" is more common. My feeling is that "I > don't care" is the more useful operation because it can't be expressed > otherwise, but I haven't checked what guests really do. Yeah, guests right now only use it for unused filesystem pieces, so the "do not care" semantics are fine. I also hoped to use discard to avoid blowing up thin-provisioned images when streaming. Perhaps we can use bdrv_has_zero_init instead, and/or pass down the copy-on-read flag to the block driver. Anyhow, there are some patches from this series that are relatively independent and ready for inclusion, I'll extract them and post them separately. Paolo