From mboxrd@z Thu Jan 1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:35133)
by lists.gnu.org with esmtp (Exim 4.71)
(envelope-from
) id 1bQD8a-0006mv-CT
for qemu-devel@nongnu.org; Thu, 21 Jul 2016 08:31:37 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
(envelope-from
) id 1bQD8Y-0002TL-0q
for qemu-devel@nongnu.org; Thu, 21 Jul 2016 08:31:35 -0400
References: <578E4708.5080308@redhat.com>
<20160720033402.GA7641@ad.usersys.redhat.com>
<578EF446.70202@redhat.com>
<20160720043709.GA10539@ad.usersys.redhat.com>
<913397c9-6edc-2561-3d2e-e32032f9db22@redhat.com>
<20160720073836.GF10539@ad.usersys.redhat.com>
<1796238868.8815050.1469006377577.JavaMail.zimbra@redhat.com>
<20160720123025.GO2031@devil.localdomain>
<20160720133517.GE23632@ndevos-x240.usersys.redhat.com>
<20160721114342.GQ2031@devil.localdomain>
From: =?UTF-8?Q?P=c3=a1draig_Brady?=
Message-ID: <5790C099.9020900@draigBrady.com>
Date: Thu, 21 Jul 2016 13:31:21 +0100
MIME-Version: 1.0
In-Reply-To: <20160721114342.GQ2031@devil.localdomain>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was
Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
List-Id:
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
To: Dave Chinner , Niels de Vos
Cc: Paolo Bonzini , Fam Zheng , Eric Blake , Kevin Wolf , qemu-block@nongnu.org, qemu-devel@nongnu.org, Max Reitz , Lukas Czerner
On 21/07/16 12:43, Dave Chinner wrote:
> On Wed, Jul 20, 2016 at 03:35:17PM +0200, Niels de Vos wrote:
>> On Wed, Jul 20, 2016 at 10:30:25PM +1000, Dave Chinner wrote:
>>> On Wed, Jul 20, 2016 at 05:19:37AM -0400, Paolo Bonzini wrote:
>>>> Adding ext4 and XFS guys (Lukas and Dave respectively). As a quick =
recap, the
>>>> issue here is the semantics of FIEMAP and SEEK_HOLE/SEEK_DATA, which=
we use in
>>>> "qemu-img map". This command prints metadata about a virtual disk i=
mage---which
>>>> in the case of a raw image amounts to detecting holes and unwritten =
extents.
>>>>
>>>> First, it seems like SEEK_HOLE and SEEK_DATA really should be "SEEK_=
NONZERO" and
>>>> "SEEK_ZERO", on both ext4 and XFS. You can see that unwritten ext=
ents are
>>>> reported by "qemu-img map" as holes:
>>>
>>> Correctly so. seek hole/data knows nothing about the underlying
>>> filesystem storage implementation. A "hole" is defined as a region
>>> of zeros, and the filesystem is free call anything it kows for
>>> certain contains zeros as a hole.
>>>
>>> FYI, SEEK_HOLE/DATA were named after the the pre-existing Solaris
>>> seek API that uses these definitions for hole and data....
>>>
>>>> $ dd if=3D/dev/urandom of=3Dtest.img bs=3D1M count=3D100
>>>> $ fallocate -z -o 10M -l 10M test.img
>>>> $ du -h test.img
>>>> $ qemu-img map --output=3Djson test.img
>>>> [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "d=
ata": true, "offset": 0},
>>>> { "start": 10485760, "length": 10485760, "depth": 0, "zero": tru=
e, "data": false, "offset": 10485760},
>>>> { "start": 20971520, "length": 83886080, "depth": 0, "zero": fal=
se, "data": true, "offset": 20971520}]
>>>
>>>>
>>>> On the second line, zero=3Dtrue data=3Dfalse identifies a hole. The=
right output
>>>> would either have zero=3Dtrue data=3Dtrue (unwritten extent) or just
>>>>
>>>> [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "=
data": true, "offset": 0},
>>>>
>>>> since the zero flag is advisory (it doesn't try to detect zeroes bey=
ond what the
>>>> filesystem says).
>>>
>>> Not from SEEK_HOLE/SEEK_DATA. All it conveys is "this is the start
>>> of a range of zeros" and "this is the start of a range of data". And
>>> for filesystems that don't specifically implement these seek
>>> operations, zeros are considered data. i.e. SEEK_HOLE will take you
>>> to the end of file, SEEK_DATA returns the current position....
>>>
>>> i.e. unwritten extents contain no data, so they are semantically
>>> identical to holes for the purposes of seeking and hence SEEK_DATA
>>> can skip over them.
>>>
>>>> The reason why we disabled FIEMAP was a combination of a corruption =
and performance
>>>> issue. The data corruption bug was at https://bugs.launchpad.net/qe=
mu/+bug/1368815
>>>> and it was reported on Ubuntu Trusty (kernel 3.13 based on the relea=
se notes at
>>>> https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes). We corrected that=
by using
>>>> FIEMAP_FLAG_SYNC, based on a similar patch to coreutils. This turne=
d out to be too
>>>> slow, so we dropped FIEMAP altogether.
>>>
>>> Yes, because FIEMAP output is only useful for diagnostic purposes as
>>> it can be stale even before the syscall returns to userspace. i.e.
>>> it can't be used in userspace for optimising file copies....
>>
>> Oh... And I was surprised to learn that "cp" does use FIEMAP and not
>> SEEK_HOLE/SEEK_DATA. You give a strong suggestion to fix that.
>> http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/extent-scan.=
c#n87
>=20
> My understanding was that FIEMAP was disabled almost immediately
> after the data corruption problems were understood, and it hasn't
> been turned back on since. I don't see FIEMAP calls in strace when I
> copy sparse files, even when I use --sparse=3Dauto which is supposed
> to trigger the sparse source file checks using FIEMAP.
>=20
> So, yeah, while there's FIEMAP code present, that doesn't mean it's
> actually used. And, yes, there are comments in coreutils commits in
> 2011 saying that the FIEMAP workarounds are temporary until
> SEK_HOLE/DATA are supported, which were added to the kernel in 2011
> soon after the 'cp-corrupts-data-with-fiemap' debacle came to light.
> Five years later, and the fiemap code is stil there?
Yes it's still there, and hasn't caused issues.
It's used only for sparse files:
$ truncate -s1G t.fiemap
$ strace -e ioctl cp t.fiemap t2.fiemap
ioctl(3, FS_IOC_FIEMAP, 0x7ffff007c910) =3D 0
It's a pity fiemap is racy (on some file systems?) wrt reporting
of data not yet written out, and is thus fairly useless IMHO
without the FIEMAP_FLAG_SYNC workaround.
We _are_ planning to use SEEK_HOLE in future.
It would be nice though to have a way to distinguish zeros from holes
to efficiently/consistently maintain allocations to avoid future ENOSPC
or inefficient future allocations.
thanks,
P=E1draig