From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35133) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bQD8a-0006mv-CT for qemu-devel@nongnu.org; Thu, 21 Jul 2016 08:31:37 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bQD8Y-0002TL-0q for qemu-devel@nongnu.org; Thu, 21 Jul 2016 08:31:35 -0400 References: <578E4708.5080308@redhat.com> <20160720033402.GA7641@ad.usersys.redhat.com> <578EF446.70202@redhat.com> <20160720043709.GA10539@ad.usersys.redhat.com> <913397c9-6edc-2561-3d2e-e32032f9db22@redhat.com> <20160720073836.GF10539@ad.usersys.redhat.com> <1796238868.8815050.1469006377577.JavaMail.zimbra@redhat.com> <20160720123025.GO2031@devil.localdomain> <20160720133517.GE23632@ndevos-x240.usersys.redhat.com> <20160721114342.GQ2031@devil.localdomain> From: =?UTF-8?Q?P=c3=a1draig_Brady?= Message-ID: <5790C099.9020900@draigBrady.com> Date: Thu, 21 Jul 2016 13:31:21 +0100 MIME-Version: 1.0 In-Reply-To: <20160721114342.GQ2031@devil.localdomain> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Dave Chinner , Niels de Vos Cc: Paolo Bonzini , Fam Zheng , Eric Blake , Kevin Wolf , qemu-block@nongnu.org, qemu-devel@nongnu.org, Max Reitz , Lukas Czerner On 21/07/16 12:43, Dave Chinner wrote: > On Wed, Jul 20, 2016 at 03:35:17PM +0200, Niels de Vos wrote: >> On Wed, Jul 20, 2016 at 10:30:25PM +1000, Dave Chinner wrote: >>> On Wed, Jul 20, 2016 at 05:19:37AM -0400, Paolo Bonzini wrote: >>>> Adding ext4 and XFS guys (Lukas and Dave respectively). As a quick = recap, the >>>> issue here is the semantics of FIEMAP and SEEK_HOLE/SEEK_DATA, which= we use in >>>> "qemu-img map". This command prints metadata about a virtual disk i= mage---which >>>> in the case of a raw image amounts to detecting holes and unwritten = extents. >>>> >>>> First, it seems like SEEK_HOLE and SEEK_DATA really should be "SEEK_= NONZERO" and >>>> "SEEK_ZERO", on both ext4 and XFS. You can see that unwritten ext= ents are >>>> reported by "qemu-img map" as holes: >>> >>> Correctly so. seek hole/data knows nothing about the underlying >>> filesystem storage implementation. A "hole" is defined as a region >>> of zeros, and the filesystem is free call anything it kows for >>> certain contains zeros as a hole. >>> >>> FYI, SEEK_HOLE/DATA were named after the the pre-existing Solaris >>> seek API that uses these definitions for hole and data.... >>> >>>> $ dd if=3D/dev/urandom of=3Dtest.img bs=3D1M count=3D100 >>>> $ fallocate -z -o 10M -l 10M test.img >>>> $ du -h test.img >>>> $ qemu-img map --output=3Djson test.img >>>> [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "d= ata": true, "offset": 0}, >>>> { "start": 10485760, "length": 10485760, "depth": 0, "zero": tru= e, "data": false, "offset": 10485760}, >>>> { "start": 20971520, "length": 83886080, "depth": 0, "zero": fal= se, "data": true, "offset": 20971520}] >>> >>>> >>>> On the second line, zero=3Dtrue data=3Dfalse identifies a hole. The= right output >>>> would either have zero=3Dtrue data=3Dtrue (unwritten extent) or just >>>> >>>> [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "= data": true, "offset": 0}, >>>> >>>> since the zero flag is advisory (it doesn't try to detect zeroes bey= ond what the >>>> filesystem says). >>> >>> Not from SEEK_HOLE/SEEK_DATA. All it conveys is "this is the start >>> of a range of zeros" and "this is the start of a range of data". And >>> for filesystems that don't specifically implement these seek >>> operations, zeros are considered data. i.e. SEEK_HOLE will take you >>> to the end of file, SEEK_DATA returns the current position.... >>> >>> i.e. unwritten extents contain no data, so they are semantically >>> identical to holes for the purposes of seeking and hence SEEK_DATA >>> can skip over them. >>> >>>> The reason why we disabled FIEMAP was a combination of a corruption = and performance >>>> issue. The data corruption bug was at https://bugs.launchpad.net/qe= mu/+bug/1368815 >>>> and it was reported on Ubuntu Trusty (kernel 3.13 based on the relea= se notes at >>>> https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes). We corrected that= by using >>>> FIEMAP_FLAG_SYNC, based on a similar patch to coreutils. This turne= d out to be too >>>> slow, so we dropped FIEMAP altogether. >>> >>> Yes, because FIEMAP output is only useful for diagnostic purposes as >>> it can be stale even before the syscall returns to userspace. i.e. >>> it can't be used in userspace for optimising file copies.... >> >> Oh... And I was surprised to learn that "cp" does use FIEMAP and not >> SEEK_HOLE/SEEK_DATA. You give a strong suggestion to fix that. >> http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/extent-scan.= c#n87 >=20 > My understanding was that FIEMAP was disabled almost immediately > after the data corruption problems were understood, and it hasn't > been turned back on since. I don't see FIEMAP calls in strace when I > copy sparse files, even when I use --sparse=3Dauto which is supposed > to trigger the sparse source file checks using FIEMAP. >=20 > So, yeah, while there's FIEMAP code present, that doesn't mean it's > actually used. And, yes, there are comments in coreutils commits in > 2011 saying that the FIEMAP workarounds are temporary until > SEK_HOLE/DATA are supported, which were added to the kernel in 2011 > soon after the 'cp-corrupts-data-with-fiemap' debacle came to light. > Five years later, and the fiemap code is stil there? Yes it's still there, and hasn't caused issues. It's used only for sparse files: $ truncate -s1G t.fiemap $ strace -e ioctl cp t.fiemap t2.fiemap ioctl(3, FS_IOC_FIEMAP, 0x7ffff007c910) =3D 0 It's a pity fiemap is racy (on some file systems?) wrt reporting of data not yet written out, and is thus fairly useless IMHO without the FIEMAP_FLAG_SYNC workaround. We _are_ planning to use SEEK_HOLE in future. It would be nice though to have a way to distinguish zeros from holes to efficiently/consistently maintain allocations to avoid future ENOSPC or inefficient future allocations. thanks, P=E1draig