From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45423) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bQDbg-0000DA-5t for qemu-devel@nongnu.org; Thu, 21 Jul 2016 09:01:44 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bQDbe-0001o2-0v for qemu-devel@nongnu.org; Thu, 21 Jul 2016 09:01:39 -0400 References: <1468901281-22858-1-git-send-email-eblake@redhat.com> <20160720033402.GA7641@ad.usersys.redhat.com> <578EF446.70202@redhat.com> <20160720043709.GA10539@ad.usersys.redhat.com> <913397c9-6edc-2561-3d2e-e32032f9db22@redhat.com> <20160720073836.GF10539@ad.usersys.redhat.com> <1796238868.8815050.1469006377577.JavaMail.zimbra@redhat.com> <20160720123025.GO2031@devil.localdomain> <360732077.8875393.1469022006074.JavaMail.zimbra@redhat.com> <20160721124119.GR2031@devil.localdomain> From: =?UTF-8?Q?P=c3=a1draig_Brady?= Message-ID: <5790C7A8.3010202@draigBrady.com> Date: Thu, 21 Jul 2016 14:01:28 +0100 MIME-Version: 1.0 In-Reply-To: <20160721124119.GR2031@devil.localdomain> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Dave Chinner , Paolo Bonzini Cc: Fam Zheng , Eric Blake , Kevin Wolf , qemu-block@nongnu.org, qemu-devel@nongnu.org, Max Reitz , Lukas Czerner , Niels de Vos On 21/07/16 13:41, Dave Chinner wrote: > On Wed, Jul 20, 2016 at 09:40:06AM -0400, Paolo Bonzini wrote: >>>> 1) is it expected that SEEK_HOLE skips unwritten extents? >>> >>> There are multiple answers to this, all of which are correct dependin= g >>> on current context and state: >>> >>> 1. No - some filesystems will report clean unwritten extents as holes= . >>> >>> 2. Yes - some filesystems will report clean unwritten extents as data= . >>> >>> 3. Maybe - if there is written data in memory over the unwritten >>> extent on disk (i.e. hasn't been flushed to disk, it will be >>> considered a data region with non-zero data. (FIEMAP will still >>> report is as unwritten) >> >> Ok, I thought it would return FIEMAP_EXTENT_UNKNOWN|FIEMAP_EXTENT_DELA= LLOC >> in this case (not FIEMAP_EXTENT_UNWRITTEN). >=20 > No. FIEMAP only returns the known extent state at the given file > offset. "delalloc" extents exist in memory, indicating the space > has already been accounted for over that offset, but the extent has > not been physically allocated. Like all other types of extents, > there may or may not be valid data over a delayed allocation extent.=20 >=20 > IOWs, fiemap only gives you a snapshot of extent state, not the > ranges of valid data in the file. >=20 >>>> If not, would >>>> it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO,= which >>>> would be similar to what SEEK_HOLE/SEEK_DATA do now? >>> >>> To solve what problem? You haven't explained what problem you are >>> trying to solve yet. >>> >>>> 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC? And if not, for w= hat >>>> filesystems and kernel releases is it really not needed? >>> >>> I can't answer this question, either, because I don't know what >>> you want the fiemap information for. >> >> The answer is the same no matter if we use both lseek and FIEMAP, so >> I'll answer just once. We want to do two things: >> >> 1) avoid copying zero data, to keep the copy process efficient. For t= his, >> SEEK_HOLE/SEEK_DATA are enough. >> >> 2) copy file contents while preserving the allocation state of the fil= e's extents. >=20 > Which is /very difficult/ to do safely and reliably. >=20 > We do actually do reliable, safe, exact hole and preallocation > layout duplication with xfs_fsr, but that uses kernel provided > cookies (from XFS_IOC_BULKSTAT) to detect that data in the source > file has not changed while it was being copied before executing the > final defrag operation in the kernel (XFS_IOC_SWAPEXT) that makes > the new copy of the data user visible. >=20 > i.e. the use of fiemap to duplicate the exact layout of a file > from userspace is only posisble if you can /guarantee/ the source > file has not changed in any way during the copy operation at the > pointin time you finalise the destination data copy. >=20 >> There can be various reasons why the user has preallocated the file (b= ecause they >> don't want an ENOSPC to happen while the VM runs; on some filesystems,= to >> minimize cases where io_submit is very un-asynchronous; or just becaus= e someone >> had a reason to do a BLKZEROOUT ioctl on the virtual disk). We want t= o preserve >> these while converting or otherwise moving the file around. >=20 > Sure, there's many reasons for using prealloc/punch/zero. The real > difference to other file operations is that they interface with low > level filesystem structure, not the data contained within the > extents. That's what makes them problematic for duplication - > userspace cannot serialise against low level filesystem structure > modifications. >=20 > Optimising file copies safely is one of the reasons the > copy_file_range() syscall has been introduced (in 4.5). While we > haven't implemented anything special in XFS yet, it will internally > use splice to do a zero-copy data transfer from source to > destination file. Optimising for exact layout copies is precisely > the sort of thing this syscall is intended for. >=20 > It's also intended to enable applications to take advantage of > hardware acceleration of data copying (e.g. server side copies to > avoid round trips as has been implemented for NFS, or storage array > offload of data copying) when such support is provided by the kernel. >=20 > IOWs, I think you should be looking to optimise file copies by using > copy_file_range() and getting filesystems to do exactly what you > need. Using FIEMAP, fallocate and moving data through userspace > won't ever be reliable without special filesystem help (that only > exists for XFS right now), nor will it enable the application to > transparently use smart storage protocols and hardware when it is > present on user systems.... Yes higher level calls are useful here and we'll consider using them in c= p etc. When I previously looked at this I noticed some implementations would fall back to do_splice_direct() which is essentially sendfile() and that expands holes which wouldn't be a good default. So there may be soem need for control flags for copy_file_range() to have it generally useful. thanks for the info, P=E1draig.