From mboxrd@z Thu Jan 1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:45423)
by lists.gnu.org with esmtp (Exim 4.71)
(envelope-from
) id 1bQDbg-0000DA-5t
for qemu-devel@nongnu.org; Thu, 21 Jul 2016 09:01:44 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
(envelope-from
) id 1bQDbe-0001o2-0v
for qemu-devel@nongnu.org; Thu, 21 Jul 2016 09:01:39 -0400
References: <1468901281-22858-1-git-send-email-eblake@redhat.com>
<20160720033402.GA7641@ad.usersys.redhat.com>
<578EF446.70202@redhat.com>
<20160720043709.GA10539@ad.usersys.redhat.com>
<913397c9-6edc-2561-3d2e-e32032f9db22@redhat.com>
<20160720073836.GF10539@ad.usersys.redhat.com>
<1796238868.8815050.1469006377577.JavaMail.zimbra@redhat.com>
<20160720123025.GO2031@devil.localdomain>
<360732077.8875393.1469022006074.JavaMail.zimbra@redhat.com>
<20160721124119.GR2031@devil.localdomain>
From: =?UTF-8?Q?P=c3=a1draig_Brady?=
Message-ID: <5790C7A8.3010202@draigBrady.com>
Date: Thu, 21 Jul 2016 14:01:28 +0100
MIME-Version: 1.0
In-Reply-To: <20160721124119.GR2031@devil.localdomain>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was
Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
List-Id:
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
To: Dave Chinner , Paolo Bonzini
Cc: Fam Zheng , Eric Blake , Kevin Wolf , qemu-block@nongnu.org, qemu-devel@nongnu.org, Max Reitz , Lukas Czerner , Niels de Vos
On 21/07/16 13:41, Dave Chinner wrote:
> On Wed, Jul 20, 2016 at 09:40:06AM -0400, Paolo Bonzini wrote:
>>>> 1) is it expected that SEEK_HOLE skips unwritten extents?
>>>
>>> There are multiple answers to this, all of which are correct dependin=
g
>>> on current context and state:
>>>
>>> 1. No - some filesystems will report clean unwritten extents as holes=
.
>>>
>>> 2. Yes - some filesystems will report clean unwritten extents as data=
.
>>>
>>> 3. Maybe - if there is written data in memory over the unwritten
>>> extent on disk (i.e. hasn't been flushed to disk, it will be
>>> considered a data region with non-zero data. (FIEMAP will still
>>> report is as unwritten)
>>
>> Ok, I thought it would return FIEMAP_EXTENT_UNKNOWN|FIEMAP_EXTENT_DELA=
LLOC
>> in this case (not FIEMAP_EXTENT_UNWRITTEN).
>=20
> No. FIEMAP only returns the known extent state at the given file
> offset. "delalloc" extents exist in memory, indicating the space
> has already been accounted for over that offset, but the extent has
> not been physically allocated. Like all other types of extents,
> there may or may not be valid data over a delayed allocation extent.=20
>=20
> IOWs, fiemap only gives you a snapshot of extent state, not the
> ranges of valid data in the file.
>=20
>>>> If not, would
>>>> it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO,=
which
>>>> would be similar to what SEEK_HOLE/SEEK_DATA do now?
>>>
>>> To solve what problem? You haven't explained what problem you are
>>> trying to solve yet.
>>>
>>>> 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC? And if not, for w=
hat
>>>> filesystems and kernel releases is it really not needed?
>>>
>>> I can't answer this question, either, because I don't know what
>>> you want the fiemap information for.
>>
>> The answer is the same no matter if we use both lseek and FIEMAP, so
>> I'll answer just once. We want to do two things:
>>
>> 1) avoid copying zero data, to keep the copy process efficient. For t=
his,
>> SEEK_HOLE/SEEK_DATA are enough.
>>
>> 2) copy file contents while preserving the allocation state of the fil=
e's extents.
>=20
> Which is /very difficult/ to do safely and reliably.
>=20
> We do actually do reliable, safe, exact hole and preallocation
> layout duplication with xfs_fsr, but that uses kernel provided
> cookies (from XFS_IOC_BULKSTAT) to detect that data in the source
> file has not changed while it was being copied before executing the
> final defrag operation in the kernel (XFS_IOC_SWAPEXT) that makes
> the new copy of the data user visible.
>=20
> i.e. the use of fiemap to duplicate the exact layout of a file
> from userspace is only posisble if you can /guarantee/ the source
> file has not changed in any way during the copy operation at the
> pointin time you finalise the destination data copy.
>=20
>> There can be various reasons why the user has preallocated the file (b=
ecause they
>> don't want an ENOSPC to happen while the VM runs; on some filesystems,=
to
>> minimize cases where io_submit is very un-asynchronous; or just becaus=
e someone
>> had a reason to do a BLKZEROOUT ioctl on the virtual disk). We want t=
o preserve
>> these while converting or otherwise moving the file around.
>=20
> Sure, there's many reasons for using prealloc/punch/zero. The real
> difference to other file operations is that they interface with low
> level filesystem structure, not the data contained within the
> extents. That's what makes them problematic for duplication -
> userspace cannot serialise against low level filesystem structure
> modifications.
>=20
> Optimising file copies safely is one of the reasons the
> copy_file_range() syscall has been introduced (in 4.5). While we
> haven't implemented anything special in XFS yet, it will internally
> use splice to do a zero-copy data transfer from source to
> destination file. Optimising for exact layout copies is precisely
> the sort of thing this syscall is intended for.
>=20
> It's also intended to enable applications to take advantage of
> hardware acceleration of data copying (e.g. server side copies to
> avoid round trips as has been implemented for NFS, or storage array
> offload of data copying) when such support is provided by the kernel.
>=20
> IOWs, I think you should be looking to optimise file copies by using
> copy_file_range() and getting filesystems to do exactly what you
> need. Using FIEMAP, fallocate and moving data through userspace
> won't ever be reliable without special filesystem help (that only
> exists for XFS right now), nor will it enable the application to
> transparently use smart storage protocols and hardware when it is
> present on user systems....
Yes higher level calls are useful here and we'll consider using them in c=
p etc.
When I previously looked at this I noticed some implementations would
fall back to do_splice_direct() which is essentially sendfile()
and that expands holes which wouldn't be a good default.
So there may be soem need for control flags for copy_file_range()
to have it generally useful.
thanks for the info,
P=E1draig.