From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Anna Schumaker <Anna.Schumaker@netapp.com>,
"Darrick J. Wong" <darrick.wong@oracle.com>,
Andy Lutomirski <luto@amacapital.net>
Cc: "Pádraig Brady" <P@draigbrady.com>,
linux-nfs@vger.kernel.org,
"Linux btrfs Developers List" <linux-btrfs@vger.kernel.org>,
"Linux FS Devel" <linux-fsdevel@vger.kernel.org>,
"Linux API" <linux-api@vger.kernel.org>,
"Zach Brown" <zab@zabbo.net>, "Al Viro" <viro@zeniv.linux.org.uk>,
"Chris Mason" <clm@fb.com>,
"Michael Kerrisk-manpages" <mtk.manpages@gmail.com>,
andros@netapp.com, "Christoph Hellwig" <hch@infradead.org>,
Coreutils <coreutils@gnu.org>
Subject: Re: [PATCH v1 0/8] VFS: In-kernel copy system call
Date: Thu, 10 Sep 2015 07:40:00 -0400 [thread overview]
Message-ID: <55F16C10.6000905@gmail.com> (raw)
In-Reply-To: <55F07FD8.4020507@Netapp.com>
[-- Attachment #1: Type: text/plain, Size: 8794 bytes --]
On 2015-09-09 14:52, Anna Schumaker wrote:
> On 09/08/2015 06:39 PM, Darrick J. Wong wrote:
>> On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
>>> On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>>>> On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote:
>>>>> On 08/09/15 20:10, Andy Lutomirski wrote:
>>>>>> On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
>>>>>> <Anna.Schumaker@netapp.com> wrote:
>>>>>>> On 09/08/2015 11:21 AM, Pádraig Brady wrote:
>>>>>>>> I see copy_file_range() is a reflink() on BTRFS?
>>>>>>>> That's a bit surprising, as it avoids the copy completely.
>>>>>>>> cp(1) for example considered doing a BTRFS clone by default,
>>>>>>>> but didn't due to expectations that users actually wanted
>>>>>>>> the data duplicated on disk for resilience reasons,
>>>>>>>> and for performance reasons so that write latencies were
>>>>>>>> restricted to the copy operation, rather than being
>>>>>>>> introduced at usage time as the dest file is CoW'd.
>>>>>>>>
>>>>>>>> If reflink() is a possibility for copy_file_range()
>>>>>>>> then could it be done optionally with a flag?
>>>>>>>
>>>>>>> The idea is that filesystems get to choose how to handle copies in the
>>>>>>> default case. BTRFS could do a reflink, but NFS could do a server side
>>>>
>>>> Eww, different default behaviors depending on the filesystem. :)
>>>>
>>>>>>> copy instead. I can change the default behavior to only do a data copy
>>>>>>> (unless the reflink flag is specified) instead, if that is desirable.
>>>>>>>
>>>>>>> What does everybody think?
>>>>>>
>>>>>> I think the best you could do is to have a hint asking politely for
>>>>>> the data to be deep-copied. After all, some filesystems reserve the
>>>>>> right to transparently deduplicate.
>>>>>>
>>>>>> Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
>>>>>> advantage to deep copying unless you actually want two copies for
>>>>>> locality reasons.
>>>>>
>>>>> Agreed. The relink and server side copy are separate things.
>>>>> There's no advantage to not doing a server side copy,
>>>>> but as mentioned there may be advantages to doing deep copies on BTRFS
>>>>> (another reason not previous mentioned in this thread, would be
>>>>> to avoid ENOSPC errors at some time in the future).
>>>>>
>>>>> So having control over the deep copy seems useful.
>>>>> It's debatable whether ALLOW_REFLINK should be on/off by default
>>>>> for copy_file_range(). I'd be inclined to have such a setting off by default,
>>>>> but cp(1) at least will work with whatever is chosen.
>>>>
>>>> So far it looks like people are interested in at least these "make data appear
>>>> in this other place" filesystem operations:
>>>>
>>>> 1. reflink
>>>> 2. reflink, but only if the contents are the same (dedupe)
>>>
>>> What I meant by this was: if you ask for "regular copy", you may end
>>> up with a reflink anyway. Anyway, how can you reflink a range and
>>> have the contents *not* be the same?
>>
>> reflink forcibly remaps fd_dest's range to fd_src's range. If they didn't
>> match before, they will afterwards.
>>
>> dedupe remaps fd_dest's range to fd_src's range only if they match, of course.
>>
>> Perhaps I should have said "...if the contents are the same before the call"?
>>
>>>
>>>> 3. regular copy
>>>> 4. regular copy, but make the hardware do it for us
>>>> 5. regular copy, but require a second copy on the media (no-dedupe)
>>>
>>> If this comes from me, I have no desire to ever use this as a flag.
>>
>> I meant (5) as a "disable auto-dedupe for this operation" flag, not as
>> a "reallocate all the shared blocks now" op...
>>
>>> If someone wants to use chattr or some new operation to say "make this
>>> range of this file belong just to me for purpose of optimizing future
>>> writes", then sure, go for it, with the understanding that there are
>>> plenty of filesystems for which that doesn't even make sense.
>>
>> "Unshare these blocks" sounds more like something fallocate could do.
>>
>> So far in my XFS reflink playground, it seems that using the defrag tool to
>> un-cow a file makes most sense. AFAICT the XFS and ext4 defraggers copy a
>> fragmented file's data to a second file and use a 'swap extents' operation,
>> after which the donor file is unlinked.
>>
>> Hey, if this syscall turns into a more generic "do something involving two
>> (fd:off:len) (fd:off:len) tuples" call, I guess we could throw in "swap
>> extents" as a 7th operation, to refactor the ioctls. <smirk>
>>
>>>
>>>> 6. regular copy, but don't CoW (eatmyothercopies) (joke)
>>>>
>>>> (Please add whatever ops I missed.)
>>>>
>>>> I think I can see a case for letting (4) fall back to (3) since (4) is an
>>>> optimization of (3).
>>>>
>>>> However, I particularly don't like the idea of (1) falling back to (3-5).
>>>> Either the kernel can satisfy a request or it can't, but let's not just
>>>> assume that we should transmogrify one type of request into another. Userspace
>>>> should decide if a reflink failure should turn into one of the copy variants,
>>>> depending on whether the user wants to spread allocation costs over rewrites or
>>>> pay it all up front. Also, if we allow reflink to fall back to copy, how do
>>>> programs find out what actually took place? Or do we simply not allow them to
>>>> find out?
>>>>
>>>> Also, programs that expect reflink either to finish or fail quickly might be
>>>> surprised if it's possible for reflink to take a longer time than usual and
>>>> with the side effect that a deep(er) copy was made.
>>>>
>>>> I guess if someone asks for both (1) and (3) we can do the fallback in the
>>>> kernel, like how we handle it right now.
>>>>
>>>
>>> I think we should focus on what the actual legit use cases might be.
>>> Certainly we want to support a mode that's "reflink or fail". We
>>> could have these flags:
>>>
>>> COPY_FILE_RANGE_ALLOW_REFLINK
>>> COPY_FILE_RANGE_ALLOW_COPY
>>>
>>> Setting neither gets -EINVAL. Setting both works as is. Setting just
>>> ALLOW_REFLINK will fail if a reflink can't be supported. Setting just
>>> ALLOW_COPY will make a best-effort attempt not to reflink but
>>> expressly permits reflinking in cases where either (a) plain old
>>> write(2) might also result in a reflink or (b) there is no advantage
>>> to not reflinking.
>>
>> I don't agree with having a 'copy' flag that can reflink when we also have a
>> 'reflink' flag. I guess I just don't like having a flag with different
>> meanings depending on context.
>>
>> Users should be able to get the default behavior by passing '0' for flags, so
>> provide FORBID_REFLINK and FORBID_COPY flags to turn off those behaviors, with
>> an admonishment that one should only use them if they have a goooood reason.
>> Passing neither gets you reflink-xor-copy, which is what I think we both want
>> in the general case.
>
> I agree here that 0 for flags should do something useful, and I wanted to double check if reflink-xor-copy is a good default behavior.
>
>>
>> FORBID_REFLINK = 1
>> FORBID_COPY = 2
>
> I don't like the idea of using flags to forbid behavior. I think it would be more straightforward to have flags like REFLINK_ONLY or COPY_ONLY so users can tell us what they want, instead of what they don't want.
>
> While I'm thinking about flags, COPY_FILE_RANGE_REFLINK_ONLY would be a bit of a mouthful. Does anybody have suggestions for ways that I could make this shorter?
>
If (and only if) it's _very_ well documented, you could probably drop
the _ONLY part.
>
>> CHECK_SAME = 4
>> HW_COPY = 8
>>
>> DEDUPE = (FORBID_COPY | CHECK_SAME)
>>
>> What do you say to that?
>>
>>> An example of (b) would be a filesystem backed by deduped
>>> thinly-provisioned storage that can't do anything about ENOSPC because
>>> it doesn't control it in the first place.
>>>
>>> Another option would be to split up the copy case into "I expect to
>>> overwrite a lot of the target file soon, so (c) try to commit space
>>> for that or (d) try to make it time-efficient". Of course, (d) is
>>> irrelevant on filesystems with no random access (nvdimms, for
>>> example).
>>>
>>> I guess the tl;dr is that I'm highly skeptical of any use for
>>> disallowing reflinking other than forcibly committing space in cases
>>> where committing space actually means something.
>>
>> That's more or less where I was going too. :)
>>
>> --D
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
prev parent reply other threads:[~2015-09-10 11:40 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-04 20:16 [PATCH v1 0/8] VFS: In-kernel copy system call Anna Schumaker
2015-09-04 20:16 ` [PATCH v1 1/9] vfs: add copy_file_range syscall and vfs helper Anna Schumaker
2015-09-04 21:50 ` Darrick J. Wong
2015-09-04 20:16 ` [PATCH v1 2/8] x86: add sys_copy_file_range to syscall tables Anna Schumaker
2015-09-04 20:16 ` [PATCH v1 3/8] btrfs: add .copy_file_range file operation Anna Schumaker
2015-09-04 21:02 ` Josef Bacik
2015-09-09 8:39 ` David Sterba
2015-09-04 20:16 ` [PATCH v1 4/8] btrfs: Add mountpoint checking during btrfs_copy_file_range Anna Schumaker
2015-09-09 9:18 ` David Sterba
2015-09-09 15:56 ` Anna Schumaker
2015-09-04 20:16 ` [PATCH v1 5/8] vfs: Remove copy_file_range mountpoint checks Anna Schumaker
2015-09-04 20:17 ` [PATCH v1 6/8] vfs: Copy should check len after file open mode Anna Schumaker
2015-09-04 20:17 ` [PATCH v1 7/8] vfs: Copy should use file_out rather than file_in Anna Schumaker
2015-09-04 20:17 ` [PATCH v1 8/8] vfs: Fall back on splice if no copy function defined Anna Schumaker
2015-09-04 21:08 ` Darrick J. Wong
2015-09-08 14:57 ` Anna Schumaker
2015-09-04 20:17 ` [PATCH v1 9/8] copy_file_range.2: New page documenting copy_file_range() Anna Schumaker
2015-09-04 21:38 ` Darrick J. Wong
2015-09-04 22:31 ` Andreas Dilger
2015-09-08 15:05 ` Anna Schumaker
2015-09-08 15:04 ` Anna Schumaker
2015-09-08 20:39 ` Darrick J. Wong
2015-09-09 9:16 ` David Sterba
2015-09-09 11:38 ` Austin S Hemmelgarn
2015-09-09 17:17 ` Darrick J. Wong
2015-09-09 17:31 ` Anna Schumaker
2015-09-09 18:12 ` Darrick J. Wong
2015-09-09 19:25 ` Anna Schumaker
2015-09-10 15:42 ` David Sterba
2015-09-10 16:43 ` Darrick J. Wong
2015-09-04 22:25 ` [PATCH v1 0/8] VFS: In-kernel copy system call Andreas Dilger
2015-09-05 8:33 ` Al Viro
2015-09-08 15:08 ` Anna Schumaker
2015-09-08 20:45 ` Darrick J. Wong
2015-09-08 20:49 ` Anna Schumaker
2015-09-08 15:07 ` Anna Schumaker
2015-09-08 15:21 ` Pádraig Brady
2015-09-08 18:23 ` Anna Schumaker
2015-09-08 19:10 ` Andy Lutomirski
2015-09-08 20:03 ` Pádraig Brady
2015-09-08 21:29 ` Darrick J. Wong
2015-09-08 21:45 ` Andy Lutomirski
2015-09-08 22:39 ` Darrick J. Wong
2015-09-08 23:08 ` Andy Lutomirski
2015-09-09 1:19 ` Darrick J. Wong
2015-09-09 20:09 ` Chris Mason
2015-09-09 20:26 ` Trond Myklebust
2015-09-09 20:38 ` Chris Mason
2015-09-09 20:41 ` Anna Schumaker
2015-09-09 21:42 ` Darrick J. Wong
2015-09-09 20:37 ` Andy Lutomirski
2015-09-09 20:42 ` Chris Mason
2015-09-13 23:25 ` Dave Chinner
2015-09-14 17:53 ` Andy Lutomirski
2015-09-09 18:52 ` Anna Schumaker
2015-09-09 21:16 ` Darrick J. Wong
2015-09-10 15:10 ` Anna Schumaker
2015-09-10 15:49 ` Austin S Hemmelgarn
2015-09-10 11:40 ` Austin S Hemmelgarn [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55F16C10.6000905@gmail.com \
--to=ahferroin7@gmail.com \
--cc=Anna.Schumaker@netapp.com \
--cc=P@draigbrady.com \
--cc=andros@netapp.com \
--cc=clm@fb.com \
--cc=coreutils@gnu.org \
--cc=darrick.wong@oracle.com \
--cc=hch@infradead.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=mtk.manpages@gmail.com \
--cc=viro@zeniv.linux.org.uk \
--cc=zab@zabbo.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).