From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: "Pádraig Brady" <P@draigbrady.com>,
"Anna Schumaker" <Anna.Schumaker@netapp.com>,
linux-nfs@vger.kernel.org,
"Linux btrfs Developers List" <linux-btrfs@vger.kernel.org>,
"Linux FS Devel" <linux-fsdevel@vger.kernel.org>,
"Linux API" <linux-api@vger.kernel.org>,
"Zach Brown" <zab@zabbo.net>, "Al Viro" <viro@zeniv.linux.org.uk>,
"Chris Mason" <clm@fb.com>,
"Michael Kerrisk-manpages" <mtk.manpages@gmail.com>,
andros@netapp.com, "Christoph Hellwig" <hch@infradead.org>,
Coreutils <coreutils@gnu.org>
Subject: Re: [PATCH v1 0/8] VFS: In-kernel copy system call
Date: Tue, 8 Sep 2015 18:19:33 -0700 [thread overview]
Message-ID: <20150909011933.GF30681@birch.djwong.org> (raw)
In-Reply-To: <CALCETrVsWBdqvAgwxHcG=gbcWRNPG2ZziWUg1g=siKDrDu7S2Q@mail.gmail.com>
On Tue, Sep 08, 2015 at 04:08:43PM -0700, Andy Lutomirski wrote:
> On Tue, Sep 8, 2015 at 3:39 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
> >> On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >> > On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote:
> >> >> On 08/09/15 20:10, Andy Lutomirski wrote:
> >> >> > On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
> >> >> > <Anna.Schumaker@netapp.com> wrote:
> >> >> >> On 09/08/2015 11:21 AM, Pádraig Brady wrote:
> >> >> >>> I see copy_file_range() is a reflink() on BTRFS?
> >> >> >>> That's a bit surprising, as it avoids the copy completely.
> >> >> >>> cp(1) for example considered doing a BTRFS clone by default,
> >> >> >>> but didn't due to expectations that users actually wanted
> >> >> >>> the data duplicated on disk for resilience reasons,
> >> >> >>> and for performance reasons so that write latencies were
> >> >> >>> restricted to the copy operation, rather than being
> >> >> >>> introduced at usage time as the dest file is CoW'd.
> >> >> >>>
> >> >> >>> If reflink() is a possibility for copy_file_range()
> >> >> >>> then could it be done optionally with a flag?
> >> >> >>
> >> >> >> The idea is that filesystems get to choose how to handle copies in the
> >> >> >> default case. BTRFS could do a reflink, but NFS could do a server side
> >> >
> >> > Eww, different default behaviors depending on the filesystem. :)
> >> >
> >> >> >> copy instead. I can change the default behavior to only do a data copy
> >> >> >> (unless the reflink flag is specified) instead, if that is desirable.
> >> >> >>
> >> >> >> What does everybody think?
> >> >> >
> >> >> > I think the best you could do is to have a hint asking politely for
> >> >> > the data to be deep-copied. After all, some filesystems reserve the
> >> >> > right to transparently deduplicate.
> >> >> >
> >> >> > Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
> >> >> > advantage to deep copying unless you actually want two copies for
> >> >> > locality reasons.
> >> >>
> >> >> Agreed. The relink and server side copy are separate things.
> >> >> There's no advantage to not doing a server side copy,
> >> >> but as mentioned there may be advantages to doing deep copies on BTRFS
> >> >> (another reason not previous mentioned in this thread, would be
> >> >> to avoid ENOSPC errors at some time in the future).
> >> >>
> >> >> So having control over the deep copy seems useful.
> >> >> It's debatable whether ALLOW_REFLINK should be on/off by default
> >> >> for copy_file_range(). I'd be inclined to have such a setting off by default,
> >> >> but cp(1) at least will work with whatever is chosen.
> >> >
> >> > So far it looks like people are interested in at least these "make data appear
> >> > in this other place" filesystem operations:
> >> >
> >> > 1. reflink
> >> > 2. reflink, but only if the contents are the same (dedupe)
> >>
> >> What I meant by this was: if you ask for "regular copy", you may end
> >> up with a reflink anyway. Anyway, how can you reflink a range and
> >> have the contents *not* be the same?
> >
> > reflink forcibly remaps fd_dest's range to fd_src's range. If they didn't
> > match before, they will afterwards.
> >
> > dedupe remaps fd_dest's range to fd_src's range only if they match, of course.
> >
> > Perhaps I should have said "...if the contents are the same before the call"?
> >
>
> Oh, I see.
>
> Can we have a clean way to figure out whether two file ranges are the
> same in a way that allows false negatives? I.e. return 1 if the
> ranges are reflinks of each other and 0 if not? Pretty please? I've
> implemented that in the past on btrfs by syncing the ranges and then
> comparing FIEMAP output, but that's hideous.
Another mode for this call... :)
> >>
> >> > 3. regular copy
> >> > 4. regular copy, but make the hardware do it for us
> >> > 5. regular copy, but require a second copy on the media (no-dedupe)
> >>
> >> If this comes from me, I have no desire to ever use this as a flag.
> >
> > I meant (5) as a "disable auto-dedupe for this operation" flag, not as
> > a "reallocate all the shared blocks now" op...
>
> Hmm, interesting. What effect does it have on systems that do
> deferred auto-dedupe?
If it's a userspace deferred auto-dedupe, then hopefully the program
coordinates with the dedupe program.
Otherwise, it's only effective with a dedupe that runs in the write-path.
> >>
> >> I think we should focus on what the actual legit use cases might be.
> >> Certainly we want to support a mode that's "reflink or fail". We
> >> could have these flags:
> >>
> >> COPY_FILE_RANGE_ALLOW_REFLINK
> >> COPY_FILE_RANGE_ALLOW_COPY
> >>
> >> Setting neither gets -EINVAL. Setting both works as is. Setting just
> >> ALLOW_REFLINK will fail if a reflink can't be supported. Setting just
> >> ALLOW_COPY will make a best-effort attempt not to reflink but
> >> expressly permits reflinking in cases where either (a) plain old
> >> write(2) might also result in a reflink or (b) there is no advantage
> >> to not reflinking.
> >
> > I don't agree with having a 'copy' flag that can reflink when we also have a
> > 'reflink' flag. I guess I just don't like having a flag with different
> > meanings depending on context.
> >
> > Users should be able to get the default behavior by passing '0' for flags, so
> > provide FORBID_REFLINK and FORBID_COPY flags to turn off those behaviors, with
> > an admonishment that one should only use them if they have a goooood reason.
> > Passing neither gets you reflink-xor-copy, which is what I think we both want
> > in the general case.
> >
> > FORBID_REFLINK = 1
> > FORBID_COPY = 2
> > CHECK_SAME = 4
> > HW_COPY = 8
> >
> > DEDUPE = (FORBID_COPY | CHECK_SAME)
> >
> > What do you say to that?
>
> What does HW_COPY mean?
It /probably/ means that the FS tells the storage device to copy the block
rather than streaming it through the page cache. Your autodedupe thinp device,
for example, would "copy" block X to Y by mapping both X and Y to the same
piece of media.
(Effectively the same thing as FS reflink/dedupe, but in the storage dev.)
>
> If we have enough weird combinations, maybe having a mode instead of
> flags makes sense.
Let's hope not. :)
--D
>
> --Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2015-09-09 1:20 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-04 20:16 [PATCH v1 0/8] VFS: In-kernel copy system call Anna Schumaker
2015-09-04 20:16 ` [PATCH v1 2/8] x86: add sys_copy_file_range to syscall tables Anna Schumaker
2015-09-04 20:16 ` [PATCH v1 3/8] btrfs: add .copy_file_range file operation Anna Schumaker
[not found] ` <1441397823-1203-4-git-send-email-Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
2015-09-04 21:02 ` Josef Bacik
2015-09-09 8:39 ` David Sterba
2015-09-04 20:17 ` [PATCH v1 7/8] vfs: Copy should use file_out rather than file_in Anna Schumaker
[not found] ` <1441397823-1203-1-git-send-email-Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
2015-09-04 20:16 ` [PATCH v1 1/9] vfs: add copy_file_range syscall and vfs helper Anna Schumaker
2015-09-04 21:50 ` Darrick J. Wong
2015-09-04 20:16 ` [PATCH v1 4/8] btrfs: Add mountpoint checking during btrfs_copy_file_range Anna Schumaker
2015-09-09 9:18 ` David Sterba
2015-09-09 15:56 ` Anna Schumaker
2015-09-04 20:16 ` [PATCH v1 5/8] vfs: Remove copy_file_range mountpoint checks Anna Schumaker
2015-09-04 20:17 ` [PATCH v1 6/8] vfs: Copy should check len after file open mode Anna Schumaker
2015-09-04 20:17 ` [PATCH v1 8/8] vfs: Fall back on splice if no copy function defined Anna Schumaker
2015-09-04 21:08 ` Darrick J. Wong
[not found] ` <20150904210813.GA30681-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
2015-09-08 14:57 ` Anna Schumaker
2015-09-04 20:17 ` [PATCH v1 9/8] copy_file_range.2: New page documenting copy_file_range() Anna Schumaker
2015-09-04 21:38 ` Darrick J. Wong
[not found] ` <20150904213856.GC10391-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
2015-09-04 22:31 ` Andreas Dilger
[not found] ` <95674806-645C-410C-8A4B-A46F03AFFE20-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org>
2015-09-08 15:05 ` Anna Schumaker
2015-09-08 15:04 ` Anna Schumaker
2015-09-08 20:39 ` Darrick J. Wong
2015-09-09 9:16 ` David Sterba
[not found] ` <20150908203918.GB30681-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
2015-09-09 11:38 ` Austin S Hemmelgarn
2015-09-09 17:17 ` Darrick J. Wong
[not found] ` <20150909171757.GE10391-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
2015-09-09 17:31 ` Anna Schumaker
[not found] ` <55F06CEC.5040208-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
2015-09-09 18:12 ` Darrick J. Wong
2015-09-09 19:25 ` Anna Schumaker
2015-09-10 15:42 ` David Sterba
[not found] ` <20150910154251.GM8891-1ReQVI26iDCaZKY3DrU6dA@public.gmane.org>
2015-09-10 16:43 ` Darrick J. Wong
2015-09-04 22:25 ` [PATCH v1 0/8] VFS: In-kernel copy system call Andreas Dilger
[not found] ` <4B41043F-5D85-42D6-8F20-2DCC45930EF4-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org>
2015-09-05 8:33 ` Al Viro
[not found] ` <20150905083342.GG22011-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-09-08 15:08 ` Anna Schumaker
2015-09-08 20:45 ` Darrick J. Wong
[not found] ` <20150908204517.GC30681-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
2015-09-08 20:49 ` Anna Schumaker
2015-09-08 15:07 ` Anna Schumaker
2015-09-08 15:21 ` Pádraig Brady
[not found] ` <55EEFCEE.5090000-V8g9lnOeT5ydJdNcDFJN0w@public.gmane.org>
2015-09-08 18:23 ` Anna Schumaker
[not found] ` <55EF279B.3020101-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
2015-09-08 19:10 ` Andy Lutomirski
[not found] ` <CALCETrXxRB-LXVb+=nkwfj0zEjWuXXTctkSAc9Oec0fgyOQ5Yg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-08 20:03 ` Pádraig Brady
[not found] ` <55EF3EFD.3080302-V8g9lnOeT5ydJdNcDFJN0w@public.gmane.org>
2015-09-08 21:29 ` Darrick J. Wong
2015-09-08 21:45 ` Andy Lutomirski
2015-09-08 22:39 ` Darrick J. Wong
2015-09-08 23:08 ` Andy Lutomirski
2015-09-09 1:19 ` Darrick J. Wong [this message]
2015-09-09 20:09 ` Chris Mason
[not found] ` <20150909200921.GD9511-DzB2rL6jT1BHfPKRx072akEOCMrvLtNR@public.gmane.org>
2015-09-09 20:26 ` Trond Myklebust
[not found] ` <CAHQdGtTSZ1beMMF4DJv=OuA1j2ww0xzJj3+9HMRAf3UpCCLaZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-09 20:38 ` Chris Mason
[not found] ` <20150909203805.GE9511-DzB2rL6jT1BHfPKRx072akEOCMrvLtNR@public.gmane.org>
2015-09-09 20:41 ` Anna Schumaker
[not found] ` <55F0997E.1040105-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
2015-09-09 21:42 ` Darrick J. Wong
2015-09-09 20:37 ` Andy Lutomirski
[not found] ` <CALCETrXPcxHWGwqhtkGStVabWDOsRbBy+VzrN+XxVZA_F9O0qA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-09 20:42 ` Chris Mason
[not found] ` <CALCETrVsWBdqvAgwxHcG=gbcWRNPG2ZziWUg1g=siKDrDu7S2Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-13 23:25 ` Dave Chinner
2015-09-14 17:53 ` Andy Lutomirski
2015-09-09 18:52 ` Anna Schumaker
[not found] ` <55F07FD8.4020507-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
2015-09-09 21:16 ` Darrick J. Wong
2015-09-10 15:10 ` Anna Schumaker
[not found] ` <55F19D7F.5090907-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
2015-09-10 15:49 ` Austin S Hemmelgarn
2015-09-10 11:40 ` Austin S Hemmelgarn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150909011933.GF30681@birch.djwong.org \
--to=darrick.wong@oracle.com \
--cc=Anna.Schumaker@netapp.com \
--cc=P@draigbrady.com \
--cc=andros@netapp.com \
--cc=clm@fb.com \
--cc=coreutils@gnu.org \
--cc=hch@infradead.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=mtk.manpages@gmail.com \
--cc=viro@zeniv.linux.org.uk \
--cc=zab@zabbo.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).