From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>,
Eric Biggers <ebiggers3@gmail.com>
Cc: mtk.manpages@gmail.com, david@fromorbit.com,
linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
xfs@oss.sgi.com, linux-btrfs@vger.kernel.org, hch@infradead.org
Subject: Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
Date: Mon, 8 Aug 2016 03:47:34 +1000 [thread overview]
Message-ID: <594f4812-e146-6a6b-60d5-91ac56d46d66@gmail.com> (raw)
In-Reply-To: <20160112091432.GB7832@birch.djwong.org>
Hi Darrick,
On 01/12/2016 08:14 PM, Darrick J. Wong wrote:
> [adding btrfs to the cc since we're talking about a whole new dedupe interface]
In the discussion below, many points of possible improvement were notedfor
the man page.... Would you be willing to put together a patch please?
Thanks,
Michael
> On Tue, Jan 12, 2016 at 12:07:14AM -0600, Eric Biggers wrote:
>> Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
>> (note: I realize the patch is mostly just moving the code that already existed
>> in btrfs, but in the VFS it deserves a more thorough review):
>
> Wheee. :)
>
> Yes, let's discuss the concerns about the btrfs extent same ioctl.
>
> I believe Christoph dislikes about the odd return mechanism (i.e. status and
> bytes_deduped) and doubts that the vectorization is really necessary. There's
> not a lot of documentation to go on aside from "Do whatever the BTRFS ioctl
> does". I suspect that will leave my explanations lackng, since I neither
> designed the btrfs interface nor know all that much about the decisions made to
> arrive at what we have now.
>
> (I agree with both of hch's complaints.)
>
> Really, the best argument for keeping this ioctl is to avoid breaking
> duperemove. Even then, given that current duperemove checks for btrfs before
> trying to use BTRFS_IOC_EXTENT_SAME, we could very well design a new dedupe
> ioctl for the VFS, hook the new dedupers (XFS) into the new VFS ioctl
> leaving the old btrfs ioctl intact, and train duperemove to try the new
> ioctl and fall back on the btrfs one if the VFS ioctl isn't supported.
>
> Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> that resembles the clone_range interface:
>
> int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);
>
> struct file_dedupe_range {
> __s64 src_fd;
> __u64 src_offset;
> __u64 length;
> __u64 dest_offset;
> __u64 flags;
> };
>
> "See if the byte range src_offset:length in src_fd matches all of
> dest_offset:length in dest_fd; if so, share src_fd's physical storage with
> dest_fd. Both fds must be files; if they are the same file the ranges cannot
> overlap; src_fd must be readable; dest_fd must be writable or append-only.
> Offsets and lengths probably need to be block-aligned, but that is filesystem
> dependent."
>
> The error conditions would be superset of the ones we know about today. I'd
> return EOVERFLOW or something if length is longer than the FS wants to deal
> with.
>
> Now all the vectorization problems go away, and since it's a new VFS interface
> we can define everything from the start.
>
> Christoph, if this new interface solves your complaints I think I'd like to get
> started on the code/docs soon.
>
>> At high level, I am confused about what is meant by the "source" and
>> "destination" files. I understand that with more than two files, you
>> effectively have to choose one file to treat specially and dedupe with all
>> the other files (an NxN comparison isn't realistic). But with just two
>> files, a deduplication operation should be completely symmetric, should it
>> not? The end
>
> Not sure what you mean by 'symmetric', but in any case the convention seems
> to be that src_fd's storage is shared with dest_fd if there's a match.
>
>> result should be that the data is deduplicated, regardless of the order in
>> which I gave the file descriptors. So why is there some non-symmetric
>> behavior? There are several examples but one is that the VFS is checking
>> !S_ISREG() on the "source" file descriptor but not on the "destination" file
>> descriptor.
>
> The dedupe_range function pointer should only be supplied for regular files.
>
>> Another is that different permissions are required on the source versus on
>> the destination. If there are good reasons for the nonsymmetry then this
>> needs to be clearly explained in the man page; otherwise it may not be clear
>> what to use as the "source" and what to use as the "destination".
>>
>> It seems odd to be adding "copy" as a system call but then have "dedupe" and
>> "clone" as ioctls rather than system calls... it seems that they should all
>> be one or the other (at least, if we put aside the fact that the ioctls
>> already exist in btrfs).
>
> We can't put the clone ioctl aside; coreutils has already started using it.
>
> I'm not sure if clone_range or extent_same are all that popular, though.
> AFAIK duperemove is the only program using extent_same, and I don't know
> of anything using clone_range.
>
> (Well, xfs_io does...)
>
>> The range checking in clone_verify_area() appears incomplete. Someone could
>> provide len=UINT64_MAX and all the checks would still pass even though
>> 'pos+len' would overflow.
>
> Yeah...
>
>> Should the ioctl be interruptible? Right now it always goes through *all*
>> the 'struct file_dedupe_range_info's you passed in --- potentially up to
>> 65535 of them.
>
> There probably ought to be explicit signal checks, or we could just get rid
> of the vectorization entirely. :)
>
>> Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
>> deduped'? 'bytes_deduped' is per file descriptor, not for the operation as a
>> whole.
>
> Right, because bytes_deduped is a part of file_dedup_range_info, not
> file_dedupe_range.
>
> (Note the bytes_deduped = 0 earlier in the function.)
>
>> What permissions do you need on the destination file descriptors? The man
>> page implies they must be open for writing and not appending. The
>> implementation differs: it requires FMODE_WRITE only for non-admin users, and
>> it doesn't check for O_APPEND at all.
>
> I think the result of an earlier discussion was that src_fd must be readable,
> and dest_fd must be writable or appendable.
>
>> The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
>> "one of the files is a swap file", which I don't see actually happening in
>> the implementation; it seems those error codes perhaps exist at all for this
>> ioctl but rather be left to open(..., O_WRONLY).
>
> Those could be hoisted to the VFS (from the XFS implementation), I think.
>
>> If the filesystem doesn't support deduplication, or I pass in a strange file
>> descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP? The
>> man page isn't clear.
>
> Should be EOPNOTSUPP if dest_fd isn't a regular file; EISDIR if either are
> directories; and EINVAL for any other kind of non-file fd. I suspect the
> clone* manpages don't make this too clear either.
>
>> Under what circumstances will 'bytes_deduped' differ from the count that was
>> passed in?
>
> btrfs/xfs will only compare the first 16MB. Not documented anywhere. :(
>
>> If short counts are allowed, what will be the 'status' be in that case:
>> FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?
>
> One of those two.
>
>> Can data be deduped even if only a prefix of the data region matches?
>
> No.
>
>> The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
>> 0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.
>
> Oops, good catch. :(
>
>> The man page isn't clear about whether the ioctl stops early if an error
>> occurs or always processes all the 'struct file_dedupe_range_info's you pass
>> in. And if it were, hypothetically, to stop early, how is the user meant to
>> know on which file it stopped?
>
> I don't know if this should be the official behavior, but it stopped at
> whichever file_dedupe_range_info has both status and bytes_deduped set to zero.
>
>> The man page says "logical_offset" but in the struct it is called
>> "dest_offset".
>
> Oops.
>
>> There are some variables named "same" which don't really make sense now that
>> the ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.
>
> Perhaps not.
>
> I'll later take a look at how many of these issues apply to clone/clone_range.
>
> --D
>
>>
>> Eric
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
next prev parent reply other threads:[~2016-08-07 17:47 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-12-19 8:55 [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS Darrick J. Wong
2015-12-19 8:55 ` [PATCH 1/9] vfs: add copy_file_range syscall and vfs helper Darrick J. Wong
2015-12-19 8:55 ` [PATCH 2/9] x86: add sys_copy_file_range to syscall tables Darrick J. Wong
2015-12-19 8:55 ` [PATCH 3/9] btrfs: add .copy_file_range file operation Darrick J. Wong
2015-12-19 8:55 ` [PATCH 4/9] vfs: Add vfs_copy_file_range() support for pagecache copies Darrick J. Wong
2015-12-19 8:55 ` [PATCH 5/9] locks: new locks_mandatory_area calling convention Darrick J. Wong
2015-12-19 8:55 ` [PATCH 6/9] vfs: pull btrfs clone API to vfs layer Darrick J. Wong
2015-12-19 8:55 ` [PATCH 7/9] vfs: wire up compat ioctl for CLONE/CLONE_RANGE Darrick J. Wong
2015-12-19 8:55 ` [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs Darrick J. Wong
2016-01-12 6:07 ` Eric Biggers
2016-01-12 9:14 ` Darrick J. Wong
2016-01-13 2:36 ` Eric Biggers
2016-01-23 0:54 ` Darrick J. Wong
2016-08-07 17:47 ` Michael Kerrisk (man-pages) [this message]
2016-07-27 21:51 ` Kirill A. Shutemov
2016-07-28 18:07 ` Darrick J. Wong
2016-07-28 19:25 ` Darrick J. Wong
2015-12-19 8:56 ` [PATCH 9/9] btrfs: use new dedupe data function pointer Darrick J. Wong
2015-12-20 15:30 ` [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=594f4812-e146-6a6b-60d5-91ac56d46d66@gmail.com \
--to=mtk.manpages@gmail.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=ebiggers3@gmail.com \
--cc=hch@infradead.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).