linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>,
	Eric Biggers <ebiggers3@gmail.com>
Cc: mtk.manpages@gmail.com, david@fromorbit.com,
	linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
	xfs@oss.sgi.com, linux-btrfs@vger.kernel.org, hch@infradead.org
Subject: Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
Date: Mon, 8 Aug 2016 03:47:34 +1000	[thread overview]
Message-ID: <594f4812-e146-6a6b-60d5-91ac56d46d66@gmail.com> (raw)
In-Reply-To: <20160112091432.GB7832@birch.djwong.org>

Hi Darrick,

On 01/12/2016 08:14 PM, Darrick J. Wong wrote:
> [adding btrfs to the cc since we're talking about a whole new dedupe interface]

In the discussion below, many points of possible improvement were notedfor
the man page.... Would you be willing to put together a patch please?

Thanks,

Michael

  
> On Tue, Jan 12, 2016 at 12:07:14AM -0600, Eric Biggers wrote:
>> Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
>> (note: I realize the patch is mostly just moving the code that already existed
>> in btrfs, but in the VFS it deserves a more thorough review):
>
> Wheee. :)
>
> Yes, let's discuss the concerns about the btrfs extent same ioctl.
>
> I believe Christoph dislikes about the odd return mechanism (i.e. status and
> bytes_deduped) and doubts that the vectorization is really necessary.  There's
> not a lot of documentation to go on aside from "Do whatever the BTRFS ioctl
> does".  I suspect that will leave my explanations lackng, since I neither
> designed the btrfs interface nor know all that much about the decisions made to
> arrive at what we have now.
>
> (I agree with both of hch's complaints.)
>
> Really, the best argument for keeping this ioctl is to avoid breaking
> duperemove.  Even then, given that current duperemove checks for btrfs before
> trying to use BTRFS_IOC_EXTENT_SAME, we could very well design a new dedupe
> ioctl for the VFS, hook the new dedupers (XFS) into the new VFS ioctl
> leaving the old btrfs ioctl intact, and train duperemove to try the new
> ioctl and fall back on the btrfs one if the VFS ioctl isn't supported.
>
> Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> that resembles the clone_range interface:
>
> int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);
>
> struct file_dedupe_range {
>     __s64 src_fd;
>     __u64 src_offset;
>     __u64 length;
>     __u64 dest_offset;
>     __u64 flags;
> };
>
> "See if the byte range src_offset:length in src_fd matches all of
> dest_offset:length in dest_fd; if so, share src_fd's physical storage with
> dest_fd.  Both fds must be files; if they are the same file the ranges cannot
> overlap; src_fd must be readable; dest_fd must be writable or append-only.
> Offsets and lengths probably need to be block-aligned, but that is filesystem
> dependent."
>
> The error conditions would be superset of the ones we know about today.  I'd
> return EOVERFLOW or something if length is longer than the FS wants to deal
> with.
>
> Now all the vectorization problems go away, and since it's a new VFS interface
> we can define everything from the start.
>
> Christoph, if this new interface solves your complaints I think I'd like to get
> started on the code/docs soon.
>
>> At high level, I am confused about what is meant by the "source" and
>> "destination" files.  I understand that with more than two files, you
>> effectively have to choose one file to treat specially and dedupe with all
>> the other files (an NxN comparison isn't realistic).  But with just two
>> files, a deduplication operation should be completely symmetric, should it
>> not?  The end
>
> Not sure what you mean by 'symmetric', but in any case the convention seems
> to be that src_fd's storage is shared with dest_fd if there's a match.
>
>> result should be that the data is deduplicated, regardless of the order in
>> which I gave the file descriptors.  So why is there some non-symmetric
>> behavior?  There are several examples but one is that the VFS is checking
>> !S_ISREG() on the "source" file descriptor but not on the "destination" file
>> descriptor.
>
> The dedupe_range function pointer should only be supplied for regular files.
>
>> Another is that different permissions are required on the source versus on
>> the destination.  If there are good reasons for the nonsymmetry then this
>> needs to be clearly explained in the man page; otherwise it may not be clear
>> what to use as the "source" and what to use as the "destination".
>>
>> It seems odd to be adding "copy" as a system call but then have "dedupe" and
>> "clone" as ioctls rather than system calls... it seems that they should all
>> be one or the other (at least, if we put aside the fact that the ioctls
>> already exist in btrfs).
>
> We can't put the clone ioctl aside; coreutils has already started using it.
>
> I'm not sure if clone_range or extent_same are all that popular, though.
> AFAIK duperemove is the only program using extent_same, and I don't know
> of anything using clone_range.
>
> (Well, xfs_io does...)
>
>> The range checking in clone_verify_area() appears incomplete.  Someone could
>> provide len=UINT64_MAX and all the checks would still pass even though
>> 'pos+len' would overflow.
>
> Yeah...
>
>> Should the ioctl be interruptible?  Right now it always goes through *all*
>> the 'struct file_dedupe_range_info's you passed in --- potentially up to
>> 65535 of them.
>
> There probably ought to be explicit signal checks, or we could just get rid
> of the vectorization entirely. :)
>
>> Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
>> deduped'?  'bytes_deduped' is per file descriptor, not for the operation as a
>> whole.
>
> Right, because bytes_deduped is a part of file_dedup_range_info, not
> file_dedupe_range.
>
> (Note the bytes_deduped = 0 earlier in the function.)
>
>> What permissions do you need on the destination file descriptors?  The man
>> page implies they must be open for writing and not appending.  The
>> implementation differs: it requires FMODE_WRITE only for non-admin users, and
>> it doesn't check for O_APPEND at all.
>
> I think the result of an earlier discussion was that src_fd must be readable,
> and dest_fd must be writable or appendable.
>
>> The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
>> "one of the files is a swap file", which I don't see actually happening in
>> the implementation; it seems those error codes perhaps exist at all for this
>> ioctl but rather be left to open(..., O_WRONLY).
>
> Those could be hoisted to the VFS (from the XFS implementation), I think.
>
>> If the filesystem doesn't support deduplication, or I pass in a strange file
>> descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP?  The
>> man page isn't clear.
>
> Should be EOPNOTSUPP if dest_fd isn't a regular file; EISDIR if either are
> directories; and EINVAL for any other kind of non-file fd.  I suspect the
> clone* manpages don't make this too clear either.
>
>> Under what circumstances will 'bytes_deduped' differ from the count that was
>> passed in?
>
> btrfs/xfs will only compare the first 16MB.  Not documented anywhere. :(
>
>> If short counts are allowed, what will be the 'status' be in that case:
>> FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?
>
> One of those two.
>
>> Can data be deduped even if only a prefix of the data region matches?
>
> No.
>
>> The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
>> 0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.
>
> Oops, good catch. :(
>
>> The man page isn't clear about whether the ioctl stops early if an error
>> occurs or always processes all the 'struct file_dedupe_range_info's you pass
>> in.  And if it were, hypothetically, to stop early, how is the user meant to
>> know on which file it stopped?
>
> I don't know if this should be the official behavior, but it stopped at
> whichever file_dedupe_range_info has both status and bytes_deduped set to zero.
>
>> The man page says "logical_offset" but in the struct it is called
>> "dest_offset".
>
> Oops.
>
>> There are some variables named "same" which don't really make sense now that
>> the ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.
>
> Perhaps not.
>
> I'll later take a look at how many of these issues apply to clone/clone_range.
>
> --D
>
>>
>> Eric
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

  parent reply	other threads:[~2016-08-07 17:47 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-19  8:55 [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS Darrick J. Wong
2015-12-19  8:55 ` [PATCH 1/9] vfs: add copy_file_range syscall and vfs helper Darrick J. Wong
2015-12-19  8:55 ` [PATCH 2/9] x86: add sys_copy_file_range to syscall tables Darrick J. Wong
2015-12-19  8:55 ` [PATCH 3/9] btrfs: add .copy_file_range file operation Darrick J. Wong
2015-12-19  8:55 ` [PATCH 4/9] vfs: Add vfs_copy_file_range() support for pagecache copies Darrick J. Wong
2015-12-19  8:55 ` [PATCH 5/9] locks: new locks_mandatory_area calling convention Darrick J. Wong
2015-12-19  8:55 ` [PATCH 6/9] vfs: pull btrfs clone API to vfs layer Darrick J. Wong
2015-12-19  8:55 ` [PATCH 7/9] vfs: wire up compat ioctl for CLONE/CLONE_RANGE Darrick J. Wong
2015-12-19  8:55 ` [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs Darrick J. Wong
2016-01-12  6:07   ` Eric Biggers
2016-01-12  9:14     ` Darrick J. Wong
2016-01-13  2:36       ` Eric Biggers
2016-01-23  0:54         ` Darrick J. Wong
2016-08-07 17:47       ` Michael Kerrisk (man-pages) [this message]
2016-07-27 21:51   ` Kirill A. Shutemov
2016-07-28 18:07     ` Darrick J. Wong
2016-07-28 19:25     ` Darrick J. Wong
2015-12-19  8:56 ` [PATCH 9/9] btrfs: use new dedupe data function pointer Darrick J. Wong
2015-12-20 15:30 ` [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=594f4812-e146-6a6b-60d5-91ac56d46d66@gmail.com \
    --to=mtk.manpages@gmail.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=ebiggers3@gmail.com \
    --cc=hch@infradead.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).