linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Steve French <smfrench@gmail.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: Copy tools on Linux
Date: Sun, 1 Jul 2018 10:10:05 +1000	[thread overview]
Message-ID: <20180701001005.GR19934@dastard> (raw)
In-Reply-To: <CAH2r5msSSe-qPzeZNTw8cGt0GXH=f5YEeCrne4KToHS1D0vmTA@mail.gmail.com>

On Fri, Jun 29, 2018 at 09:37:27PM -0500, Steve French wrote:
> I have been looking at i/o patterns from various copy tools on Linux,
> and it is pretty discouraging - I am hoping that I am forgetting an
> important one that someone can point me to ...
> 
> Some general problems:
> 1) if source and target on the same file system it would be nice to
> call the copy_file_range syscall (AFAIK only test tools call that),
> although in some cases at least cp can do it for --reflink

copy_file_range() should be made to do the right thing in as many
scnearios as we can document, and then switch userspace over to use
it at all times. Aggregate all the knowledge in one place, where we
know what the filesystem implementations are and can get hints to do
the right thing.

> 2) if source and target on different file systems there are multiple problems
>     a) smaller i/o  (rsync e.g. maxes at 128K!)
>     b) no async parallelized writes sent down to the kernel so writes
> get serialized (either through page cache, or some fs offer option to
> disable it - but it still is one thread at a time)

That because, historically, copying data into the page cache for
buffering has been orders of magnitude faster than actually doing
IO. These days, with pcie based storage, not so much. Indeed, for
bulk data copy on nvme based storage I wonder if we even need
buffered IO anymore...

>     c) sparse file support is mediocre (although cp has some support
> for it, and can call fiemap in some cases)

Using fiemap for this is broken and will lead to data corruption,
especially if you start parallelising IO to individual files.
SEEK_DATA/SEEK_HOLE should be used instead.

>     d) for file systems that prefer setting the file size first (to
> avoid metadata penalties with multiple extending writes) - AFAIK only
> rsync offers that, but rsync is one of the slowest tools otherwise

We don't want to do this for most local filesystems as it defeats
things like specualtive preallocation which are used to optimise IO
patterns and prevent file fragmentation when concurrent parallel
writes are done.

> I have looked at cp, dd, scp, rsync, gio, gcp ... are there others?
> 
> What I am looking for (and maybe we just need to patch cp and rsync
> etc.) is more like what you see with other OS ...
> 1) options for large i/o sizes (network latencies in network/cluster
> fs can be large, so prefer larger 1M or 8M in some cases I/Os)

-o largeio mount option on XFS will expose the stripe unit as the
iminimum efficient IO size returned in stat. IIRC there's another
combination that makes it emit the stripe width rather than stripe
unit.

> 2) parallelizing writes so not just one write in flight at a time

That won't make buffered IO any faster - writeback will still be the
bottleneck, and it already does allow many IOs to be in flight at
once.

> 3) options to turn off the page cache (large number of large file
> copies are not going to benefit from reuse of pages in the page cache
> so going through the page cache may be suboptimal in that case)

If you do this, you really need to use AIO+DIO to avoid blocking
(i.e. userspace can still be single threaded!), and need the
filesystem to be able to tell the app what optimal DIO sizes are
(e.g. XFS_IOC_DIOINFO has historically been used for this)

> 4) option to set the file size first, and then fill in writes (so
> non-extending writes)

Must only be an option, as will cause problems with buffered IO
because it defeats all the extending write optimisations that local
filesystems do. Same with using fallocate() to preallocate files -
this defeats all the anti-fragmentation and cross-file allocation
packing optimisations we do at writeback time that are enabled by
delayed allocation.

In general, fine-grained control of extent allocation in local
filesystems from userspace is a recipe for rapidly aging and
degrading filesystem performance. We want to avoid that as much as
possible - it sets us back at least 25 years in terms of file layout
and allocation algorithm sophistication.

> 5) sparse file support
> (and it would also be nice to support copy_file_range syscall ... but
> that is unrelated to the above)

make copy_file_range() handle that properly.

You also forgot all the benfits we'd get from parallelising
recursive directory walks and stat()ing inodes along the way (i.e.
the dir walk that rsync does to work out what it needs to copy).
Also, sorting the files to be copied based on something like inode
number rather than just operating on readdir order can improve
copying of large numbers of files substantially. Chris Mason
demonstrated this years ago:

https://oss.oracle.com/~mason/acp/

> Am I missing some magic tool?  Seems like Windows has various options
> for copy tools - but looking at Linux i/o patterns from these tools
> was pretty depressing - I am hoping that there are other choices.

Not that I know of.  The Linux IO tools need to dragged kicking and
screaming out of the 1980s. :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2018-07-01  0:10 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-30  2:37 Copy tools on Linux Steve French
2018-06-30 13:13 ` Goldwyn Rodrigues
2018-06-30 14:12   ` Steve French
2018-06-30 14:47     ` Goldwyn Rodrigues
2018-06-30 16:34 ` Andreas Dilger
2018-07-01  0:10 ` Dave Chinner [this message]
2018-07-01  2:59   ` Steve French
2018-07-01 17:44   ` Goldwyn Rodrigues
2018-07-02  0:17     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180701001005.GR19934@dastard \
    --to=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=smfrench@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).