linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Theodore Tso <tytso@MIT.EDU>
Cc: Christoph Hellwig <hch@infradead.org>,
	P??draig Brady <P@draigBrady.com>,
	linux-fsdevel@vger.kernel.org
Subject: Re: fallocate vs ENOSPC
Date: Mon, 28 Nov 2011 16:10:54 +1100	[thread overview]
Message-ID: <20111128051054.GZ2386@dastard> (raw)
In-Reply-To: <AB63FBE6-A28A-4A46-81BF-E44E31E9AFD8@mit.edu>

On Sun, Nov 27, 2011 at 07:40:14PM -0500, Theodore Tso wrote:
> 
> On Nov 27, 2011, at 6:43 PM, Dave Chinner wrote:
> 
> > fallocate() style (or non-delalloc, write syscall time) allocation
> > leads to non-optimal file layouts and slower writeback because the
> > location that blocks are allocated in no way matches the writeback
> > pattern, hence causing an increase in seeks during writeback of
> > large numbers of files.
> > 
> > Further, filesytsems that are alignment aware (e.g. XFS) will align
> > every fallocate() based allocation, greatly fragmenting free space
> > when used on small files and the filesystem is on a RAID array.
> > However, in XFS, delayed allocation will actually pack the
> > allocation across files tightly on disk, resulting in full stripe
> > writes (even for sub-stripe unit/width files) during write back.
> 
> Well, the question is whether you're optimizing for writing the files,
> or reading the files.    In some cases, files are write once, read never
> (well, almost never) --- i.e., the backup case.  In other cases, the files
> are write once, read many --- i.e., when installing software.

Doesn't matter. If delayed allocation is doing it's job properly,
then you'll get unfragemented files when they are written. delayed
allocation is supposed to make up front preallocation of disk space
-unnecessary- to prevent fragmentation. Using preallocation instead
of dealyed allocation implies your dealyed allocation implementation
is sub-optimal and needs to be fixed.

Indeed, there is no guarantee that preallocation will even lay the
files out in a sane manner that will give you good read speeds
across multiple files - it may place them so far apart that the seek
penalty between files is worse than having a few fragments...

> In that case, optimizing for the file reading might mean that you
> want to make the files aligned on RAID stripes, although it will
> fragment free space.   It all depends on what you're optimizing
> for.

If you want to optimise for read speed - especially for small
files or random IO patterns - you want to *avoid* alignment to RAID
stripes. Doing so overloads the first disk in the RAID stripe
because all small file reads (and writes) hit that disk/LUN in the
stripe. Indeed, if you have RAID5/6 and lots of small files, it is
recommended that you turn off filesystem alignment at mkfs time for
XFS.

SGI hit this problem back in the early 90s, and is one of the reasons
that XFS lays it's metadata out such that it does not hot-spot one
drive in a RAID stripe trying to read/write frequently accessed
metadata (e.g. AG headers).

> I didn't realize that XFS was not aligning to RAID stripes when doing
> delayed allocation writes.

It certainly does do alignment during delayed allocation.

/me waits for the "but you said..."

That's because XFS does -selective- alignment during delayed
allocation.... :)

What people seem to forget about delayed allocation is that when
delayed allocation occurs, we have lots of information about the
data being written that is not available in the fallocate() context
- how big the delalloc extent is, how large the file currently is,
how much more data needs to be written, whether the file is still
growing, etc, and so delayed allocation can make a much more informed
decision about how to allocate the data extents compared to
fallocate().

For example, if the allocation is for offset zero of the file, the
filesystem is using aligned allocation and the file size is larger
than the stripe unit, the allocation will be stripe unit aligned.

Hence, if you've got lots of small files, they get packed because
aligned allocation is not triggered and each allocation gets peeled
from the front edge of the same free space extent.

If you've got large files, then they get aligned, leaving space
between them for the fiel to potentially grow and fill full stripe
units and widths.

And if you've got really large files still being written to, they
get aligned and over-allocated thanks to the speculative prealloc
beyond EOF, which effectively prevents fragmentation of large files
due to interleaving allocations between files when many files are
being written concurrently by writeback.....

> I'm curious --- does it do this only when
> there are multiple files outstanding for delayed allocation in an 
> allocation group? 

Irrelevant - the consideration is solely to do with the state of the
current inode the allocation is being done for. If you're only
writing a single file, then it doesn't matter for perfromance
whether it is aligned or not. But it will matter for a freespace
management POV, and hence how the filesytem ages. 
 
> If someone does a singleton cp of a large file
> without using fallocate, will XFS try to align the write?

The above should hopefully answer that question, especially with
respect to why delayed allocation should not be short-circuited by
using fallocate by default in generic system utilities.

> Also, if we are going to use fallocate() as a way of implicitly signaling
> to the file system that the file should be optimized for reads, as
> opposed to the write, maybe we should explicitly document it as such
> in the fallocate(2) man page, so that  application programmers
> understand that this is the semantics they should expect.

Preallocation is for preventing fragmentation that leads to
performance problems. Use of fallocate() does not imply the file
layout has been optimised for read access and, IMO, never should.

Quite frankly, if system utilities like cp and tar start to abuse
fallocate() by default so they can get "upfront ENOSPC detection",
then I will seriously consider making XFS use delayed allocation for
fallocate rather than unwritten extents so we don't lose the past 15
years worth of IO and aging optimisations that delayed allocation
provides us with....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2011-11-28  5:10 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-25 10:26 fallocate vs ENOSPC Pádraig Brady
2011-11-25 10:40 ` Christoph Hellwig
2011-11-27  3:14   ` Ted Ts'o
2011-11-27 23:43     ` Dave Chinner
2011-11-28  0:13       ` Pádraig Brady
2011-11-28  3:51         ` Dave Chinner
2011-11-28  0:40       ` Theodore Tso
2011-11-28  5:10         ` Dave Chinner [this message]
2011-11-28  8:55           ` Pádraig Brady
2011-11-28 10:41             ` tao.peng
2011-11-28 12:02               ` Pádraig Brady
2011-11-28 14:36             ` Theodore Tso
2011-11-28 14:51               ` Pádraig Brady
2011-11-28 20:29                 ` Ted Ts'o
2011-11-28 20:49                   ` Jeremy Allison
2011-11-29 22:39                     ` Eric Sandeen
2011-11-29 23:04                       ` Jeremy Allison
2011-11-29 23:19                         ` Eric Sandeen
2011-11-28 18:49               ` Jeremy Allison
2011-11-29  0:26                 ` Dave Chinner
2011-11-29  0:45                   ` Jeremy Allison
2011-11-29  0:24             ` Dave Chinner
2011-11-29 14:11               ` Pádraig Brady
2011-11-29 23:37                 ` Dave Chinner
2011-11-30  9:28                   ` Pádraig Brady
2011-11-30 15:32                     ` Ted Ts'o
2011-11-30 16:11                       ` Pádraig Brady
2011-11-30 17:01                         ` Ted Ts'o
2011-11-30 23:39                           ` Dave Chinner
2011-12-01  0:11                           ` Pádraig Brady
2011-12-07 11:42                             ` Pádraig Brady

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111128051054.GZ2386@dastard \
    --to=david@fromorbit.com \
    --cc=P@draigBrady.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@MIT.EDU \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).