Re: fallocate vs ENOSPC - Pádraig Brady

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Pádraig Brady" <P@draigBrady.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Theodore Tso <tytso@MIT.EDU>,
	Christoph Hellwig <hch@infradead.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: fallocate vs ENOSPC
Date: Wed, 30 Nov 2011 09:28:32 +0000	[thread overview]
Message-ID: <4ED5F740.8090005@draigBrady.com> (raw)
In-Reply-To: <20111129233729.GS7046@dastard>

On 11/29/2011 11:37 PM, Dave Chinner wrote:
> On Tue, Nov 29, 2011 at 02:11:48PM +0000, Pádraig Brady wrote:
>> On 11/29/2011 12:24 AM, Dave Chinner wrote:
>>> On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote:
>>>> On 11/28/2011 05:10 AM, Dave Chinner wrote:
>>>>> Quite frankly, if system utilities like cp and tar start to abuse
>>>>> fallocate() by default so they can get "upfront ENOSPC detection",
>>>>> then I will seriously consider making XFS use delayed allocation for
>>>>> fallocate rather than unwritten extents so we don't lose the past 15
>>>>> years worth of IO and aging optimisations that delayed allocation
>>>>> provides us with....
>>>>
>>>> For the record I was considering fallocate() for these reasons.
>>>>
>>>>   1. Improved file layout for subsequent access
>>>>   2. Immediate indication of ENOSPC
>>>>   3. Efficient writing of NUL portions
>>>>
>>>> You lucidly detailed issues with 1. which I suppose could be somewhat
>>>> mitigated by not fallocating < say 1MB, though I suppose file systems
>>>> could be smarter here and not preallocate small chunks (or when
>>>> otherwise not appropriate).
>>>
>>> When you consider that some high end filesystem deployments have alignment
>>> characteristics over 50MB (e.g. so each uncompressed 4k resolution
>>> video frame is located on a different set of non-overlapping disks),
>>> arbitrary "don't fallocate below this amount" heuristics will always
>>> have unforseen failure cases...
>>
>> So about this alignment policy, I don't understand the issues so I'm guessing here.
> 
> Which, IMO, is exactly why you shouldn't be using fallocate() by
> default. Every filesystem behaves differently, and is optimises
> allocation differently to be tuned for the filesystem's unique
> structure and capability. fallocate() is a big hammer that ensures
> filesystems cannot optimise allocation to match observed operational
> patterns.
> 
>> You say delalloc packs files, while fallocate() will align on XFS according to
>> the stripe config. Is that assuming that when writing lots of files, that they
>> will be more likely to be read together, rather than independently.
> 
> No, it's assuming that preallocation is used for enabling extremely
> high performance, high bandwidth IO. This is what it has been used
> for in XFS for the past 10+ years, and so that is what the
> implementation in XFS is optimised for (and will continue to be
> optimised for).  In this environment, even when the file size is
> smaller than the alignment unit, we want allocation alignment to be
> done.
> 
> A real world example for you: supporting multiple, concurrent,
> realtime file-per-frame uncompressed 2k res video streams (@ ~12.5MB
> per frame).  Systems doing this sort of work are made from lots of
> HW RAID5/6 Luns (often spread across multiple arrays) that will have
> a stripe width of 14MB. XFS will be configured with a stripe unit of
> 14MB. 4-6 of these Luns will be striped together to give a stripe
> width of 56-84MB from a filesystem perspective. Each file that is
> preallocated needs to be aligned to a 16MB stripe unit so that each
> frame IO goes to a different RAID Lun. Each frame write can be done
> as a full stripe write without a RMW cycle in the back end array,
> and each frame read loads all the disks in the LUN evenly.  i.e. the
> load is distributed evenly, optimally and deterministically across
> all the back end storage.
> 
> This is the sort of application that cannot be done effectively with
> a lot of filesystem allocator support (indeed, XFS has the special
> filestreams allocation policy for this workload), and it's this sort
> of high peformance application that what we optimise preallocation
> for.
> 
> In short, what XFS is doing here is optimising allocation patterns
> for high performance, RAID based storage. If your write pattern
> triggers repeated RMW cycles in a RAID array, your write performance
> will fall by an order of magnitude or more.  Large files don't need
> packing because the writeback flusher threads can do full stripe
> writes which avoids RMW cycles in the RAID array if the files are
> aligned to the underlying RAID stripes.  But small files need tight
> packing to enable them to be aggregated into full stripe writes in
> the elevator and/or RAID controller BBWC.  This aggregation then
> avoids RMW cycles in the RAID array and hence writeback performance
> for both small and large files is similar (i.e. close to maximum IO
> bandwidth).  If you don't pack small files tightly (and XFs won't if
> you use preallocation), then each file write will cause a RMW cycle
> in the RAID array and the throughput is effective going to be about
> half the IOPS of a random write workload....
> 
>> That's a big assumption if true. Also the converse is a big assumption, that
>> fallocate() should be aligned, as that's more likely to be read independently.
> 
> You're guessing, making assumptions, etc all about how one
> filesystem works and what the impact of the change is going to be.
> What about ext4, or btrfs? They are very different structurally to
> XFS, and hence have different sets of issues when you start
> preallocating everything.  It is not a simple problem: allocation
> optimisation is, IMO, the single most difficult and complex area of
> filesystems, with many different, non-obvious, filesystem specific
> trade-offs to be made....
> 
>>> fallocate is for preallocation, not for ENOSPC detection. If you
>>> want efficient and effective ENOSPC detection before writing
>>> anything, then you really want a space -reservation- extension to
>>> fallocate. Filesystems that use delayed allocation already have a
>>> space reservation subsystem - it how they account for space that is
>>> reserved by delayed allocation prior to the real allocation being
>>> done. IMO, allowing userspace some level of access to those
>>> reservations would be more appropriate for early detection of ENOSPC
>>> than using preallocation for everything...
>>
>> Fair enough, so fallocate() would be a superset of reserve(),
>> though I'm having a hard time thinking of why one might ever need to
>> fallocate() then.
> 
> Exactly my point - the number of applications that actually need
> -preallocation- for performance reasons is actually quite small.
> 
> I'd suggest that we'd implement a reservation mechanism as a
> separate fallocate() flag, to tell fallocate() to reserve the space
> over the given range rather than needing to preallocate it. I'd also
> suggest that a reservation is not persistent (e.g. only guaranteed
> to last for the life of the file descriptor the reservation was made
> for). That would make it simple to implement in memory for all
> filesystems, and provide you with the short-term ENOSPC-or-success
> style reservation you are looking for...
> 
> Does that sound reasonable?

But then posix_fallocate() would always be slow I think,
requiring one to actually write the NULs.

TBH, it sounds like the best/minimal change is to the uncommon case.
I.E. add an ALIGN flag to fallocate() which specialised apps like
described above can use.

>>> As to efficient writing of NULL ranges - that's what sparse files
>>> are for - you do not need to write or even preallocate NULL ranges
>>> when copying files. Indeed, the most efficient way of dealing with
>>> NULL ranges is to punch a hole and let the filesystem deal with
>>> it.....
>>
>> well not for `cp --sparse=never` which might be used
>> so that processing of the copy will not result in ENOSPC.
>>
>> I'm also linking here to a related discussion.
>> http://oss.sgi.com/archives/xfs/2011-06/msg00064.html
> 
> Right, and from that discussion you can see exactly why delayed
> allocation in XFS significantly improves both data and metadata
> allocation and IO patterns for operations like tar, cp, rsync, etc
> whilst also minimising long term aging effects as compared to
> preallocation:
> 
> http://oss.sgi.com/archives/xfs/2011-06/msg00092.html
> 
>> Note also that the gold linker does fallocate() on output files by default.
> 
> "He's doing it, so we should do it" is not a very convincing
> technical argument.

Just FYI.

cheers,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2011-11-30  9:28 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-25 10:26 fallocate vs ENOSPC Pádraig Brady
2011-11-25 10:40 ` Christoph Hellwig
2011-11-27  3:14   ` Ted Ts'o
2011-11-27 23:43     ` Dave Chinner
2011-11-28  0:13       ` Pádraig Brady
2011-11-28  3:51         ` Dave Chinner
2011-11-28  0:40       ` Theodore Tso
2011-11-28  5:10         ` Dave Chinner
2011-11-28  8:55           ` Pádraig Brady
2011-11-28 10:41             ` tao.peng
2011-11-28 12:02               ` Pádraig Brady
2011-11-28 14:36             ` Theodore Tso
2011-11-28 14:51               ` Pádraig Brady
2011-11-28 20:29                 ` Ted Ts'o
2011-11-28 20:49                   ` Jeremy Allison
2011-11-29 22:39                     ` Eric Sandeen
2011-11-29 23:04                       ` Jeremy Allison
2011-11-29 23:19                         ` Eric Sandeen
2011-11-28 18:49               ` Jeremy Allison
2011-11-29  0:26                 ` Dave Chinner
2011-11-29  0:45                   ` Jeremy Allison
2011-11-29  0:24             ` Dave Chinner
2011-11-29 14:11               ` Pádraig Brady
2011-11-29 23:37                 ` Dave Chinner
2011-11-30  9:28                   ` Pádraig Brady [this message]
2011-11-30 15:32                     ` Ted Ts'o
2011-11-30 16:11                       ` Pádraig Brady
2011-11-30 17:01                         ` Ted Ts'o
2011-11-30 23:39                           ` Dave Chinner
2011-12-01  0:11                           ` Pádraig Brady
2011-12-07 11:42                             ` Pádraig Brady

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4ED5F740.8090005@draigBrady.com \
    --to=p@draigbrady.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@MIT.EDU \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.