From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?ISO-8859-1?Q?P=E1draig_Brady?= <P@draigBrady.com>
Subject: Re: fallocate vs ENOSPC
Date: Wed, 30 Nov 2011 09:28:32 +0000
Message-ID: <4ED5F740.8090005@draigBrady.com>
References: <4ECF6D41.2040801@draigBrady.com> <20111125104050.GA26729@infradead.org> <20111127031455.GK5167@thunk.org> <20111127234331.GW2386@dastard> <AB63FBE6-A28A-4A46-81BF-E44E31E9AFD8@mit.edu> <20111128051054.GZ2386@dastard> <4ED34C66.8050300@draigBrady.com> <20111129002432.GA2386@dastard> <4ED4E824.4030107@draigBrady.com> <20111129233729.GS7046@dastard>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Theodore Tso <tytso@MIT.EDU>,
	Christoph Hellwig <hch@infradead.org>,
	linux-fsdevel@vger.kernel.org
To: Dave Chinner <david@fromorbit.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail3.vodafone.ie ([213.233.128.45]:45387 "EHLO
	mail3.vodafone.ie" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753450Ab1K3J2g (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 30 Nov 2011 04:28:36 -0500
In-Reply-To: <20111129233729.GS7046@dastard>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On 11/29/2011 11:37 PM, Dave Chinner wrote:
> On Tue, Nov 29, 2011 at 02:11:48PM +0000, P=E1draig Brady wrote:
>> On 11/29/2011 12:24 AM, Dave Chinner wrote:
>>> On Mon, Nov 28, 2011 at 08:55:02AM +0000, P=E1draig Brady wrote:
>>>> On 11/28/2011 05:10 AM, Dave Chinner wrote:
>>>>> Quite frankly, if system utilities like cp and tar start to abuse
>>>>> fallocate() by default so they can get "upfront ENOSPC detection"=
,
>>>>> then I will seriously consider making XFS use delayed allocation =
for
>>>>> fallocate rather than unwritten extents so we don't lose the past=
 15
>>>>> years worth of IO and aging optimisations that delayed allocation
>>>>> provides us with....
>>>>
>>>> For the record I was considering fallocate() for these reasons.
>>>>
>>>>   1. Improved file layout for subsequent access
>>>>   2. Immediate indication of ENOSPC
>>>>   3. Efficient writing of NUL portions
>>>>
>>>> You lucidly detailed issues with 1. which I suppose could be somew=
hat
>>>> mitigated by not fallocating < say 1MB, though I suppose file syst=
ems
>>>> could be smarter here and not preallocate small chunks (or when
>>>> otherwise not appropriate).
>>>
>>> When you consider that some high end filesystem deployments have al=
ignment
>>> characteristics over 50MB (e.g. so each uncompressed 4k resolution
>>> video frame is located on a different set of non-overlapping disks)=
,
>>> arbitrary "don't fallocate below this amount" heuristics will alway=
s
>>> have unforseen failure cases...
>>
>> So about this alignment policy, I don't understand the issues so I'm=
 guessing here.
>=20
> Which, IMO, is exactly why you shouldn't be using fallocate() by
> default. Every filesystem behaves differently, and is optimises
> allocation differently to be tuned for the filesystem's unique
> structure and capability. fallocate() is a big hammer that ensures
> filesystems cannot optimise allocation to match observed operational
> patterns.
>=20
>> You say delalloc packs files, while fallocate() will align on XFS ac=
cording to
>> the stripe config. Is that assuming that when writing lots of files,=
 that they
>> will be more likely to be read together, rather than independently.
>=20
> No, it's assuming that preallocation is used for enabling extremely
> high performance, high bandwidth IO. This is what it has been used
> for in XFS for the past 10+ years, and so that is what the
> implementation in XFS is optimised for (and will continue to be
> optimised for).  In this environment, even when the file size is
> smaller than the alignment unit, we want allocation alignment to be
> done.
>=20
> A real world example for you: supporting multiple, concurrent,
> realtime file-per-frame uncompressed 2k res video streams (@ ~12.5MB
> per frame).  Systems doing this sort of work are made from lots of
> HW RAID5/6 Luns (often spread across multiple arrays) that will have
> a stripe width of 14MB. XFS will be configured with a stripe unit of
> 14MB. 4-6 of these Luns will be striped together to give a stripe
> width of 56-84MB from a filesystem perspective. Each file that is
> preallocated needs to be aligned to a 16MB stripe unit so that each
> frame IO goes to a different RAID Lun. Each frame write can be done
> as a full stripe write without a RMW cycle in the back end array,
> and each frame read loads all the disks in the LUN evenly.  i.e. the
> load is distributed evenly, optimally and deterministically across
> all the back end storage.
>=20
> This is the sort of application that cannot be done effectively with
> a lot of filesystem allocator support (indeed, XFS has the special
> filestreams allocation policy for this workload), and it's this sort
> of high peformance application that what we optimise preallocation
> for.
>=20
> In short, what XFS is doing here is optimising allocation patterns
> for high performance, RAID based storage. If your write pattern
> triggers repeated RMW cycles in a RAID array, your write performance
> will fall by an order of magnitude or more.  Large files don't need
> packing because the writeback flusher threads can do full stripe
> writes which avoids RMW cycles in the RAID array if the files are
> aligned to the underlying RAID stripes.  But small files need tight
> packing to enable them to be aggregated into full stripe writes in
> the elevator and/or RAID controller BBWC.  This aggregation then
> avoids RMW cycles in the RAID array and hence writeback performance
> for both small and large files is similar (i.e. close to maximum IO
> bandwidth).  If you don't pack small files tightly (and XFs won't if
> you use preallocation), then each file write will cause a RMW cycle
> in the RAID array and the throughput is effective going to be about
> half the IOPS of a random write workload....
>=20
>> That's a big assumption if true. Also the converse is a big assumpti=
on, that
>> fallocate() should be aligned, as that's more likely to be read inde=
pendently.
>=20
> You're guessing, making assumptions, etc all about how one
> filesystem works and what the impact of the change is going to be.
> What about ext4, or btrfs? They are very different structurally to
> XFS, and hence have different sets of issues when you start
> preallocating everything.  It is not a simple problem: allocation
> optimisation is, IMO, the single most difficult and complex area of
> filesystems, with many different, non-obvious, filesystem specific
> trade-offs to be made....
>=20
>>> fallocate is for preallocation, not for ENOSPC detection. If you
>>> want efficient and effective ENOSPC detection before writing
>>> anything, then you really want a space -reservation- extension to
>>> fallocate. Filesystems that use delayed allocation already have a
>>> space reservation subsystem - it how they account for space that is
>>> reserved by delayed allocation prior to the real allocation being
>>> done. IMO, allowing userspace some level of access to those
>>> reservations would be more appropriate for early detection of ENOSP=
C
>>> than using preallocation for everything...
>>
>> Fair enough, so fallocate() would be a superset of reserve(),
>> though I'm having a hard time thinking of why one might ever need to
>> fallocate() then.
>=20
> Exactly my point - the number of applications that actually need
> -preallocation- for performance reasons is actually quite small.
>=20
> I'd suggest that we'd implement a reservation mechanism as a
> separate fallocate() flag, to tell fallocate() to reserve the space
> over the given range rather than needing to preallocate it. I'd also
> suggest that a reservation is not persistent (e.g. only guaranteed
> to last for the life of the file descriptor the reservation was made
> for). That would make it simple to implement in memory for all
> filesystems, and provide you with the short-term ENOSPC-or-success
> style reservation you are looking for...
>=20
> Does that sound reasonable?

But then posix_fallocate() would always be slow I think,
requiring one to actually write the NULs.

TBH, it sounds like the best/minimal change is to the uncommon case.
I.E. add an ALIGN flag to fallocate() which specialised apps like
described above can use.

>>> As to efficient writing of NULL ranges - that's what sparse files
>>> are for - you do not need to write or even preallocate NULL ranges
>>> when copying files. Indeed, the most efficient way of dealing with
>>> NULL ranges is to punch a hole and let the filesystem deal with
>>> it.....
>>
>> well not for `cp --sparse=3Dnever` which might be used
>> so that processing of the copy will not result in ENOSPC.
>>
>> I'm also linking here to a related discussion.
>> http://oss.sgi.com/archives/xfs/2011-06/msg00064.html
>=20
> Right, and from that discussion you can see exactly why delayed
> allocation in XFS significantly improves both data and metadata
> allocation and IO patterns for operations like tar, cp, rsync, etc
> whilst also minimising long term aging effects as compared to
> preallocation:
>=20
> http://oss.sgi.com/archives/xfs/2011-06/msg00092.html
>=20
>> Note also that the gold linker does fallocate() on output files by d=
efault.
>=20
> "He's doing it, so we should do it" is not a very convincing
> technical argument.

Just FYI.

cheers,
P=E1draig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html