fallocate vs ENOSPC

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* fallocate vs ENOSPC
@ 2011-11-25 10:26 Pádraig Brady
  2011-11-25 10:40 ` Christoph Hellwig
  0 siblings, 1 reply; 31+ messages in thread
From: Pádraig Brady @ 2011-11-25 10:26 UTC (permalink / raw)
  To: linux-fsdevel

I was wondering about adding fallocate() to cp,
where one of the benefits would be immediate indication of ENOSPC.
I'm now wondering though might fallocate() fail to allocate an
extent with ENOSPC, but there could be fragmented space available to write()?

If the above was true, then perhaps we could assume that
any file system returning ENOSPC from fallocate(),
might provide accurate fstatvfs().f_b{free,avail} values
for a subsequent check, though needing to do that seems hacky.

cheers,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-25 10:26 fallocate vs ENOSPC Pádraig Brady
@ 2011-11-25 10:40 ` Christoph Hellwig
  2011-11-27  3:14   ` Ted Ts'o
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2011-11-25 10:40 UTC (permalink / raw)
  To: P??draig Brady; +Cc: linux-fsdevel

On Fri, Nov 25, 2011 at 10:26:09AM +0000, P??draig Brady wrote:
> I was wondering about adding fallocate() to cp,
> where one of the benefits would be immediate indication of ENOSPC.
> I'm now wondering though might fallocate() fail to allocate an
> extent with ENOSPC, but there could be fragmented space available to write()?

fallocate isn't guaranteed to allocate a single or even contiguous
extents, it just allocate the given amount of space, and if the fs isn't
too fragmented and the allocator not braindead it will be sufficiently
contiguous.  Also all Linux implementation may actually still fail a write
later if extreme corner cases when btree splits or other metadata
operations during unwritten extent conversions go over the space limit.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-25 10:40 ` Christoph Hellwig
@ 2011-11-27  3:14   ` Ted Ts'o
  2011-11-27 23:43     ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Ted Ts'o @ 2011-11-27  3:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: P??draig Brady, linux-fsdevel

On Fri, Nov 25, 2011 at 05:40:50AM -0500, Christoph Hellwig wrote:
> On Fri, Nov 25, 2011 at 10:26:09AM +0000, P??draig Brady wrote:
> > I was wondering about adding fallocate() to cp,
> > where one of the benefits would be immediate indication of ENOSPC.
> > I'm now wondering though might fallocate() fail to allocate an
> > extent with ENOSPC, but there could be fragmented space available to write()?
> 
> fallocate isn't guaranteed to allocate a single or even contiguous
> extents, it just allocate the given amount of space, and if the fs isn't
> too fragmented and the allocator not braindead it will be sufficiently
> contiguous.  Also all Linux implementation may actually still fail a write
> later if extreme corner cases when btree splits or other metadata
> operations during unwritten extent conversions go over the space limit.

While this is true, *usually* fallocate will allocate enough space,
but as Cirstoph has said, you still have to check the error returns
for the write(2) and close(2) system call, and deal appropriately with
any errors.

The other reason to use fallocate is if you are copying a huge number
of files, it's possible you'll get better block allocation layout,
depending on the file system, and how insane the writeback code for a
particular kernel version might be.  (Some versions of the kernel had
writeback algorithms that would write 4MB of one file, then 4MB for
another file, then 4MB for yet another file, then 4MB of the first
file, etc. --- and some file systems can deal with this kind of write
pattern better than others.)  Using fallocate if you know the size of
the file up front won't hurt, and on some systems it might help.

    	    	  	      	  - Ted

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-27  3:14   ` Ted Ts'o
@ 2011-11-27 23:43     ` Dave Chinner
  2011-11-28  0:13       ` Pádraig Brady
  2011-11-28  0:40       ` Theodore Tso
  0 siblings, 2 replies; 31+ messages in thread
From: Dave Chinner @ 2011-11-27 23:43 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Christoph Hellwig, P??draig Brady, linux-fsdevel

On Sat, Nov 26, 2011 at 10:14:55PM -0500, Ted Ts'o wrote:
> On Fri, Nov 25, 2011 at 05:40:50AM -0500, Christoph Hellwig wrote:
> > On Fri, Nov 25, 2011 at 10:26:09AM +0000, P??draig Brady wrote:
> > > I was wondering about adding fallocate() to cp,
> > > where one of the benefits would be immediate indication of ENOSPC.
> > > I'm now wondering though might fallocate() fail to allocate an
> > > extent with ENOSPC, but there could be fragmented space available to write()?
> > 
> > fallocate isn't guaranteed to allocate a single or even contiguous
> > extents, it just allocate the given amount of space, and if the fs isn't
> > too fragmented and the allocator not braindead it will be sufficiently
> > contiguous.  Also all Linux implementation may actually still fail a write
> > later if extreme corner cases when btree splits or other metadata
> > operations during unwritten extent conversions go over the space limit.
> 
> While this is true, *usually* fallocate will allocate enough space,
> but as Cirstoph has said, you still have to check the error returns
> for the write(2) and close(2) system call, and deal appropriately with
> any errors.
> 
> The other reason to use fallocate is if you are copying a huge number
> of files, it's possible you'll get better block allocation layout,
> depending on the file system, and how insane the writeback code for a
> particular kernel version might be.  (Some versions of the kernel had
> writeback algorithms that would write 4MB of one file, then 4MB for
> another file, then 4MB for yet another file, then 4MB of the first
> file, etc. --- and some file systems can deal with this kind of write
> pattern better than others.)

Right, but....

> Using fallocate if you know the size of
> the file up front won't hurt, and on some systems it might help.

... this is - as a generalisation - wrong. Up front fallocate() can
and does hurt performance, even when you know the size of the file
ahead of time.

Why? Because it defeats the primary, seek reducing writeback
optimisation that filesystems have these days: delayed allocation.
This has been mentioned before in previous threads where you've been
considering adding fallocate to cp. e.g:

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg10819.html

fallocate() style (or non-delalloc, write syscall time) allocation
leads to non-optimal file layouts and slower writeback because the
location that blocks are allocated in no way matches the writeback
pattern, hence causing an increase in seeks during writeback of
large numbers of files.

Further, filesytsems that are alignment aware (e.g. XFS) will align
every fallocate() based allocation, greatly fragmenting free space
when used on small files and the filesystem is on a RAID array.
However, in XFS, delayed allocation will actually pack the
allocation across files tightly on disk, resulting in full stripe
writes (even for sub-stripe unit/width files) during writeback.

Delayed allocation allows workloads such as cp to run as a bandwidth
bound operation because allocation is optimised to cause sequential
write IO, whereas up-front fallocate() causes it to run as an IOPS
bound option because file layout does not match the writeback
pattern. And on large, high performance RAID arrays, bandwidth
capacity is much, much higher than IOPS capacity, so delayed
allocation is going to be far faster and have less long term impact
on the filesystem than using fallocate.

IOWs, use of fallocate() -by default- will speed filesystem aging
because it removes the benefits delayed allocation has on both short
and long term filesystem performance.

The three major Linux filesystems (XFS, BTRFS and ext4) use delayed
allocation, and hence do not need fallocate() to be used by
userspace utilities like cp, tar, etc. to avoid fragmentation. If a
given filesystem is still prone to fragmentation of data extents
when copying data via cp or tar, then that is a problem with the
filesystem that needs to be fixed, not worked around in the
userspace utilities in a manner that is detrimental to other
filesystems that don't suffer from those problems...

Yes, fallocate can help reduce fragmentation and increase
performance in some situations, so making it an -option- for people
who know what they are doing is a good idea. However, it should not
be made the default for all of the reasons above.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-27 23:43     ` Dave Chinner
@ 2011-11-28  0:13       ` Pádraig Brady
  2011-11-28  3:51         ` Dave Chinner
  2011-11-28  0:40       ` Theodore Tso
  1 sibling, 1 reply; 31+ messages in thread
From: Pádraig Brady @ 2011-11-28  0:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ted Ts'o, Christoph Hellwig, linux-fsdevel

On 11/27/2011 11:43 PM, Dave Chinner wrote:
> On Sat, Nov 26, 2011 at 10:14:55PM -0500, Ted Ts'o wrote:
>> On Fri, Nov 25, 2011 at 05:40:50AM -0500, Christoph Hellwig wrote:
>>> On Fri, Nov 25, 2011 at 10:26:09AM +0000, P??draig Brady wrote:
>>>> I was wondering about adding fallocate() to cp,
>>>> where one of the benefits would be immediate indication of ENOSPC.
>>>> I'm now wondering though might fallocate() fail to allocate an
>>>> extent with ENOSPC, but there could be fragmented space available to write()?
>>>
>>> fallocate isn't guaranteed to allocate a single or even contiguous
>>> extents, it just allocate the given amount of space, and if the fs isn't
>>> too fragmented and the allocator not braindead it will be sufficiently
>>> contiguous.  Also all Linux implementation may actually still fail a write
>>> later if extreme corner cases when btree splits or other metadata
>>> operations during unwritten extent conversions go over the space limit.
>>
>> While this is true, *usually* fallocate will allocate enough space,
>> but as Cirstoph has said, you still have to check the error returns
>> for the write(2) and close(2) system call, and deal appropriately with
>> any errors.
>>
>> The other reason to use fallocate is if you are copying a huge number
>> of files, it's possible you'll get better block allocation layout,
>> depending on the file system, and how insane the writeback code for a
>> particular kernel version might be.  (Some versions of the kernel had
>> writeback algorithms that would write 4MB of one file, then 4MB for
>> another file, then 4MB for yet another file, then 4MB of the first
>> file, etc. --- and some file systems can deal with this kind of write
>> pattern better than others.)
> 
> Right, but....
> 
>> Using fallocate if you know the size of
>> the file up front won't hurt, and on some systems it might help.
> 
> ... this is - as a generalisation - wrong. Up front fallocate() can
> and does hurt performance, even when you know the size of the file
> ahead of time.
> 
> Why? Because it defeats the primary, seek reducing writeback
> optimisation that filesystems have these days: delayed allocation.
> This has been mentioned before in previous threads where you've been
> considering adding fallocate to cp. e.g:
> 
> http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg10819.html
> 
> fallocate() style (or non-delalloc, write syscall time) allocation
> leads to non-optimal file layouts and slower writeback because the
> location that blocks are allocated in no way matches the writeback
> pattern, hence causing an increase in seeks during writeback of
> large numbers of files.

I'm interpreting the above to mean that,
in the presence of concurrent writes to multiple files,
fallocate() may cause slower _writes_, due to bypassing the
delalloc write scheduler.
Subsequent reads of the file should be no slower though,
and perhaps faster, due to the greater likelihood of
all the blocks for the file being contiguous.

> Further, filesytsems that are alignment aware (e.g. XFS) will align
> every fallocate() based allocation, greatly fragmenting free space
> when used on small files and the filesystem is on a RAID array.
> However, in XFS, delayed allocation will actually pack the
> allocation across files tightly on disk, resulting in full stripe
> writes (even for sub-stripe unit/width files) during writeback.

Interesting. So what are the typical alignments involved.
If you had to, what would you choose as a default min file size
to enable fallocate() for?

> Delayed allocation allows workloads such as cp to run as a bandwidth
> bound operation because allocation is optimised to cause sequential
> write IO, whereas up-front fallocate() causes it to run as an IOPS
> bound option because file layout does not match the writeback
> pattern. And on large, high performance RAID arrays, bandwidth
> capacity is much, much higher than IOPS capacity, so delayed
> allocation is going to be far faster and have less long term impact
> on the filesystem than using fallocate.

So the consequences are the same as those in the first paragraph?
Though I don't understand the detrimental "long term impact" you mention.

> IOWs, use of fallocate() -by default- will speed filesystem aging
> because it removes the benefits delayed allocation has on both short
> and long term filesystem performance.
> 
> The three major Linux filesystems (XFS, BTRFS and ext4) use delayed
> allocation, and hence do not need fallocate() to be used by
> userspace utilities like cp, tar, etc. to avoid fragmentation. If a
> given filesystem is still prone to fragmentation of data extents
> when copying data via cp or tar, then that is a problem with the
> filesystem that needs to be fixed, not worked around in the
> userspace utilities in a manner that is detrimental to other
> filesystems that don't suffer from those problems...
> 
> Yes, fallocate can help reduce fragmentation and increase
> performance in some situations, so making it an -option- for people
> who know what they are doing is a good idea. However, it should not
> be made the default for all of the reasons above.

thanks for the excellent info,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-28  0:13       ` Pádraig Brady
@ 2011-11-28  3:51         ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2011-11-28  3:51 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: Ted Ts'o, Christoph Hellwig, linux-fsdevel

On Mon, Nov 28, 2011 at 12:13:31AM +0000, Pádraig Brady wrote:
> On 11/27/2011 11:43 PM, Dave Chinner wrote:
> > On Sat, Nov 26, 2011 at 10:14:55PM -0500, Ted Ts'o wrote:
> >> On Fri, Nov 25, 2011 at 05:40:50AM -0500, Christoph Hellwig wrote:
> >>> On Fri, Nov 25, 2011 at 10:26:09AM +0000, P??draig Brady wrote:
> >>>> I was wondering about adding fallocate() to cp,
> >>>> where one of the benefits would be immediate indication of ENOSPC.
> >>>> I'm now wondering though might fallocate() fail to allocate an
> >>>> extent with ENOSPC, but there could be fragmented space available to write()?
> >>>
> >>> fallocate isn't guaranteed to allocate a single or even contiguous
> >>> extents, it just allocate the given amount of space, and if the fs isn't
> >>> too fragmented and the allocator not braindead it will be sufficiently
> >>> contiguous.  Also all Linux implementation may actually still fail a write
> >>> later if extreme corner cases when btree splits or other metadata
> >>> operations during unwritten extent conversions go over the space limit.
> >>
> >> While this is true, *usually* fallocate will allocate enough space,
> >> but as Cirstoph has said, you still have to check the error returns
> >> for the write(2) and close(2) system call, and deal appropriately with
> >> any errors.
> >>
> >> The other reason to use fallocate is if you are copying a huge number
> >> of files, it's possible you'll get better block allocation layout,
> >> depending on the file system, and how insane the writeback code for a
> >> particular kernel version might be.  (Some versions of the kernel had
> >> writeback algorithms that would write 4MB of one file, then 4MB for
> >> another file, then 4MB for yet another file, then 4MB of the first
> >> file, etc. --- and some file systems can deal with this kind of write
> >> pattern better than others.)
> > 
> > Right, but....
> > 
> >> Using fallocate if you know the size of
> >> the file up front won't hurt, and on some systems it might help.
> > 
> > ... this is - as a generalisation - wrong. Up front fallocate() can
> > and does hurt performance, even when you know the size of the file
> > ahead of time.
> > 
> > Why? Because it defeats the primary, seek reducing writeback
> > optimisation that filesystems have these days: delayed allocation.
> > This has been mentioned before in previous threads where you've been
> > considering adding fallocate to cp. e.g:
> > 
> > http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg10819.html
> > 
> > fallocate() style (or non-delalloc, write syscall time) allocation
> > leads to non-optimal file layouts and slower writeback because the
> > location that blocks are allocated in no way matches the writeback
> > pattern, hence causing an increase in seeks during writeback of
> > large numbers of files.
> 
> I'm interpreting the above to mean that,
> in the presence of concurrent writes to multiple files,
> fallocate() may cause slower _writes_, due to bypassing the
> delalloc write scheduler.

It's not even concurrent writes. A single process writing multiple
files into cache serially does not necessarily result in writeback
30s later writing the data to disk in the same order.

> Subsequent reads of the file should be no slower though,
> and perhaps faster, due to the greater likelihood of
> all the blocks for the file being contiguous.

If delayed allocation does it's job, the files will be contiguous
and unfragmented and no slower to read.

> > Further, filesytsems that are alignment aware (e.g. XFS) will align
> > every fallocate() based allocation, greatly fragmenting free space
> > when used on small files and the filesystem is on a RAID array.
> > However, in XFS, delayed allocation will actually pack the
> > allocation across files tightly on disk, resulting in full stripe
> > writes (even for sub-stripe unit/width files) during writeback.
> 
> Interesting. So what are the typical alignments involved.

Typical range of alignments can be anything from 16k through to 16MB
or larger.

Consider this - a 1TB filesystem with a 1MB alignment unit
(stripe unit in the case of XFS) doing 16k aligned allocation per
file will run out of aligned allocation slots after ~1,000,000 files
have been created. At that point, the largest contiguous free space
in the filesytsem is now under 16MB. When you want to create that
multi-GB file now, it's going to have lots of extents rather than
1-2 because the preallocation has spread the small file data all
over the place.

If you used delayed allocation, the small file data will be packed
close together without alignment, leaving large, multi-GB free space
extents for the large file you then want to create....

> If you had to, what would you choose as a default min file size
> to enable fallocate() for?

I would not enable fallocate by default at all.

> > Delayed allocation allows workloads such as cp to run as a bandwidth
> > bound operation because allocation is optimised to cause sequential
> > write IO, whereas up-front fallocate() causes it to run as an IOPS
> > bound option because file layout does not match the writeback
> > pattern. And on large, high performance RAID arrays, bandwidth
> > capacity is much, much higher than IOPS capacity, so delayed
> > allocation is going to be far faster and have less long term impact
> > on the filesystem than using fallocate.
> 
> So the consequences are the same as those in the first paragraph?
> Though I don't understand the detrimental "long term impact" you mention.

Free space fragmentation will result in severe degradation of
performance as soon as all > alignment sized free spaces are
partially consumed. Then fragmentation will result from any large
allocation. i.e. small aligned preallocations accelerate filesystem
aging effects by imcreasing free space fragmentation.  This
typically won't be noticed for months until the fragmentation starts
causing noticable performance issues - at which point it will be
difficult if not impossible to correct without a backup/mkfs/restore
cycle....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-27 23:43     ` Dave Chinner
  2011-11-28  0:13       ` Pádraig Brady
@ 2011-11-28  0:40       ` Theodore Tso
  2011-11-28  5:10         ` Dave Chinner
  1 sibling, 1 reply; 31+ messages in thread
From: Theodore Tso @ 2011-11-28  0:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, P??draig Brady, linux-fsdevel

On Nov 27, 2011, at 6:43 PM, Dave Chinner wrote:

> fallocate() style (or non-delalloc, write syscall time) allocation
> leads to non-optimal file layouts and slower writeback because the
> location that blocks are allocated in no way matches the writeback
> pattern, hence causing an increase in seeks during writeback of
> large numbers of files.
> 
> Further, filesytsems that are alignment aware (e.g. XFS) will align
> every fallocate() based allocation, greatly fragmenting free space
> when used on small files and the filesystem is on a RAID array.
> However, in XFS, delayed allocation will actually pack the
> allocation across files tightly on disk, resulting in full stripe
> writes (even for sub-stripe unit/width files) during write back.

Well, the question is whether you're optimizing for writing the files,
or reading the files.    In some cases, files are write once, read never
(well, almost never) --- i.e., the backup case.  In other cases, the files
are write once, read many --- i.e., when installing software.

In that case, optimizing for the file reading might mean that you want
to make the files aligned on RAID stripes, although it will fragment free
space.   It all depends on what you're optimizing for.

I didn't realize that XFS was not aligning to RAID stripes when doing
delayed allocation writes.   I'm curious --- does it do this only when
there are multiple files outstanding for delayed allocation in an 
allocation group?   If someone does a singleton cp of a large file
without using fallocate, will XFS try to align the write?

Also, if we are going to use fallocate() as a way of implicitly signaling
to the file system that the file should be optimized for reads, as
opposed to the write, maybe we should explicitly document it as such
in the fallocate(2) man page, so that  application programmers
understand that this is the semantics they should expect.

-- Ted

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-28  0:40       ` Theodore Tso
@ 2011-11-28  5:10         ` Dave Chinner
  2011-11-28  8:55           ` Pádraig Brady
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2011-11-28  5:10 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Christoph Hellwig, P??draig Brady, linux-fsdevel

On Sun, Nov 27, 2011 at 07:40:14PM -0500, Theodore Tso wrote:
> 
> On Nov 27, 2011, at 6:43 PM, Dave Chinner wrote:
> 
> > fallocate() style (or non-delalloc, write syscall time) allocation
> > leads to non-optimal file layouts and slower writeback because the
> > location that blocks are allocated in no way matches the writeback
> > pattern, hence causing an increase in seeks during writeback of
> > large numbers of files.
> > 
> > Further, filesytsems that are alignment aware (e.g. XFS) will align
> > every fallocate() based allocation, greatly fragmenting free space
> > when used on small files and the filesystem is on a RAID array.
> > However, in XFS, delayed allocation will actually pack the
> > allocation across files tightly on disk, resulting in full stripe
> > writes (even for sub-stripe unit/width files) during write back.
> 
> Well, the question is whether you're optimizing for writing the files,
> or reading the files.    In some cases, files are write once, read never
> (well, almost never) --- i.e., the backup case.  In other cases, the files
> are write once, read many --- i.e., when installing software.

Doesn't matter. If delayed allocation is doing it's job properly,
then you'll get unfragemented files when they are written. delayed
allocation is supposed to make up front preallocation of disk space
-unnecessary- to prevent fragmentation. Using preallocation instead
of dealyed allocation implies your dealyed allocation implementation
is sub-optimal and needs to be fixed.

Indeed, there is no guarantee that preallocation will even lay the
files out in a sane manner that will give you good read speeds
across multiple files - it may place them so far apart that the seek
penalty between files is worse than having a few fragments...

> In that case, optimizing for the file reading might mean that you
> want to make the files aligned on RAID stripes, although it will
> fragment free space.   It all depends on what you're optimizing
> for.

If you want to optimise for read speed - especially for small
files or random IO patterns - you want to *avoid* alignment to RAID
stripes. Doing so overloads the first disk in the RAID stripe
because all small file reads (and writes) hit that disk/LUN in the
stripe. Indeed, if you have RAID5/6 and lots of small files, it is
recommended that you turn off filesystem alignment at mkfs time for
XFS.

SGI hit this problem back in the early 90s, and is one of the reasons
that XFS lays it's metadata out such that it does not hot-spot one
drive in a RAID stripe trying to read/write frequently accessed
metadata (e.g. AG headers).

> I didn't realize that XFS was not aligning to RAID stripes when doing
> delayed allocation writes.

It certainly does do alignment during delayed allocation.

/me waits for the "but you said..."

That's because XFS does -selective- alignment during delayed
allocation.... :)

What people seem to forget about delayed allocation is that when
delayed allocation occurs, we have lots of information about the
data being written that is not available in the fallocate() context
- how big the delalloc extent is, how large the file currently is,
how much more data needs to be written, whether the file is still
growing, etc, and so delayed allocation can make a much more informed
decision about how to allocate the data extents compared to
fallocate().

For example, if the allocation is for offset zero of the file, the
filesystem is using aligned allocation and the file size is larger
than the stripe unit, the allocation will be stripe unit aligned.

Hence, if you've got lots of small files, they get packed because
aligned allocation is not triggered and each allocation gets peeled
from the front edge of the same free space extent.

If you've got large files, then they get aligned, leaving space
between them for the fiel to potentially grow and fill full stripe
units and widths.

And if you've got really large files still being written to, they
get aligned and over-allocated thanks to the speculative prealloc
beyond EOF, which effectively prevents fragmentation of large files
due to interleaving allocations between files when many files are
being written concurrently by writeback.....

> I'm curious --- does it do this only when
> there are multiple files outstanding for delayed allocation in an 
> allocation group? 

Irrelevant - the consideration is solely to do with the state of the
current inode the allocation is being done for. If you're only
writing a single file, then it doesn't matter for perfromance
whether it is aligned or not. But it will matter for a freespace
management POV, and hence how the filesytem ages. 

> If someone does a singleton cp of a large file
> without using fallocate, will XFS try to align the write?

The above should hopefully answer that question, especially with
respect to why delayed allocation should not be short-circuited by
using fallocate by default in generic system utilities.

> Also, if we are going to use fallocate() as a way of implicitly signaling
> to the file system that the file should be optimized for reads, as
> opposed to the write, maybe we should explicitly document it as such
> in the fallocate(2) man page, so that  application programmers
> understand that this is the semantics they should expect.

Preallocation is for preventing fragmentation that leads to
performance problems. Use of fallocate() does not imply the file
layout has been optimised for read access and, IMO, never should.

Quite frankly, if system utilities like cp and tar start to abuse
fallocate() by default so they can get "upfront ENOSPC detection",
then I will seriously consider making XFS use delayed allocation for
fallocate rather than unwritten extents so we don't lose the past 15
years worth of IO and aging optimisations that delayed allocation
provides us with....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-28  5:10         ` Dave Chinner
@ 2011-11-28  8:55           ` Pádraig Brady
  2011-11-28 10:41             ` tao.peng
                               ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Pádraig Brady @ 2011-11-28  8:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Theodore Tso, Christoph Hellwig, linux-fsdevel

On 11/28/2011 05:10 AM, Dave Chinner wrote:
> Quite frankly, if system utilities like cp and tar start to abuse
> fallocate() by default so they can get "upfront ENOSPC detection",
> then I will seriously consider making XFS use delayed allocation for
> fallocate rather than unwritten extents so we don't lose the past 15
> years worth of IO and aging optimisations that delayed allocation
> provides us with....

For the record I was considering fallocate() for these reasons.

  1. Improved file layout for subsequent access
  2. Immediate indication of ENOSPC
  3. Efficient writing of NUL portions

You lucidly detailed issues with 1. which I suppose could be somewhat
mitigated by not fallocating < say 1MB, though I suppose file systems
could be smarter here and not preallocate small chunks (or when
otherwise not appropriate). We can already get ENOSPC from a write()
after an fallocate() in certain edge cases, so it would probably make
sense to expand those cases.

cheers,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: fallocate vs ENOSPC
  2011-11-28  8:55           ` Pádraig Brady
@ 2011-11-28 10:41             ` tao.peng
  2011-11-28 12:02               ` Pádraig Brady
  2011-11-28 14:36             ` Theodore Tso
  2011-11-29  0:24             ` Dave Chinner
  2 siblings, 1 reply; 31+ messages in thread
From: tao.peng @ 2011-11-28 10:41 UTC (permalink / raw)
  To: P, david; +Cc: tytso, hch, linux-fsdevel

> -----Original Message-----
> From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-
> owner@vger.kernel.org] On Behalf Of Pádraig Brady
> Sent: Monday, November 28, 2011 4:55 PM
> To: Dave Chinner
> Cc: Theodore Tso; Christoph Hellwig; linux-fsdevel@vger.kernel.org
> Subject: Re: fallocate vs ENOSPC
> 
> On 11/28/2011 05:10 AM, Dave Chinner wrote:
> > Quite frankly, if system utilities like cp and tar start to abuse
> > fallocate() by default so they can get "upfront ENOSPC detection",
> > then I will seriously consider making XFS use delayed allocation for
> > fallocate rather than unwritten extents so we don't lose the past 15
> > years worth of IO and aging optimisations that delayed allocation
> > provides us with....
> 
> For the record I was considering fallocate() for these reasons.
> 
>   1. Improved file layout for subsequent access
>   2. Immediate indication of ENOSPC
>   3. Efficient writing of NUL portions
> 
> You lucidly detailed issues with 1. which I suppose could be somewhat
> mitigated by not fallocating < say 1MB, though I suppose file systems
> could be smarter here and not preallocate small chunks (or when
> otherwise not appropriate). We can already get ENOSPC from a write()
> after an fallocate() in certain edge cases, so it would probably make
> sense to expand those cases.
> 
Just out of curiosity, how is it going to work with sparse files? By default, cp uses --sparse=auto. And for sparse files, it avoids some disk allocation automatically. With fallocate(), do you plan to change the semantics?

Cheers,
Tao

> cheers,
> Pádraig.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-28 10:41             ` tao.peng
@ 2011-11-28 12:02               ` Pádraig Brady
  0 siblings, 0 replies; 31+ messages in thread
From: Pádraig Brady @ 2011-11-28 12:02 UTC (permalink / raw)
  To: tao.peng; +Cc: david, tytso, hch, linux-fsdevel

On 11/28/2011 10:41 AM, tao.peng@emc.com wrote:
> Just out of curiosity, how is it going to work with sparse files? By default, cp uses --sparse=auto. And for sparse files, it avoids some disk allocation automatically. With fallocate(), do you plan to change the semantics?

With sparse files, coreutils currently uses fiemap (with a sync),
and thus has full details of what is sparse or empty etc.
So as detailed¹ on the coreutils list, the conversions would be:

    --sparse=auto   => 'Empty' -> 'Empty'
    --sparse=always => 'Empty' -> 'Hole'
    --sparse=never  => 'Hole'  -> 'Empty'

cheers,
Pádraig.

¹ http://debbugs.gnu.org/9500#14
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-28  8:55           ` Pádraig Brady
  2011-11-28 10:41             ` tao.peng
@ 2011-11-28 14:36             ` Theodore Tso
  2011-11-28 14:51               ` Pádraig Brady
  2011-11-28 18:49               ` Jeremy Allison
  2011-11-29  0:24             ` Dave Chinner
  2 siblings, 2 replies; 31+ messages in thread
From: Theodore Tso @ 2011-11-28 14:36 UTC (permalink / raw)
  To: Pádraig Brady
  Cc: Theodore Tso, Dave Chinner, Christoph Hellwig, linux-fsdevel


On Nov 28, 2011, at 3:55 AM, Pádraig Brady wrote:

> 
> You lucidly detailed issues with 1. which I suppose could be somewhat
> mitigated by not fallocating < say 1MB, though I suppose file systems
> could be smarter here and not preallocate small chunks (or when
> otherwise not appropriate). We can already get ENOSPC from a write()
> after an fallocate() in certain edge cases, so it would probably make
> sense to expand those cases.

I'm curious -- why are you so worried about ENOSPC?

You need to check the error returns on write(2) anyway (and it's good
programming practice anyways --- don't forget to check on close because
some network file systems only push to the network on close, and in
some cases they might only get quota errors on the close), so I don't see
why using fallocate() to get an early ENOSPC is so interesting for you.

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-28 14:36             ` Theodore Tso
@ 2011-11-28 14:51               ` Pádraig Brady
  2011-11-28 20:29                 ` Ted Ts'o
  2011-11-28 18:49               ` Jeremy Allison
  1 sibling, 1 reply; 31+ messages in thread
From: Pádraig Brady @ 2011-11-28 14:51 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel

On 11/28/2011 02:36 PM, Theodore Tso wrote:
> 
> On Nov 28, 2011, at 3:55 AM, Pádraig Brady wrote:
> 
>>
>> You lucidly detailed issues with 1. which I suppose could be somewhat
>> mitigated by not fallocating < say 1MB, though I suppose file systems
>> could be smarter here and not preallocate small chunks (or when
>> otherwise not appropriate). We can already get ENOSPC from a write()
>> after an fallocate() in certain edge cases, so it would probably make
>> sense to expand those cases.
> 
> I'm curious -- why are you so worried about ENOSPC?
> 
> You need to check the error returns on write(2) anyway (and it's good
> programming practice anyways --- don't forget to check on close because
> some network file systems only push to the network on close, and in
> some cases they might only get quota errors on the close), so I don't see
> why using fallocate() to get an early ENOSPC is so interesting for you.

It would be better to indicate ENOSPC _before_ copying a (potentially large)
file to a (potentially slow) device. If the implementation complexity
and side effects of doing this are sufficiently small, then it's worth
doing. These discussions are to quantify the side effects.

cheers,
Pádraig.

p.s. You can be sure that `cp` deals with errors from write() and close().
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-28 14:51               ` Pádraig Brady
@ 2011-11-28 20:29                 ` Ted Ts'o
  2011-11-28 20:49                   ` Jeremy Allison
  0 siblings, 1 reply; 31+ messages in thread
From: Ted Ts'o @ 2011-11-28 20:29 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel

On Mon, Nov 28, 2011 at 02:51:14PM +0000, Pádraig Brady wrote:
> It would be better to indicate ENOSPC _before_ copying a (potentially large)
> file to a (potentially slow) device. If the implementation complexity
> and side effects of doing this are sufficiently small, then it's worth
> doing. These discussions are to quantify the side effects.

In that case, why not use statfs(2) as a first class approximation?
You won't know for user how much fs metadata will be required, but for
the common case where someone trying to fit 10 pounds of horse manure
in a 5 pound bag, that can be caught very readily without needing to
use fallocate(2).

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-28 20:29                 ` Ted Ts'o
@ 2011-11-28 20:49                   ` Jeremy Allison
  2011-11-29 22:39                     ` Eric Sandeen
  0 siblings, 1 reply; 31+ messages in thread
From: Jeremy Allison @ 2011-11-28 20:49 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Pádraig Brady, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Mon, Nov 28, 2011 at 03:29:34PM -0500, Ted Ts'o wrote:
> On Mon, Nov 28, 2011 at 02:51:14PM +0000, Pádraig Brady wrote:
> > It would be better to indicate ENOSPC _before_ copying a (potentially large)
> > file to a (potentially slow) device. If the implementation complexity
> > and side effects of doing this are sufficiently small, then it's worth
> > doing. These discussions are to quantify the side effects.
> 
> In that case, why not use statfs(2) as a first class approximation?
> You won't know for user how much fs metadata will be required, but for
> the common case where someone trying to fit 10 pounds of horse manure
> in a 5 pound bag, that can be caught very readily without needing to
> use fallocate(2).

Yeah, we do that too, if the fallocate call fails.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-28 20:49                   ` Jeremy Allison
@ 2011-11-29 22:39                     ` Eric Sandeen
  2011-11-29 23:04                       ` Jeremy Allison
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Sandeen @ 2011-11-29 22:39 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Ted Ts'o, Pádraig Brady, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On 11/28/11 2:49 PM, Jeremy Allison wrote:
> On Mon, Nov 28, 2011 at 03:29:34PM -0500, Ted Ts'o wrote:
>> On Mon, Nov 28, 2011 at 02:51:14PM +0000, Pádraig Brady wrote:
>>> It would be better to indicate ENOSPC _before_ copying a (potentially large)
>>> file to a (potentially slow) device. If the implementation complexity
>>> and side effects of doing this are sufficiently small, then it's worth
>>> doing. These discussions are to quantify the side effects.
>>
>> In that case, why not use statfs(2) as a first class approximation?
>> You won't know for user how much fs metadata will be required, but for
>> the common case where someone trying to fit 10 pounds of horse manure
>> in a 5 pound bag, that can be caught very readily without needing to
>> use fallocate(2).
> 
> Yeah, we do that too, if the fallocate call fails.

That seems backwards to me; if fallocate fails, statfs(2) isn't going
to reveal more space, is it? (modulo metadata issues, anyway?)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-29 22:39                     ` Eric Sandeen
@ 2011-11-29 23:04                       ` Jeremy Allison
  2011-11-29 23:19                         ` Eric Sandeen
  0 siblings, 1 reply; 31+ messages in thread
From: Jeremy Allison @ 2011-11-29 23:04 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jeremy Allison, Ted Ts'o, Pádraig Brady, Dave Chinner,
	Christoph Hellwig, linux-fsdevel

On Tue, Nov 29, 2011 at 04:39:08PM -0600, Eric Sandeen wrote:
> On 11/28/11 2:49 PM, Jeremy Allison wrote:
> > On Mon, Nov 28, 2011 at 03:29:34PM -0500, Ted Ts'o wrote:
> >> On Mon, Nov 28, 2011 at 02:51:14PM +0000, Pádraig Brady wrote:
> >>> It would be better to indicate ENOSPC _before_ copying a (potentially large)
> >>> file to a (potentially slow) device. If the implementation complexity
> >>> and side effects of doing this are sufficiently small, then it's worth
> >>> doing. These discussions are to quantify the side effects.
> >>
> >> In that case, why not use statfs(2) as a first class approximation?
> >> You won't know for user how much fs metadata will be required, but for
> >> the common case where someone trying to fit 10 pounds of horse manure
> >> in a 5 pound bag, that can be caught very readily without needing to
> >> use fallocate(2).
> > 
> > Yeah, we do that too, if the fallocate call fails.
> 
> That seems backwards to me; if fallocate fails, statfs(2) isn't going
> to reveal more space, is it? (modulo metadata issues, anyway?)

It might if fallocate fails with ENOSYS :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-29 23:04                       ` Jeremy Allison
@ 2011-11-29 23:19                         ` Eric Sandeen
  0 siblings, 0 replies; 31+ messages in thread
From: Eric Sandeen @ 2011-11-29 23:19 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Ted Ts'o, Pádraig Brady, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On 11/29/11 5:04 PM, Jeremy Allison wrote:
> On Tue, Nov 29, 2011 at 04:39:08PM -0600, Eric Sandeen wrote:
>> On 11/28/11 2:49 PM, Jeremy Allison wrote:
>>> On Mon, Nov 28, 2011 at 03:29:34PM -0500, Ted Ts'o wrote:
>>>> On Mon, Nov 28, 2011 at 02:51:14PM +0000, Pádraig Brady wrote:
>>>>> It would be better to indicate ENOSPC _before_ copying a (potentially large)
>>>>> file to a (potentially slow) device. If the implementation complexity
>>>>> and side effects of doing this are sufficiently small, then it's worth
>>>>> doing. These discussions are to quantify the side effects.
>>>>
>>>> In that case, why not use statfs(2) as a first class approximation?
>>>> You won't know for user how much fs metadata will be required, but for
>>>> the common case where someone trying to fit 10 pounds of horse manure
>>>> in a 5 pound bag, that can be caught very readily without needing to
>>>> use fallocate(2).
>>>
>>> Yeah, we do that too, if the fallocate call fails.
>>
>> That seems backwards to me; if fallocate fails, statfs(2) isn't going
>> to reveal more space, is it? (modulo metadata issues, anyway?)
> 
> It might if fallocate fails with ENOSYS :-).

Doh.  Sorry, was not thinking of that failure.  :)

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-28 14:36             ` Theodore Tso
  2011-11-28 14:51               ` Pádraig Brady
@ 2011-11-28 18:49               ` Jeremy Allison
  2011-11-29  0:26                 ` Dave Chinner
  1 sibling, 1 reply; 31+ messages in thread
From: Jeremy Allison @ 2011-11-28 18:49 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Pádraig Brady, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Mon, Nov 28, 2011 at 09:36:18AM -0500, Theodore Tso wrote:
> 
> On Nov 28, 2011, at 3:55 AM, Pádraig Brady wrote:
> 
> > 
> > You lucidly detailed issues with 1. which I suppose could be somewhat
> > mitigated by not fallocating < say 1MB, though I suppose file systems
> > could be smarter here and not preallocate small chunks (or when
> > otherwise not appropriate). We can already get ENOSPC from a write()
> > after an fallocate() in certain edge cases, so it would probably make
> > sense to expand those cases.
> 
> I'm curious -- why are you so worried about ENOSPC?
> 
> You need to check the error returns on write(2) anyway (and it's good
> programming practice anyways --- don't forget to check on close because
> some network file systems only push to the network on close, and in
> some cases they might only get quota errors on the close), so I don't see
> why using fallocate() to get an early ENOSPC is so interesting for you.

Unfortunately for Samba, Windows clients will *only* report ENOSPC
to the userspace apps if the initial fallocation fails. Most of
the Windows apps don't bother to check for write() fails after
the initial allocation succeeds.

We check for and report them back to the Windows client anyway of
course, but most Windows apps just silently corrupt their data in
this case.

That's why we use fallocate() in Samba :-(.

Jeremy.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-28 18:49               ` Jeremy Allison
@ 2011-11-29  0:26                 ` Dave Chinner
  2011-11-29  0:45                   ` Jeremy Allison
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2011-11-29  0:26 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Theodore Tso, Pádraig Brady, Christoph Hellwig,
	linux-fsdevel

On Mon, Nov 28, 2011 at 10:49:40AM -0800, Jeremy Allison wrote:
> On Mon, Nov 28, 2011 at 09:36:18AM -0500, Theodore Tso wrote:
> > 
> > On Nov 28, 2011, at 3:55 AM, Pádraig Brady wrote:
> > 
> > > 
> > > You lucidly detailed issues with 1. which I suppose could be somewhat
> > > mitigated by not fallocating < say 1MB, though I suppose file systems
> > > could be smarter here and not preallocate small chunks (or when
> > > otherwise not appropriate). We can already get ENOSPC from a write()
> > > after an fallocate() in certain edge cases, so it would probably make
> > > sense to expand those cases.
> > 
> > I'm curious -- why are you so worried about ENOSPC?
> > 
> > You need to check the error returns on write(2) anyway (and it's good
> > programming practice anyways --- don't forget to check on close because
> > some network file systems only push to the network on close, and in
> > some cases they might only get quota errors on the close), so I don't see
> > why using fallocate() to get an early ENOSPC is so interesting for you.
> 
> Unfortunately for Samba, Windows clients will *only* report ENOSPC
> to the userspace apps if the initial fallocation fails. Most of
> the Windows apps don't bother to check for write() fails after
> the initial allocation succeeds.
> 
> We check for and report them back to the Windows client anyway of
> course, but most Windows apps just silently corrupt their data in
> this case.
> 
> That's why we use fallocate() in Samba :-(.

IOWs, what you really want is a space reservation mechanism. You've
only got this preallocate hammer, so you use it, yes?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-29  0:26                 ` Dave Chinner
@ 2011-11-29  0:45                   ` Jeremy Allison
  0 siblings, 0 replies; 31+ messages in thread
From: Jeremy Allison @ 2011-11-29  0:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeremy Allison, Theodore Tso, Pádraig Brady,
	Christoph Hellwig, linux-fsdevel

On Tue, Nov 29, 2011 at 11:26:29AM +1100, Dave Chinner wrote:
> On Mon, Nov 28, 2011 at 10:49:40AM -0800, Jeremy Allison wrote:
> > On Mon, Nov 28, 2011 at 09:36:18AM -0500, Theodore Tso wrote:
> > > 
> > > On Nov 28, 2011, at 3:55 AM, Pádraig Brady wrote:
> > > 
> > > > 
> > > > You lucidly detailed issues with 1. which I suppose could be somewhat
> > > > mitigated by not fallocating < say 1MB, though I suppose file systems
> > > > could be smarter here and not preallocate small chunks (or when
> > > > otherwise not appropriate). We can already get ENOSPC from a write()
> > > > after an fallocate() in certain edge cases, so it would probably make
> > > > sense to expand those cases.
> > > 
> > > I'm curious -- why are you so worried about ENOSPC?
> > > 
> > > You need to check the error returns on write(2) anyway (and it's good
> > > programming practice anyways --- don't forget to check on close because
> > > some network file systems only push to the network on close, and in
> > > some cases they might only get quota errors on the close), so I don't see
> > > why using fallocate() to get an early ENOSPC is so interesting for you.
> > 
> > Unfortunately for Samba, Windows clients will *only* report ENOSPC
> > to the userspace apps if the initial fallocation fails. Most of
> > the Windows apps don't bother to check for write() fails after
> > the initial allocation succeeds.
> > 
> > We check for and report them back to the Windows client anyway of
> > course, but most Windows apps just silently corrupt their data in
> > this case.
> > 
> > That's why we use fallocate() in Samba :-(.
> 
> IOWs, what you really want is a space reservation mechanism. You've
> only got this preallocate hammer, so you use it, yes?

Yes, absolutely. We're just trying to provide the Windows
semantics the clients expect.

Jeremy.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-28  8:55           ` Pádraig Brady
  2011-11-28 10:41             ` tao.peng
  2011-11-28 14:36             ` Theodore Tso
@ 2011-11-29  0:24             ` Dave Chinner
  2011-11-29 14:11               ` Pádraig Brady
  2 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2011-11-29  0:24 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: Theodore Tso, Christoph Hellwig, linux-fsdevel

On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote:
> On 11/28/2011 05:10 AM, Dave Chinner wrote:
> > Quite frankly, if system utilities like cp and tar start to abuse
> > fallocate() by default so they can get "upfront ENOSPC detection",
> > then I will seriously consider making XFS use delayed allocation for
> > fallocate rather than unwritten extents so we don't lose the past 15
> > years worth of IO and aging optimisations that delayed allocation
> > provides us with....
> 
> For the record I was considering fallocate() for these reasons.
> 
>   1. Improved file layout for subsequent access
>   2. Immediate indication of ENOSPC
>   3. Efficient writing of NUL portions
> 
> You lucidly detailed issues with 1. which I suppose could be somewhat
> mitigated by not fallocating < say 1MB, though I suppose file systems
> could be smarter here and not preallocate small chunks (or when
> otherwise not appropriate).

When you consider that some high end filesystem deployments have alignment
characteristics over 50MB (e.g. so each uncompressed 4k resolution
video frame is located on a different set of non-overlapping disks),
arbitrary "don't fallocate below this amount" heuristics will always
have unforseen failure cases...

In short: leave optimising general allocation strategies to the
filesytems and their developers - there is no One True Solution for
optimal file layout in a given filesystem, let alone across
different filesytems. In fact, I don't even want to think about the
mess fallocate() on everything would make of btrfs because of it's
COW structure - it seems to me to guarantee worse fragmentation than
using delayed allocation...

> We can already get ENOSPC from a write()
> after an fallocate() in certain edge cases, so it would probably make
> sense to expand those cases.

fallocate is for preallocation, not for ENOSPC detection. If you
want efficient and effective ENOSPC detection before writing
anything, then you really want a space -reservation- extension to
fallocate. Filesystems that use delayed allocation already have a
space reservation subsystem - it how they account for space that is
reserved by delayed allocation prior to the real allocation being
done. IMO, allowing userspace some level of access to those
reservations would be more appropriate for early detection of ENOSPC
than using preallocation for everything...

As to efficient writing of NULL ranges - that's what sparse files
are for - you do not need to write or even preallocate NULL ranges
when copying files. Indeed, the most efficient way of dealing with
NULL ranges is to punch a hole and let the filesystem deal with
it.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-29  0:24             ` Dave Chinner
@ 2011-11-29 14:11               ` Pádraig Brady
  2011-11-29 23:37                 ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Pádraig Brady @ 2011-11-29 14:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Theodore Tso, Christoph Hellwig, linux-fsdevel

On 11/29/2011 12:24 AM, Dave Chinner wrote:
> On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote:
>> On 11/28/2011 05:10 AM, Dave Chinner wrote:
>>> Quite frankly, if system utilities like cp and tar start to abuse
>>> fallocate() by default so they can get "upfront ENOSPC detection",
>>> then I will seriously consider making XFS use delayed allocation for
>>> fallocate rather than unwritten extents so we don't lose the past 15
>>> years worth of IO and aging optimisations that delayed allocation
>>> provides us with....
>>
>> For the record I was considering fallocate() for these reasons.
>>
>>   1. Improved file layout for subsequent access
>>   2. Immediate indication of ENOSPC
>>   3. Efficient writing of NUL portions
>>
>> You lucidly detailed issues with 1. which I suppose could be somewhat
>> mitigated by not fallocating < say 1MB, though I suppose file systems
>> could be smarter here and not preallocate small chunks (or when
>> otherwise not appropriate).
> 
> When you consider that some high end filesystem deployments have alignment
> characteristics over 50MB (e.g. so each uncompressed 4k resolution
> video frame is located on a different set of non-overlapping disks),
> arbitrary "don't fallocate below this amount" heuristics will always
> have unforseen failure cases...

So about this alignment policy, I don't understand the issues so I'm guessing here.
You say delalloc packs files, while fallocate() will align on XFS according to
the stripe config. Is that assuming that when writing lots of files, that they
will be more likely to be read together, rather than independently.
That's a big assumption if true. Also the converse is a big assumption, that
fallocate() should be aligned, as that's more likely to be read independently.

> In short: leave optimising general allocation strategies to the
> filesytems and their developers - there is no One True Solution for
> optimal file layout in a given filesystem, let alone across
> different filesytems. In fact, I don't even want to think about the
> mess fallocate() on everything would make of btrfs because of it's
> COW structure - it seems to me to guarantee worse fragmentation than
> using delayed allocation...
> 
>> We can already get ENOSPC from a write()
>> after an fallocate() in certain edge cases, so it would probably make
>> sense to expand those cases.
> 
> fallocate is for preallocation, not for ENOSPC detection. If you
> want efficient and effective ENOSPC detection before writing
> anything, then you really want a space -reservation- extension to
> fallocate. Filesystems that use delayed allocation already have a
> space reservation subsystem - it how they account for space that is
> reserved by delayed allocation prior to the real allocation being
> done. IMO, allowing userspace some level of access to those
> reservations would be more appropriate for early detection of ENOSPC
> than using preallocation for everything...

Fair enough, so fallocate() would be a superset of reserve(),
though I'm having a hard time thinking of why one might ever need to
fallocate() then.

> As to efficient writing of NULL ranges - that's what sparse files
> are for - you do not need to write or even preallocate NULL ranges
> when copying files. Indeed, the most efficient way of dealing with
> NULL ranges is to punch a hole and let the filesystem deal with
> it.....

well not for `cp --sparse=never` which might be used
so that processing of the copy will not result in ENOSPC.

I'm also linking here to a related discussion.
http://oss.sgi.com/archives/xfs/2011-06/msg00064.html

Note also that the gold linker does fallocate() on output files by default.

cheers,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-29 14:11               ` Pádraig Brady
@ 2011-11-29 23:37                 ` Dave Chinner
  2011-11-30  9:28                   ` Pádraig Brady
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2011-11-29 23:37 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: Theodore Tso, Christoph Hellwig, linux-fsdevel

On Tue, Nov 29, 2011 at 02:11:48PM +0000, Pádraig Brady wrote:
> On 11/29/2011 12:24 AM, Dave Chinner wrote:
> > On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote:
> >> On 11/28/2011 05:10 AM, Dave Chinner wrote:
> >>> Quite frankly, if system utilities like cp and tar start to abuse
> >>> fallocate() by default so they can get "upfront ENOSPC detection",
> >>> then I will seriously consider making XFS use delayed allocation for
> >>> fallocate rather than unwritten extents so we don't lose the past 15
> >>> years worth of IO and aging optimisations that delayed allocation
> >>> provides us with....
> >>
> >> For the record I was considering fallocate() for these reasons.
> >>
> >>   1. Improved file layout for subsequent access
> >>   2. Immediate indication of ENOSPC
> >>   3. Efficient writing of NUL portions
> >>
> >> You lucidly detailed issues with 1. which I suppose could be somewhat
> >> mitigated by not fallocating < say 1MB, though I suppose file systems
> >> could be smarter here and not preallocate small chunks (or when
> >> otherwise not appropriate).
> > 
> > When you consider that some high end filesystem deployments have alignment
> > characteristics over 50MB (e.g. so each uncompressed 4k resolution
> > video frame is located on a different set of non-overlapping disks),
> > arbitrary "don't fallocate below this amount" heuristics will always
> > have unforseen failure cases...
> 
> So about this alignment policy, I don't understand the issues so I'm guessing here.

Which, IMO, is exactly why you shouldn't be using fallocate() by
default. Every filesystem behaves differently, and is optimises
allocation differently to be tuned for the filesystem's unique
structure and capability. fallocate() is a big hammer that ensures
filesystems cannot optimise allocation to match observed operational
patterns.

> You say delalloc packs files, while fallocate() will align on XFS according to
> the stripe config. Is that assuming that when writing lots of files, that they
> will be more likely to be read together, rather than independently.

No, it's assuming that preallocation is used for enabling extremely
high performance, high bandwidth IO. This is what it has been used
for in XFS for the past 10+ years, and so that is what the
implementation in XFS is optimised for (and will continue to be
optimised for).  In this environment, even when the file size is
smaller than the alignment unit, we want allocation alignment to be
done.

A real world example for you: supporting multiple, concurrent,
realtime file-per-frame uncompressed 2k res video streams (@ ~12.5MB
per frame).  Systems doing this sort of work are made from lots of
HW RAID5/6 Luns (often spread across multiple arrays) that will have
a stripe width of 14MB. XFS will be configured with a stripe unit of
14MB. 4-6 of these Luns will be striped together to give a stripe
width of 56-84MB from a filesystem perspective. Each file that is
preallocated needs to be aligned to a 16MB stripe unit so that each
frame IO goes to a different RAID Lun. Each frame write can be done
as a full stripe write without a RMW cycle in the back end array,
and each frame read loads all the disks in the LUN evenly.  i.e. the
load is distributed evenly, optimally and deterministically across
all the back end storage.

This is the sort of application that cannot be done effectively with
a lot of filesystem allocator support (indeed, XFS has the special
filestreams allocation policy for this workload), and it's this sort
of high peformance application that what we optimise preallocation
for.

In short, what XFS is doing here is optimising allocation patterns
for high performance, RAID based storage. If your write pattern
triggers repeated RMW cycles in a RAID array, your write performance
will fall by an order of magnitude or more.  Large files don't need
packing because the writeback flusher threads can do full stripe
writes which avoids RMW cycles in the RAID array if the files are
aligned to the underlying RAID stripes.  But small files need tight
packing to enable them to be aggregated into full stripe writes in
the elevator and/or RAID controller BBWC.  This aggregation then
avoids RMW cycles in the RAID array and hence writeback performance
for both small and large files is similar (i.e. close to maximum IO
bandwidth).  If you don't pack small files tightly (and XFs won't if
you use preallocation), then each file write will cause a RMW cycle
in the RAID array and the throughput is effective going to be about
half the IOPS of a random write workload....

> That's a big assumption if true. Also the converse is a big assumption, that
> fallocate() should be aligned, as that's more likely to be read independently.

You're guessing, making assumptions, etc all about how one
filesystem works and what the impact of the change is going to be.
What about ext4, or btrfs? They are very different structurally to
XFS, and hence have different sets of issues when you start
preallocating everything.  It is not a simple problem: allocation
optimisation is, IMO, the single most difficult and complex area of
filesystems, with many different, non-obvious, filesystem specific
trade-offs to be made....

> > fallocate is for preallocation, not for ENOSPC detection. If you
> > want efficient and effective ENOSPC detection before writing
> > anything, then you really want a space -reservation- extension to
> > fallocate. Filesystems that use delayed allocation already have a
> > space reservation subsystem - it how they account for space that is
> > reserved by delayed allocation prior to the real allocation being
> > done. IMO, allowing userspace some level of access to those
> > reservations would be more appropriate for early detection of ENOSPC
> > than using preallocation for everything...
> 
> Fair enough, so fallocate() would be a superset of reserve(),
> though I'm having a hard time thinking of why one might ever need to
> fallocate() then.

Exactly my point - the number of applications that actually need
-preallocation- for performance reasons is actually quite small.

I'd suggest that we'd implement a reservation mechanism as a
separate fallocate() flag, to tell fallocate() to reserve the space
over the given range rather than needing to preallocate it. I'd also
suggest that a reservation is not persistent (e.g. only guaranteed
to last for the life of the file descriptor the reservation was made
for). That would make it simple to implement in memory for all
filesystems, and provide you with the short-term ENOSPC-or-success
style reservation you are looking for...

Does that sound reasonable?

> > As to efficient writing of NULL ranges - that's what sparse files
> > are for - you do not need to write or even preallocate NULL ranges
> > when copying files. Indeed, the most efficient way of dealing with
> > NULL ranges is to punch a hole and let the filesystem deal with
> > it.....
> 
> well not for `cp --sparse=never` which might be used
> so that processing of the copy will not result in ENOSPC.
> 
> I'm also linking here to a related discussion.
> http://oss.sgi.com/archives/xfs/2011-06/msg00064.html

Right, and from that discussion you can see exactly why delayed
allocation in XFS significantly improves both data and metadata
allocation and IO patterns for operations like tar, cp, rsync, etc
whilst also minimising long term aging effects as compared to
preallocation:

http://oss.sgi.com/archives/xfs/2011-06/msg00092.html

> Note also that the gold linker does fallocate() on output files by default.

"He's doing it, so we should do it" is not a very convincing
technical argument.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-29 23:37                 ` Dave Chinner
@ 2011-11-30  9:28                   ` Pádraig Brady
  2011-11-30 15:32                     ` Ted Ts'o
  0 siblings, 1 reply; 31+ messages in thread
From: Pádraig Brady @ 2011-11-30  9:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Theodore Tso, Christoph Hellwig, linux-fsdevel

On 11/29/2011 11:37 PM, Dave Chinner wrote:
> On Tue, Nov 29, 2011 at 02:11:48PM +0000, Pádraig Brady wrote:
>> On 11/29/2011 12:24 AM, Dave Chinner wrote:
>>> On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote:
>>>> On 11/28/2011 05:10 AM, Dave Chinner wrote:
>>>>> Quite frankly, if system utilities like cp and tar start to abuse
>>>>> fallocate() by default so they can get "upfront ENOSPC detection",
>>>>> then I will seriously consider making XFS use delayed allocation for
>>>>> fallocate rather than unwritten extents so we don't lose the past 15
>>>>> years worth of IO and aging optimisations that delayed allocation
>>>>> provides us with....
>>>>
>>>> For the record I was considering fallocate() for these reasons.
>>>>
>>>>   1. Improved file layout for subsequent access
>>>>   2. Immediate indication of ENOSPC
>>>>   3. Efficient writing of NUL portions
>>>>
>>>> You lucidly detailed issues with 1. which I suppose could be somewhat
>>>> mitigated by not fallocating < say 1MB, though I suppose file systems
>>>> could be smarter here and not preallocate small chunks (or when
>>>> otherwise not appropriate).
>>>
>>> When you consider that some high end filesystem deployments have alignment
>>> characteristics over 50MB (e.g. so each uncompressed 4k resolution
>>> video frame is located on a different set of non-overlapping disks),
>>> arbitrary "don't fallocate below this amount" heuristics will always
>>> have unforseen failure cases...
>>
>> So about this alignment policy, I don't understand the issues so I'm guessing here.
> 
> Which, IMO, is exactly why you shouldn't be using fallocate() by
> default. Every filesystem behaves differently, and is optimises
> allocation differently to be tuned for the filesystem's unique
> structure and capability. fallocate() is a big hammer that ensures
> filesystems cannot optimise allocation to match observed operational
> patterns.
> 
>> You say delalloc packs files, while fallocate() will align on XFS according to
>> the stripe config. Is that assuming that when writing lots of files, that they
>> will be more likely to be read together, rather than independently.
> 
> No, it's assuming that preallocation is used for enabling extremely
> high performance, high bandwidth IO. This is what it has been used
> for in XFS for the past 10+ years, and so that is what the
> implementation in XFS is optimised for (and will continue to be
> optimised for).  In this environment, even when the file size is
> smaller than the alignment unit, we want allocation alignment to be
> done.
> 
> A real world example for you: supporting multiple, concurrent,
> realtime file-per-frame uncompressed 2k res video streams (@ ~12.5MB
> per frame).  Systems doing this sort of work are made from lots of
> HW RAID5/6 Luns (often spread across multiple arrays) that will have
> a stripe width of 14MB. XFS will be configured with a stripe unit of
> 14MB. 4-6 of these Luns will be striped together to give a stripe
> width of 56-84MB from a filesystem perspective. Each file that is
> preallocated needs to be aligned to a 16MB stripe unit so that each
> frame IO goes to a different RAID Lun. Each frame write can be done
> as a full stripe write without a RMW cycle in the back end array,
> and each frame read loads all the disks in the LUN evenly.  i.e. the
> load is distributed evenly, optimally and deterministically across
> all the back end storage.
> 
> This is the sort of application that cannot be done effectively with
> a lot of filesystem allocator support (indeed, XFS has the special
> filestreams allocation policy for this workload), and it's this sort
> of high peformance application that what we optimise preallocation
> for.
> 
> In short, what XFS is doing here is optimising allocation patterns
> for high performance, RAID based storage. If your write pattern
> triggers repeated RMW cycles in a RAID array, your write performance
> will fall by an order of magnitude or more.  Large files don't need
> packing because the writeback flusher threads can do full stripe
> writes which avoids RMW cycles in the RAID array if the files are
> aligned to the underlying RAID stripes.  But small files need tight
> packing to enable them to be aggregated into full stripe writes in
> the elevator and/or RAID controller BBWC.  This aggregation then
> avoids RMW cycles in the RAID array and hence writeback performance
> for both small and large files is similar (i.e. close to maximum IO
> bandwidth).  If you don't pack small files tightly (and XFs won't if
> you use preallocation), then each file write will cause a RMW cycle
> in the RAID array and the throughput is effective going to be about
> half the IOPS of a random write workload....
> 
>> That's a big assumption if true. Also the converse is a big assumption, that
>> fallocate() should be aligned, as that's more likely to be read independently.
> 
> You're guessing, making assumptions, etc all about how one
> filesystem works and what the impact of the change is going to be.
> What about ext4, or btrfs? They are very different structurally to
> XFS, and hence have different sets of issues when you start
> preallocating everything.  It is not a simple problem: allocation
> optimisation is, IMO, the single most difficult and complex area of
> filesystems, with many different, non-obvious, filesystem specific
> trade-offs to be made....
> 
>>> fallocate is for preallocation, not for ENOSPC detection. If you
>>> want efficient and effective ENOSPC detection before writing
>>> anything, then you really want a space -reservation- extension to
>>> fallocate. Filesystems that use delayed allocation already have a
>>> space reservation subsystem - it how they account for space that is
>>> reserved by delayed allocation prior to the real allocation being
>>> done. IMO, allowing userspace some level of access to those
>>> reservations would be more appropriate for early detection of ENOSPC
>>> than using preallocation for everything...
>>
>> Fair enough, so fallocate() would be a superset of reserve(),
>> though I'm having a hard time thinking of why one might ever need to
>> fallocate() then.
> 
> Exactly my point - the number of applications that actually need
> -preallocation- for performance reasons is actually quite small.
> 
> I'd suggest that we'd implement a reservation mechanism as a
> separate fallocate() flag, to tell fallocate() to reserve the space
> over the given range rather than needing to preallocate it. I'd also
> suggest that a reservation is not persistent (e.g. only guaranteed
> to last for the life of the file descriptor the reservation was made
> for). That would make it simple to implement in memory for all
> filesystems, and provide you with the short-term ENOSPC-or-success
> style reservation you are looking for...
> 
> Does that sound reasonable?

But then posix_fallocate() would always be slow I think,
requiring one to actually write the NULs.

TBH, it sounds like the best/minimal change is to the uncommon case.
I.E. add an ALIGN flag to fallocate() which specialised apps like
described above can use.

>>> As to efficient writing of NULL ranges - that's what sparse files
>>> are for - you do not need to write or even preallocate NULL ranges
>>> when copying files. Indeed, the most efficient way of dealing with
>>> NULL ranges is to punch a hole and let the filesystem deal with
>>> it.....
>>
>> well not for `cp --sparse=never` which might be used
>> so that processing of the copy will not result in ENOSPC.
>>
>> I'm also linking here to a related discussion.
>> http://oss.sgi.com/archives/xfs/2011-06/msg00064.html
> 
> Right, and from that discussion you can see exactly why delayed
> allocation in XFS significantly improves both data and metadata
> allocation and IO patterns for operations like tar, cp, rsync, etc
> whilst also minimising long term aging effects as compared to
> preallocation:
> 
> http://oss.sgi.com/archives/xfs/2011-06/msg00092.html
> 
>> Note also that the gold linker does fallocate() on output files by default.
> 
> "He's doing it, so we should do it" is not a very convincing
> technical argument.

Just FYI.

cheers,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-30  9:28                   ` Pádraig Brady
@ 2011-11-30 15:32                     ` Ted Ts'o
  2011-11-30 16:11                       ` Pádraig Brady
  0 siblings, 1 reply; 31+ messages in thread
From: Ted Ts'o @ 2011-11-30 15:32 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel

On Wed, Nov 30, 2011 at 09:28:32AM +0000, Pádraig Brady wrote:
> 
> But then posix_fallocate() would always be slow I think,
> requiring one to actually write the NULs.

Almost no one should ever use posix_fallocate(); it's can be a
performance disaster because you don't know whether or not the file
system will really do fallocate, or will do the slow "write zeros"
thing.

You really should use fallocate(), take the failure if the file system
doesn't support fallocate, and then you can decide what the
appropriate thing to do might be.

       	    		    	     	      - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-30 15:32                     ` Ted Ts'o
@ 2011-11-30 16:11                       ` Pádraig Brady
  2011-11-30 17:01                         ` Ted Ts'o
  0 siblings, 1 reply; 31+ messages in thread
From: Pádraig Brady @ 2011-11-30 16:11 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel

On 11/30/2011 03:32 PM, Ted Ts'o wrote:
> On Wed, Nov 30, 2011 at 09:28:32AM +0000, Pádraig Brady wrote:
>>
>> But then posix_fallocate() would always be slow I think,
>> requiring one to actually write the NULs.
> 
> Almost no one should ever use posix_fallocate(); it's can be a
> performance disaster because you don't know whether or not the file
> system will really do fallocate, or will do the slow "write zeros"
> thing.
> 
> You really should use fallocate(), take the failure if the file system
> doesn't support fallocate, and then you can decide what the
> appropriate thing to do might be.

s/posix_fallocate()/functionality provided by &/

I.E. copy --sparse=never could use that,
and it would be beneficial if it was as fast as possible.

I looked for a couple of minutes on the XFS preallocate behaviour,
and it seems that these ioctls pre date fallocate().
http://linux.die.net/man/3/xfsctl
I see XFS_IOC_ALLOCSP and XFS_IOC_RESVSP.
So fallocate() support was directly mapped on top of the existing ALLOCSP.
I think the specialised alignment behavior should be restricted to
direct calls to XFS_IOC_ALLOCSP to be called by xfs_mkfile(1) or whatever.
Better would be to provide generic access to that functionality
through an ALIGN option to fallocate()

cheers,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-30 16:11                       ` Pádraig Brady
@ 2011-11-30 17:01                         ` Ted Ts'o
  2011-11-30 23:39                           ` Dave Chinner
  2011-12-01  0:11                           ` Pádraig Brady
  0 siblings, 2 replies; 31+ messages in thread
From: Ted Ts'o @ 2011-11-30 17:01 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel

On Wed, Nov 30, 2011 at 04:11:27PM +0000, Pádraig Brady wrote:
> I looked for a couple of minutes on the XFS preallocate behaviour,
> and it seems that these ioctls pre date fallocate().
> http://linux.die.net/man/3/xfsctl
> I see XFS_IOC_ALLOCSP and XFS_IOC_RESVSP.
> So fallocate() support was directly mapped on top of the existing ALLOCSP.
> I think the specialised alignment behavior should be restricted to
> direct calls to XFS_IOC_ALLOCSP to be called by xfs_mkfile(1) or whatever.
> Better would be to provide generic access to that functionality
> through an ALIGN option to fallocate()

Well, XFS_IOC_RESVSP is the same as fallocate with the
FALLOC_FL_KEEP_SIZE flag.  That is to say, blocks are allocated and
attached to the inode --- that is, which blocks out of the pool of
free blocks should be selected is decided at the time that you call
fallocate() with the KEEP_SIZE flag or use the XFS_IOC_RESVSP ioctl
(which by the way works on any file system that supports fallocate on
modern kernels --- the kernel provides the translation from
XFS_IOC_RESVSP to fallocate/KEEP_SIZE in fs/ioctl.c's
ioctl_preallocate() function.)

What Dave was talking about is something different.  He's suggesting a
new call which reserves space, but which does not actually make the
block allocation decision until the time of the write.  He suggested
tieing it to the file descriptor, but I wonder if it's actually more
functional to tie it to the process --- that is, the process says,
"guarantee that I will be able to write 5MB", and writes made by that
process get counted against that 5MB reservation.  When the process
exits, any reservation made by that process evaporates.

Whether we tie this space reservation to a fd or a process, we also
would need to decide up front whether this space shows up as "missing"
by statfs(2)/df or not.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-30 17:01                         ` Ted Ts'o
@ 2011-11-30 23:39                           ` Dave Chinner
  2011-12-01  0:11                           ` Pádraig Brady
  1 sibling, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2011-11-30 23:39 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Pádraig Brady, Christoph Hellwig, linux-fsdevel

On Wed, Nov 30, 2011 at 12:01:16PM -0500, Ted Ts'o wrote:
> What Dave was talking about is something different.  He's suggesting a
> new call which reserves space, but which does not actually make the
> block allocation decision until the time of the write.  He suggested
> tieing it to the file descriptor, but I wonder if it's actually more
> functional to tie it to the process --- that is, the process says,
> "guarantee that I will be able to write 5MB", and writes made by that
> process get counted against that 5MB reservation.  When the process
> exits, any reservation made by that process evaporates.

It needs to be tied to the inode in some way - there's metadata
reservations that need to be made per inode that delayed allocation
reserations are made for to take into account the potential need to
allocate extent tree blocks as well. If we on't do this, then we'll
get ENOSPC reported for writes during writeback that should have
succeeded. And that is a Bad Thing.

Further, you need to track all the ranges that have space reserved
like a special type of delayed allocation extent. That way, when the
write() comes along into the reserved range, you don't account for
it a second time as delayed allocation as the space usage has
already been accounted for.

And then there is the problem of freeing space that you don't use.
Close the fd and you automatically terminate the reservation. fiemap
can be used to find unused reserved ranges. You could probably even
release them by punching the range.

If you have a per-process pool, how do you only use it for the
write() calls you want, on the file you want, over the range you
wanted reserved? And when you have finished writing to that file,
how do you release any unused reservation? How do you know that
you've got reservations remaining?

Then the interesting questions start - how does per-process
reservation interact with quotas? The quota needs to be checked
whenthe reservation is made, and without knowing what file it is
being made for this canot be done sanely. Especially for project
quotas....

Also, per-process reservatin pools can't really be managed through
existing APIs, so we'd need new ones. And then we'd be asking
application developers to use two different models for almost
identical functionality, which means they'll just use the one that
is most effective for their purpose (i.e. fallocate() because they
already have a fd open on the file they are going to write to).

IOWs, all I see from an implementation persepctive of per-process
reservation pools is complexity and nasty corner cases. And from the
user persepctive, an API that doesn't match up with the operations
at hand. i.e. that of writing a file....

> Whether we tie this space reservation to a fd or a process, we also
> would need to decide up front whether this space shows up as "missing"
> by statfs(2)/df or not.

IMO, reserved space is used space - it's not free for just anyone to
use anymore, and it has to be checked and accounted against quotas
even before it gets used....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-11-30 17:01                         ` Ted Ts'o
  2011-11-30 23:39                           ` Dave Chinner
@ 2011-12-01  0:11                           ` Pádraig Brady
  2011-12-07 11:42                             ` Pádraig Brady
  1 sibling, 1 reply; 31+ messages in thread
From: Pádraig Brady @ 2011-12-01  0:11 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel

On 11/30/2011 05:01 PM, Ted Ts'o wrote:
> On Wed, Nov 30, 2011 at 04:11:27PM +0000, Pádraig Brady wrote:
>> I looked for a couple of minutes on the XFS preallocate behaviour,
>> and it seems that these ioctls pre date fallocate().
>> http://linux.die.net/man/3/xfsctl
>> I see XFS_IOC_ALLOCSP and XFS_IOC_RESVSP.
>> So fallocate() support was directly mapped on top of the existing ALLOCSP.
>> I think the specialised alignment behavior should be restricted to
>> direct calls to XFS_IOC_ALLOCSP to be called by xfs_mkfile(1) or whatever.
>> Better would be to provide generic access to that functionality
>> through an ALIGN option to fallocate()
> 
> Well, XFS_IOC_RESVSP is the same as fallocate with the
> FALLOC_FL_KEEP_SIZE flag.  That is to say, blocks are allocated and
> attached to the inode --- that is, which blocks out of the pool of
> free blocks should be selected is decided at the time that you call
> fallocate() with the KEEP_SIZE flag or use the XFS_IOC_RESVSP ioctl
> (which by the way works on any file system that supports fallocate on
> modern kernels --- the kernel provides the translation from
> XFS_IOC_RESVSP to fallocate/KEEP_SIZE in fs/ioctl.c's
> ioctl_preallocate() function.)

Thanks for the clarification.
My main point is that these related ioctls existed before fallocate.

> What Dave was talking about is something different.  He's suggesting a
> new call which reserves space, but which does not actually make the
> block allocation decision until the time of the write.

Yes that was clear.
I'm still not sure it's needed TBH.
The separation of functionality is needed for the reasons Dave detailed,
but it might be better to add an ALIGN flag to fallocate for
that special use case.
I'm not trying to enforce my argument with repetition here,
just trying to be clear.

cheers,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fallocate vs ENOSPC
  2011-12-01  0:11                           ` Pádraig Brady
@ 2011-12-07 11:42                             ` Pádraig Brady
  0 siblings, 0 replies; 31+ messages in thread
From: Pádraig Brady @ 2011-12-07 11:42 UTC (permalink / raw)
  To: linux-fsdevel

On 12/01/2011 12:11 AM, Pádraig Brady wrote:
> On 11/30/2011 05:01 PM, Ted Ts'o wrote:
>> On Wed, Nov 30, 2011 at 04:11:27PM +0000, Pádraig Brady wrote:
>>> I looked for a couple of minutes on the XFS preallocate behaviour,
>>> and it seems that these ioctls pre date fallocate().
>>> http://linux.die.net/man/3/xfsctl
>>> I see XFS_IOC_ALLOCSP and XFS_IOC_RESVSP.
>>> So fallocate() support was directly mapped on top of the existing ALLOCSP.
>>> I think the specialised alignment behavior should be restricted to
>>> direct calls to XFS_IOC_ALLOCSP to be called by xfs_mkfile(1) or whatever.
>>> Better would be to provide generic access to that functionality
>>> through an ALIGN option to fallocate()
>>
>> Well, XFS_IOC_RESVSP is the same as fallocate with the
>> FALLOC_FL_KEEP_SIZE flag.  That is to say, blocks are allocated and
>> attached to the inode --- that is, which blocks out of the pool of
>> free blocks should be selected is decided at the time that you call
>> fallocate() with the KEEP_SIZE flag or use the XFS_IOC_RESVSP ioctl
>> (which by the way works on any file system that supports fallocate on
>> modern kernels --- the kernel provides the translation from
>> XFS_IOC_RESVSP to fallocate/KEEP_SIZE in fs/ioctl.c's
>> ioctl_preallocate() function.)
> 
> Thanks for the clarification.
> My main point is that these related ioctls existed before fallocate.
> 
>> What Dave was talking about is something different.  He's suggesting a
>> new call which reserves space, but which does not actually make the
>> block allocation decision until the time of the write.
> 
> Yes that was clear.
> I'm still not sure it's needed TBH.
> The separation of functionality is needed for the reasons Dave detailed,
> but it might be better to add an ALIGN flag to fallocate for
> that special use case.
> I'm not trying to enforce my argument with repetition here,
> just trying to be clear.

Is XFS the only file system that overloads this alignment behavior on fallocate()?
Why I ask, is because if that was the case, then perhaps XFS could change
to using the FALLOC_FL_ALIGN flag for this (or its existing ioctl), and so
would not be negatively impacted by tools which start using fallocate(),
unaware of the subtle performance implications on XFS.

cheers,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2011-12-07 11:42 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-25 10:26 fallocate vs ENOSPC Pádraig Brady
2011-11-25 10:40 ` Christoph Hellwig
2011-11-27  3:14   ` Ted Ts'o
2011-11-27 23:43     ` Dave Chinner
2011-11-28  0:13       ` Pádraig Brady
2011-11-28  3:51         ` Dave Chinner
2011-11-28  0:40       ` Theodore Tso
2011-11-28  5:10         ` Dave Chinner
2011-11-28  8:55           ` Pádraig Brady
2011-11-28 10:41             ` tao.peng
2011-11-28 12:02               ` Pádraig Brady
2011-11-28 14:36             ` Theodore Tso
2011-11-28 14:51               ` Pádraig Brady
2011-11-28 20:29                 ` Ted Ts'o
2011-11-28 20:49                   ` Jeremy Allison
2011-11-29 22:39                     ` Eric Sandeen
2011-11-29 23:04                       ` Jeremy Allison
2011-11-29 23:19                         ` Eric Sandeen
2011-11-28 18:49               ` Jeremy Allison
2011-11-29  0:26                 ` Dave Chinner
2011-11-29  0:45                   ` Jeremy Allison
2011-11-29  0:24             ` Dave Chinner
2011-11-29 14:11               ` Pádraig Brady
2011-11-29 23:37                 ` Dave Chinner
2011-11-30  9:28                   ` Pádraig Brady
2011-11-30 15:32                     ` Ted Ts'o
2011-11-30 16:11                       ` Pádraig Brady
2011-11-30 17:01                         ` Ted Ts'o
2011-11-30 23:39                           ` Dave Chinner
2011-12-01  0:11                           ` Pádraig Brady
2011-12-07 11:42                             ` Pádraig Brady

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).