* fallocate vs ENOSPC @ 2011-11-25 10:26 Pádraig Brady 2011-11-25 10:40 ` Christoph Hellwig 0 siblings, 1 reply; 31+ messages in thread From: Pádraig Brady @ 2011-11-25 10:26 UTC (permalink / raw) To: linux-fsdevel I was wondering about adding fallocate() to cp, where one of the benefits would be immediate indication of ENOSPC. I'm now wondering though might fallocate() fail to allocate an extent with ENOSPC, but there could be fragmented space available to write()? If the above was true, then perhaps we could assume that any file system returning ENOSPC from fallocate(), might provide accurate fstatvfs().f_b{free,avail} values for a subsequent check, though needing to do that seems hacky. cheers, Pádraig. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-25 10:26 fallocate vs ENOSPC Pádraig Brady @ 2011-11-25 10:40 ` Christoph Hellwig 2011-11-27 3:14 ` Ted Ts'o 0 siblings, 1 reply; 31+ messages in thread From: Christoph Hellwig @ 2011-11-25 10:40 UTC (permalink / raw) To: P??draig Brady; +Cc: linux-fsdevel On Fri, Nov 25, 2011 at 10:26:09AM +0000, P??draig Brady wrote: > I was wondering about adding fallocate() to cp, > where one of the benefits would be immediate indication of ENOSPC. > I'm now wondering though might fallocate() fail to allocate an > extent with ENOSPC, but there could be fragmented space available to write()? fallocate isn't guaranteed to allocate a single or even contiguous extents, it just allocate the given amount of space, and if the fs isn't too fragmented and the allocator not braindead it will be sufficiently contiguous. Also all Linux implementation may actually still fail a write later if extreme corner cases when btree splits or other metadata operations during unwritten extent conversions go over the space limit. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-25 10:40 ` Christoph Hellwig @ 2011-11-27 3:14 ` Ted Ts'o 2011-11-27 23:43 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Ted Ts'o @ 2011-11-27 3:14 UTC (permalink / raw) To: Christoph Hellwig; +Cc: P??draig Brady, linux-fsdevel On Fri, Nov 25, 2011 at 05:40:50AM -0500, Christoph Hellwig wrote: > On Fri, Nov 25, 2011 at 10:26:09AM +0000, P??draig Brady wrote: > > I was wondering about adding fallocate() to cp, > > where one of the benefits would be immediate indication of ENOSPC. > > I'm now wondering though might fallocate() fail to allocate an > > extent with ENOSPC, but there could be fragmented space available to write()? > > fallocate isn't guaranteed to allocate a single or even contiguous > extents, it just allocate the given amount of space, and if the fs isn't > too fragmented and the allocator not braindead it will be sufficiently > contiguous. Also all Linux implementation may actually still fail a write > later if extreme corner cases when btree splits or other metadata > operations during unwritten extent conversions go over the space limit. While this is true, *usually* fallocate will allocate enough space, but as Cirstoph has said, you still have to check the error returns for the write(2) and close(2) system call, and deal appropriately with any errors. The other reason to use fallocate is if you are copying a huge number of files, it's possible you'll get better block allocation layout, depending on the file system, and how insane the writeback code for a particular kernel version might be. (Some versions of the kernel had writeback algorithms that would write 4MB of one file, then 4MB for another file, then 4MB for yet another file, then 4MB of the first file, etc. --- and some file systems can deal with this kind of write pattern better than others.) Using fallocate if you know the size of the file up front won't hurt, and on some systems it might help. - Ted ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-27 3:14 ` Ted Ts'o @ 2011-11-27 23:43 ` Dave Chinner 2011-11-28 0:13 ` Pádraig Brady 2011-11-28 0:40 ` Theodore Tso 0 siblings, 2 replies; 31+ messages in thread From: Dave Chinner @ 2011-11-27 23:43 UTC (permalink / raw) To: Ted Ts'o; +Cc: Christoph Hellwig, P??draig Brady, linux-fsdevel On Sat, Nov 26, 2011 at 10:14:55PM -0500, Ted Ts'o wrote: > On Fri, Nov 25, 2011 at 05:40:50AM -0500, Christoph Hellwig wrote: > > On Fri, Nov 25, 2011 at 10:26:09AM +0000, P??draig Brady wrote: > > > I was wondering about adding fallocate() to cp, > > > where one of the benefits would be immediate indication of ENOSPC. > > > I'm now wondering though might fallocate() fail to allocate an > > > extent with ENOSPC, but there could be fragmented space available to write()? > > > > fallocate isn't guaranteed to allocate a single or even contiguous > > extents, it just allocate the given amount of space, and if the fs isn't > > too fragmented and the allocator not braindead it will be sufficiently > > contiguous. Also all Linux implementation may actually still fail a write > > later if extreme corner cases when btree splits or other metadata > > operations during unwritten extent conversions go over the space limit. > > While this is true, *usually* fallocate will allocate enough space, > but as Cirstoph has said, you still have to check the error returns > for the write(2) and close(2) system call, and deal appropriately with > any errors. > > The other reason to use fallocate is if you are copying a huge number > of files, it's possible you'll get better block allocation layout, > depending on the file system, and how insane the writeback code for a > particular kernel version might be. (Some versions of the kernel had > writeback algorithms that would write 4MB of one file, then 4MB for > another file, then 4MB for yet another file, then 4MB of the first > file, etc. --- and some file systems can deal with this kind of write > pattern better than others.) Right, but.... > Using fallocate if you know the size of > the file up front won't hurt, and on some systems it might help. ... this is - as a generalisation - wrong. Up front fallocate() can and does hurt performance, even when you know the size of the file ahead of time. Why? Because it defeats the primary, seek reducing writeback optimisation that filesystems have these days: delayed allocation. This has been mentioned before in previous threads where you've been considering adding fallocate to cp. e.g: http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg10819.html fallocate() style (or non-delalloc, write syscall time) allocation leads to non-optimal file layouts and slower writeback because the location that blocks are allocated in no way matches the writeback pattern, hence causing an increase in seeks during writeback of large numbers of files. Further, filesytsems that are alignment aware (e.g. XFS) will align every fallocate() based allocation, greatly fragmenting free space when used on small files and the filesystem is on a RAID array. However, in XFS, delayed allocation will actually pack the allocation across files tightly on disk, resulting in full stripe writes (even for sub-stripe unit/width files) during writeback. Delayed allocation allows workloads such as cp to run as a bandwidth bound operation because allocation is optimised to cause sequential write IO, whereas up-front fallocate() causes it to run as an IOPS bound option because file layout does not match the writeback pattern. And on large, high performance RAID arrays, bandwidth capacity is much, much higher than IOPS capacity, so delayed allocation is going to be far faster and have less long term impact on the filesystem than using fallocate. IOWs, use of fallocate() -by default- will speed filesystem aging because it removes the benefits delayed allocation has on both short and long term filesystem performance. The three major Linux filesystems (XFS, BTRFS and ext4) use delayed allocation, and hence do not need fallocate() to be used by userspace utilities like cp, tar, etc. to avoid fragmentation. If a given filesystem is still prone to fragmentation of data extents when copying data via cp or tar, then that is a problem with the filesystem that needs to be fixed, not worked around in the userspace utilities in a manner that is detrimental to other filesystems that don't suffer from those problems... Yes, fallocate can help reduce fragmentation and increase performance in some situations, so making it an -option- for people who know what they are doing is a good idea. However, it should not be made the default for all of the reasons above. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-27 23:43 ` Dave Chinner @ 2011-11-28 0:13 ` Pádraig Brady 2011-11-28 3:51 ` Dave Chinner 2011-11-28 0:40 ` Theodore Tso 1 sibling, 1 reply; 31+ messages in thread From: Pádraig Brady @ 2011-11-28 0:13 UTC (permalink / raw) To: Dave Chinner; +Cc: Ted Ts'o, Christoph Hellwig, linux-fsdevel On 11/27/2011 11:43 PM, Dave Chinner wrote: > On Sat, Nov 26, 2011 at 10:14:55PM -0500, Ted Ts'o wrote: >> On Fri, Nov 25, 2011 at 05:40:50AM -0500, Christoph Hellwig wrote: >>> On Fri, Nov 25, 2011 at 10:26:09AM +0000, P??draig Brady wrote: >>>> I was wondering about adding fallocate() to cp, >>>> where one of the benefits would be immediate indication of ENOSPC. >>>> I'm now wondering though might fallocate() fail to allocate an >>>> extent with ENOSPC, but there could be fragmented space available to write()? >>> >>> fallocate isn't guaranteed to allocate a single or even contiguous >>> extents, it just allocate the given amount of space, and if the fs isn't >>> too fragmented and the allocator not braindead it will be sufficiently >>> contiguous. Also all Linux implementation may actually still fail a write >>> later if extreme corner cases when btree splits or other metadata >>> operations during unwritten extent conversions go over the space limit. >> >> While this is true, *usually* fallocate will allocate enough space, >> but as Cirstoph has said, you still have to check the error returns >> for the write(2) and close(2) system call, and deal appropriately with >> any errors. >> >> The other reason to use fallocate is if you are copying a huge number >> of files, it's possible you'll get better block allocation layout, >> depending on the file system, and how insane the writeback code for a >> particular kernel version might be. (Some versions of the kernel had >> writeback algorithms that would write 4MB of one file, then 4MB for >> another file, then 4MB for yet another file, then 4MB of the first >> file, etc. --- and some file systems can deal with this kind of write >> pattern better than others.) > > Right, but.... > >> Using fallocate if you know the size of >> the file up front won't hurt, and on some systems it might help. > > ... this is - as a generalisation - wrong. Up front fallocate() can > and does hurt performance, even when you know the size of the file > ahead of time. > > Why? Because it defeats the primary, seek reducing writeback > optimisation that filesystems have these days: delayed allocation. > This has been mentioned before in previous threads where you've been > considering adding fallocate to cp. e.g: > > http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg10819.html > > fallocate() style (or non-delalloc, write syscall time) allocation > leads to non-optimal file layouts and slower writeback because the > location that blocks are allocated in no way matches the writeback > pattern, hence causing an increase in seeks during writeback of > large numbers of files. I'm interpreting the above to mean that, in the presence of concurrent writes to multiple files, fallocate() may cause slower _writes_, due to bypassing the delalloc write scheduler. Subsequent reads of the file should be no slower though, and perhaps faster, due to the greater likelihood of all the blocks for the file being contiguous. > Further, filesytsems that are alignment aware (e.g. XFS) will align > every fallocate() based allocation, greatly fragmenting free space > when used on small files and the filesystem is on a RAID array. > However, in XFS, delayed allocation will actually pack the > allocation across files tightly on disk, resulting in full stripe > writes (even for sub-stripe unit/width files) during writeback. Interesting. So what are the typical alignments involved. If you had to, what would you choose as a default min file size to enable fallocate() for? > Delayed allocation allows workloads such as cp to run as a bandwidth > bound operation because allocation is optimised to cause sequential > write IO, whereas up-front fallocate() causes it to run as an IOPS > bound option because file layout does not match the writeback > pattern. And on large, high performance RAID arrays, bandwidth > capacity is much, much higher than IOPS capacity, so delayed > allocation is going to be far faster and have less long term impact > on the filesystem than using fallocate. So the consequences are the same as those in the first paragraph? Though I don't understand the detrimental "long term impact" you mention. > IOWs, use of fallocate() -by default- will speed filesystem aging > because it removes the benefits delayed allocation has on both short > and long term filesystem performance. > > The three major Linux filesystems (XFS, BTRFS and ext4) use delayed > allocation, and hence do not need fallocate() to be used by > userspace utilities like cp, tar, etc. to avoid fragmentation. If a > given filesystem is still prone to fragmentation of data extents > when copying data via cp or tar, then that is a problem with the > filesystem that needs to be fixed, not worked around in the > userspace utilities in a manner that is detrimental to other > filesystems that don't suffer from those problems... > > Yes, fallocate can help reduce fragmentation and increase > performance in some situations, so making it an -option- for people > who know what they are doing is a good idea. However, it should not > be made the default for all of the reasons above. thanks for the excellent info, Pádraig. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-28 0:13 ` Pádraig Brady @ 2011-11-28 3:51 ` Dave Chinner 0 siblings, 0 replies; 31+ messages in thread From: Dave Chinner @ 2011-11-28 3:51 UTC (permalink / raw) To: Pádraig Brady; +Cc: Ted Ts'o, Christoph Hellwig, linux-fsdevel On Mon, Nov 28, 2011 at 12:13:31AM +0000, Pádraig Brady wrote: > On 11/27/2011 11:43 PM, Dave Chinner wrote: > > On Sat, Nov 26, 2011 at 10:14:55PM -0500, Ted Ts'o wrote: > >> On Fri, Nov 25, 2011 at 05:40:50AM -0500, Christoph Hellwig wrote: > >>> On Fri, Nov 25, 2011 at 10:26:09AM +0000, P??draig Brady wrote: > >>>> I was wondering about adding fallocate() to cp, > >>>> where one of the benefits would be immediate indication of ENOSPC. > >>>> I'm now wondering though might fallocate() fail to allocate an > >>>> extent with ENOSPC, but there could be fragmented space available to write()? > >>> > >>> fallocate isn't guaranteed to allocate a single or even contiguous > >>> extents, it just allocate the given amount of space, and if the fs isn't > >>> too fragmented and the allocator not braindead it will be sufficiently > >>> contiguous. Also all Linux implementation may actually still fail a write > >>> later if extreme corner cases when btree splits or other metadata > >>> operations during unwritten extent conversions go over the space limit. > >> > >> While this is true, *usually* fallocate will allocate enough space, > >> but as Cirstoph has said, you still have to check the error returns > >> for the write(2) and close(2) system call, and deal appropriately with > >> any errors. > >> > >> The other reason to use fallocate is if you are copying a huge number > >> of files, it's possible you'll get better block allocation layout, > >> depending on the file system, and how insane the writeback code for a > >> particular kernel version might be. (Some versions of the kernel had > >> writeback algorithms that would write 4MB of one file, then 4MB for > >> another file, then 4MB for yet another file, then 4MB of the first > >> file, etc. --- and some file systems can deal with this kind of write > >> pattern better than others.) > > > > Right, but.... > > > >> Using fallocate if you know the size of > >> the file up front won't hurt, and on some systems it might help. > > > > ... this is - as a generalisation - wrong. Up front fallocate() can > > and does hurt performance, even when you know the size of the file > > ahead of time. > > > > Why? Because it defeats the primary, seek reducing writeback > > optimisation that filesystems have these days: delayed allocation. > > This has been mentioned before in previous threads where you've been > > considering adding fallocate to cp. e.g: > > > > http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg10819.html > > > > fallocate() style (or non-delalloc, write syscall time) allocation > > leads to non-optimal file layouts and slower writeback because the > > location that blocks are allocated in no way matches the writeback > > pattern, hence causing an increase in seeks during writeback of > > large numbers of files. > > I'm interpreting the above to mean that, > in the presence of concurrent writes to multiple files, > fallocate() may cause slower _writes_, due to bypassing the > delalloc write scheduler. It's not even concurrent writes. A single process writing multiple files into cache serially does not necessarily result in writeback 30s later writing the data to disk in the same order. > Subsequent reads of the file should be no slower though, > and perhaps faster, due to the greater likelihood of > all the blocks for the file being contiguous. If delayed allocation does it's job, the files will be contiguous and unfragmented and no slower to read. > > Further, filesytsems that are alignment aware (e.g. XFS) will align > > every fallocate() based allocation, greatly fragmenting free space > > when used on small files and the filesystem is on a RAID array. > > However, in XFS, delayed allocation will actually pack the > > allocation across files tightly on disk, resulting in full stripe > > writes (even for sub-stripe unit/width files) during writeback. > > Interesting. So what are the typical alignments involved. Typical range of alignments can be anything from 16k through to 16MB or larger. Consider this - a 1TB filesystem with a 1MB alignment unit (stripe unit in the case of XFS) doing 16k aligned allocation per file will run out of aligned allocation slots after ~1,000,000 files have been created. At that point, the largest contiguous free space in the filesytsem is now under 16MB. When you want to create that multi-GB file now, it's going to have lots of extents rather than 1-2 because the preallocation has spread the small file data all over the place. If you used delayed allocation, the small file data will be packed close together without alignment, leaving large, multi-GB free space extents for the large file you then want to create.... > If you had to, what would you choose as a default min file size > to enable fallocate() for? I would not enable fallocate by default at all. > > Delayed allocation allows workloads such as cp to run as a bandwidth > > bound operation because allocation is optimised to cause sequential > > write IO, whereas up-front fallocate() causes it to run as an IOPS > > bound option because file layout does not match the writeback > > pattern. And on large, high performance RAID arrays, bandwidth > > capacity is much, much higher than IOPS capacity, so delayed > > allocation is going to be far faster and have less long term impact > > on the filesystem than using fallocate. > > So the consequences are the same as those in the first paragraph? > Though I don't understand the detrimental "long term impact" you mention. Free space fragmentation will result in severe degradation of performance as soon as all > alignment sized free spaces are partially consumed. Then fragmentation will result from any large allocation. i.e. small aligned preallocations accelerate filesystem aging effects by imcreasing free space fragmentation. This typically won't be noticed for months until the fragmentation starts causing noticable performance issues - at which point it will be difficult if not impossible to correct without a backup/mkfs/restore cycle.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-27 23:43 ` Dave Chinner 2011-11-28 0:13 ` Pádraig Brady @ 2011-11-28 0:40 ` Theodore Tso 2011-11-28 5:10 ` Dave Chinner 1 sibling, 1 reply; 31+ messages in thread From: Theodore Tso @ 2011-11-28 0:40 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, P??draig Brady, linux-fsdevel On Nov 27, 2011, at 6:43 PM, Dave Chinner wrote: > fallocate() style (or non-delalloc, write syscall time) allocation > leads to non-optimal file layouts and slower writeback because the > location that blocks are allocated in no way matches the writeback > pattern, hence causing an increase in seeks during writeback of > large numbers of files. > > Further, filesytsems that are alignment aware (e.g. XFS) will align > every fallocate() based allocation, greatly fragmenting free space > when used on small files and the filesystem is on a RAID array. > However, in XFS, delayed allocation will actually pack the > allocation across files tightly on disk, resulting in full stripe > writes (even for sub-stripe unit/width files) during write back. Well, the question is whether you're optimizing for writing the files, or reading the files. In some cases, files are write once, read never (well, almost never) --- i.e., the backup case. In other cases, the files are write once, read many --- i.e., when installing software. In that case, optimizing for the file reading might mean that you want to make the files aligned on RAID stripes, although it will fragment free space. It all depends on what you're optimizing for. I didn't realize that XFS was not aligning to RAID stripes when doing delayed allocation writes. I'm curious --- does it do this only when there are multiple files outstanding for delayed allocation in an allocation group? If someone does a singleton cp of a large file without using fallocate, will XFS try to align the write? Also, if we are going to use fallocate() as a way of implicitly signaling to the file system that the file should be optimized for reads, as opposed to the write, maybe we should explicitly document it as such in the fallocate(2) man page, so that application programmers understand that this is the semantics they should expect. -- Ted ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-28 0:40 ` Theodore Tso @ 2011-11-28 5:10 ` Dave Chinner 2011-11-28 8:55 ` Pádraig Brady 0 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2011-11-28 5:10 UTC (permalink / raw) To: Theodore Tso; +Cc: Christoph Hellwig, P??draig Brady, linux-fsdevel On Sun, Nov 27, 2011 at 07:40:14PM -0500, Theodore Tso wrote: > > On Nov 27, 2011, at 6:43 PM, Dave Chinner wrote: > > > fallocate() style (or non-delalloc, write syscall time) allocation > > leads to non-optimal file layouts and slower writeback because the > > location that blocks are allocated in no way matches the writeback > > pattern, hence causing an increase in seeks during writeback of > > large numbers of files. > > > > Further, filesytsems that are alignment aware (e.g. XFS) will align > > every fallocate() based allocation, greatly fragmenting free space > > when used on small files and the filesystem is on a RAID array. > > However, in XFS, delayed allocation will actually pack the > > allocation across files tightly on disk, resulting in full stripe > > writes (even for sub-stripe unit/width files) during write back. > > Well, the question is whether you're optimizing for writing the files, > or reading the files. In some cases, files are write once, read never > (well, almost never) --- i.e., the backup case. In other cases, the files > are write once, read many --- i.e., when installing software. Doesn't matter. If delayed allocation is doing it's job properly, then you'll get unfragemented files when they are written. delayed allocation is supposed to make up front preallocation of disk space -unnecessary- to prevent fragmentation. Using preallocation instead of dealyed allocation implies your dealyed allocation implementation is sub-optimal and needs to be fixed. Indeed, there is no guarantee that preallocation will even lay the files out in a sane manner that will give you good read speeds across multiple files - it may place them so far apart that the seek penalty between files is worse than having a few fragments... > In that case, optimizing for the file reading might mean that you > want to make the files aligned on RAID stripes, although it will > fragment free space. It all depends on what you're optimizing > for. If you want to optimise for read speed - especially for small files or random IO patterns - you want to *avoid* alignment to RAID stripes. Doing so overloads the first disk in the RAID stripe because all small file reads (and writes) hit that disk/LUN in the stripe. Indeed, if you have RAID5/6 and lots of small files, it is recommended that you turn off filesystem alignment at mkfs time for XFS. SGI hit this problem back in the early 90s, and is one of the reasons that XFS lays it's metadata out such that it does not hot-spot one drive in a RAID stripe trying to read/write frequently accessed metadata (e.g. AG headers). > I didn't realize that XFS was not aligning to RAID stripes when doing > delayed allocation writes. It certainly does do alignment during delayed allocation. /me waits for the "but you said..." That's because XFS does -selective- alignment during delayed allocation.... :) What people seem to forget about delayed allocation is that when delayed allocation occurs, we have lots of information about the data being written that is not available in the fallocate() context - how big the delalloc extent is, how large the file currently is, how much more data needs to be written, whether the file is still growing, etc, and so delayed allocation can make a much more informed decision about how to allocate the data extents compared to fallocate(). For example, if the allocation is for offset zero of the file, the filesystem is using aligned allocation and the file size is larger than the stripe unit, the allocation will be stripe unit aligned. Hence, if you've got lots of small files, they get packed because aligned allocation is not triggered and each allocation gets peeled from the front edge of the same free space extent. If you've got large files, then they get aligned, leaving space between them for the fiel to potentially grow and fill full stripe units and widths. And if you've got really large files still being written to, they get aligned and over-allocated thanks to the speculative prealloc beyond EOF, which effectively prevents fragmentation of large files due to interleaving allocations between files when many files are being written concurrently by writeback..... > I'm curious --- does it do this only when > there are multiple files outstanding for delayed allocation in an > allocation group? Irrelevant - the consideration is solely to do with the state of the current inode the allocation is being done for. If you're only writing a single file, then it doesn't matter for perfromance whether it is aligned or not. But it will matter for a freespace management POV, and hence how the filesytem ages. > If someone does a singleton cp of a large file > without using fallocate, will XFS try to align the write? The above should hopefully answer that question, especially with respect to why delayed allocation should not be short-circuited by using fallocate by default in generic system utilities. > Also, if we are going to use fallocate() as a way of implicitly signaling > to the file system that the file should be optimized for reads, as > opposed to the write, maybe we should explicitly document it as such > in the fallocate(2) man page, so that application programmers > understand that this is the semantics they should expect. Preallocation is for preventing fragmentation that leads to performance problems. Use of fallocate() does not imply the file layout has been optimised for read access and, IMO, never should. Quite frankly, if system utilities like cp and tar start to abuse fallocate() by default so they can get "upfront ENOSPC detection", then I will seriously consider making XFS use delayed allocation for fallocate rather than unwritten extents so we don't lose the past 15 years worth of IO and aging optimisations that delayed allocation provides us with.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-28 5:10 ` Dave Chinner @ 2011-11-28 8:55 ` Pádraig Brady 2011-11-28 10:41 ` tao.peng ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: Pádraig Brady @ 2011-11-28 8:55 UTC (permalink / raw) To: Dave Chinner; +Cc: Theodore Tso, Christoph Hellwig, linux-fsdevel On 11/28/2011 05:10 AM, Dave Chinner wrote: > Quite frankly, if system utilities like cp and tar start to abuse > fallocate() by default so they can get "upfront ENOSPC detection", > then I will seriously consider making XFS use delayed allocation for > fallocate rather than unwritten extents so we don't lose the past 15 > years worth of IO and aging optimisations that delayed allocation > provides us with.... For the record I was considering fallocate() for these reasons. 1. Improved file layout for subsequent access 2. Immediate indication of ENOSPC 3. Efficient writing of NUL portions You lucidly detailed issues with 1. which I suppose could be somewhat mitigated by not fallocating < say 1MB, though I suppose file systems could be smarter here and not preallocate small chunks (or when otherwise not appropriate). We can already get ENOSPC from a write() after an fallocate() in certain edge cases, so it would probably make sense to expand those cases. cheers, Pádraig. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: fallocate vs ENOSPC 2011-11-28 8:55 ` Pádraig Brady @ 2011-11-28 10:41 ` tao.peng 2011-11-28 12:02 ` Pádraig Brady 2011-11-28 14:36 ` Theodore Tso 2011-11-29 0:24 ` Dave Chinner 2 siblings, 1 reply; 31+ messages in thread From: tao.peng @ 2011-11-28 10:41 UTC (permalink / raw) To: P, david; +Cc: tytso, hch, linux-fsdevel > -----Original Message----- > From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel- > owner@vger.kernel.org] On Behalf Of Pádraig Brady > Sent: Monday, November 28, 2011 4:55 PM > To: Dave Chinner > Cc: Theodore Tso; Christoph Hellwig; linux-fsdevel@vger.kernel.org > Subject: Re: fallocate vs ENOSPC > > On 11/28/2011 05:10 AM, Dave Chinner wrote: > > Quite frankly, if system utilities like cp and tar start to abuse > > fallocate() by default so they can get "upfront ENOSPC detection", > > then I will seriously consider making XFS use delayed allocation for > > fallocate rather than unwritten extents so we don't lose the past 15 > > years worth of IO and aging optimisations that delayed allocation > > provides us with.... > > For the record I was considering fallocate() for these reasons. > > 1. Improved file layout for subsequent access > 2. Immediate indication of ENOSPC > 3. Efficient writing of NUL portions > > You lucidly detailed issues with 1. which I suppose could be somewhat > mitigated by not fallocating < say 1MB, though I suppose file systems > could be smarter here and not preallocate small chunks (or when > otherwise not appropriate). We can already get ENOSPC from a write() > after an fallocate() in certain edge cases, so it would probably make > sense to expand those cases. > Just out of curiosity, how is it going to work with sparse files? By default, cp uses --sparse=auto. And for sparse files, it avoids some disk allocation automatically. With fallocate(), do you plan to change the semantics? Cheers, Tao > cheers, > Pádraig. > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-28 10:41 ` tao.peng @ 2011-11-28 12:02 ` Pádraig Brady 0 siblings, 0 replies; 31+ messages in thread From: Pádraig Brady @ 2011-11-28 12:02 UTC (permalink / raw) To: tao.peng; +Cc: david, tytso, hch, linux-fsdevel On 11/28/2011 10:41 AM, tao.peng@emc.com wrote: > Just out of curiosity, how is it going to work with sparse files? By default, cp uses --sparse=auto. And for sparse files, it avoids some disk allocation automatically. With fallocate(), do you plan to change the semantics? With sparse files, coreutils currently uses fiemap (with a sync), and thus has full details of what is sparse or empty etc. So as detailed¹ on the coreutils list, the conversions would be: --sparse=auto => 'Empty' -> 'Empty' --sparse=always => 'Empty' -> 'Hole' --sparse=never => 'Hole' -> 'Empty' cheers, Pádraig. ¹ http://debbugs.gnu.org/9500#14 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-28 8:55 ` Pádraig Brady 2011-11-28 10:41 ` tao.peng @ 2011-11-28 14:36 ` Theodore Tso 2011-11-28 14:51 ` Pádraig Brady 2011-11-28 18:49 ` Jeremy Allison 2011-11-29 0:24 ` Dave Chinner 2 siblings, 2 replies; 31+ messages in thread From: Theodore Tso @ 2011-11-28 14:36 UTC (permalink / raw) To: Pádraig Brady Cc: Theodore Tso, Dave Chinner, Christoph Hellwig, linux-fsdevel On Nov 28, 2011, at 3:55 AM, Pádraig Brady wrote: > > You lucidly detailed issues with 1. which I suppose could be somewhat > mitigated by not fallocating < say 1MB, though I suppose file systems > could be smarter here and not preallocate small chunks (or when > otherwise not appropriate). We can already get ENOSPC from a write() > after an fallocate() in certain edge cases, so it would probably make > sense to expand those cases. I'm curious -- why are you so worried about ENOSPC? You need to check the error returns on write(2) anyway (and it's good programming practice anyways --- don't forget to check on close because some network file systems only push to the network on close, and in some cases they might only get quota errors on the close), so I don't see why using fallocate() to get an early ENOSPC is so interesting for you. -- Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-28 14:36 ` Theodore Tso @ 2011-11-28 14:51 ` Pádraig Brady 2011-11-28 20:29 ` Ted Ts'o 2011-11-28 18:49 ` Jeremy Allison 1 sibling, 1 reply; 31+ messages in thread From: Pádraig Brady @ 2011-11-28 14:51 UTC (permalink / raw) To: Theodore Tso; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel On 11/28/2011 02:36 PM, Theodore Tso wrote: > > On Nov 28, 2011, at 3:55 AM, Pádraig Brady wrote: > >> >> You lucidly detailed issues with 1. which I suppose could be somewhat >> mitigated by not fallocating < say 1MB, though I suppose file systems >> could be smarter here and not preallocate small chunks (or when >> otherwise not appropriate). We can already get ENOSPC from a write() >> after an fallocate() in certain edge cases, so it would probably make >> sense to expand those cases. > > I'm curious -- why are you so worried about ENOSPC? > > You need to check the error returns on write(2) anyway (and it's good > programming practice anyways --- don't forget to check on close because > some network file systems only push to the network on close, and in > some cases they might only get quota errors on the close), so I don't see > why using fallocate() to get an early ENOSPC is so interesting for you. It would be better to indicate ENOSPC _before_ copying a (potentially large) file to a (potentially slow) device. If the implementation complexity and side effects of doing this are sufficiently small, then it's worth doing. These discussions are to quantify the side effects. cheers, Pádraig. p.s. You can be sure that `cp` deals with errors from write() and close(). -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-28 14:51 ` Pádraig Brady @ 2011-11-28 20:29 ` Ted Ts'o 2011-11-28 20:49 ` Jeremy Allison 0 siblings, 1 reply; 31+ messages in thread From: Ted Ts'o @ 2011-11-28 20:29 UTC (permalink / raw) To: Pádraig Brady; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel On Mon, Nov 28, 2011 at 02:51:14PM +0000, Pádraig Brady wrote: > It would be better to indicate ENOSPC _before_ copying a (potentially large) > file to a (potentially slow) device. If the implementation complexity > and side effects of doing this are sufficiently small, then it's worth > doing. These discussions are to quantify the side effects. In that case, why not use statfs(2) as a first class approximation? You won't know for user how much fs metadata will be required, but for the common case where someone trying to fit 10 pounds of horse manure in a 5 pound bag, that can be caught very readily without needing to use fallocate(2). Regards, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-28 20:29 ` Ted Ts'o @ 2011-11-28 20:49 ` Jeremy Allison 2011-11-29 22:39 ` Eric Sandeen 0 siblings, 1 reply; 31+ messages in thread From: Jeremy Allison @ 2011-11-28 20:49 UTC (permalink / raw) To: Ted Ts'o Cc: Pádraig Brady, Dave Chinner, Christoph Hellwig, linux-fsdevel On Mon, Nov 28, 2011 at 03:29:34PM -0500, Ted Ts'o wrote: > On Mon, Nov 28, 2011 at 02:51:14PM +0000, Pádraig Brady wrote: > > It would be better to indicate ENOSPC _before_ copying a (potentially large) > > file to a (potentially slow) device. If the implementation complexity > > and side effects of doing this are sufficiently small, then it's worth > > doing. These discussions are to quantify the side effects. > > In that case, why not use statfs(2) as a first class approximation? > You won't know for user how much fs metadata will be required, but for > the common case where someone trying to fit 10 pounds of horse manure > in a 5 pound bag, that can be caught very readily without needing to > use fallocate(2). Yeah, we do that too, if the fallocate call fails. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-28 20:49 ` Jeremy Allison @ 2011-11-29 22:39 ` Eric Sandeen 2011-11-29 23:04 ` Jeremy Allison 0 siblings, 1 reply; 31+ messages in thread From: Eric Sandeen @ 2011-11-29 22:39 UTC (permalink / raw) To: Jeremy Allison Cc: Ted Ts'o, Pádraig Brady, Dave Chinner, Christoph Hellwig, linux-fsdevel On 11/28/11 2:49 PM, Jeremy Allison wrote: > On Mon, Nov 28, 2011 at 03:29:34PM -0500, Ted Ts'o wrote: >> On Mon, Nov 28, 2011 at 02:51:14PM +0000, Pádraig Brady wrote: >>> It would be better to indicate ENOSPC _before_ copying a (potentially large) >>> file to a (potentially slow) device. If the implementation complexity >>> and side effects of doing this are sufficiently small, then it's worth >>> doing. These discussions are to quantify the side effects. >> >> In that case, why not use statfs(2) as a first class approximation? >> You won't know for user how much fs metadata will be required, but for >> the common case where someone trying to fit 10 pounds of horse manure >> in a 5 pound bag, that can be caught very readily without needing to >> use fallocate(2). > > Yeah, we do that too, if the fallocate call fails. That seems backwards to me; if fallocate fails, statfs(2) isn't going to reveal more space, is it? (modulo metadata issues, anyway?) -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-29 22:39 ` Eric Sandeen @ 2011-11-29 23:04 ` Jeremy Allison 2011-11-29 23:19 ` Eric Sandeen 0 siblings, 1 reply; 31+ messages in thread From: Jeremy Allison @ 2011-11-29 23:04 UTC (permalink / raw) To: Eric Sandeen Cc: Jeremy Allison, Ted Ts'o, Pádraig Brady, Dave Chinner, Christoph Hellwig, linux-fsdevel On Tue, Nov 29, 2011 at 04:39:08PM -0600, Eric Sandeen wrote: > On 11/28/11 2:49 PM, Jeremy Allison wrote: > > On Mon, Nov 28, 2011 at 03:29:34PM -0500, Ted Ts'o wrote: > >> On Mon, Nov 28, 2011 at 02:51:14PM +0000, Pádraig Brady wrote: > >>> It would be better to indicate ENOSPC _before_ copying a (potentially large) > >>> file to a (potentially slow) device. If the implementation complexity > >>> and side effects of doing this are sufficiently small, then it's worth > >>> doing. These discussions are to quantify the side effects. > >> > >> In that case, why not use statfs(2) as a first class approximation? > >> You won't know for user how much fs metadata will be required, but for > >> the common case where someone trying to fit 10 pounds of horse manure > >> in a 5 pound bag, that can be caught very readily without needing to > >> use fallocate(2). > > > > Yeah, we do that too, if the fallocate call fails. > > That seems backwards to me; if fallocate fails, statfs(2) isn't going > to reveal more space, is it? (modulo metadata issues, anyway?) It might if fallocate fails with ENOSYS :-). -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-29 23:04 ` Jeremy Allison @ 2011-11-29 23:19 ` Eric Sandeen 0 siblings, 0 replies; 31+ messages in thread From: Eric Sandeen @ 2011-11-29 23:19 UTC (permalink / raw) To: Jeremy Allison Cc: Ted Ts'o, Pádraig Brady, Dave Chinner, Christoph Hellwig, linux-fsdevel On 11/29/11 5:04 PM, Jeremy Allison wrote: > On Tue, Nov 29, 2011 at 04:39:08PM -0600, Eric Sandeen wrote: >> On 11/28/11 2:49 PM, Jeremy Allison wrote: >>> On Mon, Nov 28, 2011 at 03:29:34PM -0500, Ted Ts'o wrote: >>>> On Mon, Nov 28, 2011 at 02:51:14PM +0000, Pádraig Brady wrote: >>>>> It would be better to indicate ENOSPC _before_ copying a (potentially large) >>>>> file to a (potentially slow) device. If the implementation complexity >>>>> and side effects of doing this are sufficiently small, then it's worth >>>>> doing. These discussions are to quantify the side effects. >>>> >>>> In that case, why not use statfs(2) as a first class approximation? >>>> You won't know for user how much fs metadata will be required, but for >>>> the common case where someone trying to fit 10 pounds of horse manure >>>> in a 5 pound bag, that can be caught very readily without needing to >>>> use fallocate(2). >>> >>> Yeah, we do that too, if the fallocate call fails. >> >> That seems backwards to me; if fallocate fails, statfs(2) isn't going >> to reveal more space, is it? (modulo metadata issues, anyway?) > > It might if fallocate fails with ENOSYS :-). Doh. Sorry, was not thinking of that failure. :) -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-28 14:36 ` Theodore Tso 2011-11-28 14:51 ` Pádraig Brady @ 2011-11-28 18:49 ` Jeremy Allison 2011-11-29 0:26 ` Dave Chinner 1 sibling, 1 reply; 31+ messages in thread From: Jeremy Allison @ 2011-11-28 18:49 UTC (permalink / raw) To: Theodore Tso Cc: Pádraig Brady, Dave Chinner, Christoph Hellwig, linux-fsdevel On Mon, Nov 28, 2011 at 09:36:18AM -0500, Theodore Tso wrote: > > On Nov 28, 2011, at 3:55 AM, Pádraig Brady wrote: > > > > > You lucidly detailed issues with 1. which I suppose could be somewhat > > mitigated by not fallocating < say 1MB, though I suppose file systems > > could be smarter here and not preallocate small chunks (or when > > otherwise not appropriate). We can already get ENOSPC from a write() > > after an fallocate() in certain edge cases, so it would probably make > > sense to expand those cases. > > I'm curious -- why are you so worried about ENOSPC? > > You need to check the error returns on write(2) anyway (and it's good > programming practice anyways --- don't forget to check on close because > some network file systems only push to the network on close, and in > some cases they might only get quota errors on the close), so I don't see > why using fallocate() to get an early ENOSPC is so interesting for you. Unfortunately for Samba, Windows clients will *only* report ENOSPC to the userspace apps if the initial fallocation fails. Most of the Windows apps don't bother to check for write() fails after the initial allocation succeeds. We check for and report them back to the Windows client anyway of course, but most Windows apps just silently corrupt their data in this case. That's why we use fallocate() in Samba :-(. Jeremy. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-28 18:49 ` Jeremy Allison @ 2011-11-29 0:26 ` Dave Chinner 2011-11-29 0:45 ` Jeremy Allison 0 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2011-11-29 0:26 UTC (permalink / raw) To: Jeremy Allison Cc: Theodore Tso, Pádraig Brady, Christoph Hellwig, linux-fsdevel On Mon, Nov 28, 2011 at 10:49:40AM -0800, Jeremy Allison wrote: > On Mon, Nov 28, 2011 at 09:36:18AM -0500, Theodore Tso wrote: > > > > On Nov 28, 2011, at 3:55 AM, Pádraig Brady wrote: > > > > > > > > You lucidly detailed issues with 1. which I suppose could be somewhat > > > mitigated by not fallocating < say 1MB, though I suppose file systems > > > could be smarter here and not preallocate small chunks (or when > > > otherwise not appropriate). We can already get ENOSPC from a write() > > > after an fallocate() in certain edge cases, so it would probably make > > > sense to expand those cases. > > > > I'm curious -- why are you so worried about ENOSPC? > > > > You need to check the error returns on write(2) anyway (and it's good > > programming practice anyways --- don't forget to check on close because > > some network file systems only push to the network on close, and in > > some cases they might only get quota errors on the close), so I don't see > > why using fallocate() to get an early ENOSPC is so interesting for you. > > Unfortunately for Samba, Windows clients will *only* report ENOSPC > to the userspace apps if the initial fallocation fails. Most of > the Windows apps don't bother to check for write() fails after > the initial allocation succeeds. > > We check for and report them back to the Windows client anyway of > course, but most Windows apps just silently corrupt their data in > this case. > > That's why we use fallocate() in Samba :-(. IOWs, what you really want is a space reservation mechanism. You've only got this preallocate hammer, so you use it, yes? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-29 0:26 ` Dave Chinner @ 2011-11-29 0:45 ` Jeremy Allison 0 siblings, 0 replies; 31+ messages in thread From: Jeremy Allison @ 2011-11-29 0:45 UTC (permalink / raw) To: Dave Chinner Cc: Jeremy Allison, Theodore Tso, Pádraig Brady, Christoph Hellwig, linux-fsdevel On Tue, Nov 29, 2011 at 11:26:29AM +1100, Dave Chinner wrote: > On Mon, Nov 28, 2011 at 10:49:40AM -0800, Jeremy Allison wrote: > > On Mon, Nov 28, 2011 at 09:36:18AM -0500, Theodore Tso wrote: > > > > > > On Nov 28, 2011, at 3:55 AM, Pádraig Brady wrote: > > > > > > > > > > > You lucidly detailed issues with 1. which I suppose could be somewhat > > > > mitigated by not fallocating < say 1MB, though I suppose file systems > > > > could be smarter here and not preallocate small chunks (or when > > > > otherwise not appropriate). We can already get ENOSPC from a write() > > > > after an fallocate() in certain edge cases, so it would probably make > > > > sense to expand those cases. > > > > > > I'm curious -- why are you so worried about ENOSPC? > > > > > > You need to check the error returns on write(2) anyway (and it's good > > > programming practice anyways --- don't forget to check on close because > > > some network file systems only push to the network on close, and in > > > some cases they might only get quota errors on the close), so I don't see > > > why using fallocate() to get an early ENOSPC is so interesting for you. > > > > Unfortunately for Samba, Windows clients will *only* report ENOSPC > > to the userspace apps if the initial fallocation fails. Most of > > the Windows apps don't bother to check for write() fails after > > the initial allocation succeeds. > > > > We check for and report them back to the Windows client anyway of > > course, but most Windows apps just silently corrupt their data in > > this case. > > > > That's why we use fallocate() in Samba :-(. > > IOWs, what you really want is a space reservation mechanism. You've > only got this preallocate hammer, so you use it, yes? Yes, absolutely. We're just trying to provide the Windows semantics the clients expect. Jeremy. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-28 8:55 ` Pádraig Brady 2011-11-28 10:41 ` tao.peng 2011-11-28 14:36 ` Theodore Tso @ 2011-11-29 0:24 ` Dave Chinner 2011-11-29 14:11 ` Pádraig Brady 2 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2011-11-29 0:24 UTC (permalink / raw) To: Pádraig Brady; +Cc: Theodore Tso, Christoph Hellwig, linux-fsdevel On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote: > On 11/28/2011 05:10 AM, Dave Chinner wrote: > > Quite frankly, if system utilities like cp and tar start to abuse > > fallocate() by default so they can get "upfront ENOSPC detection", > > then I will seriously consider making XFS use delayed allocation for > > fallocate rather than unwritten extents so we don't lose the past 15 > > years worth of IO and aging optimisations that delayed allocation > > provides us with.... > > For the record I was considering fallocate() for these reasons. > > 1. Improved file layout for subsequent access > 2. Immediate indication of ENOSPC > 3. Efficient writing of NUL portions > > You lucidly detailed issues with 1. which I suppose could be somewhat > mitigated by not fallocating < say 1MB, though I suppose file systems > could be smarter here and not preallocate small chunks (or when > otherwise not appropriate). When you consider that some high end filesystem deployments have alignment characteristics over 50MB (e.g. so each uncompressed 4k resolution video frame is located on a different set of non-overlapping disks), arbitrary "don't fallocate below this amount" heuristics will always have unforseen failure cases... In short: leave optimising general allocation strategies to the filesytems and their developers - there is no One True Solution for optimal file layout in a given filesystem, let alone across different filesytems. In fact, I don't even want to think about the mess fallocate() on everything would make of btrfs because of it's COW structure - it seems to me to guarantee worse fragmentation than using delayed allocation... > We can already get ENOSPC from a write() > after an fallocate() in certain edge cases, so it would probably make > sense to expand those cases. fallocate is for preallocation, not for ENOSPC detection. If you want efficient and effective ENOSPC detection before writing anything, then you really want a space -reservation- extension to fallocate. Filesystems that use delayed allocation already have a space reservation subsystem - it how they account for space that is reserved by delayed allocation prior to the real allocation being done. IMO, allowing userspace some level of access to those reservations would be more appropriate for early detection of ENOSPC than using preallocation for everything... As to efficient writing of NULL ranges - that's what sparse files are for - you do not need to write or even preallocate NULL ranges when copying files. Indeed, the most efficient way of dealing with NULL ranges is to punch a hole and let the filesystem deal with it..... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-29 0:24 ` Dave Chinner @ 2011-11-29 14:11 ` Pádraig Brady 2011-11-29 23:37 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Pádraig Brady @ 2011-11-29 14:11 UTC (permalink / raw) To: Dave Chinner; +Cc: Theodore Tso, Christoph Hellwig, linux-fsdevel On 11/29/2011 12:24 AM, Dave Chinner wrote: > On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote: >> On 11/28/2011 05:10 AM, Dave Chinner wrote: >>> Quite frankly, if system utilities like cp and tar start to abuse >>> fallocate() by default so they can get "upfront ENOSPC detection", >>> then I will seriously consider making XFS use delayed allocation for >>> fallocate rather than unwritten extents so we don't lose the past 15 >>> years worth of IO and aging optimisations that delayed allocation >>> provides us with.... >> >> For the record I was considering fallocate() for these reasons. >> >> 1. Improved file layout for subsequent access >> 2. Immediate indication of ENOSPC >> 3. Efficient writing of NUL portions >> >> You lucidly detailed issues with 1. which I suppose could be somewhat >> mitigated by not fallocating < say 1MB, though I suppose file systems >> could be smarter here and not preallocate small chunks (or when >> otherwise not appropriate). > > When you consider that some high end filesystem deployments have alignment > characteristics over 50MB (e.g. so each uncompressed 4k resolution > video frame is located on a different set of non-overlapping disks), > arbitrary "don't fallocate below this amount" heuristics will always > have unforseen failure cases... So about this alignment policy, I don't understand the issues so I'm guessing here. You say delalloc packs files, while fallocate() will align on XFS according to the stripe config. Is that assuming that when writing lots of files, that they will be more likely to be read together, rather than independently. That's a big assumption if true. Also the converse is a big assumption, that fallocate() should be aligned, as that's more likely to be read independently. > In short: leave optimising general allocation strategies to the > filesytems and their developers - there is no One True Solution for > optimal file layout in a given filesystem, let alone across > different filesytems. In fact, I don't even want to think about the > mess fallocate() on everything would make of btrfs because of it's > COW structure - it seems to me to guarantee worse fragmentation than > using delayed allocation... > >> We can already get ENOSPC from a write() >> after an fallocate() in certain edge cases, so it would probably make >> sense to expand those cases. > > fallocate is for preallocation, not for ENOSPC detection. If you > want efficient and effective ENOSPC detection before writing > anything, then you really want a space -reservation- extension to > fallocate. Filesystems that use delayed allocation already have a > space reservation subsystem - it how they account for space that is > reserved by delayed allocation prior to the real allocation being > done. IMO, allowing userspace some level of access to those > reservations would be more appropriate for early detection of ENOSPC > than using preallocation for everything... Fair enough, so fallocate() would be a superset of reserve(), though I'm having a hard time thinking of why one might ever need to fallocate() then. > As to efficient writing of NULL ranges - that's what sparse files > are for - you do not need to write or even preallocate NULL ranges > when copying files. Indeed, the most efficient way of dealing with > NULL ranges is to punch a hole and let the filesystem deal with > it..... well not for `cp --sparse=never` which might be used so that processing of the copy will not result in ENOSPC. I'm also linking here to a related discussion. http://oss.sgi.com/archives/xfs/2011-06/msg00064.html Note also that the gold linker does fallocate() on output files by default. cheers, Pádraig. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-29 14:11 ` Pádraig Brady @ 2011-11-29 23:37 ` Dave Chinner 2011-11-30 9:28 ` Pádraig Brady 0 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2011-11-29 23:37 UTC (permalink / raw) To: Pádraig Brady; +Cc: Theodore Tso, Christoph Hellwig, linux-fsdevel On Tue, Nov 29, 2011 at 02:11:48PM +0000, Pádraig Brady wrote: > On 11/29/2011 12:24 AM, Dave Chinner wrote: > > On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote: > >> On 11/28/2011 05:10 AM, Dave Chinner wrote: > >>> Quite frankly, if system utilities like cp and tar start to abuse > >>> fallocate() by default so they can get "upfront ENOSPC detection", > >>> then I will seriously consider making XFS use delayed allocation for > >>> fallocate rather than unwritten extents so we don't lose the past 15 > >>> years worth of IO and aging optimisations that delayed allocation > >>> provides us with.... > >> > >> For the record I was considering fallocate() for these reasons. > >> > >> 1. Improved file layout for subsequent access > >> 2. Immediate indication of ENOSPC > >> 3. Efficient writing of NUL portions > >> > >> You lucidly detailed issues with 1. which I suppose could be somewhat > >> mitigated by not fallocating < say 1MB, though I suppose file systems > >> could be smarter here and not preallocate small chunks (or when > >> otherwise not appropriate). > > > > When you consider that some high end filesystem deployments have alignment > > characteristics over 50MB (e.g. so each uncompressed 4k resolution > > video frame is located on a different set of non-overlapping disks), > > arbitrary "don't fallocate below this amount" heuristics will always > > have unforseen failure cases... > > So about this alignment policy, I don't understand the issues so I'm guessing here. Which, IMO, is exactly why you shouldn't be using fallocate() by default. Every filesystem behaves differently, and is optimises allocation differently to be tuned for the filesystem's unique structure and capability. fallocate() is a big hammer that ensures filesystems cannot optimise allocation to match observed operational patterns. > You say delalloc packs files, while fallocate() will align on XFS according to > the stripe config. Is that assuming that when writing lots of files, that they > will be more likely to be read together, rather than independently. No, it's assuming that preallocation is used for enabling extremely high performance, high bandwidth IO. This is what it has been used for in XFS for the past 10+ years, and so that is what the implementation in XFS is optimised for (and will continue to be optimised for). In this environment, even when the file size is smaller than the alignment unit, we want allocation alignment to be done. A real world example for you: supporting multiple, concurrent, realtime file-per-frame uncompressed 2k res video streams (@ ~12.5MB per frame). Systems doing this sort of work are made from lots of HW RAID5/6 Luns (often spread across multiple arrays) that will have a stripe width of 14MB. XFS will be configured with a stripe unit of 14MB. 4-6 of these Luns will be striped together to give a stripe width of 56-84MB from a filesystem perspective. Each file that is preallocated needs to be aligned to a 16MB stripe unit so that each frame IO goes to a different RAID Lun. Each frame write can be done as a full stripe write without a RMW cycle in the back end array, and each frame read loads all the disks in the LUN evenly. i.e. the load is distributed evenly, optimally and deterministically across all the back end storage. This is the sort of application that cannot be done effectively with a lot of filesystem allocator support (indeed, XFS has the special filestreams allocation policy for this workload), and it's this sort of high peformance application that what we optimise preallocation for. In short, what XFS is doing here is optimising allocation patterns for high performance, RAID based storage. If your write pattern triggers repeated RMW cycles in a RAID array, your write performance will fall by an order of magnitude or more. Large files don't need packing because the writeback flusher threads can do full stripe writes which avoids RMW cycles in the RAID array if the files are aligned to the underlying RAID stripes. But small files need tight packing to enable them to be aggregated into full stripe writes in the elevator and/or RAID controller BBWC. This aggregation then avoids RMW cycles in the RAID array and hence writeback performance for both small and large files is similar (i.e. close to maximum IO bandwidth). If you don't pack small files tightly (and XFs won't if you use preallocation), then each file write will cause a RMW cycle in the RAID array and the throughput is effective going to be about half the IOPS of a random write workload.... > That's a big assumption if true. Also the converse is a big assumption, that > fallocate() should be aligned, as that's more likely to be read independently. You're guessing, making assumptions, etc all about how one filesystem works and what the impact of the change is going to be. What about ext4, or btrfs? They are very different structurally to XFS, and hence have different sets of issues when you start preallocating everything. It is not a simple problem: allocation optimisation is, IMO, the single most difficult and complex area of filesystems, with many different, non-obvious, filesystem specific trade-offs to be made.... > > fallocate is for preallocation, not for ENOSPC detection. If you > > want efficient and effective ENOSPC detection before writing > > anything, then you really want a space -reservation- extension to > > fallocate. Filesystems that use delayed allocation already have a > > space reservation subsystem - it how they account for space that is > > reserved by delayed allocation prior to the real allocation being > > done. IMO, allowing userspace some level of access to those > > reservations would be more appropriate for early detection of ENOSPC > > than using preallocation for everything... > > Fair enough, so fallocate() would be a superset of reserve(), > though I'm having a hard time thinking of why one might ever need to > fallocate() then. Exactly my point - the number of applications that actually need -preallocation- for performance reasons is actually quite small. I'd suggest that we'd implement a reservation mechanism as a separate fallocate() flag, to tell fallocate() to reserve the space over the given range rather than needing to preallocate it. I'd also suggest that a reservation is not persistent (e.g. only guaranteed to last for the life of the file descriptor the reservation was made for). That would make it simple to implement in memory for all filesystems, and provide you with the short-term ENOSPC-or-success style reservation you are looking for... Does that sound reasonable? > > As to efficient writing of NULL ranges - that's what sparse files > > are for - you do not need to write or even preallocate NULL ranges > > when copying files. Indeed, the most efficient way of dealing with > > NULL ranges is to punch a hole and let the filesystem deal with > > it..... > > well not for `cp --sparse=never` which might be used > so that processing of the copy will not result in ENOSPC. > > I'm also linking here to a related discussion. > http://oss.sgi.com/archives/xfs/2011-06/msg00064.html Right, and from that discussion you can see exactly why delayed allocation in XFS significantly improves both data and metadata allocation and IO patterns for operations like tar, cp, rsync, etc whilst also minimising long term aging effects as compared to preallocation: http://oss.sgi.com/archives/xfs/2011-06/msg00092.html > Note also that the gold linker does fallocate() on output files by default. "He's doing it, so we should do it" is not a very convincing technical argument. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-29 23:37 ` Dave Chinner @ 2011-11-30 9:28 ` Pádraig Brady 2011-11-30 15:32 ` Ted Ts'o 0 siblings, 1 reply; 31+ messages in thread From: Pádraig Brady @ 2011-11-30 9:28 UTC (permalink / raw) To: Dave Chinner; +Cc: Theodore Tso, Christoph Hellwig, linux-fsdevel On 11/29/2011 11:37 PM, Dave Chinner wrote: > On Tue, Nov 29, 2011 at 02:11:48PM +0000, Pádraig Brady wrote: >> On 11/29/2011 12:24 AM, Dave Chinner wrote: >>> On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote: >>>> On 11/28/2011 05:10 AM, Dave Chinner wrote: >>>>> Quite frankly, if system utilities like cp and tar start to abuse >>>>> fallocate() by default so they can get "upfront ENOSPC detection", >>>>> then I will seriously consider making XFS use delayed allocation for >>>>> fallocate rather than unwritten extents so we don't lose the past 15 >>>>> years worth of IO and aging optimisations that delayed allocation >>>>> provides us with.... >>>> >>>> For the record I was considering fallocate() for these reasons. >>>> >>>> 1. Improved file layout for subsequent access >>>> 2. Immediate indication of ENOSPC >>>> 3. Efficient writing of NUL portions >>>> >>>> You lucidly detailed issues with 1. which I suppose could be somewhat >>>> mitigated by not fallocating < say 1MB, though I suppose file systems >>>> could be smarter here and not preallocate small chunks (or when >>>> otherwise not appropriate). >>> >>> When you consider that some high end filesystem deployments have alignment >>> characteristics over 50MB (e.g. so each uncompressed 4k resolution >>> video frame is located on a different set of non-overlapping disks), >>> arbitrary "don't fallocate below this amount" heuristics will always >>> have unforseen failure cases... >> >> So about this alignment policy, I don't understand the issues so I'm guessing here. > > Which, IMO, is exactly why you shouldn't be using fallocate() by > default. Every filesystem behaves differently, and is optimises > allocation differently to be tuned for the filesystem's unique > structure and capability. fallocate() is a big hammer that ensures > filesystems cannot optimise allocation to match observed operational > patterns. > >> You say delalloc packs files, while fallocate() will align on XFS according to >> the stripe config. Is that assuming that when writing lots of files, that they >> will be more likely to be read together, rather than independently. > > No, it's assuming that preallocation is used for enabling extremely > high performance, high bandwidth IO. This is what it has been used > for in XFS for the past 10+ years, and so that is what the > implementation in XFS is optimised for (and will continue to be > optimised for). In this environment, even when the file size is > smaller than the alignment unit, we want allocation alignment to be > done. > > A real world example for you: supporting multiple, concurrent, > realtime file-per-frame uncompressed 2k res video streams (@ ~12.5MB > per frame). Systems doing this sort of work are made from lots of > HW RAID5/6 Luns (often spread across multiple arrays) that will have > a stripe width of 14MB. XFS will be configured with a stripe unit of > 14MB. 4-6 of these Luns will be striped together to give a stripe > width of 56-84MB from a filesystem perspective. Each file that is > preallocated needs to be aligned to a 16MB stripe unit so that each > frame IO goes to a different RAID Lun. Each frame write can be done > as a full stripe write without a RMW cycle in the back end array, > and each frame read loads all the disks in the LUN evenly. i.e. the > load is distributed evenly, optimally and deterministically across > all the back end storage. > > This is the sort of application that cannot be done effectively with > a lot of filesystem allocator support (indeed, XFS has the special > filestreams allocation policy for this workload), and it's this sort > of high peformance application that what we optimise preallocation > for. > > In short, what XFS is doing here is optimising allocation patterns > for high performance, RAID based storage. If your write pattern > triggers repeated RMW cycles in a RAID array, your write performance > will fall by an order of magnitude or more. Large files don't need > packing because the writeback flusher threads can do full stripe > writes which avoids RMW cycles in the RAID array if the files are > aligned to the underlying RAID stripes. But small files need tight > packing to enable them to be aggregated into full stripe writes in > the elevator and/or RAID controller BBWC. This aggregation then > avoids RMW cycles in the RAID array and hence writeback performance > for both small and large files is similar (i.e. close to maximum IO > bandwidth). If you don't pack small files tightly (and XFs won't if > you use preallocation), then each file write will cause a RMW cycle > in the RAID array and the throughput is effective going to be about > half the IOPS of a random write workload.... > >> That's a big assumption if true. Also the converse is a big assumption, that >> fallocate() should be aligned, as that's more likely to be read independently. > > You're guessing, making assumptions, etc all about how one > filesystem works and what the impact of the change is going to be. > What about ext4, or btrfs? They are very different structurally to > XFS, and hence have different sets of issues when you start > preallocating everything. It is not a simple problem: allocation > optimisation is, IMO, the single most difficult and complex area of > filesystems, with many different, non-obvious, filesystem specific > trade-offs to be made.... > >>> fallocate is for preallocation, not for ENOSPC detection. If you >>> want efficient and effective ENOSPC detection before writing >>> anything, then you really want a space -reservation- extension to >>> fallocate. Filesystems that use delayed allocation already have a >>> space reservation subsystem - it how they account for space that is >>> reserved by delayed allocation prior to the real allocation being >>> done. IMO, allowing userspace some level of access to those >>> reservations would be more appropriate for early detection of ENOSPC >>> than using preallocation for everything... >> >> Fair enough, so fallocate() would be a superset of reserve(), >> though I'm having a hard time thinking of why one might ever need to >> fallocate() then. > > Exactly my point - the number of applications that actually need > -preallocation- for performance reasons is actually quite small. > > I'd suggest that we'd implement a reservation mechanism as a > separate fallocate() flag, to tell fallocate() to reserve the space > over the given range rather than needing to preallocate it. I'd also > suggest that a reservation is not persistent (e.g. only guaranteed > to last for the life of the file descriptor the reservation was made > for). That would make it simple to implement in memory for all > filesystems, and provide you with the short-term ENOSPC-or-success > style reservation you are looking for... > > Does that sound reasonable? But then posix_fallocate() would always be slow I think, requiring one to actually write the NULs. TBH, it sounds like the best/minimal change is to the uncommon case. I.E. add an ALIGN flag to fallocate() which specialised apps like described above can use. >>> As to efficient writing of NULL ranges - that's what sparse files >>> are for - you do not need to write or even preallocate NULL ranges >>> when copying files. Indeed, the most efficient way of dealing with >>> NULL ranges is to punch a hole and let the filesystem deal with >>> it..... >> >> well not for `cp --sparse=never` which might be used >> so that processing of the copy will not result in ENOSPC. >> >> I'm also linking here to a related discussion. >> http://oss.sgi.com/archives/xfs/2011-06/msg00064.html > > Right, and from that discussion you can see exactly why delayed > allocation in XFS significantly improves both data and metadata > allocation and IO patterns for operations like tar, cp, rsync, etc > whilst also minimising long term aging effects as compared to > preallocation: > > http://oss.sgi.com/archives/xfs/2011-06/msg00092.html > >> Note also that the gold linker does fallocate() on output files by default. > > "He's doing it, so we should do it" is not a very convincing > technical argument. Just FYI. cheers, Pádraig. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-30 9:28 ` Pádraig Brady @ 2011-11-30 15:32 ` Ted Ts'o 2011-11-30 16:11 ` Pádraig Brady 0 siblings, 1 reply; 31+ messages in thread From: Ted Ts'o @ 2011-11-30 15:32 UTC (permalink / raw) To: Pádraig Brady; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel On Wed, Nov 30, 2011 at 09:28:32AM +0000, Pádraig Brady wrote: > > But then posix_fallocate() would always be slow I think, > requiring one to actually write the NULs. Almost no one should ever use posix_fallocate(); it's can be a performance disaster because you don't know whether or not the file system will really do fallocate, or will do the slow "write zeros" thing. You really should use fallocate(), take the failure if the file system doesn't support fallocate, and then you can decide what the appropriate thing to do might be. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-30 15:32 ` Ted Ts'o @ 2011-11-30 16:11 ` Pádraig Brady 2011-11-30 17:01 ` Ted Ts'o 0 siblings, 1 reply; 31+ messages in thread From: Pádraig Brady @ 2011-11-30 16:11 UTC (permalink / raw) To: Ted Ts'o; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel On 11/30/2011 03:32 PM, Ted Ts'o wrote: > On Wed, Nov 30, 2011 at 09:28:32AM +0000, Pádraig Brady wrote: >> >> But then posix_fallocate() would always be slow I think, >> requiring one to actually write the NULs. > > Almost no one should ever use posix_fallocate(); it's can be a > performance disaster because you don't know whether or not the file > system will really do fallocate, or will do the slow "write zeros" > thing. > > You really should use fallocate(), take the failure if the file system > doesn't support fallocate, and then you can decide what the > appropriate thing to do might be. s/posix_fallocate()/functionality provided by &/ I.E. copy --sparse=never could use that, and it would be beneficial if it was as fast as possible. I looked for a couple of minutes on the XFS preallocate behaviour, and it seems that these ioctls pre date fallocate(). http://linux.die.net/man/3/xfsctl I see XFS_IOC_ALLOCSP and XFS_IOC_RESVSP. So fallocate() support was directly mapped on top of the existing ALLOCSP. I think the specialised alignment behavior should be restricted to direct calls to XFS_IOC_ALLOCSP to be called by xfs_mkfile(1) or whatever. Better would be to provide generic access to that functionality through an ALIGN option to fallocate() cheers, Pádraig. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-30 16:11 ` Pádraig Brady @ 2011-11-30 17:01 ` Ted Ts'o 2011-11-30 23:39 ` Dave Chinner 2011-12-01 0:11 ` Pádraig Brady 0 siblings, 2 replies; 31+ messages in thread From: Ted Ts'o @ 2011-11-30 17:01 UTC (permalink / raw) To: Pádraig Brady; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel On Wed, Nov 30, 2011 at 04:11:27PM +0000, Pádraig Brady wrote: > I looked for a couple of minutes on the XFS preallocate behaviour, > and it seems that these ioctls pre date fallocate(). > http://linux.die.net/man/3/xfsctl > I see XFS_IOC_ALLOCSP and XFS_IOC_RESVSP. > So fallocate() support was directly mapped on top of the existing ALLOCSP. > I think the specialised alignment behavior should be restricted to > direct calls to XFS_IOC_ALLOCSP to be called by xfs_mkfile(1) or whatever. > Better would be to provide generic access to that functionality > through an ALIGN option to fallocate() Well, XFS_IOC_RESVSP is the same as fallocate with the FALLOC_FL_KEEP_SIZE flag. That is to say, blocks are allocated and attached to the inode --- that is, which blocks out of the pool of free blocks should be selected is decided at the time that you call fallocate() with the KEEP_SIZE flag or use the XFS_IOC_RESVSP ioctl (which by the way works on any file system that supports fallocate on modern kernels --- the kernel provides the translation from XFS_IOC_RESVSP to fallocate/KEEP_SIZE in fs/ioctl.c's ioctl_preallocate() function.) What Dave was talking about is something different. He's suggesting a new call which reserves space, but which does not actually make the block allocation decision until the time of the write. He suggested tieing it to the file descriptor, but I wonder if it's actually more functional to tie it to the process --- that is, the process says, "guarantee that I will be able to write 5MB", and writes made by that process get counted against that 5MB reservation. When the process exits, any reservation made by that process evaporates. Whether we tie this space reservation to a fd or a process, we also would need to decide up front whether this space shows up as "missing" by statfs(2)/df or not. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-30 17:01 ` Ted Ts'o @ 2011-11-30 23:39 ` Dave Chinner 2011-12-01 0:11 ` Pádraig Brady 1 sibling, 0 replies; 31+ messages in thread From: Dave Chinner @ 2011-11-30 23:39 UTC (permalink / raw) To: Ted Ts'o; +Cc: Pádraig Brady, Christoph Hellwig, linux-fsdevel On Wed, Nov 30, 2011 at 12:01:16PM -0500, Ted Ts'o wrote: > What Dave was talking about is something different. He's suggesting a > new call which reserves space, but which does not actually make the > block allocation decision until the time of the write. He suggested > tieing it to the file descriptor, but I wonder if it's actually more > functional to tie it to the process --- that is, the process says, > "guarantee that I will be able to write 5MB", and writes made by that > process get counted against that 5MB reservation. When the process > exits, any reservation made by that process evaporates. It needs to be tied to the inode in some way - there's metadata reservations that need to be made per inode that delayed allocation reserations are made for to take into account the potential need to allocate extent tree blocks as well. If we on't do this, then we'll get ENOSPC reported for writes during writeback that should have succeeded. And that is a Bad Thing. Further, you need to track all the ranges that have space reserved like a special type of delayed allocation extent. That way, when the write() comes along into the reserved range, you don't account for it a second time as delayed allocation as the space usage has already been accounted for. And then there is the problem of freeing space that you don't use. Close the fd and you automatically terminate the reservation. fiemap can be used to find unused reserved ranges. You could probably even release them by punching the range. If you have a per-process pool, how do you only use it for the write() calls you want, on the file you want, over the range you wanted reserved? And when you have finished writing to that file, how do you release any unused reservation? How do you know that you've got reservations remaining? Then the interesting questions start - how does per-process reservation interact with quotas? The quota needs to be checked whenthe reservation is made, and without knowing what file it is being made for this canot be done sanely. Especially for project quotas.... Also, per-process reservatin pools can't really be managed through existing APIs, so we'd need new ones. And then we'd be asking application developers to use two different models for almost identical functionality, which means they'll just use the one that is most effective for their purpose (i.e. fallocate() because they already have a fd open on the file they are going to write to). IOWs, all I see from an implementation persepctive of per-process reservation pools is complexity and nasty corner cases. And from the user persepctive, an API that doesn't match up with the operations at hand. i.e. that of writing a file.... > Whether we tie this space reservation to a fd or a process, we also > would need to decide up front whether this space shows up as "missing" > by statfs(2)/df or not. IMO, reserved space is used space - it's not free for just anyone to use anymore, and it has to be checked and accounted against quotas even before it gets used.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-11-30 17:01 ` Ted Ts'o 2011-11-30 23:39 ` Dave Chinner @ 2011-12-01 0:11 ` Pádraig Brady 2011-12-07 11:42 ` Pádraig Brady 1 sibling, 1 reply; 31+ messages in thread From: Pádraig Brady @ 2011-12-01 0:11 UTC (permalink / raw) To: Ted Ts'o; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel On 11/30/2011 05:01 PM, Ted Ts'o wrote: > On Wed, Nov 30, 2011 at 04:11:27PM +0000, Pádraig Brady wrote: >> I looked for a couple of minutes on the XFS preallocate behaviour, >> and it seems that these ioctls pre date fallocate(). >> http://linux.die.net/man/3/xfsctl >> I see XFS_IOC_ALLOCSP and XFS_IOC_RESVSP. >> So fallocate() support was directly mapped on top of the existing ALLOCSP. >> I think the specialised alignment behavior should be restricted to >> direct calls to XFS_IOC_ALLOCSP to be called by xfs_mkfile(1) or whatever. >> Better would be to provide generic access to that functionality >> through an ALIGN option to fallocate() > > Well, XFS_IOC_RESVSP is the same as fallocate with the > FALLOC_FL_KEEP_SIZE flag. That is to say, blocks are allocated and > attached to the inode --- that is, which blocks out of the pool of > free blocks should be selected is decided at the time that you call > fallocate() with the KEEP_SIZE flag or use the XFS_IOC_RESVSP ioctl > (which by the way works on any file system that supports fallocate on > modern kernels --- the kernel provides the translation from > XFS_IOC_RESVSP to fallocate/KEEP_SIZE in fs/ioctl.c's > ioctl_preallocate() function.) Thanks for the clarification. My main point is that these related ioctls existed before fallocate. > What Dave was talking about is something different. He's suggesting a > new call which reserves space, but which does not actually make the > block allocation decision until the time of the write. Yes that was clear. I'm still not sure it's needed TBH. The separation of functionality is needed for the reasons Dave detailed, but it might be better to add an ALIGN flag to fallocate for that special use case. I'm not trying to enforce my argument with repetition here, just trying to be clear. cheers, Pádraig. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fallocate vs ENOSPC 2011-12-01 0:11 ` Pádraig Brady @ 2011-12-07 11:42 ` Pádraig Brady 0 siblings, 0 replies; 31+ messages in thread From: Pádraig Brady @ 2011-12-07 11:42 UTC (permalink / raw) To: linux-fsdevel On 12/01/2011 12:11 AM, Pádraig Brady wrote: > On 11/30/2011 05:01 PM, Ted Ts'o wrote: >> On Wed, Nov 30, 2011 at 04:11:27PM +0000, Pádraig Brady wrote: >>> I looked for a couple of minutes on the XFS preallocate behaviour, >>> and it seems that these ioctls pre date fallocate(). >>> http://linux.die.net/man/3/xfsctl >>> I see XFS_IOC_ALLOCSP and XFS_IOC_RESVSP. >>> So fallocate() support was directly mapped on top of the existing ALLOCSP. >>> I think the specialised alignment behavior should be restricted to >>> direct calls to XFS_IOC_ALLOCSP to be called by xfs_mkfile(1) or whatever. >>> Better would be to provide generic access to that functionality >>> through an ALIGN option to fallocate() >> >> Well, XFS_IOC_RESVSP is the same as fallocate with the >> FALLOC_FL_KEEP_SIZE flag. That is to say, blocks are allocated and >> attached to the inode --- that is, which blocks out of the pool of >> free blocks should be selected is decided at the time that you call >> fallocate() with the KEEP_SIZE flag or use the XFS_IOC_RESVSP ioctl >> (which by the way works on any file system that supports fallocate on >> modern kernels --- the kernel provides the translation from >> XFS_IOC_RESVSP to fallocate/KEEP_SIZE in fs/ioctl.c's >> ioctl_preallocate() function.) > > Thanks for the clarification. > My main point is that these related ioctls existed before fallocate. > >> What Dave was talking about is something different. He's suggesting a >> new call which reserves space, but which does not actually make the >> block allocation decision until the time of the write. > > Yes that was clear. > I'm still not sure it's needed TBH. > The separation of functionality is needed for the reasons Dave detailed, > but it might be better to add an ALIGN flag to fallocate for > that special use case. > I'm not trying to enforce my argument with repetition here, > just trying to be clear. Is XFS the only file system that overloads this alignment behavior on fallocate()? Why I ask, is because if that was the case, then perhaps XFS could change to using the FALLOC_FL_ALIGN flag for this (or its existing ioctl), and so would not be negatively impacted by tools which start using fallocate(), unaware of the subtle performance implications on XFS. cheers, Pádraig. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2011-12-07 11:42 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-11-25 10:26 fallocate vs ENOSPC Pádraig Brady 2011-11-25 10:40 ` Christoph Hellwig 2011-11-27 3:14 ` Ted Ts'o 2011-11-27 23:43 ` Dave Chinner 2011-11-28 0:13 ` Pádraig Brady 2011-11-28 3:51 ` Dave Chinner 2011-11-28 0:40 ` Theodore Tso 2011-11-28 5:10 ` Dave Chinner 2011-11-28 8:55 ` Pádraig Brady 2011-11-28 10:41 ` tao.peng 2011-11-28 12:02 ` Pádraig Brady 2011-11-28 14:36 ` Theodore Tso 2011-11-28 14:51 ` Pádraig Brady 2011-11-28 20:29 ` Ted Ts'o 2011-11-28 20:49 ` Jeremy Allison 2011-11-29 22:39 ` Eric Sandeen 2011-11-29 23:04 ` Jeremy Allison 2011-11-29 23:19 ` Eric Sandeen 2011-11-28 18:49 ` Jeremy Allison 2011-11-29 0:26 ` Dave Chinner 2011-11-29 0:45 ` Jeremy Allison 2011-11-29 0:24 ` Dave Chinner 2011-11-29 14:11 ` Pádraig Brady 2011-11-29 23:37 ` Dave Chinner 2011-11-30 9:28 ` Pádraig Brady 2011-11-30 15:32 ` Ted Ts'o 2011-11-30 16:11 ` Pádraig Brady 2011-11-30 17:01 ` Ted Ts'o 2011-11-30 23:39 ` Dave Chinner 2011-12-01 0:11 ` Pádraig Brady 2011-12-07 11:42 ` Pádraig Brady
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).