* e2fsprogs and blocks outside i_size @ 2008-07-18 12:11 Aneesh Kumar K.V 2008-07-18 12:37 ` Theodore Tso 0 siblings, 1 reply; 6+ messages in thread From: Aneesh Kumar K.V @ 2008-07-18 12:11 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 Hi Ted, With fallocate FALLOC_FL_KEEP_SIZE option, when we write to prealloc space and if we hit ENOSPC when trying to insert the extent, we actually zero out the extent. That means we can have blocks outside i_size for an inode. I guess e2fsck currently doesn't handle this. Or should we fix kernel to update i_size to the right value if we do a zero out of the extent ? With fallocate if the prealloc area is small we also aggressively zeroout. This was needed so that a random write pattern on falloc area doesn't results in too many extents. That also can result in the above error on fsck. -aneesh ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: e2fsprogs and blocks outside i_size 2008-07-18 12:11 e2fsprogs and blocks outside i_size Aneesh Kumar K.V @ 2008-07-18 12:37 ` Theodore Tso 2008-07-21 5:08 ` Andreas Dilger 0 siblings, 1 reply; 6+ messages in thread From: Theodore Tso @ 2008-07-18 12:37 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: linux-ext4 On Fri, Jul 18, 2008 at 05:41:30PM +0530, Aneesh Kumar K.V wrote: > Hi Ted, > > With fallocate FALLOC_FL_KEEP_SIZE option, when we write to prealloc > space and if we hit ENOSPC when trying to insert the extent, > we actually zero out the extent. That means we can have blocks > outside i_size for an inode. I guess e2fsck currently doesn't > handle this. Or should we fix kernel to update i_size to the > right value if we do a zero out of the extent ? > > With fallocate if the prealloc area is small we also aggressively zeroout. > This was needed so that a random write pattern on falloc area doesn't > results in too many extents. That also can result in the above > error on fsck. It would seem to me that e2fsck should be fixed to not complain about blocks outside of i_size, *if* the blocks in question are marked as being unitialized. It also seems to me that updating i_size when the extent is zero'ed out is also not the right thing to do. Some applications depend on i_size. In the case where you hit ENOSPC when you need to grow the tree to insert an extent, that's a tough one. One approach would be to simply return an error in that case, although that's unfortunate since in addition to people using fallocate() to try to get better block layout, some people want to preallocate blocks to guarantee that the write will not fail! For people whose reason for using fallocate() are in the second camp, we can satisfy their need simply by stealing a block from the end of the preallocate segment and use it to grow the extent tree. I'm inclined to think the second solution might be the better one, but I'm sure that will be controversial. For your second case, where we aggressively zero out blocks, one of the reasons why we have to do that is because the kernel isn't coalescing extents aggressively. My inclination here is to *not* aggressively zero out blocks outside outside of i_size, and to split the extent in that case --- and then to make sure our extent management code is better about coalescing extents, so that when we convert an extent from INIT to UNINIT, we also check the previous and next extent, and if we can combine current extent with an adjacent extent, we should do so. I suppose the other hack we could do is have e2fsck check the blocks that are outside of i_size, and if they are all zero and extents are involved, that it's a case of pre-allocated blocks that needed to be zero'ed for some reason, as opposed to a corrupted i_size. That seems to be a really gross hack, though. - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: e2fsprogs and blocks outside i_size 2008-07-18 12:37 ` Theodore Tso @ 2008-07-21 5:08 ` Andreas Dilger 2008-07-21 5:59 ` Aneesh Kumar K.V 0 siblings, 1 reply; 6+ messages in thread From: Andreas Dilger @ 2008-07-21 5:08 UTC (permalink / raw) To: Theodore Tso; +Cc: Aneesh Kumar K.V, linux-ext4 On Jul 18, 2008 08:37 -0400, Theodore Ts'o wrote: > On Fri, Jul 18, 2008 at 05:41:30PM +0530, Aneesh Kumar K.V wrote: > > With fallocate FALLOC_FL_KEEP_SIZE option, when we write to prealloc > > space and if we hit ENOSPC when trying to insert the extent, > > we actually zero out the extent. That means we can have blocks > > outside i_size for an inode. To clarify, doesn't FALLOC_FL_KEEP_SIZE put the extent beyond i_size, regardless of whether the ENOSPC problem is hit? > > I guess e2fsck currently doesn't handle this. Or should we fix kernel > > to update i_size to the right value if we do a zero out of the extent ? > > > > With fallocate if the prealloc area is small we also aggressively zeroout. > > This was needed so that a random write pattern on falloc area doesn't > > results in too many extents. That also can result in the above > > error on fsck. > > It would seem to me that e2fsck should be fixed to not complain about > blocks outside of i_size, *if* the blocks in question are marked as > being unitialized. Yes, I think that is the right approach. > It also seems to me that updating i_size when the > extent is zero'ed out is also not the right thing to do. Some > applications depend on i_size. Yes, this was my thought exactly. Just because the fallocated extent is zeroed out doesn't mean the file size can suddenly change. This would appear as file corruption or "data loss" to many applications. > In the case where you hit ENOSPC when you need to grow the tree to > insert an extent, that's a tough one. One approach would be to simply I think you misunderstand Aneesh's comments - the ext4 fallocate code already handles all of these cases. If ENOSPC is hit during the extent split the error recovery will zero out the whole uninitialized extent. Slow, but effective. > For your second case, where we aggressively zero out blocks, one of > the reasons why we have to do that is because the kernel isn't > coalescing extents aggressively. My inclination here is to *not* > aggressively zero out blocks outside outside of i_size, and to split > the extent in that case That is only partly true. If an application is writing every other block in a 64MB fallocated extent we don't want to create 64MB/4kB separate extents. There is no guarantee that the application will fill in the rest of the unwritten extents and allow them to merge. Also, there is likely a net performance improvement by writing every block (half with data, the other half with zeroes) compared to doing write + seek over the entire file. > I suppose the other hack we could do is have e2fsck check the blocks > that are outside of i_size, and if they are all zero and extents are > involved, that it's a case of pre-allocated blocks that needed to be > zero'ed for some reason, as opposed to a corrupted i_size. That seems > to be a really gross hack, though. Yuck, with the added problem that there is no guarantee that these data blocks ARE all zero. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: e2fsprogs and blocks outside i_size 2008-07-21 5:08 ` Andreas Dilger @ 2008-07-21 5:59 ` Aneesh Kumar K.V 2008-07-21 12:34 ` Theodore Tso 0 siblings, 1 reply; 6+ messages in thread From: Aneesh Kumar K.V @ 2008-07-21 5:59 UTC (permalink / raw) To: Andreas Dilger; +Cc: Theodore Tso, linux-ext4 On Sun, Jul 20, 2008 at 11:08:25PM -0600, Andreas Dilger wrote: > On Jul 18, 2008 08:37 -0400, Theodore Ts'o wrote: > > On Fri, Jul 18, 2008 at 05:41:30PM +0530, Aneesh Kumar K.V wrote: > > > With fallocate FALLOC_FL_KEEP_SIZE option, when we write to prealloc > > > space and if we hit ENOSPC when trying to insert the extent, > > > we actually zero out the extent. That means we can have blocks > > > outside i_size for an inode. > > To clarify, doesn't FALLOC_FL_KEEP_SIZE put the extent beyond i_size, > regardless of whether the ENOSPC problem is hit? But the extent in that case would be marked as uninit using the extent len. So e2fsck can check for that. > > > > I guess e2fsck currently doesn't handle this. Or should we fix kernel > > > to update i_size to the right value if we do a zero out of the extent ? > > > > > > With fallocate if the prealloc area is small we also aggressively zeroout. > > > This was needed so that a random write pattern on falloc area doesn't > > > results in too many extents. That also can result in the above > > > error on fsck. > > > > It would seem to me that e2fsck should be fixed to not complain about > > blocks outside of i_size, *if* the blocks in question are marked as > > being unitialized. > > Yes, I think that is the right approach. That is fine for extents marked uninit. But when we zero out we zero out the full extent. So that means a write of few bytes can result in blocks being zeroed out outside i_size. My question was how e2fsck can handle this. Because the extent will no more be marked as uninit and there would be blocks outside i_size all carrying zero. > > > I suppose the other hack we could do is have e2fsck check the blocks > > that are outside of i_size, and if they are all zero and extents are > > involved, that it's a case of pre-allocated blocks that needed to be > > zero'ed for some reason, as opposed to a corrupted i_size. That seems > > to be a really gross hack, though. > > Yuck, with the added problem that there is no guarantee that these > data blocks ARE all zero. > -aneesh ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: e2fsprogs and blocks outside i_size 2008-07-21 5:59 ` Aneesh Kumar K.V @ 2008-07-21 12:34 ` Theodore Tso 2008-07-21 23:32 ` Andreas Dilger 0 siblings, 1 reply; 6+ messages in thread From: Theodore Tso @ 2008-07-21 12:34 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: Andreas Dilger, linux-ext4 On Mon, Jul 21, 2008 at 11:29:18AM +0530, Aneesh Kumar K.V wrote: > > That is fine for extents marked uninit. But when we zero out we zero out > the full extent. So that means a write of few bytes can result in blocks > being zeroed out outside i_size. My question was how e2fsck can handle > this. Because the extent will no more be marked as uninit and there > would be blocks outside i_size all carrying zero. > Wel, as I said originally, we have four choices, only two of which are tenable: 1) Don't change i_size and leave e2fsck confused about whether i_size is confused or not; the next time e2fsck runs it can either fix it and change i_size, confusing applications that depend on i_size, or not fix it and in the case of a corrupted i_size, leave valid data inaccessible or do the hack to which Andreas reacted, "Yuck", and which Annesh quoted and I assume agree. (i.e., checking the data blocks to see if they are non-zero, and electing to to risk confusing the application in the case where they are non-zero). This is the current case. 2) Change i_size and always confuse applications that depend on i_size carrying some semantic meaning. 3) Don't aggressively zero-out (as it presents us with these two untenable options) and try to explit the extent instead. If the block application fails, return ENOSPC. 4) #3, except if the block allocation fails, try to steal a block that had been previously preallocated for some other logical block in that inode. Are there any other choices? I think #3 and #4 are the only options and #3 is certainly the simplest to implement, but it could lead to confusing results since the filesystem would be returning ENOSPC even though 'df' reports that space is available --- and some applications preallocate in order to guarantee no write failures. #4 is more complex, but it means that file might be more fragmented at the end, which would be bad for applications that depend on fallocate() to provide a more contiugous file. (Although fallocate never guaranteed perfect layout, just that it might provide a better one.) It also means that at the end, a file write might end up failing anyway, since we ended up stealing a block that was meant for use as a data bock. The one other thing I would note is that at least for non-root users, the reserved blocks will help save us most of the time, except for when users explicitly set the reserved blocks down to zero. But maybe this is one place where we just document that reserved blocks serve yet another purpose, which is to get us out of this mess, and that applications which depend on preallocated writes not failing need to either (a) not use insane write patterns, or (b) not run as root and to save a modest number of reserved blocks for this situation (or otherwise leave some "slack space" in the filesystem.) - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: e2fsprogs and blocks outside i_size 2008-07-21 12:34 ` Theodore Tso @ 2008-07-21 23:32 ` Andreas Dilger 0 siblings, 0 replies; 6+ messages in thread From: Andreas Dilger @ 2008-07-21 23:32 UTC (permalink / raw) To: Theodore Tso; +Cc: Aneesh Kumar K.V, linux-ext4 On Jul 21, 2008 08:34 -0400, Theodore Ts'o wrote: > Wel, as I said originally, we have four choices, only two of which are > tenable: > > 1) Don't change i_size and leave e2fsck confused about whether i_size > is confused or not; the next time e2fsck runs it can either fix it and > change i_size, confusing applications that depend on i_size, or not > fix it and in the case of a corrupted i_size, leave valid data > inaccessible or do the hack to which Andreas reacted, "Yuck", and > which Annesh quoted and I assume agree. (i.e., checking the data > blocks to see if they are non-zero, and electing to to risk confusing > the application in the case where they are non-zero). This is the > current case. > > 2) Change i_size and always confuse applications that depend on i_size > carrying some semantic meaning. > > 3) Don't aggressively zero-out (as it presents us with these two > untenable options) and try to explit the extent instead. If the block > application fails, return ENOSPC. > > 4) #3, except if the block allocation fails, try to steal a block that > had been previously preallocated for some other logical block in that > inode. 5) Add a flag to the inode which means "blocks beyond i_size" if fallocate() is called with "KEEP_SIZE" and allocation is actually beyond i_size and not just filling a hole) so that e2fsck won't "fix" the size, but allows the extent to be uninitialized. The flag is cleared (by kernel and/or e2fsck) if the size is extended to the last block. To avoid consuming our precious inode flags, we might consider to re-use the EXT3_DIRSYNC_FL or EXT3_TOPDIR_FL for this purpose, since the are definitely only having meaning for directories. I guess the question is whether we would need this for directories, but I don't think so as we could always just add empty directory blocks (at the expense of having to scan them later). > The one other thing I would note is that at least for non-root users, > the reserved blocks will help save us most of the time, except for > when users explicitly set the reserved blocks down to zero. Would the index block be allocated from the reserved space tough? This is also a good idea, but I'm not sure if that is what happens. I guess the "allocate index block" code path needs to check for "(uid == s_reserved_uid || is_metadata)"? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-07-21 23:32 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-18 12:11 e2fsprogs and blocks outside i_size Aneesh Kumar K.V 2008-07-18 12:37 ` Theodore Tso 2008-07-21 5:08 ` Andreas Dilger 2008-07-21 5:59 ` Aneesh Kumar K.V 2008-07-21 12:34 ` Theodore Tso 2008-07-21 23:32 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox