* [PATCH 0/3] xfs: failed writes and stale delalloc blocks.
@ 2012-04-27 9:45 Dave Chinner
2012-04-27 9:45 ` [PATCH 1/3] xfs: punch all delalloc blocks beyond EOF on write failure Dave Chinner
` (5 more replies)
0 siblings, 6 replies; 19+ messages in thread
From: Dave Chinner @ 2012-04-27 9:45 UTC (permalink / raw)
To: xfs
The fist patch fixes the original off-by-one I found that lead to assert
failures. The assert failures still happened, and that lead to the discovery
that we can leave delalloc blocks inside the file too. That's a much more
complex fix, explained in the patch, and effectively makes the first patch
redundant. I wanted to leave them as two separate patches, though, because it
shows that there are two definitely classes of errors we have to handle
correctly.
These two patches don't solve all the xfs_getbmap() and evict() assert failures
relating to delayed allocation blocks. When I have all the debug and tracing
turned on with ehese patches, test 083 can run in a loop for an hour and not
fail, running a successful test every ~20s.
However, one in every five test runs resulted in a test failure because of the
checking of the filesystem (using 512 byte block size) would trigger a mount
warning and that would cause a false detection of a failure. Hence the third
patch to prevent that warning from occurring. I'm still seeing a less regular
check failure when not running with debug and tracing, but this error message is
not the cause anymore.
Back to delayed allocation: when I remove either the debug or the tracing, I
still see fairly regular assert failures, but this time they are not a result of
failed writes. The inodes that trigger failures do not show up in the failed
write debug output, and the extent debug output always indicates there are
delalloc blocks beyond EOF and the tracing indicates that they were put there as
a result of speculative delayed allocation beyond EOF.
I beleive the reason that fsstress is tripping the bmap assert is because,
unlike fiemap and the xfs_io bmap command, it is giving length to the range it
wants mapped and so we are trying to map beyond EOF. Of course, there are
delalloc blocks there which triggers the assert. I need to modify xfs_io to see
if this really is the cause of the remaining assert failures....
However, in the mean time, these patches remove one source of assert failures
and make my xfstests runs much less likely to fail. I can get the majority of my
auto group test runs completing with these patches instead of a 50% failure rate
without the patches.
I hope everyone enjoys reviewing patch 2/3 as much as I enjoyed writing it - it
was a PITA until I found that set_buffer_new() bug in __xfs_get_blocks().... ;)
Cheers,
Dave.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 1/3] xfs: punch all delalloc blocks beyond EOF on write failure. 2012-04-27 9:45 [PATCH 0/3] xfs: failed writes and stale delalloc blocks Dave Chinner @ 2012-04-27 9:45 ` Dave Chinner 2012-04-30 13:49 ` Christoph Hellwig 2012-04-27 9:45 ` [PATCH 2/3] xfs: punch new delalloc blocks out of failed writes inside EOF Dave Chinner ` (4 subsequent siblings) 5 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2012-04-27 9:45 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> I've been seeing regular ASSERT failures in xfstests when running fsstress based tests over the past month. xfs_getbmap() has been failing this test: XFS: Assertion failed: ((iflags & BMV_IF_DELALLOC) != 0) || (map[i].br_startblock != DELAYSTARTBLOCK), file: fs/xfs/xfs_bmap.c, line: 5650 where it is encountering a delayed allocation extent after writing all the dirty data to disk and then walking the extent map atomically by holding the XFS_IOLOCK_SHARED to prevent new delayed allocation extents from being created. Test 083 on a 512 byte block size filesystem was used to reproduce the problem, because it only had a 5s run timeand would usually fail every 3-4 runs. This test is exercising ENOSPC behaviour by running fsstress on a nearly full filesystem. The following trace extract shows the final few events on the inode that tripped the assert: xfs_ilock: flags ILOCK_EXCL caller xfs_setfilesize xfs_setfilesize: isize 0x180000 disize 0x12d400 offset 0x17e200 count 7680 file size updated to 0x180000 by IO completion xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_iext_insert: state idx 3 offset 3072 block 4503599627239432 count 1 flag 0 caller xfs_bmap_add_extent_hole_delay xfs_get_blocks_alloc: size 0x180000 offset 0x180000 count 512 type startoff 0xc00 startblock -1 blockcount 0x1 xfs_ilock: flags ILOCK_EXCL caller __xfs_get_blocks delalloc write, adding a single block at offset 0x180000 xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180200 count 512 ENOSPC trying to allocate a dellalloc block at offset 0x180200 xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_get_blocks_alloc: size 0x180000 offset 0x180200 count 512 type startoff 0xc00 startblock -1 blockcount 0x2 And succeeding on retry after flushing dirty inodes. xfs_ilock: flags ILOCK_EXCL caller __xfs_get_blocks xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180400 count 512 ENOSPC trying to allocate a dellalloc block at offset 0x180400 xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180400 count 512 And failing the retry, giving a real ENOSPC error. xfs_ilock: flags ILOCK_EXCL caller xfs_vm_write_failed ^^^^^^^^^^^^^^^^^^^ The smoking gun - the write being failed and cleaning up delalloc blocks beyond EOF allocated by the failed write. xfs_getattr: xfs_ilock: flags IOLOCK_SHARED caller xfs_getbmap xfs_ilock: flags ILOCK_SHARED caller xfs_ilock_map_shared And that's where we died almost immediately afterwards. xfs_bmapi_read() found delalloc extent beyond current file in memory file size. Some debug I added to xfs_getbmap() showed the state just before the assert failure: ino 0x80e48: off 0xc00, fsb 0xffffffffffffffff, len 0x1, size 0x180000 start_fsb 0x106, end_fsb 0x638 ino flags 0x2 nex 0xd bmvcnt 0x555, len 0x3c58a6f23c0bf1, start 0xc00 ext 0: off 0x1fc, fsb 0x24782, len 0x254 ext 1: off 0x450, fsb 0x40851, len 0x30 ext 2: off 0x480, fsb 0xd99, len 0x1b8 ext 3: off 0x92f, fsb 0x4099a, len 0x3b ext 4: off 0x96d, fsb 0x41844, len 0x98 ext 5: off 0xbf1, fsb 0x408ab, len 0xf which shows that we found a single delalloc block beyond EOF (first line of output) when we were returning the map for a length somewhere around 10^16 bytes long (second line), and the on-disk extents showed they didn't go past EOF (last lines). Further debug added to xfs_vm_write_failed() showed this happened when punching out delalloc blocks beyond the end of the file after the failed write: [ 132.606693] ino 0x80e48: vwf to 0x181000, sze 0x180000 [ 132.609573] start_fsb 0xc01, end_fsb 0xc08 It punched the range 0xc01 -> 0xc08, but the range we really need to punch is 0xc00 -> 0xc07 (8 blocks from 0xc00) as this testing was run on a 512 byte block size filesystem (8 blocks per page). the punch from is 0xc00. So end_fsb is correct, but start_fsb is wrong as we punch from start_fsb for (end_fsb - start_fsb) blocks. Hence we are not punching the delalloc block beyond EOF in the case. The fix is simple - it's a silly off-by-one mistake in calculating the range. It's especially silly because the macro used to calculate the start_fsb already takes into account the case where the inode size is an exact multiple of the filesystem block size... Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_aops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index b66766a..64ed87a 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1431,7 +1431,7 @@ xfs_vm_write_failed( * Check if there are any blocks that are outside of i_size * that need to be trimmed back. */ - start_fsb = XFS_B_TO_FSB(ip->i_mount, inode->i_size) + 1; + start_fsb = XFS_B_TO_FSB(ip->i_mount, inode->i_size); end_fsb = XFS_B_TO_FSB(ip->i_mount, to); if (end_fsb <= start_fsb) return; -- 1.7.10 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] xfs: punch all delalloc blocks beyond EOF on write failure. 2012-04-27 9:45 ` [PATCH 1/3] xfs: punch all delalloc blocks beyond EOF on write failure Dave Chinner @ 2012-04-30 13:49 ` Christoph Hellwig 0 siblings, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2012-04-30 13:49 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Looks good, Reviewed-by: Christoph Hellwig <hch@lst.de> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 2/3] xfs: punch new delalloc blocks out of failed writes inside EOF. 2012-04-27 9:45 [PATCH 0/3] xfs: failed writes and stale delalloc blocks Dave Chinner 2012-04-27 9:45 ` [PATCH 1/3] xfs: punch all delalloc blocks beyond EOF on write failure Dave Chinner @ 2012-04-27 9:45 ` Dave Chinner 2012-05-07 22:00 ` Ben Myers 2012-04-27 9:45 ` [PATCH 3/3] xfs: prevent needless mount warning causing test failures Dave Chinner ` (3 subsequent siblings) 5 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2012-04-27 9:45 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> When a partial write inside EOF fails, it can leave delayed allocation blocks lying around because they don't get punched back out. This leads to assert failures like: XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 847 when evicting inodes from the cache. This can be trivially triggered by xfstests 083, which takes between 5 and 15 executions on a 512 byte block size filesystem to trip over this. Debugging shows a failed write due to ENOSPC calling xfs_vm_write_failed such as: [ 5012.329024] ino 0xa0026: vwf to 0x17000, sze 0x1c85ae and no action is taken on it. This leaves behind a delayed allocation extent that has no page covering it and no data in it: [ 5015.867162] ino 0xa0026: blks: 0x83 delay blocks 0x1, size 0x2538c0 [ 5015.868293] ext 0: off 0x4a, fsb 0x50306, len 0x1 [ 5015.869095] ext 1: off 0x4b, fsb 0x7899, len 0x6b [ 5015.869900] ext 2: off 0xb6, fsb 0xffffffffe0008, len 0x1 ^^^^^^^^^^^^^^^ [ 5015.871027] ext 3: off 0x36e, fsb 0x7a27, len 0xd [ 5015.872206] ext 4: off 0x4cf, fsb 0x7a1d, len 0xa So the delayed allocation extent is one block long at offset 0x16c00. Tracing shows that a bigger write: xfs_file_buffered_write: size 0x1c85ae offset 0x959d count 0x1ca3f ioflags allocates the block, and then fails with ENOSPC trying to allocate the last block on the page, leading to a failed write with stale delalloc blocks on it. Because we've had an ENOSPC when trying to allocate 0x16e00, it means that we are never goinge to call ->write_end on the page and so the allocated new buffer will not get marked dirty or have the buffer_new state cleared. In other works, what the above write is supposed to end up with is this mapping for the page: +------+------+------+------+------+------+------+------+ UMA UMA UMA UMA UMA UMA UND FAIL where: U = uptodate M = mapped N = new A = allocated D = delalloc FAIL = block we ENOSPC'd on. and the key point being the buffer_new() state for the newly allocated delayed allocation block. Except it doesn't - we're not marking buffers new correctly. That buffer_new() problem goes back to the xfs_iomap removal days, where xfs_iomap() used to return a "new" status for any map with newly allocated blocks, so that __xfs_get_blocks() could call set_buffer_new() on it. We still have the "new" variable and the check for it in the set_buffer_new() logic - except we never set it now! Hence that newly allocated delalloc block doesn't have the new flag set on it, so when the write fails we cannot tell which blocks we are supposed to punch out. WHy do we need the buffer_new flag? Well, that's because we can have this case: +------+------+------+------+------+------+------+------+ UMD UMD UMD UMD UMD UMD UND FAIL where all the UMD buffers contain valid data from a previously successful write() system call. We only want to punch the UND buffer because that's the only one that we added in this write and it was only this write that failed. That implies that even the old buffer_new() logic was wrong - because it would result in all those UMD buffers on the page having set_buffer_new() called on them even though they aren't new. Hence we shoul donly be calling set_buffer_new() for delalloc buffers that were allocated (i.e. were a hole before xfs_iomap_write_delay() was called). So, fix this set_buffer_new logic according to how we need it to work for handling failed writes correctly. Also, restore the new buffer logic handling for blocks allocated via xfs_iomap_write_direct(), because it should still set the buffer_new flag appropriately for newly allocated blocks, too. SO, now we have the buffer_new() being set appropriately in __xfs_get_blocks(), we can detect the exact delalloc ranges that we allocated in a failed write, and hence can now do a walk of the buffers on a page to find them. Except, it's not that easy. When block_write_begin() fails, it unlocks and releases the page that we just had an error on, so we can't use that page to handle errors anymore. We have to get access to the page while it is still locked to walk the buffers. Hence we have to open code block_write_begin() in xfs_vm_write_begin() to be able to insert xfs_vm_write_failed() is the right place. With that, we can pass the page and write range to xfs_vm_write_failed() and walk the buffers on the page, looking for delalloc buffers that are either new or beyond EOF and punch them out. Handling buffers beyond EOF ensures we still handle the existing case that xfs_vm_write_failed() handles. Of special note is the truncate_pagecache() handling - that only should be done for pages outside EOF - pages within EOF can still contain valid, dirty data so we must not punch them out of the cache. That just leaves the xfs_vm_write_end() failure handling. The only failure case here is that we didn't copy the entire range, and generic_write_end() handles that by zeroing the region of the page that wasn't copied, we don't have to punch out blocks within the file because they are guaranteed to contain zeros. Hence we only have to handle the existing "beyond EOF" case and don't need access to the buffers on the page. Hence it remains largely unchanged. Note that xfs_getbmap() can still trip over delalloc blocks beyond EOF that are left there by speculative delayed allocation. Hence this bug fix does not solve all known issues with bmap vs delalloc, but it does fix all the the known accidental occurances of the problem. Signed-off-by: Dave Chinner <david@fromorbit.com> --- fs/xfs/xfs_aops.c | 173 +++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 127 insertions(+), 46 deletions(-) diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 64ed87a..ae31c31 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1184,11 +1184,18 @@ __xfs_get_blocks( &imap, nimaps); if (error) return -error; + new = 1; } else { /* * Delalloc reservations do not require a transaction, - * we can go on without dropping the lock here. + * we can go on without dropping the lock here. If we + * are allocating a new delalloc block, make sure that + * we set the new flag so that we mark the buffer new so + * that we know that it is newly allocated if the write + * fails. */ + if (nimaps && imap.br_startblock == HOLESTARTBLOCK) + new = 1; error = xfs_iomap_write_delay(ip, offset, size, &imap); if (error) goto out_unlock; @@ -1405,52 +1412,91 @@ out_destroy_ioend: return ret; } +/* + * Punch out the delalloc blocks we have already allocated. + * + * Don't bother with xfs_setattr given that nothing can have made it to disk yet + * as the page is still locked at this point. + */ +STATIC void +xfs_vm_kill_delalloc_range( + struct inode *inode, + loff_t start, + loff_t end) +{ + struct xfs_inode *ip = XFS_I(inode); + xfs_fileoff_t start_fsb; + xfs_fileoff_t end_fsb; + int error; + + start_fsb = XFS_B_TO_FSB(ip->i_mount, start); + end_fsb = XFS_B_TO_FSB(ip->i_mount, end); + if (end_fsb <= start_fsb) + return; + + xfs_ilock(ip, XFS_ILOCK_EXCL); + error = xfs_bmap_punch_delalloc_range(ip, start_fsb, + end_fsb - start_fsb); + if (error) { + /* something screwed, just bail */ + if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) { + xfs_alert(ip->i_mount, + "xfs_vm_write_failed: unable to clean up ino %lld", + ip->i_ino); + } + } + xfs_iunlock(ip, XFS_ILOCK_EXCL); +} + STATIC void xfs_vm_write_failed( - struct address_space *mapping, - loff_t to) + struct inode *inode, + struct page *page, + loff_t pos, + unsigned len) { - struct inode *inode = mapping->host; + loff_t block_offset = pos & PAGE_MASK; + loff_t block_start; + loff_t block_end; + loff_t from = pos & (PAGE_CACHE_SIZE - 1); + loff_t to = from + len; + struct buffer_head *bh, *head; - if (to > inode->i_size) { - /* - * Punch out the delalloc blocks we have already allocated. - * - * Don't bother with xfs_setattr given that nothing can have - * made it to disk yet as the page is still locked at this - * point. - */ - struct xfs_inode *ip = XFS_I(inode); - xfs_fileoff_t start_fsb; - xfs_fileoff_t end_fsb; - int error; + ASSERT(block_offset + from == pos); - truncate_pagecache(inode, to, inode->i_size); + head = page_buffers(page); + block_start = 0; + for (bh = head; bh != head || !block_start; + bh = bh->b_this_page, block_start = block_end, + block_offset += bh->b_size) { + block_end = block_start + bh->b_size; - /* - * Check if there are any blocks that are outside of i_size - * that need to be trimmed back. - */ - start_fsb = XFS_B_TO_FSB(ip->i_mount, inode->i_size); - end_fsb = XFS_B_TO_FSB(ip->i_mount, to); - if (end_fsb <= start_fsb) - return; - - xfs_ilock(ip, XFS_ILOCK_EXCL); - error = xfs_bmap_punch_delalloc_range(ip, start_fsb, - end_fsb - start_fsb); - if (error) { - /* something screwed, just bail */ - if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) { - xfs_alert(ip->i_mount, - "xfs_vm_write_failed: unable to clean up ino %lld", - ip->i_ino); - } - } - xfs_iunlock(ip, XFS_ILOCK_EXCL); + /* skip buffers before the write */ + if (block_end <= from) + continue; + + /* if the buffer is after the write, we're done */ + if (block_start >= to) + break; + + if (!buffer_delay(bh)) + continue; + + if (!buffer_new(bh) && block_offset < i_size_read(inode)) + continue; + + xfs_vm_kill_delalloc_range(inode, block_offset, + block_offset + bh->b_size); } + } +/* + * This used to call block_write_begin(), but it unlocks and releases the page + * on error, and we need that page to be able to punch stale delalloc blocks out + * on failure. hence we copy-n-waste it here and call xfs_vm_write_failed() at + * the appropriate point. + */ STATIC int xfs_vm_write_begin( struct file *file, @@ -1461,15 +1507,40 @@ xfs_vm_write_begin( struct page **pagep, void **fsdata) { - int ret; + pgoff_t index = pos >> PAGE_CACHE_SHIFT; + struct page *page; + int status; - ret = block_write_begin(mapping, pos, len, flags | AOP_FLAG_NOFS, - pagep, xfs_get_blocks); - if (unlikely(ret)) - xfs_vm_write_failed(mapping, pos + len); - return ret; + ASSERT(len <= PAGE_CACHE_SIZE); + + page = grab_cache_page_write_begin(mapping, index, + flags | AOP_FLAG_NOFS); + if (!page) + return -ENOMEM; + + status = __block_write_begin(page, pos, len, xfs_get_blocks); + if (unlikely(status)) { + struct inode *inode = mapping->host; + + xfs_vm_write_failed(inode, page, pos, len); + unlock_page(page); + + if (pos + len > i_size_read(inode)) + truncate_pagecache(inode, pos + len, i_size_read(inode)); + + page_cache_release(page); + page = NULL; + } + + *pagep = page; + return status; } +/* + * On failure, we only need to kill delalloc blocks beyond EOF because they + * will never be written. For blocks within EOF, generic_write_end() zeros them + * so they are safe to leave alone and be written with all the other valid data. + */ STATIC int xfs_vm_write_end( struct file *file, @@ -1482,9 +1553,19 @@ xfs_vm_write_end( { int ret; + ASSERT(len <= PAGE_CACHE_SIZE); + ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata); - if (unlikely(ret < len)) - xfs_vm_write_failed(mapping, pos + len); + if (unlikely(ret < len)) { + struct inode *inode = mapping->host; + size_t isize = i_size_read(inode); + loff_t to = pos + len; + + if (to > isize) { + truncate_pagecache(inode, to, isize); + xfs_vm_kill_delalloc_range(inode, isize, to); + } + } return ret; } -- 1.7.10 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] xfs: punch new delalloc blocks out of failed writes inside EOF. 2012-04-27 9:45 ` [PATCH 2/3] xfs: punch new delalloc blocks out of failed writes inside EOF Dave Chinner @ 2012-05-07 22:00 ` Ben Myers 0 siblings, 0 replies; 19+ messages in thread From: Ben Myers @ 2012-05-07 22:00 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, Apr 27, 2012 at 07:45:21PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > When a partial write inside EOF fails, it can leave delayed > allocation blocks lying around because they don't get punched back > out. This leads to assert failures like: > > XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 847 > > when evicting inodes from the cache. This can be trivially triggered > by xfstests 083, which takes between 5 and 15 executions on a 512 > byte block size filesystem to trip over this. Debugging shows a > failed write due to ENOSPC calling xfs_vm_write_failed such as: > > [ 5012.329024] ino 0xa0026: vwf to 0x17000, sze 0x1c85ae > > and no action is taken on it. This leaves behind a delayed > allocation extent that has no page covering it and no data in it: > > [ 5015.867162] ino 0xa0026: blks: 0x83 delay blocks 0x1, size 0x2538c0 > [ 5015.868293] ext 0: off 0x4a, fsb 0x50306, len 0x1 > [ 5015.869095] ext 1: off 0x4b, fsb 0x7899, len 0x6b > [ 5015.869900] ext 2: off 0xb6, fsb 0xffffffffe0008, len 0x1 > ^^^^^^^^^^^^^^^ > [ 5015.871027] ext 3: off 0x36e, fsb 0x7a27, len 0xd > [ 5015.872206] ext 4: off 0x4cf, fsb 0x7a1d, len 0xa > > So the delayed allocation extent is one block long at offset > 0x16c00. Tracing shows that a bigger write: > > xfs_file_buffered_write: size 0x1c85ae offset 0x959d count 0x1ca3f ioflags > > allocates the block, and then fails with ENOSPC trying to allocate > the last block on the page, leading to a failed write with stale > delalloc blocks on it. > > Because we've had an ENOSPC when trying to allocate 0x16e00, it > means that we are never goinge to call ->write_end on the page and going > so the allocated new buffer will not get marked dirty or have the > buffer_new state cleared. In other works, what the above write is > supposed to end up with is this mapping for the page: > > +------+------+------+------+------+------+------+------+ > UMA UMA UMA UMA UMA UMA UND FAIL > > where: U = uptodate > M = mapped > N = new > A = allocated > D = delalloc > FAIL = block we ENOSPC'd on. > > and the key point being the buffer_new() state for the newly > allocated delayed allocation block. Except it doesn't - we're not > marking buffers new correctly. > > That buffer_new() problem goes back to the xfs_iomap removal days, > where xfs_iomap() used to return a "new" status for any map with > newly allocated blocks, so that __xfs_get_blocks() could call > set_buffer_new() on it. We still have the "new" variable and the > check for it in the set_buffer_new() logic - except we never set it > now! > > Hence that newly allocated delalloc block doesn't have the new flag > set on it, so when the write fails we cannot tell which blocks we > are supposed to punch out. WHy do we need the buffer_new flag? Well, Why > that's because we can have this case: > > +------+------+------+------+------+------+------+------+ > UMD UMD UMD UMD UMD UMD UND FAIL > > where all the UMD buffers contain valid data from a previously > successful write() system call. We only want to punch the UND buffer > because that's the only one that we added in this write and it was > only this write that failed. > > That implies that even the old buffer_new() logic was wrong - > because it would result in all those UMD buffers on the page having > set_buffer_new() called on them even though they aren't new. Hence > we shoul donly be calling set_buffer_new() for delalloc buffers that should only > were allocated (i.e. were a hole before xfs_iomap_write_delay() was > called). > > So, fix this set_buffer_new logic according to how we need it to > work for handling failed writes correctly. Also, restore the new > buffer logic handling for blocks allocated via > xfs_iomap_write_direct(), because it should still set the buffer_new > flag appropriately for newly allocated blocks, too. > > SO, now we have the buffer_new() being set appropriately in > __xfs_get_blocks(), we can detect the exact delalloc ranges that > we allocated in a failed write, and hence can now do a walk of the > buffers on a page to find them. > > Except, it's not that easy. When block_write_begin() fails, it > unlocks and releases the page that we just had an error on, so we > can't use that page to handle errors anymore. We have to get access > to the page while it is still locked to walk the buffers. Hence we > have to open code block_write_begin() in xfs_vm_write_begin() to be > able to insert xfs_vm_write_failed() is the right place. > > With that, we can pass the page and write range to > xfs_vm_write_failed() and walk the buffers on the page, looking for > delalloc buffers that are either new or beyond EOF and punch them > out. Handling buffers beyond EOF ensures we still handle the > existing case that xfs_vm_write_failed() handles. > > Of special note is the truncate_pagecache() handling - that only > should be done for pages outside EOF - pages within EOF can still > contain valid, dirty data so we must not punch them out of the > cache. > > That just leaves the xfs_vm_write_end() failure handling. > The only failure case here is that we didn't copy the entire range, > and generic_write_end() handles that by zeroing the region of the > page that wasn't copied, Are you referring to xfs_vm_write_end generic_write_end block_write_end page_zero_new_buffers? > we don't have to punch out blocks within > the file because they are guaranteed to contain zeros. Hence we only > have to handle the existing "beyond EOF" case and don't need access > to the buffers on the page. Hence it remains largely unchanged. > > Note that xfs_getbmap() can still trip over delalloc blocks beyond > EOF that are left there by speculative delayed allocation. Hence > this bug fix does not solve all known issues with bmap vs delalloc, > but it does fix all the the known accidental occurances of the > problem. > > Signed-off-by: Dave Chinner <david@fromorbit.com> > --- > fs/xfs/xfs_aops.c | 173 +++++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 127 insertions(+), 46 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 64ed87a..ae31c31 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -1184,11 +1184,18 @@ __xfs_get_blocks( > &imap, nimaps); > if (error) > return -error; > + new = 1; > } else { > /* > * Delalloc reservations do not require a transaction, > - * we can go on without dropping the lock here. > + * we can go on without dropping the lock here. If we > + * are allocating a new delalloc block, make sure that > + * we set the new flag so that we mark the buffer new so > + * that we know that it is newly allocated if the write > + * fails. > */ > + if (nimaps && imap.br_startblock == HOLESTARTBLOCK) > + new = 1; > error = xfs_iomap_write_delay(ip, offset, size, &imap); > if (error) > goto out_unlock; > @@ -1405,52 +1412,91 @@ out_destroy_ioend: > return ret; > } > > +/* > + * Punch out the delalloc blocks we have already allocated. This language is confusing. I suggest that delay blocks are reserved and real blocks are allocated. Tomato Tomato. > + * > + * Don't bother with xfs_setattr given that nothing can have made it to disk yet > + * as the page is still locked at this point. > + */ > +STATIC void > +xfs_vm_kill_delalloc_range( > + struct inode *inode, > + loff_t start, > + loff_t end) > +{ > + struct xfs_inode *ip = XFS_I(inode); > + xfs_fileoff_t start_fsb; > + xfs_fileoff_t end_fsb; > + int error; > + > + start_fsb = XFS_B_TO_FSB(ip->i_mount, start); > + end_fsb = XFS_B_TO_FSB(ip->i_mount, end); > + if (end_fsb <= start_fsb) > + return; > + > + xfs_ilock(ip, XFS_ILOCK_EXCL); > + error = xfs_bmap_punch_delalloc_range(ip, start_fsb, > + end_fsb - start_fsb); > + if (error) { > + /* something screwed, just bail */ > + if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) { > + xfs_alert(ip->i_mount, > + "xfs_vm_write_failed: unable to clean up ino %lld", Consider updating the function name in this error message and printing out the value of error. > + ip->i_ino); > + } > + } > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > +} > + > STATIC void > xfs_vm_write_failed( > - struct address_space *mapping, > - loff_t to) > + struct inode *inode, > + struct page *page, > + loff_t pos, > + unsigned len) > { > - struct inode *inode = mapping->host; > + loff_t block_offset = pos & PAGE_MASK; > + loff_t block_start; > + loff_t block_end; > + loff_t from = pos & (PAGE_CACHE_SIZE - 1); > + loff_t to = from + len; > + struct buffer_head *bh, *head; > > - if (to > inode->i_size) { > - /* > - * Punch out the delalloc blocks we have already allocated. > - * > - * Don't bother with xfs_setattr given that nothing can have > - * made it to disk yet as the page is still locked at this > - * point. > - */ > - struct xfs_inode *ip = XFS_I(inode); > - xfs_fileoff_t start_fsb; > - xfs_fileoff_t end_fsb; > - int error; > + ASSERT(block_offset + from == pos); > > - truncate_pagecache(inode, to, inode->i_size); > + head = page_buffers(page); > + block_start = 0; > + for (bh = head; bh != head || !block_start; > + bh = bh->b_this_page, block_start = block_end, > + block_offset += bh->b_size) { > + block_end = block_start + bh->b_size; > > - /* > - * Check if there are any blocks that are outside of i_size > - * that need to be trimmed back. > - */ > - start_fsb = XFS_B_TO_FSB(ip->i_mount, inode->i_size); > - end_fsb = XFS_B_TO_FSB(ip->i_mount, to); > - if (end_fsb <= start_fsb) > - return; > - > - xfs_ilock(ip, XFS_ILOCK_EXCL); > - error = xfs_bmap_punch_delalloc_range(ip, start_fsb, > - end_fsb - start_fsb); > - if (error) { > - /* something screwed, just bail */ > - if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) { > - xfs_alert(ip->i_mount, > - "xfs_vm_write_failed: unable to clean up ino %lld", > - ip->i_ino); > - } > - } > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > + /* skip buffers before the write */ > + if (block_end <= from) > + continue; > + > + /* if the buffer is after the write, we're done */ > + if (block_start >= to) *blink* I was looking pretty hard at that for an off-by-one. Mark straightened me out. Eesh. > + break; > + > + if (!buffer_delay(bh)) > + continue; > + > + if (!buffer_new(bh) && block_offset < i_size_read(inode)) > + continue; > + > + xfs_vm_kill_delalloc_range(inode, block_offset, > + block_offset + bh->b_size); > } > + > } > > +/* > + * This used to call block_write_begin(), but it unlocks and releases the page > + * on error, and we need that page to be able to punch stale delalloc blocks out > + * on failure. hence we copy-n-waste it here and call xfs_vm_write_failed() at > + * the appropriate point. > + */ > STATIC int > xfs_vm_write_begin( > struct file *file, > @@ -1461,15 +1507,40 @@ xfs_vm_write_begin( > struct page **pagep, > void **fsdata) > { > - int ret; > + pgoff_t index = pos >> PAGE_CACHE_SHIFT; > + struct page *page; > + int status; > > - ret = block_write_begin(mapping, pos, len, flags | AOP_FLAG_NOFS, > - pagep, xfs_get_blocks); > - if (unlikely(ret)) > - xfs_vm_write_failed(mapping, pos + len); > - return ret; > + ASSERT(len <= PAGE_CACHE_SIZE); > + > + page = grab_cache_page_write_begin(mapping, index, > + flags | AOP_FLAG_NOFS); > + if (!page) > + return -ENOMEM; > + > + status = __block_write_begin(page, pos, len, xfs_get_blocks); > + if (unlikely(status)) { > + struct inode *inode = mapping->host; > + > + xfs_vm_write_failed(inode, page, pos, len); > + unlock_page(page); Consistent with block_write_begin. > + > + if (pos + len > i_size_read(inode)) > + truncate_pagecache(inode, pos + len, i_size_read(inode)); > + > + page_cache_release(page); > + page = NULL; > + } > + > + *pagep = page; > + return status; > } > > +/* > + * On failure, we only need to kill delalloc blocks beyond EOF because they > + * will never be written. For blocks within EOF, generic_write_end() zeros them > + * so they are safe to leave alone and be written with all the other valid data. > + */ > STATIC int > xfs_vm_write_end( > struct file *file, > @@ -1482,9 +1553,19 @@ xfs_vm_write_end( > { > int ret; > > + ASSERT(len <= PAGE_CACHE_SIZE); > + > ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata); > - if (unlikely(ret < len)) > - xfs_vm_write_failed(mapping, pos + len); > + if (unlikely(ret < len)) { > + struct inode *inode = mapping->host; > + size_t isize = i_size_read(inode); > + loff_t to = pos + len; > + > + if (to > isize) { > + truncate_pagecache(inode, to, isize); > + xfs_vm_kill_delalloc_range(inode, isize, to); > + } > + } > return ret; > } Aside from a few nits this is looking good. Reviewed-by: Ben Myers <bpm@sgi.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/3] xfs: prevent needless mount warning causing test failures 2012-04-27 9:45 [PATCH 0/3] xfs: failed writes and stale delalloc blocks Dave Chinner 2012-04-27 9:45 ` [PATCH 1/3] xfs: punch all delalloc blocks beyond EOF on write failure Dave Chinner 2012-04-27 9:45 ` [PATCH 2/3] xfs: punch new delalloc blocks out of failed writes inside EOF Dave Chinner @ 2012-04-27 9:45 ` Dave Chinner 2012-05-08 16:29 ` Ben Myers 2012-04-29 11:16 ` [PATCH 4/3] xfs: don't assert on delalloc regions beyond EOF Dave Chinner ` (2 subsequent siblings) 5 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2012-04-27 9:45 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> Often mounting small filesystem with small logs will emit a warning such as: XFS (vdb): Invalid block length (0x2000) for buffer during log recovery. This causes tests to randomly fail because this output causes the clean filesystem checks on test completion to think the filesystem is inconsistent. The cause of the error is simply that log recovery is asking for a buffer size that is larger than the log when zeroing the tail. This is because the buffer size is rounded up, and if the right head and tail conditions exist then the buffer size can be larger than the log. Limit the variable size xlog_get_bp() callers to requesting buffers smaller than the log. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log_recover.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index d7abe5f..ca38690 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -441,6 +441,8 @@ xlog_find_verify_cycle( * a log sector, or we're out of luck. */ bufblks = 1 << ffs(nbblks); + while (bufblks > log->l_logBBsize) + bufblks >>= 1; while (!(bp = xlog_get_bp(log, bufblks))) { bufblks >>= 1; if (bufblks < log->l_sectBBsize) @@ -1226,6 +1228,8 @@ xlog_write_log_records( * log sector, or we're out of luck. */ bufblks = 1 << ffs(blocks); + while (bufblks > log->l_logBBsize) + bufblks >>= 1; while (!(bp = xlog_get_bp(log, bufblks))) { bufblks >>= 1; if (bufblks < sectbb) -- 1.7.10 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] xfs: prevent needless mount warning causing test failures 2012-04-27 9:45 ` [PATCH 3/3] xfs: prevent needless mount warning causing test failures Dave Chinner @ 2012-05-08 16:29 ` Ben Myers 2012-05-08 22:42 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: Ben Myers @ 2012-05-08 16:29 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, Apr 27, 2012 at 07:45:22PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Often mounting small filesystem with small logs will emit a warning > such as: > > XFS (vdb): Invalid block length (0x2000) for buffer > > during log recovery. This causes tests to randomly fail because this > output causes the clean filesystem checks on test completion to > think the filesystem is inconsistent. > > The cause of the error is simply that log recovery is asking for a > buffer size that is larger than the log when zeroing the tail. This > is because the buffer size is rounded up, and if the right head and > tail conditions exist then the buffer size can be larger than the log. > Limit the variable size xlog_get_bp() callers to requesting buffers > smaller than the log. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log_recover.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index d7abe5f..ca38690 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -441,6 +441,8 @@ xlog_find_verify_cycle( > * a log sector, or we're out of luck. > */ > bufblks = 1 << ffs(nbblks); > + while (bufblks > log->l_logBBsize) > + bufblks >>= 1; AFAICS you don't need a loop here. The following would be sufficient to make xlog_buf_bbcount_valid return 0. if (bufblks > log->l_logBBsize) bufblks = log->l_logBBsize; It is a bit more obviously correct. -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] xfs: prevent needless mount warning causing test failures 2012-05-08 16:29 ` Ben Myers @ 2012-05-08 22:42 ` Dave Chinner 0 siblings, 0 replies; 19+ messages in thread From: Dave Chinner @ 2012-05-08 22:42 UTC (permalink / raw) To: Ben Myers; +Cc: xfs On Tue, May 08, 2012 at 11:29:42AM -0500, Ben Myers wrote: > On Fri, Apr 27, 2012 at 07:45:22PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Often mounting small filesystem with small logs will emit a warning > > such as: > > > > XFS (vdb): Invalid block length (0x2000) for buffer > > > > during log recovery. This causes tests to randomly fail because this > > output causes the clean filesystem checks on test completion to > > think the filesystem is inconsistent. > > > > The cause of the error is simply that log recovery is asking for a > > buffer size that is larger than the log when zeroing the tail. This > > is because the buffer size is rounded up, and if the right head and > > tail conditions exist then the buffer size can be larger than the log. > > Limit the variable size xlog_get_bp() callers to requesting buffers > > smaller than the log. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_log_recover.c | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > > index d7abe5f..ca38690 100644 > > --- a/fs/xfs/xfs_log_recover.c > > +++ b/fs/xfs/xfs_log_recover.c > > @@ -441,6 +441,8 @@ xlog_find_verify_cycle( > > * a log sector, or we're out of luck. > > */ > > bufblks = 1 << ffs(nbblks); > > + while (bufblks > log->l_logBBsize) > > + bufblks >>= 1; > > AFAICS you don't need a loop here. The following would be sufficient to make > xlog_buf_bbcount_valid return 0. > > if (bufblks > log->l_logBBsize) > bufblks = log->l_logBBsize; Yes, I could do that, but then there is a different set of boundary conditions to test. I know that the >>=1 logic works, but I have no idea what new corner cases occur when bufblks == log->l_logBBsize. > It is a bit more obviously correct. It may be to read, but it's certainly more different from a verification point of view. Given how long and arduous the process was to find the source of the problem, I am very wary of changing logic to run in ways that are different and very difficult to actually test.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 4/3] xfs: don't assert on delalloc regions beyond EOF 2012-04-27 9:45 [PATCH 0/3] xfs: failed writes and stale delalloc blocks Dave Chinner ` (2 preceding siblings ...) 2012-04-27 9:45 ` [PATCH 3/3] xfs: prevent needless mount warning causing test failures Dave Chinner @ 2012-04-29 11:16 ` Dave Chinner 2012-05-08 17:26 ` Ben Myers 2012-04-29 12:43 ` [PATCH 5/3] xfs: limit specualtive delalloc to maxioffset Dave Chinner 2012-04-29 12:57 ` [PATCH 6/3] xfs: make largest supported offset less shouty Dave Chinner 5 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2012-04-29 11:16 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> When we are doing speculative delayed allocation beyond EOF, conversion of the region allocated beyond EOF is dependent on the largest free space extent available. If the largest free extent is smaller than the delalloc range, then after allocation we leave a delalloc extent that starts beyond EOF. This extent cannot *ever* be converted by flushing data, and so will remain there until either the EOF moves into the extent or it is truncated away. Hence if xfs_getbmap() runs on such an inode and is asked to return extents beyond EOF, it will assert fail on this extent even though there is nothing xfs_getbmap() can do to convert it to a real extent. Hence we should simply report these delalloc extents rather than assert that there should be none. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_bmap.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c index 26ab256..478bce9 100644 --- a/fs/xfs/xfs_bmap.c +++ b/fs/xfs/xfs_bmap.c @@ -5620,8 +5620,20 @@ xfs_getbmap( XFS_FSB_TO_BB(mp, map[i].br_blockcount); out[cur_ext].bmv_unused1 = 0; out[cur_ext].bmv_unused2 = 0; - ASSERT(((iflags & BMV_IF_DELALLOC) != 0) || - (map[i].br_startblock != DELAYSTARTBLOCK)); + + /* + * delayed allocation extents that start beyond EOF can + * occur due to speculative EOF allocation when the + * delalloc extent is larger than the largest freespace + * extent at conversion time. These extents cannot be + * converted by data writeback, so can exist here even + * if we are not supposed to be finding delalloc + * extents. + */ + if (map[i].br_startblock == DELAYSTARTBLOCK && + map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip))) + ASSERT((iflags & BMV_IF_DELALLOC) != 0); + if (map[i].br_startblock == HOLESTARTBLOCK && whichfork == XFS_ATTR_FORK) { /* came to the end of attribute fork */ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 4/3] xfs: don't assert on delalloc regions beyond EOF 2012-04-29 11:16 ` [PATCH 4/3] xfs: don't assert on delalloc regions beyond EOF Dave Chinner @ 2012-05-08 17:26 ` Ben Myers 0 siblings, 0 replies; 19+ messages in thread From: Ben Myers @ 2012-05-08 17:26 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Sun, Apr 29, 2012 at 09:16:17PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > When we are doing speculative delayed allocation beyond EOF, > conversion of the region allocated beyond EOF is dependent on the > largest free space extent available. If the largest free extent is > smaller than the delalloc range, then after allocation we leave > a delalloc extent that starts beyond EOF. This extent cannot *ever* > be converted by flushing data, and so will remain there until either > the EOF moves into the extent or it is truncated away. > > Hence if xfs_getbmap() runs on such an inode and is asked to return > extents beyond EOF, it will assert fail on this extent even though > there is nothing xfs_getbmap() can do to convert it to a real > extent. Hence we should simply report these delalloc extents rather > than assert that there should be none. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_bmap.c | 16 ++++++++++++++-- > 1 file changed, 14 insertions(+), 2 deletions(-) > > diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c > index 26ab256..478bce9 100644 > --- a/fs/xfs/xfs_bmap.c > +++ b/fs/xfs/xfs_bmap.c > @@ -5620,8 +5620,20 @@ xfs_getbmap( > XFS_FSB_TO_BB(mp, map[i].br_blockcount); > out[cur_ext].bmv_unused1 = 0; > out[cur_ext].bmv_unused2 = 0; > - ASSERT(((iflags & BMV_IF_DELALLOC) != 0) || > - (map[i].br_startblock != DELAYSTARTBLOCK)); > + > + /* > + * delayed allocation extents that start beyond EOF can > + * occur due to speculative EOF allocation when the > + * delalloc extent is larger than the largest freespace > + * extent at conversion time. These extents cannot be > + * converted by data writeback, so can exist here even > + * if we are not supposed to be finding delalloc > + * extents. > + */ > + if (map[i].br_startblock == DELAYSTARTBLOCK && > + map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip))) > + ASSERT((iflags & BMV_IF_DELALLOC) != 0); > + Looks fine. This assert will no longer kick off for delay extents after eof, but will still catch any within the file. Reviewed-by: Ben Myers <bpm@sgi.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 5/3] xfs: limit specualtive delalloc to maxioffset 2012-04-27 9:45 [PATCH 0/3] xfs: failed writes and stale delalloc blocks Dave Chinner ` (3 preceding siblings ...) 2012-04-29 11:16 ` [PATCH 4/3] xfs: don't assert on delalloc regions beyond EOF Dave Chinner @ 2012-04-29 12:43 ` Dave Chinner 2012-05-08 18:02 ` Ben Myers 2012-04-29 12:57 ` [PATCH 6/3] xfs: make largest supported offset less shouty Dave Chinner 5 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2012-04-29 12:43 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> Speculative delayed allocation beyond EOF near the maximum supported file offset can result in creating delalloc extents beyond mp->m_maxioffset (8EB). These can never be trimmed during xfs_free_eof_blocks() because they are beyond mp->m_maxioffset, and that results in assert failures in xfs_fs_destroy_inode() due to delalloc blocks still being present. xfstests 071 exposes this problem. Limit speculative delalloc to mp->m_maxioffset to avoid this problem. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_iomap.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 303c03a..4a08ea3 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -412,6 +412,15 @@ retry: return error; } + /* + * Make sure preallocation does not create extents beyond the range we + * actually support in this filesystem. + */ + if (last_fsb > XFS_B_TO_FSB(mp, mp->m_maxioffset)) + last_fsb = XFS_B_TO_FSB(mp, mp->m_maxioffset); + + ASSERT(last_fsb > offset_fsb); + nimaps = XFS_WRITE_IMAPS; error = xfs_bmapi_delay(ip, offset_fsb, last_fsb - offset_fsb, imap, &nimaps, XFS_BMAPI_ENTIRE); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 5/3] xfs: limit specualtive delalloc to maxioffset 2012-04-29 12:43 ` [PATCH 5/3] xfs: limit specualtive delalloc to maxioffset Dave Chinner @ 2012-05-08 18:02 ` Ben Myers 0 siblings, 0 replies; 19+ messages in thread From: Ben Myers @ 2012-05-08 18:02 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Sun, Apr 29, 2012 at 10:43:19PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > Speculative delayed allocation beyond EOF near the maximum supported > file offset can result in creating delalloc extents beyond > mp->m_maxioffset (8EB). These can never be trimmed during > xfs_free_eof_blocks() because they are beyond mp->m_maxioffset, and > that results in assert failures in xfs_fs_destroy_inode() due to > delalloc blocks still being present. xfstests 071 exposes this > problem. > > Limit speculative delalloc to mp->m_maxioffset to avoid this > problem. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_iomap.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 303c03a..4a08ea3 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -412,6 +412,15 @@ retry: > return error; > } > > + /* > + * Make sure preallocation does not create extents beyond the range we > + * actually support in this filesystem. > + */ > + if (last_fsb > XFS_B_TO_FSB(mp, mp->m_maxioffset)) > + last_fsb = XFS_B_TO_FSB(mp, mp->m_maxioffset); > + > + ASSERT(last_fsb > offset_fsb); > + Yeah, looks good. xfs_iomap_prealloc_size isn't the only one who can push us up above m_maxioffset, so this is the right place for the check. Reviewed-by: Ben Myers <bpm@sgi.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 6/3] xfs: make largest supported offset less shouty 2012-04-27 9:45 [PATCH 0/3] xfs: failed writes and stale delalloc blocks Dave Chinner ` (4 preceding siblings ...) 2012-04-29 12:43 ` [PATCH 5/3] xfs: limit specualtive delalloc to maxioffset Dave Chinner @ 2012-04-29 12:57 ` Dave Chinner 2012-04-29 21:58 ` Christoph Hellwig 2012-05-08 18:15 ` Ben Myers 5 siblings, 2 replies; 19+ messages in thread From: Dave Chinner @ 2012-04-29 12:57 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> XFS_MAXIOFFSET() is just a simple macro that resolves to mp->m_maxioffset. It doesn't need to exist, and it just makes the code unnecessarily loud and shouty. Make it quiet and easy to read. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_bmap.c | 2 +- fs/xfs/xfs_file.c | 2 +- fs/xfs/xfs_inode.c | 2 +- fs/xfs/xfs_iomap.c | 2 +- fs/xfs/xfs_mount.h | 2 -- fs/xfs/xfs_qm.c | 2 +- fs/xfs/xfs_vnodeops.c | 10 +++++----- 7 files changed, 10 insertions(+), 12 deletions(-) diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c index 478bce9..919e038 100644 --- a/fs/xfs/xfs_bmap.c +++ b/fs/xfs/xfs_bmap.c @@ -5517,7 +5517,7 @@ xfs_getbmap( if (xfs_get_extsz_hint(ip) || ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC|XFS_DIFLAG_APPEND)){ prealloced = 1; - fixlen = XFS_MAXIOFFSET(mp); + fixlen = mp->m_maxioffset; } else { prealloced = 0; fixlen = XFS_ISIZE(ip); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index a37e43d..2d99208 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -273,7 +273,7 @@ xfs_file_aio_read( } } - n = XFS_MAXIOFFSET(mp) - iocb->ki_pos; + n = mp->m_maxioffset - iocb->ki_pos; if (n <= 0 || size == 0) return 0; diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index a59eea0..95e0d4a 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1226,7 +1226,7 @@ xfs_itruncate_extents( * then there is nothing to do. */ first_unmap_block = XFS_B_TO_FSB(mp, (xfs_ufsize_t)new_size); - last_block = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_MAXIOFFSET(mp)); + last_block = XFS_B_TO_FSB(mp, mp->m_maxioffset); if (first_unmap_block == last_block) return 0; diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 4a08ea3..9172d80 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -285,7 +285,7 @@ xfs_iomap_eof_want_preallocate( * do any speculative allocation. */ start_fsb = XFS_B_TO_FSBT(mp, ((xfs_ufsize_t)(offset + count - 1))); - count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_MAXIOFFSET(mp)); + count_fsb = XFS_B_TO_FSB(mp, mp->m_maxioffset); while (count_fsb > 0) { imaps = nimaps; firstblock = NULLFSBLOCK; diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 401ca2e..d0c6a0c 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -297,8 +297,6 @@ xfs_preferred_iosize(xfs_mount_t *mp) PAGE_CACHE_SIZE)); } -#define XFS_MAXIOFFSET(mp) ((mp)->m_maxioffset) - #define XFS_LAST_UNMOUNT_WAS_CLEAN(mp) \ ((mp)->m_flags & XFS_MOUNT_WAS_CLEAN) #define XFS_FORCED_SHUTDOWN(mp) ((mp)->m_flags & XFS_MOUNT_FS_SHUTDOWN) diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index 249db19..58cd3bd 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -940,7 +940,7 @@ xfs_qm_dqiterate( map = kmem_alloc(XFS_DQITER_MAP_SIZE * sizeof(*map), KM_SLEEP); lblkno = 0; - maxlblkcnt = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_MAXIOFFSET(mp)); + maxlblkcnt = XFS_B_TO_FSB(mp, mp->m_maxioffset); do { nmaps = XFS_DQITER_MAP_SIZE; /* diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c index 82b000f..e6580ea 100644 --- a/fs/xfs/xfs_vnodeops.c +++ b/fs/xfs/xfs_vnodeops.c @@ -174,7 +174,7 @@ xfs_free_eofblocks( * of the file. If not, then there is nothing to do. */ end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip)); - last_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_MAXIOFFSET(mp)); + last_fsb = XFS_B_TO_FSB(mp, mp->m_maxioffset); if (last_fsb <= end_fsb) return 0; map_len = last_fsb - end_fsb; @@ -2262,10 +2262,10 @@ xfs_change_file_space( llen = bf->l_len > 0 ? bf->l_len - 1 : bf->l_len; - if ( (bf->l_start < 0) - || (bf->l_start > XFS_MAXIOFFSET(mp)) - || (bf->l_start + llen < 0) - || (bf->l_start + llen > XFS_MAXIOFFSET(mp))) + if (bf->l_start < 0 || + bf->l_start > mp->m_maxioffset || + bf->l_start + llen < 0 || + bf->l_start + llen > mp->m_maxioffset) return XFS_ERROR(EINVAL); bf->l_whence = 0; _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 6/3] xfs: make largest supported offset less shouty 2012-04-29 12:57 ` [PATCH 6/3] xfs: make largest supported offset less shouty Dave Chinner @ 2012-04-29 21:58 ` Christoph Hellwig 2012-04-30 1:11 ` Dave Chinner 2012-05-08 18:15 ` Ben Myers 1 sibling, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2012-04-29 21:58 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Sun, Apr 29, 2012 at 10:57:29PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > XFS_MAXIOFFSET() is just a simple macro that resolves to > mp->m_maxioffset. It doesn't need to exist, and it just makes the > code unnecessarily loud and shouty. > > Make it quiet and easy to read. Do we actually need to keep around a value in our superblock? s_maxbytes in the VFS superblock already does this, and it seems like at least our checks in the read path are superflous. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 6/3] xfs: make largest supported offset less shouty 2012-04-29 21:58 ` Christoph Hellwig @ 2012-04-30 1:11 ` Dave Chinner 2012-04-30 3:03 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2012-04-30 1:11 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Sun, Apr 29, 2012 at 05:58:30PM -0400, Christoph Hellwig wrote: > On Sun, Apr 29, 2012 at 10:57:29PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > XFS_MAXIOFFSET() is just a simple macro that resolves to > > mp->m_maxioffset. It doesn't need to exist, and it just makes the > > code unnecessarily loud and shouty. > > > > Make it quiet and easy to read. > > Do we actually need to keep around a value in our superblock? > s_maxbytes in the VFS superblock already does this, and it seems like > at least our checks in the read path are superflous. Ah, we do indeed keep the same value in s_maxbytes - that's one step removed from m_maxioffset because it uses the same function to calculate it, and they are done a long way apart. Ok, it looks like I've got a couple more patches to write to finish off this cleanup. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 6/3] xfs: make largest supported offset less shouty 2012-04-30 1:11 ` Dave Chinner @ 2012-04-30 3:03 ` Dave Chinner 0 siblings, 0 replies; 19+ messages in thread From: Dave Chinner @ 2012-04-30 3:03 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Mon, Apr 30, 2012 at 11:11:24AM +1000, Dave Chinner wrote: > On Sun, Apr 29, 2012 at 05:58:30PM -0400, Christoph Hellwig wrote: > > On Sun, Apr 29, 2012 at 10:57:29PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > XFS_MAXIOFFSET() is just a simple macro that resolves to > > > mp->m_maxioffset. It doesn't need to exist, and it just makes the > > > code unnecessarily loud and shouty. > > > > > > Make it quiet and easy to read. > > > > Do we actually need to keep around a value in our superblock? > > s_maxbytes in the VFS superblock already does this, and it seems like > > at least our checks in the read path are superflous. Actually, I can't find where the read path checks against s_maxbytes. It's not in generic_segment_check(), and there appears to be no other range checks in the VFS. So I think that the check we have in xfs_file_aio_read needs to remain.... > Ah, we do indeed keep the same value in s_maxbytes - that's one step > removed from m_maxioffset because it uses the same function to > calculate it, and they are done a long way apart. Ok, it looks like > I've got a couple more patches to write to finish off this cleanup. Still, we can now replace the copy-n-paste code in xfs_file_aio_read() with a call to generic_segment_check() seeing as it returns a sum of the iovec length now, and still kill m_maxioffset.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 6/3] xfs: make largest supported offset less shouty 2012-04-29 12:57 ` [PATCH 6/3] xfs: make largest supported offset less shouty Dave Chinner 2012-04-29 21:58 ` Christoph Hellwig @ 2012-05-08 18:15 ` Ben Myers 2012-05-08 22:43 ` Dave Chinner 1 sibling, 1 reply; 19+ messages in thread From: Ben Myers @ 2012-05-08 18:15 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Sun, Apr 29, 2012 at 10:57:29PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > XFS_MAXIOFFSET() is just a simple macro that resolves to > mp->m_maxioffset. It doesn't need to exist, and it just makes the > code unnecessarily loud and shouty. > > Make it quiet and easy to read. Yep, looks good. Reviewed-by: Ben Myers <bpm@sgi.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 6/3] xfs: make largest supported offset less shouty 2012-05-08 18:15 ` Ben Myers @ 2012-05-08 22:43 ` Dave Chinner 2012-05-09 19:14 ` Ben Myers 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2012-05-08 22:43 UTC (permalink / raw) To: Ben Myers; +Cc: xfs On Tue, May 08, 2012 at 01:15:40PM -0500, Ben Myers wrote: > On Sun, Apr 29, 2012 at 10:57:29PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > XFS_MAXIOFFSET() is just a simple macro that resolves to > > mp->m_maxioffset. It doesn't need to exist, and it just makes the > > code unnecessarily loud and shouty. > > > > Make it quiet and easy to read. > > Yep, looks good. > > Reviewed-by: Ben Myers <bpm@sgi.com> I've got new versions that use s_maxbytes whih I need to post.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 6/3] xfs: make largest supported offset less shouty 2012-05-08 22:43 ` Dave Chinner @ 2012-05-09 19:14 ` Ben Myers 0 siblings, 0 replies; 19+ messages in thread From: Ben Myers @ 2012-05-09 19:14 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Wed, May 09, 2012 at 08:43:30AM +1000, Dave Chinner wrote: > On Tue, May 08, 2012 at 01:15:40PM -0500, Ben Myers wrote: > > On Sun, Apr 29, 2012 at 10:57:29PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > XFS_MAXIOFFSET() is just a simple macro that resolves to > > > mp->m_maxioffset. It doesn't need to exist, and it just makes the > > > code unnecessarily loud and shouty. > > > > > > Make it quiet and easy to read. > > > > Yep, looks good. > > > > Reviewed-by: Ben Myers <bpm@sgi.com> > > I've got new versions that use s_maxbytes whih I need to post.... Ok, I'll hold off on this one. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2012-05-09 19:10 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-04-27 9:45 [PATCH 0/3] xfs: failed writes and stale delalloc blocks Dave Chinner 2012-04-27 9:45 ` [PATCH 1/3] xfs: punch all delalloc blocks beyond EOF on write failure Dave Chinner 2012-04-30 13:49 ` Christoph Hellwig 2012-04-27 9:45 ` [PATCH 2/3] xfs: punch new delalloc blocks out of failed writes inside EOF Dave Chinner 2012-05-07 22:00 ` Ben Myers 2012-04-27 9:45 ` [PATCH 3/3] xfs: prevent needless mount warning causing test failures Dave Chinner 2012-05-08 16:29 ` Ben Myers 2012-05-08 22:42 ` Dave Chinner 2012-04-29 11:16 ` [PATCH 4/3] xfs: don't assert on delalloc regions beyond EOF Dave Chinner 2012-05-08 17:26 ` Ben Myers 2012-04-29 12:43 ` [PATCH 5/3] xfs: limit specualtive delalloc to maxioffset Dave Chinner 2012-05-08 18:02 ` Ben Myers 2012-04-29 12:57 ` [PATCH 6/3] xfs: make largest supported offset less shouty Dave Chinner 2012-04-29 21:58 ` Christoph Hellwig 2012-04-30 1:11 ` Dave Chinner 2012-04-30 3:03 ` Dave Chinner 2012-05-08 18:15 ` Ben Myers 2012-05-08 22:43 ` Dave Chinner 2012-05-09 19:14 ` Ben Myers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox