From: Lachlan McIlroy <lachlan@sgi.com>
To: lachlan@sgi.com
Cc: xfs@oss.sgi.com, xfs-dev <xfs-dev@sgi.com>
Subject: Re: [PATCH] Move vn_iowait() earlier in the reclaim path
Date: Fri, 08 Aug 2008 18:32:01 +1000 [thread overview]
Message-ID: <489C0481.4060204@sgi.com> (raw)
In-Reply-To: <489AB596.1010505@sgi.com>
Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Wed, Aug 06, 2008 at 04:10:53PM +1000, Lachlan McIlroy wrote:
>>> Dave Chinner wrote:
>>>> On Wed, Aug 06, 2008 at 12:28:30PM +1000, Lachlan McIlroy wrote:
>>>>> Dave Chinner wrote:
>>>>>> On Tue, Aug 05, 2008 at 05:52:34PM +1000, Lachlan McIlroy wrote:
>>>>>>> Dave Chinner wrote:
>>>>>>>> On Tue, Aug 05, 2008 at 04:43:29PM +1000, Lachlan McIlroy wrote:
>>>>>>>>> Currently by the time we get to vn_iowait() in xfs_reclaim() we
>>>>>>>>> have already
>>>>>>>>> gone through xfs_inactive()/xfs_free() and recycled the inode.
>>>>>>>>> Any I/O
>>>>>>>> xfs_free()? What's that?
>>>>>>> Sorry that should have been xfs_ifree() (we set the inode's mode to
>>>>>>> zero in there).
>>>>>>>
>>>>>>>>> completions still running (file size updates and unwritten
>>>>>>>>> extent conversions)
>>>>>>>>> may be working on an inode that is no longer valid.
>>>>>>>> The linux inode does not get freed until after ->clear_inode
>>>>>>>> completes, hence it is perfectly valid to reference it anywhere
>>>>>>>> in the ->clear_inode path.
>>>>>>> The problem I see is an assert in xfs_setfilesize() fail:
>>>>>>>
>>>>>>> ASSERT((ip->i_d.di_mode & S_IFMT) == S_IFREG);
>>>>>>>
>>>>>>> The mode of the XFS inode is zero at this time.
>>>>>> Ok, so the question has to be why is there I/O still in progress
>>>>>> after the truncate is supposed to have already occurred and the
>>>>>> vn_iowait() in xfs_itruncate_start() been executed.
>>>>>>
>>>>>> Something doesn't add up here - you can't be doing I/O on a file
>>>>>> with no extents or delalloc blocks, hence that means we should be
>>>>>> passing through the truncate path in xfs_inactive() before we
>>>>>> call xfs_ifree() and therefore doing the vn_iowait()..
>>>>>>
>>>>>> Hmmmm - the vn_iowait() is conditional based on:
>>>>>>
>>>>>> /* wait for the completion of any pending DIOs */
>>>>>> if (new_size < ip->i_size)
>>>>>> vn_iowait(ip);
>>>>>>
>>>>>> We are truncating to zero (new_size == 0), so the only case where
>>>>>> this would not wait is if ip->i_size == 0. Still - I can't see
>>>>>> how we'd be doing I/O on an inode with a zero i_size. I suspect
>>>>>> ensuring we call vn_iowait() if newsize == 0 as well would fix
>>>>>> the problem. If not, there's something much more subtle going
>>>>>> on here that we should understand....
>>>>> If we make the vn_iowait() unconditional we might re-introduce the
>>>>> NFS exclusivity bug that killed performance. That was through
>>>>> xfs_release()->xfs_free_eofblocks()->xfs_itruncate_start().
>>>> It won't reintroduce that problem because ->clear_inode()
>>>> is not called on every NFS write operation.
>>> Yes but xfs_itruncate_start() can be called from every NFS write so
>>> modifying the above code will re-introduce the problem.
>>
>> Ah, no. The case here is new_size == 0, which will almost never be
>> the case in the ->release call...
> True.
>
>>
>>>>> So if we leave the above code as is then we need another
>>>>> vn_iowait() in xfs_inactive() to catch any remaining workqueue
>>>>> items that we didn't wait for in xfs_itruncate_start().
>>>> How do we have any new *data* I/O at all in progress at this point?
>>> It's not new data I/O.
>>
>> Then why isn't is being caught by the vn_iowait() in the truncate
>> code?????
> No idea.
>
>>
>>> It's workqueue items that have been queued
>>> from previous I/Os that are still outstanding.
>>
>> The iocount is decremented when the completion is finished, not when it
>> is queued. Hence vn_iowait() should be taking into account this case.
> Hmmm. It should.
>
>>
>>>> That does not explain why we need an additional vn_iowait() call.
>>>> All I see from this is a truncate race that has somethign to do with
>>>> the vn_iowait() call being conditional.
>>>>
>>>> That is, if we truncate to zero, then the current code in
>>>> xfs_itruncate_start() should wait unconditinally for *all* I/O to
>>>> complete because, by definition, all that I/O is beyond the new EOF
>>>> and we have to wait for it to complete before truncating the file.
>>> That makes sense. If new_size is zero and ip->i_size is not then we
>>> will wait. If ip->i_size is also zero we will not wait but if the
>>> file size is already zero there should not be any I/Os in progress
>>> and therefore no workqueue items outstanding.
>>
>> I note from the debug below that the linux inode size is zero,
>> but you didn't include the dump of the xfs inode so we can't see
>> what the other variables are.
> I tried to dump the xfs inode at the time but kdb encountered a bad
> address. I'm pretty sure I was able to get a dump of the XFS inode
> on another crash - that's how I found the mode to be zero. I can't
> be sure now - I'll have to reproduce the problem again.
>
>>
>> Also, i_size is updated after write I/O is dispatched. If we are
>> doing synchronous I/O, then i_size is not updated until after the
>> I/O completes (in xfs_write()). Hence we could have the situation of
>> I/O being run while i_size = 0. This is why I wanted to know what
>> i_new_size is, because that gets set before the I/O is issued.
>>
>> if i_new_size is non-zero and i_size is zero,that tends to imply
>> the conditional vn_iowait() in the truncate path needs to take
>> MAX(ip->i_size, ip->i_new_size) for the check, not just ip->i_size...
>>
>> FWIW, from the dump below, we have:
>>
>> typedef struct xfs_ioend {
>> struct xfs_ioend *io_list = NULL
>> unsigned int io_type = 0x20 = IOMAP_UNWRITTEN
>> int io_error = 0
>> atomic_t io_remaining = 0;
>> struct inode *io_inode = 0xffff810054062048
>> struct buffer_head *io_buffer_head = NULL
>> struct buffer_head *io_buffer_tail = NULL
>> size_t io_size = 0x3400
>> xfs_off_t io_offset = 0xfe200
>> struct work_struct io_work; /* xfsdatad work queue */
>> } xfs_ioend_t;
>>
>> So the I/O was not even close to offset zero.
>>
>> Also, the fact that the stack trace says it came through the
>> written path, but the io_type says unwritten which says that there's
>> something fishy here. Either the stack trace is wrong, or there's
>> been a memory corruption....
> I'm pretty sure that's because it was a direct I/O write to a written
> extent. The I/O starts out as IOMAP_UNWRITTEN and if we didn't map to
> an unwritten extent it's completion handler is switched to the written
> one. Looking at the direct I/O write path I can see where the direct
> I/O write would wait for the bio's to complete but it's not waiting for
> the workqueue items to be flushed. Not sure if that's part of the
> problem though.
>
>>
>>>> If either the vn_iowait() in the truncate path is not sufficient, or
>>>> the truncate code is not being called, there is *some other bug*
>>>> that we don't yet understand. Adding an unconditional vn_iowait()
>>>> appear to me to be fixing a symptom, not the underlying cause of
>>>> the problem....
>>> I'm not adding a new call to vn_iowait(). I'm just moving the
>>> existing one from xfs_ireclaim() so that we wait for all I/O to
>>> complete before we tear the inode down.
>>
>> Yes, but that is there to catch inodes with non-zero link counts because
>> we are not doing a truncate in that case. We still need to get to the
>> bottom of why the truncate is not waiting for all I/O.
>
> I'm wondering if we have an extra decrement on the i_iocount atomic counter
> that tricked vn_iowait() into thinking that all I/Os have completed.
>
> Should this code in xfs_vm_direct_IO():
>
> if (unlikely(ret != -EIOCBQUEUED && iocb->private))
> xfs_destroy_ioend(iocb->private);
>
> be:
>
> if (unlikely(ret < 0 && ret != -EIOCBQUEUED && iocb->private))
> xfs_destroy_ioend(iocb->private);
>
> Ordinarily we'd drop the i_iocount reference by calling
> xfs_end_io_direct().
>
I think I'm able to reproduce this at will now.
The stacktrace from kdb has no arguments so I added a printk
to give us the key items.
inode->i_size, ip->i_size, ip->i_new_size and ip->i_d.di_size are all zero
ip->i_iocount is 1
inode->i_mode is valid but ip->i_d.di_mode is zero
This assert triggered when the filesystem became full. I suspect
that a direct I/O write failed due to ENOSPC (and that's why the
file size is still zero) but the ioend still got queued.
<5>XFS mounting filesystem dm-0
<7>Ending clean XFS mount for filesystem: dm-0
<7>xfs_setfilesize: ip 0xffff810fd3920c90, inode 0xffff810fd391fc18, iocount 1, ioend 0xffff810f96059be8
<4>Assertion failed: (ip->i_d.di_mode & S_IFMT) == S_IFREG, file: fs/xfs/linux-2.6/xfs_aops.c, line: 180
<0>------------[ cut here ]------------
<2>kernel BUG at fs/xfs/support/debug.c:81!
[1]kdb>
[1]kdb>
[1]kdb> xnode 0xffff810fd3920c90,
mount 0xffff81101acc0488 mnext 0xffff810f8e54c670 mprev 0xffff810f5b90a050 vnode 0xffff810fd391fc18
dev fe00000 ino 4048200[7:5c54:8]
blkno 0x1ee2a0 len 0x10 boffset 0x800
transp 0x0000000000000000 &itemp 0xffff810f32236240
&lock 0xffff810fd3920d40 &iolock 0xffff810fd3920da8 &flush 0xffff810fd3920e10 (1) pincount 0x0
udquotp 0x0000000000000000 gdquotp 0x0000000000000000
new_size 0
flags 0x0 <>
update_core 0 update size 0
gen 0x0 delayed blks 0size 0
trace 0xffff810fcc693620
bmap_trace 0xffff810fcc6935f0
bmbt trace 0xffff810fcc6935c0
rw trace 0xffff810fcc693590
ilock trace 0xffff810fcc693560
dir trace 0xffff810fcc693530
data fork
bytes 0x0 real_bytes 0x0 lastex 0x0 u1:extents 0x0000000000000000
broot 0x0000000000000000 broot_bytes 0x0 ext_max 9 flags 0x2 <extents >
u2 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
attr fork empty
[1]more>
magic 0x494e mode 00 (?------------) version 0x2 format 0x2 (extents)
nlink 0 uid 0 gid 0 projid 0 flushiter 0
atime 1218176355:126020521 mtime 1218176355d:554017996 ctime 1218176355:554017996
size 0 nblocks 0 extsize 0x0 nextents 0x0 anextents 0x0
forkoff 0 aformat 0x2 (extents) dmevmask 0x0 dmstate 0x0 flags 0x0 <> gen 0x593e66c8
--> itrace @ 0xffff810fd3920c90/0xffff810fcc693620
ref @fs/xfs/xfs_vnodeops.c:1589(xfs_create+0x390) i_count = 1
cpu = 6 pid = 9502 ra = xfs_vn_mknod+0x144
hold @fs/xfs/xfs_vnodeops.c:1642(xfs_create+0x568) i_count 1 => 2
cpu = 6 pid = 9502 ra = xfs_vn_mknod+0x144
entry to xfs_iput i_count = 2
cpu = 6 pid = 9502 ra = xfs_trans_unlock_chunk+0x77
rele @fs/xfs/xfs_iget.c:421(xfs_iput+0x42) i_count 2 => 1
cpu = 6 pid = 9502 ra = xfs_trans_unlock_chunk+0x77
entry to xfs_vn_getattr i_count = 1
cpu = 1 pid = 9502 ra = vfs_fstat+0x29
entry to xfs_vn_getattr i_count = 1
cpu = 1 pid = 9502 ra = vfs_fstat+0x29
entry to xfs_ioctl i_count = 1
cpu = 1 pid = 9502 ra = xfs_file_ioctl+0x26
entry to xfs_vn_getattr i_count = 1
cpu = 1 pid = 9502 ra = vfs_fstat+0x29
[1]more>
entry to xfs_remove i_count = 2
cpu = 1 pid = 9502 ra = xfs_vn_unlink+0x37
hold @fs/xfs/xfs_vnodeops.c:1958(xfs_remove+0x26d) i_count 2 => 3
cpu = 1 pid = 9502 ra = xfs_vn_unlink+0x37
entry to xfs_iput i_count = 3
cpu = 1 pid = 9502 ra = xfs_trans_unlock_chunk+0x77
rele @fs/xfs/xfs_iget.c:421(xfs_iput+0x42) i_count 3 => 2
cpu = 1 pid = 9502 ra = xfs_trans_unlock_chunk+0x77
exit from xfs_remove i_count = 2
cpu = 1 pid = 9502 ra = xfs_vn_unlink+0x37
entry to xfs_fs_clear_inode i_count = 0
cpu = 1 pid = 9502 ra = clear_inode+0x9c
entry to xfs_inactive i_count = 0
cpu = 1 pid = 9502 ra = xfs_fs_clear_inode+0xbd
entry to xfs_reclaim i_count = 0
cpu = 1 pid = 9502 ra = xfs_fs_clear_inode+0xde
[1]kdb> inode 0xffff810fd391fc18
struct inode at 0xffff810fd391fc18
i_ino = 4048200 i_count = 0 i_size 0
i_mode = 0100666 i_nlink = 0 i_rdev = 0x0
i_hash.nxt = 0x0000000000000000 i_hash.pprev = 0xffffc20005712980
i_list.nxt = 0xffff810fd391fc18 i_list.prv = 0xffff810fd391fc18
i_dentry.nxt = 0xffff810fd391fbb0 i_dentry.prv = 0xffff810fd391fbb0
i_sb = 0xffff810f1ca10148 i_op = 0xffffffff80690e20 i_data = 0xffff810fd391fe00 nrpages = 0
i_fop= 0xffffffff80690b40 i_flock = 0x0000000000000000 i_mapping = 0xffff810fd391fe00
i_flags 0x0 i_state 0x21 [I_DIRTY_SYNC I_FREEING] fs specific info @ 0xffff810fd391ffe8
[1]kdb>
[1]kdb> md8c20 0xffff810f96059be8
0xffff810f96059be8 0000000000000000 0000000000000020 ........ .......
0xffff810f96059bf8 5a5a5a5a00000000 ffff810fd391fc18 ....ZZZZ........
0xffff810f96059c08 0000000000000000 0000000000000000 ................
0xffff810f96059c18 0000000000001800 000000000002f800 ................
0xffff810f96059c28 ffff8110270654d0 ffff810f96059c30 .T.'....0.......
0xffff810f96059c38 ffff810f96059c30 ffffffff803b0b69 0.......i.;.....
0xffff810f96059c48 ffffffff80b7f698 0000000000000000 ................
0xffff810f96059c58 ffffffff807c0b92 d84156c5635688c0 ..|.......Vc.VA.
0xffff810f96059c68 ffffffff8025c2b6 09f911029d74e35b ..%.....[.t.....
0xffff810f96059c78 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk
[1]kdb> bt
Stack traceback for pid 379
0xffff811029484680 379 2 1 1 R 0xffff8110294849c8 *xfsdatad/1
sp ip Function (args)
0xffff81102708be28 0xffffffff803bbf40 assfail+0x1a (invalid, invalid, invalid)
kdb_bb: address 0x0000000000010286 not recognised
Using old style backtrace, unreliable with no arguments
sp ip Function (args)
0xffff81102708bdb8 0xffffffff803b0b69 xfs_end_bio_written
0xffff81102708be28 0xffffffff803bbf40 assfail+0x1a
0xffff81102708be58 0xffffffff803b0a50 xfs_setfilesize+0x6a
0xffff81102708be78 0xffffffff803b0b79 xfs_end_bio_written+0x10
0xffff81102708be88 0xffffffff8023bb02 run_workqueue+0xdf
0xffff81102708bed8 0xffffffff8023c5bd worker_thread+0xd8
0xffff81102708bef0 0xffffffff8023ef6e autoremove_wake_function
0xffff81102708bf20 0xffffffff8023c4e5 worker_thread
0xffff81102708bf28 0xffffffff8023ee5c kthread+0x47
0xffff81102708bf30 0xffffffff802294a8 schedule_tail+0x27
0xffff81102708bf48 0xffffffff8020bea8 child_rip+0xa
As suggested, I'm now running with this patch:
--- fs/xfs/xfs_inode.c_1.518 2008-08-08 17:31:02.000000000 +1000
+++ fs/xfs/xfs_inode.c 2008-08-08 17:35:11.000000000 +1000
@@ -1414,7 +1414,7 @@ xfs_itruncate_start(
mp = ip->i_mount;
/* wait for the completion of any pending DIOs */
- if (new_size < ip->i_size)
+ if (new_size == 0 || new_size < ip->i_size)
vn_iowait(ip);
/*
and so far so good.
prev parent reply other threads:[~2008-08-08 8:24 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-08-05 6:43 [PATCH] Move vn_iowait() earlier in the reclaim path Lachlan McIlroy
2008-08-05 7:37 ` Dave Chinner
2008-08-05 7:44 ` Dave Chinner
2008-08-05 7:52 ` Lachlan McIlroy
2008-08-05 8:42 ` Dave Chinner
2008-08-06 2:28 ` Lachlan McIlroy
2008-08-06 5:20 ` Dave Chinner
2008-08-06 6:10 ` Lachlan McIlroy
2008-08-06 9:38 ` Dave Chinner
2008-08-07 8:43 ` Lachlan McIlroy
2008-08-08 8:32 ` Lachlan McIlroy [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=489C0481.4060204@sgi.com \
--to=lachlan@sgi.com \
--cc=xfs-dev@sgi.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox