[PATCH] Move vn_iowait() earlier in the reclaim path

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Move vn_iowait() earlier in the reclaim path
@ 2008-08-05  6:43 Lachlan McIlroy
  2008-08-05  7:37 ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Lachlan McIlroy @ 2008-08-05  6:43 UTC (permalink / raw)
  To: xfs, xfs-dev

Currently by the time we get to vn_iowait() in xfs_reclaim() we have already
gone through xfs_inactive()/xfs_free() and recycled the inode.  Any I/O
completions still running (file size updates and unwritten extent conversions)
may be working on an inode that is no longer valid.

--- a/fs/xfs/linux-2.6/xfs_super.c	2008-08-05 16:27:57.000000000 +1000
+++ b/fs/xfs/linux-2.6/xfs_super.c	2008-08-05 13:54:41.000000000 +1000
@@ -935,6 +935,7 @@ xfs_fs_clear_inode(
 		XFS_STATS_INC(vn_reclaim);
 		XFS_STATS_DEC(vn_active);
 
+		vn_iowait(ip);
 		xfs_inactive(ip);
 		xfs_iflags_clear(ip, XFS_IMODIFIED);
 		if (xfs_reclaim(ip))
--- a/fs/xfs/xfs_vnodeops.c	2008-08-05 16:27:57.000000000 +1000
+++ b/fs/xfs/xfs_vnodeops.c	2008-08-05 13:54:43.000000000 +1000
@@ -2802,8 +2802,6 @@ xfs_reclaim(
 		return 0;
 	}
 
-	vn_iowait(ip);
-
 	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
 
 	/*

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Move vn_iowait() earlier in the reclaim path
  2008-08-05  6:43 [PATCH] Move vn_iowait() earlier in the reclaim path Lachlan McIlroy
@ 2008-08-05  7:37 ` Dave Chinner
  2008-08-05  7:44   ` Dave Chinner
  2008-08-05  7:52   ` Lachlan McIlroy
  0 siblings, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2008-08-05  7:37 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs, xfs-dev

On Tue, Aug 05, 2008 at 04:43:29PM +1000, Lachlan McIlroy wrote:
> Currently by the time we get to vn_iowait() in xfs_reclaim() we have already
> gone through xfs_inactive()/xfs_free() and recycled the inode.  Any I/O

xfs_free()? What's that?

> completions still running (file size updates and unwritten extent conversions)
> may be working on an inode that is no longer valid.

The linux inode does not get freed until after ->clear_inode
completes, hence it is perfectly valid to reference it anywhere
in the ->clear_inode path.

My bet is that you are seeing I/O completion mark an inode dirty
that is being freed. ie.  Calling mark_inode_dirty_sync() in the I/O
completion blindly assumes that the linux inode is still valid, 
when it may be in the 'being freed' path. e.g. we can put it back on the
superblock dirty list just before it gets freed for real...

I came across this about a week ago when tracking down a QA failure
with a combined linux/XFS inode patch. The fix is to make I/O
completion call xfs_mark_inode_dirty_sync() so we check that this
linux inode not in the process of being freed before we try to
mark it dirty.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

XFS: Never call mark_inode_dirty_sync() directly

Once the Linux inode and the XFS inode are combined, we cannot rely
on just checking if the linux inode exists as a method of determining
if it is valid or not. Hence we should always call
xfs_mark_inode_dirty_sync() as it does the correct checks to
determine if the linux inode is in a valid state or not before
marking it dirty.

---
 fs/xfs/linux-2.6/xfs_aops.c  |    2 +-
 fs/xfs/linux-2.6/xfs_iops.c  |    4 ++--
 fs/xfs/linux-2.6/xfs_super.c |    2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 0b211cb..45c53a7 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -192,7 +192,7 @@ xfs_setfilesize(
 		ip->i_d.di_size = isize;
 		ip->i_update_core = 1;
 		ip->i_update_size = 1;
-		mark_inode_dirty_sync(ioend->io_inode);
+		xfs_mark_inode_dirty_sync(ip);
 	}

 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index 1240b73..379f19b 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -133,7 +133,7 @@ xfs_ichgtime(
 	SYNCHRONIZE();
 	ip->i_update_core = 1;
 	if (!(inode->i_state & I_NEW))
-		mark_inode_dirty_sync(inode);
+		xfs_mark_inode_dirty_sync(ip);
 }

 /*
@@ -178,7 +178,7 @@ xfs_ichgtime_fast(
 	SYNCHRONIZE();
 	ip->i_update_core = 1;
 	if (!(inode->i_state & I_NEW))
-		mark_inode_dirty_sync(inode);
+		xfs_mark_inode_dirty_sync(ip);
 }

 /*
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index fc00499..e9cb0e9 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -963,7 +963,7 @@ xfs_fs_write_inode(
 	 * it dirty again so we'll try again later.
 	 */
 	if (error)
-		mark_inode_dirty_sync(inode);
+		xfs_mark_inode_dirty_sync(XFS_I(inode));

 	return -error;
 }

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] Move vn_iowait() earlier in the reclaim path
  2008-08-05  7:37 ` Dave Chinner
@ 2008-08-05  7:44   ` Dave Chinner
  2008-08-05  7:52   ` Lachlan McIlroy
  1 sibling, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2008-08-05  7:44 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs, xfs-dev

On Tue, Aug 05, 2008 at 05:37:11PM +1000, Dave Chinner wrote:
> On Tue, Aug 05, 2008 at 04:43:29PM +1000, Lachlan McIlroy wrote:
> > Currently by the time we get to vn_iowait() in xfs_reclaim() we have already
> > gone through xfs_inactive()/xfs_free() and recycled the inode.  Any I/O
> 
> xfs_free()? What's that?
> 
> > completions still running (file size updates and unwritten extent conversions)
> > may be working on an inode that is no longer valid.
> 
> The linux inode does not get freed until after ->clear_inode
> completes, hence it is perfectly valid to reference it anywhere
> in the ->clear_inode path.
> 
> My bet is that you are seeing I/O completion mark an inode dirty
> that is being freed. ie.  Calling mark_inode_dirty_sync() in the I/O
> completion blindly assumes that the linux inode is still valid, 
> when it may be in the 'being freed' path. e.g. we can put it back on the
> superblock dirty list just before it gets freed for real...
> 
> I came across this about a week ago when tracking down a QA failure
> with a combined linux/XFS inode patch. The fix is to make I/O
> completion call xfs_mark_inode_dirty_sync() so we check that this
> linux inode not in the process of being freed before we try to
> mark it dirty.

Oh, I should point out that that xfs_mark_inode_dirty sync looked
like this prior to the patch I posted:

/*
 * If the linux inode is valid, mark it dirty.
 * Used when commiting a dirty inode into a transaction so that
 * the inode will get written back by the linux code
 */
void
xfs_mark_inode_dirty_sync(
        xfs_inode_t     *ip)
{
        struct inode    *inode = VFS_I(ip);

        if (!(inode->i_state & (I_WILL_FREE|I_FREEING|I_CLEAR)))
                mark_inode_dirty_sync(inode);
}

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Move vn_iowait() earlier in the reclaim path
  2008-08-05  7:37 ` Dave Chinner
  2008-08-05  7:44   ` Dave Chinner
@ 2008-08-05  7:52   ` Lachlan McIlroy
  2008-08-05  8:42     ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Lachlan McIlroy @ 2008-08-05  7:52 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs, xfs-dev

Dave Chinner wrote:
> On Tue, Aug 05, 2008 at 04:43:29PM +1000, Lachlan McIlroy wrote:
>> Currently by the time we get to vn_iowait() in xfs_reclaim() we have already
>> gone through xfs_inactive()/xfs_free() and recycled the inode.  Any I/O
> 
> xfs_free()? What's that?
Sorry that should have been xfs_ifree() (we set the inode's mode to
zero in there).

> 
>> completions still running (file size updates and unwritten extent conversions)
>> may be working on an inode that is no longer valid.
> 
> The linux inode does not get freed until after ->clear_inode
> completes, hence it is perfectly valid to reference it anywhere
> in the ->clear_inode path.
The problem I see is an assert in xfs_setfilesize() fail:

	ASSERT((ip->i_d.di_mode & S_IFMT) == S_IFREG);

The mode of the XFS inode is zero at this time.

> 
> My bet is that you are seeing I/O completion mark an inode dirty
> that is being freed. ie.  Calling mark_inode_dirty_sync() in the I/O
> completion blindly assumes that the linux inode is still valid, 
> when it may be in the 'being freed' path. e.g. we can put it back on the
> superblock dirty list just before it gets freed for real...
> 
> I came across this about a week ago when tracking down a QA failure
> with a combined linux/XFS inode patch. The fix is to make I/O
> completion call xfs_mark_inode_dirty_sync() so we check that this
> linux inode not in the process of being freed before we try to
> mark it dirty.
> 
> Cheers,
> 
> Dave.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Move vn_iowait() earlier in the reclaim path
  2008-08-05  7:52   ` Lachlan McIlroy
@ 2008-08-05  8:42     ` Dave Chinner
  2008-08-06  2:28       ` Lachlan McIlroy
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2008-08-05  8:42 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs, xfs-dev

On Tue, Aug 05, 2008 at 05:52:34PM +1000, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Tue, Aug 05, 2008 at 04:43:29PM +1000, Lachlan McIlroy wrote:
>>> Currently by the time we get to vn_iowait() in xfs_reclaim() we have already
>>> gone through xfs_inactive()/xfs_free() and recycled the inode.  Any I/O
>>
>> xfs_free()? What's that?
> Sorry that should have been xfs_ifree() (we set the inode's mode to
> zero in there).
>
>>
>>> completions still running (file size updates and unwritten extent conversions)
>>> may be working on an inode that is no longer valid.
>>
>> The linux inode does not get freed until after ->clear_inode
>> completes, hence it is perfectly valid to reference it anywhere
>> in the ->clear_inode path.
> The problem I see is an assert in xfs_setfilesize() fail:
>
> 	ASSERT((ip->i_d.di_mode & S_IFMT) == S_IFREG);
>
> The mode of the XFS inode is zero at this time.

Ok, so the question has to be why is there I/O still in progress
after the truncate is supposed to have already occurred and the
vn_iowait() in xfs_itruncate_start() been executed.

Something doesn't add up here - you can't be doing I/O on a file
with no extents or delalloc blocks, hence that means we should be
passing through the truncate path in xfs_inactive() before we
call xfs_ifree() and therefore doing the vn_iowait()..

Hmmmm - the vn_iowait() is conditional based on:

        /* wait for the completion of any pending DIOs */
        if (new_size < ip->i_size)
                vn_iowait(ip);

We are truncating to zero (new_size == 0), so the only case where
this would not wait is if ip->i_size == 0. Still - I can't see
how we'd be doing I/O on an inode with a zero i_size. I suspect
ensuring we call vn_iowait() if newsize == 0 as well would fix
the problem. If not, there's something much more subtle going
on here that we should understand....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Move vn_iowait() earlier in the reclaim path
  2008-08-05  8:42     ` Dave Chinner
@ 2008-08-06  2:28       ` Lachlan McIlroy
  2008-08-06  5:20         ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Lachlan McIlroy @ 2008-08-06  2:28 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs, xfs-dev

Dave Chinner wrote:
> On Tue, Aug 05, 2008 at 05:52:34PM +1000, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Tue, Aug 05, 2008 at 04:43:29PM +1000, Lachlan McIlroy wrote:
>>>> Currently by the time we get to vn_iowait() in xfs_reclaim() we have already
>>>> gone through xfs_inactive()/xfs_free() and recycled the inode.  Any I/O
>>> xfs_free()? What's that?
>> Sorry that should have been xfs_ifree() (we set the inode's mode to
>> zero in there).
>>
>>>> completions still running (file size updates and unwritten extent conversions)
>>>> may be working on an inode that is no longer valid.
>>> The linux inode does not get freed until after ->clear_inode
>>> completes, hence it is perfectly valid to reference it anywhere
>>> in the ->clear_inode path.
>> The problem I see is an assert in xfs_setfilesize() fail:
>>
>> 	ASSERT((ip->i_d.di_mode & S_IFMT) == S_IFREG);
>>
>> The mode of the XFS inode is zero at this time.
> 
> Ok, so the question has to be why is there I/O still in progress
> after the truncate is supposed to have already occurred and the
> vn_iowait() in xfs_itruncate_start() been executed.
> 
> Something doesn't add up here - you can't be doing I/O on a file
> with no extents or delalloc blocks, hence that means we should be
> passing through the truncate path in xfs_inactive() before we
> call xfs_ifree() and therefore doing the vn_iowait()..
> 
> Hmmmm - the vn_iowait() is conditional based on:
> 
>         /* wait for the completion of any pending DIOs */
>         if (new_size < ip->i_size)
>                 vn_iowait(ip);
> 
> We are truncating to zero (new_size == 0), so the only case where
> this would not wait is if ip->i_size == 0. Still - I can't see
> how we'd be doing I/O on an inode with a zero i_size. I suspect
> ensuring we call vn_iowait() if newsize == 0 as well would fix
> the problem. If not, there's something much more subtle going
> on here that we should understand....

If we make the vn_iowait() unconditional we might re-introduce the
NFS exclusivity bug that killed performance.  That was through
xfs_release()->xfs_free_eofblocks()->xfs_itruncate_start().
So if we leave the above code as is then we need another
vn_iowait() in xfs_inactive() to catch any remaining workqueue
items that we didn't wait for in xfs_itruncate_start().

In that case the last call to vn_iowait() should be inside
xfs_inactive() after the truncate but before the call to
xfs_ifree().

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Move vn_iowait() earlier in the reclaim path
  2008-08-06  2:28       ` Lachlan McIlroy
@ 2008-08-06  5:20         ` Dave Chinner
  2008-08-06  6:10           ` Lachlan McIlroy
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2008-08-06  5:20 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs, xfs-dev

On Wed, Aug 06, 2008 at 12:28:30PM +1000, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Tue, Aug 05, 2008 at 05:52:34PM +1000, Lachlan McIlroy wrote:
>>> Dave Chinner wrote:
>>>> On Tue, Aug 05, 2008 at 04:43:29PM +1000, Lachlan McIlroy wrote:
>>>>> Currently by the time we get to vn_iowait() in xfs_reclaim() we have already
>>>>> gone through xfs_inactive()/xfs_free() and recycled the inode.  Any I/O
>>>> xfs_free()? What's that?
>>> Sorry that should have been xfs_ifree() (we set the inode's mode to
>>> zero in there).
>>>
>>>>> completions still running (file size updates and unwritten extent conversions)
>>>>> may be working on an inode that is no longer valid.
>>>> The linux inode does not get freed until after ->clear_inode
>>>> completes, hence it is perfectly valid to reference it anywhere
>>>> in the ->clear_inode path.
>>> The problem I see is an assert in xfs_setfilesize() fail:
>>>
>>> 	ASSERT((ip->i_d.di_mode & S_IFMT) == S_IFREG);
>>>
>>> The mode of the XFS inode is zero at this time.
>>
>> Ok, so the question has to be why is there I/O still in progress
>> after the truncate is supposed to have already occurred and the
>> vn_iowait() in xfs_itruncate_start() been executed.
>>
>> Something doesn't add up here - you can't be doing I/O on a file
>> with no extents or delalloc blocks, hence that means we should be
>> passing through the truncate path in xfs_inactive() before we
>> call xfs_ifree() and therefore doing the vn_iowait()..
>>
>> Hmmmm - the vn_iowait() is conditional based on:
>>
>>         /* wait for the completion of any pending DIOs */
>>         if (new_size < ip->i_size)
>>                 vn_iowait(ip);
>>
>> We are truncating to zero (new_size == 0), so the only case where
>> this would not wait is if ip->i_size == 0. Still - I can't see
>> how we'd be doing I/O on an inode with a zero i_size. I suspect
>> ensuring we call vn_iowait() if newsize == 0 as well would fix
>> the problem. If not, there's something much more subtle going
>> on here that we should understand....
>
> If we make the vn_iowait() unconditional we might re-introduce the
> NFS exclusivity bug that killed performance.  That was through
> xfs_release()->xfs_free_eofblocks()->xfs_itruncate_start().

It won't reintroduce that problem because ->clear_inode()
is not called on every NFS write operation.

> So if we leave the above code as is then we need another
> vn_iowait() in xfs_inactive() to catch any remaining workqueue
> items that we didn't wait for in xfs_itruncate_start().

How do we have any new *data* I/O at all in progress at this point?
That does not explain why we need an additional vn_iowait() call.
All I see from this is a truncate race that has somethign to do with
the vn_iowait() call being conditional.

That is, if we truncate to zero, then the current code in
xfs_itruncate_start() should wait unconditinally for *all* I/O to
complete because, by definition, all that I/O is beyond the new EOF
and we have to wait for it to complete before truncating the file.
Seeing as we are in ->clear_inode(), no new data I/O can start
while we are deep in this code, hence we should not be seeing
I/O completions after the truncate starts and vn_iowait() has
completed.

Hence we need to know, firstly, if the truncate code has been
called; Secondly, what the value of i_size and i_new_size was when
the truncate was started and, finally, whether ip->i_iocount was
non-zero when the truncate was run.  That is, we need to gather
enough data to determine whether we should have waited in the
truncate but didn't.

If either the vn_iowait() in the truncate path is not sufficient, or
the truncate code is not being called, there is *some other bug*
that we don't yet understand.  Adding an unconditional vn_iowait()
appear to me to be fixing a symptom, not the underlying cause of
the problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Move vn_iowait() earlier in the reclaim path
  2008-08-06  5:20         ` Dave Chinner
@ 2008-08-06  6:10           ` Lachlan McIlroy
  2008-08-06  9:38             ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Lachlan McIlroy @ 2008-08-06  6:10 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs, xfs-dev

Dave Chinner wrote:
> On Wed, Aug 06, 2008 at 12:28:30PM +1000, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Tue, Aug 05, 2008 at 05:52:34PM +1000, Lachlan McIlroy wrote:
>>>> Dave Chinner wrote:
>>>>> On Tue, Aug 05, 2008 at 04:43:29PM +1000, Lachlan McIlroy wrote:
>>>>>> Currently by the time we get to vn_iowait() in xfs_reclaim() we have already
>>>>>> gone through xfs_inactive()/xfs_free() and recycled the inode.  Any I/O
>>>>> xfs_free()? What's that?
>>>> Sorry that should have been xfs_ifree() (we set the inode's mode to
>>>> zero in there).
>>>>
>>>>>> completions still running (file size updates and unwritten extent conversions)
>>>>>> may be working on an inode that is no longer valid.
>>>>> The linux inode does not get freed until after ->clear_inode
>>>>> completes, hence it is perfectly valid to reference it anywhere
>>>>> in the ->clear_inode path.
>>>> The problem I see is an assert in xfs_setfilesize() fail:
>>>>
>>>> 	ASSERT((ip->i_d.di_mode & S_IFMT) == S_IFREG);
>>>>
>>>> The mode of the XFS inode is zero at this time.
>>> Ok, so the question has to be why is there I/O still in progress
>>> after the truncate is supposed to have already occurred and the
>>> vn_iowait() in xfs_itruncate_start() been executed.
>>>
>>> Something doesn't add up here - you can't be doing I/O on a file
>>> with no extents or delalloc blocks, hence that means we should be
>>> passing through the truncate path in xfs_inactive() before we
>>> call xfs_ifree() and therefore doing the vn_iowait()..
>>>
>>> Hmmmm - the vn_iowait() is conditional based on:
>>>
>>>         /* wait for the completion of any pending DIOs */
>>>         if (new_size < ip->i_size)
>>>                 vn_iowait(ip);
>>>
>>> We are truncating to zero (new_size == 0), so the only case where
>>> this would not wait is if ip->i_size == 0. Still - I can't see
>>> how we'd be doing I/O on an inode with a zero i_size. I suspect
>>> ensuring we call vn_iowait() if newsize == 0 as well would fix
>>> the problem. If not, there's something much more subtle going
>>> on here that we should understand....
>> If we make the vn_iowait() unconditional we might re-introduce the
>> NFS exclusivity bug that killed performance.  That was through
>> xfs_release()->xfs_free_eofblocks()->xfs_itruncate_start().
> 
> It won't reintroduce that problem because ->clear_inode()
> is not called on every NFS write operation.
Yes but xfs_itruncate_start() can be called from every NFS write so
modifying the above code will re-introduce the problem.

> 
>> So if we leave the above code as is then we need another
>> vn_iowait() in xfs_inactive() to catch any remaining workqueue
>> items that we didn't wait for in xfs_itruncate_start().
> 
> How do we have any new *data* I/O at all in progress at this point?
It's not new data I/O.  It's workqueue items that have been queued
from previous I/Os that are still outstanding.

> That does not explain why we need an additional vn_iowait() call.
> All I see from this is a truncate race that has somethign to do with
> the vn_iowait() call being conditional.
> 
> That is, if we truncate to zero, then the current code in
> xfs_itruncate_start() should wait unconditinally for *all* I/O to
> complete because, by definition, all that I/O is beyond the new EOF
> and we have to wait for it to complete before truncating the file.
That makes sense.  If new_size is zero and ip->i_size is not then we
will wait.  If ip->i_size is also zero we will not wait but if the
file size is already zero there should not be any I/Os in progress
and therefore no workqueue items outstanding.

> Seeing as we are in ->clear_inode(), no new data I/O can start
> while we are deep in this code, hence we should not be seeing
> I/O completions after the truncate starts and vn_iowait() has
> completed.
> 
> Hence we need to know, firstly, if the truncate code has been
> called; Secondly, what the value of i_size and i_new_size was when
> the truncate was started and, finally, whether ip->i_iocount was
> non-zero when the truncate was run.  That is, we need to gather
> enough data to determine whether we should have waited in the
> truncate but didn't.
> 
> If either the vn_iowait() in the truncate path is not sufficient, or
> the truncate code is not being called, there is *some other bug*
> that we don't yet understand.  Adding an unconditional vn_iowait()
> appear to me to be fixing a symptom, not the underlying cause of
> the problem....

I'm not adding a new call to vn_iowait().  I'm just moving the
existing one from xfs_ireclaim() so that we wait for all I/O to
complete before we tear the inode down.

Here's some info from the bug:

Stack traceback for pid 272
0xffff81007f3ea600      272        2  1    3   R  0xffff81007f3ea950 *xfsdatad/3
rsp                rip                Function (args)
0xffff81007e275e28 0xffffffff803b8cd0 assfail+0x1a (invalid, invalid, invalid)
0xffff81007e275e58 0xffffffff803adaad xfs_setfilesize+0x3d (0xffff8100383de7f8)
0xffff81007e275e78 0xffffffff803adc28 xfs_end_bio_written+0x10 (invalid)
0xffff81007e275e88 0xffffffff8023cf9a run_workqueue+0xdf (0xffff81007e7d0070)
0xffff81007e275ed8 0xffffffff8023da8f worker_thread+0xd8 (0xffff81007e7d0070)
0xffff81007e275f28 0xffffffff80240314 kthread+0x47 (invalid)
0xffff81007e275f48 0xffffffff8020bd08 child_rip+0xa (invalid, invalid)

<5>Filesystem "sdb1": Disabling barriers, not supported with external log device
<5>XFS mounting filesystem sdb1
<7>Ending clean XFS mount for filesystem: sdb1
<4>Assertion failed: (ip->i_d.di_mode & S_IFMT) == S_IFREG, file: fs/xfs/linux-2.6/xfs_aops.c, line: 178
<0>------------[ cut here ]------------
<2>kernel BUG at fs/xfs/support/debug.c:81!
<0>invalid opcode: 0000 [1] SMP 

[3]kdb> md8c20 0xffff8100383de7f8
0xffff8100383de7f8 0000000000000000 0000000000000020   ........ .......
0xffff8100383de808 5a5a5a5a00000000 ffff810054062048   ....ZZZZH .T....
0xffff8100383de818 0000000000000000 0000000000000000   ................
0xffff8100383de828 0000000000003400 00000000000fe200   .4..............
0xffff8100383de838 ffff81007e7d0070 ffff8100383de840   p.}~....@.=8....
0xffff8100383de848 ffff8100383de840 ffffffff803adc18   @.=8......:.....
0xffff8100383de858 ffffffff80b162a0 0000000000000000   .b..............
0xffff8100383de868 ffffffff80784c51 d84156c5635688c0   QLx.......Vc.VA.
0xffff8100383de878 ffffffff8025d1f6 09f911029d74e35b   ..%.....[.t.....
0xffff8100383de888 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk

I think io_type here is 0x20 which is IOMAP_UNWRITTEN.  Since we've come
through xfs_end_bio_written() (not xfs_end_bio_unwritten()) this must be
a direct I/O write to a written extent.

[3]kdb> inode ffff810054062048
struct inode at  0xffff810054062048
 i_ino = 80848951 i_count = 0 i_size 0
 i_mode = 0100666  i_nlink = 0  i_rdev = 0x0
 i_hash.nxt = 0x0000000000000000 i_hash.pprev = 0xffffc20000208518
 i_list.nxt = 0xffff810054062048 i_list.prv = 0xffff810054062048
 i_dentry.nxt = 0xffff810054061fe0 i_dentry.prv = 0xffff810054061fe0
 i_sb = 0xffff81006d5b0508 i_op = 0xffffffff806647a0 i_data = 0xffff810054062230 nrpages = 0
 i_fop= 0xffffffff806644e0 i_flock = 0x0000000000000000 i_mapping = 0xffff810054062230
 i_flags 0x0 i_state 0x21 [I_DIRTY_SYNC I_FREEING]  fs specific info @ 0xffff810054062418

Mode is 0100666.  S_IFREG is 0100000 so linux inode would not have failed
assert.  So why did XFS inode fail it?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Move vn_iowait() earlier in the reclaim path
  2008-08-06  6:10           ` Lachlan McIlroy
@ 2008-08-06  9:38             ` Dave Chinner
  2008-08-07  8:43               ` Lachlan McIlroy
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2008-08-06  9:38 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs, xfs-dev

On Wed, Aug 06, 2008 at 04:10:53PM +1000, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Wed, Aug 06, 2008 at 12:28:30PM +1000, Lachlan McIlroy wrote:
>>> Dave Chinner wrote:
>>>> On Tue, Aug 05, 2008 at 05:52:34PM +1000, Lachlan McIlroy wrote:
>>>>> Dave Chinner wrote:
>>>>>> On Tue, Aug 05, 2008 at 04:43:29PM +1000, Lachlan McIlroy wrote:
>>>>>>> Currently by the time we get to vn_iowait() in xfs_reclaim() we have already
>>>>>>> gone through xfs_inactive()/xfs_free() and recycled the inode.  Any I/O
>>>>>> xfs_free()? What's that?
>>>>> Sorry that should have been xfs_ifree() (we set the inode's mode to
>>>>> zero in there).
>>>>>
>>>>>>> completions still running (file size updates and unwritten extent conversions)
>>>>>>> may be working on an inode that is no longer valid.
>>>>>> The linux inode does not get freed until after ->clear_inode
>>>>>> completes, hence it is perfectly valid to reference it anywhere
>>>>>> in the ->clear_inode path.
>>>>> The problem I see is an assert in xfs_setfilesize() fail:
>>>>>
>>>>> 	ASSERT((ip->i_d.di_mode & S_IFMT) == S_IFREG);
>>>>>
>>>>> The mode of the XFS inode is zero at this time.
>>>> Ok, so the question has to be why is there I/O still in progress
>>>> after the truncate is supposed to have already occurred and the
>>>> vn_iowait() in xfs_itruncate_start() been executed.
>>>>
>>>> Something doesn't add up here - you can't be doing I/O on a file
>>>> with no extents or delalloc blocks, hence that means we should be
>>>> passing through the truncate path in xfs_inactive() before we
>>>> call xfs_ifree() and therefore doing the vn_iowait()..
>>>>
>>>> Hmmmm - the vn_iowait() is conditional based on:
>>>>
>>>>         /* wait for the completion of any pending DIOs */
>>>>         if (new_size < ip->i_size)
>>>>                 vn_iowait(ip);
>>>>
>>>> We are truncating to zero (new_size == 0), so the only case where
>>>> this would not wait is if ip->i_size == 0. Still - I can't see
>>>> how we'd be doing I/O on an inode with a zero i_size. I suspect
>>>> ensuring we call vn_iowait() if newsize == 0 as well would fix
>>>> the problem. If not, there's something much more subtle going
>>>> on here that we should understand....
>>> If we make the vn_iowait() unconditional we might re-introduce the
>>> NFS exclusivity bug that killed performance.  That was through
>>> xfs_release()->xfs_free_eofblocks()->xfs_itruncate_start().
>>
>> It won't reintroduce that problem because ->clear_inode()
>> is not called on every NFS write operation.
> Yes but xfs_itruncate_start() can be called from every NFS write so
> modifying the above code will re-introduce the problem.

Ah, no. The case here is new_size == 0, which will almost never be
the case in the ->release call...

>
>>
>>> So if we leave the above code as is then we need another
>>> vn_iowait() in xfs_inactive() to catch any remaining workqueue
>>> items that we didn't wait for in xfs_itruncate_start().
>>
>> How do we have any new *data* I/O at all in progress at this point?
> It's not new data I/O.

Then why isn't is being caught by the vn_iowait() in the truncate
code?????

>  It's workqueue items that have been queued
> from previous I/Os that are still outstanding.

The iocount is decremented when the completion is finished, not when it
is queued. Hence vn_iowait() should be taking into account this case.

>> That does not explain why we need an additional vn_iowait() call.
>> All I see from this is a truncate race that has somethign to do with
>> the vn_iowait() call being conditional.
>>
>> That is, if we truncate to zero, then the current code in
>> xfs_itruncate_start() should wait unconditinally for *all* I/O to
>> complete because, by definition, all that I/O is beyond the new EOF
>> and we have to wait for it to complete before truncating the file.
> That makes sense.  If new_size is zero and ip->i_size is not then we
> will wait.  If ip->i_size is also zero we will not wait but if the
> file size is already zero there should not be any I/Os in progress
> and therefore no workqueue items outstanding.

I note from the debug below that the linux inode size is zero,
but you didn't include the dump of the xfs inode so we can't see
what the other variables are.

Also, i_size is updated after write I/O is dispatched. If we are
doing synchronous I/O, then i_size is not updated until after the
I/O completes (in xfs_write()). Hence we could have the situation of
I/O being run while i_size = 0. This is why I wanted to know what
i_new_size is, because that gets set before the I/O is issued.

if i_new_size is non-zero and i_size is zero,that tends to imply
the conditional vn_iowait() in the truncate path needs to take
MAX(ip->i_size, ip->i_new_size) for the check, not just ip->i_size...

FWIW, from the dump below, we have:

typedef struct xfs_ioend {
        struct xfs_ioend        *io_list = NULL
        unsigned int            io_type = 0x20 = IOMAP_UNWRITTEN
        int                     io_error = 0
        atomic_t                io_remaining = 0;
        struct inode            *io_inode = 0xffff810054062048
        struct buffer_head      *io_buffer_head = NULL
        struct buffer_head      *io_buffer_tail = NULL
        size_t                  io_size = 0x3400
        xfs_off_t               io_offset = 0xfe200
        struct work_struct      io_work;        /* xfsdatad work queue */
} xfs_ioend_t;

So the I/O was not even close to offset zero.

Also, the fact that the stack trace says it came through the
written path, but the io_type says unwritten which says that there's
something fishy here.  Either the stack trace is wrong, or there's
been a memory corruption....

>> If either the vn_iowait() in the truncate path is not sufficient, or
>> the truncate code is not being called, there is *some other bug*
>> that we don't yet understand.  Adding an unconditional vn_iowait()
>> appear to me to be fixing a symptom, not the underlying cause of
>> the problem....
>
> I'm not adding a new call to vn_iowait().  I'm just moving the
> existing one from xfs_ireclaim() so that we wait for all I/O to
> complete before we tear the inode down.

Yes, but that is there to catch inodes with non-zero link counts because
we are not doing a truncate in that case. We still need to get to the
bottom of why the truncate is not waiting for all I/O.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Move vn_iowait() earlier in the reclaim path
  2008-08-06  9:38             ` Dave Chinner
@ 2008-08-07  8:43               ` Lachlan McIlroy
  2008-08-08  8:32                 ` Lachlan McIlroy
  0 siblings, 1 reply; 11+ messages in thread
From: Lachlan McIlroy @ 2008-08-07  8:43 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs, xfs-dev

Dave Chinner wrote:
> On Wed, Aug 06, 2008 at 04:10:53PM +1000, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Wed, Aug 06, 2008 at 12:28:30PM +1000, Lachlan McIlroy wrote:
>>>> Dave Chinner wrote:
>>>>> On Tue, Aug 05, 2008 at 05:52:34PM +1000, Lachlan McIlroy wrote:
>>>>>> Dave Chinner wrote:
>>>>>>> On Tue, Aug 05, 2008 at 04:43:29PM +1000, Lachlan McIlroy wrote:
>>>>>>>> Currently by the time we get to vn_iowait() in xfs_reclaim() we have already
>>>>>>>> gone through xfs_inactive()/xfs_free() and recycled the inode.  Any I/O
>>>>>>> xfs_free()? What's that?
>>>>>> Sorry that should have been xfs_ifree() (we set the inode's mode to
>>>>>> zero in there).
>>>>>>
>>>>>>>> completions still running (file size updates and unwritten extent conversions)
>>>>>>>> may be working on an inode that is no longer valid.
>>>>>>> The linux inode does not get freed until after ->clear_inode
>>>>>>> completes, hence it is perfectly valid to reference it anywhere
>>>>>>> in the ->clear_inode path.
>>>>>> The problem I see is an assert in xfs_setfilesize() fail:
>>>>>>
>>>>>> 	ASSERT((ip->i_d.di_mode & S_IFMT) == S_IFREG);
>>>>>>
>>>>>> The mode of the XFS inode is zero at this time.
>>>>> Ok, so the question has to be why is there I/O still in progress
>>>>> after the truncate is supposed to have already occurred and the
>>>>> vn_iowait() in xfs_itruncate_start() been executed.
>>>>>
>>>>> Something doesn't add up here - you can't be doing I/O on a file
>>>>> with no extents or delalloc blocks, hence that means we should be
>>>>> passing through the truncate path in xfs_inactive() before we
>>>>> call xfs_ifree() and therefore doing the vn_iowait()..
>>>>>
>>>>> Hmmmm - the vn_iowait() is conditional based on:
>>>>>
>>>>>         /* wait for the completion of any pending DIOs */
>>>>>         if (new_size < ip->i_size)
>>>>>                 vn_iowait(ip);
>>>>>
>>>>> We are truncating to zero (new_size == 0), so the only case where
>>>>> this would not wait is if ip->i_size == 0. Still - I can't see
>>>>> how we'd be doing I/O on an inode with a zero i_size. I suspect
>>>>> ensuring we call vn_iowait() if newsize == 0 as well would fix
>>>>> the problem. If not, there's something much more subtle going
>>>>> on here that we should understand....
>>>> If we make the vn_iowait() unconditional we might re-introduce the
>>>> NFS exclusivity bug that killed performance.  That was through
>>>> xfs_release()->xfs_free_eofblocks()->xfs_itruncate_start().
>>> It won't reintroduce that problem because ->clear_inode()
>>> is not called on every NFS write operation.
>> Yes but xfs_itruncate_start() can be called from every NFS write so
>> modifying the above code will re-introduce the problem.
> 
> Ah, no. The case here is new_size == 0, which will almost never be
> the case in the ->release call...
True.

> 
>>>> So if we leave the above code as is then we need another
>>>> vn_iowait() in xfs_inactive() to catch any remaining workqueue
>>>> items that we didn't wait for in xfs_itruncate_start().
>>> How do we have any new *data* I/O at all in progress at this point?
>> It's not new data I/O.
> 
> Then why isn't is being caught by the vn_iowait() in the truncate
> code?????
No idea.

> 
>>  It's workqueue items that have been queued
>> from previous I/Os that are still outstanding.
> 
> The iocount is decremented when the completion is finished, not when it
> is queued. Hence vn_iowait() should be taking into account this case.
Hmmm.  It should.

> 
>>> That does not explain why we need an additional vn_iowait() call.
>>> All I see from this is a truncate race that has somethign to do with
>>> the vn_iowait() call being conditional.
>>>
>>> That is, if we truncate to zero, then the current code in
>>> xfs_itruncate_start() should wait unconditinally for *all* I/O to
>>> complete because, by definition, all that I/O is beyond the new EOF
>>> and we have to wait for it to complete before truncating the file.
>> That makes sense.  If new_size is zero and ip->i_size is not then we
>> will wait.  If ip->i_size is also zero we will not wait but if the
>> file size is already zero there should not be any I/Os in progress
>> and therefore no workqueue items outstanding.
> 
> I note from the debug below that the linux inode size is zero,
> but you didn't include the dump of the xfs inode so we can't see
> what the other variables are.
I tried to dump the xfs inode at the time but kdb encountered a bad
address.  I'm pretty sure I was able to get a dump of the XFS inode
on another crash - that's how I found the mode to be zero.  I can't
be sure now - I'll have to reproduce the problem again.

> 
> Also, i_size is updated after write I/O is dispatched. If we are
> doing synchronous I/O, then i_size is not updated until after the
> I/O completes (in xfs_write()). Hence we could have the situation of
> I/O being run while i_size = 0. This is why I wanted to know what
> i_new_size is, because that gets set before the I/O is issued.
> 
> if i_new_size is non-zero and i_size is zero,that tends to imply
> the conditional vn_iowait() in the truncate path needs to take
> MAX(ip->i_size, ip->i_new_size) for the check, not just ip->i_size...
> 
> FWIW, from the dump below, we have:
> 
> typedef struct xfs_ioend {
>         struct xfs_ioend        *io_list = NULL
>         unsigned int            io_type = 0x20 = IOMAP_UNWRITTEN
>         int                     io_error = 0
>         atomic_t                io_remaining = 0;
>         struct inode            *io_inode = 0xffff810054062048
>         struct buffer_head      *io_buffer_head = NULL
>         struct buffer_head      *io_buffer_tail = NULL
>         size_t                  io_size = 0x3400
>         xfs_off_t               io_offset = 0xfe200
>         struct work_struct      io_work;        /* xfsdatad work queue */
> } xfs_ioend_t;
> 
> So the I/O was not even close to offset zero.
> 
> Also, the fact that the stack trace says it came through the
> written path, but the io_type says unwritten which says that there's
> something fishy here.  Either the stack trace is wrong, or there's
> been a memory corruption....
I'm pretty sure that's because it was a direct I/O write to a written
extent.  The I/O starts out as IOMAP_UNWRITTEN and if we didn't map to
an unwritten extent it's completion handler is switched to the written
one.  Looking at the direct I/O write path I can see where the direct
I/O write would wait for the bio's to complete but it's not waiting for
the workqueue items to be flushed.  Not sure if that's part of the
problem though.

> 
>>> If either the vn_iowait() in the truncate path is not sufficient, or
>>> the truncate code is not being called, there is *some other bug*
>>> that we don't yet understand.  Adding an unconditional vn_iowait()
>>> appear to me to be fixing a symptom, not the underlying cause of
>>> the problem....
>> I'm not adding a new call to vn_iowait().  I'm just moving the
>> existing one from xfs_ireclaim() so that we wait for all I/O to
>> complete before we tear the inode down.
> 
> Yes, but that is there to catch inodes with non-zero link counts because
> we are not doing a truncate in that case. We still need to get to the
> bottom of why the truncate is not waiting for all I/O.

I'm wondering if we have an extra decrement on the i_iocount atomic counter
that tricked vn_iowait() into thinking that all I/Os have completed.

Should this code in xfs_vm_direct_IO():

	if (unlikely(ret != -EIOCBQUEUED && iocb->private))
		xfs_destroy_ioend(iocb->private);

be:

	if (unlikely(ret < 0 && ret != -EIOCBQUEUED && iocb->private))
		xfs_destroy_ioend(iocb->private);

Ordinarily we'd drop the i_iocount reference by calling xfs_end_io_direct().

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Move vn_iowait() earlier in the reclaim path
  2008-08-07  8:43               ` Lachlan McIlroy
@ 2008-08-08  8:32                 ` Lachlan McIlroy
  0 siblings, 0 replies; 11+ messages in thread
From: Lachlan McIlroy @ 2008-08-08  8:32 UTC (permalink / raw)
  To: lachlan; +Cc: xfs, xfs-dev

Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Wed, Aug 06, 2008 at 04:10:53PM +1000, Lachlan McIlroy wrote:
>>> Dave Chinner wrote:
>>>> On Wed, Aug 06, 2008 at 12:28:30PM +1000, Lachlan McIlroy wrote:
>>>>> Dave Chinner wrote:
>>>>>> On Tue, Aug 05, 2008 at 05:52:34PM +1000, Lachlan McIlroy wrote:
>>>>>>> Dave Chinner wrote:
>>>>>>>> On Tue, Aug 05, 2008 at 04:43:29PM +1000, Lachlan McIlroy wrote:
>>>>>>>>> Currently by the time we get to vn_iowait() in xfs_reclaim() we 
>>>>>>>>> have already
>>>>>>>>> gone through xfs_inactive()/xfs_free() and recycled the inode.  
>>>>>>>>> Any I/O
>>>>>>>> xfs_free()? What's that?
>>>>>>> Sorry that should have been xfs_ifree() (we set the inode's mode to
>>>>>>> zero in there).
>>>>>>>
>>>>>>>>> completions still running (file size updates and unwritten 
>>>>>>>>> extent conversions)
>>>>>>>>> may be working on an inode that is no longer valid.
>>>>>>>> The linux inode does not get freed until after ->clear_inode
>>>>>>>> completes, hence it is perfectly valid to reference it anywhere
>>>>>>>> in the ->clear_inode path.
>>>>>>> The problem I see is an assert in xfs_setfilesize() fail:
>>>>>>>
>>>>>>>     ASSERT((ip->i_d.di_mode & S_IFMT) == S_IFREG);
>>>>>>>
>>>>>>> The mode of the XFS inode is zero at this time.
>>>>>> Ok, so the question has to be why is there I/O still in progress
>>>>>> after the truncate is supposed to have already occurred and the
>>>>>> vn_iowait() in xfs_itruncate_start() been executed.
>>>>>>
>>>>>> Something doesn't add up here - you can't be doing I/O on a file
>>>>>> with no extents or delalloc blocks, hence that means we should be
>>>>>> passing through the truncate path in xfs_inactive() before we
>>>>>> call xfs_ifree() and therefore doing the vn_iowait()..
>>>>>>
>>>>>> Hmmmm - the vn_iowait() is conditional based on:
>>>>>>
>>>>>>         /* wait for the completion of any pending DIOs */
>>>>>>         if (new_size < ip->i_size)
>>>>>>                 vn_iowait(ip);
>>>>>>
>>>>>> We are truncating to zero (new_size == 0), so the only case where
>>>>>> this would not wait is if ip->i_size == 0. Still - I can't see
>>>>>> how we'd be doing I/O on an inode with a zero i_size. I suspect
>>>>>> ensuring we call vn_iowait() if newsize == 0 as well would fix
>>>>>> the problem. If not, there's something much more subtle going
>>>>>> on here that we should understand....
>>>>> If we make the vn_iowait() unconditional we might re-introduce the
>>>>> NFS exclusivity bug that killed performance.  That was through
>>>>> xfs_release()->xfs_free_eofblocks()->xfs_itruncate_start().
>>>> It won't reintroduce that problem because ->clear_inode()
>>>> is not called on every NFS write operation.
>>> Yes but xfs_itruncate_start() can be called from every NFS write so
>>> modifying the above code will re-introduce the problem.
>>
>> Ah, no. The case here is new_size == 0, which will almost never be
>> the case in the ->release call...
> True.
> 
>>
>>>>> So if we leave the above code as is then we need another
>>>>> vn_iowait() in xfs_inactive() to catch any remaining workqueue
>>>>> items that we didn't wait for in xfs_itruncate_start().
>>>> How do we have any new *data* I/O at all in progress at this point?
>>> It's not new data I/O.
>>
>> Then why isn't is being caught by the vn_iowait() in the truncate
>> code?????
> No idea.
> 
>>
>>>  It's workqueue items that have been queued
>>> from previous I/Os that are still outstanding.
>>
>> The iocount is decremented when the completion is finished, not when it
>> is queued. Hence vn_iowait() should be taking into account this case.
> Hmmm.  It should.
> 
>>
>>>> That does not explain why we need an additional vn_iowait() call.
>>>> All I see from this is a truncate race that has somethign to do with
>>>> the vn_iowait() call being conditional.
>>>>
>>>> That is, if we truncate to zero, then the current code in
>>>> xfs_itruncate_start() should wait unconditinally for *all* I/O to
>>>> complete because, by definition, all that I/O is beyond the new EOF
>>>> and we have to wait for it to complete before truncating the file.
>>> That makes sense.  If new_size is zero and ip->i_size is not then we
>>> will wait.  If ip->i_size is also zero we will not wait but if the
>>> file size is already zero there should not be any I/Os in progress
>>> and therefore no workqueue items outstanding.
>>
>> I note from the debug below that the linux inode size is zero,
>> but you didn't include the dump of the xfs inode so we can't see
>> what the other variables are.
> I tried to dump the xfs inode at the time but kdb encountered a bad
> address.  I'm pretty sure I was able to get a dump of the XFS inode
> on another crash - that's how I found the mode to be zero.  I can't
> be sure now - I'll have to reproduce the problem again.
> 
>>
>> Also, i_size is updated after write I/O is dispatched. If we are
>> doing synchronous I/O, then i_size is not updated until after the
>> I/O completes (in xfs_write()). Hence we could have the situation of
>> I/O being run while i_size = 0. This is why I wanted to know what
>> i_new_size is, because that gets set before the I/O is issued.
>>
>> if i_new_size is non-zero and i_size is zero,that tends to imply
>> the conditional vn_iowait() in the truncate path needs to take
>> MAX(ip->i_size, ip->i_new_size) for the check, not just ip->i_size...
>>
>> FWIW, from the dump below, we have:
>>
>> typedef struct xfs_ioend {
>>         struct xfs_ioend        *io_list = NULL
>>         unsigned int            io_type = 0x20 = IOMAP_UNWRITTEN
>>         int                     io_error = 0
>>         atomic_t                io_remaining = 0;
>>         struct inode            *io_inode = 0xffff810054062048
>>         struct buffer_head      *io_buffer_head = NULL
>>         struct buffer_head      *io_buffer_tail = NULL
>>         size_t                  io_size = 0x3400
>>         xfs_off_t               io_offset = 0xfe200
>>         struct work_struct      io_work;        /* xfsdatad work queue */
>> } xfs_ioend_t;
>>
>> So the I/O was not even close to offset zero.
>>
>> Also, the fact that the stack trace says it came through the
>> written path, but the io_type says unwritten which says that there's
>> something fishy here.  Either the stack trace is wrong, or there's
>> been a memory corruption....
> I'm pretty sure that's because it was a direct I/O write to a written
> extent.  The I/O starts out as IOMAP_UNWRITTEN and if we didn't map to
> an unwritten extent it's completion handler is switched to the written
> one.  Looking at the direct I/O write path I can see where the direct
> I/O write would wait for the bio's to complete but it's not waiting for
> the workqueue items to be flushed.  Not sure if that's part of the
> problem though.
> 
>>
>>>> If either the vn_iowait() in the truncate path is not sufficient, or
>>>> the truncate code is not being called, there is *some other bug*
>>>> that we don't yet understand.  Adding an unconditional vn_iowait()
>>>> appear to me to be fixing a symptom, not the underlying cause of
>>>> the problem....
>>> I'm not adding a new call to vn_iowait().  I'm just moving the
>>> existing one from xfs_ireclaim() so that we wait for all I/O to
>>> complete before we tear the inode down.
>>
>> Yes, but that is there to catch inodes with non-zero link counts because
>> we are not doing a truncate in that case. We still need to get to the
>> bottom of why the truncate is not waiting for all I/O.
> 
> I'm wondering if we have an extra decrement on the i_iocount atomic counter
> that tricked vn_iowait() into thinking that all I/Os have completed.
> 
> Should this code in xfs_vm_direct_IO():
> 
>     if (unlikely(ret != -EIOCBQUEUED && iocb->private))
>         xfs_destroy_ioend(iocb->private);
> 
> be:
> 
>     if (unlikely(ret < 0 && ret != -EIOCBQUEUED && iocb->private))
>         xfs_destroy_ioend(iocb->private);
> 
> Ordinarily we'd drop the i_iocount reference by calling 
> xfs_end_io_direct().
> 

I think I'm able to reproduce this at will now.

The stacktrace from kdb has no arguments so I added a printk
to give us the key items.

inode->i_size, ip->i_size, ip->i_new_size and ip->i_d.di_size are all zero
ip->i_iocount is 1
inode->i_mode is valid but ip->i_d.di_mode is zero

This assert triggered when the filesystem became full.  I suspect
that a direct I/O write failed due to ENOSPC (and that's why the
file size is still zero) but the ioend still got queued.

<5>XFS mounting filesystem dm-0
<7>Ending clean XFS mount for filesystem: dm-0
<7>xfs_setfilesize: ip 0xffff810fd3920c90, inode 0xffff810fd391fc18, iocount 1, ioend 0xffff810f96059be8
<4>Assertion failed: (ip->i_d.di_mode & S_IFMT) == S_IFREG, file: fs/xfs/linux-2.6/xfs_aops.c, line: 180
<0>------------[ cut here ]------------
<2>kernel BUG at fs/xfs/support/debug.c:81!
[1]kdb> 
[1]kdb> 
[1]kdb> xnode 0xffff810fd3920c90,
mount 0xffff81101acc0488 mnext 0xffff810f8e54c670 mprev 0xffff810f5b90a050 vnode 0xffff810fd391fc18 
dev fe00000 ino 4048200[7:5c54:8]
blkno 0x1ee2a0 len 0x10 boffset 0x800
transp 0x0000000000000000 &itemp 0xffff810f32236240
&lock 0xffff810fd3920d40 &iolock 0xffff810fd3920da8 &flush 0xffff810fd3920e10 (1) pincount 0x0
udquotp 0x0000000000000000 gdquotp 0x0000000000000000
new_size 0
flags 0x0 <> 
update_core 0 update size 0
gen 0x0 delayed blks 0size 0
 trace 0xffff810fcc693620
 bmap_trace 0xffff810fcc6935f0
 bmbt trace 0xffff810fcc6935c0
 rw trace 0xffff810fcc693590
 ilock trace 0xffff810fcc693560
 dir trace 0xffff810fcc693530

data fork
 bytes 0x0 real_bytes 0x0 lastex 0x0 u1:extents 0x0000000000000000
 broot 0x0000000000000000 broot_bytes 0x0 ext_max 9 flags 0x2 <extents > 
 u2 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
attr fork empty
[1]more> 

magic 0x494e mode 00 (?------------) version 0x2 format 0x2 (extents)
nlink 0 uid 0 gid 0 projid 0 flushiter 0
atime 1218176355:126020521 mtime 1218176355d:554017996 ctime 1218176355:554017996
size 0 nblocks 0 extsize 0x0 nextents 0x0 anextents 0x0
forkoff 0 aformat 0x2 (extents) dmevmask 0x0 dmstate 0x0 flags 0x0 <> gen 0x593e66c8
--> itrace @ 0xffff810fd3920c90/0xffff810fcc693620
ref @fs/xfs/xfs_vnodeops.c:1589(xfs_create+0x390) i_count = 1
  cpu = 6 pid = 9502   ra = xfs_vn_mknod+0x144
hold @fs/xfs/xfs_vnodeops.c:1642(xfs_create+0x568) i_count 1 => 2 
  cpu = 6 pid = 9502   ra = xfs_vn_mknod+0x144
entry to xfs_iput i_count = 2
  cpu = 6 pid = 9502   ra = xfs_trans_unlock_chunk+0x77
rele @fs/xfs/xfs_iget.c:421(xfs_iput+0x42) i_count 2 => 1 
  cpu = 6 pid = 9502   ra = xfs_trans_unlock_chunk+0x77
entry to xfs_vn_getattr i_count = 1
  cpu = 1 pid = 9502   ra = vfs_fstat+0x29
entry to xfs_vn_getattr i_count = 1
  cpu = 1 pid = 9502   ra = vfs_fstat+0x29
entry to xfs_ioctl i_count = 1
  cpu = 1 pid = 9502   ra = xfs_file_ioctl+0x26
entry to xfs_vn_getattr i_count = 1
  cpu = 1 pid = 9502   ra = vfs_fstat+0x29
[1]more> 
entry to xfs_remove i_count = 2
  cpu = 1 pid = 9502   ra = xfs_vn_unlink+0x37
hold @fs/xfs/xfs_vnodeops.c:1958(xfs_remove+0x26d) i_count 2 => 3 
  cpu = 1 pid = 9502   ra = xfs_vn_unlink+0x37
entry to xfs_iput i_count = 3
  cpu = 1 pid = 9502   ra = xfs_trans_unlock_chunk+0x77
rele @fs/xfs/xfs_iget.c:421(xfs_iput+0x42) i_count 3 => 2 
  cpu = 1 pid = 9502   ra = xfs_trans_unlock_chunk+0x77
exit from xfs_remove i_count = 2
  cpu = 1 pid = 9502   ra = xfs_vn_unlink+0x37
entry to xfs_fs_clear_inode i_count = 0
  cpu = 1 pid = 9502   ra = clear_inode+0x9c
entry to xfs_inactive i_count = 0
  cpu = 1 pid = 9502   ra = xfs_fs_clear_inode+0xbd
entry to xfs_reclaim i_count = 0
  cpu = 1 pid = 9502   ra = xfs_fs_clear_inode+0xde
[1]kdb> inode 0xffff810fd391fc18
struct inode at  0xffff810fd391fc18
 i_ino = 4048200 i_count = 0 i_size 0
 i_mode = 0100666  i_nlink = 0  i_rdev = 0x0
 i_hash.nxt = 0x0000000000000000 i_hash.pprev = 0xffffc20005712980
 i_list.nxt = 0xffff810fd391fc18 i_list.prv = 0xffff810fd391fc18
 i_dentry.nxt = 0xffff810fd391fbb0 i_dentry.prv = 0xffff810fd391fbb0
 i_sb = 0xffff810f1ca10148 i_op = 0xffffffff80690e20 i_data = 0xffff810fd391fe00 nrpages = 0
 i_fop= 0xffffffff80690b40 i_flock = 0x0000000000000000 i_mapping = 0xffff810fd391fe00
 i_flags 0x0 i_state 0x21 [I_DIRTY_SYNC I_FREEING]  fs specific info @ 0xffff810fd391ffe8
[1]kdb> 
[1]kdb> md8c20 0xffff810f96059be8
0xffff810f96059be8 0000000000000000 0000000000000020   ........ .......
0xffff810f96059bf8 5a5a5a5a00000000 ffff810fd391fc18   ....ZZZZ........
0xffff810f96059c08 0000000000000000 0000000000000000   ................
0xffff810f96059c18 0000000000001800 000000000002f800   ................
0xffff810f96059c28 ffff8110270654d0 ffff810f96059c30   .T.'....0.......
0xffff810f96059c38 ffff810f96059c30 ffffffff803b0b69   0.......i.;.....
0xffff810f96059c48 ffffffff80b7f698 0000000000000000   ................
0xffff810f96059c58 ffffffff807c0b92 d84156c5635688c0   ..|.......Vc.VA.
0xffff810f96059c68 ffffffff8025c2b6 09f911029d74e35b   ..%.....[.t.....
0xffff810f96059c78 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
[1]kdb> bt
Stack traceback for pid 379
0xffff811029484680      379        2  1    1   R  0xffff8110294849c8 *xfsdatad/1
sp                ip                Function (args)
0xffff81102708be28 0xffffffff803bbf40 assfail+0x1a (invalid, invalid, invalid)
kdb_bb: address 0x0000000000010286 not recognised
Using old style backtrace, unreliable with no arguments
sp                ip                Function (args)
0xffff81102708bdb8 0xffffffff803b0b69 xfs_end_bio_written
0xffff81102708be28 0xffffffff803bbf40 assfail+0x1a
0xffff81102708be58 0xffffffff803b0a50 xfs_setfilesize+0x6a
0xffff81102708be78 0xffffffff803b0b79 xfs_end_bio_written+0x10
0xffff81102708be88 0xffffffff8023bb02 run_workqueue+0xdf
0xffff81102708bed8 0xffffffff8023c5bd worker_thread+0xd8
0xffff81102708bef0 0xffffffff8023ef6e autoremove_wake_function
0xffff81102708bf20 0xffffffff8023c4e5 worker_thread
0xffff81102708bf28 0xffffffff8023ee5c kthread+0x47
0xffff81102708bf30 0xffffffff802294a8 schedule_tail+0x27
0xffff81102708bf48 0xffffffff8020bea8 child_rip+0xa


As suggested, I'm now running with this patch:

--- fs/xfs/xfs_inode.c_1.518	2008-08-08 17:31:02.000000000 +1000
+++ fs/xfs/xfs_inode.c	2008-08-08 17:35:11.000000000 +1000
@@ -1414,7 +1414,7 @@ xfs_itruncate_start(
 	mp = ip->i_mount;
 
 	/* wait for the completion of any pending DIOs */
-	if (new_size < ip->i_size)
+	if (new_size == 0 || new_size < ip->i_size)
 		vn_iowait(ip);
 
 	/*

and so far so good.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-08-08  8:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-05  6:43 [PATCH] Move vn_iowait() earlier in the reclaim path Lachlan McIlroy
2008-08-05  7:37 ` Dave Chinner
2008-08-05  7:44   ` Dave Chinner
2008-08-05  7:52   ` Lachlan McIlroy
2008-08-05  8:42     ` Dave Chinner
2008-08-06  2:28       ` Lachlan McIlroy
2008-08-06  5:20         ` Dave Chinner
2008-08-06  6:10           ` Lachlan McIlroy
2008-08-06  9:38             ` Dave Chinner
2008-08-07  8:43               ` Lachlan McIlroy
2008-08-08  8:32                 ` Lachlan McIlroy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox