Re: [PATCH] Move vn_iowait() earlier in the reclaim path

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Lachlan McIlroy <lachlan@sgi.com>
To: Lachlan McIlroy <lachlan@sgi.com>,
	xfs@oss.sgi.com, xfs-dev <xfs-dev@sgi.com>
Subject: Re: [PATCH] Move vn_iowait() earlier in the reclaim path
Date: Thu, 07 Aug 2008 18:43:02 +1000	[thread overview]
Message-ID: <489AB596.1010505@sgi.com> (raw)
In-Reply-To: <20080806093844.GZ6119@disturbed>

Dave Chinner wrote:
> On Wed, Aug 06, 2008 at 04:10:53PM +1000, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Wed, Aug 06, 2008 at 12:28:30PM +1000, Lachlan McIlroy wrote:
>>>> Dave Chinner wrote:
>>>>> On Tue, Aug 05, 2008 at 05:52:34PM +1000, Lachlan McIlroy wrote:
>>>>>> Dave Chinner wrote:
>>>>>>> On Tue, Aug 05, 2008 at 04:43:29PM +1000, Lachlan McIlroy wrote:
>>>>>>>> Currently by the time we get to vn_iowait() in xfs_reclaim() we have already
>>>>>>>> gone through xfs_inactive()/xfs_free() and recycled the inode.  Any I/O
>>>>>>> xfs_free()? What's that?
>>>>>> Sorry that should have been xfs_ifree() (we set the inode's mode to
>>>>>> zero in there).
>>>>>>
>>>>>>>> completions still running (file size updates and unwritten extent conversions)
>>>>>>>> may be working on an inode that is no longer valid.
>>>>>>> The linux inode does not get freed until after ->clear_inode
>>>>>>> completes, hence it is perfectly valid to reference it anywhere
>>>>>>> in the ->clear_inode path.
>>>>>> The problem I see is an assert in xfs_setfilesize() fail:
>>>>>>
>>>>>> 	ASSERT((ip->i_d.di_mode & S_IFMT) == S_IFREG);
>>>>>>
>>>>>> The mode of the XFS inode is zero at this time.
>>>>> Ok, so the question has to be why is there I/O still in progress
>>>>> after the truncate is supposed to have already occurred and the
>>>>> vn_iowait() in xfs_itruncate_start() been executed.
>>>>>
>>>>> Something doesn't add up here - you can't be doing I/O on a file
>>>>> with no extents or delalloc blocks, hence that means we should be
>>>>> passing through the truncate path in xfs_inactive() before we
>>>>> call xfs_ifree() and therefore doing the vn_iowait()..
>>>>>
>>>>> Hmmmm - the vn_iowait() is conditional based on:
>>>>>
>>>>>         /* wait for the completion of any pending DIOs */
>>>>>         if (new_size < ip->i_size)
>>>>>                 vn_iowait(ip);
>>>>>
>>>>> We are truncating to zero (new_size == 0), so the only case where
>>>>> this would not wait is if ip->i_size == 0. Still - I can't see
>>>>> how we'd be doing I/O on an inode with a zero i_size. I suspect
>>>>> ensuring we call vn_iowait() if newsize == 0 as well would fix
>>>>> the problem. If not, there's something much more subtle going
>>>>> on here that we should understand....
>>>> If we make the vn_iowait() unconditional we might re-introduce the
>>>> NFS exclusivity bug that killed performance.  That was through
>>>> xfs_release()->xfs_free_eofblocks()->xfs_itruncate_start().
>>> It won't reintroduce that problem because ->clear_inode()
>>> is not called on every NFS write operation.
>> Yes but xfs_itruncate_start() can be called from every NFS write so
>> modifying the above code will re-introduce the problem.
> 
> Ah, no. The case here is new_size == 0, which will almost never be
> the case in the ->release call...
True.

> 
>>>> So if we leave the above code as is then we need another
>>>> vn_iowait() in xfs_inactive() to catch any remaining workqueue
>>>> items that we didn't wait for in xfs_itruncate_start().
>>> How do we have any new *data* I/O at all in progress at this point?
>> It's not new data I/O.
> 
> Then why isn't is being caught by the vn_iowait() in the truncate
> code?????
No idea.

> 
>>  It's workqueue items that have been queued
>> from previous I/Os that are still outstanding.
> 
> The iocount is decremented when the completion is finished, not when it
> is queued. Hence vn_iowait() should be taking into account this case.
Hmmm.  It should.

> 
>>> That does not explain why we need an additional vn_iowait() call.
>>> All I see from this is a truncate race that has somethign to do with
>>> the vn_iowait() call being conditional.
>>>
>>> That is, if we truncate to zero, then the current code in
>>> xfs_itruncate_start() should wait unconditinally for *all* I/O to
>>> complete because, by definition, all that I/O is beyond the new EOF
>>> and we have to wait for it to complete before truncating the file.
>> That makes sense.  If new_size is zero and ip->i_size is not then we
>> will wait.  If ip->i_size is also zero we will not wait but if the
>> file size is already zero there should not be any I/Os in progress
>> and therefore no workqueue items outstanding.
> 
> I note from the debug below that the linux inode size is zero,
> but you didn't include the dump of the xfs inode so we can't see
> what the other variables are.
I tried to dump the xfs inode at the time but kdb encountered a bad
address.  I'm pretty sure I was able to get a dump of the XFS inode
on another crash - that's how I found the mode to be zero.  I can't
be sure now - I'll have to reproduce the problem again.

> 
> Also, i_size is updated after write I/O is dispatched. If we are
> doing synchronous I/O, then i_size is not updated until after the
> I/O completes (in xfs_write()). Hence we could have the situation of
> I/O being run while i_size = 0. This is why I wanted to know what
> i_new_size is, because that gets set before the I/O is issued.
> 
> if i_new_size is non-zero and i_size is zero,that tends to imply
> the conditional vn_iowait() in the truncate path needs to take
> MAX(ip->i_size, ip->i_new_size) for the check, not just ip->i_size...
> 
> FWIW, from the dump below, we have:
> 
> typedef struct xfs_ioend {
>         struct xfs_ioend        *io_list = NULL
>         unsigned int            io_type = 0x20 = IOMAP_UNWRITTEN
>         int                     io_error = 0
>         atomic_t                io_remaining = 0;
>         struct inode            *io_inode = 0xffff810054062048
>         struct buffer_head      *io_buffer_head = NULL
>         struct buffer_head      *io_buffer_tail = NULL
>         size_t                  io_size = 0x3400
>         xfs_off_t               io_offset = 0xfe200
>         struct work_struct      io_work;        /* xfsdatad work queue */
> } xfs_ioend_t;
> 
> So the I/O was not even close to offset zero.
> 
> Also, the fact that the stack trace says it came through the
> written path, but the io_type says unwritten which says that there's
> something fishy here.  Either the stack trace is wrong, or there's
> been a memory corruption....
I'm pretty sure that's because it was a direct I/O write to a written
extent.  The I/O starts out as IOMAP_UNWRITTEN and if we didn't map to
an unwritten extent it's completion handler is switched to the written
one.  Looking at the direct I/O write path I can see where the direct
I/O write would wait for the bio's to complete but it's not waiting for
the workqueue items to be flushed.  Not sure if that's part of the
problem though.

> 
>>> If either the vn_iowait() in the truncate path is not sufficient, or
>>> the truncate code is not being called, there is *some other bug*
>>> that we don't yet understand.  Adding an unconditional vn_iowait()
>>> appear to me to be fixing a symptom, not the underlying cause of
>>> the problem....
>> I'm not adding a new call to vn_iowait().  I'm just moving the
>> existing one from xfs_ireclaim() so that we wait for all I/O to
>> complete before we tear the inode down.
> 
> Yes, but that is there to catch inodes with non-zero link counts because
> we are not doing a truncate in that case. We still need to get to the
> bottom of why the truncate is not waiting for all I/O.

I'm wondering if we have an extra decrement on the i_iocount atomic counter
that tricked vn_iowait() into thinking that all I/Os have completed.

Should this code in xfs_vm_direct_IO():

	if (unlikely(ret != -EIOCBQUEUED && iocb->private))
		xfs_destroy_ioend(iocb->private);

be:

	if (unlikely(ret < 0 && ret != -EIOCBQUEUED && iocb->private))
		xfs_destroy_ioend(iocb->private);

Ordinarily we'd drop the i_iocount reference by calling xfs_end_io_direct().

next prev parent reply	other threads:[~2008-08-07  8:36 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-05  6:43 [PATCH] Move vn_iowait() earlier in the reclaim path Lachlan McIlroy
2008-08-05  7:37 ` Dave Chinner
2008-08-05  7:44   ` Dave Chinner
2008-08-05  7:52   ` Lachlan McIlroy
2008-08-05  8:42     ` Dave Chinner
2008-08-06  2:28       ` Lachlan McIlroy
2008-08-06  5:20         ` Dave Chinner
2008-08-06  6:10           ` Lachlan McIlroy
2008-08-06  9:38             ` Dave Chinner
2008-08-07  8:43               ` Lachlan McIlroy [this message]
2008-08-08  8:32                 ` Lachlan McIlroy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=489AB596.1010505@sgi.com \
    --to=lachlan@sgi.com \
    --cc=xfs-dev@sgi.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox