From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking
Date: Fri, 24 Apr 2020 07:14:21 -0400 [thread overview]
Message-ID: <20200424111421.GB53325@bfoster> (raw)
In-Reply-To: <20200423213839.GQ27860@dread.disaster.area>
On Fri, Apr 24, 2020 at 07:38:39AM +1000, Dave Chinner wrote:
> On Thu, Apr 23, 2020 at 10:36:37AM -0400, Brian Foster wrote:
> > On Thu, Apr 23, 2020 at 03:59:19PM +1000, Dave Chinner wrote:
> > > On Wed, Apr 22, 2020 at 01:54:24PM -0400, Brian Foster wrote:
> > > > The log item failed flag is used to indicate a log item was flushed
> > > > but the underlying buffer was not successfully written to disk. If
> > > > the error configuration allows for writeback retries, xfsaild uses
> > > > the flag to resubmit the underlying buffer without acquiring the
> > > > flush lock of the item.
> > > >
> > > > The flag is currently set and cleared under the AIL lock and buffer
> > > > lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
> > > > completion time. The flag is checked at xfsaild push time under AIL
> > > > lock and cleared under buffer lock before resubmission. If I/O
> > > > eventually succeeds, the flag is cleared in the _done() handler for
> > > > the associated item type, again under AIL lock and buffer lock.
> > >
> > > Actually, I think you've missed the relevant lock here: the item's
> > > flush lock. The XFS_LI_FAILED flag is item flush state, and flush
> > > state is protected by the flush lock. When the item has been flushed
> > > and attached to the buffer for completion callbacks, the flush lock
> > > context gets handed to the buffer.
> > >
> > > i.e. the buffer owns the flush lock and so while that buffer is
> > > locked for IO we know that the item flush state (and hence the
> > > XFS_LI_FAILED flag) is exclusively owned by the holder of the buffer
> > > lock.
> > >
> >
> > Right..
> >
> > > (Note: this is how xfs_ifree_cluster() works - it grabs the buffer
> > > lock then walks the items on the buffer and changes the callback
> > > functions because those items are flush locked and hence holding the
> > > buffer lock gives exclusive access to the flush state of those
> > > items....)
> > >
> > > > As far as I can tell, the only reason for holding the AIL lock
> > > > across sets/clears is to manage consistency between the log item
> > > > bitop state and the temporary buffer pointer that is attached to the
> > > > log item. The bit itself is used to manage consistency of the
> > > > attached buffer, but is not enough to guarantee the buffer is still
> > > > attached by the time xfsaild attempts to access it.
> > >
> > > Correct. The guarantee that the buffer is still attached to the log
> > > item is what the AIL lock provides us with.
> > >
> > > > However since
> > > > failure state is always set or cleared under buffer lock (either via
> > > > I/O completion or xfsaild), this particular case can be handled at
> > > > item push time by verifying failure state once under buffer lock.
> > >
> > > In theory, yes, but there's a problem before you get that buffer
> > > lock. That is: what serialises access to lip->li_buf?
> > >
> >
> > That's partly what I was referring to above by the push time changes.
> > This patch was an attempt to replace reliance on ail_lock with a push
> > time sequence that would serialize access to a failed buffer
> > (essentially relying on the failed bit). Whether it's correct or not is
> > another matter... ;)
> >
> > > The xfsaild does not hold a reference to the buffer and, without the
> > > AIL lock to provide synchronisation, the log item reference to the
> > > buffer can be dropped asynchronously by buffer IO completion. Hence
> > > the buffer could be freed between the xfsaild reading lip->li_buf
> > > and calling xfs_buf_trylock(bp). i.e. this introduces a
> > > use-after-free race condition on the initial buffer access.
> > >
> >
> > Hmm.. the log item holds a temporary reference to the buffer when the
> > failed state is set. On submission, xfsaild queues the failed buffer
> > (new ref for the delwri queue), clears the failed state and drops the
> > failure reference of every failed item that is attached. xfsaild is also
> > the only task that knows how to process LI_FAILED items, so I don't
> > think we'd ever race with a failed buffer becoming "unfailed" from
> > xfsaild (which is the scenario where a buffer could disappear from I/O
> > completion).
>
> Sure we can. every inode on the buffer has XFS_LI_FAILED set on it,
> so the first inode in the AIL triggers the resubmit and starts the
> IO. the AIL runs again, coming across another inode on the buffer
> further into the AIL. That inode has been modified in memory while
> the flush was in progress, so it no longer gets removed from the AIL
> by the IO completion. Instead, xfs_clear_li_failed() is called from
> xfs_iflush_done() and the buffer is removed from the logitem and the
> reference dropped.
>
When xfsaild encounters the first failed item, the buffer resubmit path
(xfsaild_resubmit_item(), as of this series) calls xfs_clear_li_failed()
on every inode attached to the buffer. AFAICT, that means xfsaild will
see any associated inode as FLUSHING (or PINNED, if another in-core
modification has come through) once the buffer is requeued.
> > In some sense, clearing LI_FAILED of an item is essentially
> > retaking ownership of the item flush lock and hold of the underlying
> > buffer, but the existing code is certainly not written that way.
>
> Only IO completion clears the LI_FAILED state of the item, not the
> xfsaild. IO completion already owns the flush lock - the xfsaild
> only holds it long enough to flush an inode to the buffer and then
> give the flush lock to the buffer.
>
> Also, we clear LI_FAILED when removing the item from the AIL under
> the AIL lock, so the only case here that we are concerned about is
> the above "inode was relogged while being flushed" situation. THis
> situation is rare, failed buffers are rare, and so requiring the AIL
> lock to be held is a performance limiting factor here...
>
Yep, though because of the above I think I/O completion clearing the
failed state might require a more subtle dance between xfsaild and
another submission context, such as if reclaim happens to flush an inode
that wasn't already attached to a previously failed buffer (for
example).
Eh, it doesn't matter that much in any event because..
> > The issue I was wrestling with wrt to this patch was primarily
> > maintaining consistency between the bit and the assignment of the
> > pointer on a failing I/O. E.g., the buffer pointer is set by the task
> > that sets the bit, but xfsaild checking the bit doesn't guarantee the
> > pointer had been set yet. I did add a post-buffer lock LI_FAILED check
> > as well, but that probably could have been an assert.
>
> SO you've been focussing on races that may occur when the setting of
> the li_buf, but I'm looking at races involving the clearing of the
> li_buf pointer. :)
>
> > > IOWs, the xfsaild cannot access lip->li_buf safely unless the
> > > set/clear operations are protected by the same lock the xfsaild
> > > holds. The xfsaild holds neither the buffer lock, a buffer reference
> > > or an item flush lock, hence it's the AIL lock or nothing...
> > >
> >
> > Yeah, that was my impression from reading the code. I realize from this
> > discussion that this doesn't actually simplify the logic. That was the
> > primary goal, so I think I need to revisit the approach here. Even if
> > this is correct (which I'm still not totally sure of), it's fragile in
> > the sense that it partly relies on single-threadedness of xfsaild. I
> > initially thought about adding a log item spinlock for this kind of
> > thing, but it didn't (and still doesn't) seem appropriate to burden
> > every log item in the system with a spinlock for the rare case of buffer
> > I/O errors.
>
> I feel this is all largely a moot, because my patches result in this
> this whole problem of setting/clearing the li_buf for failed buffers
> go away altogether. Hence I would suggest that this patch gets
> dropped for now, because getting rid of the AIL lock is much less
> troublesome once LI_FAILED is no long dependent on the inode/dquot
> log item taking temporary references to the underlying buffer....
>
... I'm good with that. ;) This was intended to be a low level cleanup
to eliminate a fairly minor ail_lock abuse. Working around the ->li_buf
thing is the primary challenge, so if you have a broader rework that
addresses that as a side effect then that presumably makes things
easier. I'll revisit this if necessary once your work is further
along...
Brian
> >
> > I mentioned in an earlier patch that it would be nice to consider
> > removing ->li_buf entirely but hadn't thought it through.
>
> I'm going the opposite way entirely, such that LI_FAILED doesn't
> have any special state associated with it at all...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
next prev parent reply other threads:[~2020-04-24 11:14 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
2020-04-22 17:54 ` [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild Brian Foster
2020-04-23 4:09 ` Dave Chinner
2020-04-25 17:21 ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code Brian Foster
2020-04-23 4:10 ` Dave Chinner
2020-04-25 17:23 ` Christoph Hellwig
2020-04-27 11:11 ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling Brian Foster
2020-04-23 4:18 ` Dave Chinner
2020-04-23 14:28 ` Brian Foster
2020-04-25 17:26 ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 04/13] xfs: remove unnecessary shutdown check from xfs_iflush() Brian Foster
2020-04-25 17:27 ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message Brian Foster
2020-04-23 4:46 ` Dave Chinner
2020-04-23 14:29 ` Brian Foster
2020-04-23 21:14 ` Dave Chinner
2020-04-24 11:12 ` Brian Foster
2020-04-24 22:08 ` Dave Chinner
2020-04-27 11:11 ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 06/13] xfs: fix duplicate verification from xfs_qm_dqflush() Brian Foster
2020-04-23 4:47 ` Dave Chinner
2020-04-25 17:28 ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 07/13] xfs: abort consistently on dquot flush failure Brian Foster
2020-04-25 17:30 ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking Brian Foster
2020-04-23 5:59 ` Dave Chinner
2020-04-23 14:36 ` Brian Foster
2020-04-23 21:38 ` Dave Chinner
2020-04-24 11:14 ` Brian Foster [this message]
2020-04-22 17:54 ` [PATCH v2 09/13] xfs: clean up AIL log item removal functions Brian Foster
2020-04-23 4:54 ` Dave Chinner
2020-04-25 17:37 ` Christoph Hellwig
2020-04-27 11:12 ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 10/13] xfs: combine xfs_trans_ail_[remove|delete]() Brian Foster
2020-04-23 4:55 ` Dave Chinner
2020-04-22 17:54 ` [PATCH v2 11/13] xfs: remove unused iflush stale parameter Brian Foster
2020-04-25 17:37 ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 12/13] xfs: random buffer write failure errortag Brian Foster
2020-04-23 5:11 ` Dave Chinner
2020-04-25 17:38 ` Christoph Hellwig
2020-04-27 11:12 ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 13/13] xfs: remove unused shutdown types Brian Foster
2020-04-23 5:13 ` Dave Chinner
2020-04-25 17:39 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200424111421.GB53325@bfoster \
--to=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).