Re: [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking
Date: Fri, 24 Apr 2020 07:14:21 -0400	[thread overview]
Message-ID: <20200424111421.GB53325@bfoster> (raw)
In-Reply-To: <20200423213839.GQ27860@dread.disaster.area>

On Fri, Apr 24, 2020 at 07:38:39AM +1000, Dave Chinner wrote:
> On Thu, Apr 23, 2020 at 10:36:37AM -0400, Brian Foster wrote:
> > On Thu, Apr 23, 2020 at 03:59:19PM +1000, Dave Chinner wrote:
> > > On Wed, Apr 22, 2020 at 01:54:24PM -0400, Brian Foster wrote:
> > > > The log item failed flag is used to indicate a log item was flushed
> > > > but the underlying buffer was not successfully written to disk. If
> > > > the error configuration allows for writeback retries, xfsaild uses
> > > > the flag to resubmit the underlying buffer without acquiring the
> > > > flush lock of the item.
> > > > 
> > > > The flag is currently set and cleared under the AIL lock and buffer
> > > > lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
> > > > completion time. The flag is checked at xfsaild push time under AIL
> > > > lock and cleared under buffer lock before resubmission. If I/O
> > > > eventually succeeds, the flag is cleared in the _done() handler for
> > > > the associated item type, again under AIL lock and buffer lock.
> > > 
> > > Actually, I think you've missed the relevant lock here: the item's
> > > flush lock. The XFS_LI_FAILED flag is item flush state, and flush
> > > state is protected by the flush lock. When the item has been flushed
> > > and attached to the buffer for completion callbacks, the flush lock
> > > context gets handed to the buffer.
> > > 
> > > i.e. the buffer owns the flush lock and so while that buffer is
> > > locked for IO we know that the item flush state (and hence the
> > > XFS_LI_FAILED flag) is exclusively owned by the holder of the buffer
> > > lock.
> > > 
> > 
> > Right..
> > 
> > > (Note: this is how xfs_ifree_cluster() works - it grabs the buffer
> > > lock then walks the items on the buffer and changes the callback
> > > functions because those items are flush locked and hence holding the
> > > buffer lock gives exclusive access to the flush state of those
> > > items....)
> > > 
> > > > As far as I can tell, the only reason for holding the AIL lock
> > > > across sets/clears is to manage consistency between the log item
> > > > bitop state and the temporary buffer pointer that is attached to the
> > > > log item. The bit itself is used to manage consistency of the
> > > > attached buffer, but is not enough to guarantee the buffer is still
> > > > attached by the time xfsaild attempts to access it.
> > > 
> > > Correct. The guarantee that the buffer is still attached to the log
> > > item is what the AIL lock provides us with.
> > > 
> > > > However since
> > > > failure state is always set or cleared under buffer lock (either via
> > > > I/O completion or xfsaild), this particular case can be handled at
> > > > item push time by verifying failure state once under buffer lock.
> > > 
> > > In theory, yes, but there's a problem before you get that buffer
> > > lock. That is: what serialises access to lip->li_buf?
> > > 
> > 
> > That's partly what I was referring to above by the push time changes.
> > This patch was an attempt to replace reliance on ail_lock with a push
> > time sequence that would serialize access to a failed buffer
> > (essentially relying on the failed bit). Whether it's correct or not is
> > another matter... ;)
> > 
> > > The xfsaild does not hold a reference to the buffer and, without the
> > > AIL lock to provide synchronisation, the log item reference to the
> > > buffer can be dropped asynchronously by buffer IO completion. Hence
> > > the buffer could be freed between the xfsaild reading lip->li_buf
> > > and calling xfs_buf_trylock(bp). i.e. this introduces a
> > > use-after-free race condition on the initial buffer access.
> > > 
> > 
> > Hmm.. the log item holds a temporary reference to the buffer when the
> > failed state is set. On submission, xfsaild queues the failed buffer
> > (new ref for the delwri queue), clears the failed state and drops the
> > failure reference of every failed item that is attached. xfsaild is also
> > the only task that knows how to process LI_FAILED items, so I don't
> > think we'd ever race with a failed buffer becoming "unfailed" from
> > xfsaild (which is the scenario where a buffer could disappear from I/O
> > completion).
> 
> Sure we can. every inode on the buffer has XFS_LI_FAILED set on it,
> so the first inode in the AIL triggers the resubmit and starts the
> IO. the AIL runs again, coming across another inode on the buffer
> further into the AIL. That inode has been modified in memory while
> the flush was in progress, so it no longer gets removed from the AIL
> by the IO completion. Instead, xfs_clear_li_failed() is called from
> xfs_iflush_done() and the buffer is removed from the logitem and the
> reference dropped.
> 

When xfsaild encounters the first failed item, the buffer resubmit path
(xfsaild_resubmit_item(), as of this series) calls xfs_clear_li_failed()
on every inode attached to the buffer. AFAICT, that means xfsaild will
see any associated inode as FLUSHING (or PINNED, if another in-core
modification has come through) once the buffer is requeued.

> > In some sense, clearing LI_FAILED of an item is essentially
> > retaking ownership of the item flush lock and hold of the underlying
> > buffer, but the existing code is certainly not written that way.
> 
> Only IO completion clears the LI_FAILED state of the item, not the
> xfsaild. IO completion already owns the flush lock - the xfsaild
> only holds it long enough to flush an inode to the buffer and then
> give the flush lock to the buffer.
> 
> Also, we clear LI_FAILED when removing the item from the AIL under
> the AIL lock, so the only case here that we are concerned about is
> the above "inode was relogged while being flushed" situation. THis
> situation is rare, failed buffers are rare, and so requiring the AIL
> lock to be held is a performance limiting factor here...
> 

Yep, though because of the above I think I/O completion clearing the
failed state might require a more subtle dance between xfsaild and
another submission context, such as if reclaim happens to flush an inode
that wasn't already attached to a previously failed buffer (for
example).

Eh, it doesn't matter that much in any event because..

> > The issue I was wrestling with wrt to this patch was primarily
> > maintaining consistency between the bit and the assignment of the
> > pointer on a failing I/O. E.g., the buffer pointer is set by the task
> > that sets the bit, but xfsaild checking the bit doesn't guarantee the
> > pointer had been set yet. I did add a post-buffer lock LI_FAILED check
> > as well, but that probably could have been an assert.
> 
> SO you've been focussing on races that may occur when the setting of
> the li_buf, but I'm looking at races involving the clearing of the
> li_buf pointer. :)
> 
> > > IOWs, the xfsaild cannot access lip->li_buf safely unless the
> > > set/clear operations are protected by the same lock the xfsaild
> > > holds. The xfsaild holds neither the buffer lock, a buffer reference
> > > or an item flush lock, hence it's the AIL lock or nothing...
> > > 
> > 
> > Yeah, that was my impression from reading the code. I realize from this
> > discussion that this doesn't actually simplify the logic. That was the
> > primary goal, so I think I need to revisit the approach here. Even if
> > this is correct (which I'm still not totally sure of), it's fragile in
> > the sense that it partly relies on single-threadedness of xfsaild. I
> > initially thought about adding a log item spinlock for this kind of
> > thing, but it didn't (and still doesn't) seem appropriate to burden
> > every log item in the system with a spinlock for the rare case of buffer
> > I/O errors.
> 
> I feel this is all largely a moot, because my patches result in this
> this whole problem of setting/clearing the li_buf for failed buffers
> go away altogether. Hence I would suggest that this patch gets
> dropped for now, because getting rid of the AIL lock is much less
> troublesome once LI_FAILED is no long dependent on the inode/dquot
> log item taking temporary references to the underlying buffer....
> 

... I'm good with that. ;) This was intended to be a low level cleanup
to eliminate a fairly minor ail_lock abuse. Working around the ->li_buf
thing is the primary challenge, so if you have a broader rework that
addresses that as a side effect then that presumably makes things
easier.  I'll revisit this if necessary once your work is further
along...

Brian

> > 
> > I mentioned in an earlier patch that it would be nice to consider
> > removing ->li_buf entirely but hadn't thought it through.
> 
> I'm going the opposite way entirely, such that LI_FAILED doesn't
> have any special state associated with it at all...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

next prev parent reply	other threads:[~2020-04-24 11:14 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
2020-04-22 17:54 ` [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild Brian Foster
2020-04-23  4:09   ` Dave Chinner
2020-04-25 17:21   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code Brian Foster
2020-04-23  4:10   ` Dave Chinner
2020-04-25 17:23   ` Christoph Hellwig
2020-04-27 11:11     ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling Brian Foster
2020-04-23  4:18   ` Dave Chinner
2020-04-23 14:28     ` Brian Foster
2020-04-25 17:26   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 04/13] xfs: remove unnecessary shutdown check from xfs_iflush() Brian Foster
2020-04-25 17:27   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message Brian Foster
2020-04-23  4:46   ` Dave Chinner
2020-04-23 14:29     ` Brian Foster
2020-04-23 21:14       ` Dave Chinner
2020-04-24 11:12         ` Brian Foster
2020-04-24 22:08           ` Dave Chinner
2020-04-27 11:11             ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 06/13] xfs: fix duplicate verification from xfs_qm_dqflush() Brian Foster
2020-04-23  4:47   ` Dave Chinner
2020-04-25 17:28   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 07/13] xfs: abort consistently on dquot flush failure Brian Foster
2020-04-25 17:30   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking Brian Foster
2020-04-23  5:59   ` Dave Chinner
2020-04-23 14:36     ` Brian Foster
2020-04-23 21:38       ` Dave Chinner
2020-04-24 11:14         ` Brian Foster [this message]
2020-04-22 17:54 ` [PATCH v2 09/13] xfs: clean up AIL log item removal functions Brian Foster
2020-04-23  4:54   ` Dave Chinner
2020-04-25 17:37   ` Christoph Hellwig
2020-04-27 11:12     ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 10/13] xfs: combine xfs_trans_ail_[remove|delete]() Brian Foster
2020-04-23  4:55   ` Dave Chinner
2020-04-22 17:54 ` [PATCH v2 11/13] xfs: remove unused iflush stale parameter Brian Foster
2020-04-25 17:37   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 12/13] xfs: random buffer write failure errortag Brian Foster
2020-04-23  5:11   ` Dave Chinner
2020-04-25 17:38   ` Christoph Hellwig
2020-04-27 11:12     ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 13/13] xfs: remove unused shutdown types Brian Foster
2020-04-23  5:13   ` Dave Chinner
2020-04-25 17:39   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200424111421.GB53325@bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.