linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message
Date: Fri, 24 Apr 2020 07:12:32 -0400	[thread overview]
Message-ID: <20200424111232.GA53325@bfoster> (raw)
In-Reply-To: <20200423211437.GP27860@dread.disaster.area>

On Fri, Apr 24, 2020 at 07:14:37AM +1000, Dave Chinner wrote:
> On Thu, Apr 23, 2020 at 10:29:58AM -0400, Brian Foster wrote:
> > On Thu, Apr 23, 2020 at 02:46:04PM +1000, Dave Chinner wrote:
> > > On Wed, Apr 22, 2020 at 01:54:21PM -0400, Brian Foster wrote:
> > > > At unmount time, XFS emits a warning for every in-core buffer that
> > > > might have undergone a write error. In practice this behavior is
> > > > probably reasonable given that the filesystem is likely short lived
> > > > once I/O errors begin to occur consistently. Under certain test or
> > > > otherwise expected error conditions, this can spam the logs and slow
> > > > down the unmount.
> > > > 
> > > > We already have a ratelimit state defined for buffers failing
> > > > writeback. Fold this state into the buftarg and reuse it for the
> > > > unmount time errors.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > 
> > > Looks fine, but I suspect we both missed something here:
> > > xfs_buf_ioerror_alert() was made a ratelimited printk in the last
> > > cycle:
> > > 
> > > void
> > > xfs_buf_ioerror_alert(
> > >         struct xfs_buf          *bp,
> > >         xfs_failaddr_t          func)
> > > {
> > >         xfs_alert_ratelimited(bp->b_mount,
> > > "metadata I/O error in \"%pS\" at daddr 0x%llx len %d error %d",
> > >                         func, (uint64_t)XFS_BUF_ADDR(bp), bp->b_length,
> > >                         -bp->b_error);
> > > }
> > > 
> > 
> > Yeah, I hadn't noticed that.
> > 
> > > Hence I think all these buffer error alerts can be brought under the
> > > same rate limiting variable. Something like this in xfs_message.c:
> > > 
> > 
> > One thing to note is that xfs_alert_ratelimited() ultimately uses
> > the DEFAULT_RATELIMIT_INTERVAL of 5s. The ratelimit we're generalizing
> > here uses 30s (both use a burst of 10). That seems reasonable enough to
> > me for I/O errors so I'm good with the changes below.
> > 
> > FWIW, that also means we could just call xfs_buf_alert_ratelimited()
> > from xfs_buf_item_push() if we're also Ok with using an "alert" instead
> > of a "warn." I'm not immediately aware of a reason to use one over the
> > other (xfs_wait_buftarg() already uses alert) so I'll try that unless I
> > hear an objection.
> 
> SOunds fine to me.
> 
> > The xfs_wait_buftarg() ratelimit presumably remains
> > open coded because it's two separate calls and we probably don't want
> > them to individually count against the limit.
> 
> That's why I suggested dropping the second "run xfs_repair" message
> and triggering a shutdown after the wait loop. That way we don't
> issue "run xfs_repair" for every single failed buffer (largely
> noise!), and we get a non-rate-limited common "run xfs-repair"
> message once we processed all the failed writes.
> 

Sorry, must have missed that in the last reply. I don't think we want to
shut down here because XBF_WRITE_FAIL only reflects that the internal
async write retry (e.g. the one historically used to mitigate transient
I/O errors) has occurred on the buffer, not necessarily that the
immediately previous I/O has failed. For that reason I've kind of looked
at this particular instance as more of a warning that I/O errors have
occurred in the past and the user might want to verify it didn't result
in unexpected damage. FWIW, I've observed plenty of these messages long
after I've disabled error injection and allowed I/O to proceed and the
fs to unmount cleanly.

Looking at it again, the wording of the message is much stronger than a
warning (i.e. "Corruption Alert", "permanent write failures"), so
perhaps we should revisit what we're actually trying to accomplish with
this message. Note that we do have the buffer push time alert to
indicate that I/O errors and retries are occurring, as well as error
handling logic to shutdown on I/O failure while the fs is unmounting
(xfs_buf_iodone_callback_error()), so both of those cases are
fundamentally covered already.

ISTM that leaves at least a few simple options for the
xfs_wait_buftarg() message:

1.) Remove it.
2.) Massage it into more of a "Warning: buffer I/O errors have occurred
in the past" type of message. This could perhaps retain the existing
XBF_WRITE_FAIL logic, but use a flag to lift it out of the loop and thus
avoid the need to rate limit it at all.
3.) Fix up the logic to only dump (ratelimited) specifics on buffers
that have actually failed I/O. This means we use the existing message
but perhaps check ->b_error instead of XBF_WRITE_FAIL. We still don't
need to shut down (I'd probably add an assert), but we could also lift
the second xfs_repair line of the message out of the loop to reduce the
noise.

We could also look into clearing XBF_WRITE_RETRY on successful I/O
completion, FWIW. The existing logic kind of bugs me regardless.
Thoughts?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


  reply	other threads:[~2020-04-24 11:12 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
2020-04-22 17:54 ` [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild Brian Foster
2020-04-23  4:09   ` Dave Chinner
2020-04-25 17:21   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code Brian Foster
2020-04-23  4:10   ` Dave Chinner
2020-04-25 17:23   ` Christoph Hellwig
2020-04-27 11:11     ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling Brian Foster
2020-04-23  4:18   ` Dave Chinner
2020-04-23 14:28     ` Brian Foster
2020-04-25 17:26   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 04/13] xfs: remove unnecessary shutdown check from xfs_iflush() Brian Foster
2020-04-25 17:27   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message Brian Foster
2020-04-23  4:46   ` Dave Chinner
2020-04-23 14:29     ` Brian Foster
2020-04-23 21:14       ` Dave Chinner
2020-04-24 11:12         ` Brian Foster [this message]
2020-04-24 22:08           ` Dave Chinner
2020-04-27 11:11             ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 06/13] xfs: fix duplicate verification from xfs_qm_dqflush() Brian Foster
2020-04-23  4:47   ` Dave Chinner
2020-04-25 17:28   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 07/13] xfs: abort consistently on dquot flush failure Brian Foster
2020-04-25 17:30   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking Brian Foster
2020-04-23  5:59   ` Dave Chinner
2020-04-23 14:36     ` Brian Foster
2020-04-23 21:38       ` Dave Chinner
2020-04-24 11:14         ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 09/13] xfs: clean up AIL log item removal functions Brian Foster
2020-04-23  4:54   ` Dave Chinner
2020-04-25 17:37   ` Christoph Hellwig
2020-04-27 11:12     ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 10/13] xfs: combine xfs_trans_ail_[remove|delete]() Brian Foster
2020-04-23  4:55   ` Dave Chinner
2020-04-22 17:54 ` [PATCH v2 11/13] xfs: remove unused iflush stale parameter Brian Foster
2020-04-25 17:37   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 12/13] xfs: random buffer write failure errortag Brian Foster
2020-04-23  5:11   ` Dave Chinner
2020-04-25 17:38   ` Christoph Hellwig
2020-04-27 11:12     ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 13/13] xfs: remove unused shutdown types Brian Foster
2020-04-23  5:13   ` Dave Chinner
2020-04-25 17:39   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200424111232.GA53325@bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).