All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: xfs@oss.sgi.com
Subject: Re: XFS hung task in xfs_ail_push_all_sync() when unmounting FS after disk failure/recovery
Date: Fri, 25 Mar 2016 08:56:03 +1100	[thread overview]
Message-ID: <20160324215603.GD11812@dastard> (raw)
In-Reply-To: <20160324165244.GA17555@redhat.com>

On Thu, Mar 24, 2016 at 05:52:44PM +0100, Carlos Maiolino wrote:
> I can now reproduce it, or at least part of the problem.
> 
> Regarding your question Dave, yes, it can be unmounted after I issue xfs_io shutdown
> command. But, if a umount is issued before that, then we can't find the
> mountpoint anymore.
> 
> I'm not sure if I'm correct, but, what it looks like to me, as you already
> mentioned, is that we keep getting IO errors but we never actually shutdown
> the filesystem while doing async metadata writes.

*nod*

> I believe I've found the problem. So, I will try to explain it, so you guys
> can review and let me know if I'm right or not
> 
> I was looking the code, and for me, looks like async retries are designed to
> keep retrying forever, and rely on some other part of the filesystem to actually
> shutdown it.

*nod*

[snip description of metadata IO error behaviour]

Yes, that is exactly how the code is expected to behave - in fact,
that's how it was originally designed to behave.

> Looks like, somebody already noticed it:
> 
>         /*
>         ¦* If the write was asynchronous then no one will be looking for the
>         ¦* error.  Clear the error state and write the buffer out again.
>         ¦*
>         ¦* XXX: This helps against transient write errors, but we need to find
>         ¦* a way to shut the filesystem down if the writes keep failing.
>         ¦*
>         ¦* In practice we'll shut the filesystem down soon as non-transient
>         ¦* errors tend to affect the whole device and a failing log write
>         ¦* will make us give up.  But we really ought to do better here.
>         ¦*/
> 
> 
> So, if I'm write in how we hit this problem, and IIRC, Dave's patchset for
> setting limits to IO errors can be slightly modified to fix this issue too, but,

The patchset I have doesn't need modification to fix this issue - it
has a patch specifically to address this, and it changes the default
behaviour to "fail async writes at unmount":

http://oss.sgi.com/archives/xfs/2015-08/msg00092.html

> the problem is that the user must set it BEFORE he tries to unmount the
> filesystem, otherwise it will get stuck here.

Yes, but that doesn't answer the big question: why don't the
periodic log forces that are failing with EIO cause a filesystem
shutdown? We issue a log force every 30s even during unmount, and a
failed log IO must cause the filesystem to shut down. So why aren't
these causing the filesystem to shutdown as we'd expect when the
device has been pulled?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2016-03-24 21:56 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-22 11:21 XFS hung task in xfs_ail_push_all_sync() when unmounting FS after disk failure/recovery Shyam Kaushik
2016-03-22 12:19 ` Brian Foster
2016-03-22 13:01   ` Shyam Kaushik
2016-03-22 14:03     ` Brian Foster
2016-03-22 15:38       ` Carlos Maiolino
2016-03-22 15:56         ` Carlos Maiolino
2016-03-23  9:43       ` Shyam Kaushik
2016-03-23 12:30         ` Brian Foster
2016-03-23 15:32           ` Carlos Maiolino
2016-03-23 22:37             ` Dave Chinner
2016-03-24 11:08               ` Carlos Maiolino
2016-03-24 16:52               ` Carlos Maiolino
2016-03-24 21:56                 ` Dave Chinner [this message]
2016-04-01 12:31                   ` Carlos Maiolino
2016-03-23  9:52   ` Shyam Kaushik
2016-03-24 13:38   ` Shyam Kaushik
2016-04-08 10:51   ` Shyam Kaushik
2016-04-08 13:16     ` Brian Foster
2016-04-08 13:35       ` Shyam Kaushik
2016-04-08 14:31         ` Carlos Maiolino
2016-04-08 17:48       ` Shyam Kaushik
2016-04-08 19:00         ` Brian Foster
2016-04-08 17:51       ` Shyam Kaushik
2016-04-08 22:46     ` Dave Chinner
2016-04-10 18:40       ` Alex Lyakas
2016-04-11  1:21         ` Dave Chinner
2016-04-11 14:52           ` Shyam Kaushik
2016-04-11 22:47             ` Dave Chinner
2016-04-12  5:20           ` Shyam Kaushik
2016-04-12  6:59           ` Shyam Kaushik
2016-04-12  8:19             ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160324215603.GD11812@dastard \
    --to=david@fromorbit.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.