All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>, Jens Axboe <axboe@kernel.dk>,
	tomaz.solc@tablix.org, aaron.lu@intel.com,
	linux-kernel@vger.kernel.org, Oleg Nesterov <oleg@redhat.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Fengguang Wu <fengguang.wu@intel.com>
Subject: Re: Writeback threads and freezable
Date: Thu, 19 Dec 2013 15:08:21 +1100	[thread overview]
Message-ID: <20131219040821.GW31386@dastard> (raw)
In-Reply-To: <20131218114343.GA4324@htj.dyndns.org>

On Wed, Dec 18, 2013 at 06:43:43AM -0500, Tejun Heo wrote:
> Hello, Dave.
> 
> On Wed, Dec 18, 2013 at 11:35:10AM +1100, Dave Chinner wrote:
> > Perhaps the function "invalidate_partition()" is badly named. To
> > state the obvious, fsync != invalidation. What it does is:
> > 
> > 	1. sync filesystem
> > 	2. shrink the dcache
> > 	3. invalidates inodes and kills dirty inodes
> > 	4. invalidates block device (removes cached bdev pages)
> > 
> > Basically, the first step is "flush", the remainder is "invalidate".
> > 
> > Indeed, step 3 throws away dirty inodes, so why on earth would we
> > even bother with step 1 to try to clean them in the first place?
> > IOWs, the flush is irrelevant in the hot-unplug case as it will
> > fail to flush stuff, and then we just throw the stuff we
> > failed to write away.
> >
> > But in attempting to flush all the dirty data and metadata, we can
> > cause all sorts of other potential re-entrancy based deadlocks due
> > to attempting to issue IO. Whether they be freezer based or through
> > IO error handling triggering device removal or some other means, it
> > is irrelevant - it is the flush that causes all the problems.
> 
> Isn't the root cause there hotunplug reentering anything above it in
> the first place.  The problem with your proposal is that filesystem
> isn't the only place where this could happen.  Even with no filesystem
> involved, block device could still be dirty and IOs pending in
> whatever form - dirty bdi, bios queued in dm, requests queued in
> request_queue, whatever really - and if the hotunplug path reenters
> any of the higher layers in a way which blocks IO processing, it will
> deadlock.

Entirely possible.

> If knowing that the underlying device has gone away somehow helps
> filesystem, maybe we can expose that interface and avoid flushing
> after hotunplug but that merely hides the possible deadlock scenario
> that you're concerned about.  Nothing is really solved.

Except that a user of the block device has been informed that it is
now gone and has been freed from under it. i.e. we can *immediately*
inform the user that their mounted filesystem is now stuffed and
supress all the errors that are going to occur as a result of
sync_filesystem() triggering IO failures all over the place and then
having to react to that.i

Indeed, there is no guarantee that sync_filesystem will result in
the filesystem being shut down - if the filesystem is clean then
nothing will happen, and it won't be until the user modifies some
metadata that a shutdown will be triggered. That could be a long
time after the device has been removed....

> We can try to do the same thing at each layer and implement quick exit
> path for hot unplug all the way down to the driver but that kinda
> sounds complex and fragile to me.  It's a lot larger surface to cover
> when the root cause is hotunplug allowed to reenter anything at all
> from IO path.  This is especially true because hotunplug can trivially
> be made fully asynchronous in most cases.  In terms of destruction of
> higher level objects, warm and hot unplugs can and should behave
> identical.

I don't see that there is a difference between a warm and hot unplug
from a filesystem point of view - both result in the filesystem's
backing device being deleted and freed, and in both cases we have to
take the same action....

> > We need to either get rid of the flush on device failure/hot-unplug,
> > or turn it into a callout for the superblock to take an appropriate
> > action (e.g. shutting down the filesystem) rather than trying to
> > issue IO. i.e. allow the filesystem to take appropriate action of
> > shutting down the filesystem and invalidating it's caches.
> 
> There could be cases where some optimizations for hot unplug could be
> useful.  Maybe suppressing pointless duplicate warning messages or
> whatnot but I'm highly doubtful anything will be actually fixed that
> way.  We'll be most likely making bugs just less reproducible.
> 
> > Indeed, in XFS there's several other caches that could contain dirty
> > metadata that isn't invalidated by invalidate_partition(), and so
> > unless the filesystem is shut down it can continue to try to issue
> > IO on those buffers to the removed device until the filesystem is
> > shutdown or unmounted.
> 
> Do you mean xfs never gives up after IO failures?

There's this thing called a transient IO failure which we have to
handle. e.g multipath taking several minutes to detect a path
failure and fail over, whilst in the mean time IO errors are
reported after a 30s timeout. So some types of async metadata write
IO failures are simply rescheduled for a short time in the future.
They'll either succeed, or continual failure will eventually trigger
some kind of filesystem failure.

If it's a synchronous write or a write that we cannot tolerate even
transient errors on (e.g. journal writes), then we'll shut down the
filesystem immediately.

> > Seriously, Tejun, the assumption that invalidate_partition() knows
> > enough about filesystems to safely "invalidate" them is just broken.
> > These days, filesystems often reference multiple block devices, and
> > so the way hotplug currently treats them as "one device, one
> > filesystem" is also fundamentally wrong.
> > 
> > So there's many ways in which the hot-unplug code is broken in it's
> > use of invalidate_partition(), the least of which is the
> > dependencies caused by re-entrancy. We really need a
> > "sb->shutdown()" style callout as step one in the above process, not
> > fsync_bdev().
> 
> If filesystems need an indication that the underlying device is no
> longer functional, please go ahead and add it, but please keep in mind
> all these are completely asynchronous.  Nothing guarantees you that
> such events would happen in any specific order.  IOW, you can be at
> *ANY* point in your warm unplug path and the device is hot unplugged,
> which essentially forces all the code paths to be ready for the worst,
> and that's exactly why there isn't much effort in trying to separate
> out warm and hot unplug paths.

I'm not concerned about the problems that might happen if you hot
unplug during a warm unplug. All I care about is when a device is
invalidated the filesystem on top of it can take appropriate action.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2013-12-19  4:08 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-13 17:49 Writeback threads and freezable Tejun Heo
2013-12-13 18:52 ` Tejun Heo
2013-12-13 20:40   ` [PATCH] libata, freezer: avoid block device removal while system is frozen Tejun Heo
2013-12-13 22:45     ` Nigel Cunningham
2013-12-13 23:07       ` Tejun Heo
2013-12-13 23:15         ` Nigel Cunningham
2013-12-14  1:55           ` Dave Chinner
2013-12-14 20:31           ` Tejun Heo
2013-12-14 20:36             ` Tejun Heo
2013-12-14 21:21               ` Nigel Cunningham
2013-12-17  2:35                 ` Rafael J. Wysocki
2013-12-17  2:34             ` Rafael J. Wysocki
2013-12-17 12:34               ` Tejun Heo
2013-12-18  0:35                 ` Rafael J. Wysocki
2013-12-18 11:17                   ` Tejun Heo
2013-12-18 21:48                     ` Rafael J. Wysocki
2013-12-18 21:39                       ` Tejun Heo
2013-12-18 21:41                         ` Tejun Heo
2013-12-18 22:04                           ` Rafael J. Wysocki
2013-12-19 23:35                             ` [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules Tejun Heo
2013-12-20  1:26                               ` Rafael J. Wysocki
2013-12-19 23:37                             ` [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() Tejun Heo
2013-12-20  1:31                               ` Rafael J. Wysocki
2013-12-20 13:32                                 ` Tejun Heo
2013-12-20 13:56                                   ` Rafael J. Wysocki
2013-12-20 14:23                                     ` Tejun Heo
2013-12-16 12:12         ` [PATCH] libata, freezer: avoid block device removal while system is frozen Ming Lei
2013-12-16 12:45           ` Tejun Heo
2013-12-16 13:24             ` Ming Lei
2013-12-16 16:05               ` Tejun Heo
2013-12-17  2:38     ` Rafael J. Wysocki
2013-12-17 12:36       ` Tejun Heo
2013-12-18  0:23         ` Rafael J. Wysocki
2013-12-17 12:50     ` [PATCH v2] " Tejun Heo
2013-12-18  1:04       ` Rafael J. Wysocki
2013-12-18 11:08         ` Tejun Heo
2013-12-18 12:07       ` [PATCH v3] " Tejun Heo
2013-12-18 22:08         ` Rafael J. Wysocki
2013-12-19 17:24           ` Tejun Heo
2013-12-19 18:54         ` [PATCH v4] " Tejun Heo
2013-12-14  1:53 ` Writeback threads and freezable Dave Chinner
2013-12-14 17:30   ` Greg Kroah-Hartman
2013-12-14 20:23   ` Tejun Heo
2013-12-16  3:56     ` Dave Chinner
2013-12-16 12:51       ` Tejun Heo
2013-12-16 12:56         ` Tejun Heo
2013-12-18  0:35           ` Dave Chinner
2013-12-18 11:43             ` Tejun Heo
2013-12-18 22:14               ` Rafael J. Wysocki
2013-12-19  4:08               ` Dave Chinner [this message]
2013-12-19 16:24                 ` Tejun Heo
2013-12-20  0:51                   ` Dave Chinner
2013-12-20 14:51                     ` Tejun Heo
2013-12-20 14:00                   ` Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131219040821.GW31386@dastard \
    --to=david@fromorbit.com \
    --cc=aaron.lu@intel.com \
    --cc=axboe@kernel.dk \
    --cc=fengguang.wu@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=rjw@sisk.pl \
    --cc=tj@kernel.org \
    --cc=tomaz.solc@tablix.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.