From: Hannes Reinecke <hare@suse.de>
To: Al Viro <viro@ZenIV.linux.org.uk>,
Dan Williams <dan.j.williams@intel.com>
Cc: xfs@oss.sgi.com, linux-block@vger.kernel.org,
linux-nvdimm@ml01.01.org, Dave Chinner <david@fromorbit.com>,
Jens Axboe <axboe@fb.com>, Jan Kara <jack@suse.com>,
linux-fsdevel@vger.kernel.org,
Matthew Wilcox <willy@linux.intel.com>,
Ross Zwisler <ross.zwisler@linux.intel.com>
Subject: Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
Date: Mon, 11 Jan 2016 16:24:37 +0100 [thread overview]
Message-ID: <5693C935.3060701@suse.de> (raw)
In-Reply-To: <20160109075414.GA5008@ZenIV.linux.org.uk>
On 01/09/2016 08:54 AM, Al Viro wrote:
> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
>> Historically we have waited for filesystem specific heuristics to
>> attempt to guess when a block device is gone. Sometimes this works, but
>> in other cases the system can hang waiting for the fs to trigger its
>> shutdown protocol.
>>
>> The initial motivation for this investigation was to prevent DAX
>> mappings (direct mmap access to persistent memory) from leaking past the
>> lifetime of the hosting block device. However, Dave points out that
>> these shutdown operations are needed in other scenarios. Quoting Dave:
>>
>> For example, if we detect a free space corruption during allocation,
>> it is not safe to trust *any active mapping* because we can't trust
>> that we having handed out the same block to multiple owners. Hence
>> on such a filesystem shutdown, we have to prevent any new DAX
>> mapping from occurring and invalidate all existing mappings as we
>> cannot allow userspace to modify any data or metadata until we've
>> resolved the corruption situation.
>>
>> The current block device shutdown sequence of del_gendisk +
>> blk_cleanup_queue is problematic. We want to tell the fs after
>> blk_cleanup_queue that there is no possibility of recovery, but by that
>> time we have deleted partitions and lost the ability to find all the
>> super-blocks on a block device.
>>
>> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
>> notifications to all the filesystems hosted on the disk. Where
>> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
>> and ->bdi_gone() is a set of actions to take after the backing device
>> is known to be permanently dead.
>
> Would you mind explaining what the hell is _the_ backing device
> of a filesystem? What does that translate into in case of e.g. btrfs
> spanning several disks? Or ext4 with journal on a different device, for
> that matter?
>
> If anything, I would argue that filesystem is out of place here -
> general situation is "IO on X may require IO on device Y and X needs to do
> something when Y goes away". Consider e.g. /dev/loop backed by a device
> that went away. Or by a file on fs that has run down the curtain and joined
> the bleedin choir invisible. With another fs partially hosted by that
> loopback device. Or by RAID0 containing said device.
>
> You are given Y and attempt to locate the affected X. _Then_
> you assume that X is a filesystem and has "something to be done" independent
> from the role Y played for it, so you can pick that action from superblock
> method.
>
> IMO you are placing the burden in the wrong place. _Recepient_
> knows what it depends upon and what should be done for each source of
> trouble. So make it recepient's responsibility to request notifications.
> At which point the superblock method goes away, along with the requirement
> to handle all sources of trouble the same way, etc.
>
> What's more, things like RAID5 (also interested in knowing when
> a component has been ripped out) might or might not decide to propagate
> the event further - after all, that's exactly the point of redundancy.
>
> I'd look into something along the lines of notifier chain per
> gendisk, with potential victims registering a callback when they decide
> that from now on such and such device might screw them over...
Fully support this. I was planning on something similar to transport
device changes (resizing, topology change etc).
And it might even be an idea to convert the block device events to a
notifier chain, too.
Dan, can you keep me in the loop here?
Thanks.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)
next prev parent reply other threads:[~2016-01-11 15:24 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-04 18:20 [resend PATCH 0/3] fs, bdev: handle end of life Dan Williams
2016-01-04 18:20 ` [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life Dan Williams
2016-01-05 3:51 ` Dave Chinner
2016-01-05 4:25 ` Dan Williams
2016-01-05 22:32 ` Dave Chinner
2016-01-09 7:54 ` Al Viro
2016-01-09 14:17 ` Dan Williams
2016-01-11 7:15 ` Hannes Reinecke
2016-01-11 15:24 ` Hannes Reinecke [this message]
2016-01-11 15:55 ` Dan Williams
2016-01-04 18:20 ` [resend PATCH 2/3] xfs: handle shutdown notifications Dan Williams
2016-01-05 4:03 ` Dave Chinner
2016-01-04 18:20 ` [resend PATCH 3/3] writeback: fix false positive WARN in __mark_inode_dirty Dan Williams
2016-01-05 4:23 ` Dave Chinner
2016-01-05 19:59 ` Dan Williams
2016-01-05 21:10 ` Dave Chinner
2016-01-05 21:29 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5693C935.3060701@suse.de \
--to=hare@suse.de \
--cc=axboe@fb.com \
--cc=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=jack@suse.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nvdimm@ml01.01.org \
--cc=ross.zwisler@linux.intel.com \
--cc=viro@ZenIV.linux.org.uk \
--cc=willy@linux.intel.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).