From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id D1D7429DFD for ; Mon, 11 Jan 2016 09:24:43 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id B365B8F8050 for ; Mon, 11 Jan 2016 07:24:43 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id gsIuCsw0tMf73SGY (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NO) for ; Mon, 11 Jan 2016 07:24:41 -0800 (PST) Subject: Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life References: <20160104181220.24118.96661.stgit@dwillia2-desk3.amr.corp.intel.com> <20160104182005.24118.50361.stgit@dwillia2-desk3.amr.corp.intel.com> <20160109075414.GA5008@ZenIV.linux.org.uk> From: Hannes Reinecke Message-ID: <5693C935.3060701@suse.de> Date: Mon, 11 Jan 2016 16:24:37 +0100 MIME-Version: 1.0 In-Reply-To: <20160109075414.GA5008@ZenIV.linux.org.uk> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="windows-1252"; Format="flowed" Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Al Viro , Dan Williams Cc: Jens Axboe , linux-nvdimm@ml01.01.org, xfs@oss.sgi.com, linux-block@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler On 01/09/2016 08:54 AM, Al Viro wrote: > On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote: >> Historically we have waited for filesystem specific heuristics to >> attempt to guess when a block device is gone. Sometimes this works, but >> in other cases the system can hang waiting for the fs to trigger its >> shutdown protocol. >> >> The initial motivation for this investigation was to prevent DAX >> mappings (direct mmap access to persistent memory) from leaking past the >> lifetime of the hosting block device. However, Dave points out that >> these shutdown operations are needed in other scenarios. Quoting Dave: >> >> For example, if we detect a free space corruption during allocation, >> it is not safe to trust *any active mapping* because we can't trust >> that we having handed out the same block to multiple owners. Hence >> on such a filesystem shutdown, we have to prevent any new DAX >> mapping from occurring and invalidate all existing mappings as we >> cannot allow userspace to modify any data or metadata until we've >> resolved the corruption situation. >> >> The current block device shutdown sequence of del_gendisk + >> blk_cleanup_queue is problematic. We want to tell the fs after >> blk_cleanup_queue that there is no possibility of recovery, but by that >> time we have deleted partitions and lost the ability to find all the >> super-blocks on a block device. >> >> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone() >> notifications to all the filesystems hosted on the disk. Where >> ->quiesce() are 'shutdown' operations while the bdev may still be alive, >> and ->bdi_gone() is a set of actions to take after the backing device >> is known to be permanently dead. > > Would you mind explaining what the hell is _the_ backing device > of a filesystem? What does that translate into in case of e.g. btrfs > spanning several disks? Or ext4 with journal on a different device, for > that matter? > > If anything, I would argue that filesystem is out of place here - > general situation is "IO on X may require IO on device Y and X needs to do > something when Y goes away". Consider e.g. /dev/loop backed by a device > that went away. Or by a file on fs that has run down the curtain and joi= ned > the bleedin choir invisible. With another fs partially hosted by that > loopback device. Or by RAID0 containing said device. > > You are given Y and attempt to locate the affected X. _Then_ > you assume that X is a filesystem and has "something to be done" independ= ent > from the role Y played for it, so you can pick that action from superblock > method. > > IMO you are placing the burden in the wrong place. _Recepient_ > knows what it depends upon and what should be done for each source of > trouble. So make it recepient's responsibility to request notifications. > At which point the superblock method goes away, along with the requirement > to handle all sources of trouble the same way, etc. > > What's more, things like RAID5 (also interested in knowing when > a component has been ripped out) might or might not decide to propagate > the event further - after all, that's exactly the point of redundancy. > > I'd look into something along the lines of notifier chain per > gendisk, with potential victims registering a callback when they decide > that from now on such and such device might screw them over... Fully support this. I was planning on something similar to transport = device changes (resizing, topology change etc). And it might even be an idea to convert the block device events to a = notifier chain, too. Dan, can you keep me in the loop here? Thanks. Cheers, Hannes -- = Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=FCrnberg GF: F. Imend=F6rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N=FCrnberg) _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs