From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vishal Verma Subject: Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io Date: Tue, 26 Apr 2016 08:58:51 -0600 Message-ID: <1461682731.26226.20.camel@kernel.org> References: <1459303190-20072-1-git-send-email-vishal.l.verma@intel.com> <1459303190-20072-6-git-send-email-vishal.l.verma@intel.com> <20160420205923.GA24797@infradead.org> <1461434916.3695.7.camel@intel.com> <20160425083114.GA27556@infradead.org> <1461604476.3106.12.camel@intel.com> <20160425232552.GD18496@dastard> <1461628381.1421.24.camel@intel.com> <20160426004155.GF18496@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Cc: "hch@infradead.org" , "jack@suse.cz" , "axboe@fb.com" , "linux-nvdimm@ml01.01.org" , "linux-kernel@vger.kernel.org" , "xfs@oss.sgi.com" , "linux-block@vger.kernel.org" , "linux-mm@kvack.org" , "viro@zeniv.linux.org.uk" , "linux-fsdevel@vger.kernel.org" , "akpm@linux-foundation.org" , "linux-ext4@vger.kernel.org" , "Wilcox, Matthew R" To: Dave Chinner , "Verma, Vishal L" Return-path: In-Reply-To: <20160426004155.GF18496@dastard> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Tue, 2016-04-26 at 10:41 +1000, Dave Chinner wrote: > <> > > The application doesn't have to scan the entire filesystem, but > > presumably it knows what files it 'owns', and does a fiemap for > > those. > You're assuming that only the DAX aware application accesses it's > files.=C2=A0=C2=A0users, backup programs, data replicators, fileystem > re-organisers (e.g.=C2=A0=C2=A0defragmenters) etc all may access the fi= les and > they may throw errors. What then? In this scenario, backup applications etc that try to read that data before it has been replaced will just hit the errors and fail.. >=C2=A0 <> > > The data that was lost is gone -- this assumes the application has > > some > > ability to recover using a journal/log or other redundancy - yes, > > at the > > application layer. If it doesn't have this sort of capability, the > > only > > option is to restore files from a backup/mirror. > So the architecture has a built in assumption that only userspace > can handle data loss? >=20 > What about filesytsems like NOVA, that use log structured design to > provide DAX w/ update atomicity and can potentially also provide > redundancy/repair through the same mechanisms? Won't pmem native > filesystems with built in data protection features like this remove > the need for adding all this to userspace applications? >=20 > If so, shouldn't that be the focus of development rahter than > placing the burden on userspace apps to handle storage repair > situations? Agreed that file systems like NOVA can be designed to handle this better, but haven't you said in the past that it may take years for a new file system to become production ready, and that DAX is the until- then solution that gets us most of the way there.. I think we just want to ensure that current-DAX has some way to deal with errors, and these patches provide an admin-intervention recovery path and possibly another if the app wants to try something fancy for recovery. <> >=20 > >=C2=A0 > > To summarize, the two cases we want to handle are: > > 1. Application has inbuilt recovery: > > =C2=A0 - hits badblock > > =C2=A0 - figures out it is able to recover the data > > =C2=A0 - handles SIGBUS or EIO > > =C2=A0 - does a (sector aligned) write() to restore the data > The "figures out" step here is where >95% of the work we'd have to > do is. And that's in filesystem and block layer code, not > userspace, and userspace can't do that work in a signal handler. > And it=C2=A0=C2=A0can still fall down to the second case when the appli= cation > doesn't have another copy of the data somewhere. Ah when I said "figures out" I was only thinking if the application has some redundancy/jouranlling, and if it can recover using that -- not additional recovery mechanisms at the block/fs layer. >=20 > FWIW, we don't have a DAX enabled filesystem that can do > reverse block mapping, so we're a year or two away from this being a > workable production solution from the filesystem perspective. And > AFAICT, it's not even on the roadmap for dm/md layers. >=20 > >=20 > > 2. Application doesn't have any inbuilt recovery mechanism > > =C2=A0 - hits badblock > > =C2=A0 - gets SIGBUS (or EIO) and crashes > > =C2=A0 - Sysadmin restores file from backup > Which is no different to an existing non-DAX application getting an > EIO/sigbus from current storage technologies. >=20 > Except: in the existing storage stack, redundancy and correction has > already had to have failed for the application to see such an error. > Hence this is normally considered a DR case as there's had to be > cascading failures (e.g.=C2=A0=C2=A0multiple disk failures in a RAID) t= o get > to this stage, not a single error in a single sector in > non-redundant storage. >=20 > We need some form of redundancy and correction in the PMEM stack to > prevent single sector errors from taking down services until an > administrator can correct the problem. I'm trying to understand > where this is supposed to fit into the picture - at this point I > really don't think userspace applications are going to be able to do > this reliably.... Agreed that the pmem stack could use more redundancy and error correction, perhaps enabling md-raid to raid pmem devices and then enable DAX on top of that and we'll have a better chance to handle errors, but that level of recovery isn't what these patches are aiming for -- that is obviously a longer term effort. These simply aim to provide that disaster recovery path when a single sector failure does take down the service. Today, on a dax enabled filesystem, if/when the app hits an error and crashes, dax is simply disabled till the errors are gone. This is obviously less than ideal. (This was done because there is currently no way for a DAX file system to send any IO - mmap or otherwise - through the driver, including zeroing of new fs blocks). These patches enable the DR path by allowing some non-mmap IO (most importantly zeroing) to go through the driver which can tell the device to do some remapping etc. So, yes, this is very much a DR case in our current pmem+dax architecture, and we should probably design more robust handling at the block/md/fs layer, but with these, you at least get to crash the app, delete its files and restore them from out-of-band backups and continue with DAX. >=20 > Cheers, >=20 > Dave. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 4F35B7CDC for ; Tue, 26 Apr 2016 09:59:01 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay1.corp.sgi.com (Postfix) with ESMTP id D5B738F8035 for ; Tue, 26 Apr 2016 07:58:57 -0700 (PDT) Received: from mail.kernel.org ([198.145.29.136]) by cuda.sgi.com with ESMTP id hQ0ZaIolKr8QVOLv (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Tue, 26 Apr 2016 07:58:55 -0700 (PDT) Message-ID: <1461682731.26226.20.camel@kernel.org> Subject: Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io From: Vishal Verma Date: Tue, 26 Apr 2016 08:58:51 -0600 In-Reply-To: <20160426004155.GF18496@dastard> References: <1459303190-20072-1-git-send-email-vishal.l.verma@intel.com> <1459303190-20072-6-git-send-email-vishal.l.verma@intel.com> <20160420205923.GA24797@infradead.org> <1461434916.3695.7.camel@intel.com> <20160425083114.GA27556@infradead.org> <1461604476.3106.12.camel@intel.com> <20160425232552.GD18496@dastard> <1461628381.1421.24.camel@intel.com> <20160426004155.GF18496@dastard> Mime-Version: 1.0 List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner , "Verma, Vishal L" Cc: "axboe@fb.com" , "jack@suse.cz" , "linux-nvdimm@ml01.01.org" , "linux-kernel@vger.kernel.org" , "xfs@oss.sgi.com" , "hch@infradead.org" , "linux-mm@kvack.org" , "linux-block@vger.kernel.org" , "viro@zeniv.linux.org.uk" , "linux-fsdevel@vger.kernel.org" , "akpm@linux-foundation.org" , "linux-ext4@vger.kernel.org" , "Wilcox, Matthew R" T24gVHVlLCAyMDE2LTA0LTI2IGF0IDEwOjQxICsxMDAwLCBEYXZlIENoaW5uZXIgd3JvdGU6Cj4g PD4KCj4gPiBUaGUgYXBwbGljYXRpb24gZG9lc24ndCBoYXZlIHRvIHNjYW4gdGhlIGVudGlyZSBm aWxlc3lzdGVtLCBidXQKPiA+IHByZXN1bWFibHkgaXQga25vd3Mgd2hhdCBmaWxlcyBpdCAnb3du cycsIGFuZCBkb2VzIGEgZmllbWFwIGZvcgo+ID4gdGhvc2UuCj4gWW91J3JlIGFzc3VtaW5nIHRo YXQgb25seSB0aGUgREFYIGF3YXJlIGFwcGxpY2F0aW9uIGFjY2Vzc2VzIGl0J3MKPiBmaWxlcy7C oMKgdXNlcnMsIGJhY2t1cCBwcm9ncmFtcywgZGF0YSByZXBsaWNhdG9ycywgZmlsZXlzdGVtCj4g cmUtb3JnYW5pc2VycyAoZS5nLsKgwqBkZWZyYWdtZW50ZXJzKSBldGMgYWxsIG1heSBhY2Nlc3Mg dGhlIGZpbGVzIGFuZAo+IHRoZXkgbWF5IHRocm93IGVycm9ycy4gV2hhdCB0aGVuPwoKSW4gdGhp cyBzY2VuYXJpbywgYmFja3VwIGFwcGxpY2F0aW9ucyBldGMgdGhhdCB0cnkgdG8gcmVhZCB0aGF0 IGRhdGEKYmVmb3JlIGl0IGhhcyBiZWVuIHJlcGxhY2VkIHdpbGwganVzdCBoaXQgdGhlIGVycm9y cyBhbmQgZmFpbC4uCgo+wqAKCjw+Cgo+ID4gVGhlIGRhdGEgdGhhdCB3YXMgbG9zdCBpcyBnb25l IC0tIHRoaXMgYXNzdW1lcyB0aGUgYXBwbGljYXRpb24gaGFzCj4gPiBzb21lCj4gPiBhYmlsaXR5 IHRvIHJlY292ZXIgdXNpbmcgYSBqb3VybmFsL2xvZyBvciBvdGhlciByZWR1bmRhbmN5IC0geWVz LAo+ID4gYXQgdGhlCj4gPiBhcHBsaWNhdGlvbiBsYXllci4gSWYgaXQgZG9lc24ndCBoYXZlIHRo aXMgc29ydCBvZiBjYXBhYmlsaXR5LCB0aGUKPiA+IG9ubHkKPiA+IG9wdGlvbiBpcyB0byByZXN0 b3JlIGZpbGVzIGZyb20gYSBiYWNrdXAvbWlycm9yLgo+IFNvIHRoZSBhcmNoaXRlY3R1cmUgaGFz IGEgYnVpbHQgaW4gYXNzdW1wdGlvbiB0aGF0IG9ubHkgdXNlcnNwYWNlCj4gY2FuIGhhbmRsZSBk YXRhIGxvc3M/Cj4gCj4gV2hhdCBhYm91dCBmaWxlc3l0c2VtcyBsaWtlIE5PVkEsIHRoYXQgdXNl IGxvZyBzdHJ1Y3R1cmVkIGRlc2lnbiB0bwo+IHByb3ZpZGUgREFYIHcvIHVwZGF0ZSBhdG9taWNp dHkgYW5kIGNhbiBwb3RlbnRpYWxseSBhbHNvIHByb3ZpZGUKPiByZWR1bmRhbmN5L3JlcGFpciB0 aHJvdWdoIHRoZSBzYW1lIG1lY2hhbmlzbXM/IFdvbid0IHBtZW0gbmF0aXZlCj4gZmlsZXN5c3Rl bXMgd2l0aCBidWlsdCBpbiBkYXRhIHByb3RlY3Rpb24gZmVhdHVyZXMgbGlrZSB0aGlzIHJlbW92 ZQo+IHRoZSBuZWVkIGZvciBhZGRpbmcgYWxsIHRoaXMgdG8gdXNlcnNwYWNlIGFwcGxpY2F0aW9u cz8KPiAKPiBJZiBzbywgc2hvdWxkbid0IHRoYXQgYmUgdGhlIGZvY3VzIG9mIGRldmVsb3BtZW50 IHJhaHRlciB0aGFuCj4gcGxhY2luZyB0aGUgYnVyZGVuIG9uIHVzZXJzcGFjZSBhcHBzIHRvIGhh bmRsZSBzdG9yYWdlIHJlcGFpcgo+IHNpdHVhdGlvbnM/CgpBZ3JlZWQgdGhhdCBmaWxlIHN5c3Rl bXMgbGlrZSBOT1ZBIGNhbiBiZSBkZXNpZ25lZCB0byBoYW5kbGUgdGhpcwpiZXR0ZXIsIGJ1dCBo YXZlbid0IHlvdSBzYWlkIGluIHRoZSBwYXN0IHRoYXQgaXQgbWF5IHRha2UgeWVhcnMgZm9yIGEK bmV3IGZpbGUgc3lzdGVtIHRvIGJlY29tZSBwcm9kdWN0aW9uIHJlYWR5LCBhbmQgdGhhdCBEQVgg aXMgdGhlIHVudGlsLQp0aGVuIHNvbHV0aW9uIHRoYXQgZ2V0cyB1cyBtb3N0IG9mIHRoZSB3YXkg dGhlcmUuLiBJIHRoaW5rIHdlIGp1c3Qgd2FudAp0byBlbnN1cmUgdGhhdCBjdXJyZW50LURBWCBo YXMgc29tZSB3YXkgdG8gZGVhbCB3aXRoIGVycm9ycywgYW5kIHRoZXNlCnBhdGNoZXMgcHJvdmlk ZSBhbiBhZG1pbi1pbnRlcnZlbnRpb24gcmVjb3ZlcnkgcGF0aCBhbmQgcG9zc2libHkKYW5vdGhl ciBpZiB0aGUgYXBwIHdhbnRzIHRvIHRyeSBzb21ldGhpbmcgZmFuY3kgZm9yIHJlY292ZXJ5LgoK PD4KPiAKPiA+wqAKPiA+IFRvIHN1bW1hcml6ZSwgdGhlIHR3byBjYXNlcyB3ZSB3YW50IHRvIGhh bmRsZSBhcmU6Cj4gPiAxLiBBcHBsaWNhdGlvbiBoYXMgaW5idWlsdCByZWNvdmVyeToKPiA+IMKg IC0gaGl0cyBiYWRibG9jawo+ID4gwqAgLSBmaWd1cmVzIG91dCBpdCBpcyBhYmxlIHRvIHJlY292 ZXIgdGhlIGRhdGEKPiA+IMKgIC0gaGFuZGxlcyBTSUdCVVMgb3IgRUlPCj4gPiDCoCAtIGRvZXMg YSAoc2VjdG9yIGFsaWduZWQpIHdyaXRlKCkgdG8gcmVzdG9yZSB0aGUgZGF0YQo+IFRoZSAiZmln dXJlcyBvdXQiIHN0ZXAgaGVyZSBpcyB3aGVyZSA+OTUlIG9mIHRoZSB3b3JrIHdlJ2QgaGF2ZSB0 bwo+IGRvIGlzLiBBbmQgdGhhdCdzIGluIGZpbGVzeXN0ZW0gYW5kIGJsb2NrIGxheWVyIGNvZGUs IG5vdAo+IHVzZXJzcGFjZSwgYW5kIHVzZXJzcGFjZSBjYW4ndCBkbyB0aGF0IHdvcmsgaW4gYSBz aWduYWwgaGFuZGxlci4KPiBBbmQgaXTCoMKgY2FuIHN0aWxsIGZhbGwgZG93biB0byB0aGUgc2Vj b25kIGNhc2Ugd2hlbiB0aGUgYXBwbGljYXRpb24KPiBkb2Vzbid0IGhhdmUgYW5vdGhlciBjb3B5 IG9mIHRoZSBkYXRhIHNvbWV3aGVyZS4KCkFoIHdoZW4gSSBzYWlkICJmaWd1cmVzIG91dCIgSSB3 YXMgb25seSB0aGlua2luZyBpZiB0aGUgYXBwbGljYXRpb24gaGFzCnNvbWUgcmVkdW5kYW5jeS9q b3VyYW5sbGluZywgYW5kIGlmIGl0IGNhbiByZWNvdmVyIHVzaW5nIHRoYXQgLS0gbm90CmFkZGl0 aW9uYWwgcmVjb3ZlcnkgbWVjaGFuaXNtcyBhdCB0aGUgYmxvY2svZnMgbGF5ZXIuCgo+IAo+IEZX SVcsIHdlIGRvbid0IGhhdmUgYSBEQVggZW5hYmxlZCBmaWxlc3lzdGVtIHRoYXQgY2FuIGRvCj4g cmV2ZXJzZSBibG9jayBtYXBwaW5nLCBzbyB3ZSdyZSBhIHllYXIgb3IgdHdvIGF3YXkgZnJvbSB0 aGlzIGJlaW5nIGEKPiB3b3JrYWJsZSBwcm9kdWN0aW9uIHNvbHV0aW9uIGZyb20gdGhlIGZpbGVz eXN0ZW0gcGVyc3BlY3RpdmUuIEFuZAo+IEFGQUlDVCwgaXQncyBub3QgZXZlbiBvbiB0aGUgcm9h ZG1hcCBmb3IgZG0vbWQgbGF5ZXJzLgo+IAo+ID4gCj4gPiAyLiBBcHBsaWNhdGlvbiBkb2Vzbid0 IGhhdmUgYW55IGluYnVpbHQgcmVjb3ZlcnkgbWVjaGFuaXNtCj4gPiDCoCAtIGhpdHMgYmFkYmxv Y2sKPiA+IMKgIC0gZ2V0cyBTSUdCVVMgKG9yIEVJTykgYW5kIGNyYXNoZXMKPiA+IMKgIC0gU3lz YWRtaW4gcmVzdG9yZXMgZmlsZSBmcm9tIGJhY2t1cAo+IFdoaWNoIGlzIG5vIGRpZmZlcmVudCB0 byBhbiBleGlzdGluZyBub24tREFYIGFwcGxpY2F0aW9uIGdldHRpbmcgYW4KPiBFSU8vc2lnYnVz IGZyb20gY3VycmVudCBzdG9yYWdlIHRlY2hub2xvZ2llcy4KPiAKPiBFeGNlcHQ6IGluIHRoZSBl eGlzdGluZyBzdG9yYWdlIHN0YWNrLCByZWR1bmRhbmN5IGFuZCBjb3JyZWN0aW9uIGhhcwo+IGFs cmVhZHkgaGFkIHRvIGhhdmUgZmFpbGVkIGZvciB0aGUgYXBwbGljYXRpb24gdG8gc2VlIHN1Y2gg YW4gZXJyb3IuCj4gSGVuY2UgdGhpcyBpcyBub3JtYWxseSBjb25zaWRlcmVkIGEgRFIgY2FzZSBh cyB0aGVyZSdzIGhhZCB0byBiZQo+IGNhc2NhZGluZyBmYWlsdXJlcyAoZS5nLsKgwqBtdWx0aXBs ZSBkaXNrIGZhaWx1cmVzIGluIGEgUkFJRCkgdG8gZ2V0Cj4gdG8gdGhpcyBzdGFnZSwgbm90IGEg c2luZ2xlIGVycm9yIGluIGEgc2luZ2xlIHNlY3RvciBpbgo+IG5vbi1yZWR1bmRhbnQgc3RvcmFn ZS4KPiAKPiBXZSBuZWVkIHNvbWUgZm9ybSBvZiByZWR1bmRhbmN5IGFuZCBjb3JyZWN0aW9uIGlu IHRoZSBQTUVNIHN0YWNrIHRvCj4gcHJldmVudCBzaW5nbGUgc2VjdG9yIGVycm9ycyBmcm9tIHRh a2luZyBkb3duIHNlcnZpY2VzIHVudGlsIGFuCj4gYWRtaW5pc3RyYXRvciBjYW4gY29ycmVjdCB0 aGUgcHJvYmxlbS4gSSdtIHRyeWluZyB0byB1bmRlcnN0YW5kCj4gd2hlcmUgdGhpcyBpcyBzdXBw b3NlZCB0byBmaXQgaW50byB0aGUgcGljdHVyZSAtIGF0IHRoaXMgcG9pbnQgSQo+IHJlYWxseSBk b24ndCB0aGluayB1c2Vyc3BhY2UgYXBwbGljYXRpb25zIGFyZSBnb2luZyB0byBiZSBhYmxlIHRv IGRvCj4gdGhpcyByZWxpYWJseS4uLi4KCkFncmVlZCB0aGF0IHRoZSBwbWVtIHN0YWNrIGNvdWxk IHVzZSBtb3JlIHJlZHVuZGFuY3kgYW5kIGVycm9yCmNvcnJlY3Rpb24sIHBlcmhhcHMgZW5hYmxp bmcgbWQtcmFpZCB0byByYWlkIHBtZW0gZGV2aWNlcyBhbmQgdGhlbgplbmFibGUgREFYIG9uIHRv cCBvZiB0aGF0IGFuZCB3ZSdsbCBoYXZlIGEgYmV0dGVyIGNoYW5jZSB0byBoYW5kbGUKZXJyb3Jz LCBidXQgdGhhdCBsZXZlbCBvZiByZWNvdmVyeSBpc24ndCB3aGF0IHRoZXNlIHBhdGNoZXMgYXJl IGFpbWluZwpmb3IgLS0gdGhhdCBpcyBvYnZpb3VzbHkgYSBsb25nZXIgdGVybSBlZmZvcnQuIFRo ZXNlIHNpbXBseSBhaW0gdG8KcHJvdmlkZSB0aGF0IGRpc2FzdGVyIHJlY292ZXJ5IHBhdGggd2hl biBhIHNpbmdsZSBzZWN0b3IgZmFpbHVyZSBkb2VzCnRha2UgZG93biB0aGUgc2VydmljZS4KClRv ZGF5LCBvbiBhIGRheCBlbmFibGVkIGZpbGVzeXN0ZW0sIGlmL3doZW4gdGhlIGFwcCBoaXRzIGFu IGVycm9yIGFuZApjcmFzaGVzLCBkYXggaXMgc2ltcGx5IGRpc2FibGVkIHRpbGwgdGhlIGVycm9y cyBhcmUgZ29uZS4gVGhpcyBpcwpvYnZpb3VzbHkgbGVzcyB0aGFuIGlkZWFsLiAoVGhpcyB3YXMg ZG9uZSBiZWNhdXNlIHRoZXJlIGlzIGN1cnJlbnRseSBubwp3YXkgZm9yIGEgREFYIGZpbGUgc3lz dGVtIHRvIHNlbmQgYW55IElPIC0gbW1hcCBvciBvdGhlcndpc2UgLSB0aHJvdWdoCnRoZSBkcml2 ZXIsIGluY2x1ZGluZyB6ZXJvaW5nIG9mIG5ldyBmcyBibG9ja3MpLiBUaGVzZSBwYXRjaGVzIGVu YWJsZQp0aGUgRFIgcGF0aCBieSBhbGxvd2luZyBzb21lIG5vbi1tbWFwIElPIChtb3N0IGltcG9y dGFudGx5IHplcm9pbmcpIHRvCmdvIHRocm91Z2ggdGhlIGRyaXZlciB3aGljaCBjYW4gdGVsbCB0 aGUgZGV2aWNlIHRvIGRvIHNvbWUgcmVtYXBwaW5nCmV0Yy4KClNvLCB5ZXMsIHRoaXMgaXMgdmVy eSBtdWNoIGEgRFIgY2FzZSBpbiBvdXIgY3VycmVudCBwbWVtK2RheAphcmNoaXRlY3R1cmUsIGFu ZCB3ZSBzaG91bGQgcHJvYmFibHkgZGVzaWduIG1vcmUgcm9idXN0IGhhbmRsaW5nIGF0IHRoZQpi bG9jay9tZC9mcyBsYXllciwgYnV0IHdpdGggdGhlc2UsIHlvdSBhdCBsZWFzdCBnZXQgdG8gY3Jh c2ggdGhlIGFwcCwKZGVsZXRlIGl0cyBmaWxlcyBhbmQgcmVzdG9yZSB0aGVtIGZyb20gb3V0LW9m LWJhbmQgYmFja3VwcyBhbmQgY29udGludWUKd2l0aCBEQVguCgo+IAo+IENoZWVycywKPiAKPiBE YXZlLgoKX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KeGZz IG1haWxpbmcgbGlzdAp4ZnNAb3NzLnNnaS5jb20KaHR0cDovL29zcy5zZ2kuY29tL21haWxtYW4v bGlzdGluZm8veGZzCg== From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f69.google.com (mail-pa0-f69.google.com [209.85.220.69]) by kanga.kvack.org (Postfix) with ESMTP id 876776B025E for ; Tue, 26 Apr 2016 10:58:56 -0400 (EDT) Received: by mail-pa0-f69.google.com with SMTP id xm6so21145572pab.3 for ; Tue, 26 Apr 2016 07:58:56 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136]) by mx.google.com with ESMTPS id gc5si4947864pac.224.2016.04.26.07.58.55 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 26 Apr 2016 07:58:55 -0700 (PDT) Message-ID: <1461682731.26226.20.camel@kernel.org> Subject: Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io From: Vishal Verma Date: Tue, 26 Apr 2016 08:58:51 -0600 In-Reply-To: <20160426004155.GF18496@dastard> References: <1459303190-20072-1-git-send-email-vishal.l.verma@intel.com> <1459303190-20072-6-git-send-email-vishal.l.verma@intel.com> <20160420205923.GA24797@infradead.org> <1461434916.3695.7.camel@intel.com> <20160425083114.GA27556@infradead.org> <1461604476.3106.12.camel@intel.com> <20160425232552.GD18496@dastard> <1461628381.1421.24.camel@intel.com> <20160426004155.GF18496@dastard> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner , "Verma, Vishal L" Cc: "hch@infradead.org" , "jack@suse.cz" , "axboe@fb.com" , "linux-nvdimm@ml01.01.org" , "linux-kernel@vger.kernel.org" , "xfs@oss.sgi.com" , "linux-block@vger.kernel.org" , "linux-mm@kvack.org" , "viro@zeniv.linux.org.uk" , "linux-fsdevel@vger.kernel.org" , "akpm@linux-foundation.org" , "linux-ext4@vger.kernel.org" , "Wilcox, Matthew R" On Tue, 2016-04-26 at 10:41 +1000, Dave Chinner wrote: > <> > > The application doesn't have to scan the entire filesystem, but > > presumably it knows what files it 'owns', and does a fiemap for > > those. > You're assuming that only the DAX aware application accesses it's > files.A A users, backup programs, data replicators, fileystem > re-organisers (e.g.A A defragmenters) etc all may access the files and > they may throw errors. What then? In this scenario, backup applications etc that try to read that data before it has been replaced will just hit the errors and fail.. >A <> > > The data that was lost is gone -- this assumes the application has > > some > > ability to recover using a journal/log or other redundancy - yes, > > at the > > application layer. If it doesn't have this sort of capability, the > > only > > option is to restore files from a backup/mirror. > So the architecture has a built in assumption that only userspace > can handle data loss? > > What about filesytsems like NOVA, that use log structured design to > provide DAX w/ update atomicity and can potentially also provide > redundancy/repair through the same mechanisms? Won't pmem native > filesystems with built in data protection features like this remove > the need for adding all this to userspace applications? > > If so, shouldn't that be the focus of development rahter than > placing the burden on userspace apps to handle storage repair > situations? Agreed that file systems like NOVA can be designed to handle this better, but haven't you said in the past that it may take years for a new file system to become production ready, and that DAX is the until- then solution that gets us most of the way there.. I think we just want to ensure that current-DAX has some way to deal with errors, and these patches provide an admin-intervention recovery path and possibly another if the app wants to try something fancy for recovery. <> > > >A > > To summarize, the two cases we want to handle are: > > 1. Application has inbuilt recovery: > > A - hits badblock > > A - figures out it is able to recover the data > > A - handles SIGBUS or EIO > > A - does a (sector aligned) write() to restore the data > The "figures out" step here is where >95% of the work we'd have to > do is. And that's in filesystem and block layer code, not > userspace, and userspace can't do that work in a signal handler. > And itA A can still fall down to the second case when the application > doesn't have another copy of the data somewhere. Ah when I said "figures out" I was only thinking if the application has some redundancy/jouranlling, and if it can recover using that -- not additional recovery mechanisms at the block/fs layer. > > FWIW, we don't have a DAX enabled filesystem that can do > reverse block mapping, so we're a year or two away from this being a > workable production solution from the filesystem perspective. And > AFAICT, it's not even on the roadmap for dm/md layers. > > > > > 2. Application doesn't have any inbuilt recovery mechanism > > A - hits badblock > > A - gets SIGBUS (or EIO) and crashes > > A - Sysadmin restores file from backup > Which is no different to an existing non-DAX application getting an > EIO/sigbus from current storage technologies. > > Except: in the existing storage stack, redundancy and correction has > already had to have failed for the application to see such an error. > Hence this is normally considered a DR case as there's had to be > cascading failures (e.g.A A multiple disk failures in a RAID) to get > to this stage, not a single error in a single sector in > non-redundant storage. > > We need some form of redundancy and correction in the PMEM stack to > prevent single sector errors from taking down services until an > administrator can correct the problem. I'm trying to understand > where this is supposed to fit into the picture - at this point I > really don't think userspace applications are going to be able to do > this reliably.... Agreed that the pmem stack could use more redundancy and error correction, perhaps enabling md-raid to raid pmem devices and then enable DAX on top of that and we'll have a better chance to handle errors, but that level of recovery isn't what these patches are aiming for -- that is obviously a longer term effort. These simply aim to provide that disaster recovery path when a single sector failure does take down the service. Today, on a dax enabled filesystem, if/when the app hits an error and crashes, dax is simply disabled till the errors are gone. This is obviously less than ideal. (This was done because there is currently no way for a DAX file system to send any IO - mmap or otherwise - through the driver, including zeroing of new fs blocks). These patches enable the DR path by allowing some non-mmap IO (most importantly zeroing) to go through the driver which can tell the device to do some remapping etc. So, yes, this is very much a DR case in our current pmem+dax architecture, and we should probably design more robust handling at the block/md/fs layer, but with these, you at least get to crash the app, delete its files and restore them from out-of-band backups and continue with DAX. > > Cheers, > > Dave. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752315AbcDZO66 (ORCPT ); Tue, 26 Apr 2016 10:58:58 -0400 Received: from mail.kernel.org ([198.145.29.136]:47537 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751101AbcDZO64 (ORCPT ); Tue, 26 Apr 2016 10:58:56 -0400 Message-ID: <1461682731.26226.20.camel@kernel.org> Subject: Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io From: Vishal Verma To: Dave Chinner , "Verma, Vishal L" Cc: "hch@infradead.org" , "jack@suse.cz" , "axboe@fb.com" , "linux-nvdimm@ml01.01.org" , "linux-kernel@vger.kernel.org" , "xfs@oss.sgi.com" , "linux-block@vger.kernel.org" , "linux-mm@kvack.org" , "viro@zeniv.linux.org.uk" , "linux-fsdevel@vger.kernel.org" , "akpm@linux-foundation.org" , "linux-ext4@vger.kernel.org" , "Wilcox, Matthew R" Date: Tue, 26 Apr 2016 08:58:51 -0600 In-Reply-To: <20160426004155.GF18496@dastard> References: <1459303190-20072-1-git-send-email-vishal.l.verma@intel.com> <1459303190-20072-6-git-send-email-vishal.l.verma@intel.com> <20160420205923.GA24797@infradead.org> <1461434916.3695.7.camel@intel.com> <20160425083114.GA27556@infradead.org> <1461604476.3106.12.camel@intel.com> <20160425232552.GD18496@dastard> <1461628381.1421.24.camel@intel.com> <20160426004155.GF18496@dastard> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.18.5.2 (3.18.5.2-1.fc23) Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2016-04-26 at 10:41 +1000, Dave Chinner wrote: > <> > > The application doesn't have to scan the entire filesystem, but > > presumably it knows what files it 'owns', and does a fiemap for > > those. > You're assuming that only the DAX aware application accesses it's > files.  users, backup programs, data replicators, fileystem > re-organisers (e.g.  defragmenters) etc all may access the files and > they may throw errors. What then? In this scenario, backup applications etc that try to read that data before it has been replaced will just hit the errors and fail.. >  <> > > The data that was lost is gone -- this assumes the application has > > some > > ability to recover using a journal/log or other redundancy - yes, > > at the > > application layer. If it doesn't have this sort of capability, the > > only > > option is to restore files from a backup/mirror. > So the architecture has a built in assumption that only userspace > can handle data loss? > > What about filesytsems like NOVA, that use log structured design to > provide DAX w/ update atomicity and can potentially also provide > redundancy/repair through the same mechanisms? Won't pmem native > filesystems with built in data protection features like this remove > the need for adding all this to userspace applications? > > If so, shouldn't that be the focus of development rahter than > placing the burden on userspace apps to handle storage repair > situations? Agreed that file systems like NOVA can be designed to handle this better, but haven't you said in the past that it may take years for a new file system to become production ready, and that DAX is the until- then solution that gets us most of the way there.. I think we just want to ensure that current-DAX has some way to deal with errors, and these patches provide an admin-intervention recovery path and possibly another if the app wants to try something fancy for recovery. <> > > >  > > To summarize, the two cases we want to handle are: > > 1. Application has inbuilt recovery: > >   - hits badblock > >   - figures out it is able to recover the data > >   - handles SIGBUS or EIO > >   - does a (sector aligned) write() to restore the data > The "figures out" step here is where >95% of the work we'd have to > do is. And that's in filesystem and block layer code, not > userspace, and userspace can't do that work in a signal handler. > And it  can still fall down to the second case when the application > doesn't have another copy of the data somewhere. Ah when I said "figures out" I was only thinking if the application has some redundancy/jouranlling, and if it can recover using that -- not additional recovery mechanisms at the block/fs layer. > > FWIW, we don't have a DAX enabled filesystem that can do > reverse block mapping, so we're a year or two away from this being a > workable production solution from the filesystem perspective. And > AFAICT, it's not even on the roadmap for dm/md layers. > > > > > 2. Application doesn't have any inbuilt recovery mechanism > >   - hits badblock > >   - gets SIGBUS (or EIO) and crashes > >   - Sysadmin restores file from backup > Which is no different to an existing non-DAX application getting an > EIO/sigbus from current storage technologies. > > Except: in the existing storage stack, redundancy and correction has > already had to have failed for the application to see such an error. > Hence this is normally considered a DR case as there's had to be > cascading failures (e.g.  multiple disk failures in a RAID) to get > to this stage, not a single error in a single sector in > non-redundant storage. > > We need some form of redundancy and correction in the PMEM stack to > prevent single sector errors from taking down services until an > administrator can correct the problem. I'm trying to understand > where this is supposed to fit into the picture - at this point I > really don't think userspace applications are going to be able to do > this reliably.... Agreed that the pmem stack could use more redundancy and error correction, perhaps enabling md-raid to raid pmem devices and then enable DAX on top of that and we'll have a better chance to handle errors, but that level of recovery isn't what these patches are aiming for -- that is obviously a longer term effort. These simply aim to provide that disaster recovery path when a single sector failure does take down the service. Today, on a dax enabled filesystem, if/when the app hits an error and crashes, dax is simply disabled till the errors are gone. This is obviously less than ideal. (This was done because there is currently no way for a DAX file system to send any IO - mmap or otherwise - through the driver, including zeroing of new fs blocks). These patches enable the DR path by allowing some non-mmap IO (most importantly zeroing) to go through the driver which can tell the device to do some remapping etc. So, yes, this is very much a DR case in our current pmem+dax architecture, and we should probably design more robust handling at the block/md/fs layer, but with these, you at least get to crash the app, delete its files and restore them from out-of-band backups and continue with DAX. > > Cheers, > > Dave.