From mboxrd@z Thu Jan  1 00:00:00 1970
From: Vishal Verma <vishal@kernel.org>
Subject: Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io
Date: Tue, 26 Apr 2016 08:58:51 -0600
Message-ID: <1461682731.26226.20.camel@kernel.org>
References: <1459303190-20072-1-git-send-email-vishal.l.verma@intel.com>
	 <1459303190-20072-6-git-send-email-vishal.l.verma@intel.com>
	 <x49twj26edj.fsf@segfault.boston.devel.redhat.com>
	 <20160420205923.GA24797@infradead.org> <1461434916.3695.7.camel@intel.com>
	 <20160425083114.GA27556@infradead.org> <1461604476.3106.12.camel@intel.com>
	 <20160425232552.GD18496@dastard> <1461628381.1421.24.camel@intel.com>
	 <20160426004155.GF18496@dastard>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Cc: "hch@infradead.org" <hch@infradead.org>, "jack@suse.cz" <jack@suse.cz>,
 "axboe@fb.com" <axboe@fb.com>, "linux-nvdimm@ml01.01.org"
 <linux-nvdimm@ml01.01.org>,  "linux-kernel@vger.kernel.org"
 <linux-kernel@vger.kernel.org>, "xfs@oss.sgi.com" <xfs@oss.sgi.com>,
 "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
 "linux-mm@kvack.org" <linux-mm@kvack.org>,  "viro@zeniv.linux.org.uk"
 <viro@zeniv.linux.org.uk>, "linux-fsdevel@vger.kernel.org"
 <linux-fsdevel@vger.kernel.org>, "akpm@linux-foundation.org"
 <akpm@linux-foundation.org>, "linux-ext4@vger.kernel.org"
 <linux-ext4@vger.kernel.org>, "Wilcox, Matthew R"
 <matthew.r.wilcox@intel.com>
To: Dave Chinner <david@fromorbit.com>, "Verma, Vishal L"
	 <vishal.l.verma@intel.com>
Return-path: <owner-linux-mm@kvack.org>
In-Reply-To: <20160426004155.GF18496@dastard>
Sender: owner-linux-mm@kvack.org
List-Id: linux-ext4.vger.kernel.org

On Tue, 2016-04-26 at 10:41 +1000, Dave Chinner wrote:
> <>

> > The application doesn't have to scan the entire filesystem, but
> > presumably it knows what files it 'owns', and does a fiemap for
> > those.
> You're assuming that only the DAX aware application accesses it's
> files.=C2=A0=C2=A0users, backup programs, data replicators, fileystem
> re-organisers (e.g.=C2=A0=C2=A0defragmenters) etc all may access the fi=
les and
> they may throw errors. What then?

In this scenario, backup applications etc that try to read that data
before it has been replaced will just hit the errors and fail..

>=C2=A0

<>

> > The data that was lost is gone -- this assumes the application has
> > some
> > ability to recover using a journal/log or other redundancy - yes,
> > at the
> > application layer. If it doesn't have this sort of capability, the
> > only
> > option is to restore files from a backup/mirror.
> So the architecture has a built in assumption that only userspace
> can handle data loss?
>=20
> What about filesytsems like NOVA, that use log structured design to
> provide DAX w/ update atomicity and can potentially also provide
> redundancy/repair through the same mechanisms? Won't pmem native
> filesystems with built in data protection features like this remove
> the need for adding all this to userspace applications?
>=20
> If so, shouldn't that be the focus of development rahter than
> placing the burden on userspace apps to handle storage repair
> situations?

Agreed that file systems like NOVA can be designed to handle this
better, but haven't you said in the past that it may take years for a
new file system to become production ready, and that DAX is the until-
then solution that gets us most of the way there.. I think we just want
to ensure that current-DAX has some way to deal with errors, and these
patches provide an admin-intervention recovery path and possibly
another if the app wants to try something fancy for recovery.

<>
>=20
> >=C2=A0
> > To summarize, the two cases we want to handle are:
> > 1. Application has inbuilt recovery:
> > =C2=A0 - hits badblock
> > =C2=A0 - figures out it is able to recover the data
> > =C2=A0 - handles SIGBUS or EIO
> > =C2=A0 - does a (sector aligned) write() to restore the data
> The "figures out" step here is where >95% of the work we'd have to
> do is. And that's in filesystem and block layer code, not
> userspace, and userspace can't do that work in a signal handler.
> And it=C2=A0=C2=A0can still fall down to the second case when the appli=
cation
> doesn't have another copy of the data somewhere.

Ah when I said "figures out" I was only thinking if the application has
some redundancy/jouranlling, and if it can recover using that -- not
additional recovery mechanisms at the block/fs layer.

>=20
> FWIW, we don't have a DAX enabled filesystem that can do
> reverse block mapping, so we're a year or two away from this being a
> workable production solution from the filesystem perspective. And
> AFAICT, it's not even on the roadmap for dm/md layers.
>=20
> >=20
> > 2. Application doesn't have any inbuilt recovery mechanism
> > =C2=A0 - hits badblock
> > =C2=A0 - gets SIGBUS (or EIO) and crashes
> > =C2=A0 - Sysadmin restores file from backup
> Which is no different to an existing non-DAX application getting an
> EIO/sigbus from current storage technologies.
>=20
> Except: in the existing storage stack, redundancy and correction has
> already had to have failed for the application to see such an error.
> Hence this is normally considered a DR case as there's had to be
> cascading failures (e.g.=C2=A0=C2=A0multiple disk failures in a RAID) t=
o get
> to this stage, not a single error in a single sector in
> non-redundant storage.
>=20
> We need some form of redundancy and correction in the PMEM stack to
> prevent single sector errors from taking down services until an
> administrator can correct the problem. I'm trying to understand
> where this is supposed to fit into the picture - at this point I
> really don't think userspace applications are going to be able to do
> this reliably....

Agreed that the pmem stack could use more redundancy and error
correction, perhaps enabling md-raid to raid pmem devices and then
enable DAX on top of that and we'll have a better chance to handle
errors, but that level of recovery isn't what these patches are aiming
for -- that is obviously a longer term effort. These simply aim to
provide that disaster recovery path when a single sector failure does
take down the service.

Today, on a dax enabled filesystem, if/when the app hits an error and
crashes, dax is simply disabled till the errors are gone. This is
obviously less than ideal. (This was done because there is currently no
way for a DAX file system to send any IO - mmap or otherwise - through
the driver, including zeroing of new fs blocks). These patches enable
the DR path by allowing some non-mmap IO (most importantly zeroing) to
go through the driver which can tell the device to do some remapping
etc.

So, yes, this is very much a DR case in our current pmem+dax
architecture, and we should probably design more robust handling at the
block/md/fs layer, but with these, you at least get to crash the app,
delete its files and restore them from out-of-band backups and continue
with DAX.

>=20
> Cheers,
>=20
> Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=3Dmailto:"dont@kvack.org"> email@kvack.org </a>

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <xfs-bounces@oss.sgi.com>
Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111])
	by oss.sgi.com (Postfix) with ESMTP id 4F35B7CDC
	for <xfs@oss.sgi.com>; Tue, 26 Apr 2016 09:59:01 -0500 (CDT)
Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11])
	by relay1.corp.sgi.com (Postfix) with ESMTP id D5B738F8035
	for <xfs@oss.sgi.com>; Tue, 26 Apr 2016 07:58:57 -0700 (PDT)
Received: from mail.kernel.org ([198.145.29.136]) by cuda.sgi.com with ESMTP
	id hQ0ZaIolKr8QVOLv (version=TLSv1.2
	cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for
	<xfs@oss.sgi.com>; Tue, 26 Apr 2016 07:58:55 -0700 (PDT)
Message-ID: <1461682731.26226.20.camel@kernel.org>
Subject: Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io
From: Vishal Verma <vishal@kernel.org>
Date: Tue, 26 Apr 2016 08:58:51 -0600
In-Reply-To: <20160426004155.GF18496@dastard>
References: <1459303190-20072-1-git-send-email-vishal.l.verma@intel.com>
	<1459303190-20072-6-git-send-email-vishal.l.verma@intel.com>
	<x49twj26edj.fsf@segfault.boston.devel.redhat.com>
	<20160420205923.GA24797@infradead.org>
	<1461434916.3695.7.camel@intel.com>
	<20160425083114.GA27556@infradead.org>
	<1461604476.3106.12.camel@intel.com>
	<20160425232552.GD18496@dastard> <1461628381.1421.24.camel@intel.com>
	<20160426004155.GF18496@dastard>
Mime-Version: 1.0
List-Id: XFS Filesystem from SGI <xfs.oss.sgi.com>
List-Unsubscribe: <http://oss.sgi.com/mailman/options/xfs>,
	<mailto:xfs-request@oss.sgi.com?subject=unsubscribe>
List-Archive: <http://oss.sgi.com/pipermail/xfs>
List-Post: <mailto:xfs@oss.sgi.com>
List-Help: <mailto:xfs-request@oss.sgi.com?subject=help>
List-Subscribe: <http://oss.sgi.com/mailman/listinfo/xfs>,
	<mailto:xfs-request@oss.sgi.com?subject=subscribe>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: xfs-bounces@oss.sgi.com
Sender: xfs-bounces@oss.sgi.com
To: Dave Chinner <david@fromorbit.com>, "Verma, Vishal L" <vishal.l.verma@intel.com>
Cc: "axboe@fb.com" <axboe@fb.com>, "jack@suse.cz" <jack@suse.cz>, "linux-nvdimm@ml01.01.org" <linux-nvdimm@ml01.01.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "xfs@oss.sgi.com" <xfs@oss.sgi.com>, "hch@infradead.org" <hch@infradead.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>, "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>, "viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>, "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>, "akpm@linux-foundation.org" <akpm@linux-foundation.org>, "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>, "Wilcox, Matthew R" <matthew.r.wilcox@intel.com>

T24gVHVlLCAyMDE2LTA0LTI2IGF0IDEwOjQxICsxMDAwLCBEYXZlIENoaW5uZXIgd3JvdGU6Cj4g
PD4KCj4gPiBUaGUgYXBwbGljYXRpb24gZG9lc24ndCBoYXZlIHRvIHNjYW4gdGhlIGVudGlyZSBm
aWxlc3lzdGVtLCBidXQKPiA+IHByZXN1bWFibHkgaXQga25vd3Mgd2hhdCBmaWxlcyBpdCAnb3du
cycsIGFuZCBkb2VzIGEgZmllbWFwIGZvcgo+ID4gdGhvc2UuCj4gWW91J3JlIGFzc3VtaW5nIHRo
YXQgb25seSB0aGUgREFYIGF3YXJlIGFwcGxpY2F0aW9uIGFjY2Vzc2VzIGl0J3MKPiBmaWxlcy7C
oMKgdXNlcnMsIGJhY2t1cCBwcm9ncmFtcywgZGF0YSByZXBsaWNhdG9ycywgZmlsZXlzdGVtCj4g
cmUtb3JnYW5pc2VycyAoZS5nLsKgwqBkZWZyYWdtZW50ZXJzKSBldGMgYWxsIG1heSBhY2Nlc3Mg
dGhlIGZpbGVzIGFuZAo+IHRoZXkgbWF5IHRocm93IGVycm9ycy4gV2hhdCB0aGVuPwoKSW4gdGhp
cyBzY2VuYXJpbywgYmFja3VwIGFwcGxpY2F0aW9ucyBldGMgdGhhdCB0cnkgdG8gcmVhZCB0aGF0
IGRhdGEKYmVmb3JlIGl0IGhhcyBiZWVuIHJlcGxhY2VkIHdpbGwganVzdCBoaXQgdGhlIGVycm9y
cyBhbmQgZmFpbC4uCgo+wqAKCjw+Cgo+ID4gVGhlIGRhdGEgdGhhdCB3YXMgbG9zdCBpcyBnb25l
IC0tIHRoaXMgYXNzdW1lcyB0aGUgYXBwbGljYXRpb24gaGFzCj4gPiBzb21lCj4gPiBhYmlsaXR5
IHRvIHJlY292ZXIgdXNpbmcgYSBqb3VybmFsL2xvZyBvciBvdGhlciByZWR1bmRhbmN5IC0geWVz
LAo+ID4gYXQgdGhlCj4gPiBhcHBsaWNhdGlvbiBsYXllci4gSWYgaXQgZG9lc24ndCBoYXZlIHRo
aXMgc29ydCBvZiBjYXBhYmlsaXR5LCB0aGUKPiA+IG9ubHkKPiA+IG9wdGlvbiBpcyB0byByZXN0
b3JlIGZpbGVzIGZyb20gYSBiYWNrdXAvbWlycm9yLgo+IFNvIHRoZSBhcmNoaXRlY3R1cmUgaGFz
IGEgYnVpbHQgaW4gYXNzdW1wdGlvbiB0aGF0IG9ubHkgdXNlcnNwYWNlCj4gY2FuIGhhbmRsZSBk
YXRhIGxvc3M/Cj4gCj4gV2hhdCBhYm91dCBmaWxlc3l0c2VtcyBsaWtlIE5PVkEsIHRoYXQgdXNl
IGxvZyBzdHJ1Y3R1cmVkIGRlc2lnbiB0bwo+IHByb3ZpZGUgREFYIHcvIHVwZGF0ZSBhdG9taWNp
dHkgYW5kIGNhbiBwb3RlbnRpYWxseSBhbHNvIHByb3ZpZGUKPiByZWR1bmRhbmN5L3JlcGFpciB0
aHJvdWdoIHRoZSBzYW1lIG1lY2hhbmlzbXM/IFdvbid0IHBtZW0gbmF0aXZlCj4gZmlsZXN5c3Rl
bXMgd2l0aCBidWlsdCBpbiBkYXRhIHByb3RlY3Rpb24gZmVhdHVyZXMgbGlrZSB0aGlzIHJlbW92
ZQo+IHRoZSBuZWVkIGZvciBhZGRpbmcgYWxsIHRoaXMgdG8gdXNlcnNwYWNlIGFwcGxpY2F0aW9u
cz8KPiAKPiBJZiBzbywgc2hvdWxkbid0IHRoYXQgYmUgdGhlIGZvY3VzIG9mIGRldmVsb3BtZW50
IHJhaHRlciB0aGFuCj4gcGxhY2luZyB0aGUgYnVyZGVuIG9uIHVzZXJzcGFjZSBhcHBzIHRvIGhh
bmRsZSBzdG9yYWdlIHJlcGFpcgo+IHNpdHVhdGlvbnM/CgpBZ3JlZWQgdGhhdCBmaWxlIHN5c3Rl
bXMgbGlrZSBOT1ZBIGNhbiBiZSBkZXNpZ25lZCB0byBoYW5kbGUgdGhpcwpiZXR0ZXIsIGJ1dCBo
YXZlbid0IHlvdSBzYWlkIGluIHRoZSBwYXN0IHRoYXQgaXQgbWF5IHRha2UgeWVhcnMgZm9yIGEK
bmV3IGZpbGUgc3lzdGVtIHRvIGJlY29tZSBwcm9kdWN0aW9uIHJlYWR5LCBhbmQgdGhhdCBEQVgg
aXMgdGhlIHVudGlsLQp0aGVuIHNvbHV0aW9uIHRoYXQgZ2V0cyB1cyBtb3N0IG9mIHRoZSB3YXkg
dGhlcmUuLiBJIHRoaW5rIHdlIGp1c3Qgd2FudAp0byBlbnN1cmUgdGhhdCBjdXJyZW50LURBWCBo
YXMgc29tZSB3YXkgdG8gZGVhbCB3aXRoIGVycm9ycywgYW5kIHRoZXNlCnBhdGNoZXMgcHJvdmlk
ZSBhbiBhZG1pbi1pbnRlcnZlbnRpb24gcmVjb3ZlcnkgcGF0aCBhbmQgcG9zc2libHkKYW5vdGhl
ciBpZiB0aGUgYXBwIHdhbnRzIHRvIHRyeSBzb21ldGhpbmcgZmFuY3kgZm9yIHJlY292ZXJ5LgoK
PD4KPiAKPiA+wqAKPiA+IFRvIHN1bW1hcml6ZSwgdGhlIHR3byBjYXNlcyB3ZSB3YW50IHRvIGhh
bmRsZSBhcmU6Cj4gPiAxLiBBcHBsaWNhdGlvbiBoYXMgaW5idWlsdCByZWNvdmVyeToKPiA+IMKg
IC0gaGl0cyBiYWRibG9jawo+ID4gwqAgLSBmaWd1cmVzIG91dCBpdCBpcyBhYmxlIHRvIHJlY292
ZXIgdGhlIGRhdGEKPiA+IMKgIC0gaGFuZGxlcyBTSUdCVVMgb3IgRUlPCj4gPiDCoCAtIGRvZXMg
YSAoc2VjdG9yIGFsaWduZWQpIHdyaXRlKCkgdG8gcmVzdG9yZSB0aGUgZGF0YQo+IFRoZSAiZmln
dXJlcyBvdXQiIHN0ZXAgaGVyZSBpcyB3aGVyZSA+OTUlIG9mIHRoZSB3b3JrIHdlJ2QgaGF2ZSB0
bwo+IGRvIGlzLiBBbmQgdGhhdCdzIGluIGZpbGVzeXN0ZW0gYW5kIGJsb2NrIGxheWVyIGNvZGUs
IG5vdAo+IHVzZXJzcGFjZSwgYW5kIHVzZXJzcGFjZSBjYW4ndCBkbyB0aGF0IHdvcmsgaW4gYSBz
aWduYWwgaGFuZGxlci4KPiBBbmQgaXTCoMKgY2FuIHN0aWxsIGZhbGwgZG93biB0byB0aGUgc2Vj
b25kIGNhc2Ugd2hlbiB0aGUgYXBwbGljYXRpb24KPiBkb2Vzbid0IGhhdmUgYW5vdGhlciBjb3B5
IG9mIHRoZSBkYXRhIHNvbWV3aGVyZS4KCkFoIHdoZW4gSSBzYWlkICJmaWd1cmVzIG91dCIgSSB3
YXMgb25seSB0aGlua2luZyBpZiB0aGUgYXBwbGljYXRpb24gaGFzCnNvbWUgcmVkdW5kYW5jeS9q
b3VyYW5sbGluZywgYW5kIGlmIGl0IGNhbiByZWNvdmVyIHVzaW5nIHRoYXQgLS0gbm90CmFkZGl0
aW9uYWwgcmVjb3ZlcnkgbWVjaGFuaXNtcyBhdCB0aGUgYmxvY2svZnMgbGF5ZXIuCgo+IAo+IEZX
SVcsIHdlIGRvbid0IGhhdmUgYSBEQVggZW5hYmxlZCBmaWxlc3lzdGVtIHRoYXQgY2FuIGRvCj4g
cmV2ZXJzZSBibG9jayBtYXBwaW5nLCBzbyB3ZSdyZSBhIHllYXIgb3IgdHdvIGF3YXkgZnJvbSB0
aGlzIGJlaW5nIGEKPiB3b3JrYWJsZSBwcm9kdWN0aW9uIHNvbHV0aW9uIGZyb20gdGhlIGZpbGVz
eXN0ZW0gcGVyc3BlY3RpdmUuIEFuZAo+IEFGQUlDVCwgaXQncyBub3QgZXZlbiBvbiB0aGUgcm9h
ZG1hcCBmb3IgZG0vbWQgbGF5ZXJzLgo+IAo+ID4gCj4gPiAyLiBBcHBsaWNhdGlvbiBkb2Vzbid0
IGhhdmUgYW55IGluYnVpbHQgcmVjb3ZlcnkgbWVjaGFuaXNtCj4gPiDCoCAtIGhpdHMgYmFkYmxv
Y2sKPiA+IMKgIC0gZ2V0cyBTSUdCVVMgKG9yIEVJTykgYW5kIGNyYXNoZXMKPiA+IMKgIC0gU3lz
YWRtaW4gcmVzdG9yZXMgZmlsZSBmcm9tIGJhY2t1cAo+IFdoaWNoIGlzIG5vIGRpZmZlcmVudCB0
byBhbiBleGlzdGluZyBub24tREFYIGFwcGxpY2F0aW9uIGdldHRpbmcgYW4KPiBFSU8vc2lnYnVz
IGZyb20gY3VycmVudCBzdG9yYWdlIHRlY2hub2xvZ2llcy4KPiAKPiBFeGNlcHQ6IGluIHRoZSBl
eGlzdGluZyBzdG9yYWdlIHN0YWNrLCByZWR1bmRhbmN5IGFuZCBjb3JyZWN0aW9uIGhhcwo+IGFs
cmVhZHkgaGFkIHRvIGhhdmUgZmFpbGVkIGZvciB0aGUgYXBwbGljYXRpb24gdG8gc2VlIHN1Y2gg
YW4gZXJyb3IuCj4gSGVuY2UgdGhpcyBpcyBub3JtYWxseSBjb25zaWRlcmVkIGEgRFIgY2FzZSBh
cyB0aGVyZSdzIGhhZCB0byBiZQo+IGNhc2NhZGluZyBmYWlsdXJlcyAoZS5nLsKgwqBtdWx0aXBs
ZSBkaXNrIGZhaWx1cmVzIGluIGEgUkFJRCkgdG8gZ2V0Cj4gdG8gdGhpcyBzdGFnZSwgbm90IGEg
c2luZ2xlIGVycm9yIGluIGEgc2luZ2xlIHNlY3RvciBpbgo+IG5vbi1yZWR1bmRhbnQgc3RvcmFn
ZS4KPiAKPiBXZSBuZWVkIHNvbWUgZm9ybSBvZiByZWR1bmRhbmN5IGFuZCBjb3JyZWN0aW9uIGlu
IHRoZSBQTUVNIHN0YWNrIHRvCj4gcHJldmVudCBzaW5nbGUgc2VjdG9yIGVycm9ycyBmcm9tIHRh
a2luZyBkb3duIHNlcnZpY2VzIHVudGlsIGFuCj4gYWRtaW5pc3RyYXRvciBjYW4gY29ycmVjdCB0
aGUgcHJvYmxlbS4gSSdtIHRyeWluZyB0byB1bmRlcnN0YW5kCj4gd2hlcmUgdGhpcyBpcyBzdXBw
b3NlZCB0byBmaXQgaW50byB0aGUgcGljdHVyZSAtIGF0IHRoaXMgcG9pbnQgSQo+IHJlYWxseSBk
b24ndCB0aGluayB1c2Vyc3BhY2UgYXBwbGljYXRpb25zIGFyZSBnb2luZyB0byBiZSBhYmxlIHRv
IGRvCj4gdGhpcyByZWxpYWJseS4uLi4KCkFncmVlZCB0aGF0IHRoZSBwbWVtIHN0YWNrIGNvdWxk
IHVzZSBtb3JlIHJlZHVuZGFuY3kgYW5kIGVycm9yCmNvcnJlY3Rpb24sIHBlcmhhcHMgZW5hYmxp
bmcgbWQtcmFpZCB0byByYWlkIHBtZW0gZGV2aWNlcyBhbmQgdGhlbgplbmFibGUgREFYIG9uIHRv
cCBvZiB0aGF0IGFuZCB3ZSdsbCBoYXZlIGEgYmV0dGVyIGNoYW5jZSB0byBoYW5kbGUKZXJyb3Jz
LCBidXQgdGhhdCBsZXZlbCBvZiByZWNvdmVyeSBpc24ndCB3aGF0IHRoZXNlIHBhdGNoZXMgYXJl
IGFpbWluZwpmb3IgLS0gdGhhdCBpcyBvYnZpb3VzbHkgYSBsb25nZXIgdGVybSBlZmZvcnQuIFRo
ZXNlIHNpbXBseSBhaW0gdG8KcHJvdmlkZSB0aGF0IGRpc2FzdGVyIHJlY292ZXJ5IHBhdGggd2hl
biBhIHNpbmdsZSBzZWN0b3IgZmFpbHVyZSBkb2VzCnRha2UgZG93biB0aGUgc2VydmljZS4KClRv
ZGF5LCBvbiBhIGRheCBlbmFibGVkIGZpbGVzeXN0ZW0sIGlmL3doZW4gdGhlIGFwcCBoaXRzIGFu
IGVycm9yIGFuZApjcmFzaGVzLCBkYXggaXMgc2ltcGx5IGRpc2FibGVkIHRpbGwgdGhlIGVycm9y
cyBhcmUgZ29uZS4gVGhpcyBpcwpvYnZpb3VzbHkgbGVzcyB0aGFuIGlkZWFsLiAoVGhpcyB3YXMg
ZG9uZSBiZWNhdXNlIHRoZXJlIGlzIGN1cnJlbnRseSBubwp3YXkgZm9yIGEgREFYIGZpbGUgc3lz
dGVtIHRvIHNlbmQgYW55IElPIC0gbW1hcCBvciBvdGhlcndpc2UgLSB0aHJvdWdoCnRoZSBkcml2
ZXIsIGluY2x1ZGluZyB6ZXJvaW5nIG9mIG5ldyBmcyBibG9ja3MpLiBUaGVzZSBwYXRjaGVzIGVu
YWJsZQp0aGUgRFIgcGF0aCBieSBhbGxvd2luZyBzb21lIG5vbi1tbWFwIElPIChtb3N0IGltcG9y
dGFudGx5IHplcm9pbmcpIHRvCmdvIHRocm91Z2ggdGhlIGRyaXZlciB3aGljaCBjYW4gdGVsbCB0
aGUgZGV2aWNlIHRvIGRvIHNvbWUgcmVtYXBwaW5nCmV0Yy4KClNvLCB5ZXMsIHRoaXMgaXMgdmVy
eSBtdWNoIGEgRFIgY2FzZSBpbiBvdXIgY3VycmVudCBwbWVtK2RheAphcmNoaXRlY3R1cmUsIGFu
ZCB3ZSBzaG91bGQgcHJvYmFibHkgZGVzaWduIG1vcmUgcm9idXN0IGhhbmRsaW5nIGF0IHRoZQpi
bG9jay9tZC9mcyBsYXllciwgYnV0IHdpdGggdGhlc2UsIHlvdSBhdCBsZWFzdCBnZXQgdG8gY3Jh
c2ggdGhlIGFwcCwKZGVsZXRlIGl0cyBmaWxlcyBhbmQgcmVzdG9yZSB0aGVtIGZyb20gb3V0LW9m
LWJhbmQgYmFja3VwcyBhbmQgY29udGludWUKd2l0aCBEQVguCgo+IAo+IENoZWVycywKPiAKPiBE
YXZlLgoKX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KeGZz
IG1haWxpbmcgbGlzdAp4ZnNAb3NzLnNnaS5jb20KaHR0cDovL29zcy5zZ2kuY29tL21haWxtYW4v
bGlzdGluZm8veGZzCg==

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail-pa0-f69.google.com (mail-pa0-f69.google.com [209.85.220.69])
	by kanga.kvack.org (Postfix) with ESMTP id 876776B025E
	for <linux-mm@kvack.org>; Tue, 26 Apr 2016 10:58:56 -0400 (EDT)
Received: by mail-pa0-f69.google.com with SMTP id xm6so21145572pab.3
        for <linux-mm@kvack.org>; Tue, 26 Apr 2016 07:58:56 -0700 (PDT)
Received: from mail.kernel.org (mail.kernel.org. [198.145.29.136])
        by mx.google.com with ESMTPS id gc5si4947864pac.224.2016.04.26.07.58.55
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 26 Apr 2016 07:58:55 -0700 (PDT)
Message-ID: <1461682731.26226.20.camel@kernel.org>
Subject: Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io
From: Vishal Verma <vishal@kernel.org>
Date: Tue, 26 Apr 2016 08:58:51 -0600
In-Reply-To: <20160426004155.GF18496@dastard>
References: <1459303190-20072-1-git-send-email-vishal.l.verma@intel.com>
	 <1459303190-20072-6-git-send-email-vishal.l.verma@intel.com>
	 <x49twj26edj.fsf@segfault.boston.devel.redhat.com>
	 <20160420205923.GA24797@infradead.org> <1461434916.3695.7.camel@intel.com>
	 <20160425083114.GA27556@infradead.org> <1461604476.3106.12.camel@intel.com>
	 <20160425232552.GD18496@dastard> <1461628381.1421.24.camel@intel.com>
	 <20160426004155.GF18496@dastard>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
List-ID: <linux-mm.kvack.org>
To: Dave Chinner <david@fromorbit.com>, "Verma, Vishal L" <vishal.l.verma@intel.com>
Cc: "hch@infradead.org" <hch@infradead.org>, "jack@suse.cz" <jack@suse.cz>, "axboe@fb.com" <axboe@fb.com>, "linux-nvdimm@ml01.01.org" <linux-nvdimm@ml01.01.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "xfs@oss.sgi.com" <xfs@oss.sgi.com>, "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>, "viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>, "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>, "akpm@linux-foundation.org" <akpm@linux-foundation.org>, "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>, "Wilcox, Matthew R" <matthew.r.wilcox@intel.com>

On Tue, 2016-04-26 at 10:41 +1000, Dave Chinner wrote:
> <>

> > The application doesn't have to scan the entire filesystem, but
> > presumably it knows what files it 'owns', and does a fiemap for
> > those.
> You're assuming that only the DAX aware application accesses it's
> files.A A users, backup programs, data replicators, fileystem
> re-organisers (e.g.A A defragmenters) etc all may access the files and
> they may throw errors. What then?

In this scenario, backup applications etc that try to read that data
before it has been replaced will just hit the errors and fail..

>A 

<>

> > The data that was lost is gone -- this assumes the application has
> > some
> > ability to recover using a journal/log or other redundancy - yes,
> > at the
> > application layer. If it doesn't have this sort of capability, the
> > only
> > option is to restore files from a backup/mirror.
> So the architecture has a built in assumption that only userspace
> can handle data loss?
> 
> What about filesytsems like NOVA, that use log structured design to
> provide DAX w/ update atomicity and can potentially also provide
> redundancy/repair through the same mechanisms? Won't pmem native
> filesystems with built in data protection features like this remove
> the need for adding all this to userspace applications?
> 
> If so, shouldn't that be the focus of development rahter than
> placing the burden on userspace apps to handle storage repair
> situations?

Agreed that file systems like NOVA can be designed to handle this
better, but haven't you said in the past that it may take years for a
new file system to become production ready, and that DAX is the until-
then solution that gets us most of the way there.. I think we just want
to ensure that current-DAX has some way to deal with errors, and these
patches provide an admin-intervention recovery path and possibly
another if the app wants to try something fancy for recovery.

<>
> 
> >A 
> > To summarize, the two cases we want to handle are:
> > 1. Application has inbuilt recovery:
> > A  - hits badblock
> > A  - figures out it is able to recover the data
> > A  - handles SIGBUS or EIO
> > A  - does a (sector aligned) write() to restore the data
> The "figures out" step here is where >95% of the work we'd have to
> do is. And that's in filesystem and block layer code, not
> userspace, and userspace can't do that work in a signal handler.
> And itA A can still fall down to the second case when the application
> doesn't have another copy of the data somewhere.

Ah when I said "figures out" I was only thinking if the application has
some redundancy/jouranlling, and if it can recover using that -- not
additional recovery mechanisms at the block/fs layer.

> 
> FWIW, we don't have a DAX enabled filesystem that can do
> reverse block mapping, so we're a year or two away from this being a
> workable production solution from the filesystem perspective. And
> AFAICT, it's not even on the roadmap for dm/md layers.
> 
> > 
> > 2. Application doesn't have any inbuilt recovery mechanism
> > A  - hits badblock
> > A  - gets SIGBUS (or EIO) and crashes
> > A  - Sysadmin restores file from backup
> Which is no different to an existing non-DAX application getting an
> EIO/sigbus from current storage technologies.
> 
> Except: in the existing storage stack, redundancy and correction has
> already had to have failed for the application to see such an error.
> Hence this is normally considered a DR case as there's had to be
> cascading failures (e.g.A A multiple disk failures in a RAID) to get
> to this stage, not a single error in a single sector in
> non-redundant storage.
> 
> We need some form of redundancy and correction in the PMEM stack to
> prevent single sector errors from taking down services until an
> administrator can correct the problem. I'm trying to understand
> where this is supposed to fit into the picture - at this point I
> really don't think userspace applications are going to be able to do
> this reliably....

Agreed that the pmem stack could use more redundancy and error
correction, perhaps enabling md-raid to raid pmem devices and then
enable DAX on top of that and we'll have a better chance to handle
errors, but that level of recovery isn't what these patches are aiming
for -- that is obviously a longer term effort. These simply aim to
provide that disaster recovery path when a single sector failure does
take down the service.

Today, on a dax enabled filesystem, if/when the app hits an error and
crashes, dax is simply disabled till the errors are gone. This is
obviously less than ideal. (This was done because there is currently no
way for a DAX file system to send any IO - mmap or otherwise - through
the driver, including zeroing of new fs blocks). These patches enable
the DR path by allowing some non-mmap IO (most importantly zeroing) to
go through the driver which can tell the device to do some remapping
etc.

So, yes, this is very much a DR case in our current pmem+dax
architecture, and we should probably design more robust handling at the
block/md/fs layer, but with these, you at least get to crash the app,
delete its files and restore them from out-of-band backups and continue
with DAX.

> 
> Cheers,
> 
> Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752315AbcDZO66 (ORCPT <rfc822;w@1wt.eu>);
	Tue, 26 Apr 2016 10:58:58 -0400
Received: from mail.kernel.org ([198.145.29.136]:47537 "EHLO mail.kernel.org"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751101AbcDZO64 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Tue, 26 Apr 2016 10:58:56 -0400
Message-ID: <1461682731.26226.20.camel@kernel.org>
Subject: Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io
From: Vishal Verma <vishal@kernel.org>
To: Dave Chinner <david@fromorbit.com>,
        "Verma, Vishal L" <vishal.l.verma@intel.com>
Cc: "hch@infradead.org" <hch@infradead.org>, "jack@suse.cz" <jack@suse.cz>,
        "axboe@fb.com" <axboe@fb.com>,
        "linux-nvdimm@ml01.01.org" <linux-nvdimm@ml01.01.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "xfs@oss.sgi.com" <xfs@oss.sgi.com>,
        "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
        "linux-mm@kvack.org" <linux-mm@kvack.org>,
        "viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
        "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
        "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
        "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
        "Wilcox, Matthew R" <matthew.r.wilcox@intel.com>
Date: Tue, 26 Apr 2016 08:58:51 -0600
In-Reply-To: <20160426004155.GF18496@dastard>
References: <1459303190-20072-1-git-send-email-vishal.l.verma@intel.com>
	 <1459303190-20072-6-git-send-email-vishal.l.verma@intel.com>
	 <x49twj26edj.fsf@segfault.boston.devel.redhat.com>
	 <20160420205923.GA24797@infradead.org> <1461434916.3695.7.camel@intel.com>
	 <20160425083114.GA27556@infradead.org> <1461604476.3106.12.camel@intel.com>
	 <20160425232552.GD18496@dastard> <1461628381.1421.24.camel@intel.com>
	 <20160426004155.GF18496@dastard>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.18.5.2 (3.18.5.2-1.fc23) 
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, 2016-04-26 at 10:41 +1000, Dave Chinner wrote:
> <>

> > The application doesn't have to scan the entire filesystem, but
> > presumably it knows what files it 'owns', and does a fiemap for
> > those.
> You're assuming that only the DAX aware application accesses it's
> files.  users, backup programs, data replicators, fileystem
> re-organisers (e.g.  defragmenters) etc all may access the files and
> they may throw errors. What then?

In this scenario, backup applications etc that try to read that data
before it has been replaced will just hit the errors and fail..

> 

<>

> > The data that was lost is gone -- this assumes the application has
> > some
> > ability to recover using a journal/log or other redundancy - yes,
> > at the
> > application layer. If it doesn't have this sort of capability, the
> > only
> > option is to restore files from a backup/mirror.
> So the architecture has a built in assumption that only userspace
> can handle data loss?
> 
> What about filesytsems like NOVA, that use log structured design to
> provide DAX w/ update atomicity and can potentially also provide
> redundancy/repair through the same mechanisms? Won't pmem native
> filesystems with built in data protection features like this remove
> the need for adding all this to userspace applications?
> 
> If so, shouldn't that be the focus of development rahter than
> placing the burden on userspace apps to handle storage repair
> situations?

Agreed that file systems like NOVA can be designed to handle this
better, but haven't you said in the past that it may take years for a
new file system to become production ready, and that DAX is the until-
then solution that gets us most of the way there.. I think we just want
to ensure that current-DAX has some way to deal with errors, and these
patches provide an admin-intervention recovery path and possibly
another if the app wants to try something fancy for recovery.

<>
> 
> > 
> > To summarize, the two cases we want to handle are:
> > 1. Application has inbuilt recovery:
> >   - hits badblock
> >   - figures out it is able to recover the data
> >   - handles SIGBUS or EIO
> >   - does a (sector aligned) write() to restore the data
> The "figures out" step here is where >95% of the work we'd have to
> do is. And that's in filesystem and block layer code, not
> userspace, and userspace can't do that work in a signal handler.
> And it  can still fall down to the second case when the application
> doesn't have another copy of the data somewhere.

Ah when I said "figures out" I was only thinking if the application has
some redundancy/jouranlling, and if it can recover using that -- not
additional recovery mechanisms at the block/fs layer.

> 
> FWIW, we don't have a DAX enabled filesystem that can do
> reverse block mapping, so we're a year or two away from this being a
> workable production solution from the filesystem perspective. And
> AFAICT, it's not even on the roadmap for dm/md layers.
> 
> > 
> > 2. Application doesn't have any inbuilt recovery mechanism
> >   - hits badblock
> >   - gets SIGBUS (or EIO) and crashes
> >   - Sysadmin restores file from backup
> Which is no different to an existing non-DAX application getting an
> EIO/sigbus from current storage technologies.
> 
> Except: in the existing storage stack, redundancy and correction has
> already had to have failed for the application to see such an error.
> Hence this is normally considered a DR case as there's had to be
> cascading failures (e.g.  multiple disk failures in a RAID) to get
> to this stage, not a single error in a single sector in
> non-redundant storage.
> 
> We need some form of redundancy and correction in the PMEM stack to
> prevent single sector errors from taking down services until an
> administrator can correct the problem. I'm trying to understand
> where this is supposed to fit into the picture - at this point I
> really don't think userspace applications are going to be able to do
> this reliably....

Agreed that the pmem stack could use more redundancy and error
correction, perhaps enabling md-raid to raid pmem devices and then
enable DAX on top of that and we'll have a better chance to handle
errors, but that level of recovery isn't what these patches are aiming
for -- that is obviously a longer term effort. These simply aim to
provide that disaster recovery path when a single sector failure does
take down the service.

Today, on a dax enabled filesystem, if/when the app hits an error and
crashes, dax is simply disabled till the errors are gone. This is
obviously less than ideal. (This was done because there is currently no
way for a DAX file system to send any IO - mmap or otherwise - through
the driver, including zeroing of new fs blocks). These patches enable
the DR path by allowing some non-mmap IO (most importantly zeroing) to
go through the driver which can tell the device to do some remapping
etc.

So, yes, this is very much a DR case in our current pmem+dax
architecture, and we should probably design more robust handling at the
block/md/fs layer, but with these, you at least get to crash the app,
delete its files and restore them from out-of-band backups and continue
with DAX.

> 
> Cheers,
> 
> Dave.