From: Vishal Verma <vishal@kernel.org>
To: Dave Chinner <david@fromorbit.com>,
"Verma, Vishal L" <vishal.l.verma@intel.com>
Cc: "hch@infradead.org" <hch@infradead.org>,
"jack@suse.cz" <jack@suse.cz>, "axboe@fb.com" <axboe@fb.com>,
"linux-nvdimm@ml01.01.org" <linux-nvdimm@ml01.01.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"xfs@oss.sgi.com" <xfs@oss.sgi.com>,
"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
"Wilcox, Matthew R" <matthew.r.wilcox@intel.com>
Subject: Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io
Date: Tue, 26 Apr 2016 08:58:51 -0600 [thread overview]
Message-ID: <1461682731.26226.20.camel@kernel.org> (raw)
In-Reply-To: <20160426004155.GF18496@dastard>
On Tue, 2016-04-26 at 10:41 +1000, Dave Chinner wrote:
> <>
> > The application doesn't have to scan the entire filesystem, but
> > presumably it knows what files it 'owns', and does a fiemap for
> > those.
> You're assuming that only the DAX aware application accesses it's
> files.A A users, backup programs, data replicators, fileystem
> re-organisers (e.g.A A defragmenters) etc all may access the files and
> they may throw errors. What then?
In this scenario, backup applications etc that try to read that data
before it has been replaced will just hit the errors and fail..
>A
<>
> > The data that was lost is gone -- this assumes the application has
> > some
> > ability to recover using a journal/log or other redundancy - yes,
> > at the
> > application layer. If it doesn't have this sort of capability, the
> > only
> > option is to restore files from a backup/mirror.
> So the architecture has a built in assumption that only userspace
> can handle data loss?
>
> What about filesytsems like NOVA, that use log structured design to
> provide DAX w/ update atomicity and can potentially also provide
> redundancy/repair through the same mechanisms? Won't pmem native
> filesystems with built in data protection features like this remove
> the need for adding all this to userspace applications?
>
> If so, shouldn't that be the focus of development rahter than
> placing the burden on userspace apps to handle storage repair
> situations?
Agreed that file systems like NOVA can be designed to handle this
better, but haven't you said in the past that it may take years for a
new file system to become production ready, and that DAX is the until-
then solution that gets us most of the way there.. I think we just want
to ensure that current-DAX has some way to deal with errors, and these
patches provide an admin-intervention recovery path and possibly
another if the app wants to try something fancy for recovery.
<>
>
> >A
> > To summarize, the two cases we want to handle are:
> > 1. Application has inbuilt recovery:
> > A - hits badblock
> > A - figures out it is able to recover the data
> > A - handles SIGBUS or EIO
> > A - does a (sector aligned) write() to restore the data
> The "figures out" step here is where >95% of the work we'd have to
> do is. And that's in filesystem and block layer code, not
> userspace, and userspace can't do that work in a signal handler.
> And itA A can still fall down to the second case when the application
> doesn't have another copy of the data somewhere.
Ah when I said "figures out" I was only thinking if the application has
some redundancy/jouranlling, and if it can recover using that -- not
additional recovery mechanisms at the block/fs layer.
>
> FWIW, we don't have a DAX enabled filesystem that can do
> reverse block mapping, so we're a year or two away from this being a
> workable production solution from the filesystem perspective. And
> AFAICT, it's not even on the roadmap for dm/md layers.
>
> >
> > 2. Application doesn't have any inbuilt recovery mechanism
> > A - hits badblock
> > A - gets SIGBUS (or EIO) and crashes
> > A - Sysadmin restores file from backup
> Which is no different to an existing non-DAX application getting an
> EIO/sigbus from current storage technologies.
>
> Except: in the existing storage stack, redundancy and correction has
> already had to have failed for the application to see such an error.
> Hence this is normally considered a DR case as there's had to be
> cascading failures (e.g.A A multiple disk failures in a RAID) to get
> to this stage, not a single error in a single sector in
> non-redundant storage.
>
> We need some form of redundancy and correction in the PMEM stack to
> prevent single sector errors from taking down services until an
> administrator can correct the problem. I'm trying to understand
> where this is supposed to fit into the picture - at this point I
> really don't think userspace applications are going to be able to do
> this reliably....
Agreed that the pmem stack could use more redundancy and error
correction, perhaps enabling md-raid to raid pmem devices and then
enable DAX on top of that and we'll have a better chance to handle
errors, but that level of recovery isn't what these patches are aiming
for -- that is obviously a longer term effort. These simply aim to
provide that disaster recovery path when a single sector failure does
take down the service.
Today, on a dax enabled filesystem, if/when the app hits an error and
crashes, dax is simply disabled till the errors are gone. This is
obviously less than ideal. (This was done because there is currently no
way for a DAX file system to send any IO - mmap or otherwise - through
the driver, including zeroing of new fs blocks). These patches enable
the DR path by allowing some non-mmap IO (most importantly zeroing) to
go through the driver which can tell the device to do some remapping
etc.
So, yes, this is very much a DR case in our current pmem+dax
architecture, and we should probably design more robust handling at the
block/md/fs layer, but with these, you at least get to crash the app,
delete its files and restore them from out-of-band backups and continue
with DAX.
>
> Cheers,
>
> Dave.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-04-26 14:58 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-30 1:59 [PATCH v2 0/5] dax: handling of media errors Vishal Verma
2016-03-30 1:59 ` [PATCH v2 1/5] block, dax: pass blk_dax_ctl through to drivers Vishal Verma
2016-03-30 4:19 ` kbuild test robot
2016-04-15 14:55 ` Jeff Moyer
2016-03-30 1:59 ` [PATCH v2 2/5] dax: fallback from pmd to pte on error Vishal Verma
2016-04-15 14:55 ` Jeff Moyer
2016-03-30 1:59 ` [PATCH v2 3/5] dax: enable dax in the presence of known media errors (badblocks) Vishal Verma
2016-04-15 14:56 ` Jeff Moyer
2016-03-30 1:59 ` [PATCH v2 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors Vishal Verma
2016-04-15 15:18 ` Jeff Moyer
2016-03-30 1:59 ` [PATCH v2 5/5] dax: handle media errors in dax_do_io Vishal Verma
2016-03-30 3:00 ` kbuild test robot
2016-03-30 6:34 ` Christoph Hellwig
2016-03-30 6:54 ` Vishal Verma
2016-03-30 6:56 ` Christoph Hellwig
2016-04-15 16:11 ` Jeff Moyer
2016-04-15 16:54 ` Verma, Vishal L
2016-04-15 17:11 ` Jeff Moyer
2016-04-15 17:37 ` Verma, Vishal L
2016-04-15 17:57 ` Dan Williams
2016-04-15 18:06 ` Jeff Moyer
2016-04-15 18:17 ` Dan Williams
2016-04-15 18:24 ` Jeff Moyer
2016-04-15 18:56 ` Dan Williams
2016-04-15 19:13 ` Jeff Moyer
2016-04-15 19:01 ` Toshi Kani
2016-04-15 19:08 ` Toshi Kani
2016-04-20 20:59 ` Christoph Hellwig
2016-04-23 18:08 ` Verma, Vishal L
2016-04-25 8:31 ` hch
2016-04-25 15:32 ` Jeff Moyer
2016-04-26 8:32 ` hch
2016-04-25 17:14 ` Verma, Vishal L
2016-04-25 17:21 ` Dan Williams
2016-04-25 23:25 ` Dave Chinner
2016-04-25 23:34 ` Darrick J. Wong
2016-04-25 23:43 ` Dan Williams
2016-04-26 0:11 ` Dave Chinner
2016-04-26 1:45 ` Dan Williams
2016-04-26 2:56 ` Dave Chinner
2016-04-26 4:18 ` Dan Williams
2016-04-26 8:27 ` Dave Chinner
2016-04-26 14:59 ` Dan Williams
2016-04-26 15:31 ` Jan Kara
2016-04-26 17:16 ` Dan Williams
2016-04-25 23:53 ` Verma, Vishal L
2016-04-26 0:41 ` Dave Chinner
2016-04-26 14:58 ` Vishal Verma [this message]
2016-05-02 15:18 ` Jeff Moyer
2016-05-02 17:53 ` Dan Williams
2016-05-03 0:42 ` Dave Chinner
2016-05-03 1:26 ` Rudoff, Andy
2016-05-03 2:49 ` Dave Chinner
2016-05-03 18:30 ` Rudoff, Andy
2016-05-04 1:36 ` Dave Chinner
2016-05-02 23:04 ` Dave Chinner
2016-05-02 23:17 ` Verma, Vishal L
2016-05-02 23:25 ` Dan Williams
2016-05-03 1:51 ` Dave Chinner
2016-05-03 17:28 ` Dan Williams
2016-05-04 3:18 ` Dave Chinner
2016-05-04 5:05 ` Dan Williams
2016-04-26 8:33 ` hch
2016-04-26 15:01 ` Vishal Verma
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1461682731.26226.20.camel@kernel.org \
--to=vishal@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=axboe@fb.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-block@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@ml01.01.org \
--cc=matthew.r.wilcox@intel.com \
--cc=viro@zeniv.linux.org.uk \
--cc=vishal.l.verma@intel.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).