From: Dave Chinner <david@fromorbit.com>
To: Phil Terry <pterry@inphi.com>
Cc: Jeff Moyer <jmoyer@redhat.com>, Arnd Bergmann <arnd@arndb.de>,
linux-nvdimm <linux-nvdimm@ml01.01.org>,
Oleg Nesterov <oleg@redhat.com>,
Christoph Hellwig <hch@infradead.org>,
linux-mm <linux-mm@kvack.org>, Mel Gorman <mgorman@suse.de>,
Johannes Weiner <hannes@cmpxchg.org>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
Date: Fri, 26 Feb 2016 08:39:19 +1100 [thread overview]
Message-ID: <20160225213919.GC30721@dastard> (raw)
In-Reply-To: <56CF6D4C.1020101@inphi.com>
On Thu, Feb 25, 2016 at 01:08:28PM -0800, Phil Terry wrote:
> On 02/25/2016 12:15 PM, Dave Chinner wrote:
> >On Thu, Feb 25, 2016 at 02:11:49PM -0500, Jeff Moyer wrote:
> >>Jeff Moyer <jmoyer@redhat.com> writes:
> >>
> >>>>The big issue we have right now is that we haven't made the DAX/pmem
> >>>>infrastructure work correctly and reliably for general use. Hence
> >>>>adding new APIs to workaround cases where we haven't yet provided
> >>>>correct behaviour, let alone optimised for performance is, quite
> >>>>frankly, a clear case premature optimisation.
> >>>Again, I see the two things as separate issues. You need both.
> >>>Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
> >>>issue of making existing applications work safely.
> >>I want to add one more thing to this discussion, just for the sake of
> >>clarity. When I talk about existing applications and pmem, I mean
> >>applications that already know how to detect and recover from torn
> >>sectors. Any application that assumes hardware does not tear sectors
> >>should be run on a file system layered on top of the btt.
> >Which turns off DAX, and hence makes this a moot discussion because
> >mmap is then buffered through the page cache and hence applications
> >*must use msync/fsync* to provide data integrity. Which also makes
> >them safe to use with DAX if we have a working fsync.
> >
> >Keep in mind that existing storage technologies tear fileystem data
> >writes, too, because user data writes are filesystem block sized and
> >not atomic at the device level (i.e. typical is 512 byte sector, 4k
> >filesystem block size, so there are 7 points in a single write where
> >a tear can occur on a crash).
> Is that really true? Storage to date is on the PCIE/SATA etc IO
> chain. The locks and application crash scenarios when traversing
> down this chain are such that the device will not have its DMA
> programmed until the whole 4K etc page is flushed to memory, pinned
Has nothing to do with DMA semantics. Storage devices we have to
deal with have volatile write caches, and we can't assume anything
about what they write when power fails except that single sector
writes are atomic.
> In both cases, btt is not indirecting the buffer (as for a DMA
> master IO type device) but is simply using the same pmem api
> primitives to manage its own meta data about the filesystem writes
> to detect and recover from tears after the event. In what sense is
> DAX disabled for this?
BTT is, IIRC, using writeahead logging to stage every IO into pmem
so that after a crash the entire write can be recovered and replayed
to overwrite any torn sectors. This requires buffering at page cache
level, as direct writes to the pmem will not get logged. Hence DAX
cannot be used on BTT devices. Indeed:
static const struct block_device_operations btt_fops = {
.owner = THIS_MODULE,
.rw_page = btt_rw_page,
.getgeo = btt_getgeo,
.revalidate_disk = nvdimm_revalidate_disk,
};
There's no .direct_access method implemented for btt devices, so
it's clear that filesystems on BTT devices cannot enable DAX.
> So I think (please correct me if I'm wrong) but actually the
> hardware/firmware guys have been fixing the torn sector problem for
I was not talking about torn /sectors/. I was talking about a user
data write being made up of *multiple sectors*, and so there is no
atomicity guarantee for a user data write on existing storage when
the filesystem block size (user data IO size) is larger than the
device sector size.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-02-25 21:40 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-21 17:03 [RFC 0/2] New MAP_PMEM_AWARE mmap flag Boaz Harrosh
2016-02-21 17:04 ` [RFC 1/2] mmap: Define a new " Boaz Harrosh
2016-02-21 17:06 ` [RFC 2/2] dax: Support " Boaz Harrosh
2016-02-21 19:51 ` [RFC 0/2] New " Dan Williams
2016-02-21 20:24 ` Boaz Harrosh
2016-02-21 20:57 ` Dan Williams
2016-02-21 21:23 ` Boaz Harrosh
2016-02-21 22:03 ` Dan Williams
2016-02-21 22:31 ` Dave Chinner
2016-02-22 9:57 ` Boaz Harrosh
2016-02-22 15:34 ` Jeff Moyer
2016-02-22 17:44 ` Christoph Hellwig
2016-02-22 17:58 ` Jeff Moyer
2016-02-22 18:03 ` Christoph Hellwig
2016-02-22 18:52 ` Jeff Moyer
2016-02-23 9:45 ` Christoph Hellwig
2016-02-22 20:05 ` Rudoff, Andy
2016-02-23 9:52 ` Christoph Hellwig
2016-02-23 10:07 ` Rudoff, Andy
2016-02-23 12:06 ` Dave Chinner
2016-02-23 17:10 ` Ross Zwisler
2016-02-23 21:47 ` Dave Chinner
2016-02-23 22:15 ` Boaz Harrosh
2016-02-23 23:28 ` Dave Chinner
2016-02-24 0:08 ` Boaz Harrosh
2016-02-23 14:10 ` Boaz Harrosh
2016-02-23 16:56 ` Dan Williams
2016-02-23 17:05 ` Ross Zwisler
2016-02-23 17:26 ` Dan Williams
2016-02-23 21:55 ` Boaz Harrosh
2016-02-23 22:33 ` Dan Williams
2016-02-23 23:07 ` Boaz Harrosh
2016-02-23 23:23 ` Dan Williams
2016-02-23 23:40 ` Boaz Harrosh
2016-02-24 0:08 ` Dave Chinner
2016-02-23 23:28 ` Jeff Moyer
2016-02-23 23:34 ` Dan Williams
2016-02-23 23:43 ` Jeff Moyer
2016-02-23 23:56 ` Dan Williams
2016-02-24 4:09 ` Ross Zwisler
2016-02-24 19:30 ` Ross Zwisler
2016-02-25 9:46 ` Jan Kara
2016-02-25 7:44 ` Boaz Harrosh
2016-02-24 15:02 ` Jeff Moyer
2016-02-24 22:56 ` Dave Chinner
2016-02-25 16:24 ` Jeff Moyer
2016-02-25 19:11 ` Jeff Moyer
2016-02-25 20:15 ` Dave Chinner
2016-02-25 20:57 ` Jeff Moyer
2016-02-25 22:27 ` Dave Chinner
2016-02-26 4:02 ` Dan Williams
2016-02-26 10:04 ` Thanumalayan Sankaranarayana Pillai
2016-02-28 10:17 ` Boaz Harrosh
2016-03-03 17:38 ` Howard Chu
2016-02-29 20:25 ` Jeff Moyer
2016-02-25 21:08 ` Phil Terry
2016-02-25 21:39 ` Dave Chinner [this message]
2016-02-25 21:20 ` Dave Chinner
2016-02-29 20:32 ` Jeff Moyer
2016-02-23 17:25 ` Ross Zwisler
2016-02-23 22:47 ` Boaz Harrosh
2016-02-22 21:50 ` Dave Chinner
2016-02-23 13:51 ` Boaz Harrosh
2016-02-23 14:22 ` Jeff Moyer
2016-02-22 11:05 ` Boaz Harrosh
2016-03-11 6:44 ` Andy Lutomirski
2016-03-11 19:07 ` Dan Williams
2016-03-11 19:10 ` Andy Lutomirski
2016-03-11 23:02 ` Rudoff, Andy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160225213919.GC30721@dastard \
--to=david@fromorbit.com \
--cc=arnd@arndb.de \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=jmoyer@redhat.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@ml01.01.org \
--cc=mgorman@suse.de \
--cc=oleg@redhat.com \
--cc=pterry@inphi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).