From: Phil Terry <pterry@inphi.com>
To: Dave Chinner <david@fromorbit.com>, Jeff Moyer <jmoyer@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>,
linux-nvdimm <linux-nvdimm@ml01.01.org>,
Oleg Nesterov <oleg@redhat.com>,
Christoph Hellwig <hch@infradead.org>,
linux-mm <linux-mm@kvack.org>, Mel Gorman <mgorman@suse.de>,
Johannes Weiner <hannes@cmpxchg.org>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
Date: Thu, 25 Feb 2016 13:08:28 -0800 [thread overview]
Message-ID: <56CF6D4C.1020101@inphi.com> (raw)
In-Reply-To: <20160225201517.GA30721@dastard>
On 02/25/2016 12:15 PM, Dave Chinner wrote:
> On Thu, Feb 25, 2016 at 02:11:49PM -0500, Jeff Moyer wrote:
>> Jeff Moyer <jmoyer@redhat.com> writes:
>>
>>>> The big issue we have right now is that we haven't made the DAX/pmem
>>>> infrastructure work correctly and reliably for general use. Hence
>>>> adding new APIs to workaround cases where we haven't yet provided
>>>> correct behaviour, let alone optimised for performance is, quite
>>>> frankly, a clear case premature optimisation.
>>> Again, I see the two things as separate issues. You need both.
>>> Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
>>> issue of making existing applications work safely.
>> I want to add one more thing to this discussion, just for the sake of
>> clarity. When I talk about existing applications and pmem, I mean
>> applications that already know how to detect and recover from torn
>> sectors. Any application that assumes hardware does not tear sectors
>> should be run on a file system layered on top of the btt.
> Which turns off DAX, and hence makes this a moot discussion because
> mmap is then buffered through the page cache and hence applications
> *must use msync/fsync* to provide data integrity. Which also makes
> them safe to use with DAX if we have a working fsync.
>
> Keep in mind that existing storage technologies tear fileystem data
> writes, too, because user data writes are filesystem block sized and
> not atomic at the device level (i.e. typical is 512 byte sector, 4k
> filesystem block size, so there are 7 points in a single write where
> a tear can occur on a crash).
Is that really true? Storage to date is on the PCIE/SATA etc IO chain.
The locks and application crash scenarios when traversing down this
chain are such that the device will not have its DMA programmed until
the whole 4K etc page is flushed to memory, pinned for DMA, etc. Then
the DMA to the device is kicked off. If power crashes during the DMA,
either we have devices which are supercapped or battery backed to flush
their write caches and or have firmware which will abort the damaged
results of the torn DMA on the devices internal meta-data recovery when
power is restored. (The hardware/firmware on an HDD has been way more
complex than the simple mind model might lead one to expect for years).
All of this wrapped inside filesystem transaction semantics.
This is a crucial difference for "storage class memory" on the DRAM bus.
The NVDIMMs cannot be DMA masters and instead passively receive
cache-line writes. A "buffered DIMM" as alluded to in the pmem.io Device
Writers Guide might have intelligence on the DIIMM to detect, map and
recover tearing via the Block Window Aperture driver interface but on a
PMEM interface cannot do so. Hence btt on the host with full
transparency to manage the memory on the NVDIMM is required for the PMEM
driver. Given this it doesn't make sense to try and put it on the device
for the BW driver either.
In both cases, btt is not indirecting the buffer (as for a DMA master IO
type device) but is simply using the same pmem api primitives to manage
its own meta data about the filesystem writes to detect and recover from
tears after the event. In what sense is DAX disabled for this?
So I think (please correct me if I'm wrong) but actually the
hardware/firmware guys have been fixing the torn sector problem for the
last 30 years and the "storage on the memory channel" has reintroduced
the problem. So to use as SSD analogy, you fix this problem with the
FTL, and as we've seen with recent software defined flash and
openchannel approaches, you can either have the FTL on the device or on
the host. Absence of the bus master DMA on a DIMM (even with the BW
aperture software) makes a device based solution problematic so the host
solution a la btt is required for both PMEM and BW.
>
> IOWs existing storage already has the capability of tearing user
> data on crash and has been doing so for a least they last 30 years.
> Hence I really don't see any fundamental difference here with
> pmem+DAX - the only difference is that the tear granuarlity is
> smaller (CPU cacheline rather than sector).
>
> Cheers,
>
> Dave.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-02-25 21:08 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-21 17:03 [RFC 0/2] New MAP_PMEM_AWARE mmap flag Boaz Harrosh
2016-02-21 17:04 ` [RFC 1/2] mmap: Define a new " Boaz Harrosh
2016-02-21 17:06 ` [RFC 2/2] dax: Support " Boaz Harrosh
2016-02-21 19:51 ` [RFC 0/2] New " Dan Williams
2016-02-21 20:24 ` Boaz Harrosh
2016-02-21 20:57 ` Dan Williams
2016-02-21 21:23 ` Boaz Harrosh
2016-02-21 22:03 ` Dan Williams
2016-02-21 22:31 ` Dave Chinner
2016-02-22 9:57 ` Boaz Harrosh
2016-02-22 15:34 ` Jeff Moyer
2016-02-22 17:44 ` Christoph Hellwig
2016-02-22 17:58 ` Jeff Moyer
2016-02-22 18:03 ` Christoph Hellwig
2016-02-22 18:52 ` Jeff Moyer
2016-02-23 9:45 ` Christoph Hellwig
2016-02-22 20:05 ` Rudoff, Andy
2016-02-23 9:52 ` Christoph Hellwig
2016-02-23 10:07 ` Rudoff, Andy
2016-02-23 12:06 ` Dave Chinner
2016-02-23 17:10 ` Ross Zwisler
2016-02-23 21:47 ` Dave Chinner
2016-02-23 22:15 ` Boaz Harrosh
2016-02-23 23:28 ` Dave Chinner
2016-02-24 0:08 ` Boaz Harrosh
2016-02-23 14:10 ` Boaz Harrosh
2016-02-23 16:56 ` Dan Williams
2016-02-23 17:05 ` Ross Zwisler
2016-02-23 17:26 ` Dan Williams
2016-02-23 21:55 ` Boaz Harrosh
2016-02-23 22:33 ` Dan Williams
2016-02-23 23:07 ` Boaz Harrosh
2016-02-23 23:23 ` Dan Williams
2016-02-23 23:40 ` Boaz Harrosh
2016-02-24 0:08 ` Dave Chinner
2016-02-23 23:28 ` Jeff Moyer
2016-02-23 23:34 ` Dan Williams
2016-02-23 23:43 ` Jeff Moyer
2016-02-23 23:56 ` Dan Williams
2016-02-24 4:09 ` Ross Zwisler
2016-02-24 19:30 ` Ross Zwisler
2016-02-25 9:46 ` Jan Kara
2016-02-25 7:44 ` Boaz Harrosh
2016-02-24 15:02 ` Jeff Moyer
2016-02-24 22:56 ` Dave Chinner
2016-02-25 16:24 ` Jeff Moyer
2016-02-25 19:11 ` Jeff Moyer
2016-02-25 20:15 ` Dave Chinner
2016-02-25 20:57 ` Jeff Moyer
2016-02-25 22:27 ` Dave Chinner
2016-02-26 4:02 ` Dan Williams
2016-02-26 10:04 ` Thanumalayan Sankaranarayana Pillai
2016-02-28 10:17 ` Boaz Harrosh
2016-03-03 17:38 ` Howard Chu
2016-02-29 20:25 ` Jeff Moyer
2016-02-25 21:08 ` Phil Terry [this message]
2016-02-25 21:39 ` Dave Chinner
2016-02-25 21:20 ` Dave Chinner
2016-02-29 20:32 ` Jeff Moyer
2016-02-23 17:25 ` Ross Zwisler
2016-02-23 22:47 ` Boaz Harrosh
2016-02-22 21:50 ` Dave Chinner
2016-02-23 13:51 ` Boaz Harrosh
2016-02-23 14:22 ` Jeff Moyer
2016-02-22 11:05 ` Boaz Harrosh
2016-03-11 6:44 ` Andy Lutomirski
2016-03-11 19:07 ` Dan Williams
2016-03-11 19:10 ` Andy Lutomirski
2016-03-11 23:02 ` Rudoff, Andy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56CF6D4C.1020101@inphi.com \
--to=pterry@inphi.com \
--cc=arnd@arndb.de \
--cc=david@fromorbit.com \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=jmoyer@redhat.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@ml01.01.org \
--cc=mgorman@suse.de \
--cc=oleg@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).