Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Howard Chu <hyc@symas.com>
To: Dave Chinner <david@fromorbit.com>,
	Andy Lutomirski <luto@amacapital.net>
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	lsf-pc@lists.linux-foundation.org,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
Date: Wed, 22 Jan 2014 00:13:15 -0800	[thread overview]
Message-ID: <52DF7D9B.20904@symas.com> (raw)
In-Reply-To: <20140121230333.GH13997@dastard>

Dave Chinner wrote:
> On Tue, Jan 21, 2014 at 12:59:42PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote:
>>>> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote:
>>>>>> Andy Lutomirski wrote:
>>>>>>> On 01/16/2014 08:17 PM, Howard Chu wrote:
>>>>>>>> Andy Lutomirski wrote:
>>>>>>>>> I'm interested in a persistent memory track.  There seems to be plenty
>>>>>>>>> of other emails about this, but here's my take:
>>>>>>>>
>>>>>>>> I'm also interested in this track. I'm not up on FS development these
>>>>>>>> days, the last time I wrote filesystem code was nearly 20 years ago. But
>>>>>>>> persistent memory is a topic near and dear to my heart, and of great
>>>>>>>> relevance to my current pet project, the LMDB memory-mapped database.
>>>>>>>>
>>>>>>>> In a previous era I also developed block device drivers for
>>>>>>>> battery-backed external DRAM disks. (My ideal would have been systems
>>>>>>>> where all of RAM was persistent. I suppose we can just about get there
>>>>>>>> with mobile phones and tablets these days.)
>>>>>>>>
>>>>>>>> In the context of database engines, I'm interested in leveraging
>>>>>>>> persistent memory for write-back caching and how user level code can be
>>>>>>>> made aware of it. (If all your cache is persistent and guaranteed to
>>>>>>>> eventually reach stable store then you never need to fsync() a
>>>>>>>> transaction.)
>>>>>
>>>>> I don't think that is true -  your still going to need fsync to get
>>>>> the CPU to flush it's caches and filesystem metadata into the
>>>>> persistent domain....
>>>>
>>>> I think that this depends on the technology in question.
>>>>
>>>> I suspect (I don't know for sure) that, if the mapping is WT or UC,
>>>> that it would be possible to get the data fully flushed to persistent
>>>> storage by doing something like a UC read from any appropriate type of
>>>> I/O space (someone from Intel would have to confirm).
>>>
>>> And what of the filesystem metadata that is necessary to reference
>>> that data? What flushes that? e.g. using mmap of sparse files to
>>> dynamically allocate persistent memory space requires fdatasync() at
>>> minimum....

Why are you talking about fdatasync(), which is used to *avoid* flushing metadata?
For reference, we've found that we get highest DB performance using ext2fs 
with a preallocated file. In that case, we can use fdatasync() and then 
there's no metadata updates whatsoever. This also means we can ignore the 
question of FS corruption on a crash.

>> If we're using dm-crypt using an NV-DIMM "block" device as cache and a
>> real disk as backing store, then ideally mmap would map the NV-DIMM
>> directly if the data in question lives there.
>
> dm-crypt does not use any block device as a cache. You're thinking
> about dm-cache or bcache. And neither of them are operating at the
> filesystem level or are aware of the difference between fileystem
> metadata and user data.

Why should that layer need to be aware? A page is a page, as far as they're 
concerned.

> But talking about non-existent block layer
> functionality doesn't answer my the question about keeping user data
> and filesystem metadata needed to reference that user data
> coherent in persistent memory...

One of the very useful tools for PCs in the '80s was reset-survivable 
RAMdisks. Given the existence of persistent memory in a machine, this is a 
pretty obvious feature to provide.

>> If that's happening,
>> then, assuming that there are no metadata changes, you could just
>> flush the relevant hw caches.  This assumes, of course, no dm-crypt,
>> no btrfs-style checksumming, and, in general, nothing else that would
>> require stable pages or similar things.
>
> Well yes. Data IO path transformations are another reason why we'll
> need the volatile page cache involved in the persistent memory IO
> path. It follows immediately from this that applicaitons will still
> require fsync() and other data integrity operations because they
> have no idea where the persistence domain boundary lives in the IO
> stack.

And my point, stated a few times now, is there should be a way for 
applications to discover the existence and characteristics of persistent 
memory being used in the system.

>>> And then there's things like encrypted persistent memory when means
>>> applications can't directly access it and so mmap() will be buffered
>>> by the page cache just like a normal block device...
>>>
>>>> All of this suggests to me that a vsyscall "sync persistent memory"
>>>> might be better than a real syscall.
>>>
>>> Perhaps, but that implies some method other than a filesystem to
>>> manage access to persistent memory.
>>
>> It should be at least as good as fdatasync if using XIP or something like pmfs.
>>
>> For my intended application, I want to use pmfs or something similar
>> directly.  This means that I want really fast synchronous flushes, and
>> I suspect that the usual set of fs calls that handle fdatasync are
>> already quite a bit slower than a vsyscall would be, assuming that no
>> MSR write is needed.
>
> What you are saying is that you want a fixed, allocated range of
> persistent memory mapped into the applications address space that
> you have direct control of. Yes, we can do that through the
> filesystem XIP interface (zero the file via memset() rather than via
> unwritten extents) and then fsync the file. The metadata on the file
> will then never change, and you can do what you want via mmap from
> then onwards. I'd suggest at this point that msync() is the
> operation that should then be used to flush the data pages in the
> mapped range into the persistence domain.

>
>
>>>> For what it's worth, some of the NV-DIMM systems are supposed to be
>>>> configured in such a way that, if power fails, an NMI, SMI, or even
>>>> (not really sure) a hardwired thing in the memory controller will
>>>> trigger the requisite flush.  I don't personally believe in this if
>>>> L2/L3 cache are involved (they're too big), but for the little write
>>>> buffers and memory controller things, this seems entirely plausible.
>>>
>>> Right - at the moment we have to assume the persistence domain
>>> starts at the NVDIMM and doesn't cover the CPU's internal L* caches.
>>> I have no idea if/when we'll be seeing CPUs that have persistent
>>> caches, so we have to assume that data is still volatile and can be
>>> lost unless it has been specifically synced to persistent memory.
>>> i.e. persistent memory does not remove the need for fsync and
>>> friends...
>>
>> I have (NDAed and not entirely convincing) docs indicating a way (on
>> hardware that I don't have access to) to make the caches be part of
>> the persistence domain.
>
> Every platform will implement persistence domain
> mangement differently. So we can't assume that what works on one
> platform is going to work or be compatible with any other
> platform....
>
>> I also have non-NDA'd docs that suggest that
>> it's really very fast to flush things through the memory controller.
>> (I would need to time it, though.  I do have this hardware, and it
>> more or less works.)
>
> It still takes non-zero time, so there is still scope for data loss
> on power failure, or even CPU failure.
>
> Hmmm, now there's something I hadn't really thought about - how does
> CPU failure, hotplug and/or power management affect persistence
> domains if the CPU cache contains persistent data and it's no longer
> accessible?
>
> Cheers,
>
> Dave.
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

next prev parent reply	other threads:[~2014-01-22  8:13 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-17  0:56 [LSF/MM TOPIC] [ATTEND] Persistent memory Andy Lutomirski
2014-01-17  4:17 ` Howard Chu
2014-01-17 19:22   ` Andy Lutomirski
2014-01-21  7:38     ` Howard Chu
2014-01-21 11:17       ` [Lsf-pc] " Dave Chinner
2014-01-21 13:57         ` Howard Chu
2014-01-21 20:20           ` Dave Chinner
2014-01-21 16:48         ` Andy Lutomirski
2014-01-21 20:36           ` Dave Chinner
2014-01-21 20:59             ` Andy Lutomirski
2014-01-21 23:03               ` Dave Chinner
2014-01-21 23:22                 ` Andy Lutomirski
2014-01-22  8:13                 ` Howard Chu [this message]
2014-01-23 19:54                   ` Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52DF7D9B.20904@symas.com \
    --to=hyc@symas.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=luto@amacapital.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).