From: Howard Chu <hyc@symas.com>
To: Dave Chinner <david@fromorbit.com>,
Andy Lutomirski <luto@amacapital.net>
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>,
lsf-pc@lists.linux-foundation.org,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
Date: Wed, 22 Jan 2014 00:13:15 -0800 [thread overview]
Message-ID: <52DF7D9B.20904@symas.com> (raw)
In-Reply-To: <20140121230333.GH13997@dastard>
Dave Chinner wrote:
> On Tue, Jan 21, 2014 at 12:59:42PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote:
>>>> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote:
>>>>>> Andy Lutomirski wrote:
>>>>>>> On 01/16/2014 08:17 PM, Howard Chu wrote:
>>>>>>>> Andy Lutomirski wrote:
>>>>>>>>> I'm interested in a persistent memory track. There seems to be plenty
>>>>>>>>> of other emails about this, but here's my take:
>>>>>>>>
>>>>>>>> I'm also interested in this track. I'm not up on FS development these
>>>>>>>> days, the last time I wrote filesystem code was nearly 20 years ago. But
>>>>>>>> persistent memory is a topic near and dear to my heart, and of great
>>>>>>>> relevance to my current pet project, the LMDB memory-mapped database.
>>>>>>>>
>>>>>>>> In a previous era I also developed block device drivers for
>>>>>>>> battery-backed external DRAM disks. (My ideal would have been systems
>>>>>>>> where all of RAM was persistent. I suppose we can just about get there
>>>>>>>> with mobile phones and tablets these days.)
>>>>>>>>
>>>>>>>> In the context of database engines, I'm interested in leveraging
>>>>>>>> persistent memory for write-back caching and how user level code can be
>>>>>>>> made aware of it. (If all your cache is persistent and guaranteed to
>>>>>>>> eventually reach stable store then you never need to fsync() a
>>>>>>>> transaction.)
>>>>>
>>>>> I don't think that is true - your still going to need fsync to get
>>>>> the CPU to flush it's caches and filesystem metadata into the
>>>>> persistent domain....
>>>>
>>>> I think that this depends on the technology in question.
>>>>
>>>> I suspect (I don't know for sure) that, if the mapping is WT or UC,
>>>> that it would be possible to get the data fully flushed to persistent
>>>> storage by doing something like a UC read from any appropriate type of
>>>> I/O space (someone from Intel would have to confirm).
>>>
>>> And what of the filesystem metadata that is necessary to reference
>>> that data? What flushes that? e.g. using mmap of sparse files to
>>> dynamically allocate persistent memory space requires fdatasync() at
>>> minimum....
Why are you talking about fdatasync(), which is used to *avoid* flushing metadata?
For reference, we've found that we get highest DB performance using ext2fs
with a preallocated file. In that case, we can use fdatasync() and then
there's no metadata updates whatsoever. This also means we can ignore the
question of FS corruption on a crash.
>> If we're using dm-crypt using an NV-DIMM "block" device as cache and a
>> real disk as backing store, then ideally mmap would map the NV-DIMM
>> directly if the data in question lives there.
>
> dm-crypt does not use any block device as a cache. You're thinking
> about dm-cache or bcache. And neither of them are operating at the
> filesystem level or are aware of the difference between fileystem
> metadata and user data.
Why should that layer need to be aware? A page is a page, as far as they're
concerned.
> But talking about non-existent block layer
> functionality doesn't answer my the question about keeping user data
> and filesystem metadata needed to reference that user data
> coherent in persistent memory...
One of the very useful tools for PCs in the '80s was reset-survivable
RAMdisks. Given the existence of persistent memory in a machine, this is a
pretty obvious feature to provide.
>> If that's happening,
>> then, assuming that there are no metadata changes, you could just
>> flush the relevant hw caches. This assumes, of course, no dm-crypt,
>> no btrfs-style checksumming, and, in general, nothing else that would
>> require stable pages or similar things.
>
> Well yes. Data IO path transformations are another reason why we'll
> need the volatile page cache involved in the persistent memory IO
> path. It follows immediately from this that applicaitons will still
> require fsync() and other data integrity operations because they
> have no idea where the persistence domain boundary lives in the IO
> stack.
And my point, stated a few times now, is there should be a way for
applications to discover the existence and characteristics of persistent
memory being used in the system.
>>> And then there's things like encrypted persistent memory when means
>>> applications can't directly access it and so mmap() will be buffered
>>> by the page cache just like a normal block device...
>>>
>>>> All of this suggests to me that a vsyscall "sync persistent memory"
>>>> might be better than a real syscall.
>>>
>>> Perhaps, but that implies some method other than a filesystem to
>>> manage access to persistent memory.
>>
>> It should be at least as good as fdatasync if using XIP or something like pmfs.
>>
>> For my intended application, I want to use pmfs or something similar
>> directly. This means that I want really fast synchronous flushes, and
>> I suspect that the usual set of fs calls that handle fdatasync are
>> already quite a bit slower than a vsyscall would be, assuming that no
>> MSR write is needed.
>
> What you are saying is that you want a fixed, allocated range of
> persistent memory mapped into the applications address space that
> you have direct control of. Yes, we can do that through the
> filesystem XIP interface (zero the file via memset() rather than via
> unwritten extents) and then fsync the file. The metadata on the file
> will then never change, and you can do what you want via mmap from
> then onwards. I'd suggest at this point that msync() is the
> operation that should then be used to flush the data pages in the
> mapped range into the persistence domain.
>
>
>>>> For what it's worth, some of the NV-DIMM systems are supposed to be
>>>> configured in such a way that, if power fails, an NMI, SMI, or even
>>>> (not really sure) a hardwired thing in the memory controller will
>>>> trigger the requisite flush. I don't personally believe in this if
>>>> L2/L3 cache are involved (they're too big), but for the little write
>>>> buffers and memory controller things, this seems entirely plausible.
>>>
>>> Right - at the moment we have to assume the persistence domain
>>> starts at the NVDIMM and doesn't cover the CPU's internal L* caches.
>>> I have no idea if/when we'll be seeing CPUs that have persistent
>>> caches, so we have to assume that data is still volatile and can be
>>> lost unless it has been specifically synced to persistent memory.
>>> i.e. persistent memory does not remove the need for fsync and
>>> friends...
>>
>> I have (NDAed and not entirely convincing) docs indicating a way (on
>> hardware that I don't have access to) to make the caches be part of
>> the persistence domain.
>
> Every platform will implement persistence domain
> mangement differently. So we can't assume that what works on one
> platform is going to work or be compatible with any other
> platform....
>
>> I also have non-NDA'd docs that suggest that
>> it's really very fast to flush things through the memory controller.
>> (I would need to time it, though. I do have this hardware, and it
>> more or less works.)
>
> It still takes non-zero time, so there is still scope for data loss
> on power failure, or even CPU failure.
>
> Hmmm, now there's something I hadn't really thought about - how does
> CPU failure, hotplug and/or power management affect persistence
> domains if the CPU cache contains persistent data and it's no longer
> accessible?
>
> Cheers,
>
> Dave.
>
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
WARNING: multiple messages have this Message-ID (diff)
From: Howard Chu <hyc@symas.com>
To: Dave Chinner <david@fromorbit.com>,
Andy Lutomirski <luto@amacapital.net>
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>,
lsf-pc@lists.linux-foundation.org,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
Date: Wed, 22 Jan 2014 00:13:15 -0800 [thread overview]
Message-ID: <52DF7D9B.20904@symas.com> (raw)
In-Reply-To: <20140121230333.GH13997@dastard>
Dave Chinner wrote:
> On Tue, Jan 21, 2014 at 12:59:42PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote:
>>>> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote:
>>>>>> Andy Lutomirski wrote:
>>>>>>> On 01/16/2014 08:17 PM, Howard Chu wrote:
>>>>>>>> Andy Lutomirski wrote:
>>>>>>>>> I'm interested in a persistent memory track. There seems to be plenty
>>>>>>>>> of other emails about this, but here's my take:
>>>>>>>>
>>>>>>>> I'm also interested in this track. I'm not up on FS development these
>>>>>>>> days, the last time I wrote filesystem code was nearly 20 years ago. But
>>>>>>>> persistent memory is a topic near and dear to my heart, and of great
>>>>>>>> relevance to my current pet project, the LMDB memory-mapped database.
>>>>>>>>
>>>>>>>> In a previous era I also developed block device drivers for
>>>>>>>> battery-backed external DRAM disks. (My ideal would have been systems
>>>>>>>> where all of RAM was persistent. I suppose we can just about get there
>>>>>>>> with mobile phones and tablets these days.)
>>>>>>>>
>>>>>>>> In the context of database engines, I'm interested in leveraging
>>>>>>>> persistent memory for write-back caching and how user level code can be
>>>>>>>> made aware of it. (If all your cache is persistent and guaranteed to
>>>>>>>> eventually reach stable store then you never need to fsync() a
>>>>>>>> transaction.)
>>>>>
>>>>> I don't think that is true - your still going to need fsync to get
>>>>> the CPU to flush it's caches and filesystem metadata into the
>>>>> persistent domain....
>>>>
>>>> I think that this depends on the technology in question.
>>>>
>>>> I suspect (I don't know for sure) that, if the mapping is WT or UC,
>>>> that it would be possible to get the data fully flushed to persistent
>>>> storage by doing something like a UC read from any appropriate type of
>>>> I/O space (someone from Intel would have to confirm).
>>>
>>> And what of the filesystem metadata that is necessary to reference
>>> that data? What flushes that? e.g. using mmap of sparse files to
>>> dynamically allocate persistent memory space requires fdatasync() at
>>> minimum....
Why are you talking about fdatasync(), which is used to *avoid* flushing metadata?
For reference, we've found that we get highest DB performance using ext2fs
with a preallocated file. In that case, we can use fdatasync() and then
there's no metadata updates whatsoever. This also means we can ignore the
question of FS corruption on a crash.
>> If we're using dm-crypt using an NV-DIMM "block" device as cache and a
>> real disk as backing store, then ideally mmap would map the NV-DIMM
>> directly if the data in question lives there.
>
> dm-crypt does not use any block device as a cache. You're thinking
> about dm-cache or bcache. And neither of them are operating at the
> filesystem level or are aware of the difference between fileystem
> metadata and user data.
Why should that layer need to be aware? A page is a page, as far as they're
concerned.
> But talking about non-existent block layer
> functionality doesn't answer my the question about keeping user data
> and filesystem metadata needed to reference that user data
> coherent in persistent memory...
One of the very useful tools for PCs in the '80s was reset-survivable
RAMdisks. Given the existence of persistent memory in a machine, this is a
pretty obvious feature to provide.
>> If that's happening,
>> then, assuming that there are no metadata changes, you could just
>> flush the relevant hw caches. This assumes, of course, no dm-crypt,
>> no btrfs-style checksumming, and, in general, nothing else that would
>> require stable pages or similar things.
>
> Well yes. Data IO path transformations are another reason why we'll
> need the volatile page cache involved in the persistent memory IO
> path. It follows immediately from this that applicaitons will still
> require fsync() and other data integrity operations because they
> have no idea where the persistence domain boundary lives in the IO
> stack.
And my point, stated a few times now, is there should be a way for
applications to discover the existence and characteristics of persistent
memory being used in the system.
>>> And then there's things like encrypted persistent memory when means
>>> applications can't directly access it and so mmap() will be buffered
>>> by the page cache just like a normal block device...
>>>
>>>> All of this suggests to me that a vsyscall "sync persistent memory"
>>>> might be better than a real syscall.
>>>
>>> Perhaps, but that implies some method other than a filesystem to
>>> manage access to persistent memory.
>>
>> It should be at least as good as fdatasync if using XIP or something like pmfs.
>>
>> For my intended application, I want to use pmfs or something similar
>> directly. This means that I want really fast synchronous flushes, and
>> I suspect that the usual set of fs calls that handle fdatasync are
>> already quite a bit slower than a vsyscall would be, assuming that no
>> MSR write is needed.
>
> What you are saying is that you want a fixed, allocated range of
> persistent memory mapped into the applications address space that
> you have direct control of. Yes, we can do that through the
> filesystem XIP interface (zero the file via memset() rather than via
> unwritten extents) and then fsync the file. The metadata on the file
> will then never change, and you can do what you want via mmap from
> then onwards. I'd suggest at this point that msync() is the
> operation that should then be used to flush the data pages in the
> mapped range into the persistence domain.
>
>
>>>> For what it's worth, some of the NV-DIMM systems are supposed to be
>>>> configured in such a way that, if power fails, an NMI, SMI, or even
>>>> (not really sure) a hardwired thing in the memory controller will
>>>> trigger the requisite flush. I don't personally believe in this if
>>>> L2/L3 cache are involved (they're too big), but for the little write
>>>> buffers and memory controller things, this seems entirely plausible.
>>>
>>> Right - at the moment we have to assume the persistence domain
>>> starts at the NVDIMM and doesn't cover the CPU's internal L* caches.
>>> I have no idea if/when we'll be seeing CPUs that have persistent
>>> caches, so we have to assume that data is still volatile and can be
>>> lost unless it has been specifically synced to persistent memory.
>>> i.e. persistent memory does not remove the need for fsync and
>>> friends...
>>
>> I have (NDAed and not entirely convincing) docs indicating a way (on
>> hardware that I don't have access to) to make the caches be part of
>> the persistence domain.
>
> Every platform will implement persistence domain
> mangement differently. So we can't assume that what works on one
> platform is going to work or be compatible with any other
> platform....
>
>> I also have non-NDA'd docs that suggest that
>> it's really very fast to flush things through the memory controller.
>> (I would need to time it, though. I do have this hardware, and it
>> more or less works.)
>
> It still takes non-zero time, so there is still scope for data loss
> on power failure, or even CPU failure.
>
> Hmmm, now there's something I hadn't really thought about - how does
> CPU failure, hotplug and/or power management affect persistence
> domains if the CPU cache contains persistent data and it's no longer
> accessible?
>
> Cheers,
>
> Dave.
>
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-01-22 8:13 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-17 0:56 [LSF/MM TOPIC] [ATTEND] Persistent memory Andy Lutomirski
2014-01-17 0:56 ` Andy Lutomirski
2014-01-17 4:17 ` Howard Chu
2014-01-17 19:22 ` Andy Lutomirski
2014-01-17 19:22 ` Andy Lutomirski
2014-01-21 7:38 ` Howard Chu
2014-01-21 7:38 ` Howard Chu
2014-01-21 11:17 ` [Lsf-pc] " Dave Chinner
2014-01-21 13:57 ` Howard Chu
2014-01-21 20:20 ` Dave Chinner
2014-01-21 16:48 ` Andy Lutomirski
2014-01-21 20:36 ` Dave Chinner
2014-01-21 20:36 ` Dave Chinner
2014-01-21 20:59 ` Andy Lutomirski
2014-01-21 20:59 ` Andy Lutomirski
2014-01-21 23:03 ` Dave Chinner
2014-01-21 23:03 ` Dave Chinner
2014-01-21 23:22 ` Andy Lutomirski
2014-01-22 8:13 ` Howard Chu [this message]
2014-01-22 8:13 ` Howard Chu
2014-01-23 19:54 ` Andy Lutomirski
2014-01-23 19:54 ` Andy Lutomirski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52DF7D9B.20904@symas.com \
--to=hyc@symas.com \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=luto@amacapital.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.