[LSF/MM TOPIC] [ATTEND] Persistent memory

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] [ATTEND] Persistent memory
@ 2014-01-17  0:56 Andy Lutomirski
  2014-01-17  4:17 ` Howard Chu
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2014-01-17  0:56 UTC (permalink / raw)
  To: Linux FS Devel, lsf-pc, linux-mm@kvack.org

I'm interested in a persistent memory track.  There seems to be plenty
of other emails about this, but here's my take:

First, I'm not an FS expert.  I've never written an FS, touched an
on-disk (or on-persistent-memory) FS format.  I have, however, mucked
with some low-level x86 details, and I'm a heavy abuser of the Linux
page cache.

I'm an upcoming user of persistent memory -- I have some (in the form
if NV-DIMMs) and I have an application (HFT and a memory-backed
database thing) that I'll port to run on pmfs or ext4 w/ XIP once
everything is ready.

I'm also interested in some of the implementation details.  For this
stuff to be reliable on anything resembling commodity hardware, there
will be some caching issues to deal with.  For example, I think it
would be handy to run things like pmfs on top of write-through
mappings.  This is currently barely supportable (and only using
mtrrs), but it's not terribly complicated (on new enough hardware) to
support real write-through PAT entries.

I've written an i2c-imc driver (currently in limbo on the i2c list),
which will likely be used for control operations on NV-DIMMs plugged
into Intel-based server boards.

In principle, I could even bring a working NV-DIMM system to the
summit -- it's nearby, and this thing isn't *that* large :)

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-17  0:56 [LSF/MM TOPIC] [ATTEND] Persistent memory Andy Lutomirski
@ 2014-01-17  4:17 ` Howard Chu
  2014-01-17 19:22   ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Howard Chu @ 2014-01-17  4:17 UTC (permalink / raw)
  To: Linux FS Devel, lsf-pc, linux-mm@kvack.org

Andy Lutomirski wrote:
> I'm interested in a persistent memory track.  There seems to be plenty
> of other emails about this, but here's my take:

I'm also interested in this track. I'm not up on FS development these days, 
the last time I wrote filesystem code was nearly 20 years ago. But persistent 
memory is a topic near and dear to my heart, and of great relevance to my 
current pet project, the LMDB memory-mapped database.

In a previous era I also developed block device drivers for battery-backed 
external DRAM disks. (My ideal would have been systems where all of RAM was 
persistent. I suppose we can just about get there with mobile phones and 
tablets these days.)

In the context of database engines, I'm interested in leveraging persistent 
memory for write-back caching and how user level code can be made aware of it. 
(If all your cache is persistent and guaranteed to eventually reach stable 
store then you never need to fsync() a transaction.)

> First, I'm not an FS expert.  I've never written an FS, touched an
> on-disk (or on-persistent-memory) FS format.  I have, however, mucked
> with some low-level x86 details, and I'm a heavy abuser of the Linux
> page cache.
>
> I'm an upcoming user of persistent memory -- I have some (in the form
> if NV-DIMMs) and I have an application (HFT and a memory-backed
> database thing) that I'll port to run on pmfs or ext4 w/ XIP once
> everything is ready.
>
> I'm also interested in some of the implementation details.  For this
> stuff to be reliable on anything resembling commodity hardware, there
> will be some caching issues to deal with.  For example, I think it
> would be handy to run things like pmfs on top of write-through
> mappings.  This is currently barely supportable (and only using
> mtrrs), but it's not terribly complicated (on new enough hardware) to
> support real write-through PAT entries.
>
> I've written an i2c-imc driver (currently in limbo on the i2c list),
> which will likely be used for control operations on NV-DIMMs plugged
> into Intel-based server boards.
>
> In principle, I could even bring a working NV-DIMM system to the
> summit -- it's nearby, and this thing isn't *that* large :)
>
> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-17  4:17 ` Howard Chu
@ 2014-01-17 19:22   ` Andy Lutomirski
  2014-01-21  7:38     ` Howard Chu
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2014-01-17 19:22 UTC (permalink / raw)
  To: Howard Chu, Linux FS Devel, lsf-pc, linux-mm@kvack.org

On 01/16/2014 08:17 PM, Howard Chu wrote:
> Andy Lutomirski wrote:
>> I'm interested in a persistent memory track.  There seems to be plenty
>> of other emails about this, but here's my take:
> 
> I'm also interested in this track. I'm not up on FS development these
> days, the last time I wrote filesystem code was nearly 20 years ago. But
> persistent memory is a topic near and dear to my heart, and of great
> relevance to my current pet project, the LMDB memory-mapped database.
> 
> In a previous era I also developed block device drivers for
> battery-backed external DRAM disks. (My ideal would have been systems
> where all of RAM was persistent. I suppose we can just about get there
> with mobile phones and tablets these days.)
> 
> In the context of database engines, I'm interested in leveraging
> persistent memory for write-back caching and how user level code can be
> made aware of it. (If all your cache is persistent and guaranteed to
> eventually reach stable store then you never need to fsync() a
> transaction.)

Hmm.  Presumably that would work by actually allocating cache pages in
persistent memory.  I don't think that anything like the current XIP
interfaces can do that, but it's certainly an interesting thought for
(complicated) future work.

This might not be pretty in conjunction with something like my
writethrough mapping idea -- read(2) and write(2) would be fine (well,
write(2) might need to use streaming loads), but mmap users who weren't
expecting it might have truly awful performance.  That especially
includes things like databases that aren't expecting this behavior.

--Andy

> 
>> First, I'm not an FS expert.  I've never written an FS, touched an
>> on-disk (or on-persistent-memory) FS format.  I have, however, mucked
>> with some low-level x86 details, and I'm a heavy abuser of the Linux
>> page cache.
>>
>> I'm an upcoming user of persistent memory -- I have some (in the form
>> if NV-DIMMs) and I have an application (HFT and a memory-backed
>> database thing) that I'll port to run on pmfs or ext4 w/ XIP once
>> everything is ready.
>>
>> I'm also interested in some of the implementation details.  For this
>> stuff to be reliable on anything resembling commodity hardware, there
>> will be some caching issues to deal with.  For example, I think it
>> would be handy to run things like pmfs on top of write-through
>> mappings.  This is currently barely supportable (and only using
>> mtrrs), but it's not terribly complicated (on new enough hardware) to
>> support real write-through PAT entries.
>>
>> I've written an i2c-imc driver (currently in limbo on the i2c list),
>> which will likely be used for control operations on NV-DIMMs plugged
>> into Intel-based server boards.
>>
>> In principle, I could even bring a working NV-DIMM system to the
>> summit -- it's nearby, and this thing isn't *that* large :)
>>
>> --Andy
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-17 19:22   ` Andy Lutomirski
@ 2014-01-21  7:38     ` Howard Chu
  2014-01-21 11:17       ` [Lsf-pc] " Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Howard Chu @ 2014-01-21  7:38 UTC (permalink / raw)
  To: Andy Lutomirski, Linux FS Devel, lsf-pc, linux-mm@kvack.org

Andy Lutomirski wrote:
> On 01/16/2014 08:17 PM, Howard Chu wrote:
>> Andy Lutomirski wrote:
>>> I'm interested in a persistent memory track.  There seems to be plenty
>>> of other emails about this, but here's my take:
>>
>> I'm also interested in this track. I'm not up on FS development these
>> days, the last time I wrote filesystem code was nearly 20 years ago. But
>> persistent memory is a topic near and dear to my heart, and of great
>> relevance to my current pet project, the LMDB memory-mapped database.
>>
>> In a previous era I also developed block device drivers for
>> battery-backed external DRAM disks. (My ideal would have been systems
>> where all of RAM was persistent. I suppose we can just about get there
>> with mobile phones and tablets these days.)
>>
>> In the context of database engines, I'm interested in leveraging
>> persistent memory for write-back caching and how user level code can be
>> made aware of it. (If all your cache is persistent and guaranteed to
>> eventually reach stable store then you never need to fsync() a
>> transaction.)
>
> Hmm.  Presumably that would work by actually allocating cache pages in
> persistent memory.  I don't think that anything like the current XIP
> interfaces can do that, but it's certainly an interesting thought for
> (complicated) future work.
>
> This might not be pretty in conjunction with something like my
> writethrough mapping idea -- read(2) and write(2) would be fine (well,
> write(2) might need to use streaming loads), but mmap users who weren't
> expecting it might have truly awful performance.  That especially
> includes things like databases that aren't expecting this behavior.

At the moment all I can suggest is a new mmap() flag, e.g. MAP_PERSISTENT. Not 
sure how a user or app should discover that it's supported though.
>
> --Andy
>
>>
>>> First, I'm not an FS expert.  I've never written an FS, touched an
>>> on-disk (or on-persistent-memory) FS format.  I have, however, mucked
>>> with some low-level x86 details, and I'm a heavy abuser of the Linux
>>> page cache.
>>>
>>> I'm an upcoming user of persistent memory -- I have some (in the form
>>> if NV-DIMMs) and I have an application (HFT and a memory-backed
>>> database thing) that I'll port to run on pmfs or ext4 w/ XIP once
>>> everything is ready.
>>>
>>> I'm also interested in some of the implementation details.  For this
>>> stuff to be reliable on anything resembling commodity hardware, there
>>> will be some caching issues to deal with.  For example, I think it
>>> would be handy to run things like pmfs on top of write-through
>>> mappings.  This is currently barely supportable (and only using
>>> mtrrs), but it's not terribly complicated (on new enough hardware) to
>>> support real write-through PAT entries.
>>>
>>> I've written an i2c-imc driver (currently in limbo on the i2c list),
>>> which will likely be used for control operations on NV-DIMMs plugged
>>> into Intel-based server boards.
>>>
>>> In principle, I could even bring a working NV-DIMM system to the
>>> summit -- it's nearby, and this thing isn't *that* large :)
>>>
>>> --Andy
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-fsdevel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-21  7:38     ` Howard Chu
@ 2014-01-21 11:17       ` Dave Chinner
  2014-01-21 13:57         ` Howard Chu
  2014-01-21 16:48         ` Andy Lutomirski
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2014-01-21 11:17 UTC (permalink / raw)
  To: Howard Chu; +Cc: Andy Lutomirski, Linux FS Devel, lsf-pc, linux-mm@kvack.org

On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote:
> Andy Lutomirski wrote:
> >On 01/16/2014 08:17 PM, Howard Chu wrote:
> >>Andy Lutomirski wrote:
> >>>I'm interested in a persistent memory track.  There seems to be plenty
> >>>of other emails about this, but here's my take:
> >>
> >>I'm also interested in this track. I'm not up on FS development these
> >>days, the last time I wrote filesystem code was nearly 20 years ago. But
> >>persistent memory is a topic near and dear to my heart, and of great
> >>relevance to my current pet project, the LMDB memory-mapped database.
> >>
> >>In a previous era I also developed block device drivers for
> >>battery-backed external DRAM disks. (My ideal would have been systems
> >>where all of RAM was persistent. I suppose we can just about get there
> >>with mobile phones and tablets these days.)
> >>
> >>In the context of database engines, I'm interested in leveraging
> >>persistent memory for write-back caching and how user level code can be
> >>made aware of it. (If all your cache is persistent and guaranteed to
> >>eventually reach stable store then you never need to fsync() a
> >>transaction.)

I don't think that is true -  your still going to need fsync to get
the CPU to flush it's caches and filesystem metadata into the
persistent domain....

> >Hmm.  Presumably that would work by actually allocating cache pages in
> >persistent memory.  I don't think that anything like the current XIP
> >interfaces can do that, but it's certainly an interesting thought for
> >(complicated) future work.
> >
> >This might not be pretty in conjunction with something like my
> >writethrough mapping idea -- read(2) and write(2) would be fine (well,
> >write(2) might need to use streaming loads), but mmap users who weren't
> >expecting it might have truly awful performance.  That especially
> >includes things like databases that aren't expecting this behavior.
> 
> At the moment all I can suggest is a new mmap() flag, e.g.
> MAP_PERSISTENT. Not sure how a user or app should discover that it's
> supported though.

The point of using the XIP interface with filesystems that are
backed by persistent memory is that mmap() gives userspace
applications direct acess to the persistent memory directly without
needing any modifications.  It's just a really, really fast file...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-21 11:17       ` [Lsf-pc] " Dave Chinner
@ 2014-01-21 13:57         ` Howard Chu
  2014-01-21 20:20           ` Dave Chinner
  2014-01-21 16:48         ` Andy Lutomirski
  1 sibling, 1 reply; 14+ messages in thread
From: Howard Chu @ 2014-01-21 13:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Andy Lutomirski, Linux FS Devel, lsf-pc, linux-mm@kvack.org

Dave Chinner wrote:
> On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote:
>> Andy Lutomirski wrote:
>>> On 01/16/2014 08:17 PM, Howard Chu wrote:
>>>> Andy Lutomirski wrote:
>>>>> I'm interested in a persistent memory track.  There seems to be plenty
>>>>> of other emails about this, but here's my take:
>>>>
>>>> I'm also interested in this track. I'm not up on FS development these
>>>> days, the last time I wrote filesystem code was nearly 20 years ago. But
>>>> persistent memory is a topic near and dear to my heart, and of great
>>>> relevance to my current pet project, the LMDB memory-mapped database.
>>>>
>>>> In a previous era I also developed block device drivers for
>>>> battery-backed external DRAM disks. (My ideal would have been systems
>>>> where all of RAM was persistent. I suppose we can just about get there
>>>> with mobile phones and tablets these days.)
>>>>
>>>> In the context of database engines, I'm interested in leveraging
>>>> persistent memory for write-back caching and how user level code can be
>>>> made aware of it. (If all your cache is persistent and guaranteed to
>>>> eventually reach stable store then you never need to fsync() a
>>>> transaction.)
>
> I don't think that is true -  your still going to need fsync to get
> the CPU to flush it's caches and filesystem metadata into the
> persistent domain....
>
>>> Hmm.  Presumably that would work by actually allocating cache pages in
>>> persistent memory.  I don't think that anything like the current XIP
>>> interfaces can do that, but it's certainly an interesting thought for
>>> (complicated) future work.
>>>
>>> This might not be pretty in conjunction with something like my
>>> writethrough mapping idea -- read(2) and write(2) would be fine (well,
>>> write(2) might need to use streaming loads), but mmap users who weren't
>>> expecting it might have truly awful performance.  That especially
>>> includes things like databases that aren't expecting this behavior.
>>
>> At the moment all I can suggest is a new mmap() flag, e.g.
>> MAP_PERSISTENT. Not sure how a user or app should discover that it's
>> supported though.
>
> The point of using the XIP interface with filesystems that are
> backed by persistent memory is that mmap() gives userspace
> applications direct acess to the persistent memory directly without
> needing any modifications.  It's just a really, really fast file...

OK, I see that now. But that only works well when your persistent memory size 
is >= the size of the file(s) you want to work with.

If you use persistent memory for the page cache, then you can use it with any 
filesystem of any arbitrary size.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-21 13:57         ` Howard Chu
@ 2014-01-21 20:20           ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2014-01-21 20:20 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux FS Devel, linux-mm@kvack.org, lsf-pc, Andy Lutomirski

On Tue, Jan 21, 2014 at 05:57:14AM -0800, Howard Chu wrote:
> Dave Chinner wrote:
> >On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote:
> >>Andy Lutomirski wrote:
> >>>On 01/16/2014 08:17 PM, Howard Chu wrote:
> >>>>Andy Lutomirski wrote:
> >>>>>I'm interested in a persistent memory track.  There seems to be plenty
> >>>>>of other emails about this, but here's my take:
> >>>>
> >>>>I'm also interested in this track. I'm not up on FS development these
> >>>>days, the last time I wrote filesystem code was nearly 20 years ago. But
> >>>>persistent memory is a topic near and dear to my heart, and of great
> >>>>relevance to my current pet project, the LMDB memory-mapped database.
> >>>>
> >>>>In a previous era I also developed block device drivers for
> >>>>battery-backed external DRAM disks. (My ideal would have been systems
> >>>>where all of RAM was persistent. I suppose we can just about get there
> >>>>with mobile phones and tablets these days.)
> >>>>
> >>>>In the context of database engines, I'm interested in leveraging
> >>>>persistent memory for write-back caching and how user level code can be
> >>>>made aware of it. (If all your cache is persistent and guaranteed to
> >>>>eventually reach stable store then you never need to fsync() a
> >>>>transaction.)
> >
> >I don't think that is true -  your still going to need fsync to get
> >the CPU to flush it's caches and filesystem metadata into the
> >persistent domain....
> >
> >>>Hmm.  Presumably that would work by actually allocating cache pages in
> >>>persistent memory.  I don't think that anything like the current XIP
> >>>interfaces can do that, but it's certainly an interesting thought for
> >>>(complicated) future work.
> >>>
> >>>This might not be pretty in conjunction with something like my
> >>>writethrough mapping idea -- read(2) and write(2) would be fine (well,
> >>>write(2) might need to use streaming loads), but mmap users who weren't
> >>>expecting it might have truly awful performance.  That especially
> >>>includes things like databases that aren't expecting this behavior.
> >>
> >>At the moment all I can suggest is a new mmap() flag, e.g.
> >>MAP_PERSISTENT. Not sure how a user or app should discover that it's
> >>supported though.
> >
> >The point of using the XIP interface with filesystems that are
> >backed by persistent memory is that mmap() gives userspace
> >applications direct acess to the persistent memory directly without
> >needing any modifications.  It's just a really, really fast file...
> 
> OK, I see that now. But that only works well when your persistent
> memory size is >= the size of the file(s) you want to work with.

It assumes that you have a persistent memory block device. If you
have a persistent memory block device, then if you want persistent
caching on top of the filesystem, use dm-cache or bcache to stack
the persistent memory on top of the slow block device. i.e. we
already have solutions to this problem.

> If you use persistent memory for the page cache, then you can use it
> with any filesystem of any arbitrary size.

We don't actually need (or, IMO, want) a the page
cache to have to be aware of persistent memory state. If the page
cache is persistent, then we need to store that persistent state
somewhere so that when the machine crashes and reboots, we can bring
the persistent page cache back up. That involves metadata to hold
state, crash recovery, etc. We've already got all that persistence
management in our filesystem implementations.

IOWs, persistent data and it's state belongs in the filesystem
domain, not the page cache domain.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-21 11:17       ` [Lsf-pc] " Dave Chinner
  2014-01-21 13:57         ` Howard Chu
@ 2014-01-21 16:48         ` Andy Lutomirski
  2014-01-21 20:36           ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2014-01-21 16:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Howard Chu, Linux FS Devel, lsf-pc, linux-mm@kvack.org

On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote:
>> Andy Lutomirski wrote:
>> >On 01/16/2014 08:17 PM, Howard Chu wrote:
>> >>Andy Lutomirski wrote:
>> >>>I'm interested in a persistent memory track.  There seems to be plenty
>> >>>of other emails about this, but here's my take:
>> >>
>> >>I'm also interested in this track. I'm not up on FS development these
>> >>days, the last time I wrote filesystem code was nearly 20 years ago. But
>> >>persistent memory is a topic near and dear to my heart, and of great
>> >>relevance to my current pet project, the LMDB memory-mapped database.
>> >>
>> >>In a previous era I also developed block device drivers for
>> >>battery-backed external DRAM disks. (My ideal would have been systems
>> >>where all of RAM was persistent. I suppose we can just about get there
>> >>with mobile phones and tablets these days.)
>> >>
>> >>In the context of database engines, I'm interested in leveraging
>> >>persistent memory for write-back caching and how user level code can be
>> >>made aware of it. (If all your cache is persistent and guaranteed to
>> >>eventually reach stable store then you never need to fsync() a
>> >>transaction.)
>
> I don't think that is true -  your still going to need fsync to get
> the CPU to flush it's caches and filesystem metadata into the
> persistent domain....

I think that this depends on the technology in question.

I suspect (I don't know for sure) that, if the mapping is WT or UC,
that it would be possible to get the data fully flushed to persistent
storage by doing something like a UC read from any appropriate type of
I/O space (someone from Intel would have to confirm).  There's a
chipset register you're probably supposed to frob (it's well buried in
the public chipset docs), but I don't know how necessary it is.  In
any event, that type of flush is systemwide (or at least
package-wide), so fsyncing a file should be overkill.

Even if caching is on, clflush may be faster than a syscall.  (It's
sad that x86 doesn't have writeback-but-don't-invalidate.  PPC FTW.)

All of this suggests to me that a vsyscall "sync persistent memory"
might be better than a real syscall.

For what it's worth, some of the NV-DIMM systems are supposed to be
configured in such a way that, if power fails, an NMI, SMI, or even
(not really sure) a hardwired thing in the memory controller will
trigger the requisite flush.  I don't personally believe in this if
L2/L3 cache are involved (they're too big), but for the little write
buffers and memory controller things, this seems entirely plausible.

--Andy

>
>> >Hmm.  Presumably that would work by actually allocating cache pages in
>> >persistent memory.  I don't think that anything like the current XIP
>> >interfaces can do that, but it's certainly an interesting thought for
>> >(complicated) future work.
>> >
>> >This might not be pretty in conjunction with something like my
>> >writethrough mapping idea -- read(2) and write(2) would be fine (well,
>> >write(2) might need to use streaming loads), but mmap users who weren't
>> >expecting it might have truly awful performance.  That especially
>> >includes things like databases that aren't expecting this behavior.
>>
>> At the moment all I can suggest is a new mmap() flag, e.g.
>> MAP_PERSISTENT. Not sure how a user or app should discover that it's
>> supported though.
>
> The point of using the XIP interface with filesystems that are
> backed by persistent memory is that mmap() gives userspace
> applications direct acess to the persistent memory directly without
> needing any modifications.  It's just a really, really fast file...
>

I think this was talking about using persistent memory as a
limited-size cache.  In that case, XIP (as currently designed) has no
provision for removing cache pages, so the kernel isn't ready for
this.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-21 16:48         ` Andy Lutomirski
@ 2014-01-21 20:36           ` Dave Chinner
  2014-01-21 20:59             ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2014-01-21 20:36 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Howard Chu, Linux FS Devel, lsf-pc, linux-mm@kvack.org

On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote:
> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote:
> >> Andy Lutomirski wrote:
> >> >On 01/16/2014 08:17 PM, Howard Chu wrote:
> >> >>Andy Lutomirski wrote:
> >> >>>I'm interested in a persistent memory track.  There seems to be plenty
> >> >>>of other emails about this, but here's my take:
> >> >>
> >> >>I'm also interested in this track. I'm not up on FS development these
> >> >>days, the last time I wrote filesystem code was nearly 20 years ago. But
> >> >>persistent memory is a topic near and dear to my heart, and of great
> >> >>relevance to my current pet project, the LMDB memory-mapped database.
> >> >>
> >> >>In a previous era I also developed block device drivers for
> >> >>battery-backed external DRAM disks. (My ideal would have been systems
> >> >>where all of RAM was persistent. I suppose we can just about get there
> >> >>with mobile phones and tablets these days.)
> >> >>
> >> >>In the context of database engines, I'm interested in leveraging
> >> >>persistent memory for write-back caching and how user level code can be
> >> >>made aware of it. (If all your cache is persistent and guaranteed to
> >> >>eventually reach stable store then you never need to fsync() a
> >> >>transaction.)
> >
> > I don't think that is true -  your still going to need fsync to get
> > the CPU to flush it's caches and filesystem metadata into the
> > persistent domain....
> 
> I think that this depends on the technology in question.
> 
> I suspect (I don't know for sure) that, if the mapping is WT or UC,
> that it would be possible to get the data fully flushed to persistent
> storage by doing something like a UC read from any appropriate type of
> I/O space (someone from Intel would have to confirm).

And what of the filesystem metadata that is necessary to reference
that data? What flushes that? e.g. using mmap of sparse files to
dynamically allocate persistent memory space requires fdatasync() at
minimum....

And then there's things like encrypted persistent memory when means
applications can't directly access it and so mmap() will be buffered
by the page cache just like a normal block device...

> All of this suggests to me that a vsyscall "sync persistent memory"
> might be better than a real syscall.

Perhaps, but that implies some method other than a filesystem to
manage access to persistent memory.

> For what it's worth, some of the NV-DIMM systems are supposed to be
> configured in such a way that, if power fails, an NMI, SMI, or even
> (not really sure) a hardwired thing in the memory controller will
> trigger the requisite flush.  I don't personally believe in this if
> L2/L3 cache are involved (they're too big), but for the little write
> buffers and memory controller things, this seems entirely plausible.

Right - at the moment we have to assume the persistence domain
starts at the NVDIMM and doesn't cover the CPU's internal L* caches.
I have no idea if/when we'll be seeing CPUs that have persistent
caches, so we have to assume that data is still volatile and can be
lost unless it has been specifically synced to persistent memory.
i.e. persistent memory does not remove the need for fsync and
friends...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-21 20:36           ` Dave Chinner
@ 2014-01-21 20:59             ` Andy Lutomirski
  2014-01-21 23:03               ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2014-01-21 20:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Howard Chu, Linux FS Devel, lsf-pc, linux-mm@kvack.org

On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote:
>> >> Andy Lutomirski wrote:
>> >> >On 01/16/2014 08:17 PM, Howard Chu wrote:
>> >> >>Andy Lutomirski wrote:
>> >> >>>I'm interested in a persistent memory track.  There seems to be plenty
>> >> >>>of other emails about this, but here's my take:
>> >> >>
>> >> >>I'm also interested in this track. I'm not up on FS development these
>> >> >>days, the last time I wrote filesystem code was nearly 20 years ago. But
>> >> >>persistent memory is a topic near and dear to my heart, and of great
>> >> >>relevance to my current pet project, the LMDB memory-mapped database.
>> >> >>
>> >> >>In a previous era I also developed block device drivers for
>> >> >>battery-backed external DRAM disks. (My ideal would have been systems
>> >> >>where all of RAM was persistent. I suppose we can just about get there
>> >> >>with mobile phones and tablets these days.)
>> >> >>
>> >> >>In the context of database engines, I'm interested in leveraging
>> >> >>persistent memory for write-back caching and how user level code can be
>> >> >>made aware of it. (If all your cache is persistent and guaranteed to
>> >> >>eventually reach stable store then you never need to fsync() a
>> >> >>transaction.)
>> >
>> > I don't think that is true -  your still going to need fsync to get
>> > the CPU to flush it's caches and filesystem metadata into the
>> > persistent domain....
>>
>> I think that this depends on the technology in question.
>>
>> I suspect (I don't know for sure) that, if the mapping is WT or UC,
>> that it would be possible to get the data fully flushed to persistent
>> storage by doing something like a UC read from any appropriate type of
>> I/O space (someone from Intel would have to confirm).
>
> And what of the filesystem metadata that is necessary to reference
> that data? What flushes that? e.g. using mmap of sparse files to
> dynamically allocate persistent memory space requires fdatasync() at
> minimum....

If we're using dm-crypt using an NV-DIMM "block" device as cache and a
real disk as backing store, then ideally mmap would map the NV-DIMM
directly if the data in question lives there.  If that's happening,
then, assuming that there are no metadata changes, you could just
flush the relevant hw caches.  This assumes, of course, no dm-crypt,
no btrfs-style checksumming, and, in general, nothing else that would
require stable pages or similar things.

>
> And then there's things like encrypted persistent memory when means
> applications can't directly access it and so mmap() will be buffered
> by the page cache just like a normal block device...
>
>> All of this suggests to me that a vsyscall "sync persistent memory"
>> might be better than a real syscall.
>
> Perhaps, but that implies some method other than a filesystem to
> manage access to persistent memory.

It should be at least as good as fdatasync if using XIP or something like pmfs.

For my intended application, I want to use pmfs or something similar
directly.  This means that I want really fast synchronous flushes, and
I suspect that the usual set of fs calls that handle fdatasync are
already quite a bit slower than a vsyscall would be, assuming that no
MSR write is needed.

>
>> For what it's worth, some of the NV-DIMM systems are supposed to be
>> configured in such a way that, if power fails, an NMI, SMI, or even
>> (not really sure) a hardwired thing in the memory controller will
>> trigger the requisite flush.  I don't personally believe in this if
>> L2/L3 cache are involved (they're too big), but for the little write
>> buffers and memory controller things, this seems entirely plausible.
>
> Right - at the moment we have to assume the persistence domain
> starts at the NVDIMM and doesn't cover the CPU's internal L* caches.
> I have no idea if/when we'll be seeing CPUs that have persistent
> caches, so we have to assume that data is still volatile and can be
> lost unless it has been specifically synced to persistent memory.
> i.e. persistent memory does not remove the need for fsync and
> friends...

I have (NDAed and not entirely convincing) docs indicating a way (on
hardware that I don't have access to) to make the caches be part of
the persistence domain.  I also have non-NDA'd docs that suggest that
it's really very fast to flush things through the memory controller.
(I would need to time it, though.  I do have this hardware, and it
more or less works.)

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-21 20:59             ` Andy Lutomirski
@ 2014-01-21 23:03               ` Dave Chinner
  2014-01-21 23:22                 ` Andy Lutomirski
  2014-01-22  8:13                 ` Howard Chu
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2014-01-21 23:03 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Howard Chu, Linux FS Devel, lsf-pc, linux-mm@kvack.org

On Tue, Jan 21, 2014 at 12:59:42PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote:
> >> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote:
> >> >> Andy Lutomirski wrote:
> >> >> >On 01/16/2014 08:17 PM, Howard Chu wrote:
> >> >> >>Andy Lutomirski wrote:
> >> >> >>>I'm interested in a persistent memory track.  There seems to be plenty
> >> >> >>>of other emails about this, but here's my take:
> >> >> >>
> >> >> >>I'm also interested in this track. I'm not up on FS development these
> >> >> >>days, the last time I wrote filesystem code was nearly 20 years ago. But
> >> >> >>persistent memory is a topic near and dear to my heart, and of great
> >> >> >>relevance to my current pet project, the LMDB memory-mapped database.
> >> >> >>
> >> >> >>In a previous era I also developed block device drivers for
> >> >> >>battery-backed external DRAM disks. (My ideal would have been systems
> >> >> >>where all of RAM was persistent. I suppose we can just about get there
> >> >> >>with mobile phones and tablets these days.)
> >> >> >>
> >> >> >>In the context of database engines, I'm interested in leveraging
> >> >> >>persistent memory for write-back caching and how user level code can be
> >> >> >>made aware of it. (If all your cache is persistent and guaranteed to
> >> >> >>eventually reach stable store then you never need to fsync() a
> >> >> >>transaction.)
> >> >
> >> > I don't think that is true -  your still going to need fsync to get
> >> > the CPU to flush it's caches and filesystem metadata into the
> >> > persistent domain....
> >>
> >> I think that this depends on the technology in question.
> >>
> >> I suspect (I don't know for sure) that, if the mapping is WT or UC,
> >> that it would be possible to get the data fully flushed to persistent
> >> storage by doing something like a UC read from any appropriate type of
> >> I/O space (someone from Intel would have to confirm).
> >
> > And what of the filesystem metadata that is necessary to reference
> > that data? What flushes that? e.g. using mmap of sparse files to
> > dynamically allocate persistent memory space requires fdatasync() at
> > minimum....
> 
> If we're using dm-crypt using an NV-DIMM "block" device as cache and a
> real disk as backing store, then ideally mmap would map the NV-DIMM
> directly if the data in question lives there.

dm-crypt does not use any block device as a cache. You're thinking
about dm-cache or bcache. And neither of them are operating at the
filesystem level or are aware of the difference between fileystem
metadata and user data. But talking about non-existent block layer
functionality doesn't answer my the question about keeping user data
and filesystem metadata needed to reference that user data
coherent in persistent memory...

> If that's happening,
> then, assuming that there are no metadata changes, you could just
> flush the relevant hw caches.  This assumes, of course, no dm-crypt,
> no btrfs-style checksumming, and, in general, nothing else that would
> require stable pages or similar things.

Well yes. Data IO path transformations are another reason why we'll
need the volatile page cache involved in the persistent memory IO
path. It follows immediately from this that applicaitons will still
require fsync() and other data integrity operations because they
have no idea where the persistence domain boundary lives in the IO
stack.

> > And then there's things like encrypted persistent memory when means
> > applications can't directly access it and so mmap() will be buffered
> > by the page cache just like a normal block device...
> >
> >> All of this suggests to me that a vsyscall "sync persistent memory"
> >> might be better than a real syscall.
> >
> > Perhaps, but that implies some method other than a filesystem to
> > manage access to persistent memory.
> 
> It should be at least as good as fdatasync if using XIP or something like pmfs.
> 
> For my intended application, I want to use pmfs or something similar
> directly.  This means that I want really fast synchronous flushes, and
> I suspect that the usual set of fs calls that handle fdatasync are
> already quite a bit slower than a vsyscall would be, assuming that no
> MSR write is needed.

What you are saying is that you want a fixed, allocated range of
persistent memory mapped into the applications address space that
you have direct control of. Yes, we can do that through the
filesystem XIP interface (zero the file via memset() rather than via
unwritten extents) and then fsync the file. The metadata on the file
will then never change, and you can do what you want via mmap from
then onwards. I'd suggest at this point that msync() is the
operation that should then be used to flush the data pages in the
mapped range into the persistence domain.


> >> For what it's worth, some of the NV-DIMM systems are supposed to be
> >> configured in such a way that, if power fails, an NMI, SMI, or even
> >> (not really sure) a hardwired thing in the memory controller will
> >> trigger the requisite flush.  I don't personally believe in this if
> >> L2/L3 cache are involved (they're too big), but for the little write
> >> buffers and memory controller things, this seems entirely plausible.
> >
> > Right - at the moment we have to assume the persistence domain
> > starts at the NVDIMM and doesn't cover the CPU's internal L* caches.
> > I have no idea if/when we'll be seeing CPUs that have persistent
> > caches, so we have to assume that data is still volatile and can be
> > lost unless it has been specifically synced to persistent memory.
> > i.e. persistent memory does not remove the need for fsync and
> > friends...
> 
> I have (NDAed and not entirely convincing) docs indicating a way (on
> hardware that I don't have access to) to make the caches be part of
> the persistence domain. 

Every platform will implement persistence domain
mangement differently. So we can't assume that what works on one
platform is going to work or be compatible with any other
platform....

> I also have non-NDA'd docs that suggest that
> it's really very fast to flush things through the memory controller.
> (I would need to time it, though.  I do have this hardware, and it
> more or less works.)

It still takes non-zero time, so there is still scope for data loss
on power failure, or even CPU failure.

Hmmm, now there's something I hadn't really thought about - how does
CPU failure, hotplug and/or power management affect persistence
domains if the CPU cache contains persistent data and it's no longer
accessible?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-21 23:03               ` Dave Chinner
@ 2014-01-21 23:22                 ` Andy Lutomirski
  2014-01-22  8:13                 ` Howard Chu
  1 sibling, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2014-01-21 23:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Howard Chu, Linux FS Devel, lsf-pc, linux-mm@kvack.org

On Tue, Jan 21, 2014 at 3:03 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Jan 21, 2014 at 12:59:42PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote:
>> >> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote:
>> >> >> Andy Lutomirski wrote:
>> >> >> >On 01/16/2014 08:17 PM, Howard Chu wrote:
>> >> >> >>Andy Lutomirski wrote:
>> >> >> >>>I'm interested in a persistent memory track.  There seems to be plenty
>> >> >> >>>of other emails about this, but here's my take:
>> >> >> >>
>> >> >> >>I'm also interested in this track. I'm not up on FS development these
>> >> >> >>days, the last time I wrote filesystem code was nearly 20 years ago. But
>> >> >> >>persistent memory is a topic near and dear to my heart, and of great
>> >> >> >>relevance to my current pet project, the LMDB memory-mapped database.
>> >> >> >>
>> >> >> >>In a previous era I also developed block device drivers for
>> >> >> >>battery-backed external DRAM disks. (My ideal would have been systems
>> >> >> >>where all of RAM was persistent. I suppose we can just about get there
>> >> >> >>with mobile phones and tablets these days.)
>> >> >> >>
>> >> >> >>In the context of database engines, I'm interested in leveraging
>> >> >> >>persistent memory for write-back caching and how user level code can be
>> >> >> >>made aware of it. (If all your cache is persistent and guaranteed to
>> >> >> >>eventually reach stable store then you never need to fsync() a
>> >> >> >>transaction.)
>> >> >
>> >> > I don't think that is true -  your still going to need fsync to get
>> >> > the CPU to flush it's caches and filesystem metadata into the
>> >> > persistent domain....
>> >>
>> >> I think that this depends on the technology in question.
>> >>
>> >> I suspect (I don't know for sure) that, if the mapping is WT or UC,
>> >> that it would be possible to get the data fully flushed to persistent
>> >> storage by doing something like a UC read from any appropriate type of
>> >> I/O space (someone from Intel would have to confirm).
>> >
>> > And what of the filesystem metadata that is necessary to reference
>> > that data? What flushes that? e.g. using mmap of sparse files to
>> > dynamically allocate persistent memory space requires fdatasync() at
>> > minimum....
>>
>> If we're using dm-crypt using an NV-DIMM "block" device as cache and a
>> real disk as backing store, then ideally mmap would map the NV-DIMM
>> directly if the data in question lives there.
>
> dm-crypt does not use any block device as a cache. You're thinking
> about dm-cache or bcache. And neither of them are operating at the
> filesystem level or are aware of the difference between fileystem
> metadata and user data. But talking about non-existent block layer
> functionality doesn't answer my the question about keeping user data
> and filesystem metadata needed to reference that user data
> coherent in persistent memory...

Wow -- apparently I can't write coherently today.

What I'm saying is: if dm-cache (not dm-crypt) had magic
not-currently-existing functionality that allowed an XIP-capable cache
device to be mapped directly, and userspace knew it was mapped
directly, and userspace could pin that mapping there, then userspace
could avoid calling fsync.

This is (to me, and probably to everyone else, too) far less
interesting than the case of having the whole fs live in persistent
memory.

>
>> If that's happening,
>> then, assuming that there are no metadata changes, you could just
>> flush the relevant hw caches.  This assumes, of course, no dm-crypt,
>> no btrfs-style checksumming, and, in general, nothing else that would
>> require stable pages or similar things.
>
> Well yes. Data IO path transformations are another reason why we'll
> need the volatile page cache involved in the persistent memory IO
> path. It follows immediately from this that applicaitons will still
> require fsync() and other data integrity operations because they
> have no idea where the persistence domain boundary lives in the IO
> stack.
>
>> > And then there's things like encrypted persistent memory when means
>> > applications can't directly access it and so mmap() will be buffered
>> > by the page cache just like a normal block device...
>> >
>> >> All of this suggests to me that a vsyscall "sync persistent memory"
>> >> might be better than a real syscall.
>> >
>> > Perhaps, but that implies some method other than a filesystem to
>> > manage access to persistent memory.
>>
>> It should be at least as good as fdatasync if using XIP or something like pmfs.
>>
>> For my intended application, I want to use pmfs or something similar
>> directly.  This means that I want really fast synchronous flushes, and
>> I suspect that the usual set of fs calls that handle fdatasync are
>> already quite a bit slower than a vsyscall would be, assuming that no
>> MSR write is needed.
>
> What you are saying is that you want a fixed, allocated range of
> persistent memory mapped into the applications address space that
> you have direct control of. Yes, we can do that through the
> filesystem XIP interface (zero the file via memset() rather than via
> unwritten extents) and then fsync the file. The metadata on the file
> will then never change, and you can do what you want via mmap from
> then onwards. I'd suggest at this point that msync() is the
> operation that should then be used to flush the data pages in the
> mapped range into the persistence domain.

I think you're insufficiently ambitious about how fast you want this
to be.  :)  I want it to be at least possible for the whole sync
operation to be considerably faster than, say, anything involving
mmap_sem or vma walking.

But yes, the memset thing is what I want.

>
>
>> >> For what it's worth, some of the NV-DIMM systems are supposed to be
>> >> configured in such a way that, if power fails, an NMI, SMI, or even
>> >> (not really sure) a hardwired thing in the memory controller will
>> >> trigger the requisite flush.  I don't personally believe in this if
>> >> L2/L3 cache are involved (they're too big), but for the little write
>> >> buffers and memory controller things, this seems entirely plausible.
>> >
>> > Right - at the moment we have to assume the persistence domain
>> > starts at the NVDIMM and doesn't cover the CPU's internal L* caches.
>> > I have no idea if/when we'll be seeing CPUs that have persistent
>> > caches, so we have to assume that data is still volatile and can be
>> > lost unless it has been specifically synced to persistent memory.
>> > i.e. persistent memory does not remove the need for fsync and
>> > friends...
>>
>> I have (NDAed and not entirely convincing) docs indicating a way (on
>> hardware that I don't have access to) to make the caches be part of
>> the persistence domain.
>
> Every platform will implement persistence domain
> mangement differently. So we can't assume that what works on one
> platform is going to work or be compatible with any other
> platform....
>
>> I also have non-NDA'd docs that suggest that
>> it's really very fast to flush things through the memory controller.
>> (I would need to time it, though.  I do have this hardware, and it
>> more or less works.)
>
> It still takes non-zero time, so there is still scope for data loss
> on power failure, or even CPU failure.

Not if the hardware does the flush for us.  (But yes, you're right, we
can't assume that *all* persistent memory hardware can do that.)

>
> Hmmm, now there's something I hadn't really thought about - how does
> CPU failure, hotplug and/or power management affect persistence
> domains if the CPU cache contains persistent data and it's no longer
> accessible?

Given that NV-DIMMs are literally DIMMs that are mapped more or less
like any other system memory, this presumably works for the same
reason that hot-unplugging a CPU that has dirty cachelines pointing at
page cache doesn't corrupt page cache.  That is, someone (presumably
the OS arch code) is responsible for flushing the caches.

Just because L2/L3 cache might be in the persistence domain doesn't
mean that you can't clflush or wbinvd it just like any other memory.

Another reason to be a bit careful about caching: it should be
possible to write a few MB to persistent memory in a tight loop
without blowing everything else out of cache. I wonder if the default
behavior for non-mmapped writes to these things should be to use
non-temporal / streaming hints where available.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-21 23:03               ` Dave Chinner
  2014-01-21 23:22                 ` Andy Lutomirski
@ 2014-01-22  8:13                 ` Howard Chu
  2014-01-23 19:54                   ` Andy Lutomirski
  1 sibling, 1 reply; 14+ messages in thread
From: Howard Chu @ 2014-01-22  8:13 UTC (permalink / raw)
  To: Dave Chinner, Andy Lutomirski; +Cc: Linux FS Devel, lsf-pc, linux-mm@kvack.org

Dave Chinner wrote:
> On Tue, Jan 21, 2014 at 12:59:42PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote:
>>>> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote:
>>>>>> Andy Lutomirski wrote:
>>>>>>> On 01/16/2014 08:17 PM, Howard Chu wrote:
>>>>>>>> Andy Lutomirski wrote:
>>>>>>>>> I'm interested in a persistent memory track.  There seems to be plenty
>>>>>>>>> of other emails about this, but here's my take:
>>>>>>>>
>>>>>>>> I'm also interested in this track. I'm not up on FS development these
>>>>>>>> days, the last time I wrote filesystem code was nearly 20 years ago. But
>>>>>>>> persistent memory is a topic near and dear to my heart, and of great
>>>>>>>> relevance to my current pet project, the LMDB memory-mapped database.
>>>>>>>>
>>>>>>>> In a previous era I also developed block device drivers for
>>>>>>>> battery-backed external DRAM disks. (My ideal would have been systems
>>>>>>>> where all of RAM was persistent. I suppose we can just about get there
>>>>>>>> with mobile phones and tablets these days.)
>>>>>>>>
>>>>>>>> In the context of database engines, I'm interested in leveraging
>>>>>>>> persistent memory for write-back caching and how user level code can be
>>>>>>>> made aware of it. (If all your cache is persistent and guaranteed to
>>>>>>>> eventually reach stable store then you never need to fsync() a
>>>>>>>> transaction.)
>>>>>
>>>>> I don't think that is true -  your still going to need fsync to get
>>>>> the CPU to flush it's caches and filesystem metadata into the
>>>>> persistent domain....
>>>>
>>>> I think that this depends on the technology in question.
>>>>
>>>> I suspect (I don't know for sure) that, if the mapping is WT or UC,
>>>> that it would be possible to get the data fully flushed to persistent
>>>> storage by doing something like a UC read from any appropriate type of
>>>> I/O space (someone from Intel would have to confirm).
>>>
>>> And what of the filesystem metadata that is necessary to reference
>>> that data? What flushes that? e.g. using mmap of sparse files to
>>> dynamically allocate persistent memory space requires fdatasync() at
>>> minimum....

Why are you talking about fdatasync(), which is used to *avoid* flushing metadata?
For reference, we've found that we get highest DB performance using ext2fs 
with a preallocated file. In that case, we can use fdatasync() and then 
there's no metadata updates whatsoever. This also means we can ignore the 
question of FS corruption on a crash.

>> If we're using dm-crypt using an NV-DIMM "block" device as cache and a
>> real disk as backing store, then ideally mmap would map the NV-DIMM
>> directly if the data in question lives there.
>
> dm-crypt does not use any block device as a cache. You're thinking
> about dm-cache or bcache. And neither of them are operating at the
> filesystem level or are aware of the difference between fileystem
> metadata and user data.

Why should that layer need to be aware? A page is a page, as far as they're 
concerned.

> But talking about non-existent block layer
> functionality doesn't answer my the question about keeping user data
> and filesystem metadata needed to reference that user data
> coherent in persistent memory...

One of the very useful tools for PCs in the '80s was reset-survivable 
RAMdisks. Given the existence of persistent memory in a machine, this is a 
pretty obvious feature to provide.

>> If that's happening,
>> then, assuming that there are no metadata changes, you could just
>> flush the relevant hw caches.  This assumes, of course, no dm-crypt,
>> no btrfs-style checksumming, and, in general, nothing else that would
>> require stable pages or similar things.
>
> Well yes. Data IO path transformations are another reason why we'll
> need the volatile page cache involved in the persistent memory IO
> path. It follows immediately from this that applicaitons will still
> require fsync() and other data integrity operations because they
> have no idea where the persistence domain boundary lives in the IO
> stack.

And my point, stated a few times now, is there should be a way for 
applications to discover the existence and characteristics of persistent 
memory being used in the system.

>>> And then there's things like encrypted persistent memory when means
>>> applications can't directly access it and so mmap() will be buffered
>>> by the page cache just like a normal block device...
>>>
>>>> All of this suggests to me that a vsyscall "sync persistent memory"
>>>> might be better than a real syscall.
>>>
>>> Perhaps, but that implies some method other than a filesystem to
>>> manage access to persistent memory.
>>
>> It should be at least as good as fdatasync if using XIP or something like pmfs.
>>
>> For my intended application, I want to use pmfs or something similar
>> directly.  This means that I want really fast synchronous flushes, and
>> I suspect that the usual set of fs calls that handle fdatasync are
>> already quite a bit slower than a vsyscall would be, assuming that no
>> MSR write is needed.
>
> What you are saying is that you want a fixed, allocated range of
> persistent memory mapped into the applications address space that
> you have direct control of. Yes, we can do that through the
> filesystem XIP interface (zero the file via memset() rather than via
> unwritten extents) and then fsync the file. The metadata on the file
> will then never change, and you can do what you want via mmap from
> then onwards. I'd suggest at this point that msync() is the
> operation that should then be used to flush the data pages in the
> mapped range into the persistence domain.

>
>
>>>> For what it's worth, some of the NV-DIMM systems are supposed to be
>>>> configured in such a way that, if power fails, an NMI, SMI, or even
>>>> (not really sure) a hardwired thing in the memory controller will
>>>> trigger the requisite flush.  I don't personally believe in this if
>>>> L2/L3 cache are involved (they're too big), but for the little write
>>>> buffers and memory controller things, this seems entirely plausible.
>>>
>>> Right - at the moment we have to assume the persistence domain
>>> starts at the NVDIMM and doesn't cover the CPU's internal L* caches.
>>> I have no idea if/when we'll be seeing CPUs that have persistent
>>> caches, so we have to assume that data is still volatile and can be
>>> lost unless it has been specifically synced to persistent memory.
>>> i.e. persistent memory does not remove the need for fsync and
>>> friends...
>>
>> I have (NDAed and not entirely convincing) docs indicating a way (on
>> hardware that I don't have access to) to make the caches be part of
>> the persistence domain.
>
> Every platform will implement persistence domain
> mangement differently. So we can't assume that what works on one
> platform is going to work or be compatible with any other
> platform....
>
>> I also have non-NDA'd docs that suggest that
>> it's really very fast to flush things through the memory controller.
>> (I would need to time it, though.  I do have this hardware, and it
>> more or less works.)
>
> It still takes non-zero time, so there is still scope for data loss
> on power failure, or even CPU failure.
>
> Hmmm, now there's something I hadn't really thought about - how does
> CPU failure, hotplug and/or power management affect persistence
> domains if the CPU cache contains persistent data and it's no longer
> accessible?
>
> Cheers,
>
> Dave.
>


-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory
  2014-01-22  8:13                 ` Howard Chu
@ 2014-01-23 19:54                   ` Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2014-01-23 19:54 UTC (permalink / raw)
  To: Howard Chu; +Cc: Dave Chinner, Linux FS Devel, lsf-pc, linux-mm@kvack.org

On Wed, Jan 22, 2014 at 12:13 AM, Howard Chu <hyc@symas.com> wrote:
> Dave Chinner wrote:
>>
>> On Tue, Jan 21, 2014 at 12:59:42PM -0800, Andy Lutomirski wrote:
>>>
>>> On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com>
>
>>> If we're using dm-crypt using an NV-DIMM "block" device as cache and a
>>> real disk as backing store, then ideally mmap would map the NV-DIMM
>>> directly if the data in question lives there.
>>
>>
>> dm-crypt does not use any block device as a cache. You're thinking
>> about dm-cache or bcache. And neither of them are operating at the
>> filesystem level or are aware of the difference between fileystem
>> metadata and user data.
>
>
> Why should that layer need to be aware? A page is a page, as far as they're
> concerned.

I think that, ideally, the awareness would go the other way.

dm-cache (where the backing store is a normal disk but the cache
storage is persistent memory) should not care what kind of page it's
caching.  On the other hand, the filesystem sitting on top of dm-cache
should be able to tell when a page (in the device exposed by dm-cache)
is actually CPU-addressable so it can avoid allocating yet another
copy in pagecache.  Similarly, it should be able to be notified when
that page is about to stop being cpu-addressable.

This might be an argument to have, in addition (or as a replacement
to) direct_access, XIP ops that ask for a reference to a page, are
permitted to fail (i.e. say "sorry, not CPU addressable right now),
and a way to be notified when a page is going away.

(This is totally unnecessary if using something like an NV-DIMM
directly -- it's only important for more complicated things like
dm-cache.)

>
>
>> But talking about non-existent block layer
>> functionality doesn't answer my the question about keeping user data
>> and filesystem metadata needed to reference that user data
>> coherent in persistent memory...
>
>
> One of the very useful tools for PCs in the '80s was reset-survivable
> RAMdisks. Given the existence of persistent memory in a machine, this is a
> pretty obvious feature to provide.

I think that a file on pmfs or ext4-xip will work like this.  Ideally
/dev/loop will be XIP-capable if the file it's sitting on top of is
XIP.

>
>
>>> If that's happening,
>>> then, assuming that there are no metadata changes, you could just
>>> flush the relevant hw caches.  This assumes, of course, no dm-crypt,
>>> no btrfs-style checksumming, and, in general, nothing else that would
>>> require stable pages or similar things.
>>
>>
>> Well yes. Data IO path transformations are another reason why we'll
>> need the volatile page cache involved in the persistent memory IO
>> path. It follows immediately from this that applicaitons will still
>> require fsync() and other data integrity operations because they
>> have no idea where the persistence domain boundary lives in the IO
>> stack.
>
>
> And my point, stated a few times now, is there should be a way for
> applications to discover the existence and characteristics of persistent
> memory being used in the system.

Agreed.  Or maybe just some very low-level library that exposes a more
useful interface (e.g. sync this domain) to applications.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-01-23 19:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-17  0:56 [LSF/MM TOPIC] [ATTEND] Persistent memory Andy Lutomirski
2014-01-17  4:17 ` Howard Chu
2014-01-17 19:22   ` Andy Lutomirski
2014-01-21  7:38     ` Howard Chu
2014-01-21 11:17       ` [Lsf-pc] " Dave Chinner
2014-01-21 13:57         ` Howard Chu
2014-01-21 20:20           ` Dave Chinner
2014-01-21 16:48         ` Andy Lutomirski
2014-01-21 20:36           ` Dave Chinner
2014-01-21 20:59             ` Andy Lutomirski
2014-01-21 23:03               ` Dave Chinner
2014-01-21 23:22                 ` Andy Lutomirski
2014-01-22  8:13                 ` Howard Chu
2014-01-23 19:54                   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).