* [LSF/MM TOPIC] [ATTEND] Persistent memory @ 2014-01-17 0:56 Andy Lutomirski 2014-01-17 4:17 ` Howard Chu 0 siblings, 1 reply; 14+ messages in thread From: Andy Lutomirski @ 2014-01-17 0:56 UTC (permalink / raw) To: Linux FS Devel, lsf-pc, linux-mm@kvack.org I'm interested in a persistent memory track. There seems to be plenty of other emails about this, but here's my take: First, I'm not an FS expert. I've never written an FS, touched an on-disk (or on-persistent-memory) FS format. I have, however, mucked with some low-level x86 details, and I'm a heavy abuser of the Linux page cache. I'm an upcoming user of persistent memory -- I have some (in the form if NV-DIMMs) and I have an application (HFT and a memory-backed database thing) that I'll port to run on pmfs or ext4 w/ XIP once everything is ready. I'm also interested in some of the implementation details. For this stuff to be reliable on anything resembling commodity hardware, there will be some caching issues to deal with. For example, I think it would be handy to run things like pmfs on top of write-through mappings. This is currently barely supportable (and only using mtrrs), but it's not terribly complicated (on new enough hardware) to support real write-through PAT entries. I've written an i2c-imc driver (currently in limbo on the i2c list), which will likely be used for control operations on NV-DIMMs plugged into Intel-based server boards. In principle, I could even bring a working NV-DIMM system to the summit -- it's nearby, and this thing isn't *that* large :) --Andy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-17 0:56 [LSF/MM TOPIC] [ATTEND] Persistent memory Andy Lutomirski @ 2014-01-17 4:17 ` Howard Chu 2014-01-17 19:22 ` Andy Lutomirski 0 siblings, 1 reply; 14+ messages in thread From: Howard Chu @ 2014-01-17 4:17 UTC (permalink / raw) To: Linux FS Devel, lsf-pc, linux-mm@kvack.org Andy Lutomirski wrote: > I'm interested in a persistent memory track. There seems to be plenty > of other emails about this, but here's my take: I'm also interested in this track. I'm not up on FS development these days, the last time I wrote filesystem code was nearly 20 years ago. But persistent memory is a topic near and dear to my heart, and of great relevance to my current pet project, the LMDB memory-mapped database. In a previous era I also developed block device drivers for battery-backed external DRAM disks. (My ideal would have been systems where all of RAM was persistent. I suppose we can just about get there with mobile phones and tablets these days.) In the context of database engines, I'm interested in leveraging persistent memory for write-back caching and how user level code can be made aware of it. (If all your cache is persistent and guaranteed to eventually reach stable store then you never need to fsync() a transaction.) > First, I'm not an FS expert. I've never written an FS, touched an > on-disk (or on-persistent-memory) FS format. I have, however, mucked > with some low-level x86 details, and I'm a heavy abuser of the Linux > page cache. > > I'm an upcoming user of persistent memory -- I have some (in the form > if NV-DIMMs) and I have an application (HFT and a memory-backed > database thing) that I'll port to run on pmfs or ext4 w/ XIP once > everything is ready. > > I'm also interested in some of the implementation details. For this > stuff to be reliable on anything resembling commodity hardware, there > will be some caching issues to deal with. For example, I think it > would be handy to run things like pmfs on top of write-through > mappings. This is currently barely supportable (and only using > mtrrs), but it's not terribly complicated (on new enough hardware) to > support real write-through PAT entries. > > I've written an i2c-imc driver (currently in limbo on the i2c list), > which will likely be used for control operations on NV-DIMMs plugged > into Intel-based server boards. > > In principle, I could even bring a working NV-DIMM system to the > summit -- it's nearby, and this thing isn't *that* large :) > > --Andy > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-17 4:17 ` Howard Chu @ 2014-01-17 19:22 ` Andy Lutomirski 2014-01-21 7:38 ` Howard Chu 0 siblings, 1 reply; 14+ messages in thread From: Andy Lutomirski @ 2014-01-17 19:22 UTC (permalink / raw) To: Howard Chu, Linux FS Devel, lsf-pc, linux-mm@kvack.org On 01/16/2014 08:17 PM, Howard Chu wrote: > Andy Lutomirski wrote: >> I'm interested in a persistent memory track. There seems to be plenty >> of other emails about this, but here's my take: > > I'm also interested in this track. I'm not up on FS development these > days, the last time I wrote filesystem code was nearly 20 years ago. But > persistent memory is a topic near and dear to my heart, and of great > relevance to my current pet project, the LMDB memory-mapped database. > > In a previous era I also developed block device drivers for > battery-backed external DRAM disks. (My ideal would have been systems > where all of RAM was persistent. I suppose we can just about get there > with mobile phones and tablets these days.) > > In the context of database engines, I'm interested in leveraging > persistent memory for write-back caching and how user level code can be > made aware of it. (If all your cache is persistent and guaranteed to > eventually reach stable store then you never need to fsync() a > transaction.) Hmm. Presumably that would work by actually allocating cache pages in persistent memory. I don't think that anything like the current XIP interfaces can do that, but it's certainly an interesting thought for (complicated) future work. This might not be pretty in conjunction with something like my writethrough mapping idea -- read(2) and write(2) would be fine (well, write(2) might need to use streaming loads), but mmap users who weren't expecting it might have truly awful performance. That especially includes things like databases that aren't expecting this behavior. --Andy > >> First, I'm not an FS expert. I've never written an FS, touched an >> on-disk (or on-persistent-memory) FS format. I have, however, mucked >> with some low-level x86 details, and I'm a heavy abuser of the Linux >> page cache. >> >> I'm an upcoming user of persistent memory -- I have some (in the form >> if NV-DIMMs) and I have an application (HFT and a memory-backed >> database thing) that I'll port to run on pmfs or ext4 w/ XIP once >> everything is ready. >> >> I'm also interested in some of the implementation details. For this >> stuff to be reliable on anything resembling commodity hardware, there >> will be some caching issues to deal with. For example, I think it >> would be handy to run things like pmfs on top of write-through >> mappings. This is currently barely supportable (and only using >> mtrrs), but it's not terribly complicated (on new enough hardware) to >> support real write-through PAT entries. >> >> I've written an i2c-imc driver (currently in limbo on the i2c list), >> which will likely be used for control operations on NV-DIMMs plugged >> into Intel-based server boards. >> >> In principle, I could even bring a working NV-DIMM system to the >> summit -- it's nearby, and this thing isn't *that* large :) >> >> --Andy >> -- >> To unsubscribe from this list: send the line "unsubscribe >> linux-fsdevel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-17 19:22 ` Andy Lutomirski @ 2014-01-21 7:38 ` Howard Chu 2014-01-21 11:17 ` [Lsf-pc] " Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Howard Chu @ 2014-01-21 7:38 UTC (permalink / raw) To: Andy Lutomirski, Linux FS Devel, lsf-pc, linux-mm@kvack.org Andy Lutomirski wrote: > On 01/16/2014 08:17 PM, Howard Chu wrote: >> Andy Lutomirski wrote: >>> I'm interested in a persistent memory track. There seems to be plenty >>> of other emails about this, but here's my take: >> >> I'm also interested in this track. I'm not up on FS development these >> days, the last time I wrote filesystem code was nearly 20 years ago. But >> persistent memory is a topic near and dear to my heart, and of great >> relevance to my current pet project, the LMDB memory-mapped database. >> >> In a previous era I also developed block device drivers for >> battery-backed external DRAM disks. (My ideal would have been systems >> where all of RAM was persistent. I suppose we can just about get there >> with mobile phones and tablets these days.) >> >> In the context of database engines, I'm interested in leveraging >> persistent memory for write-back caching and how user level code can be >> made aware of it. (If all your cache is persistent and guaranteed to >> eventually reach stable store then you never need to fsync() a >> transaction.) > > Hmm. Presumably that would work by actually allocating cache pages in > persistent memory. I don't think that anything like the current XIP > interfaces can do that, but it's certainly an interesting thought for > (complicated) future work. > > This might not be pretty in conjunction with something like my > writethrough mapping idea -- read(2) and write(2) would be fine (well, > write(2) might need to use streaming loads), but mmap users who weren't > expecting it might have truly awful performance. That especially > includes things like databases that aren't expecting this behavior. At the moment all I can suggest is a new mmap() flag, e.g. MAP_PERSISTENT. Not sure how a user or app should discover that it's supported though. > > --Andy > >> >>> First, I'm not an FS expert. I've never written an FS, touched an >>> on-disk (or on-persistent-memory) FS format. I have, however, mucked >>> with some low-level x86 details, and I'm a heavy abuser of the Linux >>> page cache. >>> >>> I'm an upcoming user of persistent memory -- I have some (in the form >>> if NV-DIMMs) and I have an application (HFT and a memory-backed >>> database thing) that I'll port to run on pmfs or ext4 w/ XIP once >>> everything is ready. >>> >>> I'm also interested in some of the implementation details. For this >>> stuff to be reliable on anything resembling commodity hardware, there >>> will be some caching issues to deal with. For example, I think it >>> would be handy to run things like pmfs on top of write-through >>> mappings. This is currently barely supportable (and only using >>> mtrrs), but it's not terribly complicated (on new enough hardware) to >>> support real write-through PAT entries. >>> >>> I've written an i2c-imc driver (currently in limbo on the i2c list), >>> which will likely be used for control operations on NV-DIMMs plugged >>> into Intel-based server boards. >>> >>> In principle, I could even bring a working NV-DIMM system to the >>> summit -- it's nearby, and this thing isn't *that* large :) >>> >>> --Andy >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> linux-fsdevel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> > > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-21 7:38 ` Howard Chu @ 2014-01-21 11:17 ` Dave Chinner 2014-01-21 13:57 ` Howard Chu 2014-01-21 16:48 ` Andy Lutomirski 0 siblings, 2 replies; 14+ messages in thread From: Dave Chinner @ 2014-01-21 11:17 UTC (permalink / raw) To: Howard Chu; +Cc: Andy Lutomirski, Linux FS Devel, lsf-pc, linux-mm@kvack.org On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote: > Andy Lutomirski wrote: > >On 01/16/2014 08:17 PM, Howard Chu wrote: > >>Andy Lutomirski wrote: > >>>I'm interested in a persistent memory track. There seems to be plenty > >>>of other emails about this, but here's my take: > >> > >>I'm also interested in this track. I'm not up on FS development these > >>days, the last time I wrote filesystem code was nearly 20 years ago. But > >>persistent memory is a topic near and dear to my heart, and of great > >>relevance to my current pet project, the LMDB memory-mapped database. > >> > >>In a previous era I also developed block device drivers for > >>battery-backed external DRAM disks. (My ideal would have been systems > >>where all of RAM was persistent. I suppose we can just about get there > >>with mobile phones and tablets these days.) > >> > >>In the context of database engines, I'm interested in leveraging > >>persistent memory for write-back caching and how user level code can be > >>made aware of it. (If all your cache is persistent and guaranteed to > >>eventually reach stable store then you never need to fsync() a > >>transaction.) I don't think that is true - your still going to need fsync to get the CPU to flush it's caches and filesystem metadata into the persistent domain.... > >Hmm. Presumably that would work by actually allocating cache pages in > >persistent memory. I don't think that anything like the current XIP > >interfaces can do that, but it's certainly an interesting thought for > >(complicated) future work. > > > >This might not be pretty in conjunction with something like my > >writethrough mapping idea -- read(2) and write(2) would be fine (well, > >write(2) might need to use streaming loads), but mmap users who weren't > >expecting it might have truly awful performance. That especially > >includes things like databases that aren't expecting this behavior. > > At the moment all I can suggest is a new mmap() flag, e.g. > MAP_PERSISTENT. Not sure how a user or app should discover that it's > supported though. The point of using the XIP interface with filesystems that are backed by persistent memory is that mmap() gives userspace applications direct acess to the persistent memory directly without needing any modifications. It's just a really, really fast file... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-21 11:17 ` [Lsf-pc] " Dave Chinner @ 2014-01-21 13:57 ` Howard Chu 2014-01-21 20:20 ` Dave Chinner 2014-01-21 16:48 ` Andy Lutomirski 1 sibling, 1 reply; 14+ messages in thread From: Howard Chu @ 2014-01-21 13:57 UTC (permalink / raw) To: Dave Chinner; +Cc: Andy Lutomirski, Linux FS Devel, lsf-pc, linux-mm@kvack.org Dave Chinner wrote: > On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote: >> Andy Lutomirski wrote: >>> On 01/16/2014 08:17 PM, Howard Chu wrote: >>>> Andy Lutomirski wrote: >>>>> I'm interested in a persistent memory track. There seems to be plenty >>>>> of other emails about this, but here's my take: >>>> >>>> I'm also interested in this track. I'm not up on FS development these >>>> days, the last time I wrote filesystem code was nearly 20 years ago. But >>>> persistent memory is a topic near and dear to my heart, and of great >>>> relevance to my current pet project, the LMDB memory-mapped database. >>>> >>>> In a previous era I also developed block device drivers for >>>> battery-backed external DRAM disks. (My ideal would have been systems >>>> where all of RAM was persistent. I suppose we can just about get there >>>> with mobile phones and tablets these days.) >>>> >>>> In the context of database engines, I'm interested in leveraging >>>> persistent memory for write-back caching and how user level code can be >>>> made aware of it. (If all your cache is persistent and guaranteed to >>>> eventually reach stable store then you never need to fsync() a >>>> transaction.) > > I don't think that is true - your still going to need fsync to get > the CPU to flush it's caches and filesystem metadata into the > persistent domain.... > >>> Hmm. Presumably that would work by actually allocating cache pages in >>> persistent memory. I don't think that anything like the current XIP >>> interfaces can do that, but it's certainly an interesting thought for >>> (complicated) future work. >>> >>> This might not be pretty in conjunction with something like my >>> writethrough mapping idea -- read(2) and write(2) would be fine (well, >>> write(2) might need to use streaming loads), but mmap users who weren't >>> expecting it might have truly awful performance. That especially >>> includes things like databases that aren't expecting this behavior. >> >> At the moment all I can suggest is a new mmap() flag, e.g. >> MAP_PERSISTENT. Not sure how a user or app should discover that it's >> supported though. > > The point of using the XIP interface with filesystems that are > backed by persistent memory is that mmap() gives userspace > applications direct acess to the persistent memory directly without > needing any modifications. It's just a really, really fast file... OK, I see that now. But that only works well when your persistent memory size is >= the size of the file(s) you want to work with. If you use persistent memory for the page cache, then you can use it with any filesystem of any arbitrary size. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-21 13:57 ` Howard Chu @ 2014-01-21 20:20 ` Dave Chinner 0 siblings, 0 replies; 14+ messages in thread From: Dave Chinner @ 2014-01-21 20:20 UTC (permalink / raw) To: Howard Chu; +Cc: Linux FS Devel, linux-mm@kvack.org, lsf-pc, Andy Lutomirski On Tue, Jan 21, 2014 at 05:57:14AM -0800, Howard Chu wrote: > Dave Chinner wrote: > >On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote: > >>Andy Lutomirski wrote: > >>>On 01/16/2014 08:17 PM, Howard Chu wrote: > >>>>Andy Lutomirski wrote: > >>>>>I'm interested in a persistent memory track. There seems to be plenty > >>>>>of other emails about this, but here's my take: > >>>> > >>>>I'm also interested in this track. I'm not up on FS development these > >>>>days, the last time I wrote filesystem code was nearly 20 years ago. But > >>>>persistent memory is a topic near and dear to my heart, and of great > >>>>relevance to my current pet project, the LMDB memory-mapped database. > >>>> > >>>>In a previous era I also developed block device drivers for > >>>>battery-backed external DRAM disks. (My ideal would have been systems > >>>>where all of RAM was persistent. I suppose we can just about get there > >>>>with mobile phones and tablets these days.) > >>>> > >>>>In the context of database engines, I'm interested in leveraging > >>>>persistent memory for write-back caching and how user level code can be > >>>>made aware of it. (If all your cache is persistent and guaranteed to > >>>>eventually reach stable store then you never need to fsync() a > >>>>transaction.) > > > >I don't think that is true - your still going to need fsync to get > >the CPU to flush it's caches and filesystem metadata into the > >persistent domain.... > > > >>>Hmm. Presumably that would work by actually allocating cache pages in > >>>persistent memory. I don't think that anything like the current XIP > >>>interfaces can do that, but it's certainly an interesting thought for > >>>(complicated) future work. > >>> > >>>This might not be pretty in conjunction with something like my > >>>writethrough mapping idea -- read(2) and write(2) would be fine (well, > >>>write(2) might need to use streaming loads), but mmap users who weren't > >>>expecting it might have truly awful performance. That especially > >>>includes things like databases that aren't expecting this behavior. > >> > >>At the moment all I can suggest is a new mmap() flag, e.g. > >>MAP_PERSISTENT. Not sure how a user or app should discover that it's > >>supported though. > > > >The point of using the XIP interface with filesystems that are > >backed by persistent memory is that mmap() gives userspace > >applications direct acess to the persistent memory directly without > >needing any modifications. It's just a really, really fast file... > > OK, I see that now. But that only works well when your persistent > memory size is >= the size of the file(s) you want to work with. It assumes that you have a persistent memory block device. If you have a persistent memory block device, then if you want persistent caching on top of the filesystem, use dm-cache or bcache to stack the persistent memory on top of the slow block device. i.e. we already have solutions to this problem. > If you use persistent memory for the page cache, then you can use it > with any filesystem of any arbitrary size. We don't actually need (or, IMO, want) a the page cache to have to be aware of persistent memory state. If the page cache is persistent, then we need to store that persistent state somewhere so that when the machine crashes and reboots, we can bring the persistent page cache back up. That involves metadata to hold state, crash recovery, etc. We've already got all that persistence management in our filesystem implementations. IOWs, persistent data and it's state belongs in the filesystem domain, not the page cache domain. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-21 11:17 ` [Lsf-pc] " Dave Chinner 2014-01-21 13:57 ` Howard Chu @ 2014-01-21 16:48 ` Andy Lutomirski 2014-01-21 20:36 ` Dave Chinner 1 sibling, 1 reply; 14+ messages in thread From: Andy Lutomirski @ 2014-01-21 16:48 UTC (permalink / raw) To: Dave Chinner; +Cc: Howard Chu, Linux FS Devel, lsf-pc, linux-mm@kvack.org On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote: >> Andy Lutomirski wrote: >> >On 01/16/2014 08:17 PM, Howard Chu wrote: >> >>Andy Lutomirski wrote: >> >>>I'm interested in a persistent memory track. There seems to be plenty >> >>>of other emails about this, but here's my take: >> >> >> >>I'm also interested in this track. I'm not up on FS development these >> >>days, the last time I wrote filesystem code was nearly 20 years ago. But >> >>persistent memory is a topic near and dear to my heart, and of great >> >>relevance to my current pet project, the LMDB memory-mapped database. >> >> >> >>In a previous era I also developed block device drivers for >> >>battery-backed external DRAM disks. (My ideal would have been systems >> >>where all of RAM was persistent. I suppose we can just about get there >> >>with mobile phones and tablets these days.) >> >> >> >>In the context of database engines, I'm interested in leveraging >> >>persistent memory for write-back caching and how user level code can be >> >>made aware of it. (If all your cache is persistent and guaranteed to >> >>eventually reach stable store then you never need to fsync() a >> >>transaction.) > > I don't think that is true - your still going to need fsync to get > the CPU to flush it's caches and filesystem metadata into the > persistent domain.... I think that this depends on the technology in question. I suspect (I don't know for sure) that, if the mapping is WT or UC, that it would be possible to get the data fully flushed to persistent storage by doing something like a UC read from any appropriate type of I/O space (someone from Intel would have to confirm). There's a chipset register you're probably supposed to frob (it's well buried in the public chipset docs), but I don't know how necessary it is. In any event, that type of flush is systemwide (or at least package-wide), so fsyncing a file should be overkill. Even if caching is on, clflush may be faster than a syscall. (It's sad that x86 doesn't have writeback-but-don't-invalidate. PPC FTW.) All of this suggests to me that a vsyscall "sync persistent memory" might be better than a real syscall. For what it's worth, some of the NV-DIMM systems are supposed to be configured in such a way that, if power fails, an NMI, SMI, or even (not really sure) a hardwired thing in the memory controller will trigger the requisite flush. I don't personally believe in this if L2/L3 cache are involved (they're too big), but for the little write buffers and memory controller things, this seems entirely plausible. --Andy > >> >Hmm. Presumably that would work by actually allocating cache pages in >> >persistent memory. I don't think that anything like the current XIP >> >interfaces can do that, but it's certainly an interesting thought for >> >(complicated) future work. >> > >> >This might not be pretty in conjunction with something like my >> >writethrough mapping idea -- read(2) and write(2) would be fine (well, >> >write(2) might need to use streaming loads), but mmap users who weren't >> >expecting it might have truly awful performance. That especially >> >includes things like databases that aren't expecting this behavior. >> >> At the moment all I can suggest is a new mmap() flag, e.g. >> MAP_PERSISTENT. Not sure how a user or app should discover that it's >> supported though. > > The point of using the XIP interface with filesystems that are > backed by persistent memory is that mmap() gives userspace > applications direct acess to the persistent memory directly without > needing any modifications. It's just a really, really fast file... > I think this was talking about using persistent memory as a limited-size cache. In that case, XIP (as currently designed) has no provision for removing cache pages, so the kernel isn't ready for this. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-21 16:48 ` Andy Lutomirski @ 2014-01-21 20:36 ` Dave Chinner 2014-01-21 20:59 ` Andy Lutomirski 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2014-01-21 20:36 UTC (permalink / raw) To: Andy Lutomirski; +Cc: Howard Chu, Linux FS Devel, lsf-pc, linux-mm@kvack.org On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote: > On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote: > >> Andy Lutomirski wrote: > >> >On 01/16/2014 08:17 PM, Howard Chu wrote: > >> >>Andy Lutomirski wrote: > >> >>>I'm interested in a persistent memory track. There seems to be plenty > >> >>>of other emails about this, but here's my take: > >> >> > >> >>I'm also interested in this track. I'm not up on FS development these > >> >>days, the last time I wrote filesystem code was nearly 20 years ago. But > >> >>persistent memory is a topic near and dear to my heart, and of great > >> >>relevance to my current pet project, the LMDB memory-mapped database. > >> >> > >> >>In a previous era I also developed block device drivers for > >> >>battery-backed external DRAM disks. (My ideal would have been systems > >> >>where all of RAM was persistent. I suppose we can just about get there > >> >>with mobile phones and tablets these days.) > >> >> > >> >>In the context of database engines, I'm interested in leveraging > >> >>persistent memory for write-back caching and how user level code can be > >> >>made aware of it. (If all your cache is persistent and guaranteed to > >> >>eventually reach stable store then you never need to fsync() a > >> >>transaction.) > > > > I don't think that is true - your still going to need fsync to get > > the CPU to flush it's caches and filesystem metadata into the > > persistent domain.... > > I think that this depends on the technology in question. > > I suspect (I don't know for sure) that, if the mapping is WT or UC, > that it would be possible to get the data fully flushed to persistent > storage by doing something like a UC read from any appropriate type of > I/O space (someone from Intel would have to confirm). And what of the filesystem metadata that is necessary to reference that data? What flushes that? e.g. using mmap of sparse files to dynamically allocate persistent memory space requires fdatasync() at minimum.... And then there's things like encrypted persistent memory when means applications can't directly access it and so mmap() will be buffered by the page cache just like a normal block device... > All of this suggests to me that a vsyscall "sync persistent memory" > might be better than a real syscall. Perhaps, but that implies some method other than a filesystem to manage access to persistent memory. > For what it's worth, some of the NV-DIMM systems are supposed to be > configured in such a way that, if power fails, an NMI, SMI, or even > (not really sure) a hardwired thing in the memory controller will > trigger the requisite flush. I don't personally believe in this if > L2/L3 cache are involved (they're too big), but for the little write > buffers and memory controller things, this seems entirely plausible. Right - at the moment we have to assume the persistence domain starts at the NVDIMM and doesn't cover the CPU's internal L* caches. I have no idea if/when we'll be seeing CPUs that have persistent caches, so we have to assume that data is still volatile and can be lost unless it has been specifically synced to persistent memory. i.e. persistent memory does not remove the need for fsync and friends... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-21 20:36 ` Dave Chinner @ 2014-01-21 20:59 ` Andy Lutomirski 2014-01-21 23:03 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Andy Lutomirski @ 2014-01-21 20:59 UTC (permalink / raw) To: Dave Chinner; +Cc: Howard Chu, Linux FS Devel, lsf-pc, linux-mm@kvack.org On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com> wrote: > On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote: >> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote: >> > On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote: >> >> Andy Lutomirski wrote: >> >> >On 01/16/2014 08:17 PM, Howard Chu wrote: >> >> >>Andy Lutomirski wrote: >> >> >>>I'm interested in a persistent memory track. There seems to be plenty >> >> >>>of other emails about this, but here's my take: >> >> >> >> >> >>I'm also interested in this track. I'm not up on FS development these >> >> >>days, the last time I wrote filesystem code was nearly 20 years ago. But >> >> >>persistent memory is a topic near and dear to my heart, and of great >> >> >>relevance to my current pet project, the LMDB memory-mapped database. >> >> >> >> >> >>In a previous era I also developed block device drivers for >> >> >>battery-backed external DRAM disks. (My ideal would have been systems >> >> >>where all of RAM was persistent. I suppose we can just about get there >> >> >>with mobile phones and tablets these days.) >> >> >> >> >> >>In the context of database engines, I'm interested in leveraging >> >> >>persistent memory for write-back caching and how user level code can be >> >> >>made aware of it. (If all your cache is persistent and guaranteed to >> >> >>eventually reach stable store then you never need to fsync() a >> >> >>transaction.) >> > >> > I don't think that is true - your still going to need fsync to get >> > the CPU to flush it's caches and filesystem metadata into the >> > persistent domain.... >> >> I think that this depends on the technology in question. >> >> I suspect (I don't know for sure) that, if the mapping is WT or UC, >> that it would be possible to get the data fully flushed to persistent >> storage by doing something like a UC read from any appropriate type of >> I/O space (someone from Intel would have to confirm). > > And what of the filesystem metadata that is necessary to reference > that data? What flushes that? e.g. using mmap of sparse files to > dynamically allocate persistent memory space requires fdatasync() at > minimum.... If we're using dm-crypt using an NV-DIMM "block" device as cache and a real disk as backing store, then ideally mmap would map the NV-DIMM directly if the data in question lives there. If that's happening, then, assuming that there are no metadata changes, you could just flush the relevant hw caches. This assumes, of course, no dm-crypt, no btrfs-style checksumming, and, in general, nothing else that would require stable pages or similar things. > > And then there's things like encrypted persistent memory when means > applications can't directly access it and so mmap() will be buffered > by the page cache just like a normal block device... > >> All of this suggests to me that a vsyscall "sync persistent memory" >> might be better than a real syscall. > > Perhaps, but that implies some method other than a filesystem to > manage access to persistent memory. It should be at least as good as fdatasync if using XIP or something like pmfs. For my intended application, I want to use pmfs or something similar directly. This means that I want really fast synchronous flushes, and I suspect that the usual set of fs calls that handle fdatasync are already quite a bit slower than a vsyscall would be, assuming that no MSR write is needed. > >> For what it's worth, some of the NV-DIMM systems are supposed to be >> configured in such a way that, if power fails, an NMI, SMI, or even >> (not really sure) a hardwired thing in the memory controller will >> trigger the requisite flush. I don't personally believe in this if >> L2/L3 cache are involved (they're too big), but for the little write >> buffers and memory controller things, this seems entirely plausible. > > Right - at the moment we have to assume the persistence domain > starts at the NVDIMM and doesn't cover the CPU's internal L* caches. > I have no idea if/when we'll be seeing CPUs that have persistent > caches, so we have to assume that data is still volatile and can be > lost unless it has been specifically synced to persistent memory. > i.e. persistent memory does not remove the need for fsync and > friends... I have (NDAed and not entirely convincing) docs indicating a way (on hardware that I don't have access to) to make the caches be part of the persistence domain. I also have non-NDA'd docs that suggest that it's really very fast to flush things through the memory controller. (I would need to time it, though. I do have this hardware, and it more or less works.) --Andy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-21 20:59 ` Andy Lutomirski @ 2014-01-21 23:03 ` Dave Chinner 2014-01-21 23:22 ` Andy Lutomirski 2014-01-22 8:13 ` Howard Chu 0 siblings, 2 replies; 14+ messages in thread From: Dave Chinner @ 2014-01-21 23:03 UTC (permalink / raw) To: Andy Lutomirski; +Cc: Howard Chu, Linux FS Devel, lsf-pc, linux-mm@kvack.org On Tue, Jan 21, 2014 at 12:59:42PM -0800, Andy Lutomirski wrote: > On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote: > >> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote: > >> >> Andy Lutomirski wrote: > >> >> >On 01/16/2014 08:17 PM, Howard Chu wrote: > >> >> >>Andy Lutomirski wrote: > >> >> >>>I'm interested in a persistent memory track. There seems to be plenty > >> >> >>>of other emails about this, but here's my take: > >> >> >> > >> >> >>I'm also interested in this track. I'm not up on FS development these > >> >> >>days, the last time I wrote filesystem code was nearly 20 years ago. But > >> >> >>persistent memory is a topic near and dear to my heart, and of great > >> >> >>relevance to my current pet project, the LMDB memory-mapped database. > >> >> >> > >> >> >>In a previous era I also developed block device drivers for > >> >> >>battery-backed external DRAM disks. (My ideal would have been systems > >> >> >>where all of RAM was persistent. I suppose we can just about get there > >> >> >>with mobile phones and tablets these days.) > >> >> >> > >> >> >>In the context of database engines, I'm interested in leveraging > >> >> >>persistent memory for write-back caching and how user level code can be > >> >> >>made aware of it. (If all your cache is persistent and guaranteed to > >> >> >>eventually reach stable store then you never need to fsync() a > >> >> >>transaction.) > >> > > >> > I don't think that is true - your still going to need fsync to get > >> > the CPU to flush it's caches and filesystem metadata into the > >> > persistent domain.... > >> > >> I think that this depends on the technology in question. > >> > >> I suspect (I don't know for sure) that, if the mapping is WT or UC, > >> that it would be possible to get the data fully flushed to persistent > >> storage by doing something like a UC read from any appropriate type of > >> I/O space (someone from Intel would have to confirm). > > > > And what of the filesystem metadata that is necessary to reference > > that data? What flushes that? e.g. using mmap of sparse files to > > dynamically allocate persistent memory space requires fdatasync() at > > minimum.... > > If we're using dm-crypt using an NV-DIMM "block" device as cache and a > real disk as backing store, then ideally mmap would map the NV-DIMM > directly if the data in question lives there. dm-crypt does not use any block device as a cache. You're thinking about dm-cache or bcache. And neither of them are operating at the filesystem level or are aware of the difference between fileystem metadata and user data. But talking about non-existent block layer functionality doesn't answer my the question about keeping user data and filesystem metadata needed to reference that user data coherent in persistent memory... > If that's happening, > then, assuming that there are no metadata changes, you could just > flush the relevant hw caches. This assumes, of course, no dm-crypt, > no btrfs-style checksumming, and, in general, nothing else that would > require stable pages or similar things. Well yes. Data IO path transformations are another reason why we'll need the volatile page cache involved in the persistent memory IO path. It follows immediately from this that applicaitons will still require fsync() and other data integrity operations because they have no idea where the persistence domain boundary lives in the IO stack. > > And then there's things like encrypted persistent memory when means > > applications can't directly access it and so mmap() will be buffered > > by the page cache just like a normal block device... > > > >> All of this suggests to me that a vsyscall "sync persistent memory" > >> might be better than a real syscall. > > > > Perhaps, but that implies some method other than a filesystem to > > manage access to persistent memory. > > It should be at least as good as fdatasync if using XIP or something like pmfs. > > For my intended application, I want to use pmfs or something similar > directly. This means that I want really fast synchronous flushes, and > I suspect that the usual set of fs calls that handle fdatasync are > already quite a bit slower than a vsyscall would be, assuming that no > MSR write is needed. What you are saying is that you want a fixed, allocated range of persistent memory mapped into the applications address space that you have direct control of. Yes, we can do that through the filesystem XIP interface (zero the file via memset() rather than via unwritten extents) and then fsync the file. The metadata on the file will then never change, and you can do what you want via mmap from then onwards. I'd suggest at this point that msync() is the operation that should then be used to flush the data pages in the mapped range into the persistence domain. > >> For what it's worth, some of the NV-DIMM systems are supposed to be > >> configured in such a way that, if power fails, an NMI, SMI, or even > >> (not really sure) a hardwired thing in the memory controller will > >> trigger the requisite flush. I don't personally believe in this if > >> L2/L3 cache are involved (they're too big), but for the little write > >> buffers and memory controller things, this seems entirely plausible. > > > > Right - at the moment we have to assume the persistence domain > > starts at the NVDIMM and doesn't cover the CPU's internal L* caches. > > I have no idea if/when we'll be seeing CPUs that have persistent > > caches, so we have to assume that data is still volatile and can be > > lost unless it has been specifically synced to persistent memory. > > i.e. persistent memory does not remove the need for fsync and > > friends... > > I have (NDAed and not entirely convincing) docs indicating a way (on > hardware that I don't have access to) to make the caches be part of > the persistence domain. Every platform will implement persistence domain mangement differently. So we can't assume that what works on one platform is going to work or be compatible with any other platform.... > I also have non-NDA'd docs that suggest that > it's really very fast to flush things through the memory controller. > (I would need to time it, though. I do have this hardware, and it > more or less works.) It still takes non-zero time, so there is still scope for data loss on power failure, or even CPU failure. Hmmm, now there's something I hadn't really thought about - how does CPU failure, hotplug and/or power management affect persistence domains if the CPU cache contains persistent data and it's no longer accessible? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-21 23:03 ` Dave Chinner @ 2014-01-21 23:22 ` Andy Lutomirski 2014-01-22 8:13 ` Howard Chu 1 sibling, 0 replies; 14+ messages in thread From: Andy Lutomirski @ 2014-01-21 23:22 UTC (permalink / raw) To: Dave Chinner; +Cc: Howard Chu, Linux FS Devel, lsf-pc, linux-mm@kvack.org On Tue, Jan 21, 2014 at 3:03 PM, Dave Chinner <david@fromorbit.com> wrote: > On Tue, Jan 21, 2014 at 12:59:42PM -0800, Andy Lutomirski wrote: >> On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote: >> >> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote: >> >> > On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote: >> >> >> Andy Lutomirski wrote: >> >> >> >On 01/16/2014 08:17 PM, Howard Chu wrote: >> >> >> >>Andy Lutomirski wrote: >> >> >> >>>I'm interested in a persistent memory track. There seems to be plenty >> >> >> >>>of other emails about this, but here's my take: >> >> >> >> >> >> >> >>I'm also interested in this track. I'm not up on FS development these >> >> >> >>days, the last time I wrote filesystem code was nearly 20 years ago. But >> >> >> >>persistent memory is a topic near and dear to my heart, and of great >> >> >> >>relevance to my current pet project, the LMDB memory-mapped database. >> >> >> >> >> >> >> >>In a previous era I also developed block device drivers for >> >> >> >>battery-backed external DRAM disks. (My ideal would have been systems >> >> >> >>where all of RAM was persistent. I suppose we can just about get there >> >> >> >>with mobile phones and tablets these days.) >> >> >> >> >> >> >> >>In the context of database engines, I'm interested in leveraging >> >> >> >>persistent memory for write-back caching and how user level code can be >> >> >> >>made aware of it. (If all your cache is persistent and guaranteed to >> >> >> >>eventually reach stable store then you never need to fsync() a >> >> >> >>transaction.) >> >> > >> >> > I don't think that is true - your still going to need fsync to get >> >> > the CPU to flush it's caches and filesystem metadata into the >> >> > persistent domain.... >> >> >> >> I think that this depends on the technology in question. >> >> >> >> I suspect (I don't know for sure) that, if the mapping is WT or UC, >> >> that it would be possible to get the data fully flushed to persistent >> >> storage by doing something like a UC read from any appropriate type of >> >> I/O space (someone from Intel would have to confirm). >> > >> > And what of the filesystem metadata that is necessary to reference >> > that data? What flushes that? e.g. using mmap of sparse files to >> > dynamically allocate persistent memory space requires fdatasync() at >> > minimum.... >> >> If we're using dm-crypt using an NV-DIMM "block" device as cache and a >> real disk as backing store, then ideally mmap would map the NV-DIMM >> directly if the data in question lives there. > > dm-crypt does not use any block device as a cache. You're thinking > about dm-cache or bcache. And neither of them are operating at the > filesystem level or are aware of the difference between fileystem > metadata and user data. But talking about non-existent block layer > functionality doesn't answer my the question about keeping user data > and filesystem metadata needed to reference that user data > coherent in persistent memory... Wow -- apparently I can't write coherently today. What I'm saying is: if dm-cache (not dm-crypt) had magic not-currently-existing functionality that allowed an XIP-capable cache device to be mapped directly, and userspace knew it was mapped directly, and userspace could pin that mapping there, then userspace could avoid calling fsync. This is (to me, and probably to everyone else, too) far less interesting than the case of having the whole fs live in persistent memory. > >> If that's happening, >> then, assuming that there are no metadata changes, you could just >> flush the relevant hw caches. This assumes, of course, no dm-crypt, >> no btrfs-style checksumming, and, in general, nothing else that would >> require stable pages or similar things. > > Well yes. Data IO path transformations are another reason why we'll > need the volatile page cache involved in the persistent memory IO > path. It follows immediately from this that applicaitons will still > require fsync() and other data integrity operations because they > have no idea where the persistence domain boundary lives in the IO > stack. > >> > And then there's things like encrypted persistent memory when means >> > applications can't directly access it and so mmap() will be buffered >> > by the page cache just like a normal block device... >> > >> >> All of this suggests to me that a vsyscall "sync persistent memory" >> >> might be better than a real syscall. >> > >> > Perhaps, but that implies some method other than a filesystem to >> > manage access to persistent memory. >> >> It should be at least as good as fdatasync if using XIP or something like pmfs. >> >> For my intended application, I want to use pmfs or something similar >> directly. This means that I want really fast synchronous flushes, and >> I suspect that the usual set of fs calls that handle fdatasync are >> already quite a bit slower than a vsyscall would be, assuming that no >> MSR write is needed. > > What you are saying is that you want a fixed, allocated range of > persistent memory mapped into the applications address space that > you have direct control of. Yes, we can do that through the > filesystem XIP interface (zero the file via memset() rather than via > unwritten extents) and then fsync the file. The metadata on the file > will then never change, and you can do what you want via mmap from > then onwards. I'd suggest at this point that msync() is the > operation that should then be used to flush the data pages in the > mapped range into the persistence domain. I think you're insufficiently ambitious about how fast you want this to be. :) I want it to be at least possible for the whole sync operation to be considerably faster than, say, anything involving mmap_sem or vma walking. But yes, the memset thing is what I want. > > >> >> For what it's worth, some of the NV-DIMM systems are supposed to be >> >> configured in such a way that, if power fails, an NMI, SMI, or even >> >> (not really sure) a hardwired thing in the memory controller will >> >> trigger the requisite flush. I don't personally believe in this if >> >> L2/L3 cache are involved (they're too big), but for the little write >> >> buffers and memory controller things, this seems entirely plausible. >> > >> > Right - at the moment we have to assume the persistence domain >> > starts at the NVDIMM and doesn't cover the CPU's internal L* caches. >> > I have no idea if/when we'll be seeing CPUs that have persistent >> > caches, so we have to assume that data is still volatile and can be >> > lost unless it has been specifically synced to persistent memory. >> > i.e. persistent memory does not remove the need for fsync and >> > friends... >> >> I have (NDAed and not entirely convincing) docs indicating a way (on >> hardware that I don't have access to) to make the caches be part of >> the persistence domain. > > Every platform will implement persistence domain > mangement differently. So we can't assume that what works on one > platform is going to work or be compatible with any other > platform.... > >> I also have non-NDA'd docs that suggest that >> it's really very fast to flush things through the memory controller. >> (I would need to time it, though. I do have this hardware, and it >> more or less works.) > > It still takes non-zero time, so there is still scope for data loss > on power failure, or even CPU failure. Not if the hardware does the flush for us. (But yes, you're right, we can't assume that *all* persistent memory hardware can do that.) > > Hmmm, now there's something I hadn't really thought about - how does > CPU failure, hotplug and/or power management affect persistence > domains if the CPU cache contains persistent data and it's no longer > accessible? Given that NV-DIMMs are literally DIMMs that are mapped more or less like any other system memory, this presumably works for the same reason that hot-unplugging a CPU that has dirty cachelines pointing at page cache doesn't corrupt page cache. That is, someone (presumably the OS arch code) is responsible for flushing the caches. Just because L2/L3 cache might be in the persistence domain doesn't mean that you can't clflush or wbinvd it just like any other memory. Another reason to be a bit careful about caching: it should be possible to write a few MB to persistent memory in a tight loop without blowing everything else out of cache. I wonder if the default behavior for non-mmapped writes to these things should be to use non-temporal / streaming hints where available. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-21 23:03 ` Dave Chinner 2014-01-21 23:22 ` Andy Lutomirski @ 2014-01-22 8:13 ` Howard Chu 2014-01-23 19:54 ` Andy Lutomirski 1 sibling, 1 reply; 14+ messages in thread From: Howard Chu @ 2014-01-22 8:13 UTC (permalink / raw) To: Dave Chinner, Andy Lutomirski; +Cc: Linux FS Devel, lsf-pc, linux-mm@kvack.org Dave Chinner wrote: > On Tue, Jan 21, 2014 at 12:59:42PM -0800, Andy Lutomirski wrote: >> On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com> wrote: >>> On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote: >>>> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner <david@fromorbit.com> wrote: >>>>> On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote: >>>>>> Andy Lutomirski wrote: >>>>>>> On 01/16/2014 08:17 PM, Howard Chu wrote: >>>>>>>> Andy Lutomirski wrote: >>>>>>>>> I'm interested in a persistent memory track. There seems to be plenty >>>>>>>>> of other emails about this, but here's my take: >>>>>>>> >>>>>>>> I'm also interested in this track. I'm not up on FS development these >>>>>>>> days, the last time I wrote filesystem code was nearly 20 years ago. But >>>>>>>> persistent memory is a topic near and dear to my heart, and of great >>>>>>>> relevance to my current pet project, the LMDB memory-mapped database. >>>>>>>> >>>>>>>> In a previous era I also developed block device drivers for >>>>>>>> battery-backed external DRAM disks. (My ideal would have been systems >>>>>>>> where all of RAM was persistent. I suppose we can just about get there >>>>>>>> with mobile phones and tablets these days.) >>>>>>>> >>>>>>>> In the context of database engines, I'm interested in leveraging >>>>>>>> persistent memory for write-back caching and how user level code can be >>>>>>>> made aware of it. (If all your cache is persistent and guaranteed to >>>>>>>> eventually reach stable store then you never need to fsync() a >>>>>>>> transaction.) >>>>> >>>>> I don't think that is true - your still going to need fsync to get >>>>> the CPU to flush it's caches and filesystem metadata into the >>>>> persistent domain.... >>>> >>>> I think that this depends on the technology in question. >>>> >>>> I suspect (I don't know for sure) that, if the mapping is WT or UC, >>>> that it would be possible to get the data fully flushed to persistent >>>> storage by doing something like a UC read from any appropriate type of >>>> I/O space (someone from Intel would have to confirm). >>> >>> And what of the filesystem metadata that is necessary to reference >>> that data? What flushes that? e.g. using mmap of sparse files to >>> dynamically allocate persistent memory space requires fdatasync() at >>> minimum.... Why are you talking about fdatasync(), which is used to *avoid* flushing metadata? For reference, we've found that we get highest DB performance using ext2fs with a preallocated file. In that case, we can use fdatasync() and then there's no metadata updates whatsoever. This also means we can ignore the question of FS corruption on a crash. >> If we're using dm-crypt using an NV-DIMM "block" device as cache and a >> real disk as backing store, then ideally mmap would map the NV-DIMM >> directly if the data in question lives there. > > dm-crypt does not use any block device as a cache. You're thinking > about dm-cache or bcache. And neither of them are operating at the > filesystem level or are aware of the difference between fileystem > metadata and user data. Why should that layer need to be aware? A page is a page, as far as they're concerned. > But talking about non-existent block layer > functionality doesn't answer my the question about keeping user data > and filesystem metadata needed to reference that user data > coherent in persistent memory... One of the very useful tools for PCs in the '80s was reset-survivable RAMdisks. Given the existence of persistent memory in a machine, this is a pretty obvious feature to provide. >> If that's happening, >> then, assuming that there are no metadata changes, you could just >> flush the relevant hw caches. This assumes, of course, no dm-crypt, >> no btrfs-style checksumming, and, in general, nothing else that would >> require stable pages or similar things. > > Well yes. Data IO path transformations are another reason why we'll > need the volatile page cache involved in the persistent memory IO > path. It follows immediately from this that applicaitons will still > require fsync() and other data integrity operations because they > have no idea where the persistence domain boundary lives in the IO > stack. And my point, stated a few times now, is there should be a way for applications to discover the existence and characteristics of persistent memory being used in the system. >>> And then there's things like encrypted persistent memory when means >>> applications can't directly access it and so mmap() will be buffered >>> by the page cache just like a normal block device... >>> >>>> All of this suggests to me that a vsyscall "sync persistent memory" >>>> might be better than a real syscall. >>> >>> Perhaps, but that implies some method other than a filesystem to >>> manage access to persistent memory. >> >> It should be at least as good as fdatasync if using XIP or something like pmfs. >> >> For my intended application, I want to use pmfs or something similar >> directly. This means that I want really fast synchronous flushes, and >> I suspect that the usual set of fs calls that handle fdatasync are >> already quite a bit slower than a vsyscall would be, assuming that no >> MSR write is needed. > > What you are saying is that you want a fixed, allocated range of > persistent memory mapped into the applications address space that > you have direct control of. Yes, we can do that through the > filesystem XIP interface (zero the file via memset() rather than via > unwritten extents) and then fsync the file. The metadata on the file > will then never change, and you can do what you want via mmap from > then onwards. I'd suggest at this point that msync() is the > operation that should then be used to flush the data pages in the > mapped range into the persistence domain. > > >>>> For what it's worth, some of the NV-DIMM systems are supposed to be >>>> configured in such a way that, if power fails, an NMI, SMI, or even >>>> (not really sure) a hardwired thing in the memory controller will >>>> trigger the requisite flush. I don't personally believe in this if >>>> L2/L3 cache are involved (they're too big), but for the little write >>>> buffers and memory controller things, this seems entirely plausible. >>> >>> Right - at the moment we have to assume the persistence domain >>> starts at the NVDIMM and doesn't cover the CPU's internal L* caches. >>> I have no idea if/when we'll be seeing CPUs that have persistent >>> caches, so we have to assume that data is still volatile and can be >>> lost unless it has been specifically synced to persistent memory. >>> i.e. persistent memory does not remove the need for fsync and >>> friends... >> >> I have (NDAed and not entirely convincing) docs indicating a way (on >> hardware that I don't have access to) to make the caches be part of >> the persistence domain. > > Every platform will implement persistence domain > mangement differently. So we can't assume that what works on one > platform is going to work or be compatible with any other > platform.... > >> I also have non-NDA'd docs that suggest that >> it's really very fast to flush things through the memory controller. >> (I would need to time it, though. I do have this hardware, and it >> more or less works.) > > It still takes non-zero time, so there is still scope for data loss > on power failure, or even CPU failure. > > Hmmm, now there's something I hadn't really thought about - how does > CPU failure, hotplug and/or power management affect persistence > domains if the CPU cache contains persistent data and it's no longer > accessible? > > Cheers, > > Dave. > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory 2014-01-22 8:13 ` Howard Chu @ 2014-01-23 19:54 ` Andy Lutomirski 0 siblings, 0 replies; 14+ messages in thread From: Andy Lutomirski @ 2014-01-23 19:54 UTC (permalink / raw) To: Howard Chu; +Cc: Dave Chinner, Linux FS Devel, lsf-pc, linux-mm@kvack.org On Wed, Jan 22, 2014 at 12:13 AM, Howard Chu <hyc@symas.com> wrote: > Dave Chinner wrote: >> >> On Tue, Jan 21, 2014 at 12:59:42PM -0800, Andy Lutomirski wrote: >>> >>> On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner <david@fromorbit.com> > >>> If we're using dm-crypt using an NV-DIMM "block" device as cache and a >>> real disk as backing store, then ideally mmap would map the NV-DIMM >>> directly if the data in question lives there. >> >> >> dm-crypt does not use any block device as a cache. You're thinking >> about dm-cache or bcache. And neither of them are operating at the >> filesystem level or are aware of the difference between fileystem >> metadata and user data. > > > Why should that layer need to be aware? A page is a page, as far as they're > concerned. I think that, ideally, the awareness would go the other way. dm-cache (where the backing store is a normal disk but the cache storage is persistent memory) should not care what kind of page it's caching. On the other hand, the filesystem sitting on top of dm-cache should be able to tell when a page (in the device exposed by dm-cache) is actually CPU-addressable so it can avoid allocating yet another copy in pagecache. Similarly, it should be able to be notified when that page is about to stop being cpu-addressable. This might be an argument to have, in addition (or as a replacement to) direct_access, XIP ops that ask for a reference to a page, are permitted to fail (i.e. say "sorry, not CPU addressable right now), and a way to be notified when a page is going away. (This is totally unnecessary if using something like an NV-DIMM directly -- it's only important for more complicated things like dm-cache.) > > >> But talking about non-existent block layer >> functionality doesn't answer my the question about keeping user data >> and filesystem metadata needed to reference that user data >> coherent in persistent memory... > > > One of the very useful tools for PCs in the '80s was reset-survivable > RAMdisks. Given the existence of persistent memory in a machine, this is a > pretty obvious feature to provide. I think that a file on pmfs or ext4-xip will work like this. Ideally /dev/loop will be XIP-capable if the file it's sitting on top of is XIP. > > >>> If that's happening, >>> then, assuming that there are no metadata changes, you could just >>> flush the relevant hw caches. This assumes, of course, no dm-crypt, >>> no btrfs-style checksumming, and, in general, nothing else that would >>> require stable pages or similar things. >> >> >> Well yes. Data IO path transformations are another reason why we'll >> need the volatile page cache involved in the persistent memory IO >> path. It follows immediately from this that applicaitons will still >> require fsync() and other data integrity operations because they >> have no idea where the persistence domain boundary lives in the IO >> stack. > > > And my point, stated a few times now, is there should be a way for > applications to discover the existence and characteristics of persistent > memory being used in the system. Agreed. Or maybe just some very low-level library that exposes a more useful interface (e.g. sync this domain) to applications. --Andy ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2014-01-23 19:54 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-01-17 0:56 [LSF/MM TOPIC] [ATTEND] Persistent memory Andy Lutomirski 2014-01-17 4:17 ` Howard Chu 2014-01-17 19:22 ` Andy Lutomirski 2014-01-21 7:38 ` Howard Chu 2014-01-21 11:17 ` [Lsf-pc] " Dave Chinner 2014-01-21 13:57 ` Howard Chu 2014-01-21 20:20 ` Dave Chinner 2014-01-21 16:48 ` Andy Lutomirski 2014-01-21 20:36 ` Dave Chinner 2014-01-21 20:59 ` Andy Lutomirski 2014-01-21 23:03 ` Dave Chinner 2014-01-21 23:22 ` Andy Lutomirski 2014-01-22 8:13 ` Howard Chu 2014-01-23 19:54 ` Andy Lutomirski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).