* NVM Mapping API @ 2012-05-15 13:34 Matthew Wilcox 2012-05-15 17:46 ` Greg KH ` (5 more replies) 0 siblings, 6 replies; 27+ messages in thread From: Matthew Wilcox @ 2012-05-15 13:34 UTC (permalink / raw) To: linux-fsdevel; +Cc: linux-kernel There are a number of interesting non-volatile memory (NVM) technologies being developed. Some of them promise DRAM-comparable latencies and bandwidths. At Intel, we've been thinking about various ways to present those to software. This is a first draft of an API that supports the operations we see as necessary. Patches can follow easily enough once we've settled on an API. We think the appropriate way to present directly addressable NVM to in-kernel users is through a filesystem. Different technologies may want to use different filesystems, or maybe some forms of directly addressable NVM will want to use the same filesystem as each other. For mapping regions of NVM into the kernel address space, we think we need map, unmap, protect and sync operations; see kerneldoc for them below. We also think we need read and write operations (to copy to/from DRAM). The kernel_read() function already exists, and I don't think it would be unreasonable to add its kernel_write() counterpart. We aren't yet proposing a mechanism for carving up the NVM into regions. vfs_truncate() seems like a reasonable API for resizing an NVM region. filp_open() also seems reasonable for turning a name into a file pointer. What we'd really like is for people to think about how they might use fast NVM inside the kernel. There's likely to be a lot of it (at least in servers); all the technologies are promising cheaper per-bit prices than DRAM, so it's likely to be sold in larger capacities than DRAM is today. Caching is one obvious use (be it FS-Cache, Bcache, Flashcache or something else), but I bet there are more radical things we can do with it. What if we stored the inode cache in it? Would booting with a hot inode cache improve boot times? How about storing the tree of 'struct devices' in it so we don't have to rescan the busses at startup? /** * @nvm_filp: The NVM file pointer * @start: The starting offset within the NVM region to be mapped * @length: The number of bytes to map * @protection: Protection bits * @return Pointer to virtual mapping or PTR_ERR on failure * * This call maps a file to a virtual memory address. The start and length * should be page aligned. * * Errors: * EINVAL if start and length are not page aligned. * ENODEV if the file pointer does not point to a mappable file */ void *nvm_map(struct file *nvm_filp, off_t start, size_t length, pgprot_t protection); /** * @addr: The address returned by nvm_map() * * Unmaps a region previously mapped by nvm_map. */ void nvm_unmap(const void *addr); /** * @addr: The first byte to affect * @length: The number of bytes to affect * @protection: The new protection to use * * Updates the protection bits for the corresponding pages. * The start and length must be page aligned, but need not be the entirety * of the mapping. */ void nvm_protect(const void *addr, size_t length, pgprot_t protection); /** * @nvm_filp: The kernel file pointer * @addr: The first byte to sync * @length: The number of bytes to sync * @returns Zero on success, -errno on failure * * Flushes changes made to the in-core copy of a mapped file back to NVM. */ int nvm_sync(struct file *nvm_filp, void *addr, size_t length); ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-15 13:34 NVM Mapping API Matthew Wilcox @ 2012-05-15 17:46 ` Greg KH 2012-05-16 15:57 ` Matthew Wilcox 2012-05-15 23:02 ` Andy Lutomirski ` (4 subsequent siblings) 5 siblings, 1 reply; 27+ messages in thread From: Greg KH @ 2012-05-15 17:46 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel On Tue, May 15, 2012 at 09:34:51AM -0400, Matthew Wilcox wrote: > > There are a number of interesting non-volatile memory (NVM) technologies > being developed. Some of them promise DRAM-comparable latencies and > bandwidths. At Intel, we've been thinking about various ways to present > those to software. This is a first draft of an API that supports the > operations we see as necessary. Patches can follow easily enough once > we've settled on an API. > > We think the appropriate way to present directly addressable NVM to > in-kernel users is through a filesystem. Different technologies may want > to use different filesystems, or maybe some forms of directly addressable > NVM will want to use the same filesystem as each other. > > For mapping regions of NVM into the kernel address space, we think we need > map, unmap, protect and sync operations; see kerneldoc for them below. > We also think we need read and write operations (to copy to/from DRAM). > The kernel_read() function already exists, and I don't think it would > be unreasonable to add its kernel_write() counterpart. > > We aren't yet proposing a mechanism for carving up the NVM into regions. > vfs_truncate() seems like a reasonable API for resizing an NVM region. > filp_open() also seems reasonable for turning a name into a file pointer. > > What we'd really like is for people to think about how they might use > fast NVM inside the kernel. There's likely to be a lot of it (at least in > servers); all the technologies are promising cheaper per-bit prices than > DRAM, so it's likely to be sold in larger capacities than DRAM is today. > > Caching is one obvious use (be it FS-Cache, Bcache, Flashcache or > something else), but I bet there are more radical things we can do > with it. What if we stored the inode cache in it? Would booting with > a hot inode cache improve boot times? How about storing the tree of > 'struct devices' in it so we don't have to rescan the busses at startup? Rescanning the busses at startup are required anyway, as devices can be added and removed when the power is off, and I would be amazed if that is actually taking any measurable time. Do you have any numbers for this for different busses? What about pramfs for the nvram? I have a recent copy of the patches, and I think they are clean enough for acceptance, there was no complaints the last time it was suggested. Can you use that for this type of hardware? thanks, greg k-h ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-15 17:46 ` Greg KH @ 2012-05-16 15:57 ` Matthew Wilcox 2012-05-18 12:07 ` Marco Stornelli 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2012-05-16 15:57 UTC (permalink / raw) To: Greg KH; +Cc: linux-fsdevel, linux-kernel On Tue, May 15, 2012 at 10:46:39AM -0700, Greg KH wrote: > On Tue, May 15, 2012 at 09:34:51AM -0400, Matthew Wilcox wrote: > > What we'd really like is for people to think about how they might use > > fast NVM inside the kernel. There's likely to be a lot of it (at least in > > servers); all the technologies are promising cheaper per-bit prices than > > DRAM, so it's likely to be sold in larger capacities than DRAM is today. > > > > Caching is one obvious use (be it FS-Cache, Bcache, Flashcache or > > something else), but I bet there are more radical things we can do > > with it. What if we stored the inode cache in it? Would booting with > > a hot inode cache improve boot times? How about storing the tree of > > 'struct devices' in it so we don't have to rescan the busses at startup? > > Rescanning the busses at startup are required anyway, as devices can be > added and removed when the power is off, and I would be amazed if that > is actually taking any measurable time. Do you have any numbers for > this for different busses? Hi Greg, I wasn't particularly serious about this example ... I did once time the scan of a PCIe bus and it took a noticable number of milliseconds (which is why we now only scan the first device for the downstream "bus" of root ports and downstream ports). I'm just trying to stimulate a bit of discussion of possible usages for persistent memory. > What about pramfs for the nvram? I have a recent copy of the patches, > and I think they are clean enough for acceptance, there was no > complaints the last time it was suggested. Can you use that for this > type of hardware? pramfs is definitely one filesystem that's under investigation. I know there will be types of NVM for which it won't be suitable, so rather than people calling pramfs-specific functions, the notion is to get a core API in the VFS that can call into the various different filesystems that can handle the vagaries of different types of NVM. Thanks. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-16 15:57 ` Matthew Wilcox @ 2012-05-18 12:07 ` Marco Stornelli 0 siblings, 0 replies; 27+ messages in thread From: Marco Stornelli @ 2012-05-18 12:07 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Greg KH, linux-fsdevel, linux-kernel 2012/5/16 Matthew Wilcox <willy@linux.intel.com>: > On Tue, May 15, 2012 at 10:46:39AM -0700, Greg KH wrote: >> On Tue, May 15, 2012 at 09:34:51AM -0400, Matthew Wilcox wrote: >> > What we'd really like is for people to think about how they might use >> > fast NVM inside the kernel. There's likely to be a lot of it (at least in >> > servers); all the technologies are promising cheaper per-bit prices than >> > DRAM, so it's likely to be sold in larger capacities than DRAM is today. >> > >> > Caching is one obvious use (be it FS-Cache, Bcache, Flashcache or >> > something else), but I bet there are more radical things we can do >> > with it. What if we stored the inode cache in it? Would booting with >> > a hot inode cache improve boot times? How about storing the tree of >> > 'struct devices' in it so we don't have to rescan the busses at startup? >> >> Rescanning the busses at startup are required anyway, as devices can be >> added and removed when the power is off, and I would be amazed if that >> is actually taking any measurable time. Do you have any numbers for >> this for different busses? > > Hi Greg, > > I wasn't particularly serious about this example ... I did once time > the scan of a PCIe bus and it took a noticable number of milliseconds > (which is why we now only scan the first device for the downstream "bus" > of root ports and downstream ports). > > I'm just trying to stimulate a bit of discussion of possible usages for > persistent memory. > >> What about pramfs for the nvram? I have a recent copy of the patches, >> and I think they are clean enough for acceptance, there was no >> complaints the last time it was suggested. Can you use that for this >> type of hardware? > > pramfs is definitely one filesystem that's under investigation. I know > there will be types of NVM for which it won't be suitable, so rather For example? > than people calling pramfs-specific functions, the notion is to get a > core API in the VFS that can call into the various different filesystems > that can handle the vagaries of different types of NVM. > The idea could be good but I have doubt about it. Any fs is designed for a specific environment, to provide VFS api to manage NVM is not enough. I mean, a fs designed to reduce the seek time on hd, it adds not needed complexity for this kind of environment. Maybe the goal could be only for a "specific" support, for the journal for example. Marco ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-15 13:34 NVM Mapping API Matthew Wilcox 2012-05-15 17:46 ` Greg KH @ 2012-05-15 23:02 ` Andy Lutomirski 2012-05-16 16:02 ` Matthew Wilcox 2012-05-16 6:24 ` Vyacheslav Dubeyko ` (3 subsequent siblings) 5 siblings, 1 reply; 27+ messages in thread From: Andy Lutomirski @ 2012-05-15 23:02 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel On 05/15/2012 06:34 AM, Matthew Wilcox wrote: > > There are a number of interesting non-volatile memory (NVM) technologies > being developed. Some of them promise DRAM-comparable latencies and > bandwidths. At Intel, we've been thinking about various ways to present > those to software. This is a first draft of an API that supports the > operations we see as necessary. Patches can follow easily enough once > we've settled on an API. > > We think the appropriate way to present directly addressable NVM to > in-kernel users is through a filesystem. Different technologies may want > to use different filesystems, or maybe some forms of directly addressable > NVM will want to use the same filesystem as each other. > What we'd really like is for people to think about how they might use > fast NVM inside the kernel. There's likely to be a lot of it (at least in > servers); all the technologies are promising cheaper per-bit prices than > DRAM, so it's likely to be sold in larger capacities than DRAM is today. > > Caching is one obvious use (be it FS-Cache, Bcache, Flashcache or > something else), but I bet there are more radical things we can do > with it. What if we stored the inode cache in it? Would booting with > a hot inode cache improve boot times? How about storing the tree of > 'struct devices' in it so we don't have to rescan the busses at startup? > I would love to use this from userspace. If I could carve out a little piece of NVM as a file (or whatever) and mmap it, I could do all kinds of fun things with that. It would be nice if it had well-defined, or at least configurable or discoverable, caching properties (e.g. WB, WT, WC, UC, etc.). (Even better would be a way to make a clone of an fd that only allows mmap, but that's a mostly unrelated issue.) --Andy ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-15 23:02 ` Andy Lutomirski @ 2012-05-16 16:02 ` Matthew Wilcox 2012-05-31 17:53 ` Andy Lutomirski 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2012-05-16 16:02 UTC (permalink / raw) To: Andy Lutomirski; +Cc: linux-fsdevel, linux-kernel On Tue, May 15, 2012 at 04:02:01PM -0700, Andy Lutomirski wrote: > I would love to use this from userspace. If I could carve out a little > piece of NVM as a file (or whatever) and mmap it, I could do all kinds > of fun things with that. It would be nice if it had well-defined, or at > least configurable or discoverable, caching properties (e.g. WB, WT, WC, > UC, etc.). Yes, usage from userspace is definitely planned; again through a filesystem interface. Treating it like a regular file will work as expected; the question is how to expose the interesting properties (eg is there a lighter weight mechanism than calling msync()). My hope was that by having a discussion of how to use this stuff within the kernel, we might come up with some usage models that would inform how we design a user space library. > (Even better would be a way to make a clone of an fd that only allows > mmap, but that's a mostly unrelated issue.) O_MMAP_ONLY? And I'm not sure why you'd want to forbid reads and writes. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-16 16:02 ` Matthew Wilcox @ 2012-05-31 17:53 ` Andy Lutomirski 0 siblings, 0 replies; 27+ messages in thread From: Andy Lutomirski @ 2012-05-31 17:53 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel On Wed, May 16, 2012 at 9:02 AM, Matthew Wilcox <willy@linux.intel.com> wrote: > On Tue, May 15, 2012 at 04:02:01PM -0700, Andy Lutomirski wrote: >> I would love to use this from userspace. If I could carve out a little >> piece of NVM as a file (or whatever) and mmap it, I could do all kinds >> of fun things with that. It would be nice if it had well-defined, or at >> least configurable or discoverable, caching properties (e.g. WB, WT, WC, >> UC, etc.). > > Yes, usage from userspace is definitely planned; again through a > filesystem interface. Treating it like a regular file will work as > expected; the question is how to expose the interesting properties > (eg is there a lighter weight mechanism than calling msync()). clfush? vdso system call? If there's a proliferation of different technologies like this, we could have an opaque struct nvm_mapping and a vdso call like void __vdso_nvm_flush_writes(struct nvm_mapping *mapping, void *address, size_t len); that would read the struct nvm_mapping to figure out whether it should do a clflush, sfence, mfence, posting read, or whatever else the particular device needs. (This would also give a much better chance of portability to architectures other than x86.) > > My hope was that by having a discussion of how to use this stuff within > the kernel, we might come up with some usage models that would inform > how we design a user space library. > >> (Even better would be a way to make a clone of an fd that only allows >> mmap, but that's a mostly unrelated issue.) > > O_MMAP_ONLY? And I'm not sure why you'd want to forbid reads and writes. I don't want to forbid reads and writes; I want to forbid ftruncate. That way I don't need to worry about malicious / obnoxious programs sharing the fd causing SIGBUS. --Andy ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-15 13:34 NVM Mapping API Matthew Wilcox 2012-05-15 17:46 ` Greg KH 2012-05-15 23:02 ` Andy Lutomirski @ 2012-05-16 6:24 ` Vyacheslav Dubeyko 2012-05-16 16:10 ` Matthew Wilcox 2012-05-16 21:58 ` Benjamin LaHaise 2012-05-16 9:52 ` James Bottomley ` (2 subsequent siblings) 5 siblings, 2 replies; 27+ messages in thread From: Vyacheslav Dubeyko @ 2012-05-16 6:24 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel Hi, On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote: > There are a number of interesting non-volatile memory (NVM) technologies > being developed. Some of them promise DRAM-comparable latencies and > bandwidths. At Intel, we've been thinking about various ways to present > those to software. Could you please share vision of these NVM technologies in more details? What capacity in bytes of of one NVM unit do we can expect? What about bad blocks and any other reliability issues of such NVM technologies? I think that some more deep understanding of this can give possibility to imagine more deeply possible niche of such NVM units in future memory subsystem architecture. With the best regards, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-16 6:24 ` Vyacheslav Dubeyko @ 2012-05-16 16:10 ` Matthew Wilcox 2012-05-17 9:06 ` Vyacheslav Dubeyko 2012-05-16 21:58 ` Benjamin LaHaise 1 sibling, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2012-05-16 16:10 UTC (permalink / raw) To: Vyacheslav Dubeyko; +Cc: linux-fsdevel, linux-kernel On Wed, May 16, 2012 at 10:24:13AM +0400, Vyacheslav Dubeyko wrote: > On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote: > > There are a number of interesting non-volatile memory (NVM) technologies > > being developed. Some of them promise DRAM-comparable latencies and > > bandwidths. At Intel, we've been thinking about various ways to present > > those to software. > > Could you please share vision of these NVM technologies in more details? > What capacity in bytes of of one NVM unit do we can expect? What about > bad blocks and any other reliability issues of such NVM technologies? No, I can't comment on any of that. This isn't about any particular piece of technology; it's an observation that there are a lot of technologies that seem to fit in this niche; some of them are even available to buy today. No statement of mine should be taken as an indication of any future Intel product plans :-) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-16 16:10 ` Matthew Wilcox @ 2012-05-17 9:06 ` Vyacheslav Dubeyko 0 siblings, 0 replies; 27+ messages in thread From: Vyacheslav Dubeyko @ 2012-05-17 9:06 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel Hi, > No, I can't comment on any of that. This isn't about any particular piece > of technology; it's an observation that there are a lot of technologies > that seem to fit in this niche; some of them are even available to > buy today. > > No statement of mine should be taken as an indication of any future > Intel product plans :-) > Ok. I understand. :-) > > There are a number of interesting non-volatile memory (NVM) technologies > > > being developed. Some of them promise DRAM-comparable latencies and > > > bandwidths. At Intel, we've been thinking about various ways to present > > > those to software. > > We can be more and more radical in the case of new NVM technologies, I think. The non-volatile random access memory with DRAM-comparable read and write operations' latencies can change computer world dramatically. Just imagine a computer system with only NVM memory subsystem (for example, it can be very promising mobile solution). It means that we can forget about specified RAM and persistent storage solutions. We can keep run-time and persistent information in one place and operate it on the fly. Moreover, it means that we can keep any internal OS's state persistently without any special efforts. I think that it can open new very interesting opportunities. The initial purpose of a filesystem is to distinguish run-time and persistent information. Usually, we have slow persistent memory subsystem (HDD) and fast run-time memory subsystem (DRAM). Filesystem is a technique of synchronization a slow persistent memory subsystem with fast run-time memory subsystem. But if we will have a fast memory that can keep run-time and persistent information then it means a revolutionary new approach in memory architecture. It means that two different entities (run-time and persistent) can be one union. But for such joined information entity traditional filesystems' and OS's internal techniques are not adequate approaches. We need in revolutionary new approaches. From NVM technology point of view, we can be without filesystem completely, but, from usual user point of view, modern computer system can't be imagined without filesystem. We need in filesystem as a catalogue of our persistent information. But OS can be represented as catalogue of run-time information. Then, with NVM technologies, the OS and filesystem can be a union entity that keeps as persistent as run-time information in one catalogue structure. But such representation needs in dramatically reworking of OS internal techniques. It means that traditional hierarchy of folders and files is obsolete. We need in a new information structure approaches. Theoretically, it is possible to reinterpret all information as run-time and to use OS's technique of internal objects structure. But it is impossible situation from end users point of view. So, we need in filesystem layer anyway as layer which represent user information and structure of it. If we can operate and keep internal OS representation of information then it means that we can reject file abstraction. We can operate by information itself and keep information without using different files' formats. But it is known that all in Linux is a file. Then, factually, we can talk about completely new OS. Actually, NVM technologies can support possibility doesn't boot OS completely. Why does it need to boot if it is possible to keep any OS state in memory persistently? I think that OS booting can be obsolete thing. Moreover, it is possible to be without swapping completely because all our memory can be persistent. And for system with NVM memory only request queue and I/O scheduler can be obsolete thing. I think that kernel memory page approach can be redesign significantly, also. Such thing as shared libraries can be useless because all code pieces can be completely in memory. So, I think that all what I said can sound as a clear fantasy. But, maybe, it needs to discuss about new OS instead of new filesystem. :-) With the best regards, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-16 6:24 ` Vyacheslav Dubeyko 2012-05-16 16:10 ` Matthew Wilcox @ 2012-05-16 21:58 ` Benjamin LaHaise 2012-05-17 19:06 ` Matthew Wilcox 1 sibling, 1 reply; 27+ messages in thread From: Benjamin LaHaise @ 2012-05-16 21:58 UTC (permalink / raw) To: Vyacheslav Dubeyko; +Cc: Matthew Wilcox, linux-fsdevel, linux-kernel On Wed, May 16, 2012 at 10:24:13AM +0400, Vyacheslav Dubeyko wrote: > Could you please share vision of these NVM technologies in more details? > What capacity in bytes of of one NVM unit do we can expect? What about > bad blocks and any other reliability issues of such NVM technologies? > > I think that some more deep understanding of this can give possibility > to imagine more deeply possible niche of such NVM units in future memory > subsystem architecture. Try having a look at the various articles on ReRAM, PRAM, FeRAM, MRAM... There are a number of technologies being actively developed. For some quick info, Samsung has presented data on an 8Gbit 20nm device (see http://www.eetimes.com/electronics-news/4230958/ISSCC--Samsung-preps-8-Gbit-phase-change-memory ). It's hard to predict who will be first to market with a real production volume product, though. The big question I have is what the actual interface for these types of memory will be. If they're like actual RAM and can be mmap()ed into user space, it will be preferable to avoid as much of the overhead of the existing block infrastructure that most current day filesystems are built on top of. If the devices have only modest endurance limits, we may need to stick the kernel in the middle to prevent malicious code from wearing out a user's memory cells. -ben ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-16 21:58 ` Benjamin LaHaise @ 2012-05-17 19:06 ` Matthew Wilcox 0 siblings, 0 replies; 27+ messages in thread From: Matthew Wilcox @ 2012-05-17 19:06 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Vyacheslav Dubeyko, linux-fsdevel, linux-kernel On Wed, May 16, 2012 at 05:58:49PM -0400, Benjamin LaHaise wrote: > The big question I have is what the actual interface for these types of > memory will be. If they're like actual RAM and can be mmap()ed into user > space, it will be preferable to avoid as much of the overhead of the existing > block infrastructure that most current day filesystems are built on top of. Yes. I'm hoping that filesystem developers will indicate enthusiasm for moving to new APIs. If not the ones I've proposed, then at least ones which can be implemented more efficiently with a device that looks like DRAM. > If the devices have only modest endurance limits, we may need to stick the > kernel in the middle to prevent malicious code from wearing out a user's > memory cells. Yes, or if the device has long write latencies or poor write bandwidth, we'll also want to buffer writes in DRAM. My theory is that this is doable transparently to the user; we can map it read-only, and handle the fault by copying from NVM to DRAM, then changing the mapping and restarting the instruction. The page would be written back to NVM on a sync call, or when memory pressure or elapsed time dictates. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-15 13:34 NVM Mapping API Matthew Wilcox ` (2 preceding siblings ...) 2012-05-16 6:24 ` Vyacheslav Dubeyko @ 2012-05-16 9:52 ` James Bottomley 2012-05-16 17:35 ` Matthew Wilcox 2012-05-16 13:04 ` Boaz Harrosh 2012-05-18 9:33 ` Arnd Bergmann 5 siblings, 1 reply; 27+ messages in thread From: James Bottomley @ 2012-05-16 9:52 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote: > There are a number of interesting non-volatile memory (NVM) technologies > being developed. Some of them promise DRAM-comparable latencies and > bandwidths. At Intel, we've been thinking about various ways to present > those to software. This is a first draft of an API that supports the > operations we see as necessary. Patches can follow easily enough once > we've settled on an API. If we start from first principles, does this mean it's usable as DRAM? Meaning do we even need a non-memory API for it? The only difference would be that some pieces of our RAM become non-volatile. Or is there some impediment (like durability, or degradation on rewrite) which makes this unsuitable as a complete DRAM replacement? > We think the appropriate way to present directly addressable NVM to > in-kernel users is through a filesystem. Different technologies may want > to use different filesystems, or maybe some forms of directly addressable > NVM will want to use the same filesystem as each other. If it's actually DRAM, I'd present it as DRAM and figure out how to label the non volatile property instead. Alternatively, if it's not really DRAM, I think the UNIX file abstraction makes sense (it's a piece of memory presented as something like a filehandle with open, close, seek, read, write and mmap), but it's less clear that it should be an actual file system. The reason is that to present a VFS interface, you have to already have fixed the format of the actual filesystem on the memory because we can't nest filesystems (well, not without doing artificial loopbacks). Again, this might make sense if there's some architectural reason why the flash region has to have a specific layout, but your post doesn't shed any light on this. James ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-16 9:52 ` James Bottomley @ 2012-05-16 17:35 ` Matthew Wilcox 2012-05-16 19:58 ` Christian Stroetmann 2012-05-17 9:54 ` James Bottomley 0 siblings, 2 replies; 27+ messages in thread From: Matthew Wilcox @ 2012-05-16 17:35 UTC (permalink / raw) To: James Bottomley; +Cc: linux-fsdevel, linux-kernel On Wed, May 16, 2012 at 10:52:00AM +0100, James Bottomley wrote: > On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote: > > There are a number of interesting non-volatile memory (NVM) technologies > > being developed. Some of them promise DRAM-comparable latencies and > > bandwidths. At Intel, we've been thinking about various ways to present > > those to software. This is a first draft of an API that supports the > > operations we see as necessary. Patches can follow easily enough once > > we've settled on an API. > > If we start from first principles, does this mean it's usable as DRAM? > Meaning do we even need a non-memory API for it? The only difference > would be that some pieces of our RAM become non-volatile. I'm not talking about a specific piece of technology, I'm assuming that one of the competing storage technologies will eventually make it to widespread production usage. Let's assume what we have is DRAM with a giant battery on it. So, while we can use it just as DRAM, we're not taking advantage of the persistent aspect of it if we don't have an API that lets us find the data we wrote before the last reboot. And that sounds like a filesystem to me. > Or is there some impediment (like durability, or degradation on rewrite) > which makes this unsuitable as a complete DRAM replacement? The idea behind using a different filesystem for different NVM types is that we can hide those kinds of impediments in the filesystem. By the way, did you know DRAM degrades on every write? I think it's on the order of 10^20 writes (and CPU caches hide many writes to heavily-used cache lines), so it's a long way away from MLC or even SLC rates, but it does exist. > Alternatively, if it's not really DRAM, I think the UNIX file > abstraction makes sense (it's a piece of memory presented as something > like a filehandle with open, close, seek, read, write and mmap), but > it's less clear that it should be an actual file system. The reason is > that to present a VFS interface, you have to already have fixed the > format of the actual filesystem on the memory because we can't nest > filesystems (well, not without doing artificial loopbacks). Again, this > might make sense if there's some architectural reason why the flash > region has to have a specific layout, but your post doesn't shed any > light on this. We can certainly present a block interface to allow using unmodified standard filesystems on top of chunks of this NVM. That's probably not the optimum way for a filesystem to use it though; there's really no point in constructing a bio to carry data down to a layer that's simply going to do a memcpy(). ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-16 17:35 ` Matthew Wilcox @ 2012-05-16 19:58 ` Christian Stroetmann 2012-05-19 22:19 ` Christian Stroetmann 2012-05-17 9:54 ` James Bottomley 1 sibling, 1 reply; 27+ messages in thread From: Christian Stroetmann @ 2012-05-16 19:58 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-kernel, linux-fsdevel Hello Hardcore Coders, I wanted to step into the discussion already yesterday, but ... I was afraid to be rude in doing so. On We, May 16, 2012 at 19:35, Matthew Wilcox wrote: > On Wed, May 16, 2012 at 10:52:00AM +0100, James Bottomley wrote: >> On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote: >>> There are a number of interesting non-volatile memory (NVM) technologies >>> being developed. Some of them promise DRAM-comparable latencies and >>> bandwidths. At Intel, we've been thinking about various ways to present >>> those to software. This is a first draft of an API that supports the >>> operations we see as necessary. Patches can follow easily enough once >>> we've settled on an API. >> If we start from first principles, does this mean it's usable as DRAM? >> Meaning do we even need a non-memory API for it? The only difference >> would be that some pieces of our RAM become non-volatile. > I'm not talking about a specific piece of technology, I'm assuming that > one of the competing storage technologies will eventually make it to > widespread production usage. Let's assume what we have is DRAM with a > giant battery on it. Our ST-RAM (see [1] for the original source of its description) is a concept based on the combination of a writable volatile Random-Access Memory (RAM) chip and a capacitor. Either an adapter, which has a capacitor, is placed between a motherboard and a memory modul, the memory chip is simply connected with a capacitor, or a RAM chip is directly integrated with a chip capacitor. Also, the capacitor could be an element that is integrated directly with the rest of a RAM chip. While a computer system is running, the capacitor is charged with electric power, so that after a computing system is switched off the memory module will still be supported with needed power out of the capacitor and in this way the content of the memory is not lost. In this way a computing system has not to be booted in most of the normal use cases after it is switched on again. Boaz asked: "What is the difference from say a PCIE DRAM card with battery"? It sits in the RAM slot. > > So, while we can use it just as DRAM, we're not taking advantage of the > persistent aspect of it if we don't have an API that lets us find the > data we wrote before the last reboot. And that sounds like a filesystem > to me. No and yes. 1. In the first place it is just a normal DRAM. 2. But due to its nature it has also many aspects of a flash memory. So the use case is for point 1. as a normal RAM module, and for point 2. as a file system, which again can be used 2.1 directly by the kernel as a normal file system, 2.2 directly by the kernel by the PRAMFS 2.3 by the proposed NVMFS, maybe as a shortcut for optimization, and 2.4 from the userspace, most potentially by using the standard VFS. Maybe this version 2.4 is the same as point 2.2. >> Or is there some impediment (like durability, or degradation on rewrite) >> which makes this unsuitable as a complete DRAM replacement? > The idea behind using a different filesystem for different NVM types is > that we can hide those kinds of impediments in the filesystem. By the > way, did you know DRAM degrades on every write? I think it's on the > order of 10^20 writes (and CPU caches hide many writes to heavily-used > cache lines), so it's a long way away from MLC or even SLC rates, but > it does exist. As I said before, a filesystem for the different NVM types would not be enough. These things are more complex due the possibility that they can be used very flexbily. > >> Alternatively, if it's not really DRAM, I think the UNIX file >> abstraction makes sense (it's a piece of memory presented as something >> like a filehandle with open, close, seek, read, write and mmap), but >> it's less clear that it should be an actual file system. The reason is >> that to present a VFS interface, you have to already have fixed the >> format of the actual filesystem on the memory because we can't nest >> filesystems (well, not without doing artificial loopbacks). Again, this >> might make sense if there's some architectural reason why the flash >> region has to have a specific layout, but your post doesn't shed any >> light on this. > We can certainly present a block interface to allow using unmodified > standard filesystems on top of chunks of this NVM. That's probably not > the optimum way for a filesystem to use it though; there's really no > point in constructing a bio to carry data down to a layer that's simply > going to do a memcpy(). > -- I also saw the use cases by Boaz that are Journals of other FS, which could be done on top of the NVMFS for example, but is not really what I have in mind, and Execute in place, for which an Elf loader feature is needed. Obviously, this use case was envisioned by me as well. For direct rebooting the checkpointing of standard RAM is also a needed function. The decision what is trashed and what is marked as persistent RAM content has to be made by the RAM experts of the Linux developers or the user. I even think that this is a special use case on its own with many options. With all the best C. Stroetmann [1] ST-RAM www.ontonics.com/innovation/pipeline.htm#st-ram ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-16 19:58 ` Christian Stroetmann @ 2012-05-19 22:19 ` Christian Stroetmann 0 siblings, 0 replies; 27+ messages in thread From: Christian Stroetmann @ 2012-05-19 22:19 UTC (permalink / raw) To: Christian Stroetmann; +Cc: linux-kernel, linux-fsdevel On We, May 16, 2012 at 21:58, Christian Stroetmann wrote: > On We, May 16, 2012 at 19:35, Matthew Wilcox wrote: >> On Wed, May 16, 2012 at 10:52:00AM +0100, James Bottomley wrote: >>> On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote: >>>> There are a number of interesting non-volatile memory (NVM) >>>> technologies >>>> being developed. Some of them promise DRAM-comparable latencies and >>>> bandwidths. At Intel, we've been thinking about various ways to >>>> present >>>> those to software. This is a first draft of an API that supports the >>>> operations we see as necessary. Patches can follow easily enough once >>>> we've settled on an API. >>> If we start from first principles, does this mean it's usable as DRAM? >>> Meaning do we even need a non-memory API for it? The only difference >>> would be that some pieces of our RAM become non-volatile. >> I'm not talking about a specific piece of technology, I'm assuming that >> one of the competing storage technologies will eventually make it to >> widespread production usage. Let's assume what we have is DRAM with a >> giant battery on it. > Our ST-RAM (see [1] for the original source of its description) is a > concept based on the combination of a writable volatile Random-Access > Memory (RAM) chip and a capacitor. [...] > Boaz asked: "What is the difference from say a PCIE DRAM card with > battery"? It sits in the RAM slot. > > >> >> So, while we can use it just as DRAM, we're not taking advantage of the >> persistent aspect of it if we don't have an API that lets us find the >> data we wrote before the last reboot. And that sounds like a filesystem >> to me. > > No and yes. > 1. In the first place it is just a normal DRAM. > 2. But due to its nature it has also many aspects of a flash memory. > So the use case is for point > 1. as a normal RAM module, > and for point > 2. as a file system, > which again can be used > 2.1 directly by the kernel as a normal file system, > 2.2 directly by the kernel by the PRAMFS > 2.3 by the proposed NVMFS, maybe as a shortcut for optimization, > and > 2.4 from the userspace, most potentially by using the standard VFS. > Maybe this version 2.4 is the same as point 2.2. > >>> Or is there some impediment (like durability, or degradation on >>> rewrite) >>> which makes this unsuitable as a complete DRAM replacement? >> The idea behind using a different filesystem for different NVM types is >> that we can hide those kinds of impediments in the filesystem. By the >> way, did you know DRAM degrades on every write? I think it's on the >> order of 10^20 writes (and CPU caches hide many writes to heavily-used >> cache lines), so it's a long way away from MLC or even SLC rates, but >> it does exist. > > As I said before, a filesystem for the different NVM types would not > be enough. These things are more complex due the possibility that they > can be used very flexbily. > >> >>> Alternatively, if it's not really DRAM, I think the UNIX file >>> abstraction makes sense (it's a piece of memory presented as something >>> like a filehandle with open, close, seek, read, write and mmap), but >>> it's less clear that it should be an actual file system. The reason is >>> that to present a VFS interface, you have to already have fixed the >>> format of the actual filesystem on the memory because we can't nest >>> filesystems (well, not without doing artificial loopbacks). Again, >>> this >>> might make sense if there's some architectural reason why the flash >>> region has to have a specific layout, but your post doesn't shed any >>> light on this. >> We can certainly present a block interface to allow using unmodified >> standard filesystems on top of chunks of this NVM. That's probably not >> the optimum way for a filesystem to use it though; there's really no >> point in constructing a bio to carry data down to a layer that's simply >> going to do a memcpy(). >> -- > > I also saw the use cases by Boaz that are > Journals of other FS, which could be done on top of the NVMFS for > example, but is not really what I have in mind, and > Execute in place, for which an Elf loader feature is needed. > Obviously, this use case was envisioned by me as well. > > For direct rebooting the checkpointing of standard RAM is also a > needed function. The decision what is trashed and what is marked as > persistent RAM content has to be made by the RAM experts of the Linux > developers or the user. I even think that this is a special use case > on its own with many options. > Because it is now about 1 year ago when I played around with the conceptual hardware aspects of anUninterruptible Power RAM (UPRAM) like the ST-RAM, I looked in more detail at the software side yesterday and today. So let me please add my first use case that I had in mind last year and coined now: Hybrid Hibernation (HyHi) or alternatively Suspend-to-NVM, which is similar to hybrid sleep and hibernation, but also differs a little bit due to the uninterruptible power feature. But as it can be seen easily here again, even with this 1 use case exist two paths to handle the NVM that are as: 1. RAM and 2. FS, so that it leads a further time to the discussion, if hibernation should be a kernel or a user space function (see [1] and [2] for more information related with the discussion about uswsup (userspace software suspend) and suspend2, and [3] for uswsup and [4] for TuxOnIce). Eventually, there is an interest to reuse some functions or code. Have fun in the sun C. Stroetmann > [1] ST-RAM www.ontonics.com/innovation/pipeline.htm#st-ram > [1] LKML: Pavel Machek: RE: suspend2 merge lkml.org/lkml/2007/4/24/405 [2] KernelTrap: Linux: Reviewung Suspend2 kerneltrap.org/node/6766 [3] suspend.sourceforge.net [4] tuxonice.net ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-16 17:35 ` Matthew Wilcox 2012-05-16 19:58 ` Christian Stroetmann @ 2012-05-17 9:54 ` James Bottomley 2012-05-17 18:59 ` Matthew Wilcox 1 sibling, 1 reply; 27+ messages in thread From: James Bottomley @ 2012-05-17 9:54 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel On Wed, 2012-05-16 at 13:35 -0400, Matthew Wilcox wrote: > On Wed, May 16, 2012 at 10:52:00AM +0100, James Bottomley wrote: > > On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote: > > > There are a number of interesting non-volatile memory (NVM) technologies > > > being developed. Some of them promise DRAM-comparable latencies and > > > bandwidths. At Intel, we've been thinking about various ways to present > > > those to software. This is a first draft of an API that supports the > > > operations we see as necessary. Patches can follow easily enough once > > > we've settled on an API. > > > > If we start from first principles, does this mean it's usable as DRAM? > > Meaning do we even need a non-memory API for it? The only difference > > would be that some pieces of our RAM become non-volatile. > > I'm not talking about a specific piece of technology, I'm assuming that > one of the competing storage technologies will eventually make it to > widespread production usage. Let's assume what we have is DRAM with a > giant battery on it. > > So, while we can use it just as DRAM, we're not taking advantage of the > persistent aspect of it if we don't have an API that lets us find the > data we wrote before the last reboot. And that sounds like a filesystem > to me. Well, it sounds like a unix file to me rather than a filesystem (it's a flat region with a beginning and end and no structure in between). However, I'm not precluding doing this, I'm merely asking that if it looks and smells like DRAM with the only additional property being persistency, shouldn't we begin with the memory APIs and see if we can add persistency to them? Imposing a VFS API looks slightly wrong to me because it's effectively a flat region, not a hierarchical tree structure, like a FS. If all the use cases are hierarchical trees, that might be appropriate, but there hasn't really been any discussion of use cases. > > Or is there some impediment (like durability, or degradation on rewrite) > > which makes this unsuitable as a complete DRAM replacement? > > The idea behind using a different filesystem for different NVM types is > that we can hide those kinds of impediments in the filesystem. By the > way, did you know DRAM degrades on every write? I think it's on the > order of 10^20 writes (and CPU caches hide many writes to heavily-used > cache lines), so it's a long way away from MLC or even SLC rates, but > it does exist. So are you saying does or doesn't have an impediment to being used like DRAM? > > Alternatively, if it's not really DRAM, I think the UNIX file > > abstraction makes sense (it's a piece of memory presented as something > > like a filehandle with open, close, seek, read, write and mmap), but > > it's less clear that it should be an actual file system. The reason is > > that to present a VFS interface, you have to already have fixed the > > format of the actual filesystem on the memory because we can't nest > > filesystems (well, not without doing artificial loopbacks). Again, this > > might make sense if there's some architectural reason why the flash > > region has to have a specific layout, but your post doesn't shed any > > light on this. > > We can certainly present a block interface to allow using unmodified > standard filesystems on top of chunks of this NVM. That's probably not > the optimum way for a filesystem to use it though; there's really no > point in constructing a bio to carry data down to a layer that's simply > going to do a memcpy(). I think we might be talking at cross purposes. If you use the memory APIs, this looks something like an anonymous region of memory with a get and put API; something like SYSV shm if you like except that it's persistent. No filesystem semantics at all. Only if you want FS semantics (or want to impose some order on the region for unplugging and replugging), do you put an FS on the memory region using loopback techniques. Again, this depends on use case. The SYSV shm API has a global flat keyspace. Perhaps your envisaged use requires a hierarchical key space and therefore a FS interface looks more natural with the leaves being divided memory regions? James ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-17 9:54 ` James Bottomley @ 2012-05-17 18:59 ` Matthew Wilcox 2012-05-18 9:03 ` James Bottomley 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2012-05-17 18:59 UTC (permalink / raw) To: James Bottomley; +Cc: linux-fsdevel, linux-kernel On Thu, May 17, 2012 at 10:54:38AM +0100, James Bottomley wrote: > On Wed, 2012-05-16 at 13:35 -0400, Matthew Wilcox wrote: > > I'm not talking about a specific piece of technology, I'm assuming that > > one of the competing storage technologies will eventually make it to > > widespread production usage. Let's assume what we have is DRAM with a > > giant battery on it. > > > > So, while we can use it just as DRAM, we're not taking advantage of the > > persistent aspect of it if we don't have an API that lets us find the > > data we wrote before the last reboot. And that sounds like a filesystem > > to me. > > Well, it sounds like a unix file to me rather than a filesystem (it's a > flat region with a beginning and end and no structure in between). That's true, but I think we want to put a structure on top of it. Presumably there will be multiple independent users, and each will want only a fraction of it. > However, I'm not precluding doing this, I'm merely asking that if it > looks and smells like DRAM with the only additional property being > persistency, shouldn't we begin with the memory APIs and see if we can > add persistency to them? I don't think so. It feels harder to add useful persistent properties to the memory APIs than it does to add memory-like properties to our file APIs, at least partially because for userspace we already have memory properties for our file APIs (ie mmap/msync/munmap/mprotect/mincore/mlock/munlock/mremap). > Imposing a VFS API looks slightly wrong to me > because it's effectively a flat region, not a hierarchical tree > structure, like a FS. If all the use cases are hierarchical trees, that > might be appropriate, but there hasn't really been any discussion of use > cases. Discussion of use cases is exactly what I want! I think that a non-hierarchical attempt at naming chunks of memory quickly expands into cases where we learn we really do want a hierarchy after all. > > > Or is there some impediment (like durability, or degradation on rewrite) > > > which makes this unsuitable as a complete DRAM replacement? > > > > The idea behind using a different filesystem for different NVM types is > > that we can hide those kinds of impediments in the filesystem. By the > > way, did you know DRAM degrades on every write? I think it's on the > > order of 10^20 writes (and CPU caches hide many writes to heavily-used > > cache lines), so it's a long way away from MLC or even SLC rates, but > > it does exist. > > So are you saying does or doesn't have an impediment to being used like > DRAM? >From the consumers point of view, it doesn't. If the underlying physical technology does (some of the ones we've looked at have worse problems than others), then it's up to the driver to disguise that. > > > Alternatively, if it's not really DRAM, I think the UNIX file > > > abstraction makes sense (it's a piece of memory presented as something > > > like a filehandle with open, close, seek, read, write and mmap), but > > > it's less clear that it should be an actual file system. The reason is > > > that to present a VFS interface, you have to already have fixed the > > > format of the actual filesystem on the memory because we can't nest > > > filesystems (well, not without doing artificial loopbacks). Again, this > > > might make sense if there's some architectural reason why the flash > > > region has to have a specific layout, but your post doesn't shed any > > > light on this. > > > > We can certainly present a block interface to allow using unmodified > > standard filesystems on top of chunks of this NVM. That's probably not > > the optimum way for a filesystem to use it though; there's really no > > point in constructing a bio to carry data down to a layer that's simply > > going to do a memcpy(). > > I think we might be talking at cross purposes. If you use the memory > APIs, this looks something like an anonymous region of memory with a get > and put API; something like SYSV shm if you like except that it's > persistent. No filesystem semantics at all. Only if you want FS > semantics (or want to impose some order on the region for unplugging and > replugging), do you put an FS on the memory region using loopback > techniques. > > Again, this depends on use case. The SYSV shm API has a global flat > keyspace. Perhaps your envisaged use requires a hierarchical key space > and therefore a FS interface looks more natural with the leaves being > divided memory regions? I've really never heard anybody hold up the SYSV shm API as something to be desired before. Indeed, POSIX shared memory is much closer to the filesystem API; the only difference being use of shm_open() and shm_unlink() instead of open() and unlink() [see shm_overview(7)]. And I don't really see the point in creating specialised nvm_open() and nvm_unlink() functions ... ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-17 18:59 ` Matthew Wilcox @ 2012-05-18 9:03 ` James Bottomley 2012-05-18 10:13 ` Boaz Harrosh 2012-05-18 14:49 ` Matthew Wilcox 0 siblings, 2 replies; 27+ messages in thread From: James Bottomley @ 2012-05-18 9:03 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel On Thu, 2012-05-17 at 14:59 -0400, Matthew Wilcox wrote: > On Thu, May 17, 2012 at 10:54:38AM +0100, James Bottomley wrote: > > On Wed, 2012-05-16 at 13:35 -0400, Matthew Wilcox wrote: > > > I'm not talking about a specific piece of technology, I'm assuming that > > > one of the competing storage technologies will eventually make it to > > > widespread production usage. Let's assume what we have is DRAM with a > > > giant battery on it. > > > > > > So, while we can use it just as DRAM, we're not taking advantage of the > > > persistent aspect of it if we don't have an API that lets us find the > > > data we wrote before the last reboot. And that sounds like a filesystem > > > to me. > > > > Well, it sounds like a unix file to me rather than a filesystem (it's a > > flat region with a beginning and end and no structure in between). > > That's true, but I think we want to put a structure on top of it. > Presumably there will be multiple independent users, and each will want > only a fraction of it. > > > However, I'm not precluding doing this, I'm merely asking that if it > > looks and smells like DRAM with the only additional property being > > persistency, shouldn't we begin with the memory APIs and see if we can > > add persistency to them? > > I don't think so. It feels harder to add useful persistent > properties to the memory APIs than it does to add memory-like > properties to our file APIs, at least partially because for > userspace we already have memory properties for our file APIs (ie > mmap/msync/munmap/mprotect/mincore/mlock/munlock/mremap). This is what I don't quite get. At the OS level, it's all memory; we just have to flag one region as persistent. This is easy, I'd do it in the physical memory map. once this is done, we need either to tell the allocators only use volatile, only use persistent, or don't care (I presume the latter would only be if you needed the extra ram). The missing thing is persistent key management of the memory space (so if a user or kernel wants 10Mb of persistent space, they get the same 10Mb back again across boots). The reason a memory API looks better to me is because a memory API can be used within the kernel. For instance, I want a persistent /var/tmp on tmpfs, I just tell tmpfs to allocate it in persistent memory and it survives reboots. Likewise, if I want an area to dump panics, I just use it ... in fact, I'd probably always place the dmesg buffer in persistent memory. If you start off with a vfs API, it becomes far harder to use it easily from within the kernel. The question, really is all about space management: how many persistent spaces would there be. I think, given the use cases above it would be a small number (it's basically one for every kernel use and one for ever user use ... a filesystem mount counting as one use), so a flat key to space management mapping (probably using u32 keys) makes sense, and that's similar to our current shared memory API. > > Imposing a VFS API looks slightly wrong to me > > because it's effectively a flat region, not a hierarchical tree > > structure, like a FS. If all the use cases are hierarchical trees, that > > might be appropriate, but there hasn't really been any discussion of use > > cases. > > Discussion of use cases is exactly what I want! I think that a > non-hierarchical attempt at naming chunks of memory quickly expands > into cases where we learn we really do want a hierarchy after all. OK, so enumerate the uses. I can be persuaded the namespace has to be hierarchical if there are orders of magnitude more users than I think there will be. > > > > Or is there some impediment (like durability, or degradation on rewrite) > > > > which makes this unsuitable as a complete DRAM replacement? > > > > > > The idea behind using a different filesystem for different NVM types is > > > that we can hide those kinds of impediments in the filesystem. By the > > > way, did you know DRAM degrades on every write? I think it's on the > > > order of 10^20 writes (and CPU caches hide many writes to heavily-used > > > cache lines), so it's a long way away from MLC or even SLC rates, but > > > it does exist. > > > > So are you saying does or doesn't have an impediment to being used like > > DRAM? > > >From the consumers point of view, it doesn't. If the underlying physical > technology does (some of the ones we've looked at have worse problems > than others), then it's up to the driver to disguise that. OK, so in a pinch it can be used as normal DRAM, that's great. > > > > Alternatively, if it's not really DRAM, I think the UNIX file > > > > abstraction makes sense (it's a piece of memory presented as something > > > > like a filehandle with open, close, seek, read, write and mmap), but > > > > it's less clear that it should be an actual file system. The reason is > > > > that to present a VFS interface, you have to already have fixed the > > > > format of the actual filesystem on the memory because we can't nest > > > > filesystems (well, not without doing artificial loopbacks). Again, this > > > > might make sense if there's some architectural reason why the flash > > > > region has to have a specific layout, but your post doesn't shed any > > > > light on this. > > > > > > We can certainly present a block interface to allow using unmodified > > > standard filesystems on top of chunks of this NVM. That's probably not > > > the optimum way for a filesystem to use it though; there's really no > > > point in constructing a bio to carry data down to a layer that's simply > > > going to do a memcpy(). > > > > I think we might be talking at cross purposes. If you use the memory > > APIs, this looks something like an anonymous region of memory with a get > > and put API; something like SYSV shm if you like except that it's > > persistent. No filesystem semantics at all. Only if you want FS > > semantics (or want to impose some order on the region for unplugging and > > replugging), do you put an FS on the memory region using loopback > > techniques. > > > > Again, this depends on use case. The SYSV shm API has a global flat > > keyspace. Perhaps your envisaged use requires a hierarchical key space > > and therefore a FS interface looks more natural with the leaves being > > divided memory regions? > > I've really never heard anybody hold up the SYSV shm API as something > to be desired before. Indeed, POSIX shared memory is much closer to > the filesystem API; I'm not really ... I was just thinking this needs key -> region mapping and SYSV shm does that. The POSIX anonymous memory API needs you to map /dev/zero and then pass file descriptors around for sharing. It's not clear how you manage a persistent key space with that. > the only difference being use of shm_open() and > shm_unlink() instead of open() and unlink() [see shm_overview(7)]. > And I don't really see the point in creating specialised nvm_open() > and nvm_unlink() functions ... The internal kernel API addition is simply a key -> region mapping. Once that's done, you need an allocation API for userspace and you're done. I bet most userspace uses will be either give me xGB and put a tmpfs on it or give me xGB and put a something filesystem on it, but if the user wants an xGB mmap'd region, you can give them that as well. For a vfs interface, you have to do all of this as well, but in a much more complex way because the file name becomes the key and the metadata becomes the mapping. James ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-18 9:03 ` James Bottomley @ 2012-05-18 10:13 ` Boaz Harrosh 2012-05-18 14:49 ` Matthew Wilcox 1 sibling, 0 replies; 27+ messages in thread From: Boaz Harrosh @ 2012-05-18 10:13 UTC (permalink / raw) To: James Bottomley; +Cc: Matthew Wilcox, linux-fsdevel, linux-kernel On 05/18/2012 12:03 PM, James Bottomley wrote: > On Thu, 2012-05-17 at 14:59 -0400, Matthew Wilcox wrote: >> On Thu, May 17, 2012 at 10:54:38AM +0100, James Bottomley wrote: >>> On Wed, 2012-05-16 at 13:35 -0400, Matthew Wilcox wrote: >>>> I'm not talking about a specific piece of technology, I'm assuming that >>>> one of the competing storage technologies will eventually make it to >>>> widespread production usage. Let's assume what we have is DRAM with a >>>> giant battery on it. >>>> >>>> So, while we can use it just as DRAM, we're not taking advantage of the >>>> persistent aspect of it if we don't have an API that lets us find the >>>> data we wrote before the last reboot. And that sounds like a filesystem >>>> to me. >>> >>> Well, it sounds like a unix file to me rather than a filesystem (it's a >>> flat region with a beginning and end and no structure in between). >> >> That's true, but I think we want to put a structure on top of it. >> Presumably there will be multiple independent users, and each will want >> only a fraction of it. >> >>> However, I'm not precluding doing this, I'm merely asking that if it >>> looks and smells like DRAM with the only additional property being >>> persistency, shouldn't we begin with the memory APIs and see if we can >>> add persistency to them? >> >> I don't think so. It feels harder to add useful persistent >> properties to the memory APIs than it does to add memory-like >> properties to our file APIs, at least partially because for >> userspace we already have memory properties for our file APIs (ie >> mmap/msync/munmap/mprotect/mincore/mlock/munlock/mremap). > > This is what I don't quite get. At the OS level, it's all memory; we > just have to flag one region as persistent. This is easy, I'd do it in > the physical memory map. once this is done, we need either to tell the > allocators only use volatile, only use persistent, or don't care (I > presume the latter would only be if you needed the extra ram). > > The missing thing is persistent key management of the memory space (so > if a user or kernel wants 10Mb of persistent space, they get the same > 10Mb back again across boots). > > The reason a memory API looks better to me is because a memory API can > be used within the kernel. For instance, I want a persistent /var/tmp > on tmpfs, I just tell tmpfs to allocate it in persistent memory and it > survives reboots. Likewise, if I want an area to dump panics, I just > use it ... in fact, I'd probably always place the dmesg buffer in > persistent memory. > > If you start off with a vfs API, it becomes far harder to use it easily > from within the kernel. > > The question, really is all about space management: how many persistent > spaces would there be. I think, given the use cases above it would be a > small number (it's basically one for every kernel use and one for ever > user use ... a filesystem mount counting as one use), so a flat key to > space management mapping (probably using u32 keys) makes sense, and > that's similar to our current shared memory API. > >>> Imposing a VFS API looks slightly wrong to me >>> because it's effectively a flat region, not a hierarchical tree >>> structure, like a FS. If all the use cases are hierarchical trees, that >>> might be appropriate, but there hasn't really been any discussion of use >>> cases. >> >> Discussion of use cases is exactly what I want! I think that a >> non-hierarchical attempt at naming chunks of memory quickly expands >> into cases where we learn we really do want a hierarchy after all. > > OK, so enumerate the uses. I can be persuaded the namespace has to be > hierarchical if there are orders of magnitude more users than I think > there will be. > >>>>> Or is there some impediment (like durability, or degradation on rewrite) >>>>> which makes this unsuitable as a complete DRAM replacement? >>>> >>>> The idea behind using a different filesystem for different NVM types is >>>> that we can hide those kinds of impediments in the filesystem. By the >>>> way, did you know DRAM degrades on every write? I think it's on the >>>> order of 10^20 writes (and CPU caches hide many writes to heavily-used >>>> cache lines), so it's a long way away from MLC or even SLC rates, but >>>> it does exist. >>> >>> So are you saying does or doesn't have an impediment to being used like >>> DRAM? >> >> >From the consumers point of view, it doesn't. If the underlying physical >> technology does (some of the ones we've looked at have worse problems >> than others), then it's up to the driver to disguise that. > > OK, so in a pinch it can be used as normal DRAM, that's great. > >>>>> Alternatively, if it's not really DRAM, I think the UNIX file >>>>> abstraction makes sense (it's a piece of memory presented as something >>>>> like a filehandle with open, close, seek, read, write and mmap), but >>>>> it's less clear that it should be an actual file system. The reason is >>>>> that to present a VFS interface, you have to already have fixed the >>>>> format of the actual filesystem on the memory because we can't nest >>>>> filesystems (well, not without doing artificial loopbacks). Again, this >>>>> might make sense if there's some architectural reason why the flash >>>>> region has to have a specific layout, but your post doesn't shed any >>>>> light on this. >>>> >>>> We can certainly present a block interface to allow using unmodified >>>> standard filesystems on top of chunks of this NVM. That's probably not >>>> the optimum way for a filesystem to use it though; there's really no >>>> point in constructing a bio to carry data down to a layer that's simply >>>> going to do a memcpy(). >>> >>> I think we might be talking at cross purposes. If you use the memory >>> APIs, this looks something like an anonymous region of memory with a get >>> and put API; something like SYSV shm if you like except that it's >>> persistent. No filesystem semantics at all. Only if you want FS >>> semantics (or want to impose some order on the region for unplugging and >>> replugging), do you put an FS on the memory region using loopback >>> techniques. >>> >>> Again, this depends on use case. The SYSV shm API has a global flat >>> keyspace. Perhaps your envisaged use requires a hierarchical key space >>> and therefore a FS interface looks more natural with the leaves being >>> divided memory regions? >> >> I've really never heard anybody hold up the SYSV shm API as something >> to be desired before. Indeed, POSIX shared memory is much closer to >> the filesystem API; > > I'm not really ... I was just thinking this needs key -> region mapping > and SYSV shm does that. The POSIX anonymous memory API needs you to > map /dev/zero and then pass file descriptors around for sharing. It's > not clear how you manage a persistent key space with that. > >> the only difference being use of shm_open() and >> shm_unlink() instead of open() and unlink() [see shm_overview(7)]. >> And I don't really see the point in creating specialised nvm_open() >> and nvm_unlink() functions ... > > The internal kernel API addition is simply a key -> region mapping. > Once that's done, you need an allocation API for userspace and you're > done. I bet most userspace uses will be either give me xGB and put a > tmpfs on it or give me xGB and put a something filesystem on it, but if > the user wants an xGB mmap'd region, you can give them that as well. > > For a vfs interface, you have to do all of this as well, but in a much > more complex way because the file name becomes the key and the metadata > becomes the mapping. > Matthew is making very good points, and so does James. For one the very strong point is "why not use NVM in an OOM situation, as a NUMA slower node?" I think the best approach is both, and layered. 0. An NVM Driver 1. Well define, and marry, the notion of "persistent memory" into the Memory mode. Layers, speeds, and everything. Now you have one or more flat regions of NVM. So this is just one or more NVM memory zones, persistent being a property of a zone. 2. Define a new NvmFS, which is like the RamFS we have today that uses page_cach semantics and is in bed with the page-allocators This layer gives you the key-to-buffer management as well as just transparent POSIX API to existing applications. Layers 1, 2 can be generic, if Layer 0 is well parametrized. There might be a layer 2.5, where similar to a Partition, you have a flat UUIed sub-region for the likes of Kernel subsystems The NvmFS layer is mounted on an allocated UUIDed region, but also a SWAP space a Journal, what ever hybrid idea anyone has. > James > Because you see. I like and completely agree with what Matthew said, and I want it. But I also want all of what James said. nvm_kalloc(struct uuid *uuid, size_t size, gfp); (A new uuid is created but an existing one returns it. And we might want to open exclusive/shared and stuff) Just my $0.017 Boaz ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-18 9:03 ` James Bottomley 2012-05-18 10:13 ` Boaz Harrosh @ 2012-05-18 14:49 ` Matthew Wilcox 2012-05-18 15:08 ` Alan Cox 2012-05-18 15:31 ` James Bottomley 1 sibling, 2 replies; 27+ messages in thread From: Matthew Wilcox @ 2012-05-18 14:49 UTC (permalink / raw) To: James Bottomley; +Cc: linux-fsdevel, linux-kernel On Fri, May 18, 2012 at 10:03:53AM +0100, James Bottomley wrote: > On Thu, 2012-05-17 at 14:59 -0400, Matthew Wilcox wrote: > > On Thu, May 17, 2012 at 10:54:38AM +0100, James Bottomley wrote: > > > On Wed, 2012-05-16 at 13:35 -0400, Matthew Wilcox wrote: > > > > I'm not talking about a specific piece of technology, I'm assuming that > > > > one of the competing storage technologies will eventually make it to > > > > widespread production usage. Let's assume what we have is DRAM with a > > > > giant battery on it. > > > > > > > > So, while we can use it just as DRAM, we're not taking advantage of the > > > > persistent aspect of it if we don't have an API that lets us find the > > > > data we wrote before the last reboot. And that sounds like a filesystem > > > > to me. > > > > > > Well, it sounds like a unix file to me rather than a filesystem (it's a > > > flat region with a beginning and end and no structure in between). > > > > That's true, but I think we want to put a structure on top of it. > > Presumably there will be multiple independent users, and each will want > > only a fraction of it. > > > > > However, I'm not precluding doing this, I'm merely asking that if it > > > looks and smells like DRAM with the only additional property being > > > persistency, shouldn't we begin with the memory APIs and see if we can > > > add persistency to them? > > > > I don't think so. It feels harder to add useful persistent > > properties to the memory APIs than it does to add memory-like > > properties to our file APIs, at least partially because for > > userspace we already have memory properties for our file APIs (ie > > mmap/msync/munmap/mprotect/mincore/mlock/munlock/mremap). > > This is what I don't quite get. At the OS level, it's all memory; we > just have to flag one region as persistent. This is easy, I'd do it in > the physical memory map. once this is done, we need either to tell the > allocators only use volatile, only use persistent, or don't care (I > presume the latter would only be if you needed the extra ram). > > The missing thing is persistent key management of the memory space (so > if a user or kernel wants 10Mb of persistent space, they get the same > 10Mb back again across boots). > > The reason a memory API looks better to me is because a memory API can > be used within the kernel. For instance, I want a persistent /var/tmp > on tmpfs, I just tell tmpfs to allocate it in persistent memory and it > survives reboots. Likewise, if I want an area to dump panics, I just > use it ... in fact, I'd probably always place the dmesg buffer in > persistent memory. > > If you start off with a vfs API, it becomes far harder to use it easily > from within the kernel. > > The question, really is all about space management: how many persistent > spaces would there be. I think, given the use cases above it would be a > small number (it's basically one for every kernel use and one for ever > user use ... a filesystem mount counting as one use), so a flat key to > space management mapping (probably using u32 keys) makes sense, and > that's similar to our current shared memory API. So who manages the key space? If we do it based on names, it's easy; all kernel uses are ".kernel/..." and we manage our own sub-hierarchy within the namespace. If there's only a u32, somebody has to lay down the rules about which numbers are used for what things. This isn't quite as ugly as the initial proposal somebody made to me "We just use the physical address as the key", and I told them all about how a.out libraries worked. Nevertheless, I'm not interested in being the Mitch DSouza of NVM. > > Discussion of use cases is exactly what I want! I think that a > > non-hierarchical attempt at naming chunks of memory quickly expands > > into cases where we learn we really do want a hierarchy after all. > > OK, so enumerate the uses. I can be persuaded the namespace has to be > hierarchical if there are orders of magnitude more users than I think > there will be. I don't know what the potential use cases might be. I just don't think the use cases are all that bounded. > > > Again, this depends on use case. The SYSV shm API has a global flat > > > keyspace. Perhaps your envisaged use requires a hierarchical key space > > > and therefore a FS interface looks more natural with the leaves being > > > divided memory regions? > > > > I've really never heard anybody hold up the SYSV shm API as something > > to be desired before. Indeed, POSIX shared memory is much closer to > > the filesystem API; > > I'm not really ... I was just thinking this needs key -> region mapping > and SYSV shm does that. The POSIX anonymous memory API needs you to > map /dev/zero and then pass file descriptors around for sharing. It's > not clear how you manage a persistent key space with that. I didn't say "POSIX anonymous memory". I said "POSIX shared memory". I even pointed you at the right manpage to read if you haven't heard of it before. The POSIX committee took a look at SYSV shm and said "This is too ugly". So they invented their own API. > > the only difference being use of shm_open() and > > shm_unlink() instead of open() and unlink() [see shm_overview(7)]. > > The internal kernel API addition is simply a key -> region mapping. > Once that's done, you need an allocation API for userspace and you're > done. I bet most userspace uses will be either give me xGB and put a > tmpfs on it or give me xGB and put a something filesystem on it, but if > the user wants an xGB mmap'd region, you can give them that as well. > > For a vfs interface, you have to do all of this as well, but in a much > more complex way because the file name becomes the key and the metadata > becomes the mapping. You're downplaying the complexity of your own solution while overstating the complexity of mine. Let's compare, using your suggestion of the dmesg buffer. Mine: struct file *filp = filp_open(".kernel/dmesg", O_RDWR, 0); if (!IS_ERR(filp)) log_buf = nvm_map(filp, 0, __LOG_BUF_LEN, PAGE_KERNEL); Yours: log_buf = nvm_attach(492, NULL, 0); /* Hope nobody else used 492! */ Hm. Doesn't look all that different, does it? I've modelled nvm_attach() after shmat(). Of course, this ignores the need to be able to sync, which may vary between different NVM technologies, and the (desired by some users) ability to change portions of the mapped NVM between read-only and read-write. If the extra parameters and extra lines of code hinder adoption, I have no problems with adding a helper for the simple use cases: void *nvm_attach(const char *name, int perms) { void *mem; struct file *filp = filp_open(name, perms, 0); if (IS_ERR(filp)) return NULL; mem = nvm_map(filp, 0, filp->f_dentry->d_inode->i_size, PAGE_KERNEL); fput(filp); return mem; } I do think that using numbers to refer to regions of NVM is a complete non-starter. This was one of the big mistakes of SYSV; one so big that even POSIX couldn't stomach it. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-18 14:49 ` Matthew Wilcox @ 2012-05-18 15:08 ` Alan Cox 2012-05-18 15:31 ` James Bottomley 1 sibling, 0 replies; 27+ messages in thread From: Alan Cox @ 2012-05-18 15:08 UTC (permalink / raw) To: Matthew Wilcox; +Cc: James Bottomley, linux-fsdevel, linux-kernel > I do think that using numbers to refer to regions of NVM is a complete > non-starter. This was one of the big mistakes of SYSV; one so big that > even POSIX couldn't stomach it. That basically degenerates to using UUIDs. Even then it's not a useful solution because you need to be able to list the UUIDs in use and their sizes which turns into a file system. I would prefer we use names. Alan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-18 14:49 ` Matthew Wilcox 2012-05-18 15:08 ` Alan Cox @ 2012-05-18 15:31 ` James Bottomley 2012-05-18 17:19 ` Matthew Wilcox 1 sibling, 1 reply; 27+ messages in thread From: James Bottomley @ 2012-05-18 15:31 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel On Fri, 2012-05-18 at 10:49 -0400, Matthew Wilcox wrote: > You're downplaying the complexity of your own solution while overstating > the complexity of mine. Let's compare, using your suggestion of the > dmesg buffer. I'll give you that one when you tell me how you use your vfs interface simply from within the kernel. Both are always about the same complexity in user space ... To be honest, I'm not hugely concerned whether the key management API is u32 or a string. What bothers me the most is that there will be in-kernel users for whom trying to mmap a file through the vfs will be hugely more complex than a simple give me a pointer to this persistent region. What all this tells me is that the key lookup API has to be exposed both to the kernel and userspace. VFS may make the best sense for user space, but the infrastructure needs to be non-VFS for the in kernel users. So what you want is a base region manager with allocation and key lookup, which you expose to the kernel and on which you can build a filesystem for userspace. Is everyone happy now? James ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-18 15:31 ` James Bottomley @ 2012-05-18 17:19 ` Matthew Wilcox 0 siblings, 0 replies; 27+ messages in thread From: Matthew Wilcox @ 2012-05-18 17:19 UTC (permalink / raw) To: James Bottomley; +Cc: linux-fsdevel, linux-kernel On Fri, May 18, 2012 at 04:31:08PM +0100, James Bottomley wrote: > On Fri, 2012-05-18 at 10:49 -0400, Matthew Wilcox wrote: > > You're downplaying the complexity of your own solution while overstating > > the complexity of mine. Let's compare, using your suggestion of the > > dmesg buffer. > > I'll give you that one when you tell me how you use your vfs interface > simply from within the kernel. Both are always about the same > complexity in user space ... > > To be honest, I'm not hugely concerned whether the key management API is > u32 or a string. What bothers me the most is that there will be > in-kernel users for whom trying to mmap a file through the vfs will be > hugely more complex than a simple give me a pointer to this persistent > region. Huh? You snipped the example where I showed exactly that. The user calls nvm_map() and gets back a pointer to a kernel mapping for the persistent region. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-15 13:34 NVM Mapping API Matthew Wilcox ` (3 preceding siblings ...) 2012-05-16 9:52 ` James Bottomley @ 2012-05-16 13:04 ` Boaz Harrosh 2012-05-16 18:33 ` Matthew Wilcox 2012-05-18 9:33 ` Arnd Bergmann 5 siblings, 1 reply; 27+ messages in thread From: Boaz Harrosh @ 2012-05-16 13:04 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel On 05/15/2012 04:34 PM, Matthew Wilcox wrote: > > There are a number of interesting non-volatile memory (NVM) technologies > being developed. Some of them promise DRAM-comparable latencies and > bandwidths. At Intel, we've been thinking about various ways to present > those to software. This is a first draft of an API that supports the > operations we see as necessary. Patches can follow easily enough once > we've settled on an API. > > We think the appropriate way to present directly addressable NVM to > in-kernel users is through a filesystem. Different technologies may want > to use different filesystems, or maybe some forms of directly addressable > NVM will want to use the same filesystem as each other. > > For mapping regions of NVM into the kernel address space, we think we need > map, unmap, protect and sync operations; see kerneldoc for them below. > We also think we need read and write operations (to copy to/from DRAM). > The kernel_read() function already exists, and I don't think it would > be unreasonable to add its kernel_write() counterpart. > > We aren't yet proposing a mechanism for carving up the NVM into regions. > vfs_truncate() seems like a reasonable API for resizing an NVM region. > filp_open() also seems reasonable for turning a name into a file pointer. > > What we'd really like is for people to think about how they might use > fast NVM inside the kernel. There's likely to be a lot of it (at least in > servers); all the technologies are promising cheaper per-bit prices than > DRAM, so it's likely to be sold in larger capacities than DRAM is today. > > Caching is one obvious use (be it FS-Cache, Bcache, Flashcache or > something else), but I bet there are more radical things we can do > with it. > What if we stored the inode cache in it? Would booting with > a hot inode cache improve boot times? How about storing the tree of > 'struct devices' in it so we don't have to rescan the busses at startup? > No for fast boots, just use it as an hibernation space. The rest is already implemented. If you also want protection from crashes and HW failures. Or power fail with no UPS, you can have a system checkpoint every once in a while that saves an hibernation and continues. If you always want a very fast boot to a clean system. checkpoint at entry state and always resume from that hibernation. Other uses: * Journals, Journals, Journals. of other FSs. So one file system has it's jurnal as a file in proposed above NVMFS. Create an easy API for Kernel subsystems for allocating them. * Execute in place. Perhaps the elf loader can sense that the executable is on an NVMFS and execute it in place instead of copy to DRAM. Or that happens automatically with your below nvm_map() > > /** > * @nvm_filp: The NVM file pointer > * @start: The starting offset within the NVM region to be mapped > * @length: The number of bytes to map > * @protection: Protection bits > * @return Pointer to virtual mapping or PTR_ERR on failure > * > * This call maps a file to a virtual memory address. The start and length > * should be page aligned. > * > * Errors: > * EINVAL if start and length are not page aligned. > * ENODEV if the file pointer does not point to a mappable file > */ > void *nvm_map(struct file *nvm_filp, off_t start, size_t length, > pgprot_t protection); > The returned void * here is that a cooked up TLB that points to real memory bus cycles HW. So is there a real physical memory region this sits in? What is the difference from say a PCIE DRAM card with battery. Could I just use some kind of RAM-FS with this? > /** > * @addr: The address returned by nvm_map() > * > * Unmaps a region previously mapped by nvm_map. > */ > void nvm_unmap(const void *addr); > > /** > * @addr: The first byte to affect > * @length: The number of bytes to affect > * @protection: The new protection to use > * > * Updates the protection bits for the corresponding pages. > * The start and length must be page aligned, but need not be the entirety > * of the mapping. > */ > void nvm_protect(const void *addr, size_t length, pgprot_t protection); > > /** > * @nvm_filp: The kernel file pointer > * @addr: The first byte to sync > * @length: The number of bytes to sync > * @returns Zero on success, -errno on failure > * > * Flushes changes made to the in-core copy of a mapped file back to NVM. > */ > int nvm_sync(struct file *nvm_filp, void *addr, size_t length); This I do not understand. Is that an on card memory cache flush, or is it a system memory DMAed to NVM? Thanks Boaz ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-16 13:04 ` Boaz Harrosh @ 2012-05-16 18:33 ` Matthew Wilcox 0 siblings, 0 replies; 27+ messages in thread From: Matthew Wilcox @ 2012-05-16 18:33 UTC (permalink / raw) To: Boaz Harrosh; +Cc: linux-fsdevel, linux-kernel On Wed, May 16, 2012 at 04:04:05PM +0300, Boaz Harrosh wrote: > No for fast boots, just use it as an hibernation space. The rest is > already implemented. If you also want protection from crashes and > HW failures. Or power fail with no UPS, you can have a system checkpoint > every once in a while that saves an hibernation and continues. If you > always want a very fast boot to a clean system. checkpoint at entry state > and always resume from that hibernation. Yes, checkpointing to it is definitely a good idea. I was thinking more along the lines of suspend rather than hibernate. We trash a lot of clean pages as part of the hibernation process, when it'd be better to copy them to NVM and restore them. > Other uses: > > * Journals, Journals, Journals. of other FSs. So one file system has > it's jurnal as a file in proposed above NVMFS. > Create an easy API for Kernel subsystems for allocating them. That's a great idea. I could see us having a specific journal API. > * Execute in place. > Perhaps the elf loader can sense that the executable is on an NVMFS > and execute it in place instead of copy to DRAM. Or that happens > automatically with your below nvm_map() If there's an executable on the NVMFS, it's going to get mapped into userspace, so as long as the NVMFS implements the ->mmap method, that will get called. It'll be up to the individual NVMFS whether it uses the page cache to buffer a read-only mmap or whether it points directly to the NVM. > > void *nvm_map(struct file *nvm_filp, off_t start, size_t length, > > pgprot_t protection); > > The returned void * here is that a cooked up TLB that points > to real memory bus cycles HW. So is there a real physical > memory region this sits in? What is the difference from > say a PCIE DRAM card with battery. The concept we're currently playing with would have the NVM appear as part of the CPU address space, yes. > Could I just use some kind of RAM-FS with this? For prototyping, sure. > > /** > > * @nvm_filp: The kernel file pointer > > * @addr: The first byte to sync > > * @length: The number of bytes to sync > > * @returns Zero on success, -errno on failure > > * > > * Flushes changes made to the in-core copy of a mapped file back to NVM. > > */ > > int nvm_sync(struct file *nvm_filp, void *addr, size_t length); > > This I do not understand. Is that an on card memory cache flush, or is it > a system memory DMAed to NVM? Up to the implementation; if it works out best to have a CPU with write-through caches pointing directly to the address space of the NVM, then it can be a no-op. If the CPU is using a writeback cache for the NVM, then it'll flush the CPU cache. If the nvmfs has staged the writes in DRAM, this will copy from DRAM to NVM. If the NVM card needs some magic to flush an internal buffer, that will happen here. Just as with mmaping a file in userspace today, there's no guarantee that a store gets to stable storage until after a sync. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: NVM Mapping API 2012-05-15 13:34 NVM Mapping API Matthew Wilcox ` (4 preceding siblings ...) 2012-05-16 13:04 ` Boaz Harrosh @ 2012-05-18 9:33 ` Arnd Bergmann 5 siblings, 0 replies; 27+ messages in thread From: Arnd Bergmann @ 2012-05-18 9:33 UTC (permalink / raw) To: Matthew Wilcox, Carsten Otte; +Cc: linux-fsdevel, linux-kernel On Tuesday 15 May 2012, Matthew Wilcox wrote: > > There are a number of interesting non-volatile memory (NVM) technologies > being developed. Some of them promise DRAM-comparable latencies and > bandwidths. At Intel, we've been thinking about various ways to present > those to software. This is a first draft of an API that supports the > operations we see as necessary. Patches can follow easily enough once > we've settled on an API. > > We think the appropriate way to present directly addressable NVM to > in-kernel users is through a filesystem. Different technologies may want > to use different filesystems, or maybe some forms of directly addressable > NVM will want to use the same filesystem as each other. ext2 actually supports some of this already with mm/filemap_xip.c, Carsten Otte introduced it initially to support drivers/s390/block/dcssblk.c with execute-in-place, so you don't have to copy around the data when your block device is mapped into the physical address space already. I guess this could be implemented in modern file systems (ext4, btrfs) as well, or you could have a new simple fs on top of the same base API. (ext2+xip was originally a new file system but then merged into ext2). Also note that you could easily implement non-volatile memory in other virtual machines doing the same thing that dcssblk does: E.g. in KVM you would only need to map a host file into the guess address space and let the guest take advantage of a similar feature set that you get from the new memory technologies in real hardware. Arnd ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2012-05-31 17:53 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-05-15 13:34 NVM Mapping API Matthew Wilcox 2012-05-15 17:46 ` Greg KH 2012-05-16 15:57 ` Matthew Wilcox 2012-05-18 12:07 ` Marco Stornelli 2012-05-15 23:02 ` Andy Lutomirski 2012-05-16 16:02 ` Matthew Wilcox 2012-05-31 17:53 ` Andy Lutomirski 2012-05-16 6:24 ` Vyacheslav Dubeyko 2012-05-16 16:10 ` Matthew Wilcox 2012-05-17 9:06 ` Vyacheslav Dubeyko 2012-05-16 21:58 ` Benjamin LaHaise 2012-05-17 19:06 ` Matthew Wilcox 2012-05-16 9:52 ` James Bottomley 2012-05-16 17:35 ` Matthew Wilcox 2012-05-16 19:58 ` Christian Stroetmann 2012-05-19 22:19 ` Christian Stroetmann 2012-05-17 9:54 ` James Bottomley 2012-05-17 18:59 ` Matthew Wilcox 2012-05-18 9:03 ` James Bottomley 2012-05-18 10:13 ` Boaz Harrosh 2012-05-18 14:49 ` Matthew Wilcox 2012-05-18 15:08 ` Alan Cox 2012-05-18 15:31 ` James Bottomley 2012-05-18 17:19 ` Matthew Wilcox 2012-05-16 13:04 ` Boaz Harrosh 2012-05-16 18:33 ` Matthew Wilcox 2012-05-18 9:33 ` Arnd Bergmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).