Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Christian Stroetmann <stroetmann@ontolinux.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux.com>,
	Matthew Wilcox <matthew.r.wilcox@intel.com>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, willy@linux.intel.com
Subject: Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
Date: Sun, 31 Aug 2014 01:11:03 +0200	[thread overview]
Message-ID: <54025A07.9070109@ontolinux.com> (raw)
In-Reply-To: <20140828071706.GD26465@dastard>

On the 28th of August 2014 at 09:17, Dave Chinner wrote:
> On Wed, Aug 27, 2014 at 02:30:55PM -0700, Andrew Morton wrote:
>> On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter<cl@linux.com>  wrote:
>>
>>>> Some explanation of why one would use ext4 instead of, say,
>>>> suitably-modified ramfs/tmpfs/rd/etc?
>>> The NVDIMM contents survive reboot and therefore ramfs and friends wont
>>> work with it.
>> See "suitably modified".  Presumably this type of memory would need to
>> come from a particular page allocator zone.  ramfs would be unweildy
>> due to its use to dentry/inode caches, but rd/etc should be feasible.
> <sigh>

Hello Dave and the others

Thank you very much for your patience and your following summarization.

> That's where we started about two years ago with that horrible
> pramfs trainwreck.
>
> To start with: brd is a block device, not a filesystem. We still
> need the filesystem on top of a persistent ram disk to make it
> useful to applications. We can do this with ext4/XFS right now, and
> that is the fundamental basis on which DAX is built.
>
> For sake of the discussion, however, let's walk through what is
> required to make an "existing" ramfs persistent. Persistence means we
> can't just wipe it and start again if it gets corrupted, and
> rebooting is not a fix for problems.  Hence we need to be able to
> identify it, check it, repair it, ensure metadata operations are
> persistent across machine crashes, etc, so there is all sorts of
> management tools required by a persistent ramfs.
>
> But most important of all: the persistent storage format needs to be
> forwards and backwards compatible across kernel versions.  Hence we
> can't encode any structure the kernel uses internally into the
> persistent storage because they aren't stable structures.  That
> means we need to marshall objects between the persistence domain and
> the volatile domain in an orderly fashion.

Two little questions:
1. If we would omit the compatiblitiy across kernel versions only for 
theoretical reasons,
then would it make sense at all to encode a structure that the kernel 
uses internally and
what advantages could be reached in this way?
2. Have the said structures used by the kernel changed so many times?

> We can avoid using the dentry/inode *caches* by freeing those
> volatile objects the moment reference counts dop to zero rather than
> putting them on LRUs. However, we can't store them in persistent
> storage and we can't avoid using them to interface with the VFS, so
> it makes little sense to burn CPU continually marshalling such
> structures in and out of volatile memory if we have free RAM to do
> so. So even with a "persistent ramfs" caching the working set of
> volatile VFS objects makes sense from a peformance point of view.

I am sorry to say so, but I am confused again and do not understand this 
argument,
because we are already talking about NVDIMMs here. So, if we have those 
volatile
VFS objects already in NVDIMMs so to say, then we have them also in 
persistent
storage and in DRAM at the same time.

>
> Then you've got crash recovery management: NVDIMMs are not
> synchronous: they can still lose data while it is being written on
> power loss. And we can't update persistent memory piecemeal as the
> VFS code modifies metadata - there needs to be synchronisation
> points, otherwise we will always have inconsistent metadata state in
> persistent memory.
>
> Persistent memory also can't do atomic writes across multiple,
> disjoint CPU cachelines or NVDIMMs, and this is what is needed for
> synchroniation points for multi-object metadata modification
> operations to be consistent after a crash.  There is some work in
> the nvme working groups to define this, but so far there hasn't been
> any useful outcome, and then we willhave to wait for CPUs to
> implement those interfaces.
>
> Hence the metadata that indexes the persistent RAM needs to use COW
> techniques, use a log structure or use WAL (journalling).  Hence
> that "persistent ramfs" is now looking much more like a database or
> traditional filesystem.
>
> Further, it's going to need to scale to very large amounts of
> storage.  We're talking about machines with *tens of TB* of NVDIMM
> capacity in the immediate future and so free space manangement and
> concurrency of allocation and freeing of used space is going to be
> fundamental to the performance of the persistent NVRAM filesystem.
> So, you end up with block/allocation groups to subdivide the space.
> Looking a lot like ext4 or XFS at this point.
>
> And now you have to scale to indexing tens of millions of
> everything. At least tens of millions - hundreds of millions to
> billions is more likely, because storing tens of terabytes of small
> files is going to require indexing billions of files. And because
> there is no performance penalty for doing this, people will use the
> filesystem as a great big database. So now you have to have a
> scalable posix compatible directory structures, scalable freespace
> indexation, dynamic, scalable inode allocation, freeing, etc. Oh,
> and it also needs to be highly concurrent to handle machines with
> hundreds of CPU cores.
>
> Funnily enough, we already have a couple of persistent storage
> implementations that solve these problems to varying degrees. ext4
> is one of them, if you ignore the scalability and concurrency
> requirements. XFS is the other. And both will run unmodified on
> a persistant ram block device, which we *already have*.

Yeah! :D

>
> And so back to DAX. What users actually want from their high speed
> persistant RAM storage is direct, cpu addressable access to that
> persistent storage. They don't want to have to care about how to
> find an object in the persistent storage - that's what filesystems
> are for - they just want to be able to read and write to it
> directly. That's what DAX does - it provides existing filesystems
> a method for exposing direct access to the persistent RAM to
> applications in a manner that application developers are already
> familiar with. It's a win-win situation all round.
>
> IOWs, ext4/XFS + DAX gets us to a place that is good enough for most
> users and the hardware capabilities we expect to see in the next 5
> years.  And hopefully that will be long enough to bring a purpose
> built, next generation persistent memory filesystem to production
> quality that can take full advantage of the technology...

Please, if possible, then could you be so kind and give such a very good 
summarization
or a sketch about the future development path and system architecture?
How does this mentioned purpose built, next generation persistent memory 
filesystem
looks like?
How does it differ from the DAX + FS approach and which advantages will 
it offer?
Would it be some kind of an object storage system that possibly uses the 
said structures
used by the kernel (see the two little questions above again)?
Do we have to keep the term file for everything?

>
> Cheers,
>
> Dave.

With all the best
Christian Stroetmann

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-08-30 23:11 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 01/21] axonram: Fix bug in direct_access Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 02/21] Change direct_access calling convention Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 03/21] Fix XIP fault vs truncate race Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 04/21] Allow page fault handlers to perform the COW Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 05/21] Introduce IS_DAX(inode) Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 06/21] Add copy_to_iter(), copy_from_iter() and iov_iter_zero() Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 07/21] Replace XIP read and write with DAX I/O Matthew Wilcox
2014-09-14 14:11   ` Boaz Harrosh
2014-08-27  3:45 ` [PATCH v10 08/21] Replace ext2_clear_xip_target with dax_clear_blocks Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
2014-09-03  7:47   ` Dave Chinner
2014-09-10 15:23     ` Matthew Wilcox
2014-09-11  3:09       ` Dave Chinner
2014-09-24 15:43         ` Matthew Wilcox
2014-09-25  1:01           ` Dave Chinner
2014-08-27  3:45 ` [PATCH v10 10/21] Replace xip_truncate_page with dax_truncate_page Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 11/21] Replace XIP documentation with DAX documentation Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 12/21] Remove get_xip_mem Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 13/21] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 14/21] ext2: Remove ext2_use_xip Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 15/21] ext2: Remove xip.c and xip.h Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 16/21] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 17/21] ext2: Remove ext2_aops_xip Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 18/21] Get rid of most mentions of XIP in ext2 Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 19/21] xip: Add xip_zero_page_range Matthew Wilcox
2014-09-03  9:21   ` Dave Chinner
2014-09-04 21:08     ` Matthew Wilcox
2014-09-04 21:36       ` Theodore Ts'o
2014-09-08 18:59         ` Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 20/21] ext4: Add DAX functionality Matthew Wilcox
2014-09-03 11:13   ` Dave Chinner
2014-09-10 16:49     ` Boaz Harrosh
2014-09-11  4:38       ` Dave Chinner
2014-09-14 12:25         ` Boaz Harrosh
2014-09-15  6:15           ` Dave Chinner
2014-09-15  9:41             ` Boaz Harrosh
2014-08-27  3:45 ` [PATCH v10 21/21] brd: Rename XIP to DAX Matthew Wilcox
2014-08-27 20:06 ` [PATCH v10 00/21] Support ext4 on NV-DIMMs Andrew Morton
2014-08-27 21:12   ` Matthew Wilcox
2014-08-27 21:46     ` Andrew Morton
2014-08-28  1:30       ` Andy Lutomirski
2014-08-28 16:50         ` Matthew Wilcox
2014-08-28 15:45       ` Matthew Wilcox
2014-08-27 21:22   ` Christoph Lameter
2014-08-27 21:30     ` Andrew Morton
2014-08-27 23:04       ` One Thousand Gnomes
2014-08-28  7:17       ` Dave Chinner
2014-08-30 23:11         ` Christian Stroetmann [this message]
2014-08-28  8:08 ` Boaz Harrosh
2014-08-28 22:09   ` Zwisler, Ross
2014-09-03 12:05 ` [PATCH 1/1] xfs: add DAX support Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54025A07.9070109@ontolinux.com \
    --to=stroetmann@ontolinux.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).