Re: "Directly mapped persistent memory page cache"

linux-nvdimm.lists.01.org archive mirror
 help / color / mirror / Atom feed

From: Jerome Glisse <j.glisse@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>, Rik van Riel <riel@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	John Stoffel <john@stoffel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Boaz Harrosh <boaz@plexistor.com>, Jan Kara <jack@suse.cz>,
	Mike Snitzer <snitzer@redhat.com>, Neil Brown <neilb@suse.de>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Chris Mason <clm@fb.com>, Paul Mackerras <paulus@samba.org>,
	"H. Peter Anvin" <hpa@zytor.com>, Christoph Hellwig <hch@lst.de>,
	Alasdair Kergon <agk@redhat.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	Mel Gorman <mgorman@suse.de>,
	Matthew Wilcox <willy@linux.intel.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Jens Axboe <axboe@kernel.dk>, Theodore Ts'o <tytso@mit.edu>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Julia Lawall <Julia.Lawall@lip6.fr>, Tejun Heo <tj@kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: "Directly mapped persistent memory page cache"
Date: Tue, 12 May 2015 10:47:54 -0400	[thread overview]
Message-ID: <20150512144752.GA4003@gmail.com> (raw)
In-Reply-To: <20150512005347.GQ4327@dastard>

On Tue, May 12, 2015 at 10:53:47AM +1000, Dave Chinner wrote:
> On Mon, May 11, 2015 at 11:18:36AM +0200, Ingo Molnar wrote:
> 
> > > And, of course, different platforms have different page sizes, so 
> > > designing page array structures to be optimal for x86-64 is just a 
> > > wee bit premature.
> > 
> > 4K is the smallest one on x86 and ARM, and it's also a IMHO pretty 
> > sane default from a human workflow point of view.
> > 
> > But oddball configs with larger page sizes could also be supported at 
> > device creation time (via a simple superblock structure).
> 
> Ok, so now I know it's volatile, why do we need a persistent
> superblock? Why is *anything* persistent required?  And why would
> page size matter if the reserved area is volatile?
> 
> And if it is volatile, then the kernel is effectively doing dynamic
> allocation and initialisation of the struct pages, so why wouldn't
> we just do dynamic allocation out of a slab cache in RAM and free
> them when the last reference to the page goes away? Applications
> aren't going to be able to reference every page in persistent
> memory at the same time...
> 
> Keep in mind we need to design for tens of TB of PRAM at minimum
> (400GB NVDIMMS and tens of them in a single machine are not that far
> away), so static arrays of structures that index 4k blocks is not a
> design that scales to these sizes - it's like using 1980s filesystem
> algorithms for a new filesystem designed for tens of terabytes of
> storage - it can be made to work, but it's just not efficient or
> scalable in the long term.

On having easy pfn<->struct page relation i would agree with Ingo. I
thin it is important. For instance in my case when migrating system
memory to device memory i store a pfn in special swap entry. While
right now i use my own adhoc structure i would rather directly use
a struct page that i can easily find back from the pfn.

In the scheme i proposed you only need to allocate PUD & PMD directory
and use a huge zero page map read only for the whole array at boot time.
When you need a struct page for a given pfn you allocate 2 page, one for
the PMD directory and one for the struct page array for given range of
pfn. Once the struct page is no longer needed you free both page and
turn back to the zero huge page.

So you get dynamic allocation and keep the nice pfn<->struct page mapping
working.

> 
> As an example, look at the current problems with scaling the
> initialisation for struct pages for large memory machines - 16TB
> machines are taking 10 minutes just to initialise the struct page
> arrays on startup. That's the scale of overhead that static page
> arrays will have for PRAM, whether they are lazily initialised or
> not. IOWs, static page arrays are not scalable, and hence aren't a
> viable long term solution to the PRAM problem.

With solution i describe above all you need to initialize is PUD & PMD
directory to point to a zero huge page. I would think this should be
fast enough even for 1TB 2^(40 - 12 - 9 - 9) = 2^10 so you need 1024
PUD and 512K PMD (4M of PUD and 256M of PMD). You can even directly
share PMD and have to dynamicly allocate 3 pages (1 for the PMD level,
1 for the PTE level, 1 for struct page array) effectively reducing to
static 4M allocation for all PUD. Rest being dynamicly allocated/freed
upon useage.

> IMO, we need to be designing around the concept that the filesytem
> manages the pmem space, and the MM subsystem simply uses the block
> mapping information provided to it from the filesystem to decide how
> it references and maps the regions into the user's address space or
> for DMA. The mm subsystem does not manage the pmem space, it's
> alignment or how it is allocated to user files. Hence page mappings
> can only be - at best - reactive to what the filesystem does with
> it's free space. The mm subsystem already has to query the block
> layer to get mappings on page faults, so it's only a small stretch
> to enhance the DAX mapping request to ask for a large page mapping
> rather than a 4k mapping.  If the fs can't do a large page mapping,
> you'll get a 4k aligned mapping back.
> 
> What I'm trying to say is that the mapping behaviour needs to be
> designed with the way filesystems and the mm subsystem interact in
> mind, not from a pre-formed "direct Io is bad, we must use the page
> cache" point of view. The filesystem and the mm subsystem must
> co-operate to allow things like large page mappings to be made and
> hence looking at the problem purely from a mm<->pmem device
> perspective as you are ignores an important chunk of the system:
> the part that actually manages the pmem space...

I am all for letting the filesystem manage pmem, but i think having
struct page expose to mm allow the mm side to stay ignorant of what
is really behind. Also if i could share more code with other i would
be happier :)

Cheers,
Jï¿½rï¿½me

next prev parent reply	other threads:[~2015-05-12 14:47 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-06 20:04 [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t Dan Williams
2015-05-06 20:04 ` [PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o Dan Williams
2015-05-07 14:55   ` Stephen Rothwell
2015-05-08  0:21     ` Dan Williams
2015-05-06 20:05 ` [PATCH v2 02/10] block: add helpers for accessing a bio_vec page Dan Williams
2015-05-08 15:59   ` Dan Williams
2015-05-06 20:05 ` [PATCH v2 03/10] block: convert .bv_page to .bv_pfn bio_vec Dan Williams
2015-05-06 20:05 ` [PATCH v2 04/10] dma-mapping: allow archs to optionally specify a ->map_pfn() operation Dan Williams
2015-05-06 20:05 ` [PATCH v2 05/10] scatterlist: use sg_phys() Dan Williams
2015-05-06 20:05 ` [PATCH v2 06/10] scatterlist: support "page-less" (__pfn_t only) entries Dan Williams
2015-05-06 20:05 ` [PATCH v2 07/10] x86: support dma_map_pfn() Dan Williams
2015-05-06 20:05 ` [PATCH v2 08/10] x86: support kmap_atomic_pfn_t() for persistent memory Dan Williams
2015-05-06 20:20   ` [Linux-nvdimm] " Dan Williams
2015-05-06 20:05 ` [PATCH v2 09/10] dax: convert to __pfn_t Dan Williams
2015-05-06 20:05 ` [PATCH v2 10/10] block: base support for pfn i/o Dan Williams
2015-05-06 22:10 ` [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t Linus Torvalds
2015-05-06 23:47   ` Dan Williams
2015-05-07  0:19     ` Linus Torvalds
2015-05-07  2:36       ` Dan Williams
2015-05-07  9:02         ` Ingo Molnar
2015-05-07 14:42           ` Ingo Molnar
2015-05-07 15:52             ` Dan Williams
2015-05-07 17:52               ` Ingo Molnar
2015-05-07 15:00         ` Linus Torvalds
2015-05-07 15:40           ` Dan Williams
2015-05-07 15:58             ` Linus Torvalds
2015-05-07 16:03               ` Dan Williams
2015-05-07 17:36                 ` Ingo Molnar
2015-05-07 17:42                   ` Dan Williams
2015-05-07 17:56                     ` Dave Hansen
2015-05-07 19:11                       ` Ingo Molnar
2015-05-07 19:36                         ` Jerome Glisse
2015-05-07 19:48                           ` Ingo Molnar
2015-05-07 19:53                             ` Ingo Molnar
2015-05-07 20:18                               ` Jerome Glisse
2015-05-08  5:37                                 ` Ingo Molnar
2015-05-08  9:20                                   ` Al Viro
2015-05-08  9:26                                     ` Ingo Molnar
2015-05-08 10:00                                       ` Al Viro
2015-05-08 13:45                         ` Rik van Riel
2015-05-08 14:05                           ` Ingo Molnar
2015-05-08 14:54                             ` Rik van Riel
     [not found]                             ` <21836.51957.715473.780762@quad.stoffel.home>
2015-05-08 15:54                               ` Linus Torvalds
2015-05-08 16:28                                 ` Al Viro
2015-05-08 16:59                                 ` Rik van Riel
2015-05-09  1:14                                   ` Linus Torvalds
2015-05-09  3:02                                     ` Rik van Riel
2015-05-09  3:52                                       ` Linus Torvalds
2015-05-09 21:56                                       ` Dave Chinner
2015-05-09  8:45                                   ` "Directly mapped persistent memory page cache" Ingo Molnar
2015-05-09 18:24                                     ` Dan Williams
2015-05-10  9:46                                       ` Ingo Molnar
2015-05-10 17:29                                         ` Dan Williams
     [not found]                                     ` <87r3qpyciy.fsf@x220.int.ebiederm.org>
2015-05-10 10:07                                       ` Ingo Molnar
2015-05-11  8:25                                     ` Dave Chinner
2015-05-11  9:18                                       ` Ingo Molnar
2015-05-11 10:12                                         ` Zuckerman, Boris
2015-05-11 10:38                                           ` Ingo Molnar
2015-05-12  0:53                                         ` Dave Chinner
2015-05-12 14:47                                           ` Jerome Glisse [this message]
2015-06-05  5:43                                             ` Dan Williams
2015-05-11 14:31                                     ` Matthew Wilcox
2015-05-11 20:01                                       ` Jerome Glisse
2015-05-07 17:43                 ` [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t Linus Torvalds
2015-05-07 20:06                   ` Dan Williams
2015-05-07 16:18       ` Christoph Hellwig
2015-05-07 16:41         ` Dan Williams
2015-05-07 18:40           ` Ingo Molnar
2015-05-07 19:44             ` Dan Williams
2015-05-07 17:30         ` Jerome Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150512144752.GA4003@gmail.com \
    --to=j.glisse@gmail.com \
    --cc=Julia.Lawall@lip6.fr \
    --cc=agk@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=benh@kernel.crashing.org \
    --cc=boaz@plexistor.com \
    --cc=clm@fb.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=heiko.carstens@de.ibm.com \
    --cc=hpa@zytor.com \
    --cc=jack@suse.cz \
    --cc=john@stoffel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=martin.petersen@oracle.com \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=neilb@suse.de \
    --cc=paulus@samba.org \
    --cc=riel@redhat.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=schwidefsky@de.ibm.com \
    --cc=snitzer@redhat.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).