linux-nvdimm.lists.01.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@linux.intel.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Rik van Riel <riel@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	John Stoffel <john@stoffel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Boaz Harrosh <boaz@plexistor.com>, Jan Kara <jack@suse.cz>,
	Mike Snitzer <snitzer@redhat.com>, Neil Brown <neilb@suse.de>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Chris Mason <clm@fb.com>, Paul Mackerras <paulus@samba.org>,
	"H. Peter Anvin" <hpa@zytor.com>, Christoph Hellwig <hch@lst.de>,
	Alasdair Kergon <agk@redhat.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	Mel Gorman <mgorman@suse.de>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Jens Axboe <axboe@kernel.dk>, Theodore Ts'o <tytso@mit.edu>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Julia Lawall <Julia.Lawall@lip6.fr>, Tejun Heo <tj@kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: "Directly mapped persistent memory page cache"
Date: Mon, 11 May 2015 10:31:14 -0400	[thread overview]
Message-ID: <20150511143114.GP4003@linux.intel.com> (raw)
In-Reply-To: <20150509084510.GA10587@gmail.com>

On Sat, May 09, 2015 at 10:45:10AM +0200, Ingo Molnar wrote:
> If we 'think big', we can create something very exciting IMHO, that 
> also gets rid of most of the complications with DIO, DAX, etc:
> 
> "Directly mapped pmem integrated into the page cache":
> ------------------------------------------------------
> 
>   - The pmem filesystem is mapped directly in all cases, it has device 
>     side struct page arrays, and its struct pages are directly in the 
>     page cache, write-through cached. (See further below about how we 
>     can do this.)
> 
>     Note that this is radically different from the current approach 
>     that tries to use DIO and DAX to provide specialized "direct
>     access" APIs.
> 
>     With the 'directly mapped' approach we have numerous advantages:
> 
>        - no double buffering to main RAM: the device pages represent 
>          file content.
> 
>        - no bdflush, no VM pressure, no writeback pressure, no
>          swapping: this is a very simple VM model where the device is
>          RAM and we don't have much dirty state. The primary kernel 
>          cache is the dcache and the directly mapped page cache, which 
>          is not a writeback cache in this case but essentially a 
>          logical->physical index cache of filesystem indexing 
>          metadata.
> 
>        - every binary mmap()ed would be XIP mapped in essence
> 
>        - every read() would be equivalent a DIO read, without the
>          complexity of DIO.
> 
>        - every read() or write() done into a data mmap() area would
>          allow device-to-device zero copy DMA.
> 
>        - main RAM caching would still be avilable and would work in 
>          many cases by default: as most apps use file processing 
>          buffers in anonymous memory into which they read() data.

I admire your big vision, but I think there are problems that it doesn't
solve.

1. The difference in lifetimes between filesystem blocks and page cache
pages that represent them.  Existing filesystems have their own block
allocators which have their own notions of when blocks are available for
reallocation which may differ from when a page in the page cache can be
reused for caching another block.

Concrete example: A mapped page of a file is used as the source or target
of a direct I/O.  That file is simultaneously truncated, which in our
current paths calls the filesystem to free the block, while leaving the
page cache page in place in order to be the source or destination of
the I/O.  Once the I/O completes, the page's reference count drops to
zero and the page can be freed.

If we do not modify the filesystem, that page/block may end up referring
to a block in a different file, with the usual security & integrity
problems.

2. Some of the media which currently exist (not exactly supported
well by the current DAX framework either) have great read properties,
but abysmal write properties.  For example, they may have only a small
number of write cycles, or they may take milliseconds to absorb a write.
These media might work well for mapping some read-mostly files directly,
but be poor choices for putting things like struct page in, which contains
cachelines which are frquently modified.

> We can achieve this by statically allocating all page structs on the 
> device, in the following way:
> 
>   - For every 128MB of pmem data we allocate 2MB of struct-page
>     descriptors, 64 bytes each, that describes that 128MB data range 
>     in a 4K granular way. We never have to allocate page structs as 
>     they are always there.
> 
>   - Filesystems don't directly see the preallocated page arrays, they
>     still get a 'logical block space' presented that to them looks
>     like a continuous block device (which is 1.5% smaller than the 
>     true size of the device): this allows arbitrary filesystems to be 
>     put into such pmem devices, fsck will just work, etc.
> 
>     I.e. no special pmem filesystem: the full range of existing block 
>     device based Linux filesystems can be used.

I think the goal of "use any Linux filesystem" is laudable, but
impractical.  Since we're modifying filesystems anyway, is there an
advantage to doing this in the block device instead of just allocating the
struct pages in a special file in the filesystem (like modern filesystems
do for various structures)?


  parent reply	other threads:[~2015-05-11 14:31 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-06 20:04 [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t Dan Williams
2015-05-06 20:04 ` [PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o Dan Williams
2015-05-07 14:55   ` Stephen Rothwell
2015-05-08  0:21     ` Dan Williams
2015-05-06 20:05 ` [PATCH v2 02/10] block: add helpers for accessing a bio_vec page Dan Williams
2015-05-08 15:59   ` Dan Williams
2015-05-06 20:05 ` [PATCH v2 03/10] block: convert .bv_page to .bv_pfn bio_vec Dan Williams
2015-05-06 20:05 ` [PATCH v2 04/10] dma-mapping: allow archs to optionally specify a ->map_pfn() operation Dan Williams
2015-05-06 20:05 ` [PATCH v2 05/10] scatterlist: use sg_phys() Dan Williams
2015-05-06 20:05 ` [PATCH v2 06/10] scatterlist: support "page-less" (__pfn_t only) entries Dan Williams
2015-05-06 20:05 ` [PATCH v2 07/10] x86: support dma_map_pfn() Dan Williams
2015-05-06 20:05 ` [PATCH v2 08/10] x86: support kmap_atomic_pfn_t() for persistent memory Dan Williams
2015-05-06 20:20   ` [Linux-nvdimm] " Dan Williams
2015-05-06 20:05 ` [PATCH v2 09/10] dax: convert to __pfn_t Dan Williams
2015-05-06 20:05 ` [PATCH v2 10/10] block: base support for pfn i/o Dan Williams
2015-05-06 22:10 ` [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t Linus Torvalds
2015-05-06 23:47   ` Dan Williams
2015-05-07  0:19     ` Linus Torvalds
2015-05-07  2:36       ` Dan Williams
2015-05-07  9:02         ` Ingo Molnar
2015-05-07 14:42           ` Ingo Molnar
2015-05-07 15:52             ` Dan Williams
2015-05-07 17:52               ` Ingo Molnar
2015-05-07 15:00         ` Linus Torvalds
2015-05-07 15:40           ` Dan Williams
2015-05-07 15:58             ` Linus Torvalds
2015-05-07 16:03               ` Dan Williams
2015-05-07 17:36                 ` Ingo Molnar
2015-05-07 17:42                   ` Dan Williams
2015-05-07 17:56                     ` Dave Hansen
2015-05-07 19:11                       ` Ingo Molnar
2015-05-07 19:36                         ` Jerome Glisse
2015-05-07 19:48                           ` Ingo Molnar
2015-05-07 19:53                             ` Ingo Molnar
2015-05-07 20:18                               ` Jerome Glisse
2015-05-08  5:37                                 ` Ingo Molnar
2015-05-08  9:20                                   ` Al Viro
2015-05-08  9:26                                     ` Ingo Molnar
2015-05-08 10:00                                       ` Al Viro
2015-05-08 13:45                         ` Rik van Riel
2015-05-08 14:05                           ` Ingo Molnar
2015-05-08 14:54                             ` Rik van Riel
     [not found]                             ` <21836.51957.715473.780762@quad.stoffel.home>
2015-05-08 15:54                               ` Linus Torvalds
2015-05-08 16:28                                 ` Al Viro
2015-05-08 16:59                                 ` Rik van Riel
2015-05-09  1:14                                   ` Linus Torvalds
2015-05-09  3:02                                     ` Rik van Riel
2015-05-09  3:52                                       ` Linus Torvalds
2015-05-09 21:56                                       ` Dave Chinner
2015-05-09  8:45                                   ` "Directly mapped persistent memory page cache" Ingo Molnar
2015-05-09 18:24                                     ` Dan Williams
2015-05-10  9:46                                       ` Ingo Molnar
2015-05-10 17:29                                         ` Dan Williams
     [not found]                                     ` <87r3qpyciy.fsf@x220.int.ebiederm.org>
2015-05-10 10:07                                       ` Ingo Molnar
2015-05-11  8:25                                     ` Dave Chinner
2015-05-11  9:18                                       ` Ingo Molnar
2015-05-11 10:12                                         ` Zuckerman, Boris
2015-05-11 10:38                                           ` Ingo Molnar
2015-05-12  0:53                                         ` Dave Chinner
2015-05-12 14:47                                           ` Jerome Glisse
2015-06-05  5:43                                             ` Dan Williams
2015-05-11 14:31                                     ` Matthew Wilcox [this message]
2015-05-11 20:01                                       ` Jerome Glisse
2015-05-07 17:43                 ` [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t Linus Torvalds
2015-05-07 20:06                   ` Dan Williams
2015-05-07 16:18       ` Christoph Hellwig
2015-05-07 16:41         ` Dan Williams
2015-05-07 18:40           ` Ingo Molnar
2015-05-07 19:44             ` Dan Williams
2015-05-07 17:30         ` Jerome Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150511143114.GP4003@linux.intel.com \
    --to=willy@linux.intel.com \
    --cc=Julia.Lawall@lip6.fr \
    --cc=agk@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=benh@kernel.crashing.org \
    --cc=boaz@plexistor.com \
    --cc=clm@fb.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hch@lst.de \
    --cc=heiko.carstens@de.ibm.com \
    --cc=hpa@zytor.com \
    --cc=jack@suse.cz \
    --cc=john@stoffel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=martin.petersen@oracle.com \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=neilb@suse.de \
    --cc=paulus@samba.org \
    --cc=riel@redhat.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=schwidefsky@de.ibm.com \
    --cc=snitzer@redhat.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).