Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Boaz Harrosh <boaz@plexistor.com>
To: Matthew Wilcox <willy@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: linux-arch@vger.kernel.org, axboe@kernel.dk, riel@redhat.com,
	hch@infradead.org, linux-nvdimm@ml01.01.org,
	Dave Hansen <dave.hansen@linux.intel.com>,
	linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	mgorman@suse.de, linux-fsdevel@vger.kernel.org
Subject: Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
Date: Thu, 19 Mar 2015 17:54:15 +0200	[thread overview]
Message-ID: <550AF127.9010504@plexistor.com> (raw)
In-Reply-To: <20150319134313.GF4003@linux.intel.com>

On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
<>
> 
> Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
> want to be able to do any kind of I/O directly to persistent memory,
> and I think we do, we need to do one of:
> 
> 1. Construct struct pages for persistent memory
> 1a. Permanently
> 1b. While the pages are under I/O
> 2. Teach the I/O layers to deal in PFNs instead of struct pages
> 3. Replace struct page with some other structure that can represent both
>    DRAM and PMEM
> 
> I'm personally a fan of #3, and I was looking at the scatterlist as
> my preferred data structure.  I now believe the scatterlist as it is
> currently defined isn't sufficient, so we probably end up needing a new
> data structure.  I think Dan's preferred method of replacing struct
> pages with PFNs is actually less instrusive, but doesn't give us as
> much advantage (an entirely new data structure would let us move to an
> extent based system at the same time, instead of sticking with an array
> of pages).  Clearly Boaz prefers 1a, which works well enough for the
> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
> 
> What's your preference?  I guess option 0 is "force all I/O to go
> through the page cache and then get copied", but that feels like a nasty
> performance hit.

Thanks Matthew, you have summarized it perfectly.

I think #1b might have merit, as well. I have a very surgical small
"hack" that we can do with allocating on demand pages before IO.
It involves adding a new MEMORY_MODEL policy that is derived from
SPARSEMEM but lets you allocate individual pages on demand. And a new
type of page say call it GP_emulated_page.
(Tell me if you find this interesting. It is 1/117 in size of both
 #2 or #3)

In anyway please reconsider a configurable #1a for people that do
not mind sacrificing 1.2% of their pmem for real pages.

Even at 6G page-structs with 400G pmem, people would love some of the stuff
this gives them today. just few examples: direct_access from within a VM to
an host defined pmem, is trivial with no extra code with my two simple #1a
patches. RDMA memory brick targets, network shared memory FS and so on, the
list will always be bigger then any of #1b #2 or #3. Yes for people that
want to sacrifice the extra cost.

In the Kernel it was always about choice and diversity. And what does it
costs us. Nothing. Two small simple patches and a Kconfig option.
Note that I made it in such a way that if pmem is configured without
use of pages, then the mm code is *not* configured-in automatically.
We can even add a runtime option that even if #1a is enabled, for certain
pmem device may not want pages allocated. And so choose at runtime rather
than compile time.

I think this will only farther our cause and let people advance with
their research and development with great new ideas about use of pmem.
Then once there is a great demand for #1a and those large 512G devices
come out, we can go the #1b or #3 route and save them the extra 1.2%
memory, but once they have the appetite for it. (And Andrews question
becomes clear)

Our two ways need not be "either-or". They can be "have both". I think
choice is a good thing for us here. Even with #3 available #1a still has
merit in some configurations and they can co exist perfectly.

Please think about it?

Thanks
Boaz

next prev parent reply	other threads:[~2015-03-19 15:54 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
2015-03-16 20:25 ` [RFC PATCH 1/7] block: add helpers for accessing a bio_vec page Dan Williams
2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams
2015-03-16 23:05   ` Al Viro
2015-03-17 13:02     ` Matthew Wilcox
2015-03-17 15:53       ` Dan Williams
2015-03-16 20:25 ` [RFC PATCH 3/7] dma-mapping: allow archs to optionally specify a ->map_pfn() operation Dan Williams
2015-03-18 11:21   ` [Linux-nvdimm] " Boaz Harrosh
2015-03-16 20:25 ` [RFC PATCH 4/7] scatterlist: use sg_phys() Dan Williams
2015-03-16 20:25 ` [RFC PATCH 5/7] scatterlist: support "page-less" (__pfn_t only) entries Dan Williams
2015-03-16 20:25 ` [RFC PATCH 6/7] x86: support dma_map_pfn() Dan Williams
2015-03-16 20:26 ` [RFC PATCH 7/7] block: base support for pfn i/o Dan Williams
2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
2015-03-18 13:06   ` Matthew Wilcox
2015-03-18 14:38     ` [Linux-nvdimm] " Boaz Harrosh
2015-03-20 15:56       ` Rik van Riel
2015-03-22 11:53         ` Boaz Harrosh
2015-03-18 15:35   ` Dan Williams
2015-03-18 20:26 ` Andrew Morton
2015-03-19 13:43   ` Matthew Wilcox
2015-03-19 15:54     ` Boaz Harrosh [this message]
2015-03-19 19:59       ` [Linux-nvdimm] " Andrew Morton
2015-03-19 20:59         ` Dan Williams
2015-03-22 17:22           ` Boaz Harrosh
2015-03-20 17:32         ` Wols Lists
2015-03-22 10:30         ` Boaz Harrosh
2015-03-19 18:17     ` Christoph Hellwig
2015-03-19 19:31       ` Matthew Wilcox
2015-03-22 16:46       ` Boaz Harrosh
2015-03-20 16:21     ` Rik van Riel
2015-03-20 20:31       ` Matthew Wilcox
2015-03-20 21:08         ` Rik van Riel
2015-03-22 17:06           ` Boaz Harrosh
2015-03-22 17:22             ` Dan Williams
2015-03-22 17:39               ` Boaz Harrosh
2015-03-20 21:17         ` Wols Lists
2015-03-22 16:24         ` Boaz Harrosh
2015-03-22 15:51       ` Boaz Harrosh
2015-03-23 15:19         ` Rik van Riel
2015-03-23 19:30           ` Christoph Hellwig
2015-03-24  9:41           ` Boaz Harrosh
2015-03-24 16:57             ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=550AF127.9010504@plexistor.com \
    --to=boaz@plexistor.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=dave.hansen@linux.intel.com \
    --cc=hch@infradead.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=riel@redhat.com \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).