linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Boaz Harrosh <boaz@plexistor.com>, Jan Kara <jack@suse.cz>,
	Mike Snitzer <snitzer@redhat.com>, Neil Brown <neilb@suse.de>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Chris Mason <clm@fb.com>, Paul Mackerras <paulus@samba.org>,
	"H. Peter Anvin" <hpa@zytor.com>, Christoph Hellwig <hch@lst.de>,
	Alasdair Kergon <agk@redhat.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	Mel Gorman <mgorman@suse.de>,
	Matthew Wilcox <willy@linux.intel.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Rik van Riel <riel@redhat.com>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Jens Axboe <axboe@kernel.dk>, Theodore Ts'o <tytso@mit.edu>,
	"Martin K. Petersen" <martin.petersen@oracl
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t
Date: Thu, 7 May 2015 21:11:07 +0200	[thread overview]
Message-ID: <20150507191107.GB22952@gmail.com> (raw)
In-Reply-To: <554BA748.9030804@linux.intel.com>


* Dave Hansen <dave.hansen@linux.intel.com> wrote:

> On 05/07/2015 10:42 AM, Dan Williams wrote:
> > On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >> * Dan Williams <dan.j.williams@intel.com> wrote:
> >>
> >> So is there anything fundamentally wrong about creating struct 
> >> page backing at mmap() time (and making sure aliased mmaps share 
> >> struct page arrays)?
> > 
> > Something like "get_user_pages() triggers memory hotplug for 
> > persistent memory", so they are actual real struct pages?  Can we 
> > do memory hotplug at that granularity?
> 
> We've traditionally limited them to SECTION_SIZE granularity, which 
> is 128MB IIRC.  There are also assumptions in places that you can do 
> page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE.

I really don't think that's very practical: memory hotplug is slow, 
it's really not on the same abstraction level as mmap(), and the zone 
data structures are also fundamentally very coarse: not just because 
RAM ranges are huge, but also so that the pfn->page transformation 
stays relatively simple and fast.

> But, in all practicality, a lot of those places are in code like the 
> buddy allocator.  If your PTEs all have _PAGE_SPECIAL set and we're 
> not ever expecting these fake 'struct page's to hit these code 
> paths, it probably doesn't matter.
> 
> You can probably get away with just allocating PAGE_SIZE worth of 
> 'struct page' (which is 64) and mapping it in to vmemmap[].  The 
> worst case is that you'll eat 1 page of space for each outstanding 
> page of I/O.  That's a lot better than 2MB of temporary 'struct 
> page' space per page of I/O that it would take with a traditional 
> hotplug operation.

So I think the main value of struct page is if everyone on the system 
sees the same struct page for the same pfn - not just the temporary IO 
instance.

The idea of having very temporary struct page arrays misses the point 
I think: if struct page is used as essentially an IO sglist then most 
of the synchronization properties are lost: then we might as well use 
the real deal in that case and skip the dynamic allocation and use 
pfns directly and avoid the dynamic allocation overhead.

Stable, global page-struct descriptors are a given for real RAM, where 
we allocate a struct page for every page in nice, large, mostly linear 
arrays.

We'd really need that for pmem too, to get the full power of struct 
page: and that means allocating them in nice, large, predictable 
places - such as on the device itself ...

It might even be 'scattered' across the device, with 64 byte struct 
page size we can pack 64 descriptors into a single page, so every 65 
pages we could have a page-struct page.

Finding a pmem page's struct page would thus involve rounding it 
modulo 65 and reading that page.

The problem with that is fourfold:

 - that we now turn a very kernel internal API and data structure into 
   an ABI. If struct page grows beyond 64 bytes it's a problem.

 - on bootup (or device discovery time) we'd have to initialize all 
   the page structs. We could probably do this in a hierarchical way, 
   by dividing continuous pmem ranges into power-of-two groups of 
   blocks, and organizing them like the buddy allocator does.

 - 1.5% of storage space lost.

 - will wear-leveling properly migrate these 'hot' pages around?

The alternative would be some global interval-rbtree of struct page 
backed pmem ranges.

Beyond the synchronization problems of such a data structure (which 
looks like a nightmare) I don't think it's even feasible: especially 
if there's a filesystem on the pmem device then the block allocations 
could be physically fragmented (and there's no fundamental reason why 
they couldn't be fragmented), so a continuous mmap() of a file on it 
will yield wildly fragmented device-pfn ranges, exploding the rbtree. 
Think 1 million node interval-rbtree with an average depth of 20: 
cachemiss country for even simple lookups - not to mention the 
freeing/recycling complexity of unused struct pages to not allow it to 
grow too large.

I might be wrong though about all this :)

Thanks,

	Ingo

  reply	other threads:[~2015-05-07 19:11 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-06 20:04 [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t Dan Williams
2015-05-06 20:04 ` [PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o Dan Williams
2015-05-07 14:55   ` Stephen Rothwell
2015-05-08  0:21     ` Dan Williams
2015-05-06 20:05 ` [PATCH v2 02/10] block: add helpers for accessing a bio_vec page Dan Williams
2015-05-08 15:59   ` Dan Williams
2015-05-06 20:05 ` [PATCH v2 03/10] block: convert .bv_page to .bv_pfn bio_vec Dan Williams
2015-05-06 20:05 ` [PATCH v2 04/10] dma-mapping: allow archs to optionally specify a ->map_pfn() operation Dan Williams
2015-05-06 20:05 ` [PATCH v2 05/10] scatterlist: use sg_phys() Dan Williams
2015-05-06 20:05 ` [PATCH v2 06/10] scatterlist: support "page-less" (__pfn_t only) entries Dan Williams
2015-05-06 20:05 ` [PATCH v2 07/10] x86: support dma_map_pfn() Dan Williams
2015-05-06 20:05 ` [PATCH v2 08/10] x86: support kmap_atomic_pfn_t() for persistent memory Dan Williams
2015-05-06 20:20   ` [Linux-nvdimm] " Dan Williams
2015-05-06 20:05 ` [PATCH v2 09/10] dax: convert to __pfn_t Dan Williams
2015-05-06 20:05 ` [PATCH v2 10/10] block: base support for pfn i/o Dan Williams
2015-05-06 20:50 ` [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t Al Viro
2015-05-06 22:10 ` Linus Torvalds
2015-05-06 23:47   ` Dan Williams
2015-05-07  0:19     ` Linus Torvalds
2015-05-07  2:36       ` Dan Williams
2015-05-07  9:02         ` Ingo Molnar
2015-05-07 14:42           ` Ingo Molnar
2015-05-07 15:52             ` Dan Williams
2015-05-07 17:52               ` Ingo Molnar
2015-05-07 15:00         ` Linus Torvalds
2015-05-07 15:40           ` Dan Williams
2015-05-07 15:58             ` Linus Torvalds
2015-05-07 16:03               ` Dan Williams
2015-05-07 17:36                 ` Ingo Molnar
2015-05-07 17:42                   ` Dan Williams
2015-05-07 17:56                     ` Dave Hansen
2015-05-07 19:11                       ` Ingo Molnar [this message]
2015-05-07 19:36                         ` Jerome Glisse
2015-05-07 19:48                           ` Ingo Molnar
2015-05-07 19:53                             ` Ingo Molnar
2015-05-07 20:18                               ` Jerome Glisse
2015-05-08  5:37                                 ` Ingo Molnar
2015-05-08  9:20                                   ` Al Viro
2015-05-08  9:26                                     ` Ingo Molnar
2015-05-08 10:00                                       ` Al Viro
2015-05-08 13:45                         ` Rik van Riel
2015-05-08 14:05                           ` Ingo Molnar
2015-05-08 14:40                             ` John Stoffel
2015-05-08 15:54                               ` Linus Torvalds
2015-05-08 16:28                                 ` Al Viro
2015-05-08 16:59                                 ` Rik van Riel
2015-05-09  1:14                                   ` Linus Torvalds
2015-05-09  3:02                                     ` Rik van Riel
2015-05-09  3:52                                       ` Linus Torvalds
2015-05-09 21:56                                       ` Dave Chinner
2015-05-09  8:45                                   ` "Directly mapped persistent memory page cache" Ingo Molnar
2015-05-09 15:51                                     ` Eric W. Biederman
2015-05-10 10:07                                       ` Ingo Molnar
2015-05-09 18:24                                     ` Dan Williams
2015-05-10  9:46                                       ` Ingo Molnar
2015-05-10 17:29                                         ` Dan Williams
2015-05-11  8:25                                     ` Dave Chinner
2015-05-11  9:18                                       ` Ingo Molnar
2015-05-11 10:12                                         ` Zuckerman, Boris
2015-05-11 10:38                                           ` Ingo Molnar
2015-05-11 14:51                                             ` Jeff Moyer
2015-05-12  0:53                                         ` Dave Chinner
2015-05-12 14:47                                           ` Jerome Glisse
2015-06-05  5:43                                             ` Dan Williams
2015-05-11 14:31                                     ` Matthew Wilcox
2015-05-11 20:01                                       ` Jerome Glisse
2015-05-08 20:40                                 ` [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t John Stoffel
2015-05-08 14:54                             ` Rik van Riel
2015-05-07 17:43                 ` Linus Torvalds
2015-05-07 20:06                   ` Dan Williams
2015-05-07 16:18       ` Christoph Hellwig
2015-05-07 16:41         ` Dan Williams
2015-05-07 18:40           ` Ingo Molnar
2015-05-07 19:44             ` Dan Williams
2015-05-07 17:30         ` Jerome Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150507191107.GB22952@gmail.com \
    --to=mingo@kernel.org \
    --cc=agk@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=benh@kernel.crashing.org \
    --cc=boaz@plexistor.com \
    --cc=clm@fb.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hch@lst.de \
    --cc=heiko.carstens@de.ibm.com \
    --cc=hpa@zytor.com \
    --cc=jack@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=martin.petersen@oracl \
    --cc=mgorman@suse.de \
    --cc=neilb@suse.de \
    --cc=paulus@samba.org \
    --cc=riel@redhat.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=schwidefsky@de.ibm.com \
    --cc=snitzer@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).