From: Ingo Molnar <mingo@kernel.org>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rik van Riel <riel@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
John Stoffel <john@stoffel.org>,
Dave Hansen <dave.hansen@linux.intel.com>,
Dan Williams <dan.j.williams@intel.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Boaz Harrosh <boaz@plexistor.com>, Jan Kara <jack@suse.cz>,
Mike Snitzer <snitzer@redhat.com>, Neil Brown <neilb@suse.de>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Heiko Carstens <heiko.carstens@de.ibm.com>,
Chris Mason <clm@fb.com>, Paul Mackerras <paulus@samba.org>,
"H. Peter Anvin" <hpa@zytor.com>, Christoph Hellwig <hch@lst.de>,
Alasdair Kergon <agk@redhat.com>,
"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
Mel Gorman <mgorman@suse.de>,
Matthew Wilcox <willy@linux.intel.com>,
Ross Zwisler <ross.zwisler@linux.intel.com>,
Martin Schwidefsky <schwidefsky@de.ibm.com>,
Jens Axboe <axboe@kernel.dk>
Subject: Re: "Directly mapped persistent memory page cache"
Date: Sun, 10 May 2015 12:07:26 +0200 [thread overview]
Message-ID: <20150510100725.GB15198@gmail.com> (raw)
In-Reply-To: <87r3qpyciy.fsf@x220.int.ebiederm.org>
* Eric W. Biederman <ebiederm@xmission.com> wrote:
> > What do you think?
>
> The tricky bit is what happens when you reboot and run a different
> version of the kernel, especially a kernel with things debugging
> features like kmemcheck that increase the size of struct page.
Yes - but I think that's relatively easy to handle, as most 'weird'
page struct usages can be cordoned off:
I.e. we could define a 64-bit "core" struct page, denote it with a
single PG_ flag and stick with it: the only ABI is its size
essentially, as we (lazily) re-initialize it after every bootup.
The 'extended' (often debug) part of a struct page, such as
page->shadow on kmemcheck, can simply be handled in a special way
based on the PG_ flag:
- for example in the kmemcheck case no page->shadow means no leak
tracking: that's perfectly fine as these pages aren't part of the
buddy allocator and kmalloc() anyway.
- or NUMA_BALANCING's page->_last_cpupid can be 0 as well, as these
pages aren't (normally) NUMA-migrated.
The extended fields would have to be accessed via small wrappers,
which return 0 if the extended part is not present, but that's pretty
much all.
> I don't think we could have persistent struct page entries, as the
> exact contents of the struct page entries is too volatile and too
> different between architectures. [...]
Especially with the 2MB (and 1GB) granular lazy initialization
approach persisting them across reboots does not seem necessary
either.
Even main RAM is already doing lazy initialization: Mel's patches that
do that just went into -mm.
> [...] Especially architecture changes that a pmem store is likely to
> see such as switching between a 32bit and a 64bit kernel.
We'd not want to ABI-restrict the layout of struct page. But to say
that there's a core 64-byte descriptor per 4K page is not an overly
strict promise to keep.
> Further I think where in the persistent memory the struct page
> arrays live is something we could leave up to the filesystem. We
> could have some reasonable constraints to make it fast but I think
> whoever decides where things live on the persistent memory can make
> that choice.
So the beauty of the scheme is that in its initial incarnation it's
filesystem independent: you can create any filesystem on top of it
seemlessly, the filesystem simply sees a linear block device that is
1.5% smaller than the underlying storage. It won't even (normally)
have access to the struct page areas. This kind of data space
separation also protects against filesystem originated data
corruption.
Now in theory a filesystem might be aware of it, but I think it's far
more important to keep this scheme simple, robust, fast and
predictable.
> For small persistent memories it probably make sense to allocate the
> struct page array describing them out of ordinary ram. For small
> memories I don't think we are talking enough memory to worry about.
> For TB+ persistent memories where you need 16GiB per TiB it makes
> sense to allocate a one or several regions to store your struct page
> arrays, as you can't count on ordinary ram having enough capacity,
> and you may not even be talking about a system that actually has
> ordinary ram at that point.
Correct - if there's no ordinary page cache in main DRAM then for many
appliances ordinary RAM could be something like SRAM: really fast and
not wasted on dirty state and IO caches - a huge, directly mapped L4
or L5 CPU cache in essence.
Thanks,
Ingo
next prev parent reply other threads:[~2015-05-10 10:07 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-06 20:04 [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t Dan Williams
2015-05-06 20:04 ` [PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o Dan Williams
2015-05-07 14:55 ` Stephen Rothwell
2015-05-08 0:21 ` Dan Williams
2015-05-06 20:05 ` [PATCH v2 02/10] block: add helpers for accessing a bio_vec page Dan Williams
2015-05-08 15:59 ` Dan Williams
2015-05-06 20:05 ` [PATCH v2 03/10] block: convert .bv_page to .bv_pfn bio_vec Dan Williams
2015-05-06 20:05 ` [PATCH v2 04/10] dma-mapping: allow archs to optionally specify a ->map_pfn() operation Dan Williams
2015-05-06 20:05 ` [PATCH v2 05/10] scatterlist: use sg_phys() Dan Williams
2015-05-06 20:05 ` [PATCH v2 06/10] scatterlist: support "page-less" (__pfn_t only) entries Dan Williams
2015-05-06 20:05 ` [PATCH v2 07/10] x86: support dma_map_pfn() Dan Williams
2015-05-06 20:05 ` [PATCH v2 08/10] x86: support kmap_atomic_pfn_t() for persistent memory Dan Williams
2015-05-06 20:20 ` [Linux-nvdimm] " Dan Williams
2015-05-06 20:05 ` [PATCH v2 09/10] dax: convert to __pfn_t Dan Williams
2015-05-06 20:05 ` [PATCH v2 10/10] block: base support for pfn i/o Dan Williams
2015-05-06 20:50 ` [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t Al Viro
2015-05-06 22:10 ` Linus Torvalds
2015-05-06 23:47 ` Dan Williams
2015-05-07 0:19 ` Linus Torvalds
2015-05-07 2:36 ` Dan Williams
2015-05-07 9:02 ` Ingo Molnar
2015-05-07 14:42 ` Ingo Molnar
2015-05-07 15:52 ` Dan Williams
2015-05-07 17:52 ` Ingo Molnar
2015-05-07 15:00 ` Linus Torvalds
2015-05-07 15:40 ` Dan Williams
2015-05-07 15:58 ` Linus Torvalds
2015-05-07 16:03 ` Dan Williams
2015-05-07 17:36 ` Ingo Molnar
2015-05-07 17:42 ` Dan Williams
2015-05-07 17:56 ` Dave Hansen
2015-05-07 19:11 ` Ingo Molnar
2015-05-07 19:36 ` Jerome Glisse
2015-05-07 19:48 ` Ingo Molnar
2015-05-07 19:53 ` Ingo Molnar
2015-05-07 20:18 ` Jerome Glisse
2015-05-08 5:37 ` Ingo Molnar
2015-05-08 9:20 ` Al Viro
2015-05-08 9:26 ` Ingo Molnar
2015-05-08 10:00 ` Al Viro
2015-05-08 13:45 ` Rik van Riel
2015-05-08 14:05 ` Ingo Molnar
2015-05-08 14:40 ` John Stoffel
2015-05-08 15:54 ` Linus Torvalds
2015-05-08 16:28 ` Al Viro
2015-05-08 16:59 ` Rik van Riel
2015-05-09 1:14 ` Linus Torvalds
2015-05-09 3:02 ` Rik van Riel
2015-05-09 3:52 ` Linus Torvalds
2015-05-09 21:56 ` Dave Chinner
2015-05-09 8:45 ` "Directly mapped persistent memory page cache" Ingo Molnar
2015-05-09 15:51 ` Eric W. Biederman
2015-05-10 10:07 ` Ingo Molnar [this message]
2015-05-09 18:24 ` Dan Williams
2015-05-10 9:46 ` Ingo Molnar
2015-05-10 17:29 ` Dan Williams
2015-05-11 8:25 ` Dave Chinner
2015-05-11 9:18 ` Ingo Molnar
2015-05-11 10:12 ` Zuckerman, Boris
2015-05-11 10:38 ` Ingo Molnar
2015-05-11 14:51 ` Jeff Moyer
2015-05-12 0:53 ` Dave Chinner
2015-05-12 14:47 ` Jerome Glisse
2015-06-05 5:43 ` Dan Williams
2015-05-11 14:31 ` Matthew Wilcox
2015-05-11 20:01 ` Jerome Glisse
2015-05-08 20:40 ` [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t John Stoffel
2015-05-08 14:54 ` Rik van Riel
2015-05-07 17:43 ` Linus Torvalds
2015-05-07 20:06 ` Dan Williams
2015-05-07 16:18 ` Christoph Hellwig
2015-05-07 16:41 ` Dan Williams
2015-05-07 18:40 ` Ingo Molnar
2015-05-07 19:44 ` Dan Williams
2015-05-07 17:30 ` Jerome Glisse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150510100725.GB15198@gmail.com \
--to=mingo@kernel.org \
--cc=agk@redhat.com \
--cc=axboe@kernel.dk \
--cc=benh@kernel.crashing.org \
--cc=boaz@plexistor.com \
--cc=clm@fb.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=ebiederm@xmission.com \
--cc=hch@lst.de \
--cc=heiko.carstens@de.ibm.com \
--cc=hpa@zytor.com \
--cc=jack@suse.cz \
--cc=john@stoffel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvdimm@ml01.01.org \
--cc=mgorman@suse.de \
--cc=neilb@suse.de \
--cc=paulus@samba.org \
--cc=riel@redhat.com \
--cc=ross.zwisler@linux.intel.com \
--cc=schwidefsky@de.ibm.com \
--cc=snitzer@redhat.com \
--cc=torvalds@linux-foundation.org \
--cc=willy@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).