From: Wanpeng Li <liwanp@linux.vnet.ibm.com>
To: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Al Viro <viro@zeniv.linux.org.uk>,
Wu Fengguang <fengguang.wu@intel.com>, Jan Kara <jack@suse.cz>,
Mel Gorman <mgorman@suse.de>,
linux-mm@kvack.org, Andi Kleen <ak@linux.intel.com>,
Matthew Wilcox <matthew.r.wilcox@intel.com>,
"Kirill A. Shutemov" <kirill@shutemov.name>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH, RFC 00/16] Transparent huge page cache
Date: Fri, 5 Apr 2013 09:42:08 +0800 [thread overview]
Message-ID: <20130405014208.GC362@hacker.(null)> (raw)
In-Reply-To: <alpine.LNX.2.00.1301301619040.24861@eggly.anvils>
On Wed, Jan 30, 2013 at 06:12:05PM -0800, Hugh Dickins wrote:
>On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
>> Hugh Dickins wrote:
>> > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
>> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> > >
>> > > Here's first steps towards huge pages in page cache.
>> > >
>> > > The intend of the work is get code ready to enable transparent huge page
>> > > cache for the most simple fs -- ramfs.
>> > >
>> > > It's not yet near feature-complete. It only provides basic infrastructure.
>> > > At the moment we can read, write and truncate file on ramfs with huge pages in
>> > > page cache. The most interesting part, mmap(), is not yet there. For now
>> > > we split huge page on mmap() attempt.
>> > >
>> > > I can't say that I see whole picture. I'm not sure if I understand locking
>> > > model around split_huge_page(). Probably, not.
>> > > Andrea, could you check if it looks correct?
>> > >
>> > > Next steps (not necessary in this order):
>> > > - mmap();
>> > > - migration (?);
>> > > - collapse;
>> > > - stats, knobs, etc.;
>> > > - tmpfs/shmem enabling;
>> > > - ...
>> > >
>> > > Kirill A. Shutemov (16):
>> > > block: implement add_bdi_stat()
>> > > mm: implement zero_huge_user_segment and friends
>> > > mm: drop actor argument of do_generic_file_read()
>> > > radix-tree: implement preload for multiple contiguous elements
>> > > thp, mm: basic defines for transparent huge page cache
>> > > thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>> > > thp, mm: rewrite delete_from_page_cache() to support huge pages
>> > > thp, mm: locking tail page is a bug
>> > > thp, mm: handle tail pages in page_cache_get_speculative()
>> > > thp, mm: implement grab_cache_huge_page_write_begin()
>> > > thp, mm: naive support of thp in generic read/write routines
>> > > thp, libfs: initial support of thp in
>> > > simple_read/write_begin/write_end
>> > > thp: handle file pages in split_huge_page()
>> > > thp, mm: truncate support for transparent huge page cache
>> > > thp, mm: split huge page on mmap file page
>> > > ramfs: enable transparent huge page cache
>> > >
>> > > fs/libfs.c | 54 +++++++++---
>> > > fs/ramfs/inode.c | 6 +-
>> > > include/linux/backing-dev.h | 10 +++
>> > > include/linux/huge_mm.h | 8 ++
>> > > include/linux/mm.h | 15 ++++
>> > > include/linux/pagemap.h | 14 ++-
>> > > include/linux/radix-tree.h | 3 +
>> > > lib/radix-tree.c | 32 +++++--
>> > > mm/filemap.c | 204 +++++++++++++++++++++++++++++++++++--------
>> > > mm/huge_memory.c | 62 +++++++++++--
>> > > mm/memory.c | 22 +++++
>> > > mm/truncate.c | 12 +++
>> > > 12 files changed, 375 insertions(+), 67 deletions(-)
>> >
>> > Interesting.
>> >
>> > I was starting to think about Transparent Huge Pagecache a few
>> > months ago, but then got washed away by incoming waves as usual.
>> >
>> > Certainly I don't have a line of code to show for it; but my first
>> > impression of your patches is that we have very different ideas of
>> > where to start.
>
>A second impression confirms that we have very different ideas of
>where to start. I don't want to be dismissive, and please don't let
>me discourage you, but I just don't find what you have very interesting.
>
>I'm sure you'll agree that the interesting part, and the difficult part,
>comes with mmap(); and there's no point whatever to THPages without mmap()
>(of course, I'm including exec and brk and shm when I say mmap there).
>
>(There may be performance benefits in working with larger page cache
>size, which Christoph Lameter explored a few years back, but that's a
>different topic: I think 2MB - if I may be x86_64-centric - would not be
>the unit of choice for that, unless SSD erase block were to dominate.)
>
>I'm interested to get to the point of prototyping something that does
>support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
>a lot about my misconceptions, and have to rework for a while (or give
>up!); but I don't see much point in posting anything without that.
>I don't know if we have 5 or 50 places which "know" that a THPage
>must be Anon: some I'll spot in advance, some I sadly won't.
>
>It's not clear to me that the infrastructural changes you make in this
>series will be needed or not, if I pursue my approach: some perhaps as
>optimizations on top of the poorly performing base that may emerge from
>going about it my way. But for me it's too soon to think about those.
>
>Something I notice that we do agree upon: the radix_tree holding the
>4k subpages, at least for now. When I first started thinking towards
>THPageCache, I was fascinated by how we could manage the hugepages in
>the radix_tree, cutting out unnecessary levels etc; but after a while
>I realized that although there's probably nice scope for cleverness
>there (significantly constrained by RCU expectations), it would only
>be about optimization. Let's be simple and stupid about radix_tree
>for now, the problems that need to be worked out lie elsewhere.
>
>> >
>> > Perhaps that's good complementarity, or perhaps I'll disagree with
>> > your approach. I'll be taking a look at yours in the coming days,
>> > and trying to summon back up my own ideas to summarize them for you.
>>
>> Yeah, it would be nice to see alternative design ideas. Looking forward.
>>
>> > Perhaps I was naive to imagine it, but I did intend to start out
>> > generically, independent of filesystem; but content to narrow down
>> > on tmpfs alone where it gets hard to support the others (writeback
>> > springs to mind). khugepaged would be migrating little pages into
>> > huge pages, where it saw that the mmaps of the file would benefit
>> > (and for testing I would hack mmap alignment choice to favour it).
>>
>> I don't think all fs at once would fly, but it's wonderful, if I'm
>> wrong :)
>
>You are imagining the filesystem putting huge pages into its cache.
>Whereas I'm imagining khugepaged looking around at mmaped file areas,
>seeing which would benefit from huge pagecache (let's assume offset 0
>belongs on hugepage boundary - maybe one day someone will want to tune
>some files or parts differently, but that's low priority), migrating 4k
>pages over to 2MB page (wouldn't have to be done all in one pass), then
>finally slotting in the pmds for that.
>
>But going this way, I expect we'd have to split at page_mkwrite():
>we probably don't want a single touch to dirty 2MB at a time,
>unless tmpfs or ramfs.
>
>>
>> > I had arrived at a conviction that the first thing to change was
>> > the way that tail pages of a THP are refcounted, that it had been a
>> > mistake to use the compound page method of holding the THP together.
>> > But I'll have to enter a trance now to recall the arguments ;)
>>
>> THP refcounting looks reasonable for me, if take split_huge_page() in
>> account.
>
>I'm not claiming that the THP refcounting is wrong in what it's doing
>at present; but that I suspect we'll want to rework it for THPageCache.
>
>Something I take for granted, I think you do too but I'm not certain:
>a file with transparent huge pages in its page cache can also have small
>pages in other extents of its page cache; and can be mapped hugely (2MB
>extents) into one address space at the same time as individual 4k pages
>from those extents are mapped into another (or the same) address space.
>
>One can certainly imagine sacrificing that principle, splitting whenever
>there's such a "conflict"; but it then becomes uninteresting to me, too
>much like hugetlbfs. Splitting an anonymous hugepage in all address
>spaces that hold it when one of them needs it split, that has been a
>pragmatic strategy: it's not a common case for forks to diverge like
>that; but files are expected to be more widely shared.
>
>At present THP is using compound pages, with mapcount of tail pages
>reused to track their contribution to head page count; but I think we
>shall want to be able to use the mapcount, and the count, of TH tail
>pages for their original purpose if huge mappings can coexist with tiny.
>Not fully thought out, but that's my feeling.
>
>The use of compound pages, in particular the redirection of tail page
>count to head page count, was important in hugetlbfs: a get_user_pages
>reference on a subpage must prevent the containing hugepage from being
>freed, because hugetlbfs has its own separate pool of hugepages to
>which freeing returns them.
>
>But for transparent huge pages? It should not matter so much if the
>subpages are freed independently. So I'd like to devise another glue
>to hold them together more loosely (for prototyping I can certainly
>pretend we have infinite pageflag and pagefield space if that helps):
>I may find in practice that they're forever falling apart, and I run
>crying back to compound pages; but at present I'm hoping not.
>
>This mail might suggest that I'm about to start coding: I wish that
>were true, but in reality there's always a lot of unrelated things
>I have to look at, which dilute my focus. So if I've said anything
>that sparks ideas for you, go with them.
It seems that it's a good idea, Hugh. I will start coding this. ;-)
Regards,
Wanpeng Li
>
>Hugh
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org. For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-04-05 1:42 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-28 9:24 [PATCH, RFC 00/16] Transparent huge page cache Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 01/16] block: implement add_bdi_stat() Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 02/16] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 03/16] mm: drop actor argument of do_generic_file_read() Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 04/16] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 05/16] thp, mm: basic defines for transparent huge page cache Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
2013-01-29 12:11 ` Hillf Danton
2013-01-29 13:01 ` Kirill A. Shutemov
2013-01-29 12:14 ` Hillf Danton
2013-01-29 12:26 ` Hillf Danton
2013-01-29 12:48 ` Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 07/16] thp, mm: rewrite delete_from_page_cache() " Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 08/16] thp, mm: locking tail page is a bug Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 09/16] thp, mm: handle tail pages in page_cache_get_speculative() Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 10/16] thp, mm: implement grab_cache_huge_page_write_begin() Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 11/16] thp, mm: naive support of thp in generic read/write routines Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 12/16] thp, libfs: initial support of thp in simple_read/write_begin/write_end Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 13/16] thp: handle file pages in split_huge_page() Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 14/16] thp, mm: truncate support for transparent huge page cache Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 15/16] thp, mm: split huge page on mmap file page Kirill A. Shutemov
2013-01-28 9:24 ` [PATCH, RFC 16/16] ramfs: enable transparent huge page cache Kirill A. Shutemov
2013-01-29 5:03 ` [PATCH, RFC 00/16] Transparent " Hugh Dickins
2013-01-29 13:14 ` Kirill A. Shutemov
2013-01-31 2:12 ` Hugh Dickins
2013-02-02 15:13 ` Kirill A. Shutemov
2013-04-05 0:26 ` Simon Jeons
2013-04-05 1:03 ` Simon Jeons
2013-04-05 1:42 ` Wanpeng Li [this message]
2013-04-07 0:26 ` Wanpeng Li
2013-04-07 0:26 ` Wanpeng Li
2013-04-05 1:42 ` Wanpeng Li
2013-04-05 1:24 ` Ric Mason
2013-03-18 9:36 ` Simon Jeons
2013-03-21 8:00 ` Simon Jeons
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='20130405014208.GC362@hacker.(null)' \
--to=liwanp@linux.vnet.ibm.com \
--cc=aarcange@redhat.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=fengguang.wu@intel.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=kirill.shutemov@linux.intel.com \
--cc=kirill@shutemov.name \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=matthew.r.wilcox@intel.com \
--cc=mgorman@suse.de \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).