Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: Andi Kleen <ak@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Hugh Dickins <hughd@google.com>,
	Wu Fengguang <fengguang.wu@intel.com>, Jan Kara <jack@suse.cz>,
	linux-mm@kvack.org, Matthew Wilcox <willy@linux.intel.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Hillf Danton <dhillf@gmail.com>, Dave Hansen <dave@sr71.net>,
	Ning Qu <quning@google.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
Date: Tue, 1 Oct 2013 09:38:28 +0100	[thread overview]
Message-ID: <20131001083828.GA8093@suse.de> (raw)
In-Reply-To: <20130930185106.GD2125@tassilo.jf.intel.com>

On Mon, Sep 30, 2013 at 11:51:06AM -0700, Andi Kleen wrote:
> > AFAIK, this is not a problem in the vast majority of modern CPUs
> 
> Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
> That's around 2MB. There's more and more code whose footprint exceeds
> that.
> 

With an expectation that it is read-mostly data, replicated between the
caches accessing it and TLB refills taking very little time. This is not
universally true and there are exceptions but even recent papers on TLB
behaviour have tended to dismiss the iTLB refill overhead as a negligible
portion of the overall workload of interest.

> Besides iTLB is not the only target. It is also useful for 
> data of course.
> 

True, but how useful? I have not seen an example of a workload showing that
dTLB pressure on file-backed data was a major component of the workload. I
would expect that sysV shared memory is an exception but does that require
generic support for all filesystems or can tmpfs be special cased when
it's used for shared memory?

For normal data, if it's read-only data then there would be some benefit to
using huge pages once the data is in page cache. How common are workloads
that mmap() large amounts of read-only data? Possibly some databases
depending on the workload although there I would expect that the data is
placed in shared memory.

If the mmap()s data is being written then the cost of IO is likely to
dominate, not TLB pressure. For write-mostly workloads there are greater
concerns because dirty tracking can only be done at the huge page boundary
potentially leading to greater amounts of IO and degraded performance
overall.

I could be completely wrong here but these were the concerns I had when
I first glanced through the patches. The changelogs had no information
to convince me otherwise so I never dedicated the time to reviewing the
patches in detail. I raised my concerns and then dropped it.

> > > and I found it very hard to be motivated to review the series as a result.
> > > I suspected that in many cases that the cost of IO would continue to dominate
> > > performance instead of TLB pressure
> 
> The trend is to larger and larger memories, keeping things in memory.
> 

Yes, but using huge pages is not *necessarily* the answer. For fault
scalability it probably would be a lot easier to batch handle faults if
readahead indicates accesses are sequential. Background zeroing of pages
could be revisited for fault intensive workloads. A potential alternative
is that a contiguous page is allocated, zerod as one lump, split the pages
and put onto a local per-task list although the details get messy. Reclaim
scanning could be heavily modified to use collections of pages instead of
single pages (although I'm not aware of the proper design of such a thing).

Again, this could be completely off the mark but if it was me that was
working on this problem, I would have some profile data from some workloads
to make sure the part I'm optimising was a noticable percentage of the
workload and included that in the patch leader. I would hope that the data
was compelling enough to convince reviewers to pay close attention to the
series as the complexity would then be justified. Based on how complex THP
was for anonymous pages, I would be tempted to treat THP for file-backed
data as a last resort.

> In fact there's a good argument that memory sizes are growing faster
> than TLB capacities. And without large TLBs we're even further off
> the curve.
> 

I'll admit this is also true. It was considered to be true in the 90's
when huge pages were first being thrown around as a possible solution to
the problem. One paper recently suggested using segmentation for large
memory segments but the workloads they examined looked like they would
be dominated by anonymous access, not file-backed data with one exception
where the workload frequently accessed compile-time constants.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2013-10-01  8:38 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 01/22] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 02/22] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 03/22] memcg, thp: charge huge cache pages Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 04/22] thp: compile-time and sysfs knob for thp pagecache Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 05/22] thp, mm: introduce mapping_can_have_hugepages() predicate Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 06/22] thp: represent file thp pages in meminfo and friends Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 07/22] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 08/22] mm: trace filemap: dump page order Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 09/22] block: implement add_bdi_stat() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 10/22] thp, mm: rewrite delete_from_page_cache() to support huge pages Kirill A. Shutemov
2013-09-25 20:02   ` Ning Qu
2013-09-23 12:05 ` [PATCHv6 11/22] thp, mm: warn if we try to use replace_page_cache_page() with THP Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 12/22] thp, mm: add event counters for huge page alloc on file write or read Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 13/22] mm, vfs: introduce i_split_sem Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 14/22] thp, mm: allocate huge pages in grab_cache_page_write_begin() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 15/22] thp, mm: naive support of thp in generic_perform_write Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 16/22] thp, mm: handle transhuge pages in do_generic_file_read() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 17/22] thp, libfs: initial thp support Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 18/22] truncate: support huge pages Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 19/22] thp: handle file pages in split_huge_page() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 20/22] thp: wait_split_huge_page(): serialize over i_mmap_mutex too Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 21/22] thp, mm: split huge page on mmap file page Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 22/22] ramfs: enable transparent huge page cache Kirill A. Shutemov
2013-09-24 23:37 ` [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Andrew Morton
2013-09-24 23:48   ` Ning Qu
2013-09-24 23:49   ` Andi Kleen
2013-09-24 23:58     ` Andrew Morton
2013-09-25 11:15       ` Kirill A. Shutemov
2013-09-25 15:05         ` Andi Kleen
2013-09-26 18:30     ` Zach Brown
2013-09-26 19:05       ` Andi Kleen
2013-09-30 10:13     ` Mel Gorman
2013-09-30 16:05       ` Andi Kleen
2013-09-25  9:51   ` Kirill A. Shutemov
2013-09-25 23:29     ` Dave Chinner
2013-10-14 13:56       ` Kirill A. Shutemov
2013-09-30 10:02   ` Mel Gorman
2013-09-30 10:10     ` Mel Gorman
2013-09-30 18:07       ` Ning Qu
2013-09-30 18:51       ` Andi Kleen
2013-10-01  8:38         ` Mel Gorman [this message]
2013-10-01 17:11           ` Ning Qu
2013-10-14 14:27           ` Kirill A. Shutemov
2013-09-30 15:27     ` Dave Hansen
2013-09-30 18:05       ` Ning Qu
2013-09-25  0:12 ` Ning Qu
2013-09-25  9:23   ` Kirill A. Shutemov
2013-09-26 21:13 ` Dave Hansen
  -- strict thread matches above, loose matches on Subject: below --
2013-09-25 18:11 Ning Qu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131001083828.GA8093@suse.de \
    --to=mgorman@suse.de \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=dave@sr71.net \
    --cc=dhillf@gmail.com \
    --cc=fengguang.wu@intel.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=quning@google.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).