public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: David Chinner <dgc@sgi.com>
To: Andrea Arcangeli <andrea@suse.de>
Cc: David Chinner <dgc@sgi.com>, Dave Hansen <haveblue@us.ibm.com>,
	linux-kernel@vger.kernel.org, David Kleikamp <shaggy@us.ibm.com>
Subject: Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
Date: Mon, 16 Jul 2007 10:27:07 +1000	[thread overview]
Message-ID: <20070716002707.GR31489@sgi.com> (raw)
In-Reply-To: <20070713143109.GC2571@v2.random>

On Fri, Jul 13, 2007 at 04:31:09PM +0200, Andrea Arcangeli wrote:
> On Fri, Jul 13, 2007 at 05:13:08PM +1000, David Chinner wrote:
> > Sure. Fundamentally, though, I think it is the wrong approach to
> > take - it's a workaround for a big negative side effect of
> > increasing page size. It introduces lots of complexity and
> > difficult-to-test corner cases; judging by the tail packing problems
> > reiser3 has had over the years, it has the potential to be a
> > never-ending source of data corruption bugs.
> > 
> > I think that fine granularity and aggregation for efficiency of
> > scale is a better model to use than increasing the base page size.
> > With PPC, you can handle different page sizes in the hardware (like
> > MIPS) and the use of 64k base page size is an obvious workaround to
> > the problem of not being able to use multiple page sizes within the
> > OS.
> 
> I think you're being too fs centric. Moving only the pagecache to a
> large order is enough to you but it isn't enough to me, I'd like all
> allocations to be faster, and I'd like to reduce the page fault
> rate.

Right, and that is done on other operating systems by supporting
multiple hardware page sizes and telling the relevant applications to
use larger pages (e.g. via cpuset configuration).

> The CONFIG_PAGE_SHIFT isn't just about I/O. It's just that
> CONFIG_PAGE_SHIFT will give you the I/O side for free too.

It's not for free, and that's one of the points I've been trying
to make.

> Also keep in mind mixing multiple page sizes for different inodes has
> the potential to screw the aging algorithms in the reclaim code. Just
> to make an example during real random I/O over all bits of hot cache
> in pagecache, a 64k page has 16 times more probability of being marked
> young than a 4k page.

Sure, but if a page is being hit repeatedly - regardless of it's
size - then you want to keep it around....

> The tail packing of pagecache could very well be worth it. It should
> cost nothing for the large files.

As I've said before - I'm not just concerned with large files - I'm
also concerned about large numbers of files (hundreds of millions to
billions in a filesystem) and the scalability issues involved with
them. IOWs, I'm looking at metadata scalability as much as data
scalability.

It's flexibility that I need from the VM, not pure VM efficiency.
Shifting the base page size is not an efficient solution to the
different aspects of filesystem scalability. We've got to deal with
both ends of the spectrum simultaneously on the one machine in the
same filesystem and it's only going to get worse in the future.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

  reply	other threads:[~2007-07-16  0:27 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-07-06 22:26 RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE) Andrea Arcangeli
2007-07-06 23:33 ` Dave Hansen
2007-07-06 23:52   ` Andrea Arcangeli
2007-07-17 17:47     ` William Lee Irwin III
2007-07-17 19:33       ` Andrea Arcangeli
2007-07-18 13:32         ` William Lee Irwin III
2007-07-18 16:34           ` Rene Herman
2007-07-18 23:50             ` Andrea Arcangeli
2007-07-19  0:53               ` Rene Herman
2007-07-24 19:44           ` Andrea Arcangeli
2007-07-25  3:20             ` William Lee Irwin III
2007-07-25 14:39               ` Andrea Arcangeli
2007-07-25 17:56                 ` William Lee Irwin III
2007-07-07  1:36 ` Badari Pulavarty
2007-07-07  1:47 ` Badari Pulavarty
2007-07-07 10:12   ` Andrea Arcangeli
2007-07-07  7:01 ` Paul Mackerras
2007-07-07 10:25   ` Andrea Arcangeli
2007-07-07 18:53 ` Jan Engelhardt
2007-07-07 20:34   ` Rik van Riel
2007-07-08  9:52   ` Andrea Arcangeli
2007-07-08 23:20 ` David Chinner
2007-07-10 10:11   ` Andrea Arcangeli
2007-07-12  0:12     ` David Chinner
2007-07-12 11:14       ` Andrea Arcangeli
2007-07-12 14:44         ` David Chinner
2007-07-12 16:31           ` Andrea Arcangeli
2007-07-12 16:34             ` Dave Hansen
2007-07-13  7:13               ` David Chinner
2007-07-13 14:08                 ` Dave Kleikamp
2007-07-13 14:31                 ` Andrea Arcangeli
2007-07-16  0:27                   ` David Chinner [this message]
2007-07-12 17:53 ` Matt Mackall
2007-07-13  1:06   ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070716002707.GR31489@sgi.com \
    --to=dgc@sgi.com \
    --cc=andrea@suse.de \
    --cc=haveblue@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shaggy@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox