Page Colouring (was: 2.6.0 Huge pages not working as expected)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Page Colouring (was: 2.6.0 Huge pages not working as expected)
       [not found] <176UD-6vl-3@gated-at.bofh.it>
@ 2003-12-26 21:48 ` Anton Ertl
  2003-12-26 23:28   ` Linus Torvalds
  0 siblings, 1 reply; 28+ messages in thread
From: Anton Ertl @ 2003-12-26 21:48 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds <torvalds@osdl.org> writes:
>And the thing is, using huge pages will mean that the pages are 1:1
>mapped, and thus get "perfectly" cache-coloured, while the anonymous mmap 
>will give you random placement.
>
>And what you are seeing is likely the fact that random placement is 
>guaranteed to not have any worst-case behaviour.

You probably just put the "not" in the wrong place, but just in case
you meant it: Random replacement does not give such a guarantee.  You
can get the same worst-case behaviour as with page colouring, since
you can get the same mapping.  It's just unlikely.

>In particular, using a pure power-of-two stride means that you are
>limiting your cache to a certain subset of the full result with the
>perfect coloring.
>
>This, btw, is why I don't like page coloring: it does give nicely
>reproducible results, but it does not necessarily improve performance.  

Well, even if, on average, it has no performance impact,
reproducibility is a good reason to like it.  Is it good enough to
implement it?  I'll leave that to you.

However, the main question I want to look at here is: Does it improve
performance, on average?  I think it does, because of spatial
locality.

I.e., it is more frequent that you access stuff spatially close to a
recent access (where page colouring has a 0 chance of conflicting,
whereas random mapping has a non-zero chance of conflicting), than to
access stuff that is exactly a multiple of the cache-size away (which
is the worst case for page colouring).  Fortunately, set-associative
caches in the machines I use most of the time reduce the impact of the
missing page colouring in Linux.

The most frequent case where random mapping gives better performance
than page colouring is having several sequential passes over a block
that is larger than the cache; but that's just a case where caches
perform badly on principle, and cache designs that are usually
considered better (higher associativity, LRU replacement) perform
worse in this case.

OTOH, for cases where the block does barely fits in the cache, page
colouring performs quite a bit better.  This particular access pattern
can be more frequent than one might expect from other statistics, due
to software optimizations like cache blocking.

One additional mechanism in which page colouring can help performance
is by providing a predictable and understandable performance model to
programmers.  Caches are bad enough to analyse, one need not
complicate the issue with unpredictable effects of random
virtual-to-physical translation.

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-26 21:48 ` Anton Ertl
@ 2003-12-26 23:28   ` Linus Torvalds
  0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2003-12-26 23:28 UTC (permalink / raw)
  To: Anton Ertl; +Cc: linux-kernel

On Fri, 26 Dec 2003, Anton Ertl wrote:
> Linus Torvalds <torvalds@osdl.org> writes:
> >
> >And what you are seeing is likely the fact that random placement is 
> >guaranteed to not have any worst-case behaviour.
> 
> You probably just put the "not" in the wrong place, but just in case
> you meant it: Random replacement does not give such a guarantee.

No, I meant what I said.

Random placement is the _only_ algorithm guaranteed to have no 
pathological worst-case behaviour.

>							  You
> can get the same worst-case behaviour as with page colouring, since
> you can get the same mapping.  It's just unlikely.

"pathological worst-case" is something that is repeatable. For example, 
the test-case above is a pathological worst-case schenario for a 
direct-mapped cache. 

> Well, even if, on average, it has no performance impact,
> reproducibility is a good reason to like it.  Is it good enough to
> implement it?  I'll leave that to you.

Well, since random (or, more accurately in this case, "pseudo-random") has 
a number of things going for it, and is a lot faster and cheaper to 
implement, I don't see the point of cache coloring.

That's doubly true since any competent CPU will have at least four-way 
associativity these days.

> However, the main question I want to look at here is: Does it improve
> performance, on average?  I think it does, because of spatial
> locality.

Hey, the discussion in this case showed how it _deproves_ performance (at 
least if my theory was correct - and it should be easily testable and I 
bet it is).

Also, the work has been done to test things, and cache coloring definitely
makes performance _worse_. It does so exactly because it artifically
limits your page choices, causing problems at multiple levels (not just at
the cache, like this example, but also in page allocators and freeing).

So basically, cache coloring results in:
 - some nice benchmarks (mainly the kind that walk memory very 
   predictably, notably FP kernels)
 - mostly worse performance in "real life"
 - more complex code
 - much worse memory pressure

My strong opinion is that it is worthless except possibly as a performance
tuning tool, but even there the repeatability is a false advantage: if you
do performance tuning using cache coloring, there is nothing that
guarantees that your tuning was _correct_ for the real world case.

In short, you may be doing your performance tuning such that it tunes for
or against one of the (known) pathological cases of the layout, nothing
more.

But hey, some people disagree with me. That's their right. It's not 
unconstitutional to be wrong ;)

			Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
       [not found] ` <179IS-1VD-13@gated-at.bofh.it>
@ 2003-12-27 20:21   ` Anton Ertl
  2003-12-27 20:56     ` Linus Torvalds
  0 siblings, 1 reply; 28+ messages in thread
From: Anton Ertl @ 2003-12-27 20:21 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds <torvalds@osdl.org> writes:
>
>
>On Fri, 26 Dec 2003, Anton Ertl wrote:
>>							  You
>> can get the same worst-case behaviour as with page colouring, since
>> you can get the same mapping.  It's just unlikely.
>
>"pathological worst-case" is something that is repeatable.

And you probably mean "repeatable every time".  Ok, then a random
scheme has, by your definition, no pathological worst case.  I am not
sure that this is a consolation when I happen upon one of its
unpredictable and unrepeatable worst cases.

>> Well, even if, on average, it has no performance impact,
>> reproducibility is a good reason to like it.  Is it good enough to
>> implement it?  I'll leave that to you.
>
>Well, since random (or, more accurately in this case, "pseudo-random") has 
>a number of things going for it, and is a lot faster and cheaper to 
>implement, I don't see the point of cache coloring.

The points are:

- repeatability
- predictability
- better average performance (you dispute that).

>Hey, the discussion in this case showed how it _deproves_ performance (at 
>least if my theory was correct - and it should be easily testable and I 
>bet it is).

I don't think that discussing this special case answers the question
about "on average" performance, but here we go:

For the Coppermine results I see that the performance of the malloc()
case is only better with span 2048 and 4096, and not by much.  For the
Williamette 16MB results I see very little difference, except for the
span=4096 case, by a lot.  For the Williamette 4MB case I see slightly
better performance for hugetlbfs for spans 256,512, and 1024, and a
little worse performance for spans 2048 and 4096.

Yes, mapping policy could be part of the explanation for these
results: With the smaller spans, you get no cache hits with either
mapping policy.  With larger spans, random mapping might return to
some of the lines before evicting them.

However, this is probably not the whole picture, because with that
explanation we would expect that the times for larger spans with
random mapping should be better than for smaller spans, but they are
not.  There is something else at work that makes the times larger with
larger spans (maybe DRAM row switching?).

I see no easy way to test your theory (at least until I can measure
cache *and* TLB misses again on a machine I have access to).

Anyway, back to the performance effects of page colouring: Yes, there
are cases where it is not beneficial, and the huge-2^n-stride cases in
examples like the one above are one of them, but I don't think that
this is the kind of "real life" application that you mention
elsewhere, or is it?

>Also, the work has been done to test things, and cache coloring definitely
>makes performance _worse_. It does so exactly because it artifically
>limits your page choices, causing problems at multiple levels (not just at
>the cache, like this example, but also in page allocators and freeing).

Sorry, I am not aware of the work you are referring to.  Where can I
read more about it?  Are you sure that these are fundamental problems
and not just artifacts of particular implementations?

>So basically, cache coloring results in:
> - some nice benchmarks (mainly the kind that walk memory very 
>   predictably, notably FP kernels)

Predictable accesses are not important, spatial locality is.

> - mostly worse performance in "real life"

Like the code above?-)

Hmm, maybe the pathological large-2^n-stride stuff is more frequent
than I would expect.  But I think it's possible to have a repeatable
and mostly understandable/predictable mapping policy that does not
have this pathological worst case (of course, being repeatable, it
will have a different one:-), and can provide better average
performance than random mapping by exploiting spatial locality.

> - much worse memory pressure

That sounds like an implementation artifact.

>My strong opinion is that it is worthless except possibly as a performance
>tuning tool, but even there the repeatability is a false advantage: if you
>do performance tuning using cache coloring, there is nothing that
>guarantees that your tuning was _correct_ for the real world case.

How does _correct_ness come into play?

As for performance, I guess there are three cases:

- Changes that have little to do with the memory hierarchy.  These are
probably easier to evaluate in a repeatable environment, and any
performance improvements should transfer nicely into a random-mapping
environment.

- Changes that address the pathological case for the repeatable
environment, e.g., (in the context of page colouring) eliminating
large 2^n strides; this particular optimization will have less effect
in a random-mapping environment, but typically still a positive one
(random mapping also suffers from strides that are multiples of the
page size).

- Changes that tune particularly for specific cache sizes, e.g., cache
blocking.  The results may be supoptimal for the random-mapping case;
probably better than just picking the parameter at random, but in most
runs worse than some other parameter.  I wonder if you get any better
results if you make just one run for a number of parameter values in a
random-mapping environment and pick the parameter that gave the best
result (which may have more to do with the mapping in this run than
with the parameter).

In conclusion, I think that tuning in a page colouring environment
will transfer into a random-mapping environment well in most cases.

- anton

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-27 20:21   ` Page Colouring (was: 2.6.0 Huge pages not working as expected) Anton Ertl
@ 2003-12-27 20:56     ` Linus Torvalds
  2003-12-27 23:31       ` Eric W. Biederman
       [not found]       ` <17tHK-3K6-21@gated-at.bofh.it>
  0 siblings, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2003-12-27 20:56 UTC (permalink / raw)
  To: Anton Ertl; +Cc: linux-kernel

On Sat, 27 Dec 2003, Anton Ertl wrote:
> 
> And you probably mean "repeatable every time".  Ok, then a random
> scheme has, by your definition, no pathological worst case.  I am not
> sure that this is a consolation when I happen upon one of its
> unpredictable and unrepeatable worst cases.

Those "unpredictable" cases are so exceedingly rare that they aren't worth
worrying about.

> The points are:
> 
> - repeatability
> - predictability
> - better average performance (you dispute that).

I absolutely dispute that.

And you should realize that I do not dispute it because the applications
themselves would run slower with cache coloring. Most applications don't
much care, they either fit in the cache, or the cache misses have random
enough access patterns that cache layout doesn't much matter.

The test code in question is an anomaly, and doesn't matter. It's exactly
the same kind of strange case that sometimes shows cache coloring making a
huge difference.

The real degradation comes in just the fact that cache coloring itself is
often expensive to implement and causes nasty side effects like bad memory 
allocation patterns, and nasty special cases that you have to worry about 
(ie special fallback code on non-colored pages when required).

That expense is both a run-time expense _and_ a conceptual one (a
conceptual expense is somethign that complicates the internal workings of
the allocator so much that it becomes harder to think about and more
bugprone). So far nobody has shown a reasonable way to do it without
either of the two.

> Anyway, back to the performance effects of page colouring: Yes, there
> are cases where it is not beneficial, and the huge-2^n-stride cases in
> examples like the one above are one of them, but I don't think that
> this is the kind of "real life" application that you mention
> elsewhere, or is it?

The "real life" application is something like running a normal server or
desktop, and having the cache coloring code _itself_ be the performance
problem.

It's not that "apache" minds very much. Or "mozilla". They simply don't 
care. The problem is the algorithm itself.

> >Also, the work has been done to test things, and cache coloring definitely
> >makes performance _worse_. It does so exactly because it artifically
> >limits your page choices, causing problems at multiple levels (not just at
> >the cache, like this example, but also in page allocators and freeing).
> 
> Sorry, I am not aware of the work you are referring to.  Where can I
> read more about it?  Are you sure that these are fundamental problems
> and not just artifacts of particular implementations?

Hey, there have been at least four different major cache coloring trials 
for the kernel over the years. This discussion has been going on since the 
early nineties. And _none_ of them have worked well in practice.

In other words, "artifacts of the particular implementation" is certainly 
right, but the point I have is that the only thing that _matters_ is 
implementation. You can argue about theory all you like, I won't care 
until you show me an implementation that works and is robustly better.

And it has to be better on average on _everything_ that Linux supports,
not just one particular braindamaged piece of hardware. I'm totally not
interested in something that makes performance on most machines go down,
if it then improves one or two braindead setups with direct-mapped caches.

Basically: prove me wrong. People have tried before. They have failed. 
Maybe you'll succeed. I doubt it, but hey, I'm not stopping you.

		Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-27 20:56     ` Linus Torvalds
@ 2003-12-27 23:31       ` Eric W. Biederman
  2003-12-27 23:50         ` William Lee Irwin III
                           ` (2 more replies)
       [not found]       ` <17tHK-3K6-21@gated-at.bofh.it>
  1 sibling, 3 replies; 28+ messages in thread
From: Eric W. Biederman @ 2003-12-27 23:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Anton Ertl, linux-kernel

Linus Torvalds <torvalds@osdl.org> writes:

> On Sat, 27 Dec 2003, Anton Ertl wrote:
> > 
> > And you probably mean "repeatable every time".  Ok, then a random
> > scheme has, by your definition, no pathological worst case.  I am not
> > sure that this is a consolation when I happen upon one of its
> > unpredictable and unrepeatable worst cases.
> 
> Those "unpredictable" cases are so exceedingly rare that they aren't worth
> worrying about.

They show up a lot in benchmarks which makes the something
to worry about.  Even if real world applications don't show
the same behavior.  Of course it is stupid to tune machines
to the benchmarks but...

> Basically: prove me wrong. People have tried before. They have failed. 
> Maybe you'll succeed. I doubt it, but hey, I'm not stopping you.

For anyone taking you up on this I'd like to suggest two possible
directions.

1) Increasing PAGE_SIZE in the kernel.
2) Creating zones for the different colors.  Zones were not
   implemented last time, this was tried.

Both of those should be minimal impact to the complexity
of the current kernel. 

I don't know where we will wind up but the performance variation's
caused by cache conflicts in today's applications are real, and easily
measurable.  Giving the growing increase in performance difference
between CPUs and memory Amdahl's Law shows this will only grow
so I think this is worth looking at.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-27 23:31       ` Eric W. Biederman
@ 2003-12-27 23:50         ` William Lee Irwin III
  2003-12-28  1:09         ` David S. Miller
  2003-12-28  4:53         ` Linus Torvalds
  2 siblings, 0 replies; 28+ messages in thread
From: William Lee Irwin III @ 2003-12-27 23:50 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linus Torvalds, Anton Ertl, linux-kernel

Linus Torvalds <torvalds@osdl.org> writes:
>> Basically: prove me wrong. People have tried before. They have failed. 
>> Maybe you'll succeed. I doubt it, but hey, I'm not stopping you.

On Sat, Dec 27, 2003 at 04:31:22PM -0700, Eric W. Biederman wrote:
> For anyone taking you up on this I'd like to suggest two possible
> directions.
> 1) Increasing PAGE_SIZE in the kernel.
> 2) Creating zones for the different colors.  Zones were not
>    implemented last time, this was tried.
> Both of those should be minimal impact to the complexity
> of the current kernel. 
> I don't know where we will wind up but the performance variation's
> caused by cache conflicts in today's applications are real, and easily
> measurable.  Giving the growing increase in performance difference
> between CPUs and memory Amdahl's Law shows this will only grow
> so I think this is worth looking at.

Increasing PAGE_SIZE in the kernel either (a) breaks ABI or (b) is
nontrivial. I suppose I should try some of the page coloring benchmarks
on pgcl (which preserves ABI).

-- wli

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-27 23:31       ` Eric W. Biederman
  2003-12-27 23:50         ` William Lee Irwin III
@ 2003-12-28  1:09         ` David S. Miller
  2003-12-28  4:53         ` Linus Torvalds
  2 siblings, 0 replies; 28+ messages in thread
From: David S. Miller @ 2003-12-28  1:09 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: torvalds, anton, linux-kernel

On 27 Dec 2003 16:31:22 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> 2) Creating zones for the different colors.  Zones were not
>    implemented last time, this was tried.

While this idea might sound promising, it would not work
because by definition all pages of a particular color cannot
be coalesced into order 1 or larger buddies.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-27 23:31       ` Eric W. Biederman
  2003-12-27 23:50         ` William Lee Irwin III
  2003-12-28  1:09         ` David S. Miller
@ 2003-12-28  4:53         ` Linus Torvalds
  2003-12-28 16:39           ` William Lee Irwin III
  2003-12-29 21:11           ` Page Colouring (was: 2.6.0 Huge pages not working as expected) Eric W. Biederman
  2 siblings, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2003-12-28  4:53 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Anton Ertl, linux-kernel

On Sat, 27 Dec 2003, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@osdl.org> writes:
> >
> > Basically: prove me wrong. People have tried before. They have failed. 
> > Maybe you'll succeed. I doubt it, but hey, I'm not stopping you.
> 
> For anyone taking you up on this I'd like to suggest two possible
> directions.
> 
> 1) Increasing PAGE_SIZE in the kernel.

Yes. This is something I actually want to do anyway for 2.7.x. Dan 
Phillips had some patches for this six months ago.

You have to be careful, since you have to be able to mmap "partial pages", 
which is what makes it less than trivial, but there are tons of reasons to 
want to do this, and cache coloring is actually very much a secondary 
concern.

> 2) Creating zones for the different colors.  Zones were not
>    implemented last time, this was tried.

Hey, I can tell you that you _will_ fail.

Zones are actually a wonderful example of the kinds of problems you get
into when you have pages of different types aka "colors". We've had
nothing but trouble trying to balance different zones against each other,
and those problems were in fact _the_ reason for 99% of all the VM
problems in 2.4.x.

Trying to use them for cache colors would be "interesting". 

Not to mention that it's impossible to coalesce pages across zones.

> Both of those should be minimal impact to the complexity
> of the current kernel. 

Minimal? I don't think so. Zones are basically impossible, and page size 
changes will hopefully happen during 2.7.x, but not due to page coloring.

> I don't know where we will wind up but the performance variation's
> caused by cache conflicts in today's applications are real, and easily
> measurable.  Giving the growing increase in performance difference
> between CPUs and memory Amdahl's Law shows this will only grow
> so I think this is worth looking at.

Absolutely wrong.

Why? Because the fact is, that as memory gets further and further away 
from CPU's, caches have gotten further and further away from being direct 
mapped. 

Cache coloring is already a very questionable win for four-way 
set-associative caches. I doubt you can even _see_ it for eight-way or 
higher associativity caches.

In other words: the pressures you mention clearly do exist, but they are 
all driving direct-mapped caches out of the market, and thus making page
coloring _less_ interesting rather than more.

		Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-28  4:53         ` Linus Torvalds
@ 2003-12-28 16:39           ` William Lee Irwin III
  2003-12-29  0:36             ` Mike Fedyk
  2003-12-29 21:11           ` Page Colouring (was: 2.6.0 Huge pages not working as expected) Eric W. Biederman
  1 sibling, 1 reply; 28+ messages in thread
From: William Lee Irwin III @ 2003-12-28 16:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric W. Biederman, Anton Ertl, linux-kernel

On Sat, 27 Dec 2003, Eric W. Biederman wrote:
>> For anyone taking you up on this I'd like to suggest two possible
>> directions.
>> 1) Increasing PAGE_SIZE in the kernel.

On Sat, Dec 27, 2003 at 08:53:30PM -0800, Linus Torvalds wrote:
> Yes. This is something I actually want to do anyway for 2.7.x. Dan 
> Phillips had some patches for this six months ago.
> You have to be careful, since you have to be able to mmap "partial pages", 
> which is what makes it less than trivial, but there are tons of reasons to 
> want to do this, and cache coloring is actually very much a secondary 
> concern.

I've not seen Dan Phillips' code for this. I've been hacking on
something doing this since late last December.


-- wli

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
       [not found]       ` <17tHK-3K6-21@gated-at.bofh.it>
@ 2003-12-28 17:17         ` Anton Ertl
  0 siblings, 0 replies; 28+ messages in thread
From: Anton Ertl @ 2003-12-28 17:17 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds <torvalds@osdl.org> writes:
>And you should realize that I do not dispute it because the applications
>themselves would run slower with cache coloring.

Ok, I guess I misunderstood that until now.

> Most applications don't
>much care,

Yes, at least as long as the associativity is high enough.

> they either fit in the cache, or the cache misses have random
>enough access patterns that cache layout doesn't much matter.

Random mapping hurts those applications most that do fit in the cache
in principle, but that have enough hot memory (whether accessed
regularly or randomly) that random mapping usually introduces cache
conflicts (this will happen for many applications with direct-mapped
caches, but hardly ever with high-associativity caches).

>And it has to be better on average on _everything_ that Linux supports,
>not just one particular braindamaged piece of hardware. I'm totally not
>interested in something that makes performance on most machines go down,
>if it then improves one or two braindead setups with direct-mapped caches.

As has been discussed in another thread, direct-mapped caches seem to
pretty standard for off-chip caches, and this is not just a
braindamage issue: Higher associativity requires more wires to the
tags, and also to the data, if you want to access the data in parallel
with the tags for lower latency.  Running a lot of wires off-chip is a
problem.  So the choices are:

- Small on-chip cache with high associativity.

- Medium cache with off-chip data, on-chip tags, high associativity
and high latency.

- Large cache with off-chip data and off-chip tags, and low
associativity.

However, over time off-chip caches seem to become less commonplace, so
we may get rid of low associativity for L2/L3 caches eventually.

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-28 16:39           ` William Lee Irwin III
@ 2003-12-29  0:36             ` Mike Fedyk
  2003-12-29  2:55               ` William Lee Irwin III
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Fedyk @ 2003-12-29  0:36 UTC (permalink / raw)
  To: William Lee Irwin III, Linus Torvalds, Eric W. Biederman,
	Anton Ertl, linux-kernel

On Sun, Dec 28, 2003 at 08:39:52AM -0800, William Lee Irwin III wrote:
> On Sat, 27 Dec 2003, Eric W. Biederman wrote:
> >> For anyone taking you up on this I'd like to suggest two possible
> >> directions.
> >> 1) Increasing PAGE_SIZE in the kernel.
> 
> On Sat, Dec 27, 2003 at 08:53:30PM -0800, Linus Torvalds wrote:
> > Yes. This is something I actually want to do anyway for 2.7.x. Dan 
> > Phillips had some patches for this six months ago.
> > You have to be careful, since you have to be able to mmap "partial pages", 
> > which is what makes it less than trivial, but there are tons of reasons to 
> > want to do this, and cache coloring is actually very much a secondary 
> > concern.
> 
> I've not seen Dan Phillips' code for this. I've been hacking on
> something doing this since late last December.

I remember his work on pagetable sharing, but haven't heard anything about
changing the page size from him.

Could this be what Linus is remembering?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-29  0:36             ` Mike Fedyk
@ 2003-12-29  2:55               ` William Lee Irwin III
  2003-12-29  4:09                 ` Linus Torvalds
  0 siblings, 1 reply; 28+ messages in thread
From: William Lee Irwin III @ 2003-12-29  2:55 UTC (permalink / raw)
  To: mfedyk, Linus Torvalds, Eric W. Biederman, Anton Ertl,
	linux-kernel

On Sat, Dec 27, 2003 at 08:53:30PM -0800, Linus Torvalds wrote:
>>> Yes. This is something I actually want to do anyway for 2.7.x. Dan 
>>> Phillips had some patches for this six months ago.
>>> You have to be careful, since you have to be able to mmap "partial pages", 
>>> which is what makes it less than trivial, but there are tons of reasons to 
>>> want to do this, and cache coloring is actually very much a secondary 
>>> concern.

On Sun, Dec 28, 2003 at 08:39:52AM -0800, William Lee Irwin III wrote:
>> I've not seen Dan Phillips' code for this. I've been hacking on
>> something doing this since late last December.

On Sun, Dec 28, 2003 at 04:36:31PM -0800, Mike Fedyk wrote:
> I remember his work on pagetable sharing, but haven't heard anything about
> changing the page size from him.
> Could this be what Linus is remembering?

Doubtful. I suspect he may be referring to pgcl (sometimes called
"subpages"), though Dan Phillips hasn't been involved in it. I guess
we'll have to wait for Linus to respond to know for sure.


-- wli

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-29  2:55               ` William Lee Irwin III
@ 2003-12-29  4:09                 ` Linus Torvalds
  2003-12-29  6:52                   ` William Lee Irwin III
  2003-12-29 20:02                   ` Subpages (was: Page Colouring) Daniel Phillips
  0 siblings, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2003-12-29  4:09 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List,
	phillips

On Sun, 28 Dec 2003, William Lee Irwin III wrote:
> 
> Doubtful. I suspect he may be referring to pgcl (sometimes called
> "subpages"), though Dan Phillips hasn't been involved in it. I guess
> we'll have to wait for Linus to respond to know for sure.

I didn't see the patch itself, but I spent some time talking to Daniel
after your talk at the kernel summit. At least I _think_ it was him I was 
talking to - my memory for names and faces is basically zero.

Daniel claimed to have it working back then, and that it actually shrank
the kernel source code. The basic approach is to just make PAGE_SIZE
larger, and handle temporary needs for smaller subpages by just
dynamically allocating "struct page" entries for them. The size reduction
came from getting rid of the "struct buffer_head", because it ends up
being just another "small page".

Daniel, details?

		Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-29  4:09                 ` Linus Torvalds
@ 2003-12-29  6:52                   ` William Lee Irwin III
  2003-12-29  9:14                     ` Linus Torvalds
       [not found]                     ` <20031229084304.GA31630@elte.hu>
  2003-12-29 20:02                   ` Subpages (was: Page Colouring) Daniel Phillips
  1 sibling, 2 replies; 28+ messages in thread
From: William Lee Irwin III @ 2003-12-29  6:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List,
	phillips

On Sun, Dec 28, 2003 at 08:09:17PM -0800, Linus Torvalds wrote:
> I didn't see the patch itself, but I spent some time talking to Daniel
> after your talk at the kernel summit. At least I _think_ it was him I was 
> talking to - my memory for names and faces is basically zero.
> Daniel claimed to have it working back then, and that it actually shrank
> the kernel source code. The basic approach is to just make PAGE_SIZE
> larger, and handle temporary needs for smaller subpages by just
> dynamically allocating "struct page" entries for them. The size reduction
> came from getting rid of the "struct buffer_head", because it ends up
> being just another "small page".
> Daniel, details?

I also heard something about this from daniel. The description I was
given implied rather different functionality, and raised rather serious
questions about the implementation he didn't have adequate answers for.
I also never saw code, despite months of occasional discussions about it.

I did get a positive reaction from you at KS, and I've also been
slaving away at keeping this thing current and improving it when I can
for a year. Would you mind telling me what the Hell is going on here?

I guess I already know I'm screwed beyond all hope of recovery, but I
might as well get official confirmation.

-- wli

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-29  6:52                   ` William Lee Irwin III
@ 2003-12-29  9:14                     ` Linus Torvalds
  2003-12-29  9:22                       ` William Lee Irwin III
       [not found]                     ` <20031229084304.GA31630@elte.hu>
  1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2003-12-29  9:14 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List,
	phillips



On Sun, 28 Dec 2003, William Lee Irwin III wrote:
> 
> I did get a positive reaction from you at KS, and I've also been
> slaving away at keeping this thing current and improving it when I can
> for a year. Would you mind telling me what the Hell is going on here?
> 
> I guess I already know I'm screwed beyond all hope of recovery, but I
> might as well get official confirmation.

No, I haven't even _looked_ at any 2.7.x timeframe patches, and I'm not 
even going to for the next few months. 

I don't care what does it, I want a bigger PAGE_CACHE_SIZE, and working 
patches are the only thing that matters. But for now, I have my 2.6.x 
blinders on.

		Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-29  9:14                     ` Linus Torvalds
@ 2003-12-29  9:22                       ` William Lee Irwin III
  2003-12-29  9:33                         ` Linus Torvalds
  0 siblings, 1 reply; 28+ messages in thread
From: William Lee Irwin III @ 2003-12-29  9:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List,
	phillips

On Sun, 28 Dec 2003, William Lee Irwin III wrote:
>> I did get a positive reaction from you at KS, and I've also been
>> slaving away at keeping this thing current and improving it when I can
>> for a year. Would you mind telling me what the Hell is going on here?
>> I guess I already know I'm screwed beyond all hope of recovery, but I
>> might as well get official confirmation.

On Mon, Dec 29, 2003 at 01:14:03AM -0800, Linus Torvalds wrote:
> No, I haven't even _looked_ at any 2.7.x timeframe patches, and I'm not 
> even going to for the next few months. 
> I don't care what does it, I want a bigger PAGE_CACHE_SIZE, and working 
> patches are the only thing that matters. But for now, I have my 2.6.x 
> blinders on.

I can't say I'm particularly encouraged by what I've heard thus far,
but I suppose that means what I've been doing for the past 6 months
hasn't been entirely meaningless.

Thanks.


-- wli

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-29  9:22                       ` William Lee Irwin III
@ 2003-12-29  9:33                         ` Linus Torvalds
  2003-12-29 10:23                           ` William Lee Irwin III
  0 siblings, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2003-12-29  9:33 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List,
	phillips

On Mon, 29 Dec 2003, William Lee Irwin III wrote:
> 
> I can't say I'm particularly encouraged by what I've heard thus far,

Well, I don't even know what your approach is - mind giving an overview? 

My original plan (and you can see some of it in the fact that 
PAGE_CACHE_SIZE is separate from PAGE_SIZE), was to just have the page 
cache be able to use bigger pages than the "normal" pages, and the 
normal pages would continue to be the hardware page size.

However, especially with mem_map[] becoming something of a problem, and 
all the problems we'd have if PAGE_SIZE and PAGE_CACHE_SIZE were 
different, I suspect I'd just be happier with increasing PAGE_SIZE 
altogether (and PAGE_CACHE_SIZE with it), and then just teaching the VM 
mapping about "fractional pages".

What's your approach?

		Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-29  9:33                         ` Linus Torvalds
@ 2003-12-29 10:23                           ` William Lee Irwin III
  2003-12-29 10:59                             ` Mike Fedyk
  2003-12-30  2:00                             ` Rusty Russell
  0 siblings, 2 replies; 28+ messages in thread
From: William Lee Irwin III @ 2003-12-29 10:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List,
	phillips

On Mon, 29 Dec 2003, William Lee Irwin III wrote:
>> I can't say I'm particularly encouraged by what I've heard thus far,

On Mon, Dec 29, 2003 at 01:33:53AM -0800, Linus Torvalds wrote:
> Well, I don't even know what your approach is - mind giving an overview? 
> My original plan (and you can see some of it in the fact that 
> PAGE_CACHE_SIZE is separate from PAGE_SIZE), was to just have the page 
> cache be able to use bigger pages than the "normal" pages, and the 
> normal pages would continue to be the hardware page size.
> However, especially with mem_map[] becoming something of a problem, and 
> all the problems we'd have if PAGE_SIZE and PAGE_CACHE_SIZE were 
> different, I suspect I'd just be happier with increasing PAGE_SIZE 
> altogether (and PAGE_CACHE_SIZE with it), and then just teaching the VM 
> mapping about "fractional pages".
> What's your approach?

Hmm, I presented on this at KS. Basically, it's identical to Hugh
Dickins' approach from 2000. The only difference is really that it had
to be forward ported (or unfortunately in too many cases reimplemented)
to mix with current code and features.

Basically, elevate PAGE_SIZE, introduce MMUPAGE_SIZE to be a nice macro
representing the hardware pagesize, and the fault handling is done with
some relatively localized complexity. Numerous s/PAGE_SIZE/MMUPAGE_SIZE/
bits are sprinkled around, along with a few more involved changes because
a large number of distributed changes are required to handle oddities
that occur when PAGE_SIZE changes from 4KB. The more involved changes
are often for things such as the only reason it uses PAGE_SIZE is
really that it just expects 4KB and says PAGE_SIZE, or that it wants
some fixed (even across compiles) size and needs updating for more
general PAGE_SIZE numbers, or sometimes that it expects PAGE_SIZE to be
what a pte maps when this is now represented by MMUPAGE_SIZE. I have a
bad feeling the diligence of the original code audit could be bearing
against me (and though I'm trying to be equally diligent, I'm not hugh).

The fact merely elevating PAGE_SIZE breaks numerous things makes me
rather suspicious of claims that minimalistic patches can do likewise.

The only new infrastructures introduced are the MMUPAGE_SIZE and a couple
of related macros (defining numbers, not structures or code) and the
fault handler implementations. The diff size is not small. The memory
footprint is, and demonstrably so (c.f. March 27 2003).

My 2.6 code has been heavily leveraging the pfn abstraction in its favor
to represent physical addresses measured in units of the hardware
pagesize. Generally, my maintenance approach has been incrementally
advancing the state of the thing while keeping it working on as broad a
cross section of i386 systems as I can test or get testers on. It has
been verified to run userspace on Thinkpad T21's and 16x/32GB and
32x/64GB NUMA-Q's at every point release it's been ported to, which
since 2.5.68 or so has been every point release coming out of kernel.org.

-- wli

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-29 10:23                           ` William Lee Irwin III
@ 2003-12-29 10:59                             ` Mike Fedyk
  2003-12-29 11:14                               ` William Lee Irwin III
  2003-12-30  2:00                             ` Rusty Russell
  1 sibling, 1 reply; 28+ messages in thread
From: Mike Fedyk @ 2003-12-29 10:59 UTC (permalink / raw)
  To: William Lee Irwin III, Kernel Mailing List

On Mon, Dec 29, 2003 at 02:23:19AM -0800, William Lee Irwin III wrote:
> bits are sprinkled around, along with a few more involved changes because
> a large number of distributed changes are required to handle oddities
> that occur when PAGE_SIZE changes from 4KB. The more involved changes
> are often for things such as the only reason it uses PAGE_SIZE is
> really that it just expects 4KB and says PAGE_SIZE, or that it wants
> some fixed (even across compiles) size and needs updating for more
> general PAGE_SIZE numbers, or sometimes that it expects PAGE_SIZE to be
> what a pte maps when this is now represented by MMUPAGE_SIZE.

Any chance some of these changes are self contained, and could be split out
and possibly merged into -mm?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-29 10:59                             ` Mike Fedyk
@ 2003-12-29 11:14                               ` William Lee Irwin III
  0 siblings, 0 replies; 28+ messages in thread
From: William Lee Irwin III @ 2003-12-29 11:14 UTC (permalink / raw)
  To: mfedyk, Kernel Mailing List

On Mon, Dec 29, 2003 at 02:23:19AM -0800, William Lee Irwin III wrote:
>> bits are sprinkled around, along with a few more involved changes because
>> a large number of distributed changes are required to handle oddities
>> that occur when PAGE_SIZE changes from 4KB. The more involved changes
>> are often for things such as the only reason it uses PAGE_SIZE is
>> really that it just expects 4KB and says PAGE_SIZE, or that it wants
>> some fixed (even across compiles) size and needs updating for more
>> general PAGE_SIZE numbers, or sometimes that it expects PAGE_SIZE to be
>> what a pte maps when this is now represented by MMUPAGE_SIZE.

On Mon, Dec 29, 2003 at 02:59:18AM -0800, Mike Fedyk wrote:
> Any chance some of these changes are self contained, and could be split out
> and possibly merged into -mm?

I talked about this for a little while. Basically, there is only one
concept in the entire patch, despite its large size. The vast bulk of
the "distributed changes" are s/PAGE_SIZE/MMUPAGE_SIZE/.

At some point I was told to keep the whole shebang rolling out of tree
or otherwise not answered by akpm and/or Linus, after I sent in what a
split up (this is actually very easy to split up file-by-file) version
of what just some of the totally trivial arch/i386/ changes would look
like. The nontrivial changes are stupid in nature, but touch "fragile"
or otherwise "scary to touch" code, and so sort of relegate them to 2.7.
This is not entirely unjustified, as changes of a similar code impact
wrt. the GDT appear to have affected some APM systems' suspend ability
(I know for a fact my changes do not have impacts on APM suspend, but
other, analogous support issues could arise after broader testing.)

Basically, the MMUPAGE_SIZE introductions didn't interest anyone a while
ago, and I suspect people probably just want them all at once, since it's
unlikely people want to repeat the pain analogous to PAGE_CACHE_SIZE (I
should clarify later how this is different) where the incremental
introduction never culminated in the introduction of functionality.

-- wli

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
       [not found]                     ` <20031229084304.GA31630@elte.hu>
@ 2003-12-29 12:09                       ` Ingo Molnar
  2003-12-29 12:49                         ` William Lee Irwin III
  0 siblings, 1 reply; 28+ messages in thread
From: Ingo Molnar @ 2003-12-29 12:09 UTC (permalink / raw)
  To: William Lee Irwin III, Linus Torvalds, mfedyk, Eric W. Biederman,
	Anton Ertl, Kernel Mailing List, phillips

* William Lee Irwin III <wli@holomorphy.com> wrote:

> I also heard something about this from daniel. The description I was
> given implied rather different functionality, and raised rather
> serious questions about the implementation he didn't have adequate
> answers for. I also never saw code, despite months of occasional
> discussions about it.
> 
> I did get a positive reaction from you at KS, and I've also been
> slaving away at keeping this thing current and improving it when I can
> for a year. Would you mind telling me what the Hell is going on here?
> 
> I guess I already know I'm screwed beyond all hope of recovery, but I
> might as well get official confirmation.

i've been following your code (pgcl) and it looks pretty good. (it needs
finishing touches as always, but that's fine.) I tried to backport it to
2.4 before doing 4G/4G but the maintainance overhead skyrocketed and so
it not practical for 2.4-based distribution purposes - but it would be
the perfect kind of thing to start 2.7.0 with. I've not seen any other
code but yours in this area.

i believe the right approach to the 'tons of RAM' problem is to simplify
it as much as possible, ie. go for larger pages (and wrap the MMU format
in the most trivial way) and to deal with 4K pages as a filesystem (and
ELF format) compatibility thing only. Your patch does precisely this.
How much we'll have to 'mix' the two page sizes, only practice will
tell, but the less mixing, the easier it will get. Filesystems on such
systems will match the pagesize anyway.

i'd even suggest to not overdo the fractured-page logic too much - ie.
just 'waste' a full page on a misaligned or single 4K-sized vma -
concentrate on the common case: linearly mapped files and anonymous
mappings. Prefault both of them at PAGE_SIZE granularity and 'waste' the
final partial page. The VM swapout logic should only deal with full
pages. Same for the pagecache: just fill in full pages and dont worry
about granularity.

Your patch already does more than this. But i think if someone does 4K
vmas on a pgcl system or runs it on a 128 MB box and expects perfect
swapping, then it's his damn fault.

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-29 12:09                       ` Ingo Molnar
@ 2003-12-29 12:49                         ` William Lee Irwin III
  0 siblings, 0 replies; 28+ messages in thread
From: William Lee Irwin III @ 2003-12-29 12:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, mfedyk, Eric W. Biederman, Anton Ertl,
	Kernel Mailing List, phillips

* William Lee Irwin III <wli@holomorphy.com> wrote:
> > I did get a positive reaction from you at KS, and I've also been
>> slaving away at keeping this thing current and improving it when I can
>> for a year. Would you mind telling me what the Hell is going on here?
>> I guess I already know I'm screwed beyond all hope of recovery, but I
>> might as well get official confirmation.

On Mon, Dec 29, 2003 at 01:09:30PM +0100, Ingo Molnar wrote:
> i've been following your code (pgcl) and it looks pretty good. (it needs
> finishing touches as always, but that's fine.) I tried to backport it to
> 2.4 before doing 4G/4G but the maintainance overhead skyrocketed and so
> it not practical for 2.4-based distribution purposes - but it would be
> the perfect kind of thing to start 2.7.0 with. I've not seen any other
> code but yours in this area.

That's a rather kind assessment; I suppose I hold flaws not critical at
the design level as fatal where those who look primarily at design don't.

On Mon, Dec 29, 2003 at 01:09:30PM +0100, Ingo Molnar wrote:
> i believe the right approach to the 'tons of RAM' problem is to simplify
> it as much as possible, ie. go for larger pages (and wrap the MMU format
> in the most trivial way) and to deal with 4K pages as a filesystem (and
> ELF format) compatibility thing only. Your patch does precisely this.
> How much we'll have to 'mix' the two page sizes, only practice will
> tell, but the less mixing, the easier it will get. Filesystems on such
> systems will match the pagesize anyway.

Well, that's more or less consistent with what I'm I'm doing. In
actuality it's Hugh's design and original implementation, but I'm going
to have to claim _some_ credit for the work I've put into this at some
point, though it be grunt work after a fashion.

The nontrivial point is largely ABI compatibility. A tremendous amount
of diff could be eliminated without ABI compatibility; however, the
concern is rather critical as long as legacy binaries are involved.

On Mon, Dec 29, 2003 at 01:09:30PM +0100, Ingo Molnar wrote:
> i'd even suggest to not overdo the fractured-page logic too much - ie.
> just 'waste' a full page on a misaligned or single 4K-sized vma -
> concentrate on the common case: linearly mapped files and anonymous
> mappings. Prefault both of them at PAGE_SIZE granularity and 'waste' the
> final partial page. The VM swapout logic should only deal with full
> pages. Same for the pagecache: just fill in full pages and dont worry
> about granularity.
> Your patch already does more than this. But i think if someone does 4K
> vmas on a pgcl system or runs it on a 128 MB box and expects perfect
> swapping, then it's his damn fault.

My reasoning here has actually been dominated by performance. Exchanging
the logic for this task is actually a difficult enough operation with
respect to programming that very few a priori concerns can be allowed
any influence at all.

The algorithm now used for fault handling, recently ported by brute
force from Hugh's rather ancient sources, effectively does as you say
(though there is a lot of latitude in the criterion you've stated).
One risk I've taken is updating some API's to return pfn's instead of
pages. In the case of get_user_pages() this is likely essential. But
kmap_atomic_to_page() (to_pfn() in my sources) and some others might
be able to be avoided entirely with some moderately traumatic rework
(traumatic as far as work I have to do is concerned; in all honesty,
the issue is stupid, but as a problem it makes up for the lack of
difficulty owing to quality with that owed to vast quantities of
debugging and intolerance to dumb C mistakes.) The methods you're
suggesting suggest removing these changes in exchange for some
potential inefficiencies with virtualspace consumption, though these
are not entirely out of the question, as ia32 is effectively deprecated.

-- wli

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Subpages (was: Page Colouring)
  2003-12-29  4:09                 ` Linus Torvalds
  2003-12-29  6:52                   ` William Lee Irwin III
@ 2003-12-29 20:02                   ` Daniel Phillips
  2003-12-29 20:15                     ` Linus Torvalds
  1 sibling, 1 reply; 28+ messages in thread
From: Daniel Phillips @ 2003-12-29 20:02 UTC (permalink / raw)
  To: Linus Torvalds, William Lee Irwin III
  Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List

On Sunday 28 December 2003 23:09, Linus Torvalds wrote:
> On Sun, 28 Dec 2003, William Lee Irwin III wrote:
> > Doubtful. I suspect he may be referring to pgcl (sometimes called
> > "subpages"), though Dan Phillips hasn't been involved in it. I guess
> > we'll have to wait for Linus to respond to know for sure.
>
> I didn't see the patch itself, but I spent some time talking to Daniel
> after your talk at the kernel summit. At least I _think_ it was him I was
> talking to - my memory for names and faces is basically zero.
>
> Daniel claimed to have it working back then, and that it actually shrank
> the kernel source code. The basic approach is to just make PAGE_SIZE
> larger, and handle temporary needs for smaller subpages by just
> dynamically allocating "struct page" entries for them. The size reduction
> came from getting rid of the "struct buffer_head", because it ends up
> being just another "small page".
>
> Daniel, details?

Hi Linus,

Your description is accurate.  Another reason for code size shrinkage is 
getting rid of the loops across buffers in the block IO library, e.g., 
block_read_full_page.

Subpages only make sense for file-backed memory, which conveniently lets the 
page cache keep track of subpages.  Each address_space has pages of all the 
same size, which may be smaller, larger or the same as PAGE_CACHE_SIZE.  The 
first case, "subpages", is the interesting one.

An address_space with subpages has base pages of PAGE_CACHE_SIZE for its 
"even" entries and up to N-1 dynamically allocated struct pages for the "odd"  
entries where N is PAGE_CACHE_SIZE divided by the subpage size.  Base pages 
are normal members of mem_map.  Subpages are not referenced by mem_map, but 
only by the page cache.  They are created by operations such as 
find_or_create_page, which first creates a base page if necessary.  A counter 
field in the page flags of the base page keeps track of how many subpages 
share a base page's physical memory; when this field goes to zero the base 
page may be removed from the page cache.

Subpages always have a ->virtual field regardless of whether mem_map pages do.  
This is used for virt_to_phys and to locate the base page when a subpage is 
freed.

Page fault handling doesn't change much if at all, since the faulting address 
is rounded down to a physical page, which will be a base page.

Most of the changes for subpages are in the buffer.c page cache operations and 
are largely transparent to the VMM, though PAGE_CACHE_SHIFT becomes 
mapping->page_shift, which touches a lot of files.  As you noted, buffer_head 
functionality can be taken over by struct page and buffers become expendible.  
However it is not necessary to cross that bridge immediately; page buffer 
lists continue to work though the buffer list is never longer than one.

With a little more work, subpages can be used to shrink mem_map: implement a 
larger PAGE_CACHE_SIZE then use subpages to handle ABI problems.  In this 
case faults on subpages are possible and the fault path probably needs to 
know something about it.  With a larger-than-physical PAGE_CACHE_SIZE we can 
finally have large buffers, though the kernel would have to be compiled for 
it.  Some more work to allowing mapping->page_shift to be larger than 
PAGE_CACHE_SIZE would complete the process of generalizing the page size.  My 
impression is, this isn't too messy, most of the impact is on faulting.  Bill 
and others are already familiar with this I think.  The work should dovetail.

I took a stab at implementing subpages some time ago in 2.4 and got it mostly 
working but not quite bootable.  I did find out roughly how invasive the 
patch is, which is: not very, unless I've overlooked something major.  I'll 
get busy on a 2.6 prototype, and of course I'll listen attentively for 
reasons why this plan won't work.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Subpages (was: Page Colouring)
  2003-12-29 20:02                   ` Subpages (was: Page Colouring) Daniel Phillips
@ 2003-12-29 20:15                     ` Linus Torvalds
  0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2003-12-29 20:15 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: William Lee Irwin III, mfedyk, Eric W. Biederman, Anton Ertl,
	Kernel Mailing List



On Mon, 29 Dec 2003, Daniel Phillips wrote:
> 
> I took a stab at implementing subpages some time ago in 2.4 and got it mostly 
> working but not quite bootable.  I did find out roughly how invasive the 
> patch is, which is: not very, unless I've overlooked something major.  I'll 
> get busy on a 2.6 prototype, and of course I'll listen attentively for 
> reasons why this plan won't work.

Ah, ok. I thought it was further along than that.

If so, let's consider that possibility a more long-range plan - it is 
independent of just making PAGE_CACHE_SIZE be bigger.

		Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-28  4:53         ` Linus Torvalds
  2003-12-28 16:39           ` William Lee Irwin III
@ 2003-12-29 21:11           ` Eric W. Biederman
  2003-12-29 21:35             ` Linus Torvalds
  1 sibling, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2003-12-29 21:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Anton Ertl, linux-kernel

Linus Torvalds <torvalds@osdl.org> writes:

> On Sat, 27 Dec 2003, Eric W. Biederman wrote:
> > Linus Torvalds <torvalds@osdl.org> writes:
> > >
> > > Basically: prove me wrong. People have tried before. They have failed. 
> > > Maybe you'll succeed. I doubt it, but hey, I'm not stopping you.
> > 
> > For anyone taking you up on this I'd like to suggest two possible
> > directions.
> > 
> > 1) Increasing PAGE_SIZE in the kernel.
> 
> Yes. This is something I actually want to do anyway for 2.7.x. Dan 
> Phillips had some patches for this six months ago.
> 
> You have to be careful, since you have to be able to mmap "partial pages", 
> which is what makes it less than trivial, but there are tons of reasons to 
> want to do this, and cache coloring is actually very much a secondary 
> concern.
> 
> > 2) Creating zones for the different colors.  Zones were not
> >    implemented last time, this was tried.
> 
> Hey, I can tell you that you _will_ fail.

Given the > order 0 pages it looks to be a long shot at this point.

> > Both of those should be minimal impact to the complexity
> > of the current kernel. 
> 
> Minimal? I don't think so. Zones are basically impossible, and page size 
> changes will hopefully happen during 2.7.x, but not due to page coloring.

I didn't say easy, just simple enough that not everyone in the
kernel would need to know or care.

> > caused by cache conflicts in today's applications are real, and easily
> > measurable.  Giving the growing increase in performance difference
> > between CPUs and memory Amdahl's Law shows this will only grow
> > so I think this is worth looking at.
> 
> Absolutely wrong.

I don't mean to focus exclusively on cache coloring but on anything
that will increase memory performance.  Right now we are at a point
where even the DRAM page size is larger than 4K. 

> Why? Because the fact is, that as memory gets further and further away 
> from CPU's, caches have gotten further and further away from being direct 
> mapped. 

Except for L1 caches.  The hit of an associate lookup there is inherently
costly.

> Cache coloring is already a very questionable win for four-way 
> set-associative caches. I doubt you can even _see_ it for eight-way or 
> higher associativity caches.

If I can ever get something that approaches a reliable result out of
something that a memory bandwidth benchmark, I would love it.  I typically
get something like a 25%+ variance.  The most reliably way I have found
to show how variable these things get is to run updatedb.  And run streams
or a similar benchmark at the same time.   The numbers jump all over the place
and a concurrent updatedb as frequently improves as it degrades performance.

When I stop seeing a measurable then I will stop worrying.

Of course it does not help that the current generation of compilers
can only get 50% of the actual performance the memory can provide.

In recent times the variations have been getting worse if anything.

> In other words: the pressures you mention clearly do exist, but they are 
> all driving direct-mapped caches out of the market, and thus making page
> coloring _less_ interesting rather than more.

The other spin is that everything in current architectures is based
around the concept of locality.  While page tables mess up locality.
So everything we can do to preserve locality is a good thing.  Cache
color happens naturally as a result of better page locality.

A variation to the idea of using larger page sizes is to allocate/free/swap
a batch of pages at once.   We do some of that now.  A batch of pages moves
the normal allocation up to order 3 or 4 if we can get them.  While
still working with smaller order pages if we can't. 

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-29 21:11           ` Page Colouring (was: 2.6.0 Huge pages not working as expected) Eric W. Biederman
@ 2003-12-29 21:35             ` Linus Torvalds
  0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2003-12-29 21:35 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Anton Ertl, linux-kernel

On Mon, 29 Dec 2003, Eric W. Biederman wrote:
>
> Linus Torvalds <torvalds@osdl.org> writes:
> >  
> > Why? Because the fact is, that as memory gets further and further away 
> > from CPU's, caches have gotten further and further away from being direct 
> > mapped. 
> 
> Except for L1 caches.  The hit of an associate lookup there is inherently
> costly.

Having worked for a hardware company, and talked to hardware engineers, I 
can say that it generally isn't all that true.

The reason is that you can start the lookup before you even do the TLB 
lookup, and in fact you _want_ highly associative L1 caches to do that.

For example, if you have a 16kB L1 cache, and a 4kB page size, and you
want your memory accesses to go fast, you definitely want to index the L1
by the virtual access, which means that you can only use the low 12 bits
for indexing.

So what you do is you make your L1 be 4-way set-associative, so that by 
the time the TLB lookup is done, you've already looked up the index, and 
you only have to compare the TAG with one of the four possible ways.

In short: you actually _want_ your L1 to be associative, because it's the 
best way to avoid having nasty alias issues.

The only people who have a direct-mapped L1 are one of:
 - crazy and/or stupid
 - really cheap (mainly embedded space)
 - not high-performance anyway (ie their L1 is really small)
 - really sorry, and are fixing it.
 - really _really_ sorry, and have a virtually indexed cache. In which 
   case page coloring doesn't matter anyway.

Notice how high performance is _not_ on the list. Because you simply can't 
_get_ high performance with a direct-mapped L1. Those days are long gone.

There is another reason why L1's have long since moved away from
direct-mapped: the miss ratio goes up quote a bit for the same size cache.  
And things like OoO are pretty good at hiding one cycle of latency (OoO is
_not_ good at hiding memory latency, but one or two cycles are usually
ok), so even if having a larger L1 (and thus inherently more complex - not
only in associativity) means that you end up having an extra cycle access,
it's likely a win.

This is, for example, what alpha did between 21164 and the 21264: when
they went out-of-order, they did all the simulation to prove that it was
much more efficient to have a larger L1 with a higher hit ratio, even if
the latency was one cycle higher than the 21164 which was strictly
in-order.

In short, I'll bet you a dollar that you won't see a single direct-mapped 
L1 _anywhere_ where it matters. They are already pretty much gone. Can you 
name one that doesn't fit the four criteria above?

		Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-29 10:23                           ` William Lee Irwin III
  2003-12-29 10:59                             ` Mike Fedyk
@ 2003-12-30  2:00                             ` Rusty Russell
  2003-12-30  4:59                               ` William Lee Irwin III
  1 sibling, 1 reply; 28+ messages in thread
From: Rusty Russell @ 2003-12-30  2:00 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: torvalds, mfedyk, ebiederm, anton, linux-kernel, phillips

On Mon, 29 Dec 2003 02:23:19 -0800
William Lee Irwin III <wli@holomorphy.com> wrote:

> The fact merely elevating PAGE_SIZE breaks numerous things makes me
> rather suspicious of claims that minimalistic patches can do likewise.

Can you give an example?

	One approach is to simply present a larger page size to userspace w/
getpagesize().  This does break ELF programs which have been laid out assuming
the old page size (presumably they try to mprotect the read-only sections).
On PPC, the ELF ABI already insists on a 64k boundary between such sections,
and maybe for others you could simply round appropriately and pray, or do
fine-grained protections (ie. on real pagesize) for that one case.

Rusty.
-- 
   there are those who do and those who hang on and you don't see too
   many doers quoting their contemporaries.  -- Larry McVoy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
  2003-12-30  2:00                             ` Rusty Russell
@ 2003-12-30  4:59                               ` William Lee Irwin III
  0 siblings, 0 replies; 28+ messages in thread
From: William Lee Irwin III @ 2003-12-30  4:59 UTC (permalink / raw)
  To: Rusty Russell; +Cc: torvalds, mfedyk, ebiederm, anton, linux-kernel, phillips

On Mon, 29 Dec 2003 02:23:19 -0800 William Lee Irwin III wrote:
>> The fact merely elevating PAGE_SIZE breaks numerous things makes me
>> rather suspicious of claims that minimalistic patches can do likewise.

On Tue, Dec 30, 2003 at 01:00:29PM +1100, Rusty Russell wrote:
> Can you give an example?
> 	One approach is to simply present a larger page size to userspace w/
> getpagesize().  This does break ELF programs which have been laid out assuming
> the old page size (presumably they try to mprotect the read-only sections).
> On PPC, the ELF ABI already insists on a 64k boundary between such sections,
> and maybe for others you could simply round appropriately and pray, or do
> fine-grained protections (ie. on real pagesize) for that one case.

Apps must, of course, be relinked for that, but that's userspace. This
ABI change is largely out of the picture due to legacy binaries, user
virtualspace fragmentation (most likely an issue for 32-bit threading),
and so on. The choice of PAGE_SIZE in such schemes is also restricted
to no larger than whatever choice used for userspace linking, which is
a relatively ugly dependency. There's also a question of "smooth
transition": the only way to "incrementally deploy" it on a mixture
"ready" userspace and "unready" userspace is to turn it off. I suppose
it has the minor advantage of being trivial to program.

I had in mind pure kernel internal issues, not ABI.

The issues from raising PAGE_SIZE alone are things like interpreting
hardware descriptions in arch code, some shifts underflowing for things
like hashtables, certain drivers doing ioremap() and the like either
filling up vmallocspace or getting their math wrong, and some other
drivers doing calculations on physical addresses getting them wrong, or
using PAGE_SIZE to represent some 4KB or other fixed-size memory area
interpreted by hardware, and filesystems that assume blocksize ==
PAGE_SIZE or assume PAGE_SIZE is less than some particular value (e.g.
short offsets into pages, worst of all being signed shorts), and
tripping BUG()'s in ll_rw_blk.c when 512*q->max_sectors < PAGE_SIZE.

These issues are the bulk of the work needing to be done for the driver
and fs sweeps. Actual concerns about MMUPAGE_SIZE in drivers/ and fs/
are rather limited in scope, though drivers/char/drm/ was somewhat
painful to get going (Zwane actually did most of this for me, as I have
no DRM/DRI -capable graphics cards at my disposal).

-- wli

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2003-12-30  4:59 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <179fV-1iK-23@gated-at.bofh.it>
     [not found] ` <179IS-1VD-13@gated-at.bofh.it>
2003-12-27 20:21   ` Page Colouring (was: 2.6.0 Huge pages not working as expected) Anton Ertl
2003-12-27 20:56     ` Linus Torvalds
2003-12-27 23:31       ` Eric W. Biederman
2003-12-27 23:50         ` William Lee Irwin III
2003-12-28  1:09         ` David S. Miller
2003-12-28  4:53         ` Linus Torvalds
2003-12-28 16:39           ` William Lee Irwin III
2003-12-29  0:36             ` Mike Fedyk
2003-12-29  2:55               ` William Lee Irwin III
2003-12-29  4:09                 ` Linus Torvalds
2003-12-29  6:52                   ` William Lee Irwin III
2003-12-29  9:14                     ` Linus Torvalds
2003-12-29  9:22                       ` William Lee Irwin III
2003-12-29  9:33                         ` Linus Torvalds
2003-12-29 10:23                           ` William Lee Irwin III
2003-12-29 10:59                             ` Mike Fedyk
2003-12-29 11:14                               ` William Lee Irwin III
2003-12-30  2:00                             ` Rusty Russell
2003-12-30  4:59                               ` William Lee Irwin III
     [not found]                     ` <20031229084304.GA31630@elte.hu>
2003-12-29 12:09                       ` Ingo Molnar
2003-12-29 12:49                         ` William Lee Irwin III
2003-12-29 20:02                   ` Subpages (was: Page Colouring) Daniel Phillips
2003-12-29 20:15                     ` Linus Torvalds
2003-12-29 21:11           ` Page Colouring (was: 2.6.0 Huge pages not working as expected) Eric W. Biederman
2003-12-29 21:35             ` Linus Torvalds
     [not found]       ` <17tHK-3K6-21@gated-at.bofh.it>
2003-12-28 17:17         ` Anton Ertl
     [not found] <176UD-6vl-3@gated-at.bofh.it>
2003-12-26 21:48 ` Anton Ertl
2003-12-26 23:28   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox