* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) [not found] ` <179IS-1VD-13@gated-at.bofh.it> @ 2003-12-27 20:21 ` Anton Ertl 2003-12-27 20:56 ` Linus Torvalds 0 siblings, 1 reply; 28+ messages in thread From: Anton Ertl @ 2003-12-27 20:21 UTC (permalink / raw) To: linux-kernel Linus Torvalds <torvalds@osdl.org> writes: > > >On Fri, 26 Dec 2003, Anton Ertl wrote: >> You >> can get the same worst-case behaviour as with page colouring, since >> you can get the same mapping. It's just unlikely. > >"pathological worst-case" is something that is repeatable. And you probably mean "repeatable every time". Ok, then a random scheme has, by your definition, no pathological worst case. I am not sure that this is a consolation when I happen upon one of its unpredictable and unrepeatable worst cases. >> Well, even if, on average, it has no performance impact, >> reproducibility is a good reason to like it. Is it good enough to >> implement it? I'll leave that to you. > >Well, since random (or, more accurately in this case, "pseudo-random") has >a number of things going for it, and is a lot faster and cheaper to >implement, I don't see the point of cache coloring. The points are: - repeatability - predictability - better average performance (you dispute that). >Hey, the discussion in this case showed how it _deproves_ performance (at >least if my theory was correct - and it should be easily testable and I >bet it is). I don't think that discussing this special case answers the question about "on average" performance, but here we go: For the Coppermine results I see that the performance of the malloc() case is only better with span 2048 and 4096, and not by much. For the Williamette 16MB results I see very little difference, except for the span=4096 case, by a lot. For the Williamette 4MB case I see slightly better performance for hugetlbfs for spans 256,512, and 1024, and a little worse performance for spans 2048 and 4096. Yes, mapping policy could be part of the explanation for these results: With the smaller spans, you get no cache hits with either mapping policy. With larger spans, random mapping might return to some of the lines before evicting them. However, this is probably not the whole picture, because with that explanation we would expect that the times for larger spans with random mapping should be better than for smaller spans, but they are not. There is something else at work that makes the times larger with larger spans (maybe DRAM row switching?). I see no easy way to test your theory (at least until I can measure cache *and* TLB misses again on a machine I have access to). Anyway, back to the performance effects of page colouring: Yes, there are cases where it is not beneficial, and the huge-2^n-stride cases in examples like the one above are one of them, but I don't think that this is the kind of "real life" application that you mention elsewhere, or is it? >Also, the work has been done to test things, and cache coloring definitely >makes performance _worse_. It does so exactly because it artifically >limits your page choices, causing problems at multiple levels (not just at >the cache, like this example, but also in page allocators and freeing). Sorry, I am not aware of the work you are referring to. Where can I read more about it? Are you sure that these are fundamental problems and not just artifacts of particular implementations? >So basically, cache coloring results in: > - some nice benchmarks (mainly the kind that walk memory very > predictably, notably FP kernels) Predictable accesses are not important, spatial locality is. > - mostly worse performance in "real life" Like the code above?-) Hmm, maybe the pathological large-2^n-stride stuff is more frequent than I would expect. But I think it's possible to have a repeatable and mostly understandable/predictable mapping policy that does not have this pathological worst case (of course, being repeatable, it will have a different one:-), and can provide better average performance than random mapping by exploiting spatial locality. > - much worse memory pressure That sounds like an implementation artifact. >My strong opinion is that it is worthless except possibly as a performance >tuning tool, but even there the repeatability is a false advantage: if you >do performance tuning using cache coloring, there is nothing that >guarantees that your tuning was _correct_ for the real world case. How does _correct_ness come into play? As for performance, I guess there are three cases: - Changes that have little to do with the memory hierarchy. These are probably easier to evaluate in a repeatable environment, and any performance improvements should transfer nicely into a random-mapping environment. - Changes that address the pathological case for the repeatable environment, e.g., (in the context of page colouring) eliminating large 2^n strides; this particular optimization will have less effect in a random-mapping environment, but typically still a positive one (random mapping also suffers from strides that are multiples of the page size). - Changes that tune particularly for specific cache sizes, e.g., cache blocking. The results may be supoptimal for the random-mapping case; probably better than just picking the parameter at random, but in most runs worse than some other parameter. I wonder if you get any better results if you make just one run for a number of parameter values in a random-mapping environment and pick the parameter that gave the best result (which may have more to do with the mapping in this run than with the parameter). In conclusion, I think that tuning in a page colouring environment will transfer into a random-mapping environment well in most cases. - anton ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-27 20:21 ` Page Colouring (was: 2.6.0 Huge pages not working as expected) Anton Ertl @ 2003-12-27 20:56 ` Linus Torvalds 2003-12-27 23:31 ` Eric W. Biederman [not found] ` <17tHK-3K6-21@gated-at.bofh.it> 0 siblings, 2 replies; 28+ messages in thread From: Linus Torvalds @ 2003-12-27 20:56 UTC (permalink / raw) To: Anton Ertl; +Cc: linux-kernel On Sat, 27 Dec 2003, Anton Ertl wrote: > > And you probably mean "repeatable every time". Ok, then a random > scheme has, by your definition, no pathological worst case. I am not > sure that this is a consolation when I happen upon one of its > unpredictable and unrepeatable worst cases. Those "unpredictable" cases are so exceedingly rare that they aren't worth worrying about. > The points are: > > - repeatability > - predictability > - better average performance (you dispute that). I absolutely dispute that. And you should realize that I do not dispute it because the applications themselves would run slower with cache coloring. Most applications don't much care, they either fit in the cache, or the cache misses have random enough access patterns that cache layout doesn't much matter. The test code in question is an anomaly, and doesn't matter. It's exactly the same kind of strange case that sometimes shows cache coloring making a huge difference. The real degradation comes in just the fact that cache coloring itself is often expensive to implement and causes nasty side effects like bad memory allocation patterns, and nasty special cases that you have to worry about (ie special fallback code on non-colored pages when required). That expense is both a run-time expense _and_ a conceptual one (a conceptual expense is somethign that complicates the internal workings of the allocator so much that it becomes harder to think about and more bugprone). So far nobody has shown a reasonable way to do it without either of the two. > Anyway, back to the performance effects of page colouring: Yes, there > are cases where it is not beneficial, and the huge-2^n-stride cases in > examples like the one above are one of them, but I don't think that > this is the kind of "real life" application that you mention > elsewhere, or is it? The "real life" application is something like running a normal server or desktop, and having the cache coloring code _itself_ be the performance problem. It's not that "apache" minds very much. Or "mozilla". They simply don't care. The problem is the algorithm itself. > >Also, the work has been done to test things, and cache coloring definitely > >makes performance _worse_. It does so exactly because it artifically > >limits your page choices, causing problems at multiple levels (not just at > >the cache, like this example, but also in page allocators and freeing). > > Sorry, I am not aware of the work you are referring to. Where can I > read more about it? Are you sure that these are fundamental problems > and not just artifacts of particular implementations? Hey, there have been at least four different major cache coloring trials for the kernel over the years. This discussion has been going on since the early nineties. And _none_ of them have worked well in practice. In other words, "artifacts of the particular implementation" is certainly right, but the point I have is that the only thing that _matters_ is implementation. You can argue about theory all you like, I won't care until you show me an implementation that works and is robustly better. And it has to be better on average on _everything_ that Linux supports, not just one particular braindamaged piece of hardware. I'm totally not interested in something that makes performance on most machines go down, if it then improves one or two braindead setups with direct-mapped caches. Basically: prove me wrong. People have tried before. They have failed. Maybe you'll succeed. I doubt it, but hey, I'm not stopping you. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-27 20:56 ` Linus Torvalds @ 2003-12-27 23:31 ` Eric W. Biederman 2003-12-27 23:50 ` William Lee Irwin III ` (2 more replies) [not found] ` <17tHK-3K6-21@gated-at.bofh.it> 1 sibling, 3 replies; 28+ messages in thread From: Eric W. Biederman @ 2003-12-27 23:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Anton Ertl, linux-kernel Linus Torvalds <torvalds@osdl.org> writes: > On Sat, 27 Dec 2003, Anton Ertl wrote: > > > > And you probably mean "repeatable every time". Ok, then a random > > scheme has, by your definition, no pathological worst case. I am not > > sure that this is a consolation when I happen upon one of its > > unpredictable and unrepeatable worst cases. > > Those "unpredictable" cases are so exceedingly rare that they aren't worth > worrying about. They show up a lot in benchmarks which makes the something to worry about. Even if real world applications don't show the same behavior. Of course it is stupid to tune machines to the benchmarks but... > Basically: prove me wrong. People have tried before. They have failed. > Maybe you'll succeed. I doubt it, but hey, I'm not stopping you. For anyone taking you up on this I'd like to suggest two possible directions. 1) Increasing PAGE_SIZE in the kernel. 2) Creating zones for the different colors. Zones were not implemented last time, this was tried. Both of those should be minimal impact to the complexity of the current kernel. I don't know where we will wind up but the performance variation's caused by cache conflicts in today's applications are real, and easily measurable. Giving the growing increase in performance difference between CPUs and memory Amdahl's Law shows this will only grow so I think this is worth looking at. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-27 23:31 ` Eric W. Biederman @ 2003-12-27 23:50 ` William Lee Irwin III 2003-12-28 1:09 ` David S. Miller 2003-12-28 4:53 ` Linus Torvalds 2 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2003-12-27 23:50 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Linus Torvalds, Anton Ertl, linux-kernel Linus Torvalds <torvalds@osdl.org> writes: >> Basically: prove me wrong. People have tried before. They have failed. >> Maybe you'll succeed. I doubt it, but hey, I'm not stopping you. On Sat, Dec 27, 2003 at 04:31:22PM -0700, Eric W. Biederman wrote: > For anyone taking you up on this I'd like to suggest two possible > directions. > 1) Increasing PAGE_SIZE in the kernel. > 2) Creating zones for the different colors. Zones were not > implemented last time, this was tried. > Both of those should be minimal impact to the complexity > of the current kernel. > I don't know where we will wind up but the performance variation's > caused by cache conflicts in today's applications are real, and easily > measurable. Giving the growing increase in performance difference > between CPUs and memory Amdahl's Law shows this will only grow > so I think this is worth looking at. Increasing PAGE_SIZE in the kernel either (a) breaks ABI or (b) is nontrivial. I suppose I should try some of the page coloring benchmarks on pgcl (which preserves ABI). -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-27 23:31 ` Eric W. Biederman 2003-12-27 23:50 ` William Lee Irwin III @ 2003-12-28 1:09 ` David S. Miller 2003-12-28 4:53 ` Linus Torvalds 2 siblings, 0 replies; 28+ messages in thread From: David S. Miller @ 2003-12-28 1:09 UTC (permalink / raw) To: Eric W. Biederman; +Cc: torvalds, anton, linux-kernel On 27 Dec 2003 16:31:22 -0700 ebiederm@xmission.com (Eric W. Biederman) wrote: > 2) Creating zones for the different colors. Zones were not > implemented last time, this was tried. While this idea might sound promising, it would not work because by definition all pages of a particular color cannot be coalesced into order 1 or larger buddies. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-27 23:31 ` Eric W. Biederman 2003-12-27 23:50 ` William Lee Irwin III 2003-12-28 1:09 ` David S. Miller @ 2003-12-28 4:53 ` Linus Torvalds 2003-12-28 16:39 ` William Lee Irwin III 2003-12-29 21:11 ` Page Colouring (was: 2.6.0 Huge pages not working as expected) Eric W. Biederman 2 siblings, 2 replies; 28+ messages in thread From: Linus Torvalds @ 2003-12-28 4:53 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Anton Ertl, linux-kernel On Sat, 27 Dec 2003, Eric W. Biederman wrote: > Linus Torvalds <torvalds@osdl.org> writes: > > > > Basically: prove me wrong. People have tried before. They have failed. > > Maybe you'll succeed. I doubt it, but hey, I'm not stopping you. > > For anyone taking you up on this I'd like to suggest two possible > directions. > > 1) Increasing PAGE_SIZE in the kernel. Yes. This is something I actually want to do anyway for 2.7.x. Dan Phillips had some patches for this six months ago. You have to be careful, since you have to be able to mmap "partial pages", which is what makes it less than trivial, but there are tons of reasons to want to do this, and cache coloring is actually very much a secondary concern. > 2) Creating zones for the different colors. Zones were not > implemented last time, this was tried. Hey, I can tell you that you _will_ fail. Zones are actually a wonderful example of the kinds of problems you get into when you have pages of different types aka "colors". We've had nothing but trouble trying to balance different zones against each other, and those problems were in fact _the_ reason for 99% of all the VM problems in 2.4.x. Trying to use them for cache colors would be "interesting". Not to mention that it's impossible to coalesce pages across zones. > Both of those should be minimal impact to the complexity > of the current kernel. Minimal? I don't think so. Zones are basically impossible, and page size changes will hopefully happen during 2.7.x, but not due to page coloring. > I don't know where we will wind up but the performance variation's > caused by cache conflicts in today's applications are real, and easily > measurable. Giving the growing increase in performance difference > between CPUs and memory Amdahl's Law shows this will only grow > so I think this is worth looking at. Absolutely wrong. Why? Because the fact is, that as memory gets further and further away from CPU's, caches have gotten further and further away from being direct mapped. Cache coloring is already a very questionable win for four-way set-associative caches. I doubt you can even _see_ it for eight-way or higher associativity caches. In other words: the pressures you mention clearly do exist, but they are all driving direct-mapped caches out of the market, and thus making page coloring _less_ interesting rather than more. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-28 4:53 ` Linus Torvalds @ 2003-12-28 16:39 ` William Lee Irwin III 2003-12-29 0:36 ` Mike Fedyk 2003-12-29 21:11 ` Page Colouring (was: 2.6.0 Huge pages not working as expected) Eric W. Biederman 1 sibling, 1 reply; 28+ messages in thread From: William Lee Irwin III @ 2003-12-28 16:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: Eric W. Biederman, Anton Ertl, linux-kernel On Sat, 27 Dec 2003, Eric W. Biederman wrote: >> For anyone taking you up on this I'd like to suggest two possible >> directions. >> 1) Increasing PAGE_SIZE in the kernel. On Sat, Dec 27, 2003 at 08:53:30PM -0800, Linus Torvalds wrote: > Yes. This is something I actually want to do anyway for 2.7.x. Dan > Phillips had some patches for this six months ago. > You have to be careful, since you have to be able to mmap "partial pages", > which is what makes it less than trivial, but there are tons of reasons to > want to do this, and cache coloring is actually very much a secondary > concern. I've not seen Dan Phillips' code for this. I've been hacking on something doing this since late last December. -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-28 16:39 ` William Lee Irwin III @ 2003-12-29 0:36 ` Mike Fedyk 2003-12-29 2:55 ` William Lee Irwin III 0 siblings, 1 reply; 28+ messages in thread From: Mike Fedyk @ 2003-12-29 0:36 UTC (permalink / raw) To: William Lee Irwin III, Linus Torvalds, Eric W. Biederman, Anton Ertl, linux-kernel On Sun, Dec 28, 2003 at 08:39:52AM -0800, William Lee Irwin III wrote: > On Sat, 27 Dec 2003, Eric W. Biederman wrote: > >> For anyone taking you up on this I'd like to suggest two possible > >> directions. > >> 1) Increasing PAGE_SIZE in the kernel. > > On Sat, Dec 27, 2003 at 08:53:30PM -0800, Linus Torvalds wrote: > > Yes. This is something I actually want to do anyway for 2.7.x. Dan > > Phillips had some patches for this six months ago. > > You have to be careful, since you have to be able to mmap "partial pages", > > which is what makes it less than trivial, but there are tons of reasons to > > want to do this, and cache coloring is actually very much a secondary > > concern. > > I've not seen Dan Phillips' code for this. I've been hacking on > something doing this since late last December. I remember his work on pagetable sharing, but haven't heard anything about changing the page size from him. Could this be what Linus is remembering? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-29 0:36 ` Mike Fedyk @ 2003-12-29 2:55 ` William Lee Irwin III 2003-12-29 4:09 ` Linus Torvalds 0 siblings, 1 reply; 28+ messages in thread From: William Lee Irwin III @ 2003-12-29 2:55 UTC (permalink / raw) To: mfedyk, Linus Torvalds, Eric W. Biederman, Anton Ertl, linux-kernel On Sat, Dec 27, 2003 at 08:53:30PM -0800, Linus Torvalds wrote: >>> Yes. This is something I actually want to do anyway for 2.7.x. Dan >>> Phillips had some patches for this six months ago. >>> You have to be careful, since you have to be able to mmap "partial pages", >>> which is what makes it less than trivial, but there are tons of reasons to >>> want to do this, and cache coloring is actually very much a secondary >>> concern. On Sun, Dec 28, 2003 at 08:39:52AM -0800, William Lee Irwin III wrote: >> I've not seen Dan Phillips' code for this. I've been hacking on >> something doing this since late last December. On Sun, Dec 28, 2003 at 04:36:31PM -0800, Mike Fedyk wrote: > I remember his work on pagetable sharing, but haven't heard anything about > changing the page size from him. > Could this be what Linus is remembering? Doubtful. I suspect he may be referring to pgcl (sometimes called "subpages"), though Dan Phillips hasn't been involved in it. I guess we'll have to wait for Linus to respond to know for sure. -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-29 2:55 ` William Lee Irwin III @ 2003-12-29 4:09 ` Linus Torvalds 2003-12-29 6:52 ` William Lee Irwin III 2003-12-29 20:02 ` Subpages (was: Page Colouring) Daniel Phillips 0 siblings, 2 replies; 28+ messages in thread From: Linus Torvalds @ 2003-12-29 4:09 UTC (permalink / raw) To: William Lee Irwin III Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List, phillips On Sun, 28 Dec 2003, William Lee Irwin III wrote: > > Doubtful. I suspect he may be referring to pgcl (sometimes called > "subpages"), though Dan Phillips hasn't been involved in it. I guess > we'll have to wait for Linus to respond to know for sure. I didn't see the patch itself, but I spent some time talking to Daniel after your talk at the kernel summit. At least I _think_ it was him I was talking to - my memory for names and faces is basically zero. Daniel claimed to have it working back then, and that it actually shrank the kernel source code. The basic approach is to just make PAGE_SIZE larger, and handle temporary needs for smaller subpages by just dynamically allocating "struct page" entries for them. The size reduction came from getting rid of the "struct buffer_head", because it ends up being just another "small page". Daniel, details? Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-29 4:09 ` Linus Torvalds @ 2003-12-29 6:52 ` William Lee Irwin III 2003-12-29 9:14 ` Linus Torvalds [not found] ` <20031229084304.GA31630@elte.hu> 2003-12-29 20:02 ` Subpages (was: Page Colouring) Daniel Phillips 1 sibling, 2 replies; 28+ messages in thread From: William Lee Irwin III @ 2003-12-29 6:52 UTC (permalink / raw) To: Linus Torvalds Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List, phillips On Sun, Dec 28, 2003 at 08:09:17PM -0800, Linus Torvalds wrote: > I didn't see the patch itself, but I spent some time talking to Daniel > after your talk at the kernel summit. At least I _think_ it was him I was > talking to - my memory for names and faces is basically zero. > Daniel claimed to have it working back then, and that it actually shrank > the kernel source code. The basic approach is to just make PAGE_SIZE > larger, and handle temporary needs for smaller subpages by just > dynamically allocating "struct page" entries for them. The size reduction > came from getting rid of the "struct buffer_head", because it ends up > being just another "small page". > Daniel, details? I also heard something about this from daniel. The description I was given implied rather different functionality, and raised rather serious questions about the implementation he didn't have adequate answers for. I also never saw code, despite months of occasional discussions about it. I did get a positive reaction from you at KS, and I've also been slaving away at keeping this thing current and improving it when I can for a year. Would you mind telling me what the Hell is going on here? I guess I already know I'm screwed beyond all hope of recovery, but I might as well get official confirmation. -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-29 6:52 ` William Lee Irwin III @ 2003-12-29 9:14 ` Linus Torvalds 2003-12-29 9:22 ` William Lee Irwin III [not found] ` <20031229084304.GA31630@elte.hu> 1 sibling, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2003-12-29 9:14 UTC (permalink / raw) To: William Lee Irwin III Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List, phillips On Sun, 28 Dec 2003, William Lee Irwin III wrote: > > I did get a positive reaction from you at KS, and I've also been > slaving away at keeping this thing current and improving it when I can > for a year. Would you mind telling me what the Hell is going on here? > > I guess I already know I'm screwed beyond all hope of recovery, but I > might as well get official confirmation. No, I haven't even _looked_ at any 2.7.x timeframe patches, and I'm not even going to for the next few months. I don't care what does it, I want a bigger PAGE_CACHE_SIZE, and working patches are the only thing that matters. But for now, I have my 2.6.x blinders on. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-29 9:14 ` Linus Torvalds @ 2003-12-29 9:22 ` William Lee Irwin III 2003-12-29 9:33 ` Linus Torvalds 0 siblings, 1 reply; 28+ messages in thread From: William Lee Irwin III @ 2003-12-29 9:22 UTC (permalink / raw) To: Linus Torvalds Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List, phillips On Sun, 28 Dec 2003, William Lee Irwin III wrote: >> I did get a positive reaction from you at KS, and I've also been >> slaving away at keeping this thing current and improving it when I can >> for a year. Would you mind telling me what the Hell is going on here? >> I guess I already know I'm screwed beyond all hope of recovery, but I >> might as well get official confirmation. On Mon, Dec 29, 2003 at 01:14:03AM -0800, Linus Torvalds wrote: > No, I haven't even _looked_ at any 2.7.x timeframe patches, and I'm not > even going to for the next few months. > I don't care what does it, I want a bigger PAGE_CACHE_SIZE, and working > patches are the only thing that matters. But for now, I have my 2.6.x > blinders on. I can't say I'm particularly encouraged by what I've heard thus far, but I suppose that means what I've been doing for the past 6 months hasn't been entirely meaningless. Thanks. -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-29 9:22 ` William Lee Irwin III @ 2003-12-29 9:33 ` Linus Torvalds 2003-12-29 10:23 ` William Lee Irwin III 0 siblings, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2003-12-29 9:33 UTC (permalink / raw) To: William Lee Irwin III Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List, phillips On Mon, 29 Dec 2003, William Lee Irwin III wrote: > > I can't say I'm particularly encouraged by what I've heard thus far, Well, I don't even know what your approach is - mind giving an overview? My original plan (and you can see some of it in the fact that PAGE_CACHE_SIZE is separate from PAGE_SIZE), was to just have the page cache be able to use bigger pages than the "normal" pages, and the normal pages would continue to be the hardware page size. However, especially with mem_map[] becoming something of a problem, and all the problems we'd have if PAGE_SIZE and PAGE_CACHE_SIZE were different, I suspect I'd just be happier with increasing PAGE_SIZE altogether (and PAGE_CACHE_SIZE with it), and then just teaching the VM mapping about "fractional pages". What's your approach? Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-29 9:33 ` Linus Torvalds @ 2003-12-29 10:23 ` William Lee Irwin III 2003-12-29 10:59 ` Mike Fedyk 2003-12-30 2:00 ` Rusty Russell 0 siblings, 2 replies; 28+ messages in thread From: William Lee Irwin III @ 2003-12-29 10:23 UTC (permalink / raw) To: Linus Torvalds Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List, phillips On Mon, 29 Dec 2003, William Lee Irwin III wrote: >> I can't say I'm particularly encouraged by what I've heard thus far, On Mon, Dec 29, 2003 at 01:33:53AM -0800, Linus Torvalds wrote: > Well, I don't even know what your approach is - mind giving an overview? > My original plan (and you can see some of it in the fact that > PAGE_CACHE_SIZE is separate from PAGE_SIZE), was to just have the page > cache be able to use bigger pages than the "normal" pages, and the > normal pages would continue to be the hardware page size. > However, especially with mem_map[] becoming something of a problem, and > all the problems we'd have if PAGE_SIZE and PAGE_CACHE_SIZE were > different, I suspect I'd just be happier with increasing PAGE_SIZE > altogether (and PAGE_CACHE_SIZE with it), and then just teaching the VM > mapping about "fractional pages". > What's your approach? Hmm, I presented on this at KS. Basically, it's identical to Hugh Dickins' approach from 2000. The only difference is really that it had to be forward ported (or unfortunately in too many cases reimplemented) to mix with current code and features. Basically, elevate PAGE_SIZE, introduce MMUPAGE_SIZE to be a nice macro representing the hardware pagesize, and the fault handling is done with some relatively localized complexity. Numerous s/PAGE_SIZE/MMUPAGE_SIZE/ bits are sprinkled around, along with a few more involved changes because a large number of distributed changes are required to handle oddities that occur when PAGE_SIZE changes from 4KB. The more involved changes are often for things such as the only reason it uses PAGE_SIZE is really that it just expects 4KB and says PAGE_SIZE, or that it wants some fixed (even across compiles) size and needs updating for more general PAGE_SIZE numbers, or sometimes that it expects PAGE_SIZE to be what a pte maps when this is now represented by MMUPAGE_SIZE. I have a bad feeling the diligence of the original code audit could be bearing against me (and though I'm trying to be equally diligent, I'm not hugh). The fact merely elevating PAGE_SIZE breaks numerous things makes me rather suspicious of claims that minimalistic patches can do likewise. The only new infrastructures introduced are the MMUPAGE_SIZE and a couple of related macros (defining numbers, not structures or code) and the fault handler implementations. The diff size is not small. The memory footprint is, and demonstrably so (c.f. March 27 2003). My 2.6 code has been heavily leveraging the pfn abstraction in its favor to represent physical addresses measured in units of the hardware pagesize. Generally, my maintenance approach has been incrementally advancing the state of the thing while keeping it working on as broad a cross section of i386 systems as I can test or get testers on. It has been verified to run userspace on Thinkpad T21's and 16x/32GB and 32x/64GB NUMA-Q's at every point release it's been ported to, which since 2.5.68 or so has been every point release coming out of kernel.org. -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-29 10:23 ` William Lee Irwin III @ 2003-12-29 10:59 ` Mike Fedyk 2003-12-29 11:14 ` William Lee Irwin III 2003-12-30 2:00 ` Rusty Russell 1 sibling, 1 reply; 28+ messages in thread From: Mike Fedyk @ 2003-12-29 10:59 UTC (permalink / raw) To: William Lee Irwin III, Kernel Mailing List On Mon, Dec 29, 2003 at 02:23:19AM -0800, William Lee Irwin III wrote: > bits are sprinkled around, along with a few more involved changes because > a large number of distributed changes are required to handle oddities > that occur when PAGE_SIZE changes from 4KB. The more involved changes > are often for things such as the only reason it uses PAGE_SIZE is > really that it just expects 4KB and says PAGE_SIZE, or that it wants > some fixed (even across compiles) size and needs updating for more > general PAGE_SIZE numbers, or sometimes that it expects PAGE_SIZE to be > what a pte maps when this is now represented by MMUPAGE_SIZE. Any chance some of these changes are self contained, and could be split out and possibly merged into -mm? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-29 10:59 ` Mike Fedyk @ 2003-12-29 11:14 ` William Lee Irwin III 0 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2003-12-29 11:14 UTC (permalink / raw) To: mfedyk, Kernel Mailing List On Mon, Dec 29, 2003 at 02:23:19AM -0800, William Lee Irwin III wrote: >> bits are sprinkled around, along with a few more involved changes because >> a large number of distributed changes are required to handle oddities >> that occur when PAGE_SIZE changes from 4KB. The more involved changes >> are often for things such as the only reason it uses PAGE_SIZE is >> really that it just expects 4KB and says PAGE_SIZE, or that it wants >> some fixed (even across compiles) size and needs updating for more >> general PAGE_SIZE numbers, or sometimes that it expects PAGE_SIZE to be >> what a pte maps when this is now represented by MMUPAGE_SIZE. On Mon, Dec 29, 2003 at 02:59:18AM -0800, Mike Fedyk wrote: > Any chance some of these changes are self contained, and could be split out > and possibly merged into -mm? I talked about this for a little while. Basically, there is only one concept in the entire patch, despite its large size. The vast bulk of the "distributed changes" are s/PAGE_SIZE/MMUPAGE_SIZE/. At some point I was told to keep the whole shebang rolling out of tree or otherwise not answered by akpm and/or Linus, after I sent in what a split up (this is actually very easy to split up file-by-file) version of what just some of the totally trivial arch/i386/ changes would look like. The nontrivial changes are stupid in nature, but touch "fragile" or otherwise "scary to touch" code, and so sort of relegate them to 2.7. This is not entirely unjustified, as changes of a similar code impact wrt. the GDT appear to have affected some APM systems' suspend ability (I know for a fact my changes do not have impacts on APM suspend, but other, analogous support issues could arise after broader testing.) Basically, the MMUPAGE_SIZE introductions didn't interest anyone a while ago, and I suspect people probably just want them all at once, since it's unlikely people want to repeat the pain analogous to PAGE_CACHE_SIZE (I should clarify later how this is different) where the incremental introduction never culminated in the introduction of functionality. -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-29 10:23 ` William Lee Irwin III 2003-12-29 10:59 ` Mike Fedyk @ 2003-12-30 2:00 ` Rusty Russell 2003-12-30 4:59 ` William Lee Irwin III 1 sibling, 1 reply; 28+ messages in thread From: Rusty Russell @ 2003-12-30 2:00 UTC (permalink / raw) To: William Lee Irwin III Cc: torvalds, mfedyk, ebiederm, anton, linux-kernel, phillips On Mon, 29 Dec 2003 02:23:19 -0800 William Lee Irwin III <wli@holomorphy.com> wrote: > The fact merely elevating PAGE_SIZE breaks numerous things makes me > rather suspicious of claims that minimalistic patches can do likewise. Can you give an example? One approach is to simply present a larger page size to userspace w/ getpagesize(). This does break ELF programs which have been laid out assuming the old page size (presumably they try to mprotect the read-only sections). On PPC, the ELF ABI already insists on a 64k boundary between such sections, and maybe for others you could simply round appropriately and pray, or do fine-grained protections (ie. on real pagesize) for that one case. Rusty. -- there are those who do and those who hang on and you don't see too many doers quoting their contemporaries. -- Larry McVoy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-30 2:00 ` Rusty Russell @ 2003-12-30 4:59 ` William Lee Irwin III 0 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2003-12-30 4:59 UTC (permalink / raw) To: Rusty Russell; +Cc: torvalds, mfedyk, ebiederm, anton, linux-kernel, phillips On Mon, 29 Dec 2003 02:23:19 -0800 William Lee Irwin III wrote: >> The fact merely elevating PAGE_SIZE breaks numerous things makes me >> rather suspicious of claims that minimalistic patches can do likewise. On Tue, Dec 30, 2003 at 01:00:29PM +1100, Rusty Russell wrote: > Can you give an example? > One approach is to simply present a larger page size to userspace w/ > getpagesize(). This does break ELF programs which have been laid out assuming > the old page size (presumably they try to mprotect the read-only sections). > On PPC, the ELF ABI already insists on a 64k boundary between such sections, > and maybe for others you could simply round appropriately and pray, or do > fine-grained protections (ie. on real pagesize) for that one case. Apps must, of course, be relinked for that, but that's userspace. This ABI change is largely out of the picture due to legacy binaries, user virtualspace fragmentation (most likely an issue for 32-bit threading), and so on. The choice of PAGE_SIZE in such schemes is also restricted to no larger than whatever choice used for userspace linking, which is a relatively ugly dependency. There's also a question of "smooth transition": the only way to "incrementally deploy" it on a mixture "ready" userspace and "unready" userspace is to turn it off. I suppose it has the minor advantage of being trivial to program. I had in mind pure kernel internal issues, not ABI. The issues from raising PAGE_SIZE alone are things like interpreting hardware descriptions in arch code, some shifts underflowing for things like hashtables, certain drivers doing ioremap() and the like either filling up vmallocspace or getting their math wrong, and some other drivers doing calculations on physical addresses getting them wrong, or using PAGE_SIZE to represent some 4KB or other fixed-size memory area interpreted by hardware, and filesystems that assume blocksize == PAGE_SIZE or assume PAGE_SIZE is less than some particular value (e.g. short offsets into pages, worst of all being signed shorts), and tripping BUG()'s in ll_rw_blk.c when 512*q->max_sectors < PAGE_SIZE. These issues are the bulk of the work needing to be done for the driver and fs sweeps. Actual concerns about MMUPAGE_SIZE in drivers/ and fs/ are rather limited in scope, though drivers/char/drm/ was somewhat painful to get going (Zwane actually did most of this for me, as I have no DRM/DRI -capable graphics cards at my disposal). -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <20031229084304.GA31630@elte.hu>]
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) [not found] ` <20031229084304.GA31630@elte.hu> @ 2003-12-29 12:09 ` Ingo Molnar 2003-12-29 12:49 ` William Lee Irwin III 0 siblings, 1 reply; 28+ messages in thread From: Ingo Molnar @ 2003-12-29 12:09 UTC (permalink / raw) To: William Lee Irwin III, Linus Torvalds, mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List, phillips * William Lee Irwin III <wli@holomorphy.com> wrote: > I also heard something about this from daniel. The description I was > given implied rather different functionality, and raised rather > serious questions about the implementation he didn't have adequate > answers for. I also never saw code, despite months of occasional > discussions about it. > > I did get a positive reaction from you at KS, and I've also been > slaving away at keeping this thing current and improving it when I can > for a year. Would you mind telling me what the Hell is going on here? > > I guess I already know I'm screwed beyond all hope of recovery, but I > might as well get official confirmation. i've been following your code (pgcl) and it looks pretty good. (it needs finishing touches as always, but that's fine.) I tried to backport it to 2.4 before doing 4G/4G but the maintainance overhead skyrocketed and so it not practical for 2.4-based distribution purposes - but it would be the perfect kind of thing to start 2.7.0 with. I've not seen any other code but yours in this area. i believe the right approach to the 'tons of RAM' problem is to simplify it as much as possible, ie. go for larger pages (and wrap the MMU format in the most trivial way) and to deal with 4K pages as a filesystem (and ELF format) compatibility thing only. Your patch does precisely this. How much we'll have to 'mix' the two page sizes, only practice will tell, but the less mixing, the easier it will get. Filesystems on such systems will match the pagesize anyway. i'd even suggest to not overdo the fractured-page logic too much - ie. just 'waste' a full page on a misaligned or single 4K-sized vma - concentrate on the common case: linearly mapped files and anonymous mappings. Prefault both of them at PAGE_SIZE granularity and 'waste' the final partial page. The VM swapout logic should only deal with full pages. Same for the pagecache: just fill in full pages and dont worry about granularity. Your patch already does more than this. But i think if someone does 4K vmas on a pgcl system or runs it on a 128 MB box and expects perfect swapping, then it's his damn fault. Ingo ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-29 12:09 ` Ingo Molnar @ 2003-12-29 12:49 ` William Lee Irwin III 0 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2003-12-29 12:49 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List, phillips * William Lee Irwin III <wli@holomorphy.com> wrote: > > I did get a positive reaction from you at KS, and I've also been >> slaving away at keeping this thing current and improving it when I can >> for a year. Would you mind telling me what the Hell is going on here? >> I guess I already know I'm screwed beyond all hope of recovery, but I >> might as well get official confirmation. On Mon, Dec 29, 2003 at 01:09:30PM +0100, Ingo Molnar wrote: > i've been following your code (pgcl) and it looks pretty good. (it needs > finishing touches as always, but that's fine.) I tried to backport it to > 2.4 before doing 4G/4G but the maintainance overhead skyrocketed and so > it not practical for 2.4-based distribution purposes - but it would be > the perfect kind of thing to start 2.7.0 with. I've not seen any other > code but yours in this area. That's a rather kind assessment; I suppose I hold flaws not critical at the design level as fatal where those who look primarily at design don't. On Mon, Dec 29, 2003 at 01:09:30PM +0100, Ingo Molnar wrote: > i believe the right approach to the 'tons of RAM' problem is to simplify > it as much as possible, ie. go for larger pages (and wrap the MMU format > in the most trivial way) and to deal with 4K pages as a filesystem (and > ELF format) compatibility thing only. Your patch does precisely this. > How much we'll have to 'mix' the two page sizes, only practice will > tell, but the less mixing, the easier it will get. Filesystems on such > systems will match the pagesize anyway. Well, that's more or less consistent with what I'm I'm doing. In actuality it's Hugh's design and original implementation, but I'm going to have to claim _some_ credit for the work I've put into this at some point, though it be grunt work after a fashion. The nontrivial point is largely ABI compatibility. A tremendous amount of diff could be eliminated without ABI compatibility; however, the concern is rather critical as long as legacy binaries are involved. On Mon, Dec 29, 2003 at 01:09:30PM +0100, Ingo Molnar wrote: > i'd even suggest to not overdo the fractured-page logic too much - ie. > just 'waste' a full page on a misaligned or single 4K-sized vma - > concentrate on the common case: linearly mapped files and anonymous > mappings. Prefault both of them at PAGE_SIZE granularity and 'waste' the > final partial page. The VM swapout logic should only deal with full > pages. Same for the pagecache: just fill in full pages and dont worry > about granularity. > Your patch already does more than this. But i think if someone does 4K > vmas on a pgcl system or runs it on a 128 MB box and expects perfect > swapping, then it's his damn fault. My reasoning here has actually been dominated by performance. Exchanging the logic for this task is actually a difficult enough operation with respect to programming that very few a priori concerns can be allowed any influence at all. The algorithm now used for fault handling, recently ported by brute force from Hugh's rather ancient sources, effectively does as you say (though there is a lot of latitude in the criterion you've stated). One risk I've taken is updating some API's to return pfn's instead of pages. In the case of get_user_pages() this is likely essential. But kmap_atomic_to_page() (to_pfn() in my sources) and some others might be able to be avoided entirely with some moderately traumatic rework (traumatic as far as work I have to do is concerned; in all honesty, the issue is stupid, but as a problem it makes up for the lack of difficulty owing to quality with that owed to vast quantities of debugging and intolerance to dumb C mistakes.) The methods you're suggesting suggest removing these changes in exchange for some potential inefficiencies with virtualspace consumption, though these are not entirely out of the question, as ia32 is effectively deprecated. -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Subpages (was: Page Colouring) 2003-12-29 4:09 ` Linus Torvalds 2003-12-29 6:52 ` William Lee Irwin III @ 2003-12-29 20:02 ` Daniel Phillips 2003-12-29 20:15 ` Linus Torvalds 1 sibling, 1 reply; 28+ messages in thread From: Daniel Phillips @ 2003-12-29 20:02 UTC (permalink / raw) To: Linus Torvalds, William Lee Irwin III Cc: mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List On Sunday 28 December 2003 23:09, Linus Torvalds wrote: > On Sun, 28 Dec 2003, William Lee Irwin III wrote: > > Doubtful. I suspect he may be referring to pgcl (sometimes called > > "subpages"), though Dan Phillips hasn't been involved in it. I guess > > we'll have to wait for Linus to respond to know for sure. > > I didn't see the patch itself, but I spent some time talking to Daniel > after your talk at the kernel summit. At least I _think_ it was him I was > talking to - my memory for names and faces is basically zero. > > Daniel claimed to have it working back then, and that it actually shrank > the kernel source code. The basic approach is to just make PAGE_SIZE > larger, and handle temporary needs for smaller subpages by just > dynamically allocating "struct page" entries for them. The size reduction > came from getting rid of the "struct buffer_head", because it ends up > being just another "small page". > > Daniel, details? Hi Linus, Your description is accurate. Another reason for code size shrinkage is getting rid of the loops across buffers in the block IO library, e.g., block_read_full_page. Subpages only make sense for file-backed memory, which conveniently lets the page cache keep track of subpages. Each address_space has pages of all the same size, which may be smaller, larger or the same as PAGE_CACHE_SIZE. The first case, "subpages", is the interesting one. An address_space with subpages has base pages of PAGE_CACHE_SIZE for its "even" entries and up to N-1 dynamically allocated struct pages for the "odd" entries where N is PAGE_CACHE_SIZE divided by the subpage size. Base pages are normal members of mem_map. Subpages are not referenced by mem_map, but only by the page cache. They are created by operations such as find_or_create_page, which first creates a base page if necessary. A counter field in the page flags of the base page keeps track of how many subpages share a base page's physical memory; when this field goes to zero the base page may be removed from the page cache. Subpages always have a ->virtual field regardless of whether mem_map pages do. This is used for virt_to_phys and to locate the base page when a subpage is freed. Page fault handling doesn't change much if at all, since the faulting address is rounded down to a physical page, which will be a base page. Most of the changes for subpages are in the buffer.c page cache operations and are largely transparent to the VMM, though PAGE_CACHE_SHIFT becomes mapping->page_shift, which touches a lot of files. As you noted, buffer_head functionality can be taken over by struct page and buffers become expendible. However it is not necessary to cross that bridge immediately; page buffer lists continue to work though the buffer list is never longer than one. With a little more work, subpages can be used to shrink mem_map: implement a larger PAGE_CACHE_SIZE then use subpages to handle ABI problems. In this case faults on subpages are possible and the fault path probably needs to know something about it. With a larger-than-physical PAGE_CACHE_SIZE we can finally have large buffers, though the kernel would have to be compiled for it. Some more work to allowing mapping->page_shift to be larger than PAGE_CACHE_SIZE would complete the process of generalizing the page size. My impression is, this isn't too messy, most of the impact is on faulting. Bill and others are already familiar with this I think. The work should dovetail. I took a stab at implementing subpages some time ago in 2.4 and got it mostly working but not quite bootable. I did find out roughly how invasive the patch is, which is: not very, unless I've overlooked something major. I'll get busy on a 2.6 prototype, and of course I'll listen attentively for reasons why this plan won't work. Regards, Daniel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Subpages (was: Page Colouring) 2003-12-29 20:02 ` Subpages (was: Page Colouring) Daniel Phillips @ 2003-12-29 20:15 ` Linus Torvalds 0 siblings, 0 replies; 28+ messages in thread From: Linus Torvalds @ 2003-12-29 20:15 UTC (permalink / raw) To: Daniel Phillips Cc: William Lee Irwin III, mfedyk, Eric W. Biederman, Anton Ertl, Kernel Mailing List On Mon, 29 Dec 2003, Daniel Phillips wrote: > > I took a stab at implementing subpages some time ago in 2.4 and got it mostly > working but not quite bootable. I did find out roughly how invasive the > patch is, which is: not very, unless I've overlooked something major. I'll > get busy on a 2.6 prototype, and of course I'll listen attentively for > reasons why this plan won't work. Ah, ok. I thought it was further along than that. If so, let's consider that possibility a more long-range plan - it is independent of just making PAGE_CACHE_SIZE be bigger. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-28 4:53 ` Linus Torvalds 2003-12-28 16:39 ` William Lee Irwin III @ 2003-12-29 21:11 ` Eric W. Biederman 2003-12-29 21:35 ` Linus Torvalds 1 sibling, 1 reply; 28+ messages in thread From: Eric W. Biederman @ 2003-12-29 21:11 UTC (permalink / raw) To: Linus Torvalds; +Cc: Anton Ertl, linux-kernel Linus Torvalds <torvalds@osdl.org> writes: > On Sat, 27 Dec 2003, Eric W. Biederman wrote: > > Linus Torvalds <torvalds@osdl.org> writes: > > > > > > Basically: prove me wrong. People have tried before. They have failed. > > > Maybe you'll succeed. I doubt it, but hey, I'm not stopping you. > > > > For anyone taking you up on this I'd like to suggest two possible > > directions. > > > > 1) Increasing PAGE_SIZE in the kernel. > > Yes. This is something I actually want to do anyway for 2.7.x. Dan > Phillips had some patches for this six months ago. > > You have to be careful, since you have to be able to mmap "partial pages", > which is what makes it less than trivial, but there are tons of reasons to > want to do this, and cache coloring is actually very much a secondary > concern. > > > 2) Creating zones for the different colors. Zones were not > > implemented last time, this was tried. > > Hey, I can tell you that you _will_ fail. Given the > order 0 pages it looks to be a long shot at this point. > > Both of those should be minimal impact to the complexity > > of the current kernel. > > Minimal? I don't think so. Zones are basically impossible, and page size > changes will hopefully happen during 2.7.x, but not due to page coloring. I didn't say easy, just simple enough that not everyone in the kernel would need to know or care. > > caused by cache conflicts in today's applications are real, and easily > > measurable. Giving the growing increase in performance difference > > between CPUs and memory Amdahl's Law shows this will only grow > > so I think this is worth looking at. > > Absolutely wrong. I don't mean to focus exclusively on cache coloring but on anything that will increase memory performance. Right now we are at a point where even the DRAM page size is larger than 4K. > Why? Because the fact is, that as memory gets further and further away > from CPU's, caches have gotten further and further away from being direct > mapped. Except for L1 caches. The hit of an associate lookup there is inherently costly. > Cache coloring is already a very questionable win for four-way > set-associative caches. I doubt you can even _see_ it for eight-way or > higher associativity caches. If I can ever get something that approaches a reliable result out of something that a memory bandwidth benchmark, I would love it. I typically get something like a 25%+ variance. The most reliably way I have found to show how variable these things get is to run updatedb. And run streams or a similar benchmark at the same time. The numbers jump all over the place and a concurrent updatedb as frequently improves as it degrades performance. When I stop seeing a measurable then I will stop worrying. Of course it does not help that the current generation of compilers can only get 50% of the actual performance the memory can provide. In recent times the variations have been getting worse if anything. > In other words: the pressures you mention clearly do exist, but they are > all driving direct-mapped caches out of the market, and thus making page > coloring _less_ interesting rather than more. The other spin is that everything in current architectures is based around the concept of locality. While page tables mess up locality. So everything we can do to preserve locality is a good thing. Cache color happens naturally as a result of better page locality. A variation to the idea of using larger page sizes is to allocate/free/swap a batch of pages at once. We do some of that now. A batch of pages moves the normal allocation up to order 3 or 4 if we can get them. While still working with smaller order pages if we can't. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-29 21:11 ` Page Colouring (was: 2.6.0 Huge pages not working as expected) Eric W. Biederman @ 2003-12-29 21:35 ` Linus Torvalds 0 siblings, 0 replies; 28+ messages in thread From: Linus Torvalds @ 2003-12-29 21:35 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Anton Ertl, linux-kernel On Mon, 29 Dec 2003, Eric W. Biederman wrote: > > Linus Torvalds <torvalds@osdl.org> writes: > > > > Why? Because the fact is, that as memory gets further and further away > > from CPU's, caches have gotten further and further away from being direct > > mapped. > > Except for L1 caches. The hit of an associate lookup there is inherently > costly. Having worked for a hardware company, and talked to hardware engineers, I can say that it generally isn't all that true. The reason is that you can start the lookup before you even do the TLB lookup, and in fact you _want_ highly associative L1 caches to do that. For example, if you have a 16kB L1 cache, and a 4kB page size, and you want your memory accesses to go fast, you definitely want to index the L1 by the virtual access, which means that you can only use the low 12 bits for indexing. So what you do is you make your L1 be 4-way set-associative, so that by the time the TLB lookup is done, you've already looked up the index, and you only have to compare the TAG with one of the four possible ways. In short: you actually _want_ your L1 to be associative, because it's the best way to avoid having nasty alias issues. The only people who have a direct-mapped L1 are one of: - crazy and/or stupid - really cheap (mainly embedded space) - not high-performance anyway (ie their L1 is really small) - really sorry, and are fixing it. - really _really_ sorry, and have a virtually indexed cache. In which case page coloring doesn't matter anyway. Notice how high performance is _not_ on the list. Because you simply can't _get_ high performance with a direct-mapped L1. Those days are long gone. There is another reason why L1's have long since moved away from direct-mapped: the miss ratio goes up quote a bit for the same size cache. And things like OoO are pretty good at hiding one cycle of latency (OoO is _not_ good at hiding memory latency, but one or two cycles are usually ok), so even if having a larger L1 (and thus inherently more complex - not only in associativity) means that you end up having an extra cycle access, it's likely a win. This is, for example, what alpha did between 21164 and the 21264: when they went out-of-order, they did all the simulation to prove that it was much more efficient to have a larger L1 with a higher hit ratio, even if the latency was one cycle higher than the 21164 which was strictly in-order. In short, I'll bet you a dollar that you won't see a single direct-mapped L1 _anywhere_ where it matters. They are already pretty much gone. Can you name one that doesn't fit the four criteria above? Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <17tHK-3K6-21@gated-at.bofh.it>]
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) [not found] ` <17tHK-3K6-21@gated-at.bofh.it> @ 2003-12-28 17:17 ` Anton Ertl 0 siblings, 0 replies; 28+ messages in thread From: Anton Ertl @ 2003-12-28 17:17 UTC (permalink / raw) To: linux-kernel Linus Torvalds <torvalds@osdl.org> writes: >And you should realize that I do not dispute it because the applications >themselves would run slower with cache coloring. Ok, I guess I misunderstood that until now. > Most applications don't >much care, Yes, at least as long as the associativity is high enough. > they either fit in the cache, or the cache misses have random >enough access patterns that cache layout doesn't much matter. Random mapping hurts those applications most that do fit in the cache in principle, but that have enough hot memory (whether accessed regularly or randomly) that random mapping usually introduces cache conflicts (this will happen for many applications with direct-mapped caches, but hardly ever with high-associativity caches). >And it has to be better on average on _everything_ that Linux supports, >not just one particular braindamaged piece of hardware. I'm totally not >interested in something that makes performance on most machines go down, >if it then improves one or two braindead setups with direct-mapped caches. As has been discussed in another thread, direct-mapped caches seem to pretty standard for off-chip caches, and this is not just a braindamage issue: Higher associativity requires more wires to the tags, and also to the data, if you want to access the data in parallel with the tags for lower latency. Running a lot of wires off-chip is a problem. So the choices are: - Small on-chip cache with high associativity. - Medium cache with off-chip data, on-chip tags, high associativity and high latency. - Large cache with off-chip data and off-chip tags, and low associativity. However, over time off-chip caches seem to become less commonplace, so we may get rid of low associativity for L2/L3 caches eventually. - anton -- M. Anton Ertl Some things have to be seen to be believed anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <176UD-6vl-3@gated-at.bofh.it>]
* Page Colouring (was: 2.6.0 Huge pages not working as expected) [not found] <176UD-6vl-3@gated-at.bofh.it> @ 2003-12-26 21:48 ` Anton Ertl 2003-12-26 23:28 ` Linus Torvalds 0 siblings, 1 reply; 28+ messages in thread From: Anton Ertl @ 2003-12-26 21:48 UTC (permalink / raw) To: linux-kernel Linus Torvalds <torvalds@osdl.org> writes: >And the thing is, using huge pages will mean that the pages are 1:1 >mapped, and thus get "perfectly" cache-coloured, while the anonymous mmap >will give you random placement. > >And what you are seeing is likely the fact that random placement is >guaranteed to not have any worst-case behaviour. You probably just put the "not" in the wrong place, but just in case you meant it: Random replacement does not give such a guarantee. You can get the same worst-case behaviour as with page colouring, since you can get the same mapping. It's just unlikely. >In particular, using a pure power-of-two stride means that you are >limiting your cache to a certain subset of the full result with the >perfect coloring. > >This, btw, is why I don't like page coloring: it does give nicely >reproducible results, but it does not necessarily improve performance. Well, even if, on average, it has no performance impact, reproducibility is a good reason to like it. Is it good enough to implement it? I'll leave that to you. However, the main question I want to look at here is: Does it improve performance, on average? I think it does, because of spatial locality. I.e., it is more frequent that you access stuff spatially close to a recent access (where page colouring has a 0 chance of conflicting, whereas random mapping has a non-zero chance of conflicting), than to access stuff that is exactly a multiple of the cache-size away (which is the worst case for page colouring). Fortunately, set-associative caches in the machines I use most of the time reduce the impact of the missing page colouring in Linux. The most frequent case where random mapping gives better performance than page colouring is having several sequential passes over a block that is larger than the cache; but that's just a case where caches perform badly on principle, and cache designs that are usually considered better (higher associativity, LRU replacement) perform worse in this case. OTOH, for cases where the block does barely fits in the cache, page colouring performs quite a bit better. This particular access pattern can be more frequent than one might expect from other statistics, due to software optimizations like cache blocking. One additional mechanism in which page colouring can help performance is by providing a predictable and understandable performance model to programmers. Caches are bad enough to analyse, one need not complicate the issue with unpredictable effects of random virtual-to-physical translation. - anton -- M. Anton Ertl Some things have to be seen to be believed anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Page Colouring (was: 2.6.0 Huge pages not working as expected) 2003-12-26 21:48 ` Anton Ertl @ 2003-12-26 23:28 ` Linus Torvalds 0 siblings, 0 replies; 28+ messages in thread From: Linus Torvalds @ 2003-12-26 23:28 UTC (permalink / raw) To: Anton Ertl; +Cc: linux-kernel On Fri, 26 Dec 2003, Anton Ertl wrote: > Linus Torvalds <torvalds@osdl.org> writes: > > > >And what you are seeing is likely the fact that random placement is > >guaranteed to not have any worst-case behaviour. > > You probably just put the "not" in the wrong place, but just in case > you meant it: Random replacement does not give such a guarantee. No, I meant what I said. Random placement is the _only_ algorithm guaranteed to have no pathological worst-case behaviour. > You > can get the same worst-case behaviour as with page colouring, since > you can get the same mapping. It's just unlikely. "pathological worst-case" is something that is repeatable. For example, the test-case above is a pathological worst-case schenario for a direct-mapped cache. > Well, even if, on average, it has no performance impact, > reproducibility is a good reason to like it. Is it good enough to > implement it? I'll leave that to you. Well, since random (or, more accurately in this case, "pseudo-random") has a number of things going for it, and is a lot faster and cheaper to implement, I don't see the point of cache coloring. That's doubly true since any competent CPU will have at least four-way associativity these days. > However, the main question I want to look at here is: Does it improve > performance, on average? I think it does, because of spatial > locality. Hey, the discussion in this case showed how it _deproves_ performance (at least if my theory was correct - and it should be easily testable and I bet it is). Also, the work has been done to test things, and cache coloring definitely makes performance _worse_. It does so exactly because it artifically limits your page choices, causing problems at multiple levels (not just at the cache, like this example, but also in page allocators and freeing). So basically, cache coloring results in: - some nice benchmarks (mainly the kind that walk memory very predictably, notably FP kernels) - mostly worse performance in "real life" - more complex code - much worse memory pressure My strong opinion is that it is worthless except possibly as a performance tuning tool, but even there the repeatability is a false advantage: if you do performance tuning using cache coloring, there is nothing that guarantees that your tuning was _correct_ for the real world case. In short, you may be doing your performance tuning such that it tunes for or against one of the (known) pathological cases of the layout, nothing more. But hey, some people disagree with me. That's their right. It's not unconstitutional to be wrong ;) Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2003-12-30 4:59 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <179fV-1iK-23@gated-at.bofh.it>
[not found] ` <179IS-1VD-13@gated-at.bofh.it>
2003-12-27 20:21 ` Page Colouring (was: 2.6.0 Huge pages not working as expected) Anton Ertl
2003-12-27 20:56 ` Linus Torvalds
2003-12-27 23:31 ` Eric W. Biederman
2003-12-27 23:50 ` William Lee Irwin III
2003-12-28 1:09 ` David S. Miller
2003-12-28 4:53 ` Linus Torvalds
2003-12-28 16:39 ` William Lee Irwin III
2003-12-29 0:36 ` Mike Fedyk
2003-12-29 2:55 ` William Lee Irwin III
2003-12-29 4:09 ` Linus Torvalds
2003-12-29 6:52 ` William Lee Irwin III
2003-12-29 9:14 ` Linus Torvalds
2003-12-29 9:22 ` William Lee Irwin III
2003-12-29 9:33 ` Linus Torvalds
2003-12-29 10:23 ` William Lee Irwin III
2003-12-29 10:59 ` Mike Fedyk
2003-12-29 11:14 ` William Lee Irwin III
2003-12-30 2:00 ` Rusty Russell
2003-12-30 4:59 ` William Lee Irwin III
[not found] ` <20031229084304.GA31630@elte.hu>
2003-12-29 12:09 ` Ingo Molnar
2003-12-29 12:49 ` William Lee Irwin III
2003-12-29 20:02 ` Subpages (was: Page Colouring) Daniel Phillips
2003-12-29 20:15 ` Linus Torvalds
2003-12-29 21:11 ` Page Colouring (was: 2.6.0 Huge pages not working as expected) Eric W. Biederman
2003-12-29 21:35 ` Linus Torvalds
[not found] ` <17tHK-3K6-21@gated-at.bofh.it>
2003-12-28 17:17 ` Anton Ertl
[not found] <176UD-6vl-3@gated-at.bofh.it>
2003-12-26 21:48 ` Anton Ertl
2003-12-26 23:28 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox