* please revert bogus patch to vmscan.c
@ 2001-10-29 23:08 Benjamin LaHaise
2001-10-29 23:14 ` David S. Miller
2001-10-29 23:51 ` Paul Mackerras
0 siblings, 2 replies; 54+ messages in thread
From: Benjamin LaHaise @ 2001-10-29 23:08 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
The following:
@@ -50,7 +50,6 @@
/* Don't look at this pte if it's been accessed recently. */
if (ptep_test_and_clear_young(page_table)) {
- flush_tlb_page(vma, address);
mark_page_accessed(page);
return 0;
}
is completely bogus. Without the tlb flush, the system may never update
the accessed bit on a page that is heavily being used.
-ben
--
Fish.
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: please revert bogus patch to vmscan.c 2001-10-29 23:08 please revert bogus patch to vmscan.c Benjamin LaHaise @ 2001-10-29 23:14 ` David S. Miller 2001-10-29 23:15 ` Benjamin LaHaise 2001-10-29 23:22 ` Linus Torvalds 2001-10-29 23:51 ` Paul Mackerras 1 sibling, 2 replies; 54+ messages in thread From: David S. Miller @ 2001-10-29 23:14 UTC (permalink / raw) To: bcrl; +Cc: torvalds, linux-kernel From: Benjamin LaHaise <bcrl@redhat.com> Date: Mon, 29 Oct 2001 18:08:37 -0500 is completely bogus. Without the tlb flush, the system may never update the accessed bit on a page that is heavily being used. It's intentional Ben, think about the high cost of the SMP invalidate when kswapd is just scanning page tables. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:14 ` David S. Miller @ 2001-10-29 23:15 ` Benjamin LaHaise 2001-10-29 23:25 ` Linus Torvalds 2001-10-29 23:22 ` Linus Torvalds 1 sibling, 1 reply; 54+ messages in thread From: Benjamin LaHaise @ 2001-10-29 23:15 UTC (permalink / raw) To: David S. Miller; +Cc: torvalds, linux-kernel On Mon, Oct 29, 2001 at 03:14:22PM -0800, David S. Miller wrote: > From: Benjamin LaHaise <bcrl@redhat.com> > Date: Mon, 29 Oct 2001 18:08:37 -0500 > > is completely bogus. Without the tlb flush, the system may never update > the accessed bit on a page that is heavily being used. > > It's intentional Ben, think about the high cost of the SMP invalidate > when kswapd is just scanning page tables. I think it's far more expensive to pull a page back in from disk. -ben ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:15 ` Benjamin LaHaise @ 2001-10-29 23:25 ` Linus Torvalds 2001-10-29 23:33 ` Benjamin LaHaise 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2001-10-29 23:25 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: David S. Miller, linux-kernel On Mon, 29 Oct 2001, Benjamin LaHaise wrote: > > I think it's far more expensive to pull a page back in from disk. Note that _if_ the page is so often accessed that it stays in the TLB, it won't ever hit the disk. We _will_ do a TLB invalidate when we unmap it, and it will get re-mapped long before it's written out. So it's more likely to result in extra soft-faults, but considering that if the system is paging stuff out we'll almost certainly be doing enough context switches that the thing doesn't matter. And avoiding the extra TLB flush (for _every_ page scanned) is noticeable according to Andrea. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:25 ` Linus Torvalds @ 2001-10-29 23:33 ` Benjamin LaHaise 2001-10-29 23:36 ` David S. Miller 0 siblings, 1 reply; 54+ messages in thread From: Benjamin LaHaise @ 2001-10-29 23:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: David S. Miller, linux-kernel On Mon, Oct 29, 2001 at 03:25:36PM -0800, Linus Torvalds wrote: > And avoiding the extra TLB flush (for _every_ page scanned) is noticeable > according to Andrea. Sure. But do it correctly and perform a tlb flush higher up in the page table walking code. Just deleting it entirely is bogus. Introducing hard to track down bugs is just stupid. -ben ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:33 ` Benjamin LaHaise @ 2001-10-29 23:36 ` David S. Miller 2001-10-29 23:39 ` Benjamin LaHaise 0 siblings, 1 reply; 54+ messages in thread From: David S. Miller @ 2001-10-29 23:36 UTC (permalink / raw) To: bcrl; +Cc: torvalds, linux-kernel From: Benjamin LaHaise <bcrl@redhat.com> Date: Mon, 29 Oct 2001 18:33:16 -0500 Sure. But do it correctly and perform a tlb flush higher up in the page table walking code. Just deleting it entirely is bogus. Introducing hard to track down bugs is just stupid. It isn't a bug, the referenced bit is a heuristic. The referenced bit being wrong cannot result in corrupted user memory. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:36 ` David S. Miller @ 2001-10-29 23:39 ` Benjamin LaHaise 2001-10-29 23:41 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Benjamin LaHaise @ 2001-10-29 23:39 UTC (permalink / raw) To: David S. Miller; +Cc: torvalds, linux-kernel On Mon, Oct 29, 2001 at 03:36:03PM -0800, David S. Miller wrote: > It isn't a bug, the referenced bit is a heuristic. The referenced bit > being wrong cannot result in corrupted user memory. We might as well choose what pages to swap out according to a random number generator then. -ben -- Fish. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:39 ` Benjamin LaHaise @ 2001-10-29 23:41 ` Linus Torvalds 2001-10-29 23:48 ` Benjamin LaHaise 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2001-10-29 23:41 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: David S. Miller, linux-kernel On Mon, 29 Oct 2001, Benjamin LaHaise wrote: > On Mon, Oct 29, 2001 at 03:36:03PM -0800, David S. Miller wrote: > > It isn't a bug, the referenced bit is a heuristic. The referenced bit > > being wrong cannot result in corrupted user memory. > > We might as well choose what pages to swap out according to a random number > generator then. Show me a number. Andrea showed his performance. And I claim that our "random number generator" is really close to reality. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:41 ` Linus Torvalds @ 2001-10-29 23:48 ` Benjamin LaHaise 2001-10-29 23:50 ` David S. Miller 0 siblings, 1 reply; 54+ messages in thread From: Benjamin LaHaise @ 2001-10-29 23:48 UTC (permalink / raw) To: Linus Torvalds; +Cc: David S. Miller, linux-kernel On Mon, Oct 29, 2001 at 03:41:58PM -0800, Linus Torvalds wrote: > Show me a number. Andrea showed his performance. And I claim that our > "random number generator" is really close to reality. Please enlighten me why I should go off to do a couple of days work to come up with a comparison against a correct version that does the tlb flush higher up in the page table walker? Sorry, but for now I'll stick with -ac vm that isn't changing things randomly every release. -ben -- Fish. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:48 ` Benjamin LaHaise @ 2001-10-29 23:50 ` David S. Miller 2001-10-29 23:51 ` Benjamin LaHaise ` (2 more replies) 0 siblings, 3 replies; 54+ messages in thread From: David S. Miller @ 2001-10-29 23:50 UTC (permalink / raw) To: bcrl; +Cc: torvalds, linux-kernel Numbers talk, bullshit walks. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:50 ` David S. Miller @ 2001-10-29 23:51 ` Benjamin LaHaise 2001-10-29 23:55 ` David S. Miller 2001-10-30 1:31 ` Rik van Riel 2001-10-30 9:03 ` Alan Cox 2 siblings, 1 reply; 54+ messages in thread From: Benjamin LaHaise @ 2001-10-29 23:51 UTC (permalink / raw) To: David S. Miller; +Cc: torvalds, linux-kernel On Mon, Oct 29, 2001 at 03:50:56PM -0800, David S. Miller wrote: > > Numbers talk, bullshit walks. There is a correct way to do this optimization. If you're enough of an asshole to not care about doing it that way, great! -ben ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:51 ` Benjamin LaHaise @ 2001-10-29 23:55 ` David S. Miller 2001-10-29 23:57 ` Benjamin LaHaise 2001-10-29 23:58 ` Linus Torvalds 0 siblings, 2 replies; 54+ messages in thread From: David S. Miller @ 2001-10-29 23:55 UTC (permalink / raw) To: bcrl; +Cc: torvalds, linux-kernel From: Benjamin LaHaise <bcrl@redhat.com> Date: Mon, 29 Oct 2001 18:51:58 -0500 On Mon, Oct 29, 2001 at 03:50:56PM -0800, David S. Miller wrote: > > Numbers talk, bullshit walks. There is a correct way to do this optimization. If you're enough of an asshole to not care about doing it that way, great! Doing range flushes is not the answer. It is going to be about the same cost as doing per-page flushes. The cost is doing the cross calls at all, not the granularity in which we do them. You're refusing to do any work to prove whether your case matters at all in real life, and you're calling other people assholes for asking that you do so. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:55 ` David S. Miller @ 2001-10-29 23:57 ` Benjamin LaHaise 2001-10-30 0:01 ` David S. Miller 2001-10-29 23:58 ` Linus Torvalds 1 sibling, 1 reply; 54+ messages in thread From: Benjamin LaHaise @ 2001-10-29 23:57 UTC (permalink / raw) To: David S. Miller; +Cc: torvalds, linux-kernel On Mon, Oct 29, 2001 at 03:55:59PM -0800, David S. Miller wrote: > Doing range flushes is not the answer. It is going to be about > the same cost as doing per-page flushes. > > The cost is doing the cross calls at all, not the granularity in which > we do them. > > You're refusing to do any work to prove whether your case matters > at all in real life, and you're calling other people assholes for > asking that you do so. See Paul's message. ia64 does the same thing with hardware walked hashed page tables. Now, do you want to pay for the 2 days of time you want me to commit to investigating something which is obvious to me? I don't think so. -ben ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:57 ` Benjamin LaHaise @ 2001-10-30 0:01 ` David S. Miller 2001-10-30 0:05 ` Benjamin LaHaise 0 siblings, 1 reply; 54+ messages in thread From: David S. Miller @ 2001-10-30 0:01 UTC (permalink / raw) To: bcrl; +Cc: torvalds, linux-kernel From: Benjamin LaHaise <bcrl@redhat.com> Date: Mon, 29 Oct 2001 18:57:44 -0500 See Paul's message. ia64 does the same thing with hardware walked hashed page tables. Now, do you want to pay for the 2 days of time you want me to commit to investigating something which is obvious to me? I don't think so. Why would it take you two days to put together a test case for something so "trivial"? Don't be rediculious. I would be more than happy to pay for the 2 days it's going to take for you to possibly admit "yeah, maybe it doesn't matter in real life, sorry". :-) Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 0:01 ` David S. Miller @ 2001-10-30 0:05 ` Benjamin LaHaise 0 siblings, 0 replies; 54+ messages in thread From: Benjamin LaHaise @ 2001-10-30 0:05 UTC (permalink / raw) To: David S. Miller; +Cc: torvalds, linux-kernel On Mon, Oct 29, 2001 at 04:01:23PM -0800, David S. Miller wrote: > Why would it take you two days to put together a test case for > something so "trivial"? Don't be rediculious. Well, maybe you like under testing things, but personally I don't. What do you want a microoptimization? Get real, you don't accept microoptimizations to networking code without numbers from the bigger picture. So that means running real tests on real applications like Oracle or RHDB. I think you'll find that the cross cpu case will be quite happy with a range flush. Heck, it's the kind of IPI that can even legally be asynchronous. Why don't you go do something useful like implement that instead of making me spend my time to correct other people's obvious mistakes? -ben -- Fish. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:55 ` David S. Miller 2001-10-29 23:57 ` Benjamin LaHaise @ 2001-10-29 23:58 ` Linus Torvalds 1 sibling, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2001-10-29 23:58 UTC (permalink / raw) To: David S. Miller; +Cc: bcrl, linux-kernel On Mon, 29 Oct 2001, David S. Miller wrote: > > Doing range flushes is not the answer. It is going to be about > the same cost as doing per-page flushes. No, doing a range flush might be fine - we'd just do it _once_ per invocation of swap_out(), and that would probably be fine. The problem with the flush at the low level is that it's done once for every page in the whole VM space, which is easily millions of times. Cutting it down to once every MM would definitely be worth it. It won't be "exact" either, but it would mean that at least the lifetime of an optimistic TLB entry is bounded. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:50 ` David S. Miller 2001-10-29 23:51 ` Benjamin LaHaise @ 2001-10-30 1:31 ` Rik van Riel 2001-10-30 1:34 ` David S. Miller 2001-10-30 9:03 ` Alan Cox 2 siblings, 1 reply; 54+ messages in thread From: Rik van Riel @ 2001-10-30 1:31 UTC (permalink / raw) To: David S. Miller; +Cc: bcrl, torvalds, linux-kernel On Mon, 29 Oct 2001, David S. Miller wrote: > Numbers talk, bullshit walks. Lies, damn lies and statistics. Your numbers are nice for one thing and break for something else, don't you learn anything from our experiments in the past ? cheers, Rik -- DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 1:31 ` Rik van Riel @ 2001-10-30 1:34 ` David S. Miller 2001-10-30 1:42 ` Linus Torvalds 2001-10-30 1:49 ` Benjamin LaHaise 0 siblings, 2 replies; 54+ messages in thread From: David S. Miller @ 2001-10-30 1:34 UTC (permalink / raw) To: riel; +Cc: bcrl, torvalds, linux-kernel From: Rik van Riel <riel@conectiva.com.br> Date: Mon, 29 Oct 2001 23:31:31 -0200 (BRST) Your numbers are nice for one thing and break for something else, don't you learn anything from our experiments in the past ? I'm asking him to show the case that "breaks for something else". Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 1:34 ` David S. Miller @ 2001-10-30 1:42 ` Linus Torvalds 2001-10-30 1:46 ` David S. Miller 2001-10-30 2:25 ` Benjamin LaHaise 2001-10-30 1:49 ` Benjamin LaHaise 1 sibling, 2 replies; 54+ messages in thread From: Linus Torvalds @ 2001-10-30 1:42 UTC (permalink / raw) To: David S. Miller; +Cc: riel, bcrl, linux-kernel On Mon, 29 Oct 2001, David S. Miller wrote: > > I'm asking him to show the case that "breaks for something > else". Guys, guys, calm down. I removed the tlb invalidate that ended up being called millions of times, but I don't really have anything fundamental against either invalidating each mm as it comes up in swap_out_mm(), or maybe just doing a full TLB invalidate for each swap_out(). As Ben points out, the invalidate doesn't even have to be synchronous - the only thing we really care about is that there is some upper bound for how long we can cache TLB entries witht eh wrong accessed bit. One reasonable (?), yet rare, upper bound might be something like "swap_out() wrapped around the MM list". This is particularly true since we won't actually _care_ about the accessed bit until the second time around when it is clear, so the "wrapped around the VM list" thing is (a) often enough to matter and (b) obviously seldom enough that it shouldn't be a performance issue even if it implies a cross-call to everybody. The difference in call frequency would, on large machines, probably be on the order of several magnitudes, which will certainly cut the overhead down to the noise while satisfying people who have architectures that can cache things for a long time. Agreed? (Yeah, maybe you think that's _too_ long. Civil arguments, please). Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 1:42 ` Linus Torvalds @ 2001-10-30 1:46 ` David S. Miller 2001-10-30 2:25 ` Benjamin LaHaise 1 sibling, 0 replies; 54+ messages in thread From: David S. Miller @ 2001-10-30 1:46 UTC (permalink / raw) To: torvalds; +Cc: riel, bcrl, linux-kernel From: Linus Torvalds <torvalds@transmeta.com> Date: Mon, 29 Oct 2001 17:42:07 -0800 (PST) Agreed? I wouldn't be against a per-mm flush. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 1:42 ` Linus Torvalds 2001-10-30 1:46 ` David S. Miller @ 2001-10-30 2:25 ` Benjamin LaHaise 2001-10-30 15:20 ` Andrea Arcangeli 1 sibling, 1 reply; 54+ messages in thread From: Benjamin LaHaise @ 2001-10-30 2:25 UTC (permalink / raw) To: Linus Torvalds; +Cc: David S. Miller, riel, linux-kernel On Mon, Oct 29, 2001 at 05:42:07PM -0800, Linus Torvalds wrote: > > On Mon, 29 Oct 2001, David S. Miller wrote: > > > > I'm asking him to show the case that "breaks for something > > else". > > Guys, guys, calm down. I'm trying to be. Dave just rubs me the wrong way with his inability to think about the problem for a minute instead of insisting on proof that incorrect behaviour can happen. Damnit, it's supposed to be a kernel on its way towards being more stable, not less. This is especially insulting when you can bother to go through the work of it as a thought experiment in less than 5 minutes to realise what could happen. Dave: please read the above paragraph again. Now, can you see why I'm arguing for doing the optimization in the *correct* way first? The microbenchmark will always "prove" that avoiding the tlb flush is a win. The test to prove that the failure case can happen is non-trivial, and proving the real world win/loss is lengthy task involving lots of benchmarks and effort. Please just sit down and think for 5 minutes and acknowledge that it is a possibility! > The difference in call frequency would, on large machines, probably be on > the order of several magnitudes, which will certainly cut the overhead > down to the noise while satisfying people who have architectures that can > cache things for a long time. > > Agreed? Which is exactly what was in the back of my mind in the first place. I didn't write the patch as the distraction of going off on yet another tangent when I'm this > < close to being done with the battle I'm currently in just doesn't make sense. I'm sorry, but I'm not the best person for doing a brain dump when in the middle of something, plus I assume that people can think about the how and why of things for themselves. You'll note that I never denied that the microoptimization in question is a win; I fully well expect it to be. However, from the point of view of stability we *want* to be conservative and correct. If Al had to demonstrate with 'sploits that a possible vfs race could occur every time he found one, wouldn't he be wasting time that could be better spent finding and fixing other problems??? Dave, please accept that other people's opinions occasionally hold value and reconsider reacting negatively without thinking first. -ben -- Fish. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 2:25 ` Benjamin LaHaise @ 2001-10-30 15:20 ` Andrea Arcangeli 2001-10-30 15:34 ` Rik van Riel 0 siblings, 1 reply; 54+ messages in thread From: Andrea Arcangeli @ 2001-10-30 15:20 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Linus Torvalds, David S. Miller, riel, linux-kernel On Mon, Oct 29, 2001 at 09:25:46PM -0500, Benjamin LaHaise wrote: > I fully well expect it to be. However, from the point of view of stability > we *want* to be conservative and correct. If Al had to demonstrate with Dave just told you what this change has to do with stability, not sure why you keep reiterating about stability and correctness. But of course going from page flush to the mm flush is fine from my part too. As Linus noted a few days ago during swapout we're going to block and reschedule all the time, so the range flush is going to be a noop in real life (the whole thing is an heuristic), and this is why it wasn't implemented right now. But I agree it shouldn't hurt either and it looks nicer. Andrea ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 15:20 ` Andrea Arcangeli @ 2001-10-30 15:34 ` Rik van Riel 2001-10-30 15:51 ` Andrea Arcangeli ` (2 more replies) 0 siblings, 3 replies; 54+ messages in thread From: Rik van Riel @ 2001-10-30 15:34 UTC (permalink / raw) To: Andrea Arcangeli Cc: Benjamin LaHaise, Linus Torvalds, David S. Miller, linux-kernel On Tue, 30 Oct 2001, Andrea Arcangeli wrote: > On Mon, Oct 29, 2001 at 09:25:46PM -0500, Benjamin LaHaise wrote: > > I fully well expect it to be. However, from the point of view of stability > > we *want* to be conservative and correct. If Al had to demonstrate with > > Dave just told you what this change has to do with stability, not sure > why you keep reiterating about stability and correctness. > > But of course going from page flush to the mm flush is fine from my part > too. As Linus noted a few days ago during swapout we're going to block > and reschedule all the time, so the range flush is going to be a noop in Only on architectures where the TLB (or equivalent) is small and only capable of holding entries for one address space at a time. It's simply not true on eg PPC. Rik -- DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 15:34 ` Rik van Riel @ 2001-10-30 15:51 ` Andrea Arcangeli 2001-10-30 16:34 ` Benjamin LaHaise 2001-10-30 16:13 ` Giuliano Pochini 2001-10-30 16:38 ` Linus Torvalds 2 siblings, 1 reply; 54+ messages in thread From: Andrea Arcangeli @ 2001-10-30 15:51 UTC (permalink / raw) To: Rik van Riel Cc: Benjamin LaHaise, Linus Torvalds, David S. Miller, linux-kernel On Tue, Oct 30, 2001 at 01:34:50PM -0200, Rik van Riel wrote: > On Tue, 30 Oct 2001, Andrea Arcangeli wrote: > > On Mon, Oct 29, 2001 at 09:25:46PM -0500, Benjamin LaHaise wrote: > > > I fully well expect it to be. However, from the point of view of stability > > > we *want* to be conservative and correct. If Al had to demonstrate with > > > > Dave just told you what this change has to do with stability, not sure > > why you keep reiterating about stability and correctness. > > > > But of course going from page flush to the mm flush is fine from my part > > too. As Linus noted a few days ago during swapout we're going to block > > and reschedule all the time, so the range flush is going to be a noop in > > Only on architectures where the TLB (or equivalent) is > small and only capable of holding entries for one address > space at a time. > > It's simply not true on eg PPC. I thought at alpha of course but alpha doesn't provide an hardware accessed bit in first place :). my view was too x86-64/x86/alpha centric (incidentally the only archs I test personally :), I really didn't considered the case of tagged tlb + accessed bit both provided by the cpu while making that change, so we hurted ia64 and ppc, but it will be trivial to fix now that Ben promptly noticed the problem, thanks Ben, good catch, that's really appreciated! Anwyays this have _nothing_ to do at the very least with stability unlike the above subliminal messages are implying, see above, it can only potentially be less responsive under very heavy swap on ia64 and ppc (dunno sparc64?), period. mentioning real life workloads like Oracle and RHDB in relation to the tlb flush for the accessed bit is further subliminal bullshit, Oracle definitely isn't supposed to swap heavily during the benchmarks, and I'm sure it's not the case for mainline postrgres either (dunno about RHDB). Andrea ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 15:51 ` Andrea Arcangeli @ 2001-10-30 16:34 ` Benjamin LaHaise 2001-10-30 17:00 ` Andrea Arcangeli 0 siblings, 1 reply; 54+ messages in thread From: Benjamin LaHaise @ 2001-10-30 16:34 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Linus Torvalds, David S. Miller, linux-kernel On Tue, Oct 30, 2001 at 04:51:19PM +0100, Andrea Arcangeli wrote: > Anwyays this have _nothing_ to do at the very least with stability > unlike the above subliminal messages are implying, see above, it can > only potentially be less responsive under very heavy swap on ia64 and > ppc (dunno sparc64?), period. mentioning real life workloads like Oracle > and RHDB in relation to the tlb flush for the accessed bit is further > subliminal bullshit, Oracle definitely isn't supposed to swap heavily > during the benchmarks, and I'm sure it's not the case for mainline > postrgres either (dunno about RHDB). It has nothing to do with subliminal effects, but rather what kind of effect this microoptimization is going to have in the Big Picture. What I'm contending is that the Real World difference between the correct version of the optimization will have no significant performance effects compared to the incorrect version that you and davem are so gleefully advocating. This means not running through "bullshit" benchmarks that test one and only one thing, but running apps that actually put memory pressure on the system (Oracle does so quite nicely using a filesystem without O_DIRECT) which in turn causes page scanning (aka the clearing of the referenced bit which is *THE* code that is being contested) but should not cause swap out. To me, this is just part of the methodology of being thorough in testing the effects of changes to the VM subsystem. Frankly, I'm suitable unimpressed with the thoroughness of consideration and testing to the changes currently being pushed into the base tree. Again, this is why I'm not bothering to run base kernels. -ben -- Fish. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 16:34 ` Benjamin LaHaise @ 2001-10-30 17:00 ` Andrea Arcangeli 2001-10-30 17:07 ` Rik van Riel 0 siblings, 1 reply; 54+ messages in thread From: Andrea Arcangeli @ 2001-10-30 17:00 UTC (permalink / raw) To: Benjamin LaHaise Cc: Rik van Riel, Linus Torvalds, David S. Miller, linux-kernel On Tue, Oct 30, 2001 at 11:34:10AM -0500, Benjamin LaHaise wrote: > I'm contending is that the Real World difference between the correct > version of the optimization will have no significant performance effects > compared to the incorrect version that you and davem are so gleefully > advocating. This means not running through "bullshit" benchmarks that > test one and only one thing, but running apps that actually put memory > pressure on the system (Oracle does so quite nicely using a filesystem > without O_DIRECT) which in turn causes page scanning (aka the clearing > of the referenced bit which is *THE* code that is being contested) but > should not cause swap out. To me, this is just part of the methodology > of being thorough in testing the effects of changes to the VM subsystem. There should be no relevant pagetable scanning during those tests, and the few bits that can get unmapped have lots of time to get mapped in again from the cache with a minor fault, IMHO there's no way such tlb flush removal can make a difference in a DBMS workload on a sanely setup machine, I'm amazed you think it can make a difference. Andrea ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 17:00 ` Andrea Arcangeli @ 2001-10-30 17:07 ` Rik van Riel 0 siblings, 0 replies; 54+ messages in thread From: Rik van Riel @ 2001-10-30 17:07 UTC (permalink / raw) To: Andrea Arcangeli Cc: Benjamin LaHaise, Linus Torvalds, David S. Miller, linux-kernel On Tue, 30 Oct 2001, Andrea Arcangeli wrote: > again from the cache with a minor fault, IMHO there's no way such tlb > flush removal can make a difference in a DBMS workload on a sanely > setup machine, I'm amazed you think it can make a difference. You seem to be hammering on the DBMS scenario because Ben chose a bad example, without taking a step back and realising that his point is valid, though mostly with other examples. Linus came up with a more efficient way to make sure we get to see (and clear) the accessed bits, lets use that one and get on with life. regards, Rik -- DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 15:34 ` Rik van Riel 2001-10-30 15:51 ` Andrea Arcangeli @ 2001-10-30 16:13 ` Giuliano Pochini 2001-10-30 16:54 ` Andrea Arcangeli 2001-10-30 16:38 ` Linus Torvalds 2 siblings, 1 reply; 54+ messages in thread From: Giuliano Pochini @ 2001-10-30 16:13 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, David S. Miller, Linus Torvalds, Benjamin LaHaise, Andrea Arcangeli >> But of course going from page flush to the mm flush is fine from my part >> too. As Linus noted a few days ago during swapout we're going to block >> and reschedule all the time, so the range flush is going to be a noop in > > Only on architectures where the TLB (or equivalent) is > small and only capable of holding entries for one address > space at a time. > > It's simply not true on eg PPC. #ifdef ? Bye. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 16:13 ` Giuliano Pochini @ 2001-10-30 16:54 ` Andrea Arcangeli 2001-10-30 17:23 ` Giuliano Pochini 0 siblings, 1 reply; 54+ messages in thread From: Andrea Arcangeli @ 2001-10-30 16:54 UTC (permalink / raw) To: Giuliano Pochini Cc: Rik van Riel, linux-kernel, David S. Miller, Linus Torvalds, Benjamin LaHaise On Tue, Oct 30, 2001 at 05:13:53PM +0100, Giuliano Pochini wrote: > > >> But of course going from page flush to the mm flush is fine from my part > >> too. As Linus noted a few days ago during swapout we're going to block > >> and reschedule all the time, so the range flush is going to be a noop in > > > > Only on architectures where the TLB (or equivalent) is > > small and only capable of holding entries for one address > > space at a time. > > > > It's simply not true on eg PPC. > > #ifdef ? yes, but not for ppc, for alpha and all other archs without accessed bit provided in hardware (and cached in the tlb). the flush_mm proposed by Ben looks fine for x86 too, it's a waste only for archs without accessed bit. I think an #ifndef HAVE_NO_ACCESS_BIT_IN_TLB or something like that, then define that in asm-alpha/ and the other archs without accessed bit. OTOH, it probably doesn't make much difference so maybe it doesn't worth the effort. Andrea ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 16:54 ` Andrea Arcangeli @ 2001-10-30 17:23 ` Giuliano Pochini 2001-10-30 17:30 ` Andrea Arcangeli 0 siblings, 1 reply; 54+ messages in thread From: Giuliano Pochini @ 2001-10-30 17:23 UTC (permalink / raw) To: Andrea Arcangeli Cc: Benjamin LaHaise, Linus Torvalds, David S. Miller, linux-kernel, Rik van Riel >> #ifdef ? > > yes, but not for ppc, for alpha and all other archs without accessed bit > provided in hardware (and cached in the tlb). PPCs have that bits. I'm not sure if they are cached in the tbl. Bye. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 17:23 ` Giuliano Pochini @ 2001-10-30 17:30 ` Andrea Arcangeli 2001-10-31 0:38 ` Paul Mackerras [not found] ` <15327.8495.767553.389519@cargo.ozlabs.ibm.com> 0 siblings, 2 replies; 54+ messages in thread From: Andrea Arcangeli @ 2001-10-30 17:30 UTC (permalink / raw) To: Giuliano Pochini Cc: Benjamin LaHaise, Linus Torvalds, David S. Miller, linux-kernel, Rik van Riel On Tue, Oct 30, 2001 at 06:23:26PM +0100, Giuliano Pochini wrote: > > >> #ifdef ? > > > > yes, but not for ppc, for alpha and all other archs without accessed bit > > provided in hardware (and cached in the tlb). > > PPCs have that bits. I'm not sure if they are cached in the tbl. yes, this is why it won't need to define the HAVE_NO_ACCESS_BIT_IN_TLB, only alpha will. but again I'm not sure if it worth given we just reduced the flush frequency of an order of magnitude. Andrea ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 17:30 ` Andrea Arcangeli @ 2001-10-31 0:38 ` Paul Mackerras [not found] ` <15327.8495.767553.389519@cargo.ozlabs.ibm.com> 1 sibling, 0 replies; 54+ messages in thread From: Paul Mackerras @ 2001-10-31 0:38 UTC (permalink / raw) To: linux-kernel Andrea Arcangeli writes: > On Tue, Oct 30, 2001 at 06:23:26PM +0100, Giuliano Pochini wrote: > > > > >> #ifdef ? > > > > > > yes, but not for ppc, for alpha and all other archs without accessed bit > > > provided in hardware (and cached in the tlb). > > > > PPCs have that bits. I'm not sure if they are cached in the tbl. > > yes, this is why it won't need to define the HAVE_NO_ACCESS_BIT_IN_TLB, > only alpha will. but again I'm not sure if it worth given we just > reduced the flush frequency of an order of magnitude. I don't think it has to do with whether the accessed bit is maintained by hardware or by software. On PPC, we actually maintain the accessed and dirty bits in software - we don't use the hardware-maintained R (referenced) and C (changed) bits, because they are in the HPTEs (the PTEs in the hash table), and it is hard to get back from a linux pte value to the corresponding HPTE. The performance impact of maintaining the accessed and dirty bits in software is negligible according to the benchmarks I have done, and it is much much simpler than trying to use the hardware R and C bits. Paul. ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <15327.8495.767553.389519@cargo.ozlabs.ibm.com>]
* Re: please revert bogus patch to vmscan.c [not found] ` <15327.8495.767553.389519@cargo.ozlabs.ibm.com> @ 2001-10-31 1:57 ` Andrea Arcangeli 0 siblings, 0 replies; 54+ messages in thread From: Andrea Arcangeli @ 2001-10-31 1:57 UTC (permalink / raw) To: Paul Mackerras; +Cc: linux-kernel Hello Paul, On Wed, Oct 31, 2001 at 08:52:47AM +1100, Paul Mackerras wrote: > pte value to the corresponding HPTE. The performance impact of > maintaining the accessed and dirty bits in software is negligible The accessed bit is worthless to be maintained in software. Infact if it is true that ppc doesn't have an hardware maintained accessed bit (like alpha) this means ppc _doesn't_ want the tlb flush, just like alpha, unlike said previously today in other emails. the whole ptep_test_and_clear_young section + tlb flush should be #ifndeffed out for those archs. The waste is to care about this non existent accessed bit in first place for those archs. Those archs enterely depend on the minor faults to activate the cache and so we can only unmap unconditionally the inactive cache on those archs. Infact on those archs we should probably deactivate all the cache, not only the inactive one, so that it has a better change to keep the working set in active cache. with regard to the dirty bit while fixing x86 smp races, Ben did benchmarks that showed the performance hit isn't negligible if maintained in software, you know it generates the doube of page faults with the "right" benchmark, but maybe this is the only sane approch for ppc. Andrea ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 15:34 ` Rik van Riel 2001-10-30 15:51 ` Andrea Arcangeli 2001-10-30 16:13 ` Giuliano Pochini @ 2001-10-30 16:38 ` Linus Torvalds 2001-10-30 16:47 ` Benjamin LaHaise ` (2 more replies) 2 siblings, 3 replies; 54+ messages in thread From: Linus Torvalds @ 2001-10-30 16:38 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, Benjamin LaHaise, David S. Miller, linux-kernel On Tue, 30 Oct 2001, Rik van Riel wrote: > > Only on architectures where the TLB (or equivalent) is > small and only capable of holding entries for one address > space at a time. > > It's simply not true on eg PPC. Now, it's not true on _all_ PPC's. The sane PPC setups actually have a regular soft-filled TLB, and last I saw that actually performed _better_ than the stupid architected hash- chains. And for the broken OS's (ie AIX) that wants the hash-chains, you can always make the soft-fill TLB do the stupid thing.. (Yeah, yeah, I'm sure you can find code where the hash-chains are faster, especially big Fortran programs that have basically no tear-down and build-up overhead. Which was why those things were designed that way, of course. But it _looks_ like at least parts of IBM may finally be wising up to the fact that hashed TLB's are a stupid idea). Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 16:38 ` Linus Torvalds @ 2001-10-30 16:47 ` Benjamin LaHaise 2001-10-30 16:57 ` Victor Yodaiken 2001-10-30 17:23 ` Benjamin Herrenschmidt 2 siblings, 0 replies; 54+ messages in thread From: Benjamin LaHaise @ 2001-10-30 16:47 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Andrea Arcangeli, David S. Miller, linux-kernel On Tue, Oct 30, 2001 at 08:38:31AM -0800, Linus Torvalds wrote: > Now, it's not true on _all_ PPC's. > > The sane PPC setups actually have a regular soft-filled TLB, and last I > saw that actually performed _better_ than the stupid architected hash- > chains. And for the broken OS's (ie AIX) that wants the hash-chains, you > can always make the soft-fill TLB do the stupid thing.. User Mode Linux has effectively an infinitely sized TLB. -ben ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 16:38 ` Linus Torvalds 2001-10-30 16:47 ` Benjamin LaHaise @ 2001-10-30 16:57 ` Victor Yodaiken 2001-10-30 17:16 ` Troy Benjegerdes ` (2 more replies) 2001-10-30 17:23 ` Benjamin Herrenschmidt 2 siblings, 3 replies; 54+ messages in thread From: Victor Yodaiken @ 2001-10-30 16:57 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Andrea Arcangeli, Benjamin LaHaise, David S. Miller, linux-kernel On Tue, Oct 30, 2001 at 08:38:31AM -0800, Linus Torvalds wrote: > > It's simply not true on eg PPC. > > Now, it's not true on _all_ PPC's. > > The sane PPC setups actually have a regular soft-filled TLB, and last I > saw that actually performed _better_ than the stupid architected hash- > chains. And for the broken OS's (ie AIX) that wants the hash-chains, you > can always make the soft-fill TLB do the stupid thing.. You can't turn off hardware hash-chains on anything past 603, sadly enough. So all Macs, many embedded boards, ... ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 16:57 ` Victor Yodaiken @ 2001-10-30 17:16 ` Troy Benjegerdes 2001-10-30 17:17 ` Linus Torvalds 2001-10-30 21:39 ` Paul Mackerras 2 siblings, 0 replies; 54+ messages in thread From: Troy Benjegerdes @ 2001-10-30 17:16 UTC (permalink / raw) To: Victor Yodaiken Cc: Linus Torvalds, Rik van Riel, Andrea Arcangeli, Benjamin LaHaise, David S. Miller, linux-kernel On Tue, Oct 30, 2001 at 09:57:57AM -0700, Victor Yodaiken wrote: > On Tue, Oct 30, 2001 at 08:38:31AM -0800, Linus Torvalds wrote: > > > It's simply not true on eg PPC. > > > > Now, it's not true on _all_ PPC's. > > > > The sane PPC setups actually have a regular soft-filled TLB, and last I > > saw that actually performed _better_ than the stupid architected hash- > > chains. And for the broken OS's (ie AIX) that wants the hash-chains, you > > can always make the soft-fill TLB do the stupid thing.. > > You can't turn off hardware hash-chains on anything past 603, sadly enough. > So all Macs, many embedded boards, ... Actually, the 7450 (V'ger) from Mot supports software tablewalks. (This is what is in the > 800mhz dual mac G4's). The issue then becomes supporting both hardware and software TLB walks on SMP systems, since the 7450 is the first SMP capable PPC that can do software tlbwalks. -- Troy Benjegerdes | master of mispeeling | 'da hozer' | hozer@drgw.net -----"If this message isn't misspelled, I didn't write it" -- Me ----- "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shulz ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 16:57 ` Victor Yodaiken 2001-10-30 17:16 ` Troy Benjegerdes @ 2001-10-30 17:17 ` Linus Torvalds 2001-10-30 17:51 ` Victor Yodaiken 2001-10-30 18:01 ` Cort Dougan 2001-10-30 21:39 ` Paul Mackerras 2 siblings, 2 replies; 54+ messages in thread From: Linus Torvalds @ 2001-10-30 17:17 UTC (permalink / raw) To: Victor Yodaiken Cc: Rik van Riel, Andrea Arcangeli, Benjamin LaHaise, David S. Miller, linux-kernel On Tue, 30 Oct 2001, Victor Yodaiken wrote: > > You can't turn off hardware hash-chains on anything past 603, sadly enough. Gods, I hope they have reconsidered that in their 64-bit chips. The 32-bit hash chains may be ugly, but the architected 32/64-bit MMU stuff is just so incredibly baroque that it makes any other MMU look positively beautiful ("Segments? Segments shmegments. Big deal"). I still have the occasional nightmares about the IBM block diagrams "explaining" the PowerPC MMU in their technical documentation. There's probably a perfectly valid explanation for them, though (*). Linus (*) Probably along the lines of the designers being so high on LSD that they thought it was a really cool idea. That would certainly explain it in a very logical fashion. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 17:17 ` Linus Torvalds @ 2001-10-30 17:51 ` Victor Yodaiken 2001-10-30 18:01 ` Cort Dougan 1 sibling, 0 replies; 54+ messages in thread From: Victor Yodaiken @ 2001-10-30 17:51 UTC (permalink / raw) To: Linus Torvalds Cc: Victor Yodaiken, Rik van Riel, Andrea Arcangeli, Benjamin LaHaise, David S. Miller, linux-kernel On Tue, Oct 30, 2001 at 09:17:31AM -0800, Linus Torvalds wrote: > I still have the occasional nightmares about the IBM block diagrams > "explaining" the PowerPC MMU in their technical documentation. > > There's probably a perfectly valid explanation for them, though (*). > > Linus > > (*) Probably along the lines of the designers being so high on LSD that > they thought it was a really cool idea. That would certainly explain it in > a very logical fashion. All the studies I saw were back from the days when cache-speed/expensive-memory-speed was close to 1. In this case, the effect of randomizing memory fetches is no big deal. The rest of standard PPC mmu architecture is pretty nice, but, if the Alpha architects could decide to use the PC cmos clock as their only prgrammable timer, and the Itanium guys could decide to put in a single shift-mask path, why shouldn't the IBM designers get to destroy cache by wasting a bunch of CPU area logic? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 17:17 ` Linus Torvalds 2001-10-30 17:51 ` Victor Yodaiken @ 2001-10-30 18:01 ` Cort Dougan 1 sibling, 0 replies; 54+ messages in thread From: Cort Dougan @ 2001-10-30 18:01 UTC (permalink / raw) To: Linus Torvalds Cc: Victor Yodaiken, Rik van Riel, Andrea Arcangeli, Benjamin LaHaise, David S. Miller, linux-kernel Actually, they swung the pendelum the other way for the 64-bit chips. The VSID's (MM contexts) are indirectly accessed via a hash-table (with an on chip TLB-style cache called a SLB). The speedup from using the software table-walk actually came from emulating x86 instead of using the native hash tables. Pretty slick that emulating a 30-year old MMU and improves performance on the PowerPC, eh? There was an April Fools Microprocessor reports describing a processor that had gone 64-bit and had a "twisted gothic nightmare of twisted logic" based MMU that involved XOR-ing addresses with random numbers. They were unwittingly predicting the future of the PPC MMU. The nightmares and shakes have never ended for me, either. Sorry about that, man. } Gods, I hope they have reconsidered that in their 64-bit chips. The 32-bit } hash chains may be ugly, but the architected 32/64-bit MMU stuff is just } so incredibly baroque that it makes any other MMU look positively } beautiful ("Segments? Segments shmegments. Big deal"). } } I still have the occasional nightmares about the IBM block diagrams } "explaining" the PowerPC MMU in their technical documentation. } } There's probably a perfectly valid explanation for them, though (*). } } Linus } } (*) Probably along the lines of the designers being so high on LSD that } they thought it was a really cool idea. That would certainly explain it in } a very logical fashion. } } - } To unsubscribe from this list: send the line "unsubscribe linux-kernel" in } the body of a message to majordomo@vger.kernel.org } More majordomo info at http://vger.kernel.org/majordomo-info.html } Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 16:57 ` Victor Yodaiken 2001-10-30 17:16 ` Troy Benjegerdes 2001-10-30 17:17 ` Linus Torvalds @ 2001-10-30 21:39 ` Paul Mackerras 2001-10-30 22:36 ` Victor Yodaiken 2 siblings, 1 reply; 54+ messages in thread From: Paul Mackerras @ 2001-10-30 21:39 UTC (permalink / raw) To: linux-kernel Victor Yodaiken writes: > You can't turn off hardware hash-chains on anything past 603, sadly enough. > So all Macs, many embedded boards, ... All the 4xx, 8xx, and 82xx embedded PPC cpus use software TLB loading. The 7450 has a mode bit to say whether to use a hash table or software TLB loading. And the new "Book E" specification also mandates software-loaded TLBs. That still leaves almost all of the current Macs and RS/6000s using a hash table, of course. It does sound like providing at least the option to use software TLB loading is becoming common in new PPC designs. (And BTW they are not hash *chains*, there is no chaining involved. There is a primary bucket and a secondary bucket for any given address, each of which can hold 8 ptes.) Paul. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 21:39 ` Paul Mackerras @ 2001-10-30 22:36 ` Victor Yodaiken 0 siblings, 0 replies; 54+ messages in thread From: Victor Yodaiken @ 2001-10-30 22:36 UTC (permalink / raw) To: Paul Mackerras; +Cc: linux-kernel On Wed, Oct 31, 2001 at 08:39:48AM +1100, Paul Mackerras wrote: > Victor Yodaiken writes: > > > You can't turn off hardware hash-chains on anything past 603, sadly enough. > > So all Macs, many embedded boards, ... > > All the 4xx, 8xx, and 82xx embedded PPC cpus use software TLB > loading. The 7450 has a mode bit to say whether to use a hash table > or software TLB loading. And the new "Book E" specification also > mandates software-loaded TLBs. > > That still leaves almost all of the current Macs and RS/6000s using a > hash table, of course. It does sound like providing at least the > option to use software TLB loading is becoming common in new PPC > designs. But, as I understand it, not in PPC64 or in the mainline PC series. What's really cool about the hashtable design is that each bucket will have at most one useful piece of information but still fill an entire cache line. > (And BTW they are not hash *chains*, there is no chaining involved. > There is a primary bucket and a secondary bucket for any given > address, each of which can hold 8 ptes.) Which are almost certain to be unrelated, so two consecutive TLB misses will almost certainly require two fetches from main memory. And, reusing a context requires a scrub of the hash table .. Once you start down this slope, all sorts ofother disadvantages appear for free. > > Paul. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 16:38 ` Linus Torvalds 2001-10-30 16:47 ` Benjamin LaHaise 2001-10-30 16:57 ` Victor Yodaiken @ 2001-10-30 17:23 ` Benjamin Herrenschmidt 2001-10-30 17:36 ` Benjamin Herrenschmidt 2001-10-30 17:41 ` Linus Torvalds 2 siblings, 2 replies; 54+ messages in thread From: Benjamin Herrenschmidt @ 2001-10-30 17:23 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, David S. Miller, Andrea Arcangeli, Benjamin LaHaise, paulus >Now, it's not true on _all_ PPC's. > >The sane PPC setups actually have a regular soft-filled TLB, and last I >saw that actually performed _better_ than the stupid architected hash- >chains. And for the broken OS's (ie AIX) that wants the hash-chains, you >can always make the soft-fill TLB do the stupid thing.. > >(Yeah, yeah, I'm sure you can find code where the hash-chains are faster, >especially big Fortran programs that have basically no tear-down and >build-up overhead. Which was why those things were designed that way, of >course. But it _looks_ like at least parts of IBM may finally be wising up >to the fact that hashed TLB's are a stupid idea). Well, I lack experience to state which scheme is better, but there is definitely overhead intruduced by our hash management in linux on ppc, since we have to evicts pages from the hash as soon as we test&clear PAGE_ACCESSED or PAGE_DIRTY. That means we keep flushing pages out of the hash table, which seems to be defeat the purpose of that big hash table supposed to hold the PTEs for everybody out there more/less permanently. However, I was thinking about a possible solution, please tell me if I'm completely off here or if it makes sense. Since we can't (AFAIK) have linux use larger PTEs (in which case we could store a pointer to the hash PTE in the linux PTE), We could instead layout an array of pointer (one for each page) that would hold these. That way, we have a fast way to grab the real accessed and dirty bits, which means we no longer need to flush hash pages when getting those bits and have a more realistic PAGE_ACCESSED (currently, any page in the hash has PAGE_ACCESSED). Comments ? Ben. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 17:23 ` Benjamin Herrenschmidt @ 2001-10-30 17:36 ` Benjamin Herrenschmidt 2001-10-30 17:41 ` Linus Torvalds 1 sibling, 0 replies; 54+ messages in thread From: Benjamin Herrenschmidt @ 2001-10-30 17:36 UTC (permalink / raw) To: Benjamin Herrenschmidt, Linus Torvalds, linux-kernel >Since we can't (AFAIK) have linux use larger PTEs (in which case we >could store a pointer to the hash PTE in the linux PTE), We could >instead layout an array of pointer (one for each page) that would >hold these. > >That way, we have a fast way to grab the real accessed and dirty >bits, which means we no longer need to flush hash pages when getting >those bits and have a more realistic PAGE_ACCESSED (currently, any page >in the hash has PAGE_ACCESSED). Obviously, an array of pointer would be horrible bloat, maybe the layout of a parallel page table (ARM does this I think) would make more sense. I don't know if it would be really an improvement however. ben. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 17:23 ` Benjamin Herrenschmidt 2001-10-30 17:36 ` Benjamin Herrenschmidt @ 2001-10-30 17:41 ` Linus Torvalds 1 sibling, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2001-10-30 17:41 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: linux-kernel, David S. Miller, Andrea Arcangeli, Benjamin LaHaise, paulus On Tue, 30 Oct 2001, Benjamin Herrenschmidt wrote: > > Well, I lack experience to state which scheme is better, but there is > definitely overhead intruduced by our hash management in linux on ppc, > since we have to evicts pages from the hash as soon as we test&clear > PAGE_ACCESSED or PAGE_DIRTY. > That means we keep flushing pages out of the hash table, which seems > to be defeat the purpose of that big hash table supposed to hold the > PTEs for everybody out there more/less permanently. Exactly. The problem with the hash chains is that they were designed for _large_ jobs. For physicists that run a few copies of the same _huge_ load over and over, where the jobs take minutes, hours or even days to complete. These jobs also have a very noticeable memory footprint (and thus TLB component), and benchmarks like this sell systems. And hash chains work _wonderfully_ for them - they are basically an almost infinitely sized TLB, and the fact that cache locality is crap for them doesn't matter because you use a LOT of entries and you have a LOT of cache. However, most people don't use their machines that way. Unix is pretty much built up around running small, quick processes, and in _most_ normal usage patterns (sorry, physicists ;), the cost of tearing down and building up mappings is quite noticeable - often more so than the TLB misses themselves. In fact, I bet that for many apps, the number of TLB misses and the number of faults to populate the VM space are not different by more than an order of magnitude or so. They come, they eat, they go. And in that kind of schenario, where mappings don't have lifetimes on the order of minutes and hours at a time, hash chains suck. They make for horrible cache behaviour, and building them up and tearing them down more than makes up for any dubious win they had at TLB miss time. And I stress _dubious_. A tree-based TLB lookup has a lot better cache behaviour, and you can do a tree-based lookup in hardware quite efficiently. Yeah, if you didn't guess already, I despise hash chains. I'll take a bigger on-chip secondary TLB and proper address space identifiers any day over stupid hash chains. I think the Athlon does the former, but obviously you can't do the latter with Intel-compatibility. > Since we can't (AFAIK) have linux use larger PTEs (in which case we > could store a pointer to the hash PTE in the linux PTE), We could > instead layout an array of pointer (one for each page) that would > hold these. You can make Linux use any size PTE you wish - the Linux VM (including the page tables) is entirely software-driven (with the current limitation of having three levels of lookup - we'll probably change that when 42 bits of virtual user space gets tight and there are enough machines out there to matter, which is about another five years, I suspect). So storing a pointer to the hardware hash table if you want to is certainly possible. It has memory overhead, though, and decreased cache compactness etc, so... Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 1:34 ` David S. Miller 2001-10-30 1:42 ` Linus Torvalds @ 2001-10-30 1:49 ` Benjamin LaHaise 1 sibling, 0 replies; 54+ messages in thread From: Benjamin LaHaise @ 2001-10-30 1:49 UTC (permalink / raw) To: David S. Miller; +Cc: riel, torvalds, linux-kernel On Mon, Oct 29, 2001 at 05:34:00PM -0800, David S. Miller wrote: > I'm asking him to show the case that "breaks for something > else". Why don't you try lifting your fingers to defend your position? We know that: 1. the change introduces state data to tlb with no limits on the duration of said stale data 2. some architectures have large tlbs. For the sake of (2), hashed tlbs count as excessively large tlbs. See the UML, ia64 and powerpcs for this kind of setup. Now, if you want to live in microbenchmark land, then, yes, the "optimized" but invalid behaviour will *always* win. Back in the real world, sure, I can construct a microbenchmark that will sit on 2 pages of memory spinning until the pages get swapped out. What does that buy us? Nothing other than placating you. I've already pointed out that the optimization can be done in a valid way. If you can't go through the effort of completing the above thought experiment on your own, then I really have no reason to care about your opinion any further. You've already stated you don't care about mine, and I've stated that I don't feel like going to the effort to placate you. Where does that leave us? Nowhere, I guess. -ben ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:50 ` David S. Miller 2001-10-29 23:51 ` Benjamin LaHaise 2001-10-30 1:31 ` Rik van Riel @ 2001-10-30 9:03 ` Alan Cox 2 siblings, 0 replies; 54+ messages in thread From: Alan Cox @ 2001-10-30 9:03 UTC (permalink / raw) To: David S. Miller; +Cc: bcrl, torvalds, linux-kernel > Numbers talk, bullshit walks. So you are walking somewhere today ? Dave I can produce equivalently valid microbenchmarks showing Linux works much better with the scheduler disabled. They are worth about as much as your benchmarks for that optimisation and they likewise ignore a slightly important object known as "the big picture" ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:14 ` David S. Miller 2001-10-29 23:15 ` Benjamin LaHaise @ 2001-10-29 23:22 ` Linus Torvalds 2001-10-29 23:29 ` Benjamin LaHaise 1 sibling, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2001-10-29 23:22 UTC (permalink / raw) To: David S. Miller; +Cc: bcrl, linux-kernel On Mon, 29 Oct 2001, David S. Miller wrote: > Date: Mon, 29 Oct 2001 18:08:37 -0500 > > is completely bogus. Without the tlb flush, the system may never update > the accessed bit on a page that is heavily being used. > > It's intentional Ben, think about the high cost of the SMP invalidate > when kswapd is just scanning page tables. Indeed. I thought it shouldn't mater, but apparently it does.. Does it make the accessed bit less reliable? Sure it does. But basically, either the page is accessed SO much that it stays in the TLB all the time (which is basically not really possible if you page heavily, I suspect), or it will age out of the TLB on its own at which point we get the accessed bit back. In the worst case it does generate more noise in the reference bit logic, but .. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:22 ` Linus Torvalds @ 2001-10-29 23:29 ` Benjamin LaHaise 2001-10-29 23:44 ` Linus Torvalds 2001-10-30 0:02 ` Hugh Dickins 0 siblings, 2 replies; 54+ messages in thread From: Benjamin LaHaise @ 2001-10-29 23:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: David S. Miller, linux-kernel On Mon, Oct 29, 2001 at 03:22:28PM -0800, Linus Torvalds wrote: > Does it make the accessed bit less reliable? Sure it does. But basically, > either the page is accessed SO much that it stays in the TLB all the time > (which is basically not really possible if you page heavily, I suspect), > or it will age out of the TLB on its own at which point we get the > accessed bit back. Think of CPUs with tagged tlbs and lots of entries. Or even a system that only runs 1 threaded app. Easily triggerable. If people want to optimise it, great. But go for correctness first, please... -ben ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:29 ` Benjamin LaHaise @ 2001-10-29 23:44 ` Linus Torvalds 2001-10-30 0:02 ` Hugh Dickins 1 sibling, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2001-10-29 23:44 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: David S. Miller, linux-kernel On Mon, 29 Oct 2001, Benjamin LaHaise wrote: > > Think of CPUs with tagged tlbs and lots of entries. Or even a system that > only runs 1 threaded app. Easily triggerable. If people want to optimise > it, great. But go for correctness first, please... "easily triggerable"? I doubt you'll find _any_ system where you can trigger it. Think about it: in order to get the kind of VM pressure that causes page-outs, your TLB pressure will be a lot higher than _any_ CPU I have ever heard about. 256 TLB entries is considered "a lot". Tagged or not, if the VM pressure is so big that we need to swap, those 256 entries are a grain of sand in sahara. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:29 ` Benjamin LaHaise 2001-10-29 23:44 ` Linus Torvalds @ 2001-10-30 0:02 ` Hugh Dickins 2001-10-30 0:04 ` Linus Torvalds 2001-10-30 0:04 ` David S. Miller 1 sibling, 2 replies; 54+ messages in thread From: Hugh Dickins @ 2001-10-30 0:02 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Linus Torvalds, David S. Miller, linux-kernel On Mon, 29 Oct 2001, Benjamin LaHaise wrote: > Think of CPUs with tagged tlbs and lots of entries. Or even a system that > only runs 1 threaded app. Easily triggerable. If people want to optimise > it, great. But go for correctness first, please... But was the original flush_tlb_page() fully correct? In i386 it's if (vma->vm_mm == current->active_mm) __flush_tlb_one(addr); without reference to whether vma->vm_mm is active on another cpu. Hugh ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 0:02 ` Hugh Dickins @ 2001-10-30 0:04 ` Linus Torvalds 2001-10-30 0:04 ` David S. Miller 1 sibling, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2001-10-30 0:04 UTC (permalink / raw) To: Hugh Dickins; +Cc: Benjamin LaHaise, David S. Miller, linux-kernel On Tue, 30 Oct 2001, Hugh Dickins wrote: > > But was the original flush_tlb_page() fully correct? In i386 it's > if (vma->vm_mm == current->active_mm) > __flush_tlb_one(addr); > without reference to whether vma->vm_mm is active on another cpu. No, you're looking at the UP inline version - the SMP version is elsewhere. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-30 0:02 ` Hugh Dickins 2001-10-30 0:04 ` Linus Torvalds @ 2001-10-30 0:04 ` David S. Miller 1 sibling, 0 replies; 54+ messages in thread From: David S. Miller @ 2001-10-30 0:04 UTC (permalink / raw) To: hugh; +Cc: bcrl, torvalds, linux-kernel From: Hugh Dickins <hugh@veritas.com> Date: Tue, 30 Oct 2001 00:02:53 +0000 (GMT) But was the original flush_tlb_page() fully correct? In i386 it's if (vma->vm_mm == current->active_mm) __flush_tlb_one(addr); without reference to whether vma->vm_mm is active on another cpu. You're looking at the non-SMP version :-) Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: please revert bogus patch to vmscan.c 2001-10-29 23:08 please revert bogus patch to vmscan.c Benjamin LaHaise 2001-10-29 23:14 ` David S. Miller @ 2001-10-29 23:51 ` Paul Mackerras 1 sibling, 0 replies; 54+ messages in thread From: Paul Mackerras @ 2001-10-29 23:51 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Linus Torvalds, linux-kernel Benjamin LaHaise writes: > The following: > > @@ -50,7 +50,6 @@ > > /* Don't look at this pte if it's been accessed recently. */ > if (ptep_test_and_clear_young(page_table)) { > - flush_tlb_page(vma, address); > mark_page_accessed(page); > return 0; > } > > is completely bogus. Without the tlb flush, the system may never update > the accessed bit on a page that is heavily being used. On PPC, the page wouldn't even need to be being heavily used. Most PPCs have an MMU hash table that we use as a level-2 cache for the TLB. With this change, we won't see the accessed bit being set again for any page unless there is so much memory in use that we start evicting PTEs from the hash table, and that is very rare in practice. So I'm with Ben on this one. Paul. ^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2001-10-31 1:57 UTC | newest]
Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-10-29 23:08 please revert bogus patch to vmscan.c Benjamin LaHaise
2001-10-29 23:14 ` David S. Miller
2001-10-29 23:15 ` Benjamin LaHaise
2001-10-29 23:25 ` Linus Torvalds
2001-10-29 23:33 ` Benjamin LaHaise
2001-10-29 23:36 ` David S. Miller
2001-10-29 23:39 ` Benjamin LaHaise
2001-10-29 23:41 ` Linus Torvalds
2001-10-29 23:48 ` Benjamin LaHaise
2001-10-29 23:50 ` David S. Miller
2001-10-29 23:51 ` Benjamin LaHaise
2001-10-29 23:55 ` David S. Miller
2001-10-29 23:57 ` Benjamin LaHaise
2001-10-30 0:01 ` David S. Miller
2001-10-30 0:05 ` Benjamin LaHaise
2001-10-29 23:58 ` Linus Torvalds
2001-10-30 1:31 ` Rik van Riel
2001-10-30 1:34 ` David S. Miller
2001-10-30 1:42 ` Linus Torvalds
2001-10-30 1:46 ` David S. Miller
2001-10-30 2:25 ` Benjamin LaHaise
2001-10-30 15:20 ` Andrea Arcangeli
2001-10-30 15:34 ` Rik van Riel
2001-10-30 15:51 ` Andrea Arcangeli
2001-10-30 16:34 ` Benjamin LaHaise
2001-10-30 17:00 ` Andrea Arcangeli
2001-10-30 17:07 ` Rik van Riel
2001-10-30 16:13 ` Giuliano Pochini
2001-10-30 16:54 ` Andrea Arcangeli
2001-10-30 17:23 ` Giuliano Pochini
2001-10-30 17:30 ` Andrea Arcangeli
2001-10-31 0:38 ` Paul Mackerras
[not found] ` <15327.8495.767553.389519@cargo.ozlabs.ibm.com>
2001-10-31 1:57 ` Andrea Arcangeli
2001-10-30 16:38 ` Linus Torvalds
2001-10-30 16:47 ` Benjamin LaHaise
2001-10-30 16:57 ` Victor Yodaiken
2001-10-30 17:16 ` Troy Benjegerdes
2001-10-30 17:17 ` Linus Torvalds
2001-10-30 17:51 ` Victor Yodaiken
2001-10-30 18:01 ` Cort Dougan
2001-10-30 21:39 ` Paul Mackerras
2001-10-30 22:36 ` Victor Yodaiken
2001-10-30 17:23 ` Benjamin Herrenschmidt
2001-10-30 17:36 ` Benjamin Herrenschmidt
2001-10-30 17:41 ` Linus Torvalds
2001-10-30 1:49 ` Benjamin LaHaise
2001-10-30 9:03 ` Alan Cox
2001-10-29 23:22 ` Linus Torvalds
2001-10-29 23:29 ` Benjamin LaHaise
2001-10-29 23:44 ` Linus Torvalds
2001-10-30 0:02 ` Hugh Dickins
2001-10-30 0:04 ` Linus Torvalds
2001-10-30 0:04 ` David S. Miller
2001-10-29 23:51 ` Paul Mackerras
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox