* clear_user_highpage() @ 2004-08-11 23:15 David S. Miller 2004-08-11 23:31 ` clear_user_highpage() Benjamin Herrenschmidt 2004-08-11 23:46 ` clear_user_highpage() Linus Torvalds 0 siblings, 2 replies; 41+ messages in thread From: David S. Miller @ 2004-08-11 23:15 UTC (permalink / raw) To: torvalds; +Cc: linux-arch During a kernel build, this is what tops the profiling charts for me on sparc64 currently. This drives me crazy :-) I've optimized the sparc64 page zero as much as I possibly could, so that's not worth tinkering with any longer. The PPC people used to zero out pages in the cpu idle loop and I'd definitely like to do something along those lines on sparc64 as well, I feel it would be extremely effective. There is a lot of code path in there for alloc_pages_vma(). I don't think adding arch overridable stuff is the way to go here. Something generic in the per-cpu hot/cold page list handling that the cpu_idle() loop of each architecture could call. Perhaps a page flags bit that says "pre-zeroed" or something. Then my clear_user_page() code on sparc64 could just test the page bit and return if it is set. Page free would need to clear that bit of course. I have no real concrete ideas, but I know that while looking at some source code in an editor, my cpus could zero out all the free pages in the system in a second or two :-) Comments? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-11 23:15 clear_user_highpage() David S. Miller @ 2004-08-11 23:31 ` Benjamin Herrenschmidt 2004-08-11 23:55 ` clear_user_highpage() David S. Miller 2004-08-11 23:46 ` clear_user_highpage() Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: Benjamin Herrenschmidt @ 2004-08-11 23:31 UTC (permalink / raw) To: David S. Miller; +Cc: Linus Torvalds, Linux Arch list > There is a lot of code path in there for alloc_pages_vma(). > I don't think adding arch overridable stuff is the way > to go here. Something generic in the per-cpu hot/cold > page list handling that the cpu_idle() loop of each architecture > could call. It would be nice, indeed, tho we have to be careful not to waste too much time in there looking for pages to clear, especially when there is none. Time spent not putting the CPU into power managed idle, at least on PPC's, means the CPU getting hotter, consuming more battery, etc... which is definitely a bad thing on laptops. I already took a significant hit with HZ=1000 btw, I'm considering lowering it back to 100 on ppc32 at least... We really want tickless scheduling for these beasts so we can select how deep to PM the CPU based on how much time we expect to stay idle. Ben. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-11 23:31 ` clear_user_highpage() Benjamin Herrenschmidt @ 2004-08-11 23:55 ` David S. Miller 2004-08-12 0:03 ` clear_user_highpage() Benjamin Herrenschmidt 0 siblings, 1 reply; 41+ messages in thread From: David S. Miller @ 2004-08-11 23:55 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: torvalds, linux-arch On Thu, 12 Aug 2004 09:31:03 +1000 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > It would be nice, indeed, tho we have to be careful not to waste > too much time in there looking for pages to clear, especially when > there is none. Time spent not putting the CPU into power managed > idle, at least on PPC's, means the CPU getting hotter, consuming > more battery, etc... which is definitely a bad thing on laptops. I totally agree. This is why I believe it should be a per-arch decision at cpu_idle() time whether to do the clears or not. > I already took a significant hit with HZ=1000 btw, I'm considering > lowering it back to 100 on ppc32 at least... We really want tickless > scheduling for these beasts so we can select how deep to PM the CPU > based on how much time we expect to stay idle. I think dynamic-resolution timers are the way to go here. Rusty was talking about something along theses lines at the networking summit. The reason Rusty had brought it up was due to a ipv6 problem, there are some timers that need to get so far into the future that with HZ=1000 it isn't representable in the 32-bit timer offsets. We have jiffies64 and could move the timer struct over the a u64 for the offsets, but that seems like overkill. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-11 23:55 ` clear_user_highpage() David S. Miller @ 2004-08-12 0:03 ` Benjamin Herrenschmidt 2004-08-12 1:18 ` clear_user_highpage() William Lee Irwin III 2004-08-12 2:11 ` clear_user_highpage() Andi Kleen 0 siblings, 2 replies; 41+ messages in thread From: Benjamin Herrenschmidt @ 2004-08-12 0:03 UTC (permalink / raw) To: David S. Miller; +Cc: Linus Torvalds, Linux Arch list > I think dynamic-resolution timers are the way to go here. > Rusty was talking about something along theses lines at > the networking summit. Yup, several people talked about it at KS/OLS and I think s390 has some implementation already, though I hadn't time to look at it yet, hopefully that will happen sooner or later. > The reason Rusty had brought it up was due to a ipv6 problem, > there are some timers that need to get so far into the future > that with HZ=1000 it isn't representable in the 32-bit timer > offsets. We have jiffies64 and could move the timer struct > over the a u64 for the offsets, but that seems like overkill. -- Benjamin Herrenschmidt <benh@kernel.crashing.org> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 0:03 ` clear_user_highpage() Benjamin Herrenschmidt @ 2004-08-12 1:18 ` William Lee Irwin III 2004-08-12 2:11 ` clear_user_highpage() Andi Kleen 1 sibling, 0 replies; 41+ messages in thread From: William Lee Irwin III @ 2004-08-12 1:18 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: David S. Miller, Linus Torvalds, Linux Arch list At some point in the past, someone wrote: >> I think dynamic-resolution timers are the way to go here. >> Rusty was talking about something along theses lines at >> the networking summit. On Thu, Aug 12, 2004 at 10:03:37AM +1000, Benjamin Herrenschmidt wrote: > Yup, several people talked about it at KS/OLS and I think s390 has > some implementation already, though I hadn't time to look at it yet, > hopefully that will happen sooner or later. Zwane has a tickless idling patch for i386 already (not sure if it's been posted yet). I'm looking at helping out with it at some point, at least if Zwane stops churning out new functionality long enough for me to get a line in edgewise. =) -- wli ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 0:03 ` clear_user_highpage() Benjamin Herrenschmidt 2004-08-12 1:18 ` clear_user_highpage() William Lee Irwin III @ 2004-08-12 2:11 ` Andi Kleen 2004-08-12 9:23 ` clear_user_highpage() Martin Schwidefsky 1 sibling, 1 reply; 41+ messages in thread From: Andi Kleen @ 2004-08-12 2:11 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: David S. Miller, Linus Torvalds, Linux Arch list On Thu, Aug 12, 2004 at 10:03:37AM +1000, Benjamin Herrenschmidt wrote: > > > I think dynamic-resolution timers are the way to go here. > > Rusty was talking about something along theses lines at > > the networking summit. > > Yup, several people talked about it at KS/OLS and I think s390 has > some implementation already, though I hadn't time to look at it yet, > hopefully that will happen sooner or later. My main issue with the s390 approach is that they actually wanted to get rid of jiffies. That's fine for long term, but short term it would be a big problem because it would break everything. IMHO the way to do it would be to define jiffies to a function and keep virtual jiffies using the CLOCK_MONOTONIC timer. Then only tick with a low frequency for statistic ticks and disabled for idle or when an actual event is scheduled. I didn't have time to actually work on it, the s390 guys are writing actual code, so they win for now :) -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 2:11 ` clear_user_highpage() Andi Kleen @ 2004-08-12 9:23 ` Martin Schwidefsky 0 siblings, 0 replies; 41+ messages in thread From: Martin Schwidefsky @ 2004-08-12 9:23 UTC (permalink / raw) To: Andi Kleen Cc: Benjamin Herrenschmidt, David S. Miller, Linux Arch list, Linus Torvalds > > > I think dynamic-resolution timers are the way to go here. > > > Rusty was talking about something along theses lines at > > > the networking summit. > > > > Yup, several people talked about it at KS/OLS and I think s390 has > > some implementation already, though I hadn't time to look at it yet, > > hopefully that will happen sooner or later. > > My main issue with the s390 approach is that they actually > wanted to get rid of jiffies. That's fine for long term, > but short term it would be a big problem because it would > break everything. IMHO the way to do it would be to define > jiffies to a function and keep virtual jiffies using the CLOCK_MONOTONIC > timer. Then only tick with a low frequency for statistic ticks > and disabled for idle or when an actual event is scheduled. In the end we indeed want to get rid of jiffies altogether. But it's a long road. What we are currently doing is to replace some of the dependecies to jiffies in the common code. E.g. instead of doing the cpu time accounting on a tick base we want to introduce a cputime_t type that is in principle not related to jiffies (but it is in the generic implementation). The process accounting is done by a new account_cputime function that can be called any time and can be passed any amount of cputime. On s390 we'll define cputime_t based on a virtual cpu timer with microseconds resolution. This solves two problems, the first is the fact that running on a virtual processor f**** up your tick based accounting badly. The second is the accuracy of the number in /proc/stat. The second step is to untie the time slices from the jiffies tick. > I didn't have time to actually work on it, the s390 guys > are writing actual code, so they win for now :) We win ?!? Hey, we win ;-)) I posted a first set of cputime patches on lkml last week. I probably post them here shortly in the hope to get a little more feedback from you arch guys. blue skies, Martin Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: schwidefsky@de.ibm.com ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-11 23:15 clear_user_highpage() David S. Miller 2004-08-11 23:31 ` clear_user_highpage() Benjamin Herrenschmidt @ 2004-08-11 23:46 ` Linus Torvalds 2004-08-11 23:53 ` clear_user_highpage() David S. Miller ` (2 more replies) 1 sibling, 3 replies; 41+ messages in thread From: Linus Torvalds @ 2004-08-11 23:46 UTC (permalink / raw) To: David S. Miller; +Cc: linux-arch On Wed, 11 Aug 2004, David S. Miller wrote: > > During a kernel build, this is what tops the profiling charts > for me on sparc64 currently. This drives me crazy :-) Think of it this way: if that function is your top function, then you're doing really really well. It's a good function to have at top. > The PPC people used to zero out pages in the cpu idle loop > and I'd definitely like to do something along those lines > on sparc64 as well, I feel it would be extremely effective. No. It sucks. It sucks so bad it's not funny. It sucks because it eats CPU and memory bandwidth when it shouldn't be eaten. It's a total disaster on SMP, but it's bad on UP too. It sucks because it does bad things to cache behaviour. Sure, you'll move the cost away from "clear_user_highpage", but the thing is, you will _not_ move it into the idle time. What you will do is to move it into some random time _after_ the idle time, when the idle thing has crapped all over your caches. The thing is, you make your cache footprint per CPU _much_ bigger, and you spread it out a lot over time too, so you make it even worse. The clearing will then be totally hidden in the profiles, because you will have turned a nice and well-behaved "this is there the time goes" profile into a mush of "we're taking cache misses at random times, and we don't know why". That, btw, is a _classic_ mistake in profiling. Move the work around so that it's not as visible any more. In other words, don't do it. It's a mistake. It is optimizing the profile without actually optimizing what you want _done_. Btw, this is exactly what the totally brain-damaged slab stuff does. It takes away the peaks, but does so by having worse cache access patterns all around. Look at it this way: - it might be worth doing in big batches under some kind of user control, when you really can _control_ that it happens at a good time. I _might_ buy into this argument. Make it a batch thing that really screws the caches, but only does so very seldom, when the user asked for it. - but we aren't supposed to have that much memory free _anyway_, and trying to keep it around on a separate list is horrible for fragmentation. So batching huge things up is likely not a good idea either. - with caches growing larger, it's actually BETTER to clear the page at usage time, because then the CPU that actually touches the page won't have to bring in the page in from memory. We'll blow one page of cache by clearing it, but we will blow it in a "good" way - hopefully with almost no memory traffic at all (ie the clear can be done as pure invalidate cycles, no read-back into the CPU). And the thing is, the background clearing will just get worse and worse. In summary: it's a _good_ thing when you see a sharp peak in your profiles, and you can say "I know exactly what that peak is for, and it's doing exactly the work it should be doing and nothing else". Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-11 23:46 ` clear_user_highpage() Linus Torvalds @ 2004-08-11 23:53 ` David S. Miller 2004-08-12 0:00 ` clear_user_highpage() Linus Torvalds 2004-08-12 0:00 ` clear_user_highpage() Benjamin Herrenschmidt 2004-08-12 0:46 ` clear_user_highpage() William Lee Irwin III 2 siblings, 1 reply; 41+ messages in thread From: David S. Miller @ 2004-08-11 23:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-arch On Wed, 11 Aug 2004 16:46:10 -0700 (PDT) Linus Torvalds <torvalds@osdl.org> wrote: > It sucks because it does bad things to cache behaviour. Sure, you'll move > the cost away from "clear_user_highpage", but the thing is, you will _not_ > move it into the idle time. What you will do is to move it into some > random time _after_ the idle time, when the idle thing has crapped all > over your caches. It won't crap out the caches on sparc64 and any platform with cache-bypass- on-miss stores. I believe ia64 and opteron have similar mechanisms. If the store misses the L2 cache it goes straight out to main memory, it doesn't allocate cache lines anywhere in such cases. I think ppc/ppc64 has this too... no, sorry, it has the data-cache allocate line and zero instruction, which isn't what you want here. > The thing is, you make your cache footprint per CPU _much_ bigger, and you > spread it out a lot over time too, so you make it even worse. > > The clearing will then be totally hidden in the profiles, because you will > have turned a nice and well-behaved "this is there the time goes" profile > into a mush of "we're taking cache misses at random times, and we don't > know why". Therefore, I do not believe any of this is applicable. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-11 23:53 ` clear_user_highpage() David S. Miller @ 2004-08-12 0:00 ` Linus Torvalds 2004-08-12 0:06 ` clear_user_highpage() Benjamin Herrenschmidt ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Linus Torvalds @ 2004-08-12 0:00 UTC (permalink / raw) To: David S. Miller; +Cc: linux-arch On Wed, 11 Aug 2004, David S. Miller wrote: > On Wed, 11 Aug 2004 16:46:10 -0700 (PDT) > Linus Torvalds <torvalds@osdl.org> wrote: > > > It sucks because it does bad things to cache behaviour. Sure, you'll move > > the cost away from "clear_user_highpage", but the thing is, you will _not_ > > move it into the idle time. What you will do is to move it into some > > random time _after_ the idle time, when the idle thing has crapped all > > over your caches. > > It won't crap out the caches on sparc64 and any platform with cache-bypass- > on-miss stores. I believe ia64 and opteron have similar mechanisms. You didn't read my message. If it doesn't crap on the caches when you do the stores, it _will_ crap on the bus both when you do the stores _and_ when you actually read the page. In other words, you will have taken _more_ of a hit later on. It's just that it won't be a nice profile hit, it will be a nasty "everything runs slower later". Caches work best when you have good temporal locality. You are removing that locality, and thus you are making your caches _less_ efficient. That's a very _fundamental_ argument. > If the store misses the L2 cache it goes straight out to main memory, > it doesn't allocate cache lines anywhere in such cases. > > I think ppc/ppc64 has this too... no, sorry, it has the data-cache allocate > line and zero instruction, which isn't what you want here. It's exactly what you _do_ want, it's just that you want it in "clear_user_highpage()". Then you have the perfect cache behaviour, assuming your cache is big enough that it will likely get a good hit ratio on the new page. And let's admit it now: caches _are_ big enough that they get good hit ratios on things with good temporal locality. Larger caches will happen. My argument will get only more relevant. Your approach will force cache misses and tons of memory bus traffic. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 0:00 ` clear_user_highpage() Linus Torvalds @ 2004-08-12 0:06 ` Benjamin Herrenschmidt 2004-08-12 0:24 ` clear_user_highpage() David S. Miller 2004-08-12 0:23 ` clear_user_highpage() David S. Miller 2004-08-12 2:08 ` clear_user_highpage() Andi Kleen 2 siblings, 1 reply; 41+ messages in thread From: Benjamin Herrenschmidt @ 2004-08-12 0:06 UTC (permalink / raw) To: Linus Torvalds; +Cc: David S. Miller, Linux Arch list > You didn't read my message. If it doesn't crap on the caches when you do > the stores, it _will_ crap on the bus both when you do the stores _and_ > when you actually read the page. > > In other words, you will have taken _more_ of a hit later on. It's just > that it won't be a nice profile hit, it will be a nasty "everything runs > slower later". > > Caches work best when you have good temporal locality. You are removing > that locality, and thus you are making your caches _less_ efficient. > > That's a very _fundamental_ argument. Ok, that may be why it was removed from ppc then, I should ask Paul. Ben. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 0:06 ` clear_user_highpage() Benjamin Herrenschmidt @ 2004-08-12 0:24 ` David S. Miller 0 siblings, 0 replies; 41+ messages in thread From: David S. Miller @ 2004-08-12 0:24 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: torvalds, linux-arch On Thu, 12 Aug 2004 10:06:37 +1000 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > Ok, that may be why it was removed from ppc then, I should ask Paul. I think this might have more to do with the fact that they got tired locally patching their tree all the time. It required changes to generic code which they could never get merged upstream. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 0:00 ` clear_user_highpage() Linus Torvalds 2004-08-12 0:06 ` clear_user_highpage() Benjamin Herrenschmidt @ 2004-08-12 0:23 ` David S. Miller 2004-08-12 1:46 ` clear_user_highpage() Linus Torvalds 2004-08-16 1:58 ` clear_user_highpage() Paul Mackerras 2004-08-12 2:08 ` clear_user_highpage() Andi Kleen 2 siblings, 2 replies; 41+ messages in thread From: David S. Miller @ 2004-08-12 0:23 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-arch On Wed, 11 Aug 2004 17:00:37 -0700 (PDT) Linus Torvalds <torvalds@osdl.org> wrote: > You didn't read my message. If it doesn't crap on the caches when you do > the stores, it _will_ crap on the bus both when you do the stores _and_ > when you actually read the page. I understand what you're saying. > In other words, you will have taken _more_ of a hit later on. It's just > that it won't be a nice profile hit, it will be a nasty "everything runs > slower later". > > Caches work best when you have good temporal locality. You are removing > that locality, and thus you are making your caches _less_ efficient. > > That's a very _fundamental_ argument. Here is some more data. If I use the cache bypassing stores on sparc64 for clear page (which I do and always have), it takes roughly 4400 cycles to clear a page out on a 750Mhz cpu regardless of whether the page is in the L2 cache or not. Conversely, I played with a version that did not do cache bypass and for a cache hit it was phenominal, about twice as fast, but for the cache miss case it was very slow, some 20,000 cycles. I played around with trying to prefetch the data into the L2 cache, that didn't help much in the miss case at all. Also, when the user takes that first write fault on the anonymous page, it typically access the first several bytes (it is usually a malloc chunk or similar), it doesn't trypically walk the entire page. So to me, bringing the whole thing in seems inefficient. Let the process bring the cache lines in, when it's really needed, which (for all the cache lines in that page) is not necessarily when the write fault occurs and we clear the page out. If it happened to be in the L2 cache at clear_user_highpage() time, it'll stay there during the clearing and that's great too. Is that logic fundamentally flawed? > Larger caches will happen. My argument will get only more relevant. Your > approach will force cache misses and tons of memory bus traffic. I agree with you. But I believe, given the data above wrt. sparc64, it is a profitable scheme at least on that platform. You definitely have piqued my interest in some things. I'll try out the expensive clear_user_highpage() that brings the data into the L2 cache always, and see if that makes kernel builds faster. Although I think the fact that clear_user_highpage() will be 5 times slower on the L2 miss case might nullify any gains bringing the data in always for the user might give. We'll see. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 0:23 ` clear_user_highpage() David S. Miller @ 2004-08-12 1:46 ` Linus Torvalds 2004-08-12 2:51 ` clear_user_highpage() David S. Miller 2004-08-16 1:58 ` clear_user_highpage() Paul Mackerras 1 sibling, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2004-08-12 1:46 UTC (permalink / raw) To: David S. Miller; +Cc: linux-arch On Wed, 11 Aug 2004, David S. Miller wrote: > > If I use the cache bypassing stores on sparc64 for clear page (which I > do and always have), it takes roughly 4400 cycles to clear a page out > on a 750Mhz cpu regardless of whether the page is in the L2 cache or > not. > > Conversely, I played with a version that did not do cache bypass and > for a cache hit it was phenominal, about twice as fast, but for the > cache miss case it was very slow, some 20,000 cycles. I played around > with trying to prefetch the data into the L2 cache, that didn't help > much in the miss case at all. Ok. This is exactly why you want to have a "establish cache line" instruction. Because you _cannot_ make a perfect memset without one. I'm surprised that even CPU's that have cache control instructions don't have the very fundamental "establish" one. ppc is actually the only one I know about that does that. Clearly the ultrasparc doesn't figure out the clear cache-line, and makes the regular memset() be a fairly synchronous "read cacheline + writeout". Which will indeed suck. > So to me, bringing the whole thing in seems inefficient. Absolutely. What we want from a software perspective is a "get exclusive cacheline without reading it from memory" using a cache line invalidate setup rather than reading it. > Is that logic fundamentally flawed? I suspect that the cache-bypass stores might be the right thing until the cache grows big enough that it hurts more than it helps. Is there no "store to cache line, but do not establish" instruction? Sounds like that should be the fastest one for your setup. > You definitely have piqued my interest in some things. I'll try out > the expensive clear_user_highpage() that brings the data into the L2 > cache always, and see if that makes kernel builds faster. Although > I think the fact that clear_user_highpage() will be 5 times slower on > the L2 miss case might nullify any gains bringing the data in always > for the user might give. Yeah, sounds horrible. I can't imagine that the cost of bringing it into the cache if it wasn't already can ever really help you. Then you might as well wait with brining it in until much later. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 1:46 ` clear_user_highpage() Linus Torvalds @ 2004-08-12 2:51 ` David S. Miller 0 siblings, 0 replies; 41+ messages in thread From: David S. Miller @ 2004-08-12 2:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-arch On Wed, 11 Aug 2004 18:46:56 -0700 (PDT) Linus Torvalds <torvalds@osdl.org> wrote: > Ok. This is exactly why you want to have a "establish cache line" > instruction. Because you _cannot_ make a perfect memset without one. I can prefetch for one or multiple writes, but these only install the cacheline in exclusive state if no other cpu responds to the snoop. > Clearly the ultrasparc doesn't figure out the clear cache-line, and makes > the regular memset() be a fairly synchronous "read cacheline + writeout". > Which will indeed suck. The cache bypassing block stores store 64-bytes at a time (ie. a full cache line). So either it goes directly into the L2 cache line from the write-cache (which itself is 2K) or it goes right out to the memory bus as a cacheline write. > Absolutely. What we want from a software perspective is a "get exclusive > cacheline without reading it from memory" using a cache line invalidate > setup rather than reading it. Yes. For the "hit in L2 case" that is what the cache-bypassing stores on sparc64 effectively do. > Is there no "store to cache line, but do not establish" instruction? > Sounds like that should be the fastest one for your setup. Yes, but it acts that way only on a L2 hit. > Yeah, sounds horrible. I can't imagine that the cost of bringing it into > the cache if it wasn't already can ever really help you. Then you might as > well wait with brining it in until much later. I'm still undecided. I think there is real value in the issue William and myself keep bringing up, which is that the arguments you propose hinge upon the process using some significant portion of the page right after anonymous page fault time, and I concur with William that this is not typically the case. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 0:23 ` clear_user_highpage() David S. Miller 2004-08-12 1:46 ` clear_user_highpage() Linus Torvalds @ 2004-08-16 1:58 ` Paul Mackerras 1 sibling, 0 replies; 41+ messages in thread From: Paul Mackerras @ 2004-08-16 1:58 UTC (permalink / raw) To: David S. Miller; +Cc: Linus Torvalds, linux-arch David S. Miller writes: > If I use the cache bypassing stores on sparc64 for clear page (which I > do and always have), it takes roughly 4400 cycles to clear a page out > on a 750Mhz cpu regardless of whether the page is in the L2 cache or > not. Just for fun (and to make Dave jealous :) I instrumented clear_page() on the G5 to measure the number of calls and total time taken. (Note that clear_user_highpage calls clear_page, which gets inlined.) The result was that clear_page takes an average of 96ns (192 cycles) per page on my 2-way 2GHz G5. Our pages are 4k so you would have to double that to get a fair comparison with the sparc, but even then we are still only taking 9% of the cycles. :) This is using the dcbz (data cache block zero) instruction, which makes a cache line exclusive in the cache and zeroes it without memory traffic (there is some bus traffic on SMP because it has to issue a kill to all the other processors). There will of course be memory traffic later as those cache lines get written back, but that occurs in cacheline-sized bursts, and to the extent that the program writes to the page before the lines get written back, we win. Regards, Paul. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 0:00 ` clear_user_highpage() Linus Torvalds 2004-08-12 0:06 ` clear_user_highpage() Benjamin Herrenschmidt 2004-08-12 0:23 ` clear_user_highpage() David S. Miller @ 2004-08-12 2:08 ` Andi Kleen 2004-08-12 2:45 ` clear_user_highpage() David S. Miller 2 siblings, 1 reply; 41+ messages in thread From: Andi Kleen @ 2004-08-12 2:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: David S. Miller, linux-arch On Wed, Aug 11, 2004 at 05:00:37PM -0700, Linus Torvalds wrote: > You didn't read my message. If it doesn't crap on the caches when you do > the stores, it _will_ crap on the bus both when you do the stores _and_ > when you actually read the page. I discovered this the hard way on Opteron too. At some point I was doing clear_page using cache bypassing write combining stores. That was done because it was faster in microbenchmarks that just tested the function. But on actual macro benchmarks it was quite bad because the applications were eating cache misses all the time. Doing it in the idle loop would have the same problem. When I could see it making sense would be for page table pages though (especially when you cache in a bitmap what ptes have been actually touched and ignore the rest) > In other words, you will have taken _more_ of a hit later on. It's just > that it won't be a nice profile hit, it will be a nasty "everything runs > slower later". Yep, it's a bad idea. -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 2:08 ` clear_user_highpage() Andi Kleen @ 2004-08-12 2:45 ` David S. Miller 2004-08-12 9:09 ` clear_user_highpage() Andi Kleen 0 siblings, 1 reply; 41+ messages in thread From: David S. Miller @ 2004-08-12 2:45 UTC (permalink / raw) To: Andi Kleen; +Cc: torvalds, linux-arch On Thu, 12 Aug 2004 04:08:25 +0200 Andi Kleen <ak@suse.de> wrote: > I discovered this the hard way on Opteron too. At some point > I was doing clear_page using cache bypassing write combining stores. > That was done because it was faster in microbenchmarks that just > tested the function. But on actual macro benchmarks it was quite > bad because the applications were eating cache misses all the time. Do these cache-bypassing stores use the L2 cache on a hit? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 2:45 ` clear_user_highpage() David S. Miller @ 2004-08-12 9:09 ` Andi Kleen 2004-08-12 19:50 ` clear_user_highpage() David S. Miller 0 siblings, 1 reply; 41+ messages in thread From: Andi Kleen @ 2004-08-12 9:09 UTC (permalink / raw) To: David S. Miller; +Cc: torvalds, linux-arch On Wed, 11 Aug 2004 19:45:45 -0700 "David S. Miller" <davem@redhat.com> wrote: > On Thu, 12 Aug 2004 04:08:25 +0200 > Andi Kleen <ak@suse.de> wrote: > > > I discovered this the hard way on Opteron too. At some point > > I was doing clear_page using cache bypassing write combining stores. > > That was done because it was faster in microbenchmarks that just > > tested the function. But on actual macro benchmarks it was quite > > bad because the applications were eating cache misses all the time. > > Do these cache-bypassing stores use the L2 cache on a hit? No, they invalidate the cache. -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 9:09 ` clear_user_highpage() Andi Kleen @ 2004-08-12 19:50 ` David S. Miller 2004-08-12 20:00 ` clear_user_highpage() Andi Kleen 2004-08-12 21:34 ` clear_user_highpage() Matthew Wilcox 0 siblings, 2 replies; 41+ messages in thread From: David S. Miller @ 2004-08-12 19:50 UTC (permalink / raw) To: Andi Kleen; +Cc: torvalds, linux-arch On Thu, 12 Aug 2004 11:09:24 +0200 Andi Kleen <ak@suse.de> wrote: > On Wed, 11 Aug 2004 19:45:45 -0700 > "David S. Miller" <davem@redhat.com> wrote: > > > Do these cache-bypassing stores use the L2 cache on a hit? > > No, they invalidate the cache. That explains, at least partly, why they performed so poorly. Is there any other platform that has the same kind of block stores sparc64 does (basically use L2 cache if line present, else bypass L2 cache for the store and do not allocate L2 cache lines for the data)? I bet ia64 does have something like this. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 19:50 ` clear_user_highpage() David S. Miller @ 2004-08-12 20:00 ` Andi Kleen 2004-08-12 20:30 ` clear_user_highpage() David S. Miller 2004-08-12 21:34 ` clear_user_highpage() Matthew Wilcox 1 sibling, 1 reply; 41+ messages in thread From: Andi Kleen @ 2004-08-12 20:00 UTC (permalink / raw) To: David S. Miller; +Cc: torvalds, linux-arch On Thu, 12 Aug 2004 12:50:59 -0700 "David S. Miller" <davem@redhat.com> wrote: > On Thu, 12 Aug 2004 11:09:24 +0200 > Andi Kleen <ak@suse.de> wrote: > > > On Wed, 11 Aug 2004 19:45:45 -0700 > > "David S. Miller" <davem@redhat.com> wrote: > > > > > Do these cache-bypassing stores use the L2 cache on a hit? > > > > No, they invalidate the cache. > > That explains, at least partly, why they performed so poorly. Well, the writes are usually faster. While they don't use the cache they use special write combining buffers in the CPU that hold the data until it can blast out a full cache. Advantage is that it doesn't have to read anything first. How effective this is depends on the CPU, in general newer x86s tend to have much larger WC buffers than the previous generation (e.g. Intel just enlarged them again in Prescott) Unlike all other stores on x86 they are also very lazily ordered and need explicit memory barriers. Normally it is used for frame buffers and other hardware mappings, but sometimes it can be useful for a lot of streaming data too. > Is there any other platform that has the same kind of block > stores sparc64 does (basically use L2 cache if line present, > else bypass L2 cache for the store and do not allocate L2 > cache lines for the data)? I bet ia64 does have something > like this. This still has the same problem: in the end the data is out of cache and when someone else needs it later they eat large penalties. -Andi P.S.: I added a new experimental option now to use unordered WC stores for writel(). Haven't benchmarked it much so far though. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 20:00 ` clear_user_highpage() Andi Kleen @ 2004-08-12 20:30 ` David S. Miller 0 siblings, 0 replies; 41+ messages in thread From: David S. Miller @ 2004-08-12 20:30 UTC (permalink / raw) To: Andi Kleen; +Cc: torvalds, linux-arch On Thu, 12 Aug 2004 22:00:25 +0200 Andi Kleen <ak@suse.de> wrote: > Well, the writes are usually faster. While they don't use the > cache they use special write combining buffers in the CPU > that hold the data until it can blast out a full cache. Advantage > is that it doesn't have to read anything first. Sure. Sparc64 has this two, in fact is has a full 2K write cache to absorb all of the cpu's write traffic. > How effective this is depends on the CPU, in general newer > x86s tend to have much larger WC buffers than the previous > generation (e.g. Intel just enlarged them again in Prescott) > > Unlike all other stores on x86 they are also very lazily ordered > and need explicit memory barriers. The cache-bypassing 64-byte block stores behave this way on sparc64. > This still has the same problem: in the end the data > is out of cache and when someone else needs it later they eat > large penalties. If it was in the cache to begin with, it will stay there. This is the case the x86_64 bits lose for, they'll kick the lines out. If it is out of cache, no L2 cache lines are allocated. This is how x86_64 will perform. I think the "hit" case behavior difference could make a difference. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 19:50 ` clear_user_highpage() David S. Miller 2004-08-12 20:00 ` clear_user_highpage() Andi Kleen @ 2004-08-12 21:34 ` Matthew Wilcox 2004-08-13 8:16 ` clear_user_highpage() David Mosberger 1 sibling, 1 reply; 41+ messages in thread From: Matthew Wilcox @ 2004-08-12 21:34 UTC (permalink / raw) To: David S. Miller; +Cc: Andi Kleen, torvalds, linux-arch On Thu, Aug 12, 2004 at 12:50:59PM -0700, David S. Miller wrote: > Is there any other platform that has the same kind of block > stores sparc64 does (basically use L2 cache if line present, > else bypass L2 cache for the store and do not allocate L2 > cache lines for the data)? I bet ia64 does have something > like this. Yes, almost exactly. You can specify the "nta" hint to stores which means "non-temporal at all levels". If the cache-line is already present in the cache at any level, it will not be demoted, but if it isn't present, it'll bypass the cache entirely. If you want to specifically retain a cache line at a particular level in cache, you can prefetch it into that level, then use .nta and the line won't move. That's all according to the architecture reference anyway. I don't know how much of that processors actually implement and how much they think they know better than the programmer ;-) -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 21:34 ` clear_user_highpage() Matthew Wilcox @ 2004-08-13 8:16 ` David Mosberger 0 siblings, 0 replies; 41+ messages in thread From: David Mosberger @ 2004-08-13 8:16 UTC (permalink / raw) To: Matthew Wilcox; +Cc: David S. Miller, Andi Kleen, torvalds, linux-arch >>>>> On Thu, 12 Aug 2004 22:34:03 +0100, Matthew Wilcox <willy@debian.org> said: Matthew> Yes, almost exactly. You can specify the "nta" hint to Matthew> stores which means "non-temporal at all levels". If the Matthew> cache-line is already present in the cache at any level, it Matthew> will not be demoted, but if it isn't present, it'll bypass Matthew> the cache entirely. The architecture leaves the details to the chip family. For Itanium 2 chips, a store with the ".nta" hint means (see [1]): L1 cache: don't allocate, don't update LRU bits L2 cache: allocate, don't update LRU bits L3 cache: don't allocate, don't update LRU bits The textual description is as follows: .nta: This hint means non-temporal locality in all levels of the cache hierarchy. For the Itanium 2 processor, this hint will cause the line to be allocated in L2; however, the LRU information will not be updated for the line (i.e., it will be the next line to be replaced in the particular set). This line will not be allocated in the L3 cache. If present in any cache, it will not be deallocated from that cache, although sometimes lines are deallocated for coherency reasons. So it's not exactly like the SPARC64 case but it is quite similar in nature. --david [1] http://www.intel.com/design/itanium2/manuals/251110.htm ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-11 23:46 ` clear_user_highpage() Linus Torvalds 2004-08-11 23:53 ` clear_user_highpage() David S. Miller @ 2004-08-12 0:00 ` Benjamin Herrenschmidt 2004-08-12 0:21 ` clear_user_highpage() Linus Torvalds 2004-08-12 0:46 ` clear_user_highpage() William Lee Irwin III 2 siblings, 1 reply; 41+ messages in thread From: Benjamin Herrenschmidt @ 2004-08-12 0:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: David S. Miller, Linux Arch list > It sucks because it eats CPU and memory bandwidth when it shouldn't be > eaten. It's a total disaster on SMP, but it's bad on UP too. Ok, agreed about the SMP case > It sucks because it does bad things to cache behaviour. Sure, you'll move > the cost away from "clear_user_highpage", but the thing is, you will _not_ > move it into the idle time. What you will do is to move it into some > random time _after_ the idle time, when the idle thing has crapped all > over your caches. You can probably code it in such a way that it won't do that, using cache hints. > The thing is, you make your cache footprint per CPU _much_ bigger, and you > spread it out a lot over time too, so you make it even worse. > > The clearing will then be totally hidden in the profiles, because you will > have turned a nice and well-behaved "this is there the time goes" profile > into a mush of "we're taking cache misses at random times, and we don't > know why". > > That, btw, is a _classic_ mistake in profiling. Move the work around so > that it's not as visible any more. > > In other words, don't do it. It's a mistake. It is optimizing the profile > without actually optimizing what you want _done_. > > Btw, this is exactly what the totally brain-damaged slab stuff does. It > takes away the peaks, but does so by having worse cache access patterns > all around. > > Look at it this way: > > - it might be worth doing in big batches under some kind of user control, > when you really can _control_ that it happens at a good time. > > I _might_ buy into this argument. Make it a batch thing that really > screws the caches, but only does so very seldom, when the user asked > for it. > > - but we aren't supposed to have that much memory free _anyway_, and > trying to keep it around on a separate list is horrible for > fragmentation. So batching huge things up is likely not a good idea > either. > > - with caches growing larger, it's actually BETTER to clear the page at > usage time, because then the CPU that actually touches the page won't > have to bring in the page in from memory. We'll blow one page of cache > by clearing it, but we will blow it in a "good" way - hopefully with > almost no memory traffic at all (ie the clear can be done as pure > invalidate cycles, no read-back into the CPU). Ok, the later makes sense... especially since we could use the ppc dcbz instruction to "create blank cache lines" (not bothering at all about the previous content of the line), though I would expect any modern write combining CPU to figure that out based on the access pattern and end up doing the same at the cache level > And the thing is, the background clearing will just get worse and worse. > > In summary: it's a _good_ thing when you see a sharp peak in your > profiles, and you can say "I know exactly what that peak is for, and it's > doing exactly the work it should be doing and nothing else". > > Linus -- Benjamin Herrenschmidt <benh@kernel.crashing.org> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 0:00 ` clear_user_highpage() Benjamin Herrenschmidt @ 2004-08-12 0:21 ` Linus Torvalds 0 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2004-08-12 0:21 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: David S. Miller, Linux Arch list On Thu, 12 Aug 2004, Benjamin Herrenschmidt wrote: > > Ok, the later makes sense... especially since we could use the ppc dcbz > instruction to "create blank cache lines" (not bothering at all about > the previous content of the line) ppc64 definitely already does that according to <asm/page.h> ;) > , though I would expect any modern > write combining CPU to figure that out based on the access pattern and > end up doing the same at the cache level Quite possibly. I certainly hope so, but I suspect especially for the memory clearing case it's just simpler for everybody to just tell the CPU to do it. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-11 23:46 ` clear_user_highpage() Linus Torvalds 2004-08-11 23:53 ` clear_user_highpage() David S. Miller 2004-08-12 0:00 ` clear_user_highpage() Benjamin Herrenschmidt @ 2004-08-12 0:46 ` William Lee Irwin III 2004-08-12 1:01 ` clear_user_highpage() David S. Miller 2004-08-12 2:18 ` clear_user_highpage() Linus Torvalds 2 siblings, 2 replies; 41+ messages in thread From: William Lee Irwin III @ 2004-08-12 0:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: David S. Miller, linux-arch On Wed, 11 Aug 2004, David S. Miller wrote: >> The PPC people used to zero out pages in the cpu idle loop >> and I'd definitely like to do something along those lines >> on sparc64 as well, I feel it would be extremely effective. On Wed, Aug 11, 2004 at 04:46:10PM -0700, Linus Torvalds wrote: > No. It sucks. It sucks so bad it's not funny. > It sucks because it eats CPU and memory bandwidth when it shouldn't be > eaten. It's a total disaster on SMP, but it's bad on UP too. Results from prototype prezeroing patches (ca. 2001) showed that dedicating a cpu on a 16x machine to prezeroing userspace pages (doing no other work on that cpu) improved kernel compile (insert sound of projectile vomiting here) "benchmarks". This suggests cache pollution and scheduling latency can be circumvented under some circumstances. On Wed, Aug 11, 2004 at 04:46:10PM -0700, Linus Torvalds wrote: > It sucks because it does bad things to cache behaviour. Sure, you'll move > the cost away from "clear_user_highpage", but the thing is, you will _not_ > move it into the idle time. What you will do is to move it into some > random time _after_ the idle time, when the idle thing has crapped all > over your caches. > The thing is, you make your cache footprint per CPU _much_ bigger, and you > spread it out a lot over time too, so you make it even worse. Uncached zeroing, dedicated cpus, or appropriate cache semantics (e.g. not allocating a cacheline either via some special instruction or by the cache in general not allocating lines on some writes and/or zeroing writes that miss) negate this. On Wed, Aug 11, 2004 at 04:46:10PM -0700, Linus Torvalds wrote: > The clearing will then be totally hidden in the profiles, because you will > have turned a nice and well-behaved "this is there the time goes" profile > into a mush of "we're taking cache misses at random times, and we don't > know why". > That, btw, is a _classic_ mistake in profiling. Move the work around so > that it's not as visible any more. > In other words, don't do it. It's a mistake. It is optimizing the profile > without actually optimizing what you want _done_. > Btw, this is exactly what the totally brain-damaged slab stuff does. It > takes away the peaks, but does so by having worse cache access patterns > all around. I beg to differ; slab preconstruction, when it has not been effective, has had to do with the heaviness of the slab allocator and when the slab allocator is circumvented it's effective even where it's otherwise too heavyweight. Zeroing pagetables is in fact the poster child for this, where almost all architectures have cached prezeroed pagetables forever. Reinstating caching of i386 pagetables improved SDET performance by a consistent (and hence statistically significant) margin of 1%-1.5%. One of the key aspects of an access pattern that makes preconstruction useful is that very little of the allocated memory is actually touched during typical accesses. Hence, the construction of the object pollutes the cache with numerous cachelines that are rarely touched. Objects as large as pages, e.g. pagetable pages, show this very well. Typical usage of the upper levels is sparse, and for smaller processes the lower levels are also sparsely-used. Userspace likewise can't be assumed to reference an entire zeroed page allocated to it. Userspace can't be predicted but it is also typical there for only small portions of large data structures to be referenced. e.g. a large, say, PAGE_SIZE buffer is allocated for read() traffic, but all typical read()'s are only a few bytes in length. And in general the "precharging" stalls taking unnecessary misses for the cachelines of the object that are rarely accessed, pollutes the cache with those cachelines of the object that are rarely accessed, and burns a few extra cycles (dwarfed by the misses on the unnecessarily- touched cachelines) doing an unnecessary pass over the object. On Wed, Aug 11, 2004 at 04:46:10PM -0700, Linus Torvalds wrote: > Look at it this way: > - it might be worth doing in big batches under some kind of user control, > when you really can _control_ that it happens at a good time. > I _might_ buy into this argument. Make it a batch thing that really > screws the caches, but only does so very seldom, when the user asked > for it. > - but we aren't supposed to have that much memory free _anyway_, and > trying to keep it around on a separate list is horrible for > fragmentation. So batching huge things up is likely not a good idea > either. > - with caches growing larger, it's actually BETTER to clear the page at > usage time, because then the CPU that actually touches the page won't > have to bring in the page in from memory. We'll blow one page of cache > by clearing it, but we will blow it in a "good" way - hopefully with > almost no memory traffic at all (ie the clear can be done as pure > invalidate cycles, no read-back into the CPU). > And the thing is, the background clearing will just get worse and worse. > In summary: it's a _good_ thing when you see a sharp peak in your > profiles, and you can say "I know exactly what that peak is for, and it's > doing exactly the work it should be doing and nothing else". The real flaws I see in background zeroing are fragmentation and scheduling latency (or potential loss of cpus dedicated to the purpose). Preventing cache pollution is already a prerequisite for remotely non-naive implementations. The scheduling latency aspect is due to the fact that many cpus have caching semantics that require extremely slow uncached accesss to prevent cache pollution, and that page zeroing is slow enough of an operation to noticeably stall rescheduling userspace. It's possible that this could be mitigated by incrementally zeroing pages and polling TIF_NEED_RESCHED between blocks of a page, but the background zeroing efforts went in a rather different, useless direction (dedicating cpus). The fragmentation bits are just as you say, an artifact of segregating a pool of pages from the general pool of free pages that can be coalesced. I haven't come up with any methods to address this. In general, I despise background processing and would rather see event-driven methods of accomplishing preconstruction, though I've no idea whatsoever how those would be carried out for userspace memory. -- wli ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 0:46 ` clear_user_highpage() William Lee Irwin III @ 2004-08-12 1:01 ` David S. Miller 2004-08-12 2:18 ` clear_user_highpage() Linus Torvalds 1 sibling, 0 replies; 41+ messages in thread From: David S. Miller @ 2004-08-12 1:01 UTC (permalink / raw) To: William Lee Irwin III; +Cc: torvalds, linux-arch On Wed, 11 Aug 2004 17:46:54 -0700 William Lee Irwin III <wli@holomorphy.com> wrote: > The scheduling latency aspect is due to the fact that many cpus have > caching semantics that require extremely slow uncached accesss to > prevent cache pollution, and that page zeroing is slow enough of an > operation to noticeably stall rescheduling userspace. This wouldn't be an issue on sparc64, as I've previously stated an entire page can be zero'd out, cache bypassed on miss, in 4400 cycles even when the L2 cache misses for the whole page. That would add less to rescheduling latency than that obtained from simply taking a hardware interrupt. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 0:46 ` clear_user_highpage() William Lee Irwin III 2004-08-12 1:01 ` clear_user_highpage() David S. Miller @ 2004-08-12 2:18 ` Linus Torvalds 2004-08-12 2:43 ` clear_user_highpage() David S. Miller ` (3 more replies) 1 sibling, 4 replies; 41+ messages in thread From: Linus Torvalds @ 2004-08-12 2:18 UTC (permalink / raw) To: William Lee Irwin III; +Cc: David S. Miller, linux-arch On Wed, 11 Aug 2004, William Lee Irwin III wrote: > > Results from prototype prezeroing patches (ca. 2001) showed that > dedicating a cpu on a 16x machine to prezeroing userspace pages (doing > no other work on that cpu) improved kernel compile (insert sound of > projectile vomiting here) "benchmarks". This suggests cache pollution > and scheduling latency can be circumvented under some circumstances. Heh. And at what point does it become a problem? Caches are growing, at some point it is going to be a loss to zero memory on another CPU.. I really do believe (but can't back it up with any real numbers) that we want to try to keep pages in cache as long as possible. That means keeping the pages close to the last CPU that used them, btw. It would be interesting to see if we could make the buddy allocator more "per-cpu" friendly, for example - I suspect that would make much _more_ of a difference than pre-zeroing pages. As it is, the pages we allocate have _no_ CPU affinity (unlike kmalloc/slab), and as a result they aren't even very likely to be in the cache even if you have tons of cache on the CPU. And my whole argument against pre-zeroing really falls totally flat if the pages aren't in the cache. So I'd personally be a whole lot more interested in seeing whether we could have per-CPU pages than in pre-zeroing. Fragmentation of memory is the _big_ problem, of course. It comes up almost for _any_ page allocation issue. But it might be interesting to see if we could have a special per-cpu "page pool" for some usage. Sized fairly small - on the order of a few times the CPU cache size - and used for anonymous pages that we think might be short-lived. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 2:18 ` clear_user_highpage() Linus Torvalds @ 2004-08-12 2:43 ` David S. Miller 2004-08-12 4:19 ` clear_user_highpage() Linus Torvalds 2004-08-12 2:57 ` clear_user_highpage() David S. Miller ` (2 subsequent siblings) 3 siblings, 1 reply; 41+ messages in thread From: David S. Miller @ 2004-08-12 2:43 UTC (permalink / raw) To: Linus Torvalds; +Cc: wli, linux-arch On Wed, 11 Aug 2004 19:18:18 -0700 (PDT) Linus Torvalds <torvalds@osdl.org> wrote: > So I'd personally be a whole lot more interested in seeing whether we > could have per-CPU pages than in pre-zeroing. We have that cold/hot page thing in the current 2.6.x tree, or are you talking about something else? I'm talking about the struct per_cpu_pages stuff. It's the first thing buffered_rmqueue() checks when the request order of the page allocation is zero. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 2:43 ` clear_user_highpage() David S. Miller @ 2004-08-12 4:19 ` Linus Torvalds 2004-08-12 4:46 ` clear_user_highpage() William Lee Irwin III 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2004-08-12 4:19 UTC (permalink / raw) To: David S. Miller; +Cc: wli, linux-arch On Wed, 11 Aug 2004, David S. Miller wrote: > > We have that cold/hot page thing in the current 2.6.x > tree, or are you talking about something else? You're right. It ended up never having problems (or they were worked out in the -mm tree), so I forgot all about it ;) How effective is it? Maybe the numbers that were done in 2001 aren't relevant any more? Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 4:19 ` clear_user_highpage() Linus Torvalds @ 2004-08-12 4:46 ` William Lee Irwin III 2004-08-15 6:22 ` clear_user_highpage() Andrew Morton 0 siblings, 1 reply; 41+ messages in thread From: William Lee Irwin III @ 2004-08-12 4:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: David S. Miller, linux-arch On Wed, 11 Aug 2004, David S. Miller wrote: >> We have that cold/hot page thing in the current 2.6.x >> tree, or are you talking about something else? On Wed, Aug 11, 2004 at 09:19:32PM -0700, Linus Torvalds wrote: > You're right. It ended up never having problems (or they were worked out > in the -mm tree), so I forgot all about it ;) > How effective is it? Maybe the numbers that were done in 2001 aren't > relevant any more? For lock amortization it's extremely effective. Its effects on caching have never been properly instrumented that I know of. -- wli ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 4:46 ` clear_user_highpage() William Lee Irwin III @ 2004-08-15 6:22 ` Andrew Morton 2004-08-15 6:38 ` clear_user_highpage() William Lee Irwin III 0 siblings, 1 reply; 41+ messages in thread From: Andrew Morton @ 2004-08-15 6:22 UTC (permalink / raw) To: William Lee Irwin III; +Cc: torvalds, davem, linux-arch William Lee Irwin III <wli@holomorphy.com> wrote: > > On Wed, 11 Aug 2004, David S. Miller wrote: > >> We have that cold/hot page thing in the current 2.6.x > >> tree, or are you talking about something else? > > On Wed, Aug 11, 2004 at 09:19:32PM -0700, Linus Torvalds wrote: > > You're right. It ended up never having problems (or they were worked out > > in the -mm tree), so I forgot all about it ;) > > How effective is it? Maybe the numbers that were done in 2001 aren't > > relevant any more? > > For lock amortization it's extremely effective. Its effects on caching > have never been properly instrumented that I know of. No, we (me, mbligh) instrumented the crap out of it. It turned out that the cache affinity was of very marginal benefit, if any. I cooked up an artificial benchmark which consisted of writing 32k to a file, then truncating it back to zero, then repeating. Four instances of that, against four separate files on 4-way showed a large speedup - 2x or 3x, from memory. But for real-world workloads you really needed to squint to see anything at all. Which is why I dithered without sending it to Linus for a couple of months. Ended up merging it anyway because of some lock contention benefits, and because someone mught have a workload which involves repeated write/truncate looping ;) ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-15 6:22 ` clear_user_highpage() Andrew Morton @ 2004-08-15 6:38 ` William Lee Irwin III 0 siblings, 0 replies; 41+ messages in thread From: William Lee Irwin III @ 2004-08-15 6:38 UTC (permalink / raw) To: Andrew Morton; +Cc: torvalds, davem, linux-arch William Lee Irwin III <wli@holomorphy.com> wrote: >> For lock amortization it's extremely effective. Its effects on caching >> have never been properly instrumented that I know of. On Sat, Aug 14, 2004 at 11:22:23PM -0700, Andrew Morton wrote: > No, we (me, mbligh) instrumented the crap out of it. It turned out that > the cache affinity was of very marginal benefit, if any. > I cooked up an artificial benchmark which consisted of writing 32k to a > file, then truncating it back to zero, then repeating. Four instances of > that, against four separate files on 4-way showed a large speedup - 2x or > 3x, from memory. But for real-world workloads you really needed to squint > to see anything at all. > Which is why I dithered without sending it to Linus for a couple of months. > Ended up merging it anyway because of some lock contention benefits, and > because someone mught have a workload which involves repeated > write/truncate looping ;) I had more in mind that it had never been explained why the cache affinity was ineffective, which would seem to require getting some instrumentation of how often the lists were being turned over, how many remote frees are going on, how "out of order" frees are, etc. etc. What I heard at the time was that none of those were instrumented. -- wli ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 2:18 ` clear_user_highpage() Linus Torvalds 2004-08-12 2:43 ` clear_user_highpage() David S. Miller @ 2004-08-12 2:57 ` David S. Miller 2004-08-12 3:20 ` clear_user_highpage() William Lee Irwin III 2004-08-13 21:41 ` clear_user_highpage() David S. Miller 3 siblings, 0 replies; 41+ messages in thread From: David S. Miller @ 2004-08-12 2:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: wli, linux-arch On Wed, 11 Aug 2004 19:18:18 -0700 (PDT) Linus Torvalds <torvalds@osdl.org> wrote: > I really do believe (but can't back it up with any real numbers) that we > want to try to keep pages in cache as long as possible. That means keeping > the pages close to the last CPU that used them, btw. This reminded me of something. One place where things fall apart is for situations like a fork+exit benchmark such as lmbench's "lat_proc fork". Here is what happens: CPU 1 CPU 2 parent: alloc local cpu pagetable parent: init child page table parent: wait on child child: tlb miss, ref page tables child: exit_mmap child: free page tables to local cpu It is exactly the most sub-optimal sequence of page table usage possible. CPU 1's cache empties constantly, while CPU 2's grows constantly. CPU 2 goes over it's limit and starts feeding excess page table per-cpu cache pages into the generic page pool (and actually in 2.6.x into the per cpu hot/cold page lists). Meanwhile CPU 1 is constantly going to the page allocator for page table pages since the per-cpu pgtable cache is empty. It's amusing, and just wanted to bring it up to light while we're discussing things of this nature. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 2:18 ` clear_user_highpage() Linus Torvalds 2004-08-12 2:43 ` clear_user_highpage() David S. Miller 2004-08-12 2:57 ` clear_user_highpage() David S. Miller @ 2004-08-12 3:20 ` William Lee Irwin III 2004-08-13 21:41 ` clear_user_highpage() David S. Miller 3 siblings, 0 replies; 41+ messages in thread From: William Lee Irwin III @ 2004-08-12 3:20 UTC (permalink / raw) To: Linus Torvalds; +Cc: David S. Miller, linux-arch On Wed, 11 Aug 2004, William Lee Irwin III wrote: >> Results from prototype prezeroing patches (ca. 2001) showed that >> dedicating a cpu on a 16x machine to prezeroing userspace pages (doing >> no other work on that cpu) improved kernel compile (insert sound of >> projectile vomiting here) "benchmarks". This suggests cache pollution >> and scheduling latency can be circumvented under some circumstances. On Wed, Aug 11, 2004 at 07:18:18PM -0700, Linus Torvalds wrote: > Heh. > And at what point does it become a problem? Caches are growing, at some > point it is going to be a loss to zero memory on another CPU.. The cache pollution and scheduling latencies would have been introduced by earlier versions of the prototype prezeroing patch (they should be inherent to most naive implementations). The implementor of those prototypes was unaware of PCD, PAT, and various other tricks so I'm rather suspicious of it all, and the result is vaguely disgusting. On Wed, Aug 11, 2004 at 07:18:18PM -0700, Linus Torvalds wrote: > I really do believe (but can't back it up with any real numbers) that we > want to try to keep pages in cache as long as possible. That means keeping > the pages close to the last CPU that used them, btw. > It would be interesting to see if we could make the buddy allocator more > "per-cpu" friendly, for example - I suspect that would make much _more_ of > a difference than pre-zeroing pages. Per-cpu zoning, perhaps? The hot/cold pages bits seem to achieve more in terms of lock amortization than cache warmth, probably due to the lists being turned over too often. Page allocation rates are truly immense, but I've not checked the hot/cold list turnover rates to see what's going on there in part because out-of-order frees spoil the naive accounting methods. On Wed, Aug 11, 2004 at 07:18:18PM -0700, Linus Torvalds wrote: > As it is, the pages we allocate have _no_ CPU affinity (unlike > kmalloc/slab), and as a result they aren't even very likely to be in the > cache even if you have tons of cache on the CPU. > And my whole argument against pre-zeroing really falls totally flat if the > pages aren't in the cache. > So I'd personally be a whole lot more interested in seeing whether we > could have per-CPU pages than in pre-zeroing. There are a few other points in the design space, e.g. batching, that haven't been tried yet. e.g. in the fault handler, do write-through zeroing of ZERO_BATCH_SIZE - 1 pages and a cached zero of the page to be handed to userspace when some per-cpu pool of pages is empty, or similar nonsense (maybe via schedule_work(), or queueing pages for the idle task to process, or something else that sounds like a plausible way to salvage things). Truly speculative background zeroing (or "page scrubbing") is just wrong as various workloads, e.g. routing, have next to zero userspace participation and may literally be interested in eliminating the last userspace process running or avoiding ever running userspace altogether on very memory-constrained embedded systems. So I think that if there can be a proper prezeroing implementation, it would only perform prezeroing in response to some event or when guided by some prediction. I guess it's a squishier objection than "implementing it via $FOO got numbers $BAR", but anyhow. On Wed, Aug 11, 2004 at 07:18:18PM -0700, Linus Torvalds wrote: > Fragmentation of memory is the _big_ problem, of course. It comes up > almost for _any_ page allocation issue. But it might be interesting to see > if we could have a special per-cpu "page pool" for some usage. Sized > fairly small - on the order of a few times the CPU cache size - and used > for anonymous pages that we think might be short-lived. Well, regardless of whether zones per se are used, some larger physically contiguous cpu-affine memory pools than the hot/cold page lists sounds very close to this ideal. I think the important aspect of their being physically contiguous is that the contiguity prevents the things from fragmenting areas outside that physical region. The flaw in all this is that there's no adequate (not even approximate that I know of) method of predicting lifetimes of userspace pages, and recovering from these mispredictions seems to typically involve... (cue Darth Vader dirge) ... background processing things have to wait for. -- wli ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-12 2:18 ` clear_user_highpage() Linus Torvalds ` (2 preceding siblings ...) 2004-08-12 3:20 ` clear_user_highpage() William Lee Irwin III @ 2004-08-13 21:41 ` David S. Miller 2004-08-16 13:00 ` clear_user_highpage() David Mosberger 3 siblings, 1 reply; 41+ messages in thread From: David S. Miller @ 2004-08-13 21:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: wli, linux-arch On Wed, 11 Aug 2004 19:18:18 -0700 (PDT) Linus Torvalds <torvalds@osdl.org> wrote: > I really do believe (but can't back it up with any real numbers) that we > want to try to keep pages in cache as long as possible. That means keeping > the pages close to the last CPU that used them, btw. So I did some testing. I changed the cache-bypassing clear_user_page() into one that uses normal stores and does allocate in the L2 cache. I ran the full build tests 3 times for each case, and the numbers were consistent. It makes the full build take a full minute longer. And I truly believe this is because of the argument William and myself are making, that a write protection fault does not mean the process is going to access a majority of the data in that page any time soon at all. {clear,copy}_user_page() is not some kind of "prefetch the whole page into the cache" for the user. It would be if the user would access the entire thing in the near future, but I do not believe that is the typical access pattern for fresh anonymous pages. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-13 21:41 ` clear_user_highpage() David S. Miller @ 2004-08-16 13:00 ` David Mosberger 2004-08-22 19:51 ` clear_user_highpage() Linus Torvalds 0 siblings, 1 reply; 41+ messages in thread From: David Mosberger @ 2004-08-16 13:00 UTC (permalink / raw) To: David S. Miller; +Cc: Linus Torvalds, wli, linux-arch >>>>> On Fri, 13 Aug 2004 14:41:15 -0700, "David S. Miller" <davem@redhat.com> said: DaveM> On Wed, 11 Aug 2004 19:18:18 -0700 (PDT) Linus Torvalds DaveM> <torvalds@osdl.org> wrote: >> I really do believe (but can't back it up with any real numbers) >> that we want to try to keep pages in cache as long as >> possible. That means keeping the pages close to the last CPU that >> used them, btw. DaveM> So I did some testing. DaveM> I changed the cache-bypassing clear_user_page() into one that DaveM> uses normal stores and does allocate in the L2 cache. DaveM> I ran the full build tests 3 times for each case, and the DaveM> numbers were consistent. It makes the full build take a full DaveM> minute longer. Very interesting. I tried something similar on a dual 1.5GHz Itanium 2. I tried two versions of clear_page: one with .nta (no-temporal affinity hint, which is the default) and one without. The result for 5 runs of a fairly minimal kernel-compile (with make -j8): with .nta: without .nta: real 80.0 70.2 71.0 70.4 70.3 79.4 69.5 69.5 69.5 69.3 sys 4.6 4.6 4.6 4.6 4.5 5.3 5.4 5.3 5.4 5.4 Note that the first run was with cold page cache, hence the longer runtime. So, on average, dropping the .nta costs us 0.78 seconds of kernel-time but, overall, the kernel-builds complete about 0.94 seconds faster. Of course, it's just one (more) data-point... --david ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-16 13:00 ` clear_user_highpage() David Mosberger @ 2004-08-22 19:51 ` Linus Torvalds 2005-09-17 19:01 ` clear_user_highpage() Andi Kleen 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2004-08-22 19:51 UTC (permalink / raw) To: davidm; +Cc: David S. Miller, wli, linux-arch On Mon, 16 Aug 2004, David Mosberger wrote: > > Very interesting. I tried something similar on a dual 1.5GHz Itanium > 2. I tried two versions of clear_page: one with .nta (no-temporal > affinity hint, which is the default) and one without. The result for > 5 runs of a fairly minimal kernel-compile (with make -j8): > > with .nta: without .nta: > > real 80.0 70.2 71.0 70.4 70.3 79.4 69.5 69.5 69.5 69.3 > sys 4.6 4.6 4.6 4.6 4.5 5.3 5.4 5.3 5.4 5.4 > > Note that the first run was with cold page cache, hence the longer > runtime. > > So, on average, dropping the .nta costs us 0.78 seconds of kernel-time > but, overall, the kernel-builds complete about 0.94 seconds faster. I obviously love the result, since it validates my theory that it's better to have a nice hot "clear_page()" and avoid cache misses elsewhere. Score one for WAGging. That said, I suspect it does so exactly because the Itanium 2 has largish caches, and it's likely that smaller (or external, slower) caches or other loads will see different behaviour. My basic point stands: hotspots in profiling are NOT automatically a sign of anything bad. I'd rather have hotspots where you can say "this is an important function" than try to smush the costs out. I personally prefer a profile that clearly shows where the work is being done to one that has most of the costs in a long tail of random functions. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2004-08-22 19:51 ` clear_user_highpage() Linus Torvalds @ 2005-09-17 19:01 ` Andi Kleen 2005-09-17 19:16 ` clear_user_highpage() Andi Kleen 0 siblings, 1 reply; 41+ messages in thread From: Andi Kleen @ 2005-09-17 19:01 UTC (permalink / raw) To: Linus Torvalds; +Cc: davidm, David S. Miller, wli, linux-arch On Sun, 2004-08-22 at 12:51 -0700, Linus Torvalds wrote: > I obviously love the result, since it validates my theory that it's better > to have a nice hot "clear_page()" and avoid cache misses elsewhere. Score > one for WAGging. Experiences on Opteron have been similar - NT seems to be a loss for normal clear/copy_page. However i liked the recent results of someone using it for write() only by defining a special copy_from_user_uncached() I suspect that would be even a win in most cases as long as you don't do it for pipes, but only for file systems. -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: clear_user_highpage() 2005-09-17 19:01 ` clear_user_highpage() Andi Kleen @ 2005-09-17 19:16 ` Andi Kleen 0 siblings, 0 replies; 41+ messages in thread From: Andi Kleen @ 2005-09-17 19:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: davidm, David S. Miller, wli, linux-arch Sorry - somehow my mailer got disorganized and I ended up replying to this really old email. Please ignore. [Well actually the copy_from_user_uncached stuff is still interesting...] -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2005-09-17 19:16 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-08-11 23:15 clear_user_highpage() David S. Miller 2004-08-11 23:31 ` clear_user_highpage() Benjamin Herrenschmidt 2004-08-11 23:55 ` clear_user_highpage() David S. Miller 2004-08-12 0:03 ` clear_user_highpage() Benjamin Herrenschmidt 2004-08-12 1:18 ` clear_user_highpage() William Lee Irwin III 2004-08-12 2:11 ` clear_user_highpage() Andi Kleen 2004-08-12 9:23 ` clear_user_highpage() Martin Schwidefsky 2004-08-11 23:46 ` clear_user_highpage() Linus Torvalds 2004-08-11 23:53 ` clear_user_highpage() David S. Miller 2004-08-12 0:00 ` clear_user_highpage() Linus Torvalds 2004-08-12 0:06 ` clear_user_highpage() Benjamin Herrenschmidt 2004-08-12 0:24 ` clear_user_highpage() David S. Miller 2004-08-12 0:23 ` clear_user_highpage() David S. Miller 2004-08-12 1:46 ` clear_user_highpage() Linus Torvalds 2004-08-12 2:51 ` clear_user_highpage() David S. Miller 2004-08-16 1:58 ` clear_user_highpage() Paul Mackerras 2004-08-12 2:08 ` clear_user_highpage() Andi Kleen 2004-08-12 2:45 ` clear_user_highpage() David S. Miller 2004-08-12 9:09 ` clear_user_highpage() Andi Kleen 2004-08-12 19:50 ` clear_user_highpage() David S. Miller 2004-08-12 20:00 ` clear_user_highpage() Andi Kleen 2004-08-12 20:30 ` clear_user_highpage() David S. Miller 2004-08-12 21:34 ` clear_user_highpage() Matthew Wilcox 2004-08-13 8:16 ` clear_user_highpage() David Mosberger 2004-08-12 0:00 ` clear_user_highpage() Benjamin Herrenschmidt 2004-08-12 0:21 ` clear_user_highpage() Linus Torvalds 2004-08-12 0:46 ` clear_user_highpage() William Lee Irwin III 2004-08-12 1:01 ` clear_user_highpage() David S. Miller 2004-08-12 2:18 ` clear_user_highpage() Linus Torvalds 2004-08-12 2:43 ` clear_user_highpage() David S. Miller 2004-08-12 4:19 ` clear_user_highpage() Linus Torvalds 2004-08-12 4:46 ` clear_user_highpage() William Lee Irwin III 2004-08-15 6:22 ` clear_user_highpage() Andrew Morton 2004-08-15 6:38 ` clear_user_highpage() William Lee Irwin III 2004-08-12 2:57 ` clear_user_highpage() David S. Miller 2004-08-12 3:20 ` clear_user_highpage() William Lee Irwin III 2004-08-13 21:41 ` clear_user_highpage() David S. Miller 2004-08-16 13:00 ` clear_user_highpage() David Mosberger 2004-08-22 19:51 ` clear_user_highpage() Linus Torvalds 2005-09-17 19:01 ` clear_user_highpage() Andi Kleen 2005-09-17 19:16 ` clear_user_highpage() Andi Kleen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox