clear_user_highpage()

public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed

* clear_user_highpage()
@ 2004-08-11 23:15 David S. Miller
  2004-08-11 23:31 ` clear_user_highpage() Benjamin Herrenschmidt
  2004-08-11 23:46 ` clear_user_highpage() Linus Torvalds
  0 siblings, 2 replies; 41+ messages in thread
From: David S. Miller @ 2004-08-11 23:15 UTC (permalink / raw)
  To: torvalds; +Cc: linux-arch

During a kernel build, this is what tops the profiling charts
for me on sparc64 currently.  This drives me crazy :-)

I've optimized the sparc64 page zero as much as I possibly
could, so that's not worth tinkering with any longer.

The PPC people used to zero out pages in the cpu idle loop
and I'd definitely like to do something along those lines
on sparc64 as well, I feel it would be extremely effective.

There is a lot of code path in there for alloc_pages_vma().
I don't think adding arch overridable stuff is the way
to go here.  Something generic in the per-cpu hot/cold
page list handling that the cpu_idle() loop of each architecture
could call.

Perhaps a page flags bit that says "pre-zeroed" or something.
Then my clear_user_page() code on sparc64 could just test the
page bit and return if it is set.  Page free would need to
clear that bit of course.

I have no real concrete ideas, but I know that while looking
at some source code in an editor, my cpus could zero out all
the free pages in the system in a second or two :-)

Comments?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-11 23:15 clear_user_highpage() David S. Miller
@ 2004-08-11 23:31 ` Benjamin Herrenschmidt
  2004-08-11 23:55   ` clear_user_highpage() David S. Miller
  2004-08-11 23:46 ` clear_user_highpage() Linus Torvalds
  1 sibling, 1 reply; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2004-08-11 23:31 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linus Torvalds, Linux Arch list

> There is a lot of code path in there for alloc_pages_vma().
> I don't think adding arch overridable stuff is the way
> to go here.  Something generic in the per-cpu hot/cold
> page list handling that the cpu_idle() loop of each architecture
> could call.

It would be nice, indeed, tho we have to be careful not to waste
too much time in there looking for pages to clear, especially when
there is none. Time spent not putting the CPU into power managed
idle, at least on PPC's, means the CPU getting hotter, consuming
more battery, etc... which is definitely a bad thing on laptops.

I already took a significant hit with HZ=1000 btw, I'm considering
lowering it back to 100 on ppc32 at least... We really want tickless
scheduling for these beasts so we can select how deep to PM the CPU
based on how much time we expect to stay idle.

Ben.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-11 23:31 ` clear_user_highpage() Benjamin Herrenschmidt
@ 2004-08-11 23:55   ` David S. Miller
  2004-08-12  0:03     ` clear_user_highpage() Benjamin Herrenschmidt
  0 siblings, 1 reply; 41+ messages in thread
From: David S. Miller @ 2004-08-11 23:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: torvalds, linux-arch

On Thu, 12 Aug 2004 09:31:03 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> It would be nice, indeed, tho we have to be careful not to waste
> too much time in there looking for pages to clear, especially when
> there is none. Time spent not putting the CPU into power managed
> idle, at least on PPC's, means the CPU getting hotter, consuming
> more battery, etc... which is definitely a bad thing on laptops.

I totally agree.  This is why I believe it should be a per-arch
decision at cpu_idle() time whether to do the clears or not.

> I already took a significant hit with HZ=1000 btw, I'm considering
> lowering it back to 100 on ppc32 at least... We really want tickless
> scheduling for these beasts so we can select how deep to PM the CPU
> based on how much time we expect to stay idle.

I think dynamic-resolution timers are the way to go here.
Rusty was talking about something along theses lines at
the networking summit.

The reason Rusty had brought it up was due to a ipv6 problem,
there are some timers that need to get so far into the future
that with HZ=1000 it isn't representable in the 32-bit timer
offsets.  We have jiffies64 and could move the timer struct
over the a u64 for the offsets, but that seems like overkill.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-11 23:55   ` clear_user_highpage() David S. Miller
@ 2004-08-12  0:03     ` Benjamin Herrenschmidt
  2004-08-12  1:18       ` clear_user_highpage() William Lee Irwin III
  2004-08-12  2:11       ` clear_user_highpage() Andi Kleen
  0 siblings, 2 replies; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2004-08-12  0:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linus Torvalds, Linux Arch list


> I think dynamic-resolution timers are the way to go here.
> Rusty was talking about something along theses lines at
> the networking summit.

Yup, several people talked about it at KS/OLS and I think  s390 has
some implementation already, though I hadn't time to look at it yet,
hopefully that will happen sooner or later.

> The reason Rusty had brought it up was due to a ipv6 problem,
> there are some timers that need to get so far into the future
> that with HZ=1000 it isn't representable in the 32-bit timer
> offsets.  We have jiffies64 and could move the timer struct
> over the a u64 for the offsets, but that seems like overkill.
-- 
Benjamin Herrenschmidt <benh@kernel.crashing.org>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  0:03     ` clear_user_highpage() Benjamin Herrenschmidt
@ 2004-08-12  1:18       ` William Lee Irwin III
  2004-08-12  2:11       ` clear_user_highpage() Andi Kleen
  1 sibling, 0 replies; 41+ messages in thread
From: William Lee Irwin III @ 2004-08-12  1:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: David S. Miller, Linus Torvalds, Linux Arch list

At some point in the past, someone wrote:
>> I think dynamic-resolution timers are the way to go here.
>> Rusty was talking about something along theses lines at
>> the networking summit.

On Thu, Aug 12, 2004 at 10:03:37AM +1000, Benjamin Herrenschmidt wrote:
> Yup, several people talked about it at KS/OLS and I think  s390 has
> some implementation already, though I hadn't time to look at it yet,
> hopefully that will happen sooner or later.

Zwane has a tickless idling patch for i386 already (not sure if it's
been posted yet). I'm looking at helping out with it at some point, at
least if Zwane stops churning out new functionality long enough for me
to get a line in edgewise. =)


-- wli

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  0:03     ` clear_user_highpage() Benjamin Herrenschmidt
  2004-08-12  1:18       ` clear_user_highpage() William Lee Irwin III
@ 2004-08-12  2:11       ` Andi Kleen
  2004-08-12  9:23         ` clear_user_highpage() Martin Schwidefsky
  1 sibling, 1 reply; 41+ messages in thread
From: Andi Kleen @ 2004-08-12  2:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: David S. Miller, Linus Torvalds, Linux Arch list

On Thu, Aug 12, 2004 at 10:03:37AM +1000, Benjamin Herrenschmidt wrote:
> 
> > I think dynamic-resolution timers are the way to go here.
> > Rusty was talking about something along theses lines at
> > the networking summit.
> 
> Yup, several people talked about it at KS/OLS and I think  s390 has
> some implementation already, though I hadn't time to look at it yet,
> hopefully that will happen sooner or later.

My main issue with the s390 approach is that they actually 
wanted to get rid of jiffies. That's fine for long term,
but short term it would be a big problem because it would
break everything. IMHO the way to do it would be to define
jiffies to a function and keep virtual jiffies using the CLOCK_MONOTONIC
timer. Then only tick with a low frequency for statistic ticks
and disabled for idle or when an actual event is scheduled.

I didn't have time to actually work on it, the s390 guys
are writing actual code, so they win for now :)

-Andi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  2:11       ` clear_user_highpage() Andi Kleen
@ 2004-08-12  9:23         ` Martin Schwidefsky
  0 siblings, 0 replies; 41+ messages in thread
From: Martin Schwidefsky @ 2004-08-12  9:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Benjamin Herrenschmidt, David S. Miller, Linux Arch list,
	Linus Torvalds

> > > I think dynamic-resolution timers are the way to go here.
> > > Rusty was talking about something along theses lines at
> > > the networking summit.
> >
> > Yup, several people talked about it at KS/OLS and I think  s390 has
> > some implementation already, though I hadn't time to look at it yet,
> > hopefully that will happen sooner or later.
>
> My main issue with the s390 approach is that they actually
> wanted to get rid of jiffies. That's fine for long term,
> but short term it would be a big problem because it would
> break everything. IMHO the way to do it would be to define
> jiffies to a function and keep virtual jiffies using the CLOCK_MONOTONIC
> timer. Then only tick with a low frequency for statistic ticks
> and disabled for idle or when an actual event is scheduled.

In the end we indeed want to get rid of jiffies altogether. But it's a long
road. What we are currently doing is to replace some of the dependecies to
jiffies in the common code. E.g. instead of doing the cpu time accounting
on a tick base we want to introduce a cputime_t type that is in principle
not related to jiffies (but it is in the generic implementation). The
process accounting is done by a new account_cputime function that can be
called any time and can be passed any amount of cputime. On s390 we'll
define cputime_t based on a virtual cpu timer with microseconds resolution.
This solves two problems, the first is the fact that running on a virtual
processor f**** up your tick based accounting badly. The second is the
accuracy of the number in /proc/stat. The second step is to untie the
time slices from the jiffies tick.

> I didn't have time to actually work on it, the s390 guys
> are writing actual code, so they win for now :)

We win ?!? Hey, we win ;-)) I posted a first set of cputime patches on
lkml last week. I probably post them here shortly in the hope to get
a little more feedback from you arch guys.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-11 23:15 clear_user_highpage() David S. Miller
  2004-08-11 23:31 ` clear_user_highpage() Benjamin Herrenschmidt
@ 2004-08-11 23:46 ` Linus Torvalds
  2004-08-11 23:53   ` clear_user_highpage() David S. Miller
                     ` (2 more replies)
  1 sibling, 3 replies; 41+ messages in thread
From: Linus Torvalds @ 2004-08-11 23:46 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-arch

On Wed, 11 Aug 2004, David S. Miller wrote:
> 
> During a kernel build, this is what tops the profiling charts
> for me on sparc64 currently.  This drives me crazy :-)

Think of it this way: if that function is your top function, then you're 
doing really really well. It's a good function to have at top.

> The PPC people used to zero out pages in the cpu idle loop
> and I'd definitely like to do something along those lines
> on sparc64 as well, I feel it would be extremely effective.

No. It sucks. It sucks so bad it's not funny.

It sucks because it eats CPU and memory bandwidth when it shouldn't be 
eaten. It's a total disaster on SMP, but it's bad on UP too.

It sucks because it does bad things to cache behaviour. Sure, you'll move 
the cost away from "clear_user_highpage", but the thing is, you will _not_ 
move it into the idle time. What you will do is to move it into some 
random time _after_ the idle time, when the idle thing has crapped all 
over your caches.

The thing is, you make your cache footprint per CPU _much_ bigger, and you 
spread it out a lot over time too, so you make it even worse.

The clearing will then be totally hidden in the profiles, because you will
have turned a nice and well-behaved "this is there the time goes" profile
into a mush of "we're taking cache misses at random times, and we don't
know why".

That, btw, is a _classic_ mistake in profiling. Move the work around so 
that it's not as visible any more.

In other words, don't do it. It's a mistake. It is optimizing the profile 
without actually optimizing what you want _done_. 

Btw, this is exactly what the totally brain-damaged slab stuff does. It 
takes away the peaks, but does so by having worse cache access patterns 
all around. 

Look at it this way:

 - it might be worth doing in big batches under some kind of user control, 
   when you really can _control_ that it happens at a good time.

   I _might_ buy into this argument. Make it a batch thing that really 
   screws the caches, but only does so very seldom, when the user asked
   for it.

 - but we aren't supposed to have that much memory free _anyway_, and 
   trying to keep it around on a separate list is horrible for 
   fragmentation.  So batching huge things up is likely not a good idea 
   either.

 - with caches growing larger, it's actually BETTER to clear the page at 
   usage time, because then the CPU that actually touches the page won't 
   have to bring in the page in from memory. We'll blow one page of cache 
   by clearing it, but we will blow it in a "good" way - hopefully with
   almost no memory traffic at all (ie the clear can be done as pure 
   invalidate cycles, no read-back into the CPU).

And the thing is, the background clearing will just get worse and worse.

In summary: it's a _good_ thing when you see a sharp peak in your 
profiles, and you can say "I know exactly what that peak is for, and it's 
doing exactly the work it should be doing and nothing else".

			Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-11 23:46 ` clear_user_highpage() Linus Torvalds
@ 2004-08-11 23:53   ` David S. Miller
  2004-08-12  0:00     ` clear_user_highpage() Linus Torvalds
  2004-08-12  0:00   ` clear_user_highpage() Benjamin Herrenschmidt
  2004-08-12  0:46   ` clear_user_highpage() William Lee Irwin III
  2 siblings, 1 reply; 41+ messages in thread
From: David S. Miller @ 2004-08-11 23:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch

On Wed, 11 Aug 2004 16:46:10 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> It sucks because it does bad things to cache behaviour. Sure, you'll move 
> the cost away from "clear_user_highpage", but the thing is, you will _not_ 
> move it into the idle time. What you will do is to move it into some 
> random time _after_ the idle time, when the idle thing has crapped all 
> over your caches.

It won't crap out the caches on sparc64 and any platform with cache-bypass-
on-miss stores.  I believe ia64 and opteron have similar mechanisms.

If the store misses the L2 cache it goes straight out to main memory,
it doesn't allocate cache lines anywhere in such cases.

I think ppc/ppc64 has this too... no, sorry, it has the data-cache allocate
line and zero instruction, which isn't what you want here.

> The thing is, you make your cache footprint per CPU _much_ bigger, and you 
> spread it out a lot over time too, so you make it even worse.
> 
> The clearing will then be totally hidden in the profiles, because you will
> have turned a nice and well-behaved "this is there the time goes" profile
> into a mush of "we're taking cache misses at random times, and we don't
> know why".

Therefore, I do not believe any of this is applicable.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-11 23:53   ` clear_user_highpage() David S. Miller
@ 2004-08-12  0:00     ` Linus Torvalds
  2004-08-12  0:06       ` clear_user_highpage() Benjamin Herrenschmidt
                         ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Linus Torvalds @ 2004-08-12  0:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-arch

On Wed, 11 Aug 2004, David S. Miller wrote:
> On Wed, 11 Aug 2004 16:46:10 -0700 (PDT)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > It sucks because it does bad things to cache behaviour. Sure, you'll move 
> > the cost away from "clear_user_highpage", but the thing is, you will _not_ 
> > move it into the idle time. What you will do is to move it into some 
> > random time _after_ the idle time, when the idle thing has crapped all 
> > over your caches.
> 
> It won't crap out the caches on sparc64 and any platform with cache-bypass-
> on-miss stores.  I believe ia64 and opteron have similar mechanisms.

You didn't read my message. If it doesn't crap on the caches when you do 
the stores, it _will_ crap on the bus both when you do the stores _and_ 
when you actually read the page.

In other words, you will have taken _more_ of a hit later on. It's just 
that it won't be a nice profile hit, it will be a nasty "everything runs 
slower later".

Caches work best when you have good temporal locality. You are removing 
that locality, and thus you are making your caches _less_ efficient.

That's a very _fundamental_ argument. 

> If the store misses the L2 cache it goes straight out to main memory,
> it doesn't allocate cache lines anywhere in such cases.
> 
> I think ppc/ppc64 has this too... no, sorry, it has the data-cache allocate
> line and zero instruction, which isn't what you want here.

It's exactly what you _do_ want, it's just that you want it in 
"clear_user_highpage()". Then you have the perfect cache behaviour, 
assuming your cache is big enough that it will likely get a good hit 
ratio on the new page.

And let's admit it now: caches _are_ big enough that they get good hit 
ratios on things with good temporal locality. 

Larger caches will happen. My argument will get only more relevant. Your 
approach will force cache misses and tons of memory bus traffic. 

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  0:00     ` clear_user_highpage() Linus Torvalds
@ 2004-08-12  0:06       ` Benjamin Herrenschmidt
  2004-08-12  0:24         ` clear_user_highpage() David S. Miller
  2004-08-12  0:23       ` clear_user_highpage() David S. Miller
  2004-08-12  2:08       ` clear_user_highpage() Andi Kleen
  2 siblings, 1 reply; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2004-08-12  0:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, Linux Arch list


> You didn't read my message. If it doesn't crap on the caches when you do 
> the stores, it _will_ crap on the bus both when you do the stores _and_ 
> when you actually read the page.
> 
> In other words, you will have taken _more_ of a hit later on. It's just 
> that it won't be a nice profile hit, it will be a nasty "everything runs 
> slower later".
> 
> Caches work best when you have good temporal locality. You are removing 
> that locality, and thus you are making your caches _less_ efficient.
> 
> That's a very _fundamental_ argument. 

Ok, that may be why it was removed from ppc then, I should ask Paul.

Ben.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  0:06       ` clear_user_highpage() Benjamin Herrenschmidt
@ 2004-08-12  0:24         ` David S. Miller
  0 siblings, 0 replies; 41+ messages in thread
From: David S. Miller @ 2004-08-12  0:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: torvalds, linux-arch

On Thu, 12 Aug 2004 10:06:37 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> Ok, that may be why it was removed from ppc then, I should ask Paul.

I think this might have more to do with the fact that they
got tired locally patching their tree all the time.  It required
changes to generic code which they could never get merged
upstream.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  0:00     ` clear_user_highpage() Linus Torvalds
  2004-08-12  0:06       ` clear_user_highpage() Benjamin Herrenschmidt
@ 2004-08-12  0:23       ` David S. Miller
  2004-08-12  1:46         ` clear_user_highpage() Linus Torvalds
  2004-08-16  1:58         ` clear_user_highpage() Paul Mackerras
  2004-08-12  2:08       ` clear_user_highpage() Andi Kleen
  2 siblings, 2 replies; 41+ messages in thread
From: David S. Miller @ 2004-08-12  0:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch

On Wed, 11 Aug 2004 17:00:37 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> You didn't read my message. If it doesn't crap on the caches when you do 
> the stores, it _will_ crap on the bus both when you do the stores _and_ 
> when you actually read the page.

I understand what you're saying.

> In other words, you will have taken _more_ of a hit later on. It's just 
> that it won't be a nice profile hit, it will be a nasty "everything runs 
> slower later".
> 
> Caches work best when you have good temporal locality. You are removing 
> that locality, and thus you are making your caches _less_ efficient.
> 
> That's a very _fundamental_ argument. 

Here is some more data.

If I use the cache bypassing stores on sparc64 for clear page (which I
do and always have), it takes roughly 4400 cycles to clear a page out
on a 750Mhz cpu regardless of whether the page is in the L2 cache or
not.

Conversely, I played with a version that did not do cache bypass and
for a cache hit it was phenominal, about twice as fast, but for the
cache miss case it was very slow, some 20,000 cycles.  I played around
with trying to prefetch the data into the L2 cache, that didn't help
much in the miss case at all.

Also, when the user takes that first write fault on the anonymous
page, it typically access the first several bytes (it is usually a
malloc chunk or similar), it doesn't trypically walk the entire page.
So to me, bringing the whole thing in seems inefficient.  Let the
process bring the cache lines in, when it's really needed, which (for
all the cache lines in that page) is not necessarily when the write
fault occurs and we clear the page out.  If it happened to be in the
L2 cache at clear_user_highpage() time, it'll stay there during the
clearing and that's great too.

Is that logic fundamentally flawed?

> Larger caches will happen. My argument will get only more relevant. Your 
> approach will force cache misses and tons of memory bus traffic. 

I agree with you.  But I believe, given the data above wrt. sparc64,
it is a profitable scheme at least on that platform.

You definitely have piqued my interest in some things.  I'll try out
the expensive clear_user_highpage() that brings the data into the L2
cache always, and see if that makes kernel builds faster.  Although
I think the fact that clear_user_highpage() will be 5 times slower on
the L2 miss case might nullify any gains bringing the data in always
for the user might give.

We'll see.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  0:23       ` clear_user_highpage() David S. Miller
@ 2004-08-12  1:46         ` Linus Torvalds
  2004-08-12  2:51           ` clear_user_highpage() David S. Miller
  2004-08-16  1:58         ` clear_user_highpage() Paul Mackerras
  1 sibling, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2004-08-12  1:46 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-arch

On Wed, 11 Aug 2004, David S. Miller wrote:
> 
> If I use the cache bypassing stores on sparc64 for clear page (which I
> do and always have), it takes roughly 4400 cycles to clear a page out
> on a 750Mhz cpu regardless of whether the page is in the L2 cache or
> not.
> 
> Conversely, I played with a version that did not do cache bypass and
> for a cache hit it was phenominal, about twice as fast, but for the
> cache miss case it was very slow, some 20,000 cycles.  I played around
> with trying to prefetch the data into the L2 cache, that didn't help
> much in the miss case at all.

Ok. This is exactly why you want to have a "establish cache line" 
instruction. Because you _cannot_ make a perfect memset without one.

I'm surprised that even CPU's that have cache control instructions don't 
have the very fundamental "establish" one. ppc is actually the only one I 
know about that does that.

Clearly the ultrasparc doesn't figure out the clear cache-line, and makes 
the regular memset() be a fairly synchronous "read cacheline + writeout". 
Which will indeed suck. 

> So to me, bringing the whole thing in seems inefficient.

Absolutely. What we want from a software perspective is a "get exclusive 
cacheline without reading it from memory" using a cache line invalidate 
setup rather than reading it.

> Is that logic fundamentally flawed?

I suspect that the cache-bypass stores might be the right thing until the 
cache grows big enough that it hurts more than it helps.

Is there no "store to cache line, but do not establish" instruction?
Sounds like that should be the fastest one for your setup.

> You definitely have piqued my interest in some things.  I'll try out
> the expensive clear_user_highpage() that brings the data into the L2
> cache always, and see if that makes kernel builds faster.  Although
> I think the fact that clear_user_highpage() will be 5 times slower on
> the L2 miss case might nullify any gains bringing the data in always
> for the user might give.

Yeah, sounds horrible. I can't imagine that the cost of bringing it into 
the cache if it wasn't already can ever really help you. Then you might as 
well wait with brining it in until much later.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  1:46         ` clear_user_highpage() Linus Torvalds
@ 2004-08-12  2:51           ` David S. Miller
  0 siblings, 0 replies; 41+ messages in thread
From: David S. Miller @ 2004-08-12  2:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch

On Wed, 11 Aug 2004 18:46:56 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> Ok. This is exactly why you want to have a "establish cache line" 
> instruction. Because you _cannot_ make a perfect memset without one.

I can prefetch for one or multiple writes, but these only install the
cacheline in exclusive state if no other cpu responds to the snoop.

> Clearly the ultrasparc doesn't figure out the clear cache-line, and makes 
> the regular memset() be a fairly synchronous "read cacheline + writeout". 
> Which will indeed suck. 

The cache bypassing block stores store 64-bytes at a time (ie. a full
cache line).  So either it goes directly into the L2 cache line from
the write-cache (which itself is 2K) or it goes right out to the memory
bus as a cacheline write.

> Absolutely. What we want from a software perspective is a "get exclusive 
> cacheline without reading it from memory" using a cache line invalidate 
> setup rather than reading it.

Yes.  For the "hit in L2 case" that is what the cache-bypassing stores
on sparc64 effectively do.

> Is there no "store to cache line, but do not establish" instruction?
> Sounds like that should be the fastest one for your setup.

Yes, but it acts that way only on a L2 hit.

> Yeah, sounds horrible. I can't imagine that the cost of bringing it into 
> the cache if it wasn't already can ever really help you. Then you might as 
> well wait with brining it in until much later.

I'm still undecided.  I think there is real value in the issue William and
myself keep bringing up, which is that the arguments you propose hinge upon
the process using some significant portion of the page right after anonymous
page fault time, and I concur with William that this is not typically the
case.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  0:23       ` clear_user_highpage() David S. Miller
  2004-08-12  1:46         ` clear_user_highpage() Linus Torvalds
@ 2004-08-16  1:58         ` Paul Mackerras
  1 sibling, 0 replies; 41+ messages in thread
From: Paul Mackerras @ 2004-08-16  1:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linus Torvalds, linux-arch

David S. Miller writes:

> If I use the cache bypassing stores on sparc64 for clear page (which I
> do and always have), it takes roughly 4400 cycles to clear a page out
> on a 750Mhz cpu regardless of whether the page is in the L2 cache or
> not.

Just for fun (and to make Dave jealous :) I instrumented clear_page()
on the G5 to measure the number of calls and total time taken.  (Note
that clear_user_highpage calls clear_page, which gets inlined.)

The result was that clear_page takes an average of 96ns (192 cycles)
per page on my 2-way 2GHz G5.  Our pages are 4k so you would have to
double that to get a fair comparison with the sparc, but even then we
are still only taking 9% of the cycles. :)  This is using the dcbz
(data cache block zero) instruction, which makes a cache line
exclusive in the cache and zeroes it without memory traffic (there is
some bus traffic on SMP because it has to issue a kill to all the
other processors).

There will of course be memory traffic later as those cache lines get
written back, but that occurs in cacheline-sized bursts, and to the
extent that the program writes to the page before the lines get
written back, we win.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  0:00     ` clear_user_highpage() Linus Torvalds
  2004-08-12  0:06       ` clear_user_highpage() Benjamin Herrenschmidt
  2004-08-12  0:23       ` clear_user_highpage() David S. Miller
@ 2004-08-12  2:08       ` Andi Kleen
  2004-08-12  2:45         ` clear_user_highpage() David S. Miller
  2 siblings, 1 reply; 41+ messages in thread
From: Andi Kleen @ 2004-08-12  2:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, linux-arch

On Wed, Aug 11, 2004 at 05:00:37PM -0700, Linus Torvalds wrote:
> You didn't read my message. If it doesn't crap on the caches when you do 
> the stores, it _will_ crap on the bus both when you do the stores _and_ 
> when you actually read the page.

I discovered this the hard way on Opteron too. At some point
I was doing clear_page using cache bypassing write combining stores.
That was done because it was faster in microbenchmarks that just
tested the function. But on actual macro benchmarks it was quite
bad because the applications were eating cache misses all the time.

Doing it in the idle loop would have the same problem.

When I could see it making sense would be for page table 
pages though (especially when you cache in a bitmap what ptes
have been actually touched and ignore the rest) 

> In other words, you will have taken _more_ of a hit later on. It's just 
> that it won't be a nice profile hit, it will be a nasty "everything runs 
> slower later".

Yep, it's a bad idea.

-Andi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  2:08       ` clear_user_highpage() Andi Kleen
@ 2004-08-12  2:45         ` David S. Miller
  2004-08-12  9:09           ` clear_user_highpage() Andi Kleen
  0 siblings, 1 reply; 41+ messages in thread
From: David S. Miller @ 2004-08-12  2:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: torvalds, linux-arch

On Thu, 12 Aug 2004 04:08:25 +0200
Andi Kleen <ak@suse.de> wrote:

> I discovered this the hard way on Opteron too. At some point
> I was doing clear_page using cache bypassing write combining stores.
> That was done because it was faster in microbenchmarks that just
> tested the function. But on actual macro benchmarks it was quite
> bad because the applications were eating cache misses all the time.

Do these cache-bypassing stores use the L2 cache on a hit?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  2:45         ` clear_user_highpage() David S. Miller
@ 2004-08-12  9:09           ` Andi Kleen
  2004-08-12 19:50             ` clear_user_highpage() David S. Miller
  0 siblings, 1 reply; 41+ messages in thread
From: Andi Kleen @ 2004-08-12  9:09 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, linux-arch

On Wed, 11 Aug 2004 19:45:45 -0700
"David S. Miller" <davem@redhat.com> wrote:

> On Thu, 12 Aug 2004 04:08:25 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > I discovered this the hard way on Opteron too. At some point
> > I was doing clear_page using cache bypassing write combining stores.
> > That was done because it was faster in microbenchmarks that just
> > tested the function. But on actual macro benchmarks it was quite
> > bad because the applications were eating cache misses all the time.
> 
> Do these cache-bypassing stores use the L2 cache on a hit?

No, they invalidate the cache.

-Andi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  9:09           ` clear_user_highpage() Andi Kleen
@ 2004-08-12 19:50             ` David S. Miller
  2004-08-12 20:00               ` clear_user_highpage() Andi Kleen
  2004-08-12 21:34               ` clear_user_highpage() Matthew Wilcox
  0 siblings, 2 replies; 41+ messages in thread
From: David S. Miller @ 2004-08-12 19:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: torvalds, linux-arch

On Thu, 12 Aug 2004 11:09:24 +0200
Andi Kleen <ak@suse.de> wrote:

> On Wed, 11 Aug 2004 19:45:45 -0700
> "David S. Miller" <davem@redhat.com> wrote:
> 
> > Do these cache-bypassing stores use the L2 cache on a hit?
> 
> No, they invalidate the cache.

That explains, at least partly, why they performed so poorly.

Is there any other platform that has the same kind of block
stores sparc64 does (basically use L2 cache if line present,
else bypass L2 cache for the store and do not allocate L2
cache lines for the data)?  I bet ia64 does have something
like this.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12 19:50             ` clear_user_highpage() David S. Miller
@ 2004-08-12 20:00               ` Andi Kleen
  2004-08-12 20:30                 ` clear_user_highpage() David S. Miller
  2004-08-12 21:34               ` clear_user_highpage() Matthew Wilcox
  1 sibling, 1 reply; 41+ messages in thread
From: Andi Kleen @ 2004-08-12 20:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, linux-arch

On Thu, 12 Aug 2004 12:50:59 -0700
"David S. Miller" <davem@redhat.com> wrote:

> On Thu, 12 Aug 2004 11:09:24 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > On Wed, 11 Aug 2004 19:45:45 -0700
> > "David S. Miller" <davem@redhat.com> wrote:
> > 
> > > Do these cache-bypassing stores use the L2 cache on a hit?
> > 
> > No, they invalidate the cache.
> 
> That explains, at least partly, why they performed so poorly.

Well, the writes are usually faster. While they don't use the 
cache they use special write combining buffers in the CPU
that hold the data until it can blast out a full cache. Advantage
is that it doesn't have to read anything first.

How effective this is depends on the CPU, in general newer 
x86s tend to have much larger WC buffers than the previous 
generation (e.g. Intel just enlarged them again in Prescott) 

Unlike all other stores on x86 they are also very lazily ordered
and need explicit memory barriers.

Normally it is used for frame buffers and other hardware 
mappings, but sometimes it can be useful for a lot of streaming
data too.

> Is there any other platform that has the same kind of block
> stores sparc64 does (basically use L2 cache if line present,
> else bypass L2 cache for the store and do not allocate L2
> cache lines for the data)?  I bet ia64 does have something
> like this.

This still has the same problem: in the end the data
is out of cache and when someone else needs it later they eat
large penalties.

-Andi

P.S.: I added a new experimental option now to use unordered WC 
stores for writel(). Haven't benchmarked it much so far though.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12 20:00               ` clear_user_highpage() Andi Kleen
@ 2004-08-12 20:30                 ` David S. Miller
  0 siblings, 0 replies; 41+ messages in thread
From: David S. Miller @ 2004-08-12 20:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: torvalds, linux-arch

On Thu, 12 Aug 2004 22:00:25 +0200
Andi Kleen <ak@suse.de> wrote:

> Well, the writes are usually faster. While they don't use the 
> cache they use special write combining buffers in the CPU
> that hold the data until it can blast out a full cache. Advantage
> is that it doesn't have to read anything first.

Sure. Sparc64 has this two, in fact is has a full 2K write
cache to absorb all of the cpu's write traffic.

> How effective this is depends on the CPU, in general newer 
> x86s tend to have much larger WC buffers than the previous 
> generation (e.g. Intel just enlarged them again in Prescott) 
> 
> Unlike all other stores on x86 they are also very lazily ordered
> and need explicit memory barriers.

The cache-bypassing 64-byte block stores behave this way
on sparc64.

> This still has the same problem: in the end the data
> is out of cache and when someone else needs it later they eat
> large penalties.

If it was in the cache to begin with, it will stay there.
This is the case the x86_64 bits lose for, they'll kick
the lines out.

If it is out of cache, no L2 cache lines are allocated.  This
is how x86_64 will perform.

I think the "hit" case behavior difference could make a difference.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12 19:50             ` clear_user_highpage() David S. Miller
  2004-08-12 20:00               ` clear_user_highpage() Andi Kleen
@ 2004-08-12 21:34               ` Matthew Wilcox
  2004-08-13  8:16                 ` clear_user_highpage() David Mosberger
  1 sibling, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2004-08-12 21:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, torvalds, linux-arch

On Thu, Aug 12, 2004 at 12:50:59PM -0700, David S. Miller wrote:
> Is there any other platform that has the same kind of block
> stores sparc64 does (basically use L2 cache if line present,
> else bypass L2 cache for the store and do not allocate L2
> cache lines for the data)?  I bet ia64 does have something
> like this.

Yes, almost exactly.  You can specify the "nta" hint to stores which means
"non-temporal at all levels".  If the cache-line is already present in
the cache at any level, it will not be demoted, but if it isn't present,
it'll bypass the cache entirely.

If you want to specifically retain a cache line at a particular level
in cache, you can prefetch it into that level, then use .nta and the
line won't move.

That's all according to the architecture reference anyway.  I don't know
how much of that processors actually implement and how much they think
they know better than the programmer ;-)

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12 21:34               ` clear_user_highpage() Matthew Wilcox
@ 2004-08-13  8:16                 ` David Mosberger
  0 siblings, 0 replies; 41+ messages in thread
From: David Mosberger @ 2004-08-13  8:16 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David S. Miller, Andi Kleen, torvalds, linux-arch

>>>>> On Thu, 12 Aug 2004 22:34:03 +0100, Matthew Wilcox <willy@debian.org> said:

  Matthew> Yes, almost exactly.  You can specify the "nta" hint to
  Matthew> stores which means "non-temporal at all levels".  If the
  Matthew> cache-line is already present in the cache at any level, it
  Matthew> will not be demoted, but if it isn't present, it'll bypass
  Matthew> the cache entirely.

The architecture leaves the details to the chip family.  For Itanium 2
chips, a store with the ".nta" hint means (see [1]):

 L1 cache: don't allocate, don't update LRU bits
 L2 cache: allocate, don't update LRU bits
 L3 cache: don't allocate, don't update LRU bits

The textual description is as follows:

 .nta: This hint means non-temporal locality in all levels of the
 cache hierarchy.  For the Itanium 2 processor, this hint will cause
 the line to be allocated in L2; however, the LRU information will not
 be updated for the line (i.e., it will be the next line to be
 replaced in the particular set).  This line will not be allocated in
 the L3 cache.  If present in any cache, it will not be deallocated
 from that cache, although sometimes lines are deallocated for
 coherency reasons.

So it's not exactly like the SPARC64 case but it is quite similar in
nature.

	--david

[1] http://www.intel.com/design/itanium2/manuals/251110.htm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-11 23:46 ` clear_user_highpage() Linus Torvalds
  2004-08-11 23:53   ` clear_user_highpage() David S. Miller
@ 2004-08-12  0:00   ` Benjamin Herrenschmidt
  2004-08-12  0:21     ` clear_user_highpage() Linus Torvalds
  2004-08-12  0:46   ` clear_user_highpage() William Lee Irwin III
  2 siblings, 1 reply; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2004-08-12  0:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, Linux Arch list


> It sucks because it eats CPU and memory bandwidth when it shouldn't be 
> eaten. It's a total disaster on SMP, but it's bad on UP too.

Ok, agreed about the SMP case

> It sucks because it does bad things to cache behaviour. Sure, you'll move 
> the cost away from "clear_user_highpage", but the thing is, you will _not_ 
> move it into the idle time. What you will do is to move it into some 
> random time _after_ the idle time, when the idle thing has crapped all 
> over your caches.

You can probably code it in such a way that it won't do that, using
cache hints.

> The thing is, you make your cache footprint per CPU _much_ bigger, and you 
> spread it out a lot over time too, so you make it even worse.
> 
> The clearing will then be totally hidden in the profiles, because you will
> have turned a nice and well-behaved "this is there the time goes" profile
> into a mush of "we're taking cache misses at random times, and we don't
> know why".
> 
> That, btw, is a _classic_ mistake in profiling. Move the work around so 
> that it's not as visible any more.
> 
> In other words, don't do it. It's a mistake. It is optimizing the profile 
> without actually optimizing what you want _done_. 
> 
> Btw, this is exactly what the totally brain-damaged slab stuff does. It 
> takes away the peaks, but does so by having worse cache access patterns 
> all around. 
> 
> Look at it this way:
> 
>  - it might be worth doing in big batches under some kind of user control, 
>    when you really can _control_ that it happens at a good time.
> 
>    I _might_ buy into this argument. Make it a batch thing that really 
>    screws the caches, but only does so very seldom, when the user asked
>    for it.
> 
>  - but we aren't supposed to have that much memory free _anyway_, and 
>    trying to keep it around on a separate list is horrible for 
>    fragmentation.  So batching huge things up is likely not a good idea 
>    either.
> 
>  - with caches growing larger, it's actually BETTER to clear the page at 
>    usage time, because then the CPU that actually touches the page won't 
>    have to bring in the page in from memory. We'll blow one page of cache 
>    by clearing it, but we will blow it in a "good" way - hopefully with
>    almost no memory traffic at all (ie the clear can be done as pure 
>    invalidate cycles, no read-back into the CPU).

Ok, the later makes sense... especially since we could use the ppc dcbz
instruction to "create blank cache lines" (not bothering at all about
the previous content of the line), though I would expect any modern
write combining CPU to figure that out based on the access pattern and
end up doing the same at the cache level

> And the thing is, the background clearing will just get worse and worse.
> 
> In summary: it's a _good_ thing when you see a sharp peak in your 
> profiles, and you can say "I know exactly what that peak is for, and it's 
> doing exactly the work it should be doing and nothing else".
> 
> 			Linus
-- 
Benjamin Herrenschmidt <benh@kernel.crashing.org>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  0:00   ` clear_user_highpage() Benjamin Herrenschmidt
@ 2004-08-12  0:21     ` Linus Torvalds
  0 siblings, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2004-08-12  0:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: David S. Miller, Linux Arch list



On Thu, 12 Aug 2004, Benjamin Herrenschmidt wrote:
> 
> Ok, the later makes sense... especially since we could use the ppc dcbz
> instruction to "create blank cache lines" (not bothering at all about
> the previous content of the line)

ppc64 definitely already does that according to <asm/page.h> ;) 

>				, though I would expect any modern
> write combining CPU to figure that out based on the access pattern and
> end up doing the same at the cache level

Quite possibly.  I certainly hope so, but I suspect especially for the 
memory clearing case it's just simpler for everybody to just tell the CPU 
to do it.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-11 23:46 ` clear_user_highpage() Linus Torvalds
  2004-08-11 23:53   ` clear_user_highpage() David S. Miller
  2004-08-12  0:00   ` clear_user_highpage() Benjamin Herrenschmidt
@ 2004-08-12  0:46   ` William Lee Irwin III
  2004-08-12  1:01     ` clear_user_highpage() David S. Miller
  2004-08-12  2:18     ` clear_user_highpage() Linus Torvalds
  2 siblings, 2 replies; 41+ messages in thread
From: William Lee Irwin III @ 2004-08-12  0:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, linux-arch

On Wed, 11 Aug 2004, David S. Miller wrote:
>> The PPC people used to zero out pages in the cpu idle loop
>> and I'd definitely like to do something along those lines
>> on sparc64 as well, I feel it would be extremely effective.

On Wed, Aug 11, 2004 at 04:46:10PM -0700, Linus Torvalds wrote:
> No. It sucks. It sucks so bad it's not funny.
> It sucks because it eats CPU and memory bandwidth when it shouldn't be 
> eaten. It's a total disaster on SMP, but it's bad on UP too.

Results from prototype prezeroing patches (ca. 2001) showed that
dedicating a cpu on a 16x machine to prezeroing userspace pages (doing
no other work on that cpu) improved kernel compile (insert sound of
projectile vomiting here) "benchmarks". This suggests cache pollution
and scheduling latency can be circumvented under some circumstances.

On Wed, Aug 11, 2004 at 04:46:10PM -0700, Linus Torvalds wrote:
> It sucks because it does bad things to cache behaviour. Sure, you'll move 
> the cost away from "clear_user_highpage", but the thing is, you will _not_ 
> move it into the idle time. What you will do is to move it into some 
> random time _after_ the idle time, when the idle thing has crapped all 
> over your caches.
> The thing is, you make your cache footprint per CPU _much_ bigger, and you 
> spread it out a lot over time too, so you make it even worse.

Uncached zeroing, dedicated cpus, or appropriate cache semantics (e.g.
not allocating a cacheline either via some special instruction or by
the cache in general not allocating lines on some writes and/or zeroing
writes that miss) negate this.

On Wed, Aug 11, 2004 at 04:46:10PM -0700, Linus Torvalds wrote:
> The clearing will then be totally hidden in the profiles, because you will
> have turned a nice and well-behaved "this is there the time goes" profile
> into a mush of "we're taking cache misses at random times, and we don't
> know why".
> That, btw, is a _classic_ mistake in profiling. Move the work around so 
> that it's not as visible any more.
> In other words, don't do it. It's a mistake. It is optimizing the profile 
> without actually optimizing what you want _done_. 
> Btw, this is exactly what the totally brain-damaged slab stuff does. It 
> takes away the peaks, but does so by having worse cache access patterns 
> all around. 

I beg to differ; slab preconstruction, when it has not been effective,
has had to do with the heaviness of the slab allocator and when the
slab allocator is circumvented it's effective even where it's otherwise
too heavyweight. Zeroing pagetables is in fact the poster child for this,
where almost all architectures have cached prezeroed pagetables forever.
Reinstating caching of i386 pagetables improved SDET performance by a
consistent (and hence statistically significant) margin of 1%-1.5%.

One of the key aspects of an access pattern that makes preconstruction
useful is that very little of the allocated memory is actually touched
during typical accesses. Hence, the construction of the object pollutes
the cache with numerous cachelines that are rarely touched. Objects as
large as pages, e.g. pagetable pages, show this very well. Typical
usage of the upper levels is sparse, and for smaller processes the
lower levels are also sparsely-used.

Userspace likewise can't be assumed to reference an entire zeroed page
allocated to it. Userspace can't be predicted but it is also typical
there for only small portions of large data structures to be referenced.
e.g. a large, say, PAGE_SIZE buffer is allocated for read() traffic, but
all typical read()'s are only a few bytes in length.

And in general the "precharging" stalls taking unnecessary misses for
the cachelines of the object that are rarely accessed, pollutes the
cache with those cachelines of the object that are rarely accessed, and
burns a few extra cycles (dwarfed by the misses on the unnecessarily-
touched cachelines) doing an unnecessary pass over the object.

On Wed, Aug 11, 2004 at 04:46:10PM -0700, Linus Torvalds wrote:
> Look at it this way:
>  - it might be worth doing in big batches under some kind of user control, 
>    when you really can _control_ that it happens at a good time.
>    I _might_ buy into this argument. Make it a batch thing that really 
>    screws the caches, but only does so very seldom, when the user asked
>    for it.
>  - but we aren't supposed to have that much memory free _anyway_, and 
>    trying to keep it around on a separate list is horrible for 
>    fragmentation.  So batching huge things up is likely not a good idea 
>    either.
>  - with caches growing larger, it's actually BETTER to clear the page at 
>    usage time, because then the CPU that actually touches the page won't 
>    have to bring in the page in from memory. We'll blow one page of cache 
>    by clearing it, but we will blow it in a "good" way - hopefully with
>    almost no memory traffic at all (ie the clear can be done as pure 
>    invalidate cycles, no read-back into the CPU).
> And the thing is, the background clearing will just get worse and worse.
> In summary: it's a _good_ thing when you see a sharp peak in your 
> profiles, and you can say "I know exactly what that peak is for, and it's 
> doing exactly the work it should be doing and nothing else".

The real flaws I see in background zeroing are fragmentation and
scheduling latency (or potential loss of cpus dedicated to the purpose).
Preventing cache pollution is already a prerequisite for remotely
non-naive implementations.

The scheduling latency aspect is due to the fact that many cpus have
caching semantics that require extremely slow uncached accesss to
prevent cache pollution, and that page zeroing is slow enough of an
operation to noticeably stall rescheduling userspace. It's possible
that this could be mitigated by incrementally zeroing pages and polling
TIF_NEED_RESCHED between blocks of a page, but the background zeroing
efforts went in a rather different, useless direction (dedicating cpus).

The fragmentation bits are just as you say, an artifact of segregating a
pool of pages from the general pool of free pages that can be coalesced.
I haven't come up with any methods to address this.

In general, I despise background processing and would rather see
event-driven methods of accomplishing preconstruction, though I've no
idea whatsoever how those would be carried out for userspace memory.

-- wli

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  0:46   ` clear_user_highpage() William Lee Irwin III
@ 2004-08-12  1:01     ` David S. Miller
  2004-08-12  2:18     ` clear_user_highpage() Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: David S. Miller @ 2004-08-12  1:01 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: torvalds, linux-arch

On Wed, 11 Aug 2004 17:46:54 -0700
William Lee Irwin III <wli@holomorphy.com> wrote:

> The scheduling latency aspect is due to the fact that many cpus have
> caching semantics that require extremely slow uncached accesss to
> prevent cache pollution, and that page zeroing is slow enough of an
> operation to noticeably stall rescheduling userspace.

This wouldn't be an issue on sparc64, as I've previously stated
an entire page can be zero'd out, cache bypassed on miss, in 4400
cycles even when the L2 cache misses for the whole page.

That would add less to rescheduling latency than that
obtained from simply taking a hardware interrupt.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  0:46   ` clear_user_highpage() William Lee Irwin III
  2004-08-12  1:01     ` clear_user_highpage() David S. Miller
@ 2004-08-12  2:18     ` Linus Torvalds
  2004-08-12  2:43       ` clear_user_highpage() David S. Miller
                         ` (3 more replies)
  1 sibling, 4 replies; 41+ messages in thread
From: Linus Torvalds @ 2004-08-12  2:18 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: David S. Miller, linux-arch

On Wed, 11 Aug 2004, William Lee Irwin III wrote:
> 
> Results from prototype prezeroing patches (ca. 2001) showed that
> dedicating a cpu on a 16x machine to prezeroing userspace pages (doing
> no other work on that cpu) improved kernel compile (insert sound of
> projectile vomiting here) "benchmarks". This suggests cache pollution
> and scheduling latency can be circumvented under some circumstances.

Heh.

And at what point does it become a problem? Caches are growing, at some 
point it is going to be a loss to zero memory on another CPU..

I really do believe (but can't back it up with any real numbers) that we 
want to try to keep pages in cache as long as possible. That means keeping 
the pages close to the last CPU that used them, btw.

It would be interesting to see if we could make the buddy allocator more
"per-cpu" friendly, for example - I suspect that would make much _more_ of
a difference than pre-zeroing pages. 

As it is, the pages we allocate have _no_ CPU affinity (unlike 
kmalloc/slab), and as a result they aren't even very likely to be in the 
cache even if you have tons of cache on the CPU. 

And my whole argument against pre-zeroing really falls totally flat if the 
pages aren't in the cache. 

So I'd personally be a whole lot more interested in seeing whether we 
could have per-CPU pages than in pre-zeroing. 

Fragmentation of memory is the _big_ problem, of course. It comes up
almost for _any_ page allocation issue. But it might be interesting to see 
if we could have a special per-cpu "page pool" for some usage. Sized 
fairly small - on the order of a few times the CPU cache size - and used 
for anonymous pages that we think might be short-lived.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  2:18     ` clear_user_highpage() Linus Torvalds
@ 2004-08-12  2:43       ` David S. Miller
  2004-08-12  4:19         ` clear_user_highpage() Linus Torvalds
  2004-08-12  2:57       ` clear_user_highpage() David S. Miller
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: David S. Miller @ 2004-08-12  2:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: wli, linux-arch

On Wed, 11 Aug 2004 19:18:18 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> So I'd personally be a whole lot more interested in seeing whether we 
> could have per-CPU pages than in pre-zeroing. 

We have that cold/hot page thing in the current 2.6.x
tree, or are you talking about something else?

I'm talking about the struct per_cpu_pages stuff.
It's the first thing buffered_rmqueue() checks when
the request order of the page allocation is zero.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  2:43       ` clear_user_highpage() David S. Miller
@ 2004-08-12  4:19         ` Linus Torvalds
  2004-08-12  4:46           ` clear_user_highpage() William Lee Irwin III
  0 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2004-08-12  4:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: wli, linux-arch



On Wed, 11 Aug 2004, David S. Miller wrote:
> 
> We have that cold/hot page thing in the current 2.6.x
> tree, or are you talking about something else?

You're right. It ended up never having problems (or they were worked out
in the -mm tree), so I forgot all about it ;)

How effective is it? Maybe the numbers that were done in 2001 aren't 
relevant any more?

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  4:19         ` clear_user_highpage() Linus Torvalds
@ 2004-08-12  4:46           ` William Lee Irwin III
  2004-08-15  6:22             ` clear_user_highpage() Andrew Morton
  0 siblings, 1 reply; 41+ messages in thread
From: William Lee Irwin III @ 2004-08-12  4:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, linux-arch

On Wed, 11 Aug 2004, David S. Miller wrote:
>> We have that cold/hot page thing in the current 2.6.x
>> tree, or are you talking about something else?

On Wed, Aug 11, 2004 at 09:19:32PM -0700, Linus Torvalds wrote:
> You're right. It ended up never having problems (or they were worked out
> in the -mm tree), so I forgot all about it ;)
> How effective is it? Maybe the numbers that were done in 2001 aren't 
> relevant any more?

For lock amortization it's extremely effective. Its effects on caching
have never been properly instrumented that I know of.


-- wli

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  4:46           ` clear_user_highpage() William Lee Irwin III
@ 2004-08-15  6:22             ` Andrew Morton
  2004-08-15  6:38               ` clear_user_highpage() William Lee Irwin III
  0 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2004-08-15  6:22 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: torvalds, davem, linux-arch

William Lee Irwin III <wli@holomorphy.com> wrote:
>
> On Wed, 11 Aug 2004, David S. Miller wrote:
>  >> We have that cold/hot page thing in the current 2.6.x
>  >> tree, or are you talking about something else?
> 
>  On Wed, Aug 11, 2004 at 09:19:32PM -0700, Linus Torvalds wrote:
>  > You're right. It ended up never having problems (or they were worked out
>  > in the -mm tree), so I forgot all about it ;)
>  > How effective is it? Maybe the numbers that were done in 2001 aren't 
>  > relevant any more?
> 
>  For lock amortization it's extremely effective. Its effects on caching
>  have never been properly instrumented that I know of.

No, we (me, mbligh) instrumented the crap out of it.  It turned out that
the cache affinity was of very marginal benefit, if any.

I cooked up an artificial benchmark which consisted of writing 32k to a
file, then truncating it back to zero, then repeating.  Four instances of
that, against four separate files on 4-way showed a large speedup - 2x or
3x, from memory.  But for real-world workloads you really needed to squint
to see anything at all.

Which is why I dithered without sending it to Linus for a couple of months.
Ended up merging it anyway because of some lock contention benefits, and
because someone mught have a workload which involves repeated
write/truncate looping ;)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-15  6:22             ` clear_user_highpage() Andrew Morton
@ 2004-08-15  6:38               ` William Lee Irwin III
  0 siblings, 0 replies; 41+ messages in thread
From: William Lee Irwin III @ 2004-08-15  6:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, davem, linux-arch

William Lee Irwin III <wli@holomorphy.com> wrote:
>>  For lock amortization it's extremely effective. Its effects on caching
>>  have never been properly instrumented that I know of.

On Sat, Aug 14, 2004 at 11:22:23PM -0700, Andrew Morton wrote:
> No, we (me, mbligh) instrumented the crap out of it.  It turned out that
> the cache affinity was of very marginal benefit, if any.
> I cooked up an artificial benchmark which consisted of writing 32k to a
> file, then truncating it back to zero, then repeating.  Four instances of
> that, against four separate files on 4-way showed a large speedup - 2x or
> 3x, from memory.  But for real-world workloads you really needed to squint
> to see anything at all.
> Which is why I dithered without sending it to Linus for a couple of months.
> Ended up merging it anyway because of some lock contention benefits, and
> because someone mught have a workload which involves repeated
> write/truncate looping ;)

I had more in mind that it had never been explained why the cache
affinity was ineffective, which would seem to require getting some
instrumentation of how often the lists were being turned over, how many
remote frees are going on, how "out of order" frees are, etc. etc. What
I heard at the time was that none of those were instrumented.

-- wli

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  2:18     ` clear_user_highpage() Linus Torvalds
  2004-08-12  2:43       ` clear_user_highpage() David S. Miller
@ 2004-08-12  2:57       ` David S. Miller
  2004-08-12  3:20       ` clear_user_highpage() William Lee Irwin III
  2004-08-13 21:41       ` clear_user_highpage() David S. Miller
  3 siblings, 0 replies; 41+ messages in thread
From: David S. Miller @ 2004-08-12  2:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: wli, linux-arch

On Wed, 11 Aug 2004 19:18:18 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> I really do believe (but can't back it up with any real numbers) that we 
> want to try to keep pages in cache as long as possible. That means keeping 
> the pages close to the last CPU that used them, btw.

This reminded me of something.

One place where things fall apart is for situations like a fork+exit
benchmark such as lmbench's "lat_proc fork".  Here is what happens:

	CPU	1		CPU	2

parent:	alloc local cpu pagetable
parent:	init child page table
parent:	wait on child
child:				tlb miss, ref page tables
child:				exit_mmap
child:				free page tables to local cpu

It is exactly the most sub-optimal sequence of page table
usage possible.  CPU 1's cache empties constantly, while
CPU 2's grows constantly.  CPU 2 goes over it's limit
and starts feeding excess page table per-cpu cache pages
into the generic page pool (and actually in 2.6.x into
the per cpu hot/cold page lists).

Meanwhile CPU 1 is constantly going to the page allocator
for page table pages since the per-cpu pgtable cache is empty.

It's amusing, and just wanted to bring it up to light while
we're discussing things of this nature.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  2:18     ` clear_user_highpage() Linus Torvalds
  2004-08-12  2:43       ` clear_user_highpage() David S. Miller
  2004-08-12  2:57       ` clear_user_highpage() David S. Miller
@ 2004-08-12  3:20       ` William Lee Irwin III
  2004-08-13 21:41       ` clear_user_highpage() David S. Miller
  3 siblings, 0 replies; 41+ messages in thread
From: William Lee Irwin III @ 2004-08-12  3:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, linux-arch

On Wed, 11 Aug 2004, William Lee Irwin III wrote:
>> Results from prototype prezeroing patches (ca. 2001) showed that
>> dedicating a cpu on a 16x machine to prezeroing userspace pages (doing
>> no other work on that cpu) improved kernel compile (insert sound of
>> projectile vomiting here) "benchmarks". This suggests cache pollution
>> and scheduling latency can be circumvented under some circumstances.

On Wed, Aug 11, 2004 at 07:18:18PM -0700, Linus Torvalds wrote:
> Heh.
> And at what point does it become a problem? Caches are growing, at some 
> point it is going to be a loss to zero memory on another CPU..

The cache pollution and scheduling latencies would have been introduced
by earlier versions of the prototype prezeroing patch (they should be
inherent to most naive implementations). The implementor of those
prototypes was unaware of PCD, PAT, and various other tricks so I'm
rather suspicious of it all, and the result is vaguely disgusting.

On Wed, Aug 11, 2004 at 07:18:18PM -0700, Linus Torvalds wrote:
> I really do believe (but can't back it up with any real numbers) that we 
> want to try to keep pages in cache as long as possible. That means keeping 
> the pages close to the last CPU that used them, btw.
> It would be interesting to see if we could make the buddy allocator more
> "per-cpu" friendly, for example - I suspect that would make much _more_ of
> a difference than pre-zeroing pages. 

Per-cpu zoning, perhaps? The hot/cold pages bits seem to achieve more
in terms of lock amortization than cache warmth, probably due to the
lists being turned over too often. Page allocation rates are truly
immense, but I've not checked the hot/cold list turnover rates to see
what's going on there in part because out-of-order frees spoil the
naive accounting methods.

On Wed, Aug 11, 2004 at 07:18:18PM -0700, Linus Torvalds wrote:
> As it is, the pages we allocate have _no_ CPU affinity (unlike 
> kmalloc/slab), and as a result they aren't even very likely to be in the 
> cache even if you have tons of cache on the CPU. 
> And my whole argument against pre-zeroing really falls totally flat if the 
> pages aren't in the cache. 
> So I'd personally be a whole lot more interested in seeing whether we 
> could have per-CPU pages than in pre-zeroing. 

There are a few other points in the design space, e.g. batching, that
haven't been tried yet. e.g. in the fault handler, do write-through
zeroing of ZERO_BATCH_SIZE - 1 pages and a cached zero of the page to
be handed to userspace when some per-cpu pool of pages is empty, or
similar nonsense (maybe via schedule_work(), or queueing pages for the
idle task to process, or something else that sounds like a plausible
way to salvage things). Truly speculative background zeroing (or "page
scrubbing") is just wrong as various workloads, e.g. routing, have next
to zero userspace participation and may literally be interested in
eliminating the last userspace process running or avoiding ever running
userspace altogether on very memory-constrained embedded systems. So I
think that if there can be a proper prezeroing implementation, it would
only perform prezeroing in response to some event or when guided by
some prediction. I guess it's a squishier objection than "implementing
it via $FOO got numbers $BAR", but anyhow.

On Wed, Aug 11, 2004 at 07:18:18PM -0700, Linus Torvalds wrote:
> Fragmentation of memory is the _big_ problem, of course. It comes up
> almost for _any_ page allocation issue. But it might be interesting to see 
> if we could have a special per-cpu "page pool" for some usage. Sized 
> fairly small - on the order of a few times the CPU cache size - and used 
> for anonymous pages that we think might be short-lived.

Well, regardless of whether zones per se are used, some larger
physically contiguous cpu-affine memory pools than the hot/cold page
lists sounds very close to this ideal. I think the important aspect
of their being physically contiguous is that the contiguity prevents
the things from fragmenting areas outside that physical region. The
flaw in all this is that there's no adequate (not even approximate that
I know of) method of predicting lifetimes of userspace pages, and
recovering from these mispredictions seems to typically involve...
(cue Darth Vader dirge) ... background processing things have to wait for.

-- wli

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-12  2:18     ` clear_user_highpage() Linus Torvalds
                         ` (2 preceding siblings ...)
  2004-08-12  3:20       ` clear_user_highpage() William Lee Irwin III
@ 2004-08-13 21:41       ` David S. Miller
  2004-08-16 13:00         ` clear_user_highpage() David Mosberger
  3 siblings, 1 reply; 41+ messages in thread
From: David S. Miller @ 2004-08-13 21:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: wli, linux-arch

On Wed, 11 Aug 2004 19:18:18 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> I really do believe (but can't back it up with any real numbers) that we 
> want to try to keep pages in cache as long as possible. That means keeping 
> the pages close to the last CPU that used them, btw.

So I did some testing.

I changed the cache-bypassing clear_user_page() into one that uses
normal stores and does allocate in the L2 cache.

I ran the full build tests 3 times for each case, and the numbers were
consistent.  It makes the full build take a full minute longer.

And I truly believe this is because of the argument William and myself
are making, that a write protection fault does not mean the process is
going to access a majority of the data in that page any time soon at all.

{clear,copy}_user_page() is not some kind of "prefetch the whole page
into the cache" for the user.  It would be if the user would access
the entire thing in the near future, but I do not believe that is
the typical access pattern for fresh anonymous pages.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-13 21:41       ` clear_user_highpage() David S. Miller
@ 2004-08-16 13:00         ` David Mosberger
  2004-08-22 19:51           ` clear_user_highpage() Linus Torvalds
  0 siblings, 1 reply; 41+ messages in thread
From: David Mosberger @ 2004-08-16 13:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linus Torvalds, wli, linux-arch

>>>>> On Fri, 13 Aug 2004 14:41:15 -0700, "David S. Miller" <davem@redhat.com> said:

  DaveM> On Wed, 11 Aug 2004 19:18:18 -0700 (PDT) Linus Torvalds
  DaveM> <torvalds@osdl.org> wrote:

  >> I really do believe (but can't back it up with any real numbers)
  >> that we want to try to keep pages in cache as long as
  >> possible. That means keeping the pages close to the last CPU that
  >> used them, btw.

  DaveM> So I did some testing.

  DaveM> I changed the cache-bypassing clear_user_page() into one that
  DaveM> uses normal stores and does allocate in the L2 cache.

  DaveM> I ran the full build tests 3 times for each case, and the
  DaveM> numbers were consistent.  It makes the full build take a full
  DaveM> minute longer.

Very interesting. I tried something similar on a dual 1.5GHz Itanium
2.  I tried two versions of clear_page: one with .nta (no-temporal
affinity hint, which is the default) and one without.  The result for
5 runs of a fairly minimal kernel-compile (with make -j8):

	with .nta:			without .nta:

real	80.0 70.2 71.0 70.4 70.3	79.4 69.5 69.5 69.5 69.3
sys	 4.6  4.6  4.6  4.6  4.5	 5.3  5.4  5.3  5.4  5.4

Note that the first run was with cold page cache, hence the longer
runtime.

So, on average, dropping the .nta costs us 0.78 seconds of kernel-time
but, overall, the kernel-builds complete about 0.94 seconds faster.

Of course, it's just one (more) data-point...

	--david

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-16 13:00         ` clear_user_highpage() David Mosberger
@ 2004-08-22 19:51           ` Linus Torvalds
  2005-09-17 19:01             ` clear_user_highpage() Andi Kleen
  0 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2004-08-22 19:51 UTC (permalink / raw)
  To: davidm; +Cc: David S. Miller, wli, linux-arch

On Mon, 16 Aug 2004, David Mosberger wrote:
> 
> Very interesting. I tried something similar on a dual 1.5GHz Itanium
> 2.  I tried two versions of clear_page: one with .nta (no-temporal
> affinity hint, which is the default) and one without.  The result for
> 5 runs of a fairly minimal kernel-compile (with make -j8):
> 
> 	with .nta:			without .nta:
> 
> real	80.0 70.2 71.0 70.4 70.3	79.4 69.5 69.5 69.5 69.3
> sys	 4.6  4.6  4.6  4.6  4.5	 5.3  5.4  5.3  5.4  5.4
> 
> Note that the first run was with cold page cache, hence the longer
> runtime.
> 
> So, on average, dropping the .nta costs us 0.78 seconds of kernel-time
> but, overall, the kernel-builds complete about 0.94 seconds faster.

I obviously love the result, since it validates my theory that it's better
to have a nice hot "clear_page()" and avoid cache misses elsewhere. Score
one for WAGging.

That said, I suspect it does so exactly because the Itanium 2 has largish
caches, and it's likely that smaller (or external, slower) caches or other 
loads will see different behaviour.

My basic point stands: hotspots in profiling are NOT automatically a sign 
of anything bad. I'd rather have hotspots where you can say "this is an 
important function" than try to smush the costs out. I personally prefer a 
profile that clearly shows where the work is being done to one that has 
most of the costs in a long tail of random functions.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2004-08-22 19:51           ` clear_user_highpage() Linus Torvalds
@ 2005-09-17 19:01             ` Andi Kleen
  2005-09-17 19:16               ` clear_user_highpage() Andi Kleen
  0 siblings, 1 reply; 41+ messages in thread
From: Andi Kleen @ 2005-09-17 19:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: davidm, David S. Miller, wli, linux-arch

On Sun, 2004-08-22 at 12:51 -0700, Linus Torvalds wrote:

> I obviously love the result, since it validates my theory that it's better
> to have a nice hot "clear_page()" and avoid cache misses elsewhere. Score
> one for WAGging.

Experiences on Opteron have been similar - NT seems to be a loss 
for normal clear/copy_page.

However i liked the recent results of someone using it for
write() only by defining a special copy_from_user_uncached()
I suspect that would be even a win in most cases as long as you
don't do it for pipes, but only for file systems.

-Andi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: clear_user_highpage()
  2005-09-17 19:01             ` clear_user_highpage() Andi Kleen
@ 2005-09-17 19:16               ` Andi Kleen
  0 siblings, 0 replies; 41+ messages in thread
From: Andi Kleen @ 2005-09-17 19:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: davidm, David S. Miller, wli, linux-arch

Sorry - somehow my mailer got disorganized and I ended up replying
to this really old email. Please ignore.

[Well actually the copy_from_user_uncached stuff is still
interesting...]

-Andi

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2005-09-17 19:16 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-11 23:15 clear_user_highpage() David S. Miller
2004-08-11 23:31 ` clear_user_highpage() Benjamin Herrenschmidt
2004-08-11 23:55   ` clear_user_highpage() David S. Miller
2004-08-12  0:03     ` clear_user_highpage() Benjamin Herrenschmidt
2004-08-12  1:18       ` clear_user_highpage() William Lee Irwin III
2004-08-12  2:11       ` clear_user_highpage() Andi Kleen
2004-08-12  9:23         ` clear_user_highpage() Martin Schwidefsky
2004-08-11 23:46 ` clear_user_highpage() Linus Torvalds
2004-08-11 23:53   ` clear_user_highpage() David S. Miller
2004-08-12  0:00     ` clear_user_highpage() Linus Torvalds
2004-08-12  0:06       ` clear_user_highpage() Benjamin Herrenschmidt
2004-08-12  0:24         ` clear_user_highpage() David S. Miller
2004-08-12  0:23       ` clear_user_highpage() David S. Miller
2004-08-12  1:46         ` clear_user_highpage() Linus Torvalds
2004-08-12  2:51           ` clear_user_highpage() David S. Miller
2004-08-16  1:58         ` clear_user_highpage() Paul Mackerras
2004-08-12  2:08       ` clear_user_highpage() Andi Kleen
2004-08-12  2:45         ` clear_user_highpage() David S. Miller
2004-08-12  9:09           ` clear_user_highpage() Andi Kleen
2004-08-12 19:50             ` clear_user_highpage() David S. Miller
2004-08-12 20:00               ` clear_user_highpage() Andi Kleen
2004-08-12 20:30                 ` clear_user_highpage() David S. Miller
2004-08-12 21:34               ` clear_user_highpage() Matthew Wilcox
2004-08-13  8:16                 ` clear_user_highpage() David Mosberger
2004-08-12  0:00   ` clear_user_highpage() Benjamin Herrenschmidt
2004-08-12  0:21     ` clear_user_highpage() Linus Torvalds
2004-08-12  0:46   ` clear_user_highpage() William Lee Irwin III
2004-08-12  1:01     ` clear_user_highpage() David S. Miller
2004-08-12  2:18     ` clear_user_highpage() Linus Torvalds
2004-08-12  2:43       ` clear_user_highpage() David S. Miller
2004-08-12  4:19         ` clear_user_highpage() Linus Torvalds
2004-08-12  4:46           ` clear_user_highpage() William Lee Irwin III
2004-08-15  6:22             ` clear_user_highpage() Andrew Morton
2004-08-15  6:38               ` clear_user_highpage() William Lee Irwin III
2004-08-12  2:57       ` clear_user_highpage() David S. Miller
2004-08-12  3:20       ` clear_user_highpage() William Lee Irwin III
2004-08-13 21:41       ` clear_user_highpage() David S. Miller
2004-08-16 13:00         ` clear_user_highpage() David Mosberger
2004-08-22 19:51           ` clear_user_highpage() Linus Torvalds
2005-09-17 19:01             ` clear_user_highpage() Andi Kleen
2005-09-17 19:16               ` clear_user_highpage() Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox