public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Q: PREFETCH_STRIDE/16
@ 2002-05-23 15:12 Hugh Dickins
  2002-05-23 16:26 ` David Mosberger
  0 siblings, 1 reply; 8+ messages in thread
From: Hugh Dickins @ 2002-05-23 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrea Arcangeli, Linus Torvalds

Could anyone please shed light on PREFETCH_STRIDE,
and in particular its sole use:
		prefetchw(pmd+j+(PREFETCH_STRIDE/16));
in mm/memory.c: free_one_pgd().

That looks to me suspiciously like something inserted to suit
one particular architecture - ia64? is it really suitable for
others? is 4*L1_CACHE_SIZE really right for PREFETCH_STRIDE
on anything that prefetches except ia64? what's the "/ 16"?
shouldn't there be a "/ sizeof(pmd_t)" somewhere (PAE or not)?
is it right to prefetch each time around that loop? isn't it
appropriate only to the exit_mm (0 to TASK_SIZE) clearance?

All in all, I'm thinking that line shouldn't be there,
or not without a substantial comment...

Thanks,
Hugh


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Q: PREFETCH_STRIDE/16
  2002-05-23 16:26 ` David Mosberger
@ 2002-05-23 16:21   ` David S. Miller
  2002-05-23 16:43     ` David Mosberger
  0 siblings, 1 reply; 8+ messages in thread
From: David S. Miller @ 2002-05-23 16:21 UTC (permalink / raw)
  To: davidm, davidm; +Cc: hugh, linux-kernel, andrea, torvalds

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Thu, 23 May 2002 09:26:54 -0700
   
   That code certainly wasn't optimized for ia64.  Furthermore, I also do
   not like the prefetch distance it's using.  In fact, in my ia64-patch,
   I use the following code instead:
   
   		prefetchw(pmd + j + PREFETCH_STRIDE/sizeof(*pmd));
   
   This is more sensible (because it really does prefetch by the
   PREFETCH_STRIDE distance) and it also happens to run (slightly) faster
   on Itanium.

All of these particular prefetches are amusing, with or without your
fix, considering there are other more powerful ways to optimize this
stuff. :-)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Q: PREFETCH_STRIDE/16
  2002-05-23 15:12 Q: PREFETCH_STRIDE/16 Hugh Dickins
@ 2002-05-23 16:26 ` David Mosberger
  2002-05-23 16:21   ` David S. Miller
  0 siblings, 1 reply; 8+ messages in thread
From: David Mosberger @ 2002-05-23 16:26 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, Andrea Arcangeli, Linus Torvalds

>>>>> On Thu, 23 May 2002 16:12:16 +0100 (BST), Hugh Dickins <hugh@veritas.com> said:

  Hugh> Could anyone please shed light on PREFETCH_STRIDE, and in
  Hugh> particular its sole use:
  Hugh> prefetchw(pmd+j+(PREFETCH_STRIDE/16)); in mm/memory.c:
  Hugh> free_one_pgd().

  Hugh> That looks to me suspiciously like something inserted to suit
  Hugh> one particular architecture - ia64? is it really suitable for
  Hugh> others? is 4*L1_CACHE_SIZE really right for PREFETCH_STRIDE on
  Hugh> anything that prefetches except ia64? what's the "/ 16"?
  Hugh> shouldn't there be a "/ sizeof(pmd_t)" somewhere (PAE or not)?
  Hugh> is it right to prefetch each time around that loop? isn't it
  Hugh> appropriate only to the exit_mm (0 to TASK_SIZE) clearance?

  Hugh> All in all, I'm thinking that line shouldn't be there, or not
  Hugh> without a substantial comment...

That code certainly wasn't optimized for ia64.  Furthermore, I also do
not like the prefetch distance it's using.  In fact, in my ia64-patch,
I use the following code instead:

		prefetchw(pmd + j + PREFETCH_STRIDE/sizeof(*pmd));

This is more sensible (because it really does prefetch by the
PREFETCH_STRIDE distance) and it also happens to run (slightly) faster
on Itanium.

	--david

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Q: PREFETCH_STRIDE/16
  2002-05-23 16:43     ` David Mosberger
@ 2002-05-23 16:34       ` David S. Miller
  2002-05-23 17:02         ` David Mosberger
  0 siblings, 1 reply; 8+ messages in thread
From: David S. Miller @ 2002-05-23 16:34 UTC (permalink / raw)
  To: davidm, davidm; +Cc: hugh, linux-kernel, andrea, torvalds

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Thu, 23 May 2002 09:43:54 -0700

     DaveM> All of these particular prefetches are amusing, with or
     DaveM> without your fix, considering there are other more powerful
     DaveM> ways to optimize this stuff. :-)
   
   What do you have in mind?

I mentioned this 1 or 2 days ago in the TLB thread with
Linus, you pessimistically maintain a tiny bitmap per
mm_struct which keeps track of where mappings actually
are.  You use some hash function on the virtual address
to determine the bit.  You clear it when the mm_struct is
new, and you just set bits when mappings are installed.
Very simple.

Then all of these "walk all valid page tables" loops that scan entire
mostly empty pages of pgd/pmd/pte entries for no reason can just check
the bitmap instead.

Most of the exit overhead is in clear_page_tables walking over entire
pages.  It effectively flushes the cache unless all you are doing is
fork/exit/fork/exit

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Q: PREFETCH_STRIDE/16
  2002-05-23 16:21   ` David S. Miller
@ 2002-05-23 16:43     ` David Mosberger
  2002-05-23 16:34       ` David S. Miller
  0 siblings, 1 reply; 8+ messages in thread
From: David Mosberger @ 2002-05-23 16:43 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, hugh, linux-kernel, andrea, torvalds

>>>>> On Thu, 23 May 2002 09:21:55 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  DaveM> All of these particular prefetches are amusing, with or
  DaveM> without your fix, considering there are other more powerful
  DaveM> ways to optimize this stuff. :-)

What do you have in mind?

	--david

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Q: PREFETCH_STRIDE/16
  2002-05-23 17:02         ` David Mosberger
@ 2002-05-23 17:00           ` David S. Miller
  2002-05-23 18:32             ` David Mosberger
  0 siblings, 1 reply; 8+ messages in thread
From: David S. Miller @ 2002-05-23 17:00 UTC (permalink / raw)
  To: davidm, davidm; +Cc: hugh, linux-kernel, andrea, torvalds

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Thu, 23 May 2002 10:02:33 -0700

   Sounds like something worth experimenting with.  I doubt you could
   really avoid (effectively) flushing the caches, but even if there are
   just a few zero bits in the bitmap at the time of the tear-down, a
   fair amount of time could be saved.

You'd be surprised how many 0 bits there will be in the average
process.  Even if you bring in all of emacs, glibc, X11R6 libs
etc. and the anonymous memory, there are still a HUGE portion of the
address space totally unused.

But like you said, worth experimenting with :-) First test would be,
start with 1 unsigned long as the bitmask in mm_context_t.  Just
implement the bit setting part.  Then at exit() count how many 0 bits
are left, record this into some counter table which has one counter
for 0 --> N_BITS_IN_LONG.  Make some debug /proc thing which spits the
table out.  (hint: at fork, clear out the child's bitmask before
copy_page_range is run for best results :-)

You can use this to do vaious things and see how much there is to gain
by going to two unsigned longs, three, etc.

Then you can hack up the actual clear_page_tables optimization (to
start) and measure the result.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Q: PREFETCH_STRIDE/16
  2002-05-23 16:34       ` David S. Miller
@ 2002-05-23 17:02         ` David Mosberger
  2002-05-23 17:00           ` David S. Miller
  0 siblings, 1 reply; 8+ messages in thread
From: David Mosberger @ 2002-05-23 17:02 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, davidm, hugh, linux-kernel, andrea, torvalds

>>>>> On Thu, 23 May 2002 09:34:16 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  DaveM> I mentioned this 1 or 2 days ago in the TLB thread with
  DaveM> Linus, you pessimistically maintain a tiny bitmap per
  DaveM> mm_struct which keeps track of where mappings actually are.
  DaveM> You use some hash function on the virtual address to
  DaveM> determine the bit.  You clear it when the mm_struct is new,
  DaveM> and you just set bits when mappings are installed.  Very
  DaveM> simple.

  DaveM> Then all of these "walk all valid page tables" loops that
  DaveM> scan entire mostly empty pages of pgd/pmd/pte entries for no
  DaveM> reason can just check the bitmap instead.

  DaveM> Most of the exit overhead is in clear_page_tables walking
  DaveM> over entire pages.  It effectively flushes the cache unless
  DaveM> all you are doing is fork/exit/fork/exit

Sounds like something worth experimenting with.  I doubt you could
really avoid (effectively) flushing the caches, but even if there are
just a few zero bits in the bitmap at the time of the tear-down, a
fair amount of time could be saved.

	--david

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Q: PREFETCH_STRIDE/16
  2002-05-23 17:00           ` David S. Miller
@ 2002-05-23 18:32             ` David Mosberger
  0 siblings, 0 replies; 8+ messages in thread
From: David Mosberger @ 2002-05-23 18:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, davidm, hugh, linux-kernel, andrea, torvalds

>>>>> On Thu, 23 May 2002 10:00:58 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  DaveM> You'd be surprised how many 0 bits there will be in the
  DaveM> average process.  Even if you bring in all of emacs, glibc,
  DaveM> X11R6 libs etc. and the anonymous memory, there are still a
  DaveM> HUGE portion of the address space totally unused.

But don't confuse the portion of address space used with the number of
page tables in use!  Even if you use a teeny tiny portion of a 64-bit
address space, you can still have hundreds and thousands of page table
pages.

Also, we should not ignore the advantages of the current scheme:

 o It's so straight-forward, it's virtually impossible to screw it up
   (with the hashed scheme, forgetting to set a bit just once could
   lead to very difficult-to-track-down bugs; been there, done that,
   in a slightly different context, and it was ugly...).

 o Performance is very predicable (basically linear in virtual address
   space in use).  There is a danger that a hashed scheme would be
   optimized for today's workloads.  As the working sets increase over
   the years, the hashed scheme could eventually break down and the
   worst part would be that it would be very hard to notice (only
   effect is bad performance for very large tasks; few benchmarks
   would probably catch such worse-than-optimal performance).

  DaveM> But like you said, worth experimenting with :-) First test
  DaveM> would be, start with 1 unsigned long as the bitmask in
  DaveM> mm_context_t.  Just implement the bit setting part.  Then at
  DaveM> exit() count how many 0 bits are left, record this into some
  DaveM> counter table which has one counter for 0 --> N_BITS_IN_LONG.
  DaveM> Make some debug /proc thing which spits the table out.
  DaveM> (hint: at fork, clear out the child's bitmask before
  DaveM> copy_page_range is run for best results :-)

  DaveM> You can use this to do vaious things and see how much there
  DaveM> is to gain by going to two unsigned longs, three, etc.

  DaveM> Then you can hack up the actual clear_page_tables
  DaveM> optimization (to start) and measure the result.

Yes.  Hopefully, someone with some spare time at hand can play with
this.

	--david

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2002-05-23 18:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-23 15:12 Q: PREFETCH_STRIDE/16 Hugh Dickins
2002-05-23 16:26 ` David Mosberger
2002-05-23 16:21   ` David S. Miller
2002-05-23 16:43     ` David Mosberger
2002-05-23 16:34       ` David S. Miller
2002-05-23 17:02         ` David Mosberger
2002-05-23 17:00           ` David S. Miller
2002-05-23 18:32             ` David Mosberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox