* Q: PREFETCH_STRIDE/16
@ 2002-05-23 15:12 Hugh Dickins
2002-05-23 16:26 ` David Mosberger
0 siblings, 1 reply; 8+ messages in thread
From: Hugh Dickins @ 2002-05-23 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrea Arcangeli, Linus Torvalds
Could anyone please shed light on PREFETCH_STRIDE,
and in particular its sole use:
prefetchw(pmd+j+(PREFETCH_STRIDE/16));
in mm/memory.c: free_one_pgd().
That looks to me suspiciously like something inserted to suit
one particular architecture - ia64? is it really suitable for
others? is 4*L1_CACHE_SIZE really right for PREFETCH_STRIDE
on anything that prefetches except ia64? what's the "/ 16"?
shouldn't there be a "/ sizeof(pmd_t)" somewhere (PAE or not)?
is it right to prefetch each time around that loop? isn't it
appropriate only to the exit_mm (0 to TASK_SIZE) clearance?
All in all, I'm thinking that line shouldn't be there,
or not without a substantial comment...
Thanks,
Hugh
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16
2002-05-23 16:26 ` David Mosberger
@ 2002-05-23 16:21 ` David S. Miller
2002-05-23 16:43 ` David Mosberger
0 siblings, 1 reply; 8+ messages in thread
From: David S. Miller @ 2002-05-23 16:21 UTC (permalink / raw)
To: davidm, davidm; +Cc: hugh, linux-kernel, andrea, torvalds
From: David Mosberger <davidm@napali.hpl.hp.com>
Date: Thu, 23 May 2002 09:26:54 -0700
That code certainly wasn't optimized for ia64. Furthermore, I also do
not like the prefetch distance it's using. In fact, in my ia64-patch,
I use the following code instead:
prefetchw(pmd + j + PREFETCH_STRIDE/sizeof(*pmd));
This is more sensible (because it really does prefetch by the
PREFETCH_STRIDE distance) and it also happens to run (slightly) faster
on Itanium.
All of these particular prefetches are amusing, with or without your
fix, considering there are other more powerful ways to optimize this
stuff. :-)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16
2002-05-23 15:12 Q: PREFETCH_STRIDE/16 Hugh Dickins
@ 2002-05-23 16:26 ` David Mosberger
2002-05-23 16:21 ` David S. Miller
0 siblings, 1 reply; 8+ messages in thread
From: David Mosberger @ 2002-05-23 16:26 UTC (permalink / raw)
To: Hugh Dickins; +Cc: linux-kernel, Andrea Arcangeli, Linus Torvalds
>>>>> On Thu, 23 May 2002 16:12:16 +0100 (BST), Hugh Dickins <hugh@veritas.com> said:
Hugh> Could anyone please shed light on PREFETCH_STRIDE, and in
Hugh> particular its sole use:
Hugh> prefetchw(pmd+j+(PREFETCH_STRIDE/16)); in mm/memory.c:
Hugh> free_one_pgd().
Hugh> That looks to me suspiciously like something inserted to suit
Hugh> one particular architecture - ia64? is it really suitable for
Hugh> others? is 4*L1_CACHE_SIZE really right for PREFETCH_STRIDE on
Hugh> anything that prefetches except ia64? what's the "/ 16"?
Hugh> shouldn't there be a "/ sizeof(pmd_t)" somewhere (PAE or not)?
Hugh> is it right to prefetch each time around that loop? isn't it
Hugh> appropriate only to the exit_mm (0 to TASK_SIZE) clearance?
Hugh> All in all, I'm thinking that line shouldn't be there, or not
Hugh> without a substantial comment...
That code certainly wasn't optimized for ia64. Furthermore, I also do
not like the prefetch distance it's using. In fact, in my ia64-patch,
I use the following code instead:
prefetchw(pmd + j + PREFETCH_STRIDE/sizeof(*pmd));
This is more sensible (because it really does prefetch by the
PREFETCH_STRIDE distance) and it also happens to run (slightly) faster
on Itanium.
--david
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16
2002-05-23 16:43 ` David Mosberger
@ 2002-05-23 16:34 ` David S. Miller
2002-05-23 17:02 ` David Mosberger
0 siblings, 1 reply; 8+ messages in thread
From: David S. Miller @ 2002-05-23 16:34 UTC (permalink / raw)
To: davidm, davidm; +Cc: hugh, linux-kernel, andrea, torvalds
From: David Mosberger <davidm@napali.hpl.hp.com>
Date: Thu, 23 May 2002 09:43:54 -0700
DaveM> All of these particular prefetches are amusing, with or
DaveM> without your fix, considering there are other more powerful
DaveM> ways to optimize this stuff. :-)
What do you have in mind?
I mentioned this 1 or 2 days ago in the TLB thread with
Linus, you pessimistically maintain a tiny bitmap per
mm_struct which keeps track of where mappings actually
are. You use some hash function on the virtual address
to determine the bit. You clear it when the mm_struct is
new, and you just set bits when mappings are installed.
Very simple.
Then all of these "walk all valid page tables" loops that scan entire
mostly empty pages of pgd/pmd/pte entries for no reason can just check
the bitmap instead.
Most of the exit overhead is in clear_page_tables walking over entire
pages. It effectively flushes the cache unless all you are doing is
fork/exit/fork/exit
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16
2002-05-23 16:21 ` David S. Miller
@ 2002-05-23 16:43 ` David Mosberger
2002-05-23 16:34 ` David S. Miller
0 siblings, 1 reply; 8+ messages in thread
From: David Mosberger @ 2002-05-23 16:43 UTC (permalink / raw)
To: David S. Miller; +Cc: davidm, hugh, linux-kernel, andrea, torvalds
>>>>> On Thu, 23 May 2002 09:21:55 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:
DaveM> All of these particular prefetches are amusing, with or
DaveM> without your fix, considering there are other more powerful
DaveM> ways to optimize this stuff. :-)
What do you have in mind?
--david
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16
2002-05-23 17:02 ` David Mosberger
@ 2002-05-23 17:00 ` David S. Miller
2002-05-23 18:32 ` David Mosberger
0 siblings, 1 reply; 8+ messages in thread
From: David S. Miller @ 2002-05-23 17:00 UTC (permalink / raw)
To: davidm, davidm; +Cc: hugh, linux-kernel, andrea, torvalds
From: David Mosberger <davidm@napali.hpl.hp.com>
Date: Thu, 23 May 2002 10:02:33 -0700
Sounds like something worth experimenting with. I doubt you could
really avoid (effectively) flushing the caches, but even if there are
just a few zero bits in the bitmap at the time of the tear-down, a
fair amount of time could be saved.
You'd be surprised how many 0 bits there will be in the average
process. Even if you bring in all of emacs, glibc, X11R6 libs
etc. and the anonymous memory, there are still a HUGE portion of the
address space totally unused.
But like you said, worth experimenting with :-) First test would be,
start with 1 unsigned long as the bitmask in mm_context_t. Just
implement the bit setting part. Then at exit() count how many 0 bits
are left, record this into some counter table which has one counter
for 0 --> N_BITS_IN_LONG. Make some debug /proc thing which spits the
table out. (hint: at fork, clear out the child's bitmask before
copy_page_range is run for best results :-)
You can use this to do vaious things and see how much there is to gain
by going to two unsigned longs, three, etc.
Then you can hack up the actual clear_page_tables optimization (to
start) and measure the result.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16
2002-05-23 16:34 ` David S. Miller
@ 2002-05-23 17:02 ` David Mosberger
2002-05-23 17:00 ` David S. Miller
0 siblings, 1 reply; 8+ messages in thread
From: David Mosberger @ 2002-05-23 17:02 UTC (permalink / raw)
To: David S. Miller; +Cc: davidm, davidm, hugh, linux-kernel, andrea, torvalds
>>>>> On Thu, 23 May 2002 09:34:16 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:
DaveM> I mentioned this 1 or 2 days ago in the TLB thread with
DaveM> Linus, you pessimistically maintain a tiny bitmap per
DaveM> mm_struct which keeps track of where mappings actually are.
DaveM> You use some hash function on the virtual address to
DaveM> determine the bit. You clear it when the mm_struct is new,
DaveM> and you just set bits when mappings are installed. Very
DaveM> simple.
DaveM> Then all of these "walk all valid page tables" loops that
DaveM> scan entire mostly empty pages of pgd/pmd/pte entries for no
DaveM> reason can just check the bitmap instead.
DaveM> Most of the exit overhead is in clear_page_tables walking
DaveM> over entire pages. It effectively flushes the cache unless
DaveM> all you are doing is fork/exit/fork/exit
Sounds like something worth experimenting with. I doubt you could
really avoid (effectively) flushing the caches, but even if there are
just a few zero bits in the bitmap at the time of the tear-down, a
fair amount of time could be saved.
--david
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16
2002-05-23 17:00 ` David S. Miller
@ 2002-05-23 18:32 ` David Mosberger
0 siblings, 0 replies; 8+ messages in thread
From: David Mosberger @ 2002-05-23 18:32 UTC (permalink / raw)
To: David S. Miller; +Cc: davidm, davidm, hugh, linux-kernel, andrea, torvalds
>>>>> On Thu, 23 May 2002 10:00:58 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:
DaveM> You'd be surprised how many 0 bits there will be in the
DaveM> average process. Even if you bring in all of emacs, glibc,
DaveM> X11R6 libs etc. and the anonymous memory, there are still a
DaveM> HUGE portion of the address space totally unused.
But don't confuse the portion of address space used with the number of
page tables in use! Even if you use a teeny tiny portion of a 64-bit
address space, you can still have hundreds and thousands of page table
pages.
Also, we should not ignore the advantages of the current scheme:
o It's so straight-forward, it's virtually impossible to screw it up
(with the hashed scheme, forgetting to set a bit just once could
lead to very difficult-to-track-down bugs; been there, done that,
in a slightly different context, and it was ugly...).
o Performance is very predicable (basically linear in virtual address
space in use). There is a danger that a hashed scheme would be
optimized for today's workloads. As the working sets increase over
the years, the hashed scheme could eventually break down and the
worst part would be that it would be very hard to notice (only
effect is bad performance for very large tasks; few benchmarks
would probably catch such worse-than-optimal performance).
DaveM> But like you said, worth experimenting with :-) First test
DaveM> would be, start with 1 unsigned long as the bitmask in
DaveM> mm_context_t. Just implement the bit setting part. Then at
DaveM> exit() count how many 0 bits are left, record this into some
DaveM> counter table which has one counter for 0 --> N_BITS_IN_LONG.
DaveM> Make some debug /proc thing which spits the table out.
DaveM> (hint: at fork, clear out the child's bitmask before
DaveM> copy_page_range is run for best results :-)
DaveM> You can use this to do vaious things and see how much there
DaveM> is to gain by going to two unsigned longs, three, etc.
DaveM> Then you can hack up the actual clear_page_tables
DaveM> optimization (to start) and measure the result.
Yes. Hopefully, someone with some spare time at hand can play with
this.
--david
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2002-05-23 18:32 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-23 15:12 Q: PREFETCH_STRIDE/16 Hugh Dickins
2002-05-23 16:26 ` David Mosberger
2002-05-23 16:21 ` David S. Miller
2002-05-23 16:43 ` David Mosberger
2002-05-23 16:34 ` David S. Miller
2002-05-23 17:02 ` David Mosberger
2002-05-23 17:00 ` David S. Miller
2002-05-23 18:32 ` David Mosberger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox