* Q: PREFETCH_STRIDE/16 @ 2002-05-23 15:12 Hugh Dickins 2002-05-23 16:26 ` David Mosberger 0 siblings, 1 reply; 8+ messages in thread From: Hugh Dickins @ 2002-05-23 15:12 UTC (permalink / raw) To: linux-kernel; +Cc: Andrea Arcangeli, Linus Torvalds Could anyone please shed light on PREFETCH_STRIDE, and in particular its sole use: prefetchw(pmd+j+(PREFETCH_STRIDE/16)); in mm/memory.c: free_one_pgd(). That looks to me suspiciously like something inserted to suit one particular architecture - ia64? is it really suitable for others? is 4*L1_CACHE_SIZE really right for PREFETCH_STRIDE on anything that prefetches except ia64? what's the "/ 16"? shouldn't there be a "/ sizeof(pmd_t)" somewhere (PAE or not)? is it right to prefetch each time around that loop? isn't it appropriate only to the exit_mm (0 to TASK_SIZE) clearance? All in all, I'm thinking that line shouldn't be there, or not without a substantial comment... Thanks, Hugh ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16 2002-05-23 15:12 Q: PREFETCH_STRIDE/16 Hugh Dickins @ 2002-05-23 16:26 ` David Mosberger 2002-05-23 16:21 ` David S. Miller 0 siblings, 1 reply; 8+ messages in thread From: David Mosberger @ 2002-05-23 16:26 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-kernel, Andrea Arcangeli, Linus Torvalds >>>>> On Thu, 23 May 2002 16:12:16 +0100 (BST), Hugh Dickins <hugh@veritas.com> said: Hugh> Could anyone please shed light on PREFETCH_STRIDE, and in Hugh> particular its sole use: Hugh> prefetchw(pmd+j+(PREFETCH_STRIDE/16)); in mm/memory.c: Hugh> free_one_pgd(). Hugh> That looks to me suspiciously like something inserted to suit Hugh> one particular architecture - ia64? is it really suitable for Hugh> others? is 4*L1_CACHE_SIZE really right for PREFETCH_STRIDE on Hugh> anything that prefetches except ia64? what's the "/ 16"? Hugh> shouldn't there be a "/ sizeof(pmd_t)" somewhere (PAE or not)? Hugh> is it right to prefetch each time around that loop? isn't it Hugh> appropriate only to the exit_mm (0 to TASK_SIZE) clearance? Hugh> All in all, I'm thinking that line shouldn't be there, or not Hugh> without a substantial comment... That code certainly wasn't optimized for ia64. Furthermore, I also do not like the prefetch distance it's using. In fact, in my ia64-patch, I use the following code instead: prefetchw(pmd + j + PREFETCH_STRIDE/sizeof(*pmd)); This is more sensible (because it really does prefetch by the PREFETCH_STRIDE distance) and it also happens to run (slightly) faster on Itanium. --david ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16 2002-05-23 16:26 ` David Mosberger @ 2002-05-23 16:21 ` David S. Miller 2002-05-23 16:43 ` David Mosberger 0 siblings, 1 reply; 8+ messages in thread From: David S. Miller @ 2002-05-23 16:21 UTC (permalink / raw) To: davidm, davidm; +Cc: hugh, linux-kernel, andrea, torvalds From: David Mosberger <davidm@napali.hpl.hp.com> Date: Thu, 23 May 2002 09:26:54 -0700 That code certainly wasn't optimized for ia64. Furthermore, I also do not like the prefetch distance it's using. In fact, in my ia64-patch, I use the following code instead: prefetchw(pmd + j + PREFETCH_STRIDE/sizeof(*pmd)); This is more sensible (because it really does prefetch by the PREFETCH_STRIDE distance) and it also happens to run (slightly) faster on Itanium. All of these particular prefetches are amusing, with or without your fix, considering there are other more powerful ways to optimize this stuff. :-) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16 2002-05-23 16:21 ` David S. Miller @ 2002-05-23 16:43 ` David Mosberger 2002-05-23 16:34 ` David S. Miller 0 siblings, 1 reply; 8+ messages in thread From: David Mosberger @ 2002-05-23 16:43 UTC (permalink / raw) To: David S. Miller; +Cc: davidm, hugh, linux-kernel, andrea, torvalds >>>>> On Thu, 23 May 2002 09:21:55 -0700 (PDT), "David S. Miller" <davem@redhat.com> said: DaveM> All of these particular prefetches are amusing, with or DaveM> without your fix, considering there are other more powerful DaveM> ways to optimize this stuff. :-) What do you have in mind? --david ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16 2002-05-23 16:43 ` David Mosberger @ 2002-05-23 16:34 ` David S. Miller 2002-05-23 17:02 ` David Mosberger 0 siblings, 1 reply; 8+ messages in thread From: David S. Miller @ 2002-05-23 16:34 UTC (permalink / raw) To: davidm, davidm; +Cc: hugh, linux-kernel, andrea, torvalds From: David Mosberger <davidm@napali.hpl.hp.com> Date: Thu, 23 May 2002 09:43:54 -0700 DaveM> All of these particular prefetches are amusing, with or DaveM> without your fix, considering there are other more powerful DaveM> ways to optimize this stuff. :-) What do you have in mind? I mentioned this 1 or 2 days ago in the TLB thread with Linus, you pessimistically maintain a tiny bitmap per mm_struct which keeps track of where mappings actually are. You use some hash function on the virtual address to determine the bit. You clear it when the mm_struct is new, and you just set bits when mappings are installed. Very simple. Then all of these "walk all valid page tables" loops that scan entire mostly empty pages of pgd/pmd/pte entries for no reason can just check the bitmap instead. Most of the exit overhead is in clear_page_tables walking over entire pages. It effectively flushes the cache unless all you are doing is fork/exit/fork/exit ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16 2002-05-23 16:34 ` David S. Miller @ 2002-05-23 17:02 ` David Mosberger 2002-05-23 17:00 ` David S. Miller 0 siblings, 1 reply; 8+ messages in thread From: David Mosberger @ 2002-05-23 17:02 UTC (permalink / raw) To: David S. Miller; +Cc: davidm, davidm, hugh, linux-kernel, andrea, torvalds >>>>> On Thu, 23 May 2002 09:34:16 -0700 (PDT), "David S. Miller" <davem@redhat.com> said: DaveM> I mentioned this 1 or 2 days ago in the TLB thread with DaveM> Linus, you pessimistically maintain a tiny bitmap per DaveM> mm_struct which keeps track of where mappings actually are. DaveM> You use some hash function on the virtual address to DaveM> determine the bit. You clear it when the mm_struct is new, DaveM> and you just set bits when mappings are installed. Very DaveM> simple. DaveM> Then all of these "walk all valid page tables" loops that DaveM> scan entire mostly empty pages of pgd/pmd/pte entries for no DaveM> reason can just check the bitmap instead. DaveM> Most of the exit overhead is in clear_page_tables walking DaveM> over entire pages. It effectively flushes the cache unless DaveM> all you are doing is fork/exit/fork/exit Sounds like something worth experimenting with. I doubt you could really avoid (effectively) flushing the caches, but even if there are just a few zero bits in the bitmap at the time of the tear-down, a fair amount of time could be saved. --david ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16 2002-05-23 17:02 ` David Mosberger @ 2002-05-23 17:00 ` David S. Miller 2002-05-23 18:32 ` David Mosberger 0 siblings, 1 reply; 8+ messages in thread From: David S. Miller @ 2002-05-23 17:00 UTC (permalink / raw) To: davidm, davidm; +Cc: hugh, linux-kernel, andrea, torvalds From: David Mosberger <davidm@napali.hpl.hp.com> Date: Thu, 23 May 2002 10:02:33 -0700 Sounds like something worth experimenting with. I doubt you could really avoid (effectively) flushing the caches, but even if there are just a few zero bits in the bitmap at the time of the tear-down, a fair amount of time could be saved. You'd be surprised how many 0 bits there will be in the average process. Even if you bring in all of emacs, glibc, X11R6 libs etc. and the anonymous memory, there are still a HUGE portion of the address space totally unused. But like you said, worth experimenting with :-) First test would be, start with 1 unsigned long as the bitmask in mm_context_t. Just implement the bit setting part. Then at exit() count how many 0 bits are left, record this into some counter table which has one counter for 0 --> N_BITS_IN_LONG. Make some debug /proc thing which spits the table out. (hint: at fork, clear out the child's bitmask before copy_page_range is run for best results :-) You can use this to do vaious things and see how much there is to gain by going to two unsigned longs, three, etc. Then you can hack up the actual clear_page_tables optimization (to start) and measure the result. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Q: PREFETCH_STRIDE/16 2002-05-23 17:00 ` David S. Miller @ 2002-05-23 18:32 ` David Mosberger 0 siblings, 0 replies; 8+ messages in thread From: David Mosberger @ 2002-05-23 18:32 UTC (permalink / raw) To: David S. Miller; +Cc: davidm, davidm, hugh, linux-kernel, andrea, torvalds >>>>> On Thu, 23 May 2002 10:00:58 -0700 (PDT), "David S. Miller" <davem@redhat.com> said: DaveM> You'd be surprised how many 0 bits there will be in the DaveM> average process. Even if you bring in all of emacs, glibc, DaveM> X11R6 libs etc. and the anonymous memory, there are still a DaveM> HUGE portion of the address space totally unused. But don't confuse the portion of address space used with the number of page tables in use! Even if you use a teeny tiny portion of a 64-bit address space, you can still have hundreds and thousands of page table pages. Also, we should not ignore the advantages of the current scheme: o It's so straight-forward, it's virtually impossible to screw it up (with the hashed scheme, forgetting to set a bit just once could lead to very difficult-to-track-down bugs; been there, done that, in a slightly different context, and it was ugly...). o Performance is very predicable (basically linear in virtual address space in use). There is a danger that a hashed scheme would be optimized for today's workloads. As the working sets increase over the years, the hashed scheme could eventually break down and the worst part would be that it would be very hard to notice (only effect is bad performance for very large tasks; few benchmarks would probably catch such worse-than-optimal performance). DaveM> But like you said, worth experimenting with :-) First test DaveM> would be, start with 1 unsigned long as the bitmask in DaveM> mm_context_t. Just implement the bit setting part. Then at DaveM> exit() count how many 0 bits are left, record this into some DaveM> counter table which has one counter for 0 --> N_BITS_IN_LONG. DaveM> Make some debug /proc thing which spits the table out. DaveM> (hint: at fork, clear out the child's bitmask before DaveM> copy_page_range is run for best results :-) DaveM> You can use this to do vaious things and see how much there DaveM> is to gain by going to two unsigned longs, three, etc. DaveM> Then you can hack up the actual clear_page_tables DaveM> optimization (to start) and measure the result. Yes. Hopefully, someone with some spare time at hand can play with this. --david ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2002-05-23 18:32 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-05-23 15:12 Q: PREFETCH_STRIDE/16 Hugh Dickins 2002-05-23 16:26 ` David Mosberger 2002-05-23 16:21 ` David S. Miller 2002-05-23 16:43 ` David Mosberger 2002-05-23 16:34 ` David S. Miller 2002-05-23 17:02 ` David Mosberger 2002-05-23 17:00 ` David S. Miller 2002-05-23 18:32 ` David Mosberger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox