* ppc44x - how do i optimize driver for tlb hits @ 2010-09-23 15:12 Ayman El-Khashab 2010-09-23 22:01 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 11+ messages in thread From: Ayman El-Khashab @ 2010-09-23 15:12 UTC (permalink / raw) To: linuxppc-dev I've implemented a working driver on my 460EX. it allocates a couple of buffers of 4MB each. I have a custom memcmp algorithm in asm that is extremely fast in user space, but 1/2 as fast when run on these buffers. my tests are showing that the algorithm seems to be memory bandwidth bound. my guess is that i am having tlb or cache misses (my algo uses the dbct) that is slowing performance. curiously when in user space, i can affect the performance by small changes in the size of the buffer, i.e. 4MB + 32B is fast, 4MB + 4K is much worse. Can i adjust my driver code that is using kmalloc to make sure that the ppc44x has 4MB tlb entries for these and that they stay put? thanks ayman ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ppc44x - how do i optimize driver for tlb hits 2010-09-23 15:12 ppc44x - how do i optimize driver for tlb hits Ayman El-Khashab @ 2010-09-23 22:01 ` Benjamin Herrenschmidt 2010-09-23 22:35 ` Ayman El-Khashab 0 siblings, 1 reply; 11+ messages in thread From: Benjamin Herrenschmidt @ 2010-09-23 22:01 UTC (permalink / raw) To: Ayman El-Khashab; +Cc: linuxppc-dev On Thu, 2010-09-23 at 10:12 -0500, Ayman El-Khashab wrote: > I've implemented a working driver on my 460EX. it allocates a couple > of buffers of 4MB each. I have a custom memcmp algorithm in asm that > is extremely fast in user space, but 1/2 as fast when run on these > buffers. > > my tests are showing that the algorithm seems to be memory bandwidth > bound. my guess is that i am having tlb or cache misses (my algo > uses the dbct) that is slowing performance. curiously when in user > space, i can affect the performance by small changes in the size of > the buffer, i.e. 4MB + 32B is fast, 4MB + 4K is much worse. > > Can i adjust my driver code that is using kmalloc to make sure that > the ppc44x has 4MB tlb entries for these and that they stay put? Anything you allocate with kmalloc() is going to be mapped by bolted 256M TLB entries, so there should be no TLB misses happening in the kernel case. Cheers, Ben. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ppc44x - how do i optimize driver for tlb hits 2010-09-23 22:01 ` Benjamin Herrenschmidt @ 2010-09-23 22:35 ` Ayman El-Khashab 2010-09-24 1:07 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 11+ messages in thread From: Ayman El-Khashab @ 2010-09-23 22:35 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linuxppc-dev On Fri, Sep 24, 2010 at 08:01:04AM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2010-09-23 at 10:12 -0500, Ayman El-Khashab wrote: > > I've implemented a working driver on my 460EX. it allocates a couple > > of buffers of 4MB each. I have a custom memcmp algorithm in asm that > > is extremely fast in user space, but 1/2 as fast when run on these > > buffers. > > > > my tests are showing that the algorithm seems to be memory bandwidth > > bound. my guess is that i am having tlb or cache misses (my algo > > uses the dbct) that is slowing performance. curiously when in user > > space, i can affect the performance by small changes in the size of > > the buffer, i.e. 4MB + 32B is fast, 4MB + 4K is much worse. > > > > Can i adjust my driver code that is using kmalloc to make sure that > > the ppc44x has 4MB tlb entries for these and that they stay put? > > Anything you allocate with kmalloc() is going to be mapped by bolted > 256M TLB entries, so there should be no TLB misses happening in the > kernel case. > Hi Ben, can you or somebody elaborate? I saw the pinned tlb in 44x_mmu.c. Perhaps I don't understand the code fully, but it appears to map 256MB of "lowmem" into a pinned tlb. I am not sure what phys address lowmem means, but I assumed (possibly incorrectly) that it is 0-256MB. When I get the physical addresses for my buffers after kmalloc, they all have addresses that are within my DRAM but start at about the 440MB mark. I end up passing those phys addresses to my DMA engine. When my compare runs it takes a huge amount of time in the assembly code doing memory fetches which makes me think that there are either tons of cache misses (despite the prefetching) or the entries have been purged from the TLB and must be obtained again. As an experiment, I disabled my cache prefetch code and the algo took forever. Next I altered the asm to do the same amount of data but a smaller amount over and over so that less if fetched from main memory. That executed very quickly. >From that I drew the conclusion that the algorithm is memory bandwidth limited. In a standalone configuration (i.e. algorithm just using user memory, everything else identical), the speedup is 2-3x. So the limitation is not a hardware limit, it must be something that is happening when I execute the loads. (it is a compare algorithm, so it only does loads). Thanks Ayman ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ppc44x - how do i optimize driver for tlb hits 2010-09-23 22:35 ` Ayman El-Khashab @ 2010-09-24 1:07 ` Benjamin Herrenschmidt 2010-09-24 2:58 ` Ayman El-Khashab 0 siblings, 1 reply; 11+ messages in thread From: Benjamin Herrenschmidt @ 2010-09-24 1:07 UTC (permalink / raw) To: Ayman El-Khashab; +Cc: linuxppc-dev On Thu, 2010-09-23 at 17:35 -0500, Ayman El-Khashab wrote: > Anything you allocate with kmalloc() is going to be mapped by bolted > > 256M TLB entries, so there should be no TLB misses happening in the > > kernel case. > > > > Hi Ben, can you or somebody elaborate? I saw the pinned tlb in > 44x_mmu.c. > Perhaps I don't understand the code fully, but it appears to map 256MB > of "lowmem" into a pinned tlb. I am not sure what phys address lowmem > means, but I assumed (possibly incorrectly) that it is 0-256MB. No. The first pinned entry (0...256M) is inserted by the asm code in head_44x.S. The code in 44x_mmu.c will later map the rest of lowmem (typically up to 768M but various settings can change that) using more 256M entries. Basically, all of lowmem is permanently mapped with such entries. > When I get the physical addresses for my buffers after kmalloc, they > all have addresses that are within my DRAM but start at about the > 440MB mark. I end up passing those phys addresses to my DMA engine. Anything you get from kmalloc is going to come from lowmem, and thus be covered by those bolted TLB entries. > When my compare runs it takes a huge amount of time in the assembly > code doing memory fetches which makes me think that there are either > tons of cache misses (despite the prefetching) or the entries have > been purged What prefetching ? IE. The DMA operation -will- flush things out of the cache due to the DMA being not cache coherent on 44x. The 440 also doesn't have a working HW prefetch engine afaik (it should be disabled in FW or early asm on 440 cores and fused out in HW on 460 cores afaik). So only explicit SW prefetching will help. > from the TLB and must be obtained again. As an experiment, I disabled > my cache prefetch code and the algo took forever. Next I altered the > asm to do the same amount of data but a smaller amount over and over > so that less if fetched from main memory. That executed very quickly. > >From that I drew the conclusion that the algorithm is memory > bandwidth limited. I don't know what exactly is going on, maybe your prefetch stride isn't right for the HW setup, or something like that. You can use xmon 'u' command to look at the TLB content. Check that we have the 256M entries mapping your data, they should be there. > In a standalone configuration (i.e. algorithm just using user memory, > everything else identical), the speedup is 2-3x. So the limitation > is not a hardware limit, it must be something that is happening when > I execute the loads. (it is a compare algorithm, so it only does > loads). Cheers, Ben. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ppc44x - how do i optimize driver for tlb hits 2010-09-24 1:07 ` Benjamin Herrenschmidt @ 2010-09-24 2:58 ` Ayman El-Khashab 2010-09-24 4:43 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 11+ messages in thread From: Ayman El-Khashab @ 2010-09-24 2:58 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linuxppc-dev On Fri, Sep 24, 2010 at 11:07:24AM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2010-09-23 at 17:35 -0500, Ayman El-Khashab wrote: > > Anything you allocate with kmalloc() is going to be mapped by bolted > > > 256M TLB entries, so there should be no TLB misses happening in the > > > kernel case. > > > > > > > Hi Ben, can you or somebody elaborate? I saw the pinned tlb in > > 44x_mmu.c. > > Perhaps I don't understand the code fully, but it appears to map 256MB > > of "lowmem" into a pinned tlb. I am not sure what phys address lowmem > > means, but I assumed (possibly incorrectly) that it is 0-256MB. > > No. The first pinned entry (0...256M) is inserted by the asm code in > head_44x.S. The code in 44x_mmu.c will later map the rest of lowmem > (typically up to 768M but various settings can change that) using more > 256M entries. Thanks Ben, appreciate all your wisdom and insight. Ok, so my 460ex board has 512MB total, so how does that figure into the 768M? Is there some other heuristic that determines how these are mapped? > Basically, all of lowmem is permanently mapped with such entries. > > > When I get the physical addresses for my buffers after kmalloc, they > > all have addresses that are within my DRAM but start at about the > > 440MB mark. I end up passing those phys addresses to my DMA engine. > > Anything you get from kmalloc is going to come from lowmem, and thus be > covered by those bolted TLB entries. So is it reasonable to assume that everything on my system will come from pinned TLB entries? > > > When my compare runs it takes a huge amount of time in the assembly > > code doing memory fetches which makes me think that there are either > > tons of cache misses (despite the prefetching) or the entries have > > been purged > > What prefetching ? IE. The DMA operation -will- flush things out of the > cache due to the DMA being not cache coherent on 44x. The 440 also > doesn't have a working HW prefetch engine afaik (it should be disabled > in FW or early asm on 440 cores and fused out in HW on 460 cores afaik). > > So only explicit SW prefetching will help. > The DMA is what I use in the "real world case" to get data into and out of these buffers. However, I can disable the DMA completely and do only the kmalloc. In this case I still see the same poor performance. My prefetching is part of my algo using the dcbt instructions. I know the instructions are effective b/c without them the algo is much less performant. So yes, my prefetches are explicit. > > from the TLB and must be obtained again. As an experiment, I disabled > > my cache prefetch code and the algo took forever. Next I altered the > > asm to do the same amount of data but a smaller amount over and over > > so that less if fetched from main memory. That executed very quickly. > > >From that I drew the conclusion that the algorithm is memory > > bandwidth limited. > > I don't know what exactly is going on, maybe your prefetch stride isn't > right for the HW setup, or something like that. You can use xmon 'u' > command to look at the TLB content. Check that we have the 256M entries > mapping your data, they should be there. Ok, I will give that a try ... in addition, is there an easy way to use any sort of gprof like tool to see the system performance? What about looking at the 44x performance counters in some meaningful way? All the experiments point to the fetching being slower in the full program as opposed to the algo in a testbench, so I want to determine what it is that could cause that. thanks ayman ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ppc44x - how do i optimize driver for tlb hits 2010-09-24 2:58 ` Ayman El-Khashab @ 2010-09-24 4:43 ` Benjamin Herrenschmidt 2010-09-24 10:30 ` Josh Boyer 0 siblings, 1 reply; 11+ messages in thread From: Benjamin Herrenschmidt @ 2010-09-24 4:43 UTC (permalink / raw) To: Ayman El-Khashab; +Cc: linuxppc-dev > > No. The first pinned entry (0...256M) is inserted by the asm code in > > head_44x.S. The code in 44x_mmu.c will later map the rest of lowmem > > (typically up to 768M but various settings can change that) using more > > 256M entries. > > Thanks Ben, appreciate all your wisdom and insight. > > Ok, so my 460ex board has 512MB total, so how does that figure into > the 768M? Is there some other heuristic that determines how these > are mapped? Not really, it all fits in lowmem so it will be mapped with two pinned 256M entries. Basically, we try to map all memory with those entries in the linear mapping. But since we only have 1G of address space available when PAGE_OFFSET is c0000000, and we need some of that for vmalloc, ioremap, etc... we thus limit that mapping to 768M currently. If you have more memory, you will see only 768M unless you use CONFIG_HIGHMEM, which allows the kernel to exploit more physical memory. In this case, only the first 768M are permanently mapped (and accessible), but you can allocate pages in "highmem" which can still be mapped into user space and need kmap/kunmap calls to be accessed by the kernel. However, in your case you don't need highmem, everything fits in lowmem, so the kernel will just use 2x256M of bolted TLB entries to map that permanently. Note also that kmalloc() always return lowmem. > So is it reasonable to assume that everything on my system will come from > pinned TLB entries? Yes. > The DMA is what I use in the "real world case" to get data into and out > of these buffers. However, I can disable the DMA completely and do only > the kmalloc. In this case I still see the same poor performance. My > prefetching is part of my algo using the dcbt instructions. I know the > instructions are effective b/c without them the algo is much less > performant. So yes, my prefetches are explicit. Could be some "effect" of the cache structure, L2 cache, cache geometry (number of ways etc...). You might be able to alleviate that by changing the "stride" of your prefetch. Unfortunately, I'm not familiar enough with the 440 micro architecture and its caches to be able to help you much here. > Ok, I will give that a try ... in addition, is there an easy way to use > any sort of gprof like tool to see the system performance? What about > looking at the 44x performance counters in some meaningful way? All > the experiments point to the fetching being slower in the full program > as opposed to the algo in a testbench, so I want to determine what it is > that could cause that. Does it have any useful performance counters ? I didn't think it did but I may be mistaken. Cheers, Ben. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ppc44x - how do i optimize driver for tlb hits 2010-09-24 4:43 ` Benjamin Herrenschmidt @ 2010-09-24 10:30 ` Josh Boyer 2010-09-24 13:08 ` Ayman El-Khashab 0 siblings, 1 reply; 11+ messages in thread From: Josh Boyer @ 2010-09-24 10:30 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Ayman El-Khashab On Fri, Sep 24, 2010 at 02:43:52PM +1000, Benjamin Herrenschmidt wrote: >> The DMA is what I use in the "real world case" to get data into and out >> of these buffers. However, I can disable the DMA completely and do only >> the kmalloc. In this case I still see the same poor performance. My >> prefetching is part of my algo using the dcbt instructions. I know the >> instructions are effective b/c without them the algo is much less >> performant. So yes, my prefetches are explicit. > >Could be some "effect" of the cache structure, L2 cache, cache geometry >(number of ways etc...). You might be able to alleviate that by changing >the "stride" of your prefetch. > >Unfortunately, I'm not familiar enough with the 440 micro architecture >and its caches to be able to help you much here. Also, doesn't kmalloc have a limit to the size of the request it will let you allocate? I know in the distant past you could allocate 128K with kmalloc, and 2M with an explicit call to get_free_pages. Anything larger than that had to use vmalloc. The limit might indeed be higher now, but a 4MB kmalloc buffer sounds very large, given that it would be contiguous pages. Two of them even less so. >> Ok, I will give that a try ... in addition, is there an easy way to use >> any sort of gprof like tool to see the system performance? What about >> looking at the 44x performance counters in some meaningful way? All >> the experiments point to the fetching being slower in the full program >> as opposed to the algo in a testbench, so I want to determine what it is >> that could cause that. > >Does it have any useful performance counters ? I didn't think it did but >I may be mistaken. No, it doesn't. josh ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ppc44x - how do i optimize driver for tlb hits 2010-09-24 10:30 ` Josh Boyer @ 2010-09-24 13:08 ` Ayman El-Khashab 2010-09-24 22:11 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 11+ messages in thread From: Ayman El-Khashab @ 2010-09-24 13:08 UTC (permalink / raw) To: Josh Boyer; +Cc: linuxppc-dev On Fri, Sep 24, 2010 at 06:30:34AM -0400, Josh Boyer wrote: > On Fri, Sep 24, 2010 at 02:43:52PM +1000, Benjamin Herrenschmidt wrote: > >> The DMA is what I use in the "real world case" to get data into and out > >> of these buffers. However, I can disable the DMA completely and do only > >> the kmalloc. In this case I still see the same poor performance. My > >> prefetching is part of my algo using the dcbt instructions. I know the > >> instructions are effective b/c without them the algo is much less > >> performant. So yes, my prefetches are explicit. > > > >Could be some "effect" of the cache structure, L2 cache, cache geometry > >(number of ways etc...). You might be able to alleviate that by changing > >the "stride" of your prefetch. My original theory was that it was having lots of cache misses. But since the algorithm works standalone fast and uses large enough buffers (4MB), much of the cache is flushed and replaced with my data. The cache is 32K, 8 way, 32b/line. I've crafted the algorithm to use those parameters. > > > >Unfortunately, I'm not familiar enough with the 440 micro architecture > >and its caches to be able to help you much here. > > Also, doesn't kmalloc have a limit to the size of the request it will > let you allocate? I know in the distant past you could allocate 128K > with kmalloc, and 2M with an explicit call to get_free_pages. Anything > larger than that had to use vmalloc. The limit might indeed be higher > now, but a 4MB kmalloc buffer sounds very large, given that it would be > contiguous pages. Two of them even less so. I thought so too, but at least in the current implementation we found empirically that we could kmalloc up to but no more than 4MB. We have also tried an approach in user memory and then using "get_user_pages" and building a scatter-gather. We found that the compare code doesn't perform any better. I suppose another option is to to use the kernel profiling option I always see but have never used. Is that a viable option to figure out what is happening here? ayman ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ppc44x - how do i optimize driver for tlb hits 2010-09-24 13:08 ` Ayman El-Khashab @ 2010-09-24 22:11 ` Benjamin Herrenschmidt 2010-10-03 19:13 ` Ayman El-Khashab 0 siblings, 1 reply; 11+ messages in thread From: Benjamin Herrenschmidt @ 2010-09-24 22:11 UTC (permalink / raw) To: Ayman El-Khashab; +Cc: linuxppc-dev On Fri, 2010-09-24 at 08:08 -0500, Ayman El-Khashab wrote: > > I suppose another option is to to use the kernel profiling option I > always see but have never used. Is that a viable option to figure out > what is happening here? With perf and stochastic sampling ? If you sample fast enough... but you'll mostly point to your routine I suppose... though it might tell you statistically where in your code, which -might- help. Cheers, Ben. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ppc44x - how do i optimize driver for tlb hits 2010-09-24 22:11 ` Benjamin Herrenschmidt @ 2010-10-03 19:13 ` Ayman El-Khashab 2010-10-03 22:38 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 11+ messages in thread From: Ayman El-Khashab @ 2010-10-03 19:13 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linuxppc-dev On Sat, Sep 25, 2010 at 08:11:04AM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2010-09-24 at 08:08 -0500, Ayman El-Khashab wrote: > > > > I suppose another option is to to use the kernel profiling option I > > always see but have never used. Is that a viable option to figure out > > what is happening here? > > With perf and stochastic sampling ? If you sample fast enough... but > you'll mostly point to your routine I suppose... though it might tell > you statistically where in your code, which -might- help. > Thanks I didn't end up profiling it b/c we found the biggest culprit. Basically we were mapping this memory in kernel space and as long as we did that ONLY everything was ok. But then we would mmap the physical addresses into user space. Using MAP_SHARED made it extremely slow. Using MAP_PRIVATE made it very fast. So it works, but why is MAP_SHARED that much slower? The other optimization was a change in the algorithm to take advantage of the L2 prefetching. Since we were operating on many simultaneous streams it seems that the cache performance was not good. thanks ame ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ppc44x - how do i optimize driver for tlb hits 2010-10-03 19:13 ` Ayman El-Khashab @ 2010-10-03 22:38 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 11+ messages in thread From: Benjamin Herrenschmidt @ 2010-10-03 22:38 UTC (permalink / raw) To: Ayman El-Khashab; +Cc: linuxppc-dev On Sun, 2010-10-03 at 14:13 -0500, Ayman El-Khashab wrote: > On Sat, Sep 25, 2010 at 08:11:04AM +1000, Benjamin Herrenschmidt wrote: > > On Fri, 2010-09-24 at 08:08 -0500, Ayman El-Khashab wrote: > > > > > > I suppose another option is to to use the kernel profiling option I > > > always see but have never used. Is that a viable option to figure out > > > what is happening here? > > > > With perf and stochastic sampling ? If you sample fast enough... but > > you'll mostly point to your routine I suppose... though it might tell > > you statistically where in your code, which -might- help. > > > > Thanks I didn't end up profiling it b/c we found the biggest culprit. > Basically we were mapping this memory in kernel space and as long as we > did that ONLY everything was ok. But then we would mmap the physical > addresses into user space. Using MAP_SHARED made it extremely slow. > Using MAP_PRIVATE made it very fast. So it works, but why is MAP_SHARED > that much slower? I don't see any reason off hand why this would be the case. Can you inspect the content of the TLB with either xmon or whatever HW debugger you may have at hand and show me what difference you have between an entry for your workload coming from MAP_SHARED vs. one coming from MAP_PRIVATE ? > The other optimization was a change in the algorithm to take advantage > of the L2 prefetching. Since we were operating on many simultaneous > streams it seems that the cache performance was not good. Cheers, Ben. > thanks > ame ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2010-10-03 22:39 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-09-23 15:12 ppc44x - how do i optimize driver for tlb hits Ayman El-Khashab 2010-09-23 22:01 ` Benjamin Herrenschmidt 2010-09-23 22:35 ` Ayman El-Khashab 2010-09-24 1:07 ` Benjamin Herrenschmidt 2010-09-24 2:58 ` Ayman El-Khashab 2010-09-24 4:43 ` Benjamin Herrenschmidt 2010-09-24 10:30 ` Josh Boyer 2010-09-24 13:08 ` Ayman El-Khashab 2010-09-24 22:11 ` Benjamin Herrenschmidt 2010-10-03 19:13 ` Ayman El-Khashab 2010-10-03 22:38 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).