Re: ppc44x - how do i optimize driver for tlb hits

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: Ayman El-Khashab <ayman@elkhashab.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linuxppc-dev@ozlabs.org
Subject: Re: ppc44x - how do i optimize driver for tlb hits
Date: Thu, 23 Sep 2010 21:58:50 -0500	[thread overview]
Message-ID: <20100924025849.GA5619@crust.elkhashab.com> (raw)
In-Reply-To: <1285290444.14081.6.camel@pasglop>

On Fri, Sep 24, 2010 at 11:07:24AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-09-23 at 17:35 -0500, Ayman El-Khashab wrote:
> > Anything you allocate with kmalloc() is going to be mapped by bolted
> > > 256M TLB entries, so there should be no TLB misses happening in the
> > > kernel case.
> > > 
> > 
> > Hi Ben, can you or somebody elaborate?  I saw the pinned tlb in
> > 44x_mmu.c.
> > Perhaps I don't understand the code fully, but it appears to map 256MB
> > of "lowmem" into a pinned tlb.  I am not sure what phys address lowmem
> > means, but I assumed (possibly incorrectly) that it is 0-256MB. 
> 
> No. The first pinned entry (0...256M) is inserted by the asm code in
> head_44x.S. The code in 44x_mmu.c will later map the rest of lowmem
> (typically up to 768M but various settings can change that) using more
> 256M entries.

Thanks Ben, appreciate all your wisdom and insight.

Ok, so my 460ex board has 512MB total, so how does that figure into 
the 768M?  Is there some other heuristic that determines how these
are mapped? 

> Basically, all of lowmem is permanently mapped with such entries. 
> 
> > When I get the physical addresses for my buffers after kmalloc, they
> > all have addresses that are within my DRAM but start at about the
> > 440MB mark. I end up passing those phys addresses to my DMA engine.
> 
> Anything you get from kmalloc is going to come from lowmem, and thus be
> covered by those bolted TLB entries.

So is it reasonable to assume that everything on my system will come from
pinned TLB entries?

> 
> > When my compare runs it takes a huge amount of time in the assembly
> > code doing memory fetches which makes me think that there are either
> > tons of cache misses (despite the prefetching) or the entries have
> > been purged
> 
> What prefetching ? IE. The DMA operation -will- flush things out of the
> cache due to the DMA being not cache coherent on 44x. The 440 also
> doesn't have a working HW prefetch engine afaik (it should be disabled
> in FW or early asm on 440 cores and fused out in HW on 460 cores afaik).
>
> So only explicit SW prefetching will help.
> 

The DMA is what I use in the "real world case" to get data into and out 
of these buffers.  However, I can disable the DMA completely and do only
the kmalloc.  In this case I still see the same poor performance.  My
prefetching is part of my algo using the dcbt instructions.  I know the
instructions are effective b/c without them the algo is much less 
performant.  So yes, my prefetches are explicit.

> > from the TLB and must be obtained again.  As an experiment, I disabled
> > my cache prefetch code and the algo took forever.  Next I altered the
> > asm to do the same amount of data but a smaller amount over and over 
> > so that less if fetched from main memory.  That executed very quickly.
> > >From that I drew the conclusion that the algorithm is memory
> > bandwidth limited.
> 
> I don't know what exactly is going on, maybe your prefetch stride isn't
> right for the HW setup, or something like that. You can use xmon 'u'
> command to look at the TLB content. Check that we have the 256M entries
> mapping your data, they should be there.

Ok, I will give that a try ... in addition, is there an easy way to use
any sort of gprof like tool to see the system performance?  What about
looking at the 44x performance counters in some meaningful way?  All
the experiments point to the fetching being slower in the full program
as opposed to the algo in a testbench, so I want to determine what it is
that could cause that.

thanks
ayman

next prev parent reply	other threads:[~2010-09-24  2:58 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-23 15:12 ppc44x - how do i optimize driver for tlb hits Ayman El-Khashab
2010-09-23 22:01 ` Benjamin Herrenschmidt
2010-09-23 22:35   ` Ayman El-Khashab
2010-09-24  1:07     ` Benjamin Herrenschmidt
2010-09-24  2:58       ` Ayman El-Khashab [this message]
2010-09-24  4:43         ` Benjamin Herrenschmidt
2010-09-24 10:30           ` Josh Boyer
2010-09-24 13:08             ` Ayman El-Khashab
2010-09-24 22:11               ` Benjamin Herrenschmidt
2010-10-03 19:13                 ` Ayman El-Khashab
2010-10-03 22:38                   ` Benjamin Herrenschmidt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100924025849.GA5619@crust.elkhashab.com \
    --to=ayman@elkhashab.com \
    --cc=benh@kernel.crashing.org \
    --cc=linuxppc-dev@ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).