All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tom Evans <tom_usenet@optusnet.com.au>
To: xenomai@xenomai.org
Subject: Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
Date: Fri, 17 Oct 2014 10:14:44 +1100	[thread overview]
Message-ID: <54405164.3090205@optusnet.com.au> (raw)
In-Reply-To: <20141016205621.GC24110@csclub.uwaterloo.ca>

On 17/10/14 07:56, Lennart Sorensen wrote:
> On Thu, Oct 16, 2014 at 08:58:19PM +0200, Gilles Chanteperdrix wrote:
>> ... After implementing a routine to average pixels
>> from a bayer pattern on cortex A8 (where I could use NEON) I got a
>> factor gain of 2 or 3, far from what could have been expected from
>> processing 16 pixels at once,

How big is your data-set? You are probably breaking the L2 cache.

Work out how many pixels per second you're processing and then compare it to 
the memory bandwidth. You may be surprised at how slow the memory system is.

Download, compile and run this program:

http://www.cwi.nl/~manegold/Calibrator/

   root@triton1:/tmp# nice --20 ./calibrator 800 1700k report

   caches:
   level  size    linesize   miss-latency        replace-time
     1     32 KB  128 bytes   12.70 ns =  10 cy   13.40 ns =  11 cy
     2    256 KB   64 bytes  191.21 ns = 153 cy  194.37 ns = 155 cy

   TLBs:
   level #entries  pagesize  miss-latency
     1       32       4 KB    57.65 ns =  46 cy

Miss L1 and wait 10 clocks. Miss L2 and wait 153 clocks! Step through memory 
4k at a time and wait 46 clocks for the TLB to reload.

 >> and I got a biggest gain by inserting the non-NEON "pld"
 >> instruction at key points (which I could do in the non NEON
 >> code as well).

With a 153-clock latency on an L2 miss, PLD will have a large affect if you 
can get them in early enough. You should preload multiple cache lines ahead 
and not just a few words.

>> I also do not really understand how NEON accelerates memcpy,
 >> why is a NEON multiple registers load/store faster than
 >> ldm/stm, is not it a problem in ldm/stm rather than a
>> virtue of NEON?

The following should be a good reference, but doesn't answer this question. It 
says there is no difference, but that's not what we're seeing.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/kihAsZfdS5wTMO.html

The faster Neon copy indicates a problem with the ARM architecture itself. 
Whenever the ARM CPU performs a memcpy(), the sequence is (read(src); 
read(dst); write(dst)). The cache design means that the destination cache line 
is READ before being written, so the memcpy() speed is 1/3 of the basic memory 
speed.

The PPC architecture provides DCBZ and friends. During a memcpy() you perform 
a DCBZ on the destination which is a "promise" to the CPU that you're going to 
write the entire cache line so it doesn't have to be read first.

Neon performs the operations a cache line at a time and gets rid of the 
redundant read operation, so it runs faster by 3/2. The previous link implies 
this might require the correct CPU configuration (Neon bypassing L1).

>> All this to say, is NEON that useful?

We're performing alpha blending with 32-bit pixels and our Neon code is able 
to do that at the same speed as a CPU-driven memcpy(). It is a lot faster than 
my poor attempts at alpha-blending 4 bytes per pixel in C. Our Neon memcpy() 
(copying 800x480 32-bit pixels at 20Hz to /dev/fb0) is 50% faster than the 
alternative.

I also had an instance when performing an affine transform (rotation) where 
the speed dropped to 29% at specific rotation angle where a 300-pixel "walk" 
through memory aliased in the caches. That was fixed by performing the 
transform in 48 by 48 pixel "tiles".

> Well the vfp on the Cortex-A8 is rather slow,

10 CPU clocks per instruction instead of 1 clock on the other chips.

Tom



  reply	other threads:[~2014-10-16 23:14 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <543E4B9F.60602@cgglobal.com>
2014-10-15 10:59 ` [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode ZIV-Alberto Ozalla Cantabrana
2014-10-15 11:11   ` Gilles Chanteperdrix
2014-10-15 13:12     ` ZIV-Alberto Ozalla Cantabrana
2014-10-15 13:16       ` Gilles Chanteperdrix
2014-10-15 20:03       ` Gilles Chanteperdrix
2014-10-17 16:33         ` ZIV-Alberto Ozalla Cantabrana
2014-10-17 16:38           ` Gilles Chanteperdrix
2014-10-15 13:19   ` Gilles Chanteperdrix
2014-10-15 13:34     ` ZIV-Alberto Ozalla Cantabrana
2014-10-16  7:16   ` Gilles Chanteperdrix
2014-10-16  8:16     ` Gilles Chanteperdrix
2014-10-16  8:33       ` ZIV-Alberto Ozalla Cantabrana
2014-10-16  8:39         ` Gilles Chanteperdrix
2014-10-16 18:17       ` Lennart Sorensen
2014-10-16 18:58         ` Gilles Chanteperdrix
2014-10-16 20:56           ` Lennart Sorensen
2014-10-16 23:14             ` Tom Evans [this message]
2014-10-17  5:34               ` Gilles Chanteperdrix
2014-10-17  6:47                 ` Tom Evans
2014-10-17  7:02                   ` Gilles Chanteperdrix
2014-10-17 14:08                     ` Tom Evans
2014-10-17 19:36                       ` Gilles Chanteperdrix
2014-10-17 14:32                     ` Anders Blomdell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54405164.3090205@optusnet.com.au \
    --to=tom_usenet@optusnet.com.au \
    --cc=xenomai@xenomai.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.