Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Tom Evans <tom_usenet@optusnet.com.au>
To: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
Cc: xenomai@xenomai.org
Subject: Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
Date: Sat, 18 Oct 2014 01:08:57 +1100	[thread overview]
Message-ID: <544122F9.9060203@optusnet.com.au> (raw)
In-Reply-To: <20141017070201.GI30661@sisyphus.hd.free.fr>

On 17/10/2014 6:02 PM, Gilles Chanteperdrix wrote:
> On Fri, Oct 17, 2014 at 05:47:07PM +1100, Tom Evans wrote:
>> On 17/10/14 16:34, Gilles Chanteperdrix wrote:
>>>> Work out how many pixels per second you're processing and then
>>>> compare it to the memory bandwidth.

That would still be an interesting number to measure and quote.

>> It might be better to FLUSH the entire cache, perform a L2-sized
>> transfer and then flush it again. The flushes *might* be to linear
>> addresses in open pages.

Thinking more about this it would be better to flush the entire cache 
and then perform a preload-and-read pass (to load the cache from 
complete open rows in on-page RAM) then loop reading and writing (from 
cache to cache) and then loop flushing the destination cache lines into 
open rows of the RAM.

This would be easy on the PPC. It has six "User level cache 
instructions". Even the Coldfire has a "CPUSHL" user-mode instruction.

This seems to be impossible from user-space on the ARM, for as far as I 
can tell all of the 13 "Cache and branch predictor maintenance 
operations, VMSA" instructions "can be executed only by software 
executing at PL1 or higher". The only user-space ones are PLD, PLDW and 
PLI. So I'd have to write a kernel driver to copy user memory and worry 
about the page translation.

>> I got my fastest memcpy() speed on an MCF5329 by reading 2k to the
>> stack (in static ram in the CPU) and then writing that back out.
>> Copying twice was a LOT faster than any other method.

240MHz Coldfire with 80MHz 32-bit SDR memory. It started out at 33MB/s, 
got to 39MB/s by using the multiple register move instruction, and 
peaked at 55 MB/s copying via the stack in internal SRAM. Memcpy() from 
internal RAM to internal RAM managed 304MB/s!

RAM could be read at 87MB/s (due to the lack of pipelining in this CPU) 
but could be written at 207MB/s.

Function             kB/s   Memclk/cache line
=============================================
memcpy_gcc_4_4       30883  41.45
memcpy_gcc_4_3_O1    33382  38.34
memcpy_gcc_4_3_O2    33385  38.34
memcpy_gcc_2         33390  38.33

memcpy(132096)       33379  38.35
memcpy_moveml        39752  32.20
memcpy_dma           43709  29.28
memcpy_moveml_32     49618  25.80
memcpy_stack         52912  24.19
memcpy_moveml_192    54052  23.68
memcpy_moveml_48     54093  23.66
memcpy_stack_48      54997  23.27
memcpy_stack_32_mis  55079  23.24
memcpy_stack_32      55125  23.22
memcpy_stack_192     55736  22.97
memcpy_moveml_96_ps  56739  22.56
memRead_stack_32     85017  15.06
memRead_moveml_32    87141  14.69
memWrite_stack_32   196864   6.50
memWrite_moveml_32  207535   6.17
memcpy_stack_stack  304368   4.21 (12.62 CPU clocks)

> Actually, I am wrong, I was only reading the image,
 > not writing to it, simply computing a very reduced
 > averaged image, so there were some writes from time
 > to time, but very rarely.

That should get the best performance. That's not a common operation in 
what I'm doing which is alpha-blending graphics over each other.

 > > Miss L1 and wait 10 clocks. Miss L2 and wait 153 clocks! Step
 > > through memory 4k at a time and wait 46 clocks for the TLB to
 > > reload.
 >
 > That does not prove that the memory system is slow, that
 > proves that the processor access to memory is slow. But
 > why is that?

The memory controller may not be capable of keeping multiple banks open, 
or even rows open. It takes a long time to close an open row with the 
precharge and to then open the next one.

Don't even think about reading or writing peripheral pins. I worked on 
an ARM chip (PXA) that took 200 CPU Clocks to read or write a port 
register. It was actually recommended to program the DMA controller to 
read and write the ports and to interrupt the CPU when done!

That previously-quoted ARM FAQ on memory copying suggests the same thing 
(DMA or Preload Engine) for copying memory while the CPU goes and does 
something else.

> But I see your point, the problem is not NEON, but the
 > way the processor handles memory and cache.

The frustrating thing is the missing user-mode cache control instructions.

Tom

next prev parent reply	other threads:[~2014-10-17 14:08 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <543E4B9F.60602@cgglobal.com>
2014-10-15 10:59 ` [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode ZIV-Alberto Ozalla Cantabrana
2014-10-15 11:11   ` Gilles Chanteperdrix
2014-10-15 13:12     ` ZIV-Alberto Ozalla Cantabrana
2014-10-15 13:16       ` Gilles Chanteperdrix
2014-10-15 20:03       ` Gilles Chanteperdrix
2014-10-17 16:33         ` ZIV-Alberto Ozalla Cantabrana
2014-10-17 16:38           ` Gilles Chanteperdrix
2014-10-15 13:19   ` Gilles Chanteperdrix
2014-10-15 13:34     ` ZIV-Alberto Ozalla Cantabrana
2014-10-16  7:16   ` Gilles Chanteperdrix
2014-10-16  8:16     ` Gilles Chanteperdrix
2014-10-16  8:33       ` ZIV-Alberto Ozalla Cantabrana
2014-10-16  8:39         ` Gilles Chanteperdrix
2014-10-16 18:17       ` Lennart Sorensen
2014-10-16 18:58         ` Gilles Chanteperdrix
2014-10-16 20:56           ` Lennart Sorensen
2014-10-16 23:14             ` Tom Evans
2014-10-17  5:34               ` Gilles Chanteperdrix
2014-10-17  6:47                 ` Tom Evans
2014-10-17  7:02                   ` Gilles Chanteperdrix
2014-10-17 14:08                     ` Tom Evans [this message]
2014-10-17 19:36                       ` Gilles Chanteperdrix
2014-10-17 14:32                     ` Anders Blomdell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=544122F9.9060203@optusnet.com.au \
    --to=tom_usenet@optusnet.com.au \
    --cc=gilles.chanteperdrix@xenomai.org \
    --cc=xenomai@xenomai.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.