From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <544122F9.9060203@optusnet.com.au> Date: Sat, 18 Oct 2014 01:08:57 +1100 From: Tom Evans MIME-Version: 1.0 References: <543E4B9F.60602@cgglobal.com> <543E5380.3080700@cgglobal.com> <543F70C5.1090005@xenomai.org> <543F7ED2.2080201@xenomai.org> <20141016181711.GB24110@csclub.uwaterloo.ca> <20141016185819.GF30661@sisyphus.hd.free.fr> <20141016205621.GC24110@csclub.uwaterloo.ca> <54405164.3090205@optusnet.com.au> <20141017053423.GG30661@sisyphus.hd.free.fr> <5440BB6B.9030807@optusnet.com.au> <20141017070201.GI30661@sisyphus.hd.free.fr> In-Reply-To: <20141017070201.GI30661@sisyphus.hd.free.fr> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode. List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Gilles Chanteperdrix Cc: xenomai@xenomai.org On 17/10/2014 6:02 PM, Gilles Chanteperdrix wrote: > On Fri, Oct 17, 2014 at 05:47:07PM +1100, Tom Evans wrote: >> On 17/10/14 16:34, Gilles Chanteperdrix wrote: >>>> Work out how many pixels per second you're processing and then >>>> compare it to the memory bandwidth. That would still be an interesting number to measure and quote. >> It might be better to FLUSH the entire cache, perform a L2-sized >> transfer and then flush it again. The flushes *might* be to linear >> addresses in open pages. Thinking more about this it would be better to flush the entire cache and then perform a preload-and-read pass (to load the cache from complete open rows in on-page RAM) then loop reading and writing (from cache to cache) and then loop flushing the destination cache lines into open rows of the RAM. This would be easy on the PPC. It has six "User level cache instructions". Even the Coldfire has a "CPUSHL" user-mode instruction. This seems to be impossible from user-space on the ARM, for as far as I can tell all of the 13 "Cache and branch predictor maintenance operations, VMSA" instructions "can be executed only by software executing at PL1 or higher". The only user-space ones are PLD, PLDW and PLI. So I'd have to write a kernel driver to copy user memory and worry about the page translation. >> I got my fastest memcpy() speed on an MCF5329 by reading 2k to the >> stack (in static ram in the CPU) and then writing that back out. >> Copying twice was a LOT faster than any other method. 240MHz Coldfire with 80MHz 32-bit SDR memory. It started out at 33MB/s, got to 39MB/s by using the multiple register move instruction, and peaked at 55 MB/s copying via the stack in internal SRAM. Memcpy() from internal RAM to internal RAM managed 304MB/s! RAM could be read at 87MB/s (due to the lack of pipelining in this CPU) but could be written at 207MB/s. Function kB/s Memclk/cache line ============================================= memcpy_gcc_4_4 30883 41.45 memcpy_gcc_4_3_O1 33382 38.34 memcpy_gcc_4_3_O2 33385 38.34 memcpy_gcc_2 33390 38.33 memcpy(132096) 33379 38.35 memcpy_moveml 39752 32.20 memcpy_dma 43709 29.28 memcpy_moveml_32 49618 25.80 memcpy_stack 52912 24.19 memcpy_moveml_192 54052 23.68 memcpy_moveml_48 54093 23.66 memcpy_stack_48 54997 23.27 memcpy_stack_32_mis 55079 23.24 memcpy_stack_32 55125 23.22 memcpy_stack_192 55736 22.97 memcpy_moveml_96_ps 56739 22.56 memRead_stack_32 85017 15.06 memRead_moveml_32 87141 14.69 memWrite_stack_32 196864 6.50 memWrite_moveml_32 207535 6.17 memcpy_stack_stack 304368 4.21 (12.62 CPU clocks) > Actually, I am wrong, I was only reading the image, > not writing to it, simply computing a very reduced > averaged image, so there were some writes from time > to time, but very rarely. That should get the best performance. That's not a common operation in what I'm doing which is alpha-blending graphics over each other. > > Miss L1 and wait 10 clocks. Miss L2 and wait 153 clocks! Step > > through memory 4k at a time and wait 46 clocks for the TLB to > > reload. > > That does not prove that the memory system is slow, that > proves that the processor access to memory is slow. But > why is that? The memory controller may not be capable of keeping multiple banks open, or even rows open. It takes a long time to close an open row with the precharge and to then open the next one. Don't even think about reading or writing peripheral pins. I worked on an ARM chip (PXA) that took 200 CPU Clocks to read or write a port register. It was actually recommended to program the DMA controller to read and write the ports and to interrupt the CPU when done! That previously-quoted ARM FAQ on memory copying suggests the same thing (DMA or Preload Engine) for copying memory while the CPU goes and does something else. > But I see your point, the problem is not NEON, but the > way the processor handles memory and cache. The frustrating thing is the missing user-mode cache control instructions. Tom