linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: bill4carson@gmail.com (bill4carson)
To: linux-arm-kernel@lists.infradead.org
Subject: Query about float point operation on Cortex A9
Date: Thu, 03 Nov 2011 16:21:28 +0800	[thread overview]
Message-ID: <4EB24F08.9050008@gmail.com> (raw)

Hi, all

I run STREAM test on ARM Versatile Express Cortex-A9x4 tile with Linux
version 3.1.0+ by adding hugeptlb support. When using huge page,
performance improvements surprisingly hit *only* "Copy" function with
2.7% ~ 10.7%, while "Scale" "Add" "Triadd" functions barely exceed 4k page.

By analyzing the code with Oprofile, it turn out __adddf3/__muldf3 
operations
eating out most of CPU cycles which can be seen from the following log.
Apparently these two operations destroy the benefit brought by hugetlb,
To make hugetlb benchmarking more convincing for the community to accept
I use -mfloat-abi=softfp compile source code, unfortunately I got illegal
instructions after each run.

Is there any other thing need to be done to execute instructions such as
vldr/vmul.f64/vadd.f64 ?


root at localhost:/root> ./run_huge.sh
Profiler running.
hugectl: WARNING: data and bss remapped together in the default hugepage 
size
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2097152, Offset = 0
Total memory required = 48.0 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 9 microseconds.
Each test below will take on the order of 267867 microseconds.
    (= 29763 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:         209.0475       0.1613       0.1605       0.1624
Scale:         95.3991       0.3819       0.3517       0.4015
Add:          125.4612       0.4245       0.4012       0.4696
Triad:         62.7207       0.8052       0.8025       0.8094
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
Stopping profiling.
Overflow stats not available
CPU: ARM Cortex-A9, speed 0 MHz (estimated)
Counted CPU_CYCLES events (Number of CPU cycles) with a unit mask of 
0x00 (No unit mask) count 100000
warning: the last modified time of the binary file does not match that 
of the sample file for /root/stream_c_2M.exe
Either this is the wrong binary or the binary has been modified since 
the sample file was created.
samples  %        symbol name
235526   46.7135  __adddf3
185916   36.8740  __muldf3
37467     7.4311  tuned_STREAM_Copy
14374     2.8509  tuned_STREAM_Triad
11627     2.3061  tuned_STREAM_Add
10815     2.1450  tuned_STREAM_Scale
6066      1.2031  main
2399      0.4758  checkSTREAMresults
1        2.0e-04  __divdf3
1        2.0e-04  __floatsidf
1        2.0e-04  mysecond



root at localhost:/root> ./run_4k.sh
Profiler running.
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2097152, Offset = 0
Total memory required = 48.0 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 9 microseconds.
Each test below will take on the order of 268604 microseconds.
    (= 29844 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:         265.7375       0.1269       0.1263       0.1275
Scale:         99.3782       0.3583       0.3376       0.3796
Add:          132.0995       0.3940       0.3810       0.4173
Triad:         71.8116       0.7040       0.7009       0.7082
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
Stopping profiling.
Overflow stats not available
CPU: ARM Cortex-A9, speed 0 MHz (estimated)
Counted CPU_CYCLES events (Number of CPU cycles) with a unit mask of 
0x00 (No unit mask) count 100000
warning: the last modified time of the binary file does not match that 
of the sample file for /root/stream_c_2M.exe
Either this is the wrong binary or the binary has been modified since 
the sample file was created.
samples  %        symbol name
274767   46.7235  __adddf3
216873   36.8788  __muldf3
43709     7.4326  tuned_STREAM_Copy
16682     2.8367  tuned_STREAM_Triad
13579     2.3091  tuned_STREAM_Add
12587     2.1404  tuned_STREAM_Scale
7079      1.2038  main
2791      0.4746  checkSTREAMresults
1        1.7e-04  __divdf3
1        1.7e-04  __floatsidf
1        1.7e-04  mysecond

-- 
I am a slow learner
but I will keep trying to fight for my dreams!

--bill

                 reply	other threads:[~2011-11-03  8:21 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4EB24F08.9050008@gmail.com \
    --to=bill4carson@gmail.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).