From: Stephen Boyd <sboyd@codeaurora.org>
To: Daniel Thompson <daniel.thompson@linaro.org>,
Will Deacon <will.deacon@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
John Stultz <john.stultz@linaro.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"patches@linaro.org" <patches@linaro.org>,
"linaro-kernel@lists.linaro.org" <linaro-kernel@lists.linaro.org>,
Sumit Semwal <sumit.semwal@linaro.org>,
Steven Rostedt <rostedt@goodmis.org>,
Russell King <linux@arm.linux.org.uk>,
Catalin Marinas <Catalin.Marinas@arm.com>
Subject: Re: [PATCH v4 2/5] sched_clock: Optimize cache line usage
Date: Mon, 09 Feb 2015 18:37:35 -0800 [thread overview]
Message-ID: <54D96EEF.40302@codeaurora.org> (raw)
In-Reply-To: <54D88222.8040000@linaro.org>
On 02/09/15 01:47, Daniel Thompson wrote:
> On 09/02/15 09:28, Will Deacon wrote:
>> On Sun, Feb 08, 2015 at 12:02:37PM +0000, Daniel Thompson wrote:
>>> Currently sched_clock(), a very hot code path, is not optimized to
>>> minimise its cache profile. In particular:
>>>
>>> 1. cd is not ____cacheline_aligned,
>>>
>>> 2. struct clock_data does not distinguish between hotpath and
>>> coldpath data, reducing locality of reference in the hotpath,
>>>
>>> 3. Some hotpath data is missing from struct clock_data and is marked
>>> __read_mostly (which more or less guarantees it will not share a
>>> cache line with cd).
>>>
>>> This patch corrects these problems by extracting all hotpath data
>>> into a separate structure and using ____cacheline_aligned to ensure
>>> the hotpath uses a single (64 byte) cache line.
>> Have you got any performance figures for this change, or is this just a
>> theoretical optimisation? It would be interesting to see what effect this
>> has on systems with 32-byte cachelines and also scenarios where there's
>> contention on the sequence counter.
> Most of my testing has focused on proving the NMI safety parts of the
> patch work as advertised so its mostly theoretical.
>
> However there are some numbers from simple tight loop calls to
> sched_clock (Stephen Boyd's results are more interesting than mine
> because I observe pretty wild quantization effects that render the
> results hard to trust):
> http://thread.gmane.org/gmane.linux.kernel/1871157/focus=1879265
>
> Not sure what useful figures would be useful for a contended sequence
> counter. Firstly the counter is taken for write at 7/8 wrap time of the
> times so even for the fastest timers the interval is likely to be >3s
> and is very short duration. Additionally, the NMI safety changes make it
> possible to read the timer whilst it is being updated so it is only
> during the very short struct-copy/write/struct-copy/write update
> sequence that we will observe the extra cache line used for a read.
> Benchmarks that show the effect of update are therefore non-trivial to
> construct.
>
Here's the raw numbers for the tight loop. I noticed that if I don't use
perf I get a larger number of calls per 10s, most likely because we
aren't doing anything else. These are very lightly loaded systems, i.e.
busybox ramdisk with nothing going on. Kernel version is v3.19-rc4. The
CPU is Krait on msm8960 and msm8974, except on msm8974 it has the arm
architected timers backing sched_clock() vs. our own custom timer IP on
msm8960. The cache line size is 64 bytes. I also ran it on msm8660 which
is a Scorpion CPU with the same timer as msm8960 (custom timer IP) and a
cache line size of 32 bytes. Unfortunately nobody has ported Scorpion
over to perf events, so we don't hardware events.
msm8960 (before patch)
----------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 14528449 calls in 10000000290 ns
Made 14528925 calls in 10000000142 ns
Made 14524549 calls in 10000000587 ns
Made 14528164 calls in 10000000734 ns
Made 14524468 calls in 10000000290 ns
Made 14527198 calls in 10000000438 ns
Made 14523508 calls in 10000000734 ns
Made 14527894 calls in 10000000290 ns
Made 14529609 calls in 10000000734 ns
Made 14523114 calls in 10000000142 ns
Performance counter stats for 'modprobe sched_clock_test' (10 runs):
10009.635016 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
7 context-switches # 0.001 K/sec ( +- 16.16% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
4003806350 cycles # 0.400 GHz ( +- 0.00% )
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
921924235 instructions # 0.23 insns per cycle ( +- 0.01% )
0 branches # 0.000 K/sec
58521151 branch-misses # 5.846 M/sec ( +- 0.01% )
10.011767657 seconds time elapsed ( +- 0.00% )
msm8960 (after patch)
---------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 19626366 calls in 10000000587 ns
Made 19623708 calls in 10000000142 ns
Made 19623282 calls in 10000000290 ns
Made 19625304 calls in 10000000290 ns
Made 19625151 calls in 10000000291 ns
Made 19624906 calls in 10000000290 ns
Made 19625383 calls in 10000000143 ns
Made 19625235 calls in 10000000290 ns
Made 19624969 calls in 10000000290 ns
Made 19625209 calls in 10000000438 ns
Performance counter stats for 'modprobe sched_clock_test' (10 runs):
10009.883401 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
7 context-switches # 0.001 K/sec ( +- 15.88% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
4003901511 cycles # 0.400 GHz ( +- 0.00% )
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
1164635790 instructions # 0.29 insns per cycle ( +- 0.00% )
0 branches # 0.000 K/sec
20039814 branch-misses # 2.002 M/sec ( +- 0.00% )
10.012092383 seconds time elapsed ( +- 0.00% )
msm8974 (before patch)
----------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 21289694 calls in 10000000083 ns
Made 21289072 calls in 10000000082 ns
Made 21289550 calls in 10000000395 ns
Made 21288892 calls in 10000000291 ns
Made 21288987 calls in 10000000135 ns
Made 21289140 calls in 10000000395 ns
Made 21289161 calls in 10000000395 ns
Made 21288911 calls in 10000000239 ns
Made 21289204 calls in 10000000135 ns
Made 21288738 calls in 10000000135 ns
Performance counter stats for 'modprobe sched_clock_test' (10 runs):
10003.839348 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
4 context-switches # 0.000 K/sec ( +- 3.70% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
6146323757 cycles # 0.614 GHz ( +- 0.00% )
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
1155527762 instructions # 0.19 insns per cycle ( +- 0.00% )
107186099 branches # 10.714 M/sec ( +- 0.00% )
35548359 branch-misses # 33.17% of all branches ( +- 0.00% )
10.004769053 seconds time elapsed ( +- 0.00% )
msm8974 (after patch)
---------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 21289357 calls in 10000000239 ns
Made 21384961 calls in 10000000396 ns
Made 22105925 calls in 10000000238 ns
Made 27384126 calls in 10000000239 ns
Made 22107737 calls in 10000000134 ns
Made 21368867 calls in 10000000239 ns
Made 22106065 calls in 10000000395 ns
Made 27384196 calls in 10000000083 ns
Made 22107334 calls in 10000000291 ns
Made 21365426 calls in 10000000291 ns
Performance counter stats for 'modprobe sched_clock_test' (10 runs):
10003.753333 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
7 context-switches # 0.001 K/sec ( +- 18.18% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
6837664600 cycles # 0.684 GHz ( +- 6.74% )
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
1148993903 instructions # 0.17 insns per cycle ( +- 3.32% )
115049358 branches # 11.501 M/sec ( +- 3.31% )
42520513 branch-misses # 36.96% of all branches ( +- 5.00% )
10.004769533 seconds time elapsed ( +- 0.00% )
msm8660 (before patch)
----------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 14099029 calls in 10000000586 ns
Made 14099227 calls in 10000000735 ns
Made 14098763 calls in 10000000439 ns
Made 14099042 calls in 10000000291 ns
Made 14099273 calls in 10000000290 ns
Made 14100377 calls in 10000000586 ns
Made 14100183 calls in 10000000586 ns
Made 14099220 calls in 10000000586 ns
Made 14098853 calls in 10000000587 ns
Made 14099368 calls in 10000000142 ns
Performance counter stats for 'modprobe sched_clock_test' (10 runs):
10006.700528 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
11 context-switches # 0.001 K/sec ( +- 10.38% )
0 cpu-migrations # 0.000 K/sec
56 page-faults # 0.006 K/sec
0 cycles # 0.000 GHz
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
0 instructions
0 branches # 0.000 K/sec
0 branch-misses # 0.000 K/sec
10.008796161 seconds time elapsed ( +- 0.00% )
msm8660 (after patch)
---------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 20555901 calls in 10000000438 ns
Made 15510019 calls in 10000000142 ns
Made 15510371 calls in 10000000587 ns
Made 15509184 calls in 10000000439 ns
Made 15509068 calls in 10000000291 ns
Made 15510719 calls in 10000000439 ns
Made 15508899 calls in 10000000291 ns
Made 15509206 calls in 10000000587 ns
Made 15509057 calls in 10000000290 ns
Made 15509178 calls in 10000000735 ns
Performance counter stats for 'modprobe sched_clock_test' (10 runs):
10009.491416 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
13 context-switches # 0.001 K/sec ( +- 10.82% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
0 cycles # 0.000 GHz
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
0 instructions
0 branches # 0.000 K/sec
0 branch-misses # 0.000 K/sec
10.011834087 seconds time elapsed ( +- 0.00% )
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
next prev parent reply other threads:[~2015-02-10 2:37 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-01-21 16:53 [RFC PATCH] sched_clock: Avoid tearing during read from NMI Daniel Thompson
2015-01-21 17:29 ` John Stultz
2015-01-21 20:20 ` Daniel Thompson
2015-01-21 20:58 ` Stephen Boyd
2015-01-22 13:06 ` [PATCH v2] sched_clock: Avoid deadlock " Daniel Thompson
2015-01-30 19:03 ` [PATCH v3 0/4] sched_clock: Optimize and avoid " Daniel Thompson
2015-01-30 19:03 ` [PATCH v3 1/4] sched_clock: Match scope of read and write seqcounts Daniel Thompson
2015-01-30 19:03 ` [PATCH v3 2/4] sched_clock: Optimize cache line usage Daniel Thompson
2015-02-05 1:14 ` Stephen Boyd
2015-02-05 10:21 ` Daniel Thompson
2015-01-30 19:03 ` [PATCH v3 3/4] sched_clock: Remove suspend from clock_read_data Daniel Thompson
2015-01-30 19:03 ` [PATCH v3 4/4] sched_clock: Avoid deadlock during read from NMI Daniel Thompson
2015-02-05 1:23 ` Stephen Boyd
2015-02-05 1:48 ` Steven Rostedt
2015-02-05 6:23 ` Stephen Boyd
2015-02-05 0:50 ` [PATCH v3 0/4] sched_clock: Optimize and avoid " Stephen Boyd
2015-02-05 9:05 ` Daniel Thompson
2015-02-08 12:09 ` Daniel Thompson
2015-02-09 22:08 ` Stephen Boyd
2015-02-08 12:02 ` [PATCH v4 0/5] " Daniel Thompson
2015-02-08 12:02 ` [PATCH v4 1/5] sched_clock: Match scope of read and write seqcounts Daniel Thompson
2015-02-08 12:02 ` [PATCH v4 2/5] sched_clock: Optimize cache line usage Daniel Thompson
2015-02-09 1:28 ` Will Deacon
2015-02-09 9:47 ` Daniel Thompson
2015-02-10 2:37 ` Stephen Boyd [this message]
2015-02-08 12:02 ` [PATCH v4 3/5] sched_clock: Remove suspend from clock_read_data Daniel Thompson
2015-02-08 12:02 ` [PATCH v4 4/5] sched_clock: Remove redundant notrace from update function Daniel Thompson
2015-02-08 12:02 ` [PATCH v4 5/5] sched_clock: Avoid deadlock during read from NMI Daniel Thompson
2015-02-13 3:49 ` [PATCH v4 0/5] sched_clock: Optimize and avoid " Stephen Boyd
2015-03-02 15:56 ` [PATCH v5 " Daniel Thompson
2015-03-02 15:56 ` [PATCH v5 1/5] sched_clock: Match scope of read and write seqcounts Daniel Thompson
2015-03-02 15:56 ` [PATCH v5 2/5] sched_clock: Optimize cache line usage Daniel Thompson
2015-03-02 15:56 ` [PATCH v5 3/5] sched_clock: Remove suspend from clock_read_data Daniel Thompson
2015-03-02 15:56 ` [PATCH v5 4/5] sched_clock: Remove redundant notrace from update function Daniel Thompson
2015-03-02 15:56 ` [PATCH v5 5/5] sched_clock: Avoid deadlock during read from NMI Daniel Thompson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54D96EEF.40302@codeaurora.org \
--to=sboyd@codeaurora.org \
--cc=Catalin.Marinas@arm.com \
--cc=daniel.thompson@linaro.org \
--cc=john.stultz@linaro.org \
--cc=linaro-kernel@lists.linaro.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@arm.linux.org.uk \
--cc=patches@linaro.org \
--cc=rostedt@goodmis.org \
--cc=sumit.semwal@linaro.org \
--cc=tglx@linutronix.de \
--cc=will.deacon@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox