From: Akira Yokosawa <akiyks@gmail.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: perfbook@vger.kernel.org, Akira Yokosawa <akiyks@gmail.com>
Subject: [PATCH 06/10] treewide: Use \Power{} macro for POWER CPU family
Date: Fri, 6 Oct 2017 00:54:38 +0900 [thread overview]
Message-ID: <7873ebb2-714c-e68d-3130-1ed7710bbde4@gmail.com> (raw)
In-Reply-To: <d36d262c-2fe4-284c-7818-516a4ad59ec7@gmail.com>
From c7255fa8b6fc7835c0eb6ab524aed3349cea1dca Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@gmail.com>
Date: Sun, 1 Oct 2017 12:40:18 +0900
Subject: [PATCH 06/10] treewide: Use \Power{} macro for POWER CPU family
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
appendix/toyrcu/toyrcu.tex | 26 +++++++++++++-------------
count/count.tex | 6 +++---
intro/intro.tex | 2 +-
memorder/memorder.tex | 30 +++++++++++++++---------------
perfbook.tex | 1 +
toolsoftrade/toolsoftrade.tex | 4 ++--
6 files changed, 35 insertions(+), 34 deletions(-)
diff --git a/appendix/toyrcu/toyrcu.tex b/appendix/toyrcu/toyrcu.tex
index db45fad..2c65f74 100644
--- a/appendix/toyrcu/toyrcu.tex
+++ b/appendix/toyrcu/toyrcu.tex
@@ -73,7 +73,7 @@ Of course, only one RCU reader may be in its read-side critical section
at a time, which almost entirely defeats the purpose of RCU.
In addition, the lock operations in \co{rcu_read_lock()} and
\co{rcu_read_unlock()} are extremely heavyweight,
-with read-side overhead ranging from about 100~nanoseconds on a single Power5
+with read-side overhead ranging from about 100~nanoseconds on a single \Power{5}
CPU up to more than 17~\emph{microseconds} on a 64-CPU system.
Worse yet,
these same lock operations permit \co{rcu_read_lock()}
@@ -216,7 +216,7 @@ with a single global lock.
Furthermore, the read-side overhead, though high at roughly 140 nanoseconds,
remains at about 140 nanoseconds regardless of the number of CPUs.
However, the update-side overhead ranges from about 600 nanoseconds
-on a single Power5 CPU
+on a single \Power{5} CPU
up to more than 100 \emph{microseconds} on 64 CPUs.
\QuickQuiz{}
@@ -368,7 +368,7 @@ However, this implementations still has some serious shortcomings.
First, the atomic operations in \co{rcu_read_lock()} and
\co{rcu_read_unlock()} are still quite heavyweight,
with read-side overhead ranging from about 100~nanoseconds on
-a single Power5 CPU up to almost 40~\emph{microseconds}
+a single \Power{5} CPU up to almost 40~\emph{microseconds}
on a 64-CPU system.
This means that the RCU read-side critical sections
have to be extremely long in order to get any real
@@ -718,9 +718,9 @@ In fact, they are more complex than those
of the single-counter variant shown in
Figure~\ref{fig:app:toyrcu:RCU Implementation Using Single Global Reference Counter},
with the read-side primitives consuming about 150~nanoseconds on a single
-Power5 CPU and almost 40~\emph{microseconds} on a 64-CPU system.
+\Power{5} CPU and almost 40~\emph{microseconds} on a 64-CPU system.
The update-side \co{synchronize_rcu()} primitive is more costly as
-well, ranging from about 200~nanoseconds on a single Power5 CPU to
+well, ranging from about 200~nanoseconds on a single \Power{5} CPU to
more than 40~\emph{microseconds} on a 64-CPU system.
This means that the RCU read-side critical sections
have to be extremely long in order to get any real
@@ -963,9 +963,9 @@ environments.
That said, the read-side primitives scale very nicely, requiring about
115~nanoseconds regardless of whether running on a single-CPU or a 64-CPU
-Power5 system.
+\Power{5} system.
As noted above, the \co{synchronize_rcu()} primitive does not scale,
-ranging in overhead from almost a microsecond on a single Power5 CPU
+ranging in overhead from almost a microsecond on a single \Power{5} CPU
up to almost 200~microseconds on a 64-CPU system.
This implementation could conceivably form the basis for a
production-quality user-level RCU implementation.
@@ -1340,9 +1340,9 @@ destruction will not be reordered into the preceding loop.
This approach achieves much better read-side performance, incurring
roughly 63~nanoseconds of overhead regardless of the number of
-Power5 CPUs.
+\Power{5} CPUs.
Updates incur more overhead, ranging from about 500~nanoseconds on
-a single Power5 CPU to more than 100~\emph{microseconds} on 64
+a single \Power{5} CPU to more than 100~\emph{microseconds} on 64
such CPUs.
\QuickQuiz{}
@@ -1542,9 +1542,9 @@ This approach achieves read-side performance almost equal to that
shown in
Section~\ref{sec:app:toyrcu:RCU Based on Free-Running Counter}, incurring
roughly 65~nanoseconds of overhead regardless of the number of
-Power5 CPUs.
+\Power{5} CPUs.
Updates again incur more overhead, ranging from about 600~nanoseconds on
-a single Power5 CPU to more than 100~\emph{microseconds} on 64
+a single \Power{5} CPU to more than 100~\emph{microseconds} on 64
such CPUs.
\QuickQuiz{}
@@ -1866,11 +1866,11 @@ This implementation has blazingly fast read-side primitives, with
an \co{rcu_read_lock()}-\co{rcu_read_unlock()} round trip incurring
an overhead of roughly 50~\emph{picoseconds}.
The \co{synchronize_rcu()} overhead ranges from about 600~nanoseconds
-on a single-CPU Power5 system up to more than 100~microseconds on
+on a single-CPU \Power{5} system up to more than 100~microseconds on
a 64-CPU system.
\QuickQuiz{}
- To be sure, the clock frequencies of Power
+ To be sure, the clock frequencies of \Power{}
systems in 2008 were quite high, but even a 5\,GHz clock
frequency is insufficient to allow
loops to be executed in 50~picoseconds!
diff --git a/count/count.tex b/count/count.tex
index 73b6866..a38aba1 100644
--- a/count/count.tex
+++ b/count/count.tex
@@ -3330,7 +3330,7 @@ will expand on these lessons.
\path{count_end_rcu.c} & \ref{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters} &
5.7 ns & 354 ns & 501 ns \\
\end{tabular}
-\caption{Statistical Counter Performance on Power-6}
+\caption{Statistical Counter Performance on \Power{6}}
\label{tab:count:Statistical Counter Performance on Power-6}
\end{table*}
@@ -3410,14 +3410,14 @@ courtesy of eventual consistency.
\path{count_lim_sig.c} & \ref{sec:count:Signal-Theft Limit Counter Implementation} &
Y & 10.2 ns & 370 ns & 54,000 ns \\
\end{tabular}
-\caption{Limit Counter Performance on Power-6}
+\caption{Limit Counter Performance on \Power{6}}
\label{tab:count:Limit Counter Performance on Power-6}
\end{table*}
Figure~\ref{tab:count:Limit Counter Performance on Power-6}
shows the performance of the parallel limit-counting algorithms.
Exact enforcement of the limits incurs a substantial performance
-penalty, although on this 4.7\,GHz Power-6 system that penalty can be reduced
+penalty, although on this 4.7\,GHz \Power{6} system that penalty can be reduced
by substituting signals for atomic operations.
All of these implementations suffer from read-side lock contention
in the face of concurrent readers.
diff --git a/intro/intro.tex b/intro/intro.tex
index 8bed518..293a02f 100644
--- a/intro/intro.tex
+++ b/intro/intro.tex
@@ -77,7 +77,7 @@ that of a bicycle, courtesy of Moore's Law.
Papers calling out the advantages of multicore CPUs were published
as early as 1996~\cite{Olukotun96}.
IBM introduced simultaneous multi-threading
-into its high-end POWER family in 2000, and multicore in 2001.
+into its high-end \Power{} family in 2000, and multicore in 2001.
Intel introduced hyperthreading into its commodity Pentium line in
November 2000, and both AMD and Intel introduced
dual-core CPUs in 2005.
diff --git a/memorder/memorder.tex b/memorder/memorder.tex
index 7dc3fb4..944c17a 100644
--- a/memorder/memorder.tex
+++ b/memorder/memorder.tex
@@ -314,7 +314,7 @@ synchronization primitives (such as locking and RCU)
that are responsible for maintaining the illusion of ordering through use of
\emph{memory barriers} (for example, \co{smp_mb()} in the Linux kernel).
These memory barriers can be explicit instructions, as they are on
-ARM, POWER, Itanium, and Alpha, or they can be implied by other instructions,
+ARM, \Power{}, Itanium, and Alpha, or they can be implied by other instructions,
as they often are on x86.
Since these standard synchronization primitives preserve the illusion of
ordering, your path of least resistance is to simply use these primitives,
@@ -827,7 +827,7 @@ if the shared variable had changed before entry into the loop.
This allows us to plot each CPU's view of the value of \co{state.variable}
over a 532-nanosecond time period, as shown in
Figure~\ref{fig:memorder:A Variable With Multiple Simultaneous Values}.
-This data was collected in 2006 on 1.5\,GHz POWER5 system with 8 cores,
+This data was collected in 2006 on 1.5\,GHz \Power{5} system with 8 cores,
each containing a pair of hardware threads.
CPUs~1, 2, 3, and~4 recorded the values, while CPU~0 controlled the test.
The timebase counter period was about 5.32\,ns, sufficiently fine-grained
@@ -2043,7 +2043,7 @@ communicated to \co{P1()} long before it was communicated to \co{P2()}.
\QuickQuizAnswer{
You need to face the fact that it really can trigger.
Akira Yokosawa used the \co{litmus7} tool to run this litmus test
- on a Power8 system.
+ on a \Power{8} system.
Out of 1,000,000,000 runs, 4 triggered the \co{exists} clause.
Thus, triggering the \co{exists} clause is not merely a one-in-a-million
occurrence, but rather a one-in-a-hundred-million occurrence.
@@ -3707,7 +3707,7 @@ dependencies.
\rotatebox{90}{PA-RISC CPUs}
\end{picture}
& \begin{picture}(6,60)(0,0)
- \rotatebox{90}{POWER}
+ \rotatebox{90}{\Power{}}
\end{picture}
& \begin{picture}(6,60)(0,0)
\rotatebox{90}{SPARC TSO}
@@ -4134,7 +4134,7 @@ For more on Alpha, see its reference manual~\cite{ALPHA2002}.
The ARM family of CPUs is extremely popular in embedded applications,
particularly for power-constrained applications such as cellphones.
-Its memory model is similar to that of Power
+Its memory model is similar to that of \Power{}
(see Section~\ref{sec:memorder:POWER / PowerPC}, but ARM uses a
different set of memory-barrier instructions~\cite{ARMv7A:2010}:
@@ -4144,7 +4144,7 @@ different set of memory-barrier instructions~\cite{ARMv7A:2010}:
subsequent operations of the same type.
The ``type'' of operations can be all operations or can be
restricted to only writes (similar to the Alpha \co{wmb}
- and the POWER \co{eieio} instructions).
+ and the \Power{} \co{eieio} instructions).
In addition, ARM allows cache coherence to have one of three
scopes: single processor, a subset of the processors
(``inner'') and global (``outer'').
@@ -4168,7 +4168,7 @@ None of these instructions exactly match the semantics of Linux's
\co{DMB}.
The \co{DMB} and \co{DSB} instructions have a recursive definition
of accesses ordered before and after the barrier, which has an effect
-similar to that of POWER's cumulativity.
+similar to that of \Power{}'s cumulativity.
ARM also implements control dependencies, so that if a conditional
branch depends on a load, then any store executed after that conditional
@@ -4292,7 +4292,7 @@ memory barriers.
\subsection{MIPS}
The MIPS memory model~\cite[Table 6.6]{MIPSvII-A-2015}
-appears to resemble that of ARM, Itanium, and Power,
+appears to resemble that of ARM, Itanium, and \Power{},
being weakly ordered by default, but respecting dependencies.
MIPS has a wide variety of memory-barrier instructions, but ties them
not to hardware considerations, but rather to the use cases provided
@@ -4325,7 +4325,7 @@ in a manner similar to the ARM64 additions:
Informal discussions with MIPS architects indicates that MIPS has a
definition of transitivity or cumulativity similar to that of
-ARM and Power.
+ARM and \Power{}.
However, it appears that different MIPS implementations can have
different memory-ordering properties, so it is important to consult
the documentation for the specific MIPS implementation you are using.
@@ -4339,10 +4339,10 @@ no code, however, they do use the gcc {\tt memory} attribute to disable
compiler optimizations that would reorder code across the memory
barrier.
-\subsection{POWER / PowerPC}
+\subsection{\Power{} / PowerPC}
\label{sec:memorder:POWER / PowerPC}
-The POWER and PowerPC\textsuperscript{\textregistered}
+The \Power{} and PowerPC\textsuperscript{\textregistered}
CPU families have a wide variety of memory-barrier
instructions~\cite{PowerPC94,MichaelLyons05a}:
\begin{description}
@@ -4388,7 +4388,7 @@ The \co{smp_mb()} instruction is also defined to be the {\tt sync}
instruction, but both \co{smp_rmb()} and \co{rmb()} are defined to
be the lighter-weight {\tt lwsync} instruction.
-Power features ``cumulativity'', which can be used to obtain
+\Power{} features ``cumulativity'', which can be used to obtain
transitivity.
When used properly, any code seeing the results of an earlier
code fragment will also see the accesses that this earlier code
@@ -4396,11 +4396,11 @@ fragment itself saw.
Much more detail is available from
McKenney and Silvera~\cite{PaulEMcKenneyN2745r2009}.
-Power respects control dependencies in much the same way that ARM
-does, with the exception that the Power \co{isync} instruction
+\Power{} respects control dependencies in much the same way that ARM
+does, with the exception that the \Power{} \co{isync} instruction
is substituted for the ARM \co{ISB} instruction.
-Many members of the POWER architecture have incoherent instruction
+Many members of the \Power{} architecture have incoherent instruction
caches, so that a store to memory will not necessarily be reflected
in the instruction cache.
Thankfully, few people write self-modifying code these days, but JITs
diff --git a/perfbook.tex b/perfbook.tex
index da9cfa8..cc4f4b0 100644
--- a/perfbook.tex
+++ b/perfbook.tex
@@ -138,6 +138,7 @@
\newcommand{\qop}[1]{{\sffamily #1}} % QC operator such as H, T, S, etc.
\DeclareRobustCommand{\euler}{\ensuremath{\mathrm{e}}}
+\newcommand{\Power}[1]{POWER#1}
\newcommand{\Epigraph}[2]{\epigraphhead[65]{\rmfamily\epigraph{#1}{#2}}}
diff --git a/toolsoftrade/toolsoftrade.tex b/toolsoftrade/toolsoftrade.tex
index 9cf3312..97a37d3 100644
--- a/toolsoftrade/toolsoftrade.tex
+++ b/toolsoftrade/toolsoftrade.tex
@@ -1038,7 +1038,7 @@ Line~39 moves the lock-acquisition count to this thread's element of the
\end{figure}
Figure~\ref{fig:toolsoftrade:Reader-Writer Lock Scalability}
-shows the results of running this test on a 64-core Power-5 system
+shows the results of running this test on a 64-core \Power{5} system
with two hardware threads per core for a total of 128 software-visible
CPUs.
The \co{thinktime} parameter was zero for all these tests, and the
@@ -1137,7 +1137,7 @@ This situation will only get worse as you add CPUs.
} \QuickQuizEnd
\QuickQuiz{}
- Power-5 is several years old, and new hardware should
+ \Power{5} is several years old, and new hardware should
be faster.
So why should anyone worry about reader-writer locks being slow?
\QuickQuizAnswer{
--
2.7.4
next prev parent reply other threads:[~2017-10-05 15:54 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-10-05 15:47 [PATCH 00/10] Tweaks to follow guidelines in style guide Akira Yokosawa
2017-10-05 15:48 ` [PATCH 01/10] debugging: Insert narrow space in front of percent symbol Akira Yokosawa
2017-10-05 15:49 ` [PATCH 02/10] debugging: Use upright font for Euler's number Akira Yokosawa
2017-10-05 15:51 ` [PATCH 03/10] future/QC: Insert narrow space in front of percent symbol Akira Yokosawa
2017-10-05 15:52 ` [PATCH 04/10] future/QC: Use non-breakable hyphen for axis names Akira Yokosawa
2017-10-05 15:53 ` [PATCH 05/10] treewide: Insert narrow space in front of percent symbol Akira Yokosawa
2017-10-05 15:54 ` Akira Yokosawa [this message]
2017-10-05 15:55 ` [PATCH 07/10] treewide: Call GNU C compiler as "GCC" Akira Yokosawa
2017-10-05 15:56 ` [PATCH 08/10] treewide: Use "IRQ" instead of "irq" used as abbreviation Akira Yokosawa
2017-10-05 15:59 ` [PATCH 09/10] future/QC: Use upright glyph for math constant and descriptive suffix Akira Yokosawa
2017-10-05 16:00 ` [PATCH 10/10] styleguide: Reflect recent style improvements Akira Yokosawa
2017-10-05 20:48 ` [PATCH 00/10] Tweaks to follow guidelines in style guide Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7873ebb2-714c-e68d-3130-1ed7710bbde4@gmail.com \
--to=akiyks@gmail.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=perfbook@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.