* [PATCH -perfbook 1/4] cpu: Improve layout and consistency of Tables 3.1, E.1, and E.2
2023-02-14 10:03 [PATCH -perfbook 0/4] cpu/overheads: Make tables more consistent Akira Yokosawa
@ 2023-02-14 10:05 ` Akira Yokosawa
2023-02-14 10:07 ` [PATCH -perfbook 2/4] cpu: Use 'on-core' rather than 'in-core' Akira Yokosawa
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Akira Yokosawa @ 2023-02-14 10:05 UTC (permalink / raw)
To: Paul E. McKenney, Leonardo Bras; +Cc: perfbook, Akira Yokosawa
For these tables to fit column width of 2c builds, make changes as
follows:
- Move prefix of "Same-CPU", "In-Core", etc. to a separate
row.
- Add \midrule between different classes of counterpart CPUs.
- Stop coloring alternative rows.
- Shrink "CPUs" column width by spanning two rows.
Define \tcresizewidth{} ("tc" stands for "two column") and use it
for slightly wide tables in 2c builds.
To improve consistency among these tables:
- Uppercase "In-Core", "Off-Core", Off-System", and "Blind CAS".
Reported-by: Leonardo Bras <leobras.c@gmail.com>
Link: [1] https://www.spinics.net/lists/perfbook/msg03827.html
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
cpu/overheads.tex | 143 +++++++++++++++++++++++++++++-----------------
perfbook-lt.tex | 6 ++
2 files changed, 97 insertions(+), 52 deletions(-)
diff --git a/cpu/overheads.tex b/cpu/overheads.tex
index a89c71158bf9..7ae99ed6cb7b 100644
--- a/cpu/overheads.tex
+++ b/cpu/overheads.tex
@@ -133,44 +133,62 @@ optimization.
\subsection{Costs of Operations}
\label{sec:cpu:Costs of Operations}
-\begin{table*}
-\rowcolors{1}{}{lightgray}
+\begin{table}
+%\rowcolors{1}{}{lightgray}
\renewcommand*{\arraystretch}{1.1}
\centering\small
-\ebresizewidth{
+\tcresizewidth{
\begin{tabular}
{
- l
+ ll
S[table-format = 9.1]
S[table-format = 9.1]
r
}
\toprule
- Operation & \multicolumn{1}{r}{Cost (ns)}
+ \multicolumn{2}{l}{Operation}
+ & \multicolumn{1}{r}{Cost (ns)}
& {\parbox[b]{.7in}{\raggedleft Ratio\\(cost/clock)}}
& CPUs \\
\midrule
- Clock period & 0.5 & 1.0 & \\
- Same-CPU CAS & 7.0 & 14.6 & 0 \\
- Same-CPU lock & 15.4 & 32.3 & 0 \\
- In-core blind CAS & 7.2 & 15.2 & 224 \\
- In-core CAS & 18.0 & 37.7 & 224 \\
- Off-core blind CAS & 47.5 & 99.8 & 1--27,225--251 \\
- Off-core CAS & 101.9 & 214.0 & 1--27,225--251 \\
- Off-socket blind CAS & 148.8 & 312.5 & 28--111,252--335 \\
- Off-socket CAS & 442.9 & 930.1 & 28--111,252--335 \\
- Cross-interconnect blind CAS & 336.6 & 706.8 & 112--223,336--447 \\
- Cross-interconnect CAS & 944.8 & 1984.2 & 112--223,336--447 \\
+ \multicolumn{2}{l}{Clock period}
+ & 0.5 & 1.0 & \\
+ \midrule
+ \multicolumn{2}{l}{Same-CPU}
+ & & & 0 \\
+ & CAS & 7.0 & 14.6 & \\
+ & lock & 15.4 & 32.3 & \\
\midrule
- Off-System & & & \\
- Comms Fabric & 5 000 & 10 500 & \\
- Global Comms & 195 000 000 & 409 500 000 & \\
+ \multicolumn{2}{l}{In-Core}
+ & & & 224 \\
+ & Blind CAS& 7.2 & 15.2 & \\
+ & CAS & 18.0 & 37.7 & \\
+ \midrule
+ \multicolumn{2}{l}{Off-Core}
+ & & & 1--27 \\
+ & Blind CAS& 47.5 & 99.8 & 225--251 \\
+ & CAS & 101.9 & 214.0 & \\
+ \midrule
+ \multicolumn{2}{l}{Off-Socket}
+ & & & 28--111 \\
+ & Blind CAS& 148.8 & 312.5 & 252--335 \\
+ & CAS & 442.9 & 930.1 & \\
+ \midrule
+ \multicolumn{2}{l}{Cross-Interconnect}
+ & & & 112--223 \\
+ & Blind CAS& 336.6 & 706.8 & 336--447 \\
+ & CAS & 944.8 & 1984.2 & \\
+ \midrule
+ \multicolumn{2}{l}{Off-System}
+ & & & \\
+ & Comms Fabric & 5 000 & 10 500 & \\
+ & Global Comms & 195 000 000 & 409 500 000 & \\
\bottomrule
\end{tabular}
}
\caption{CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10\,GHz}
\label{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
-\end{table*}
+\end{table}
The overheads of some common operations important to parallel programs are
displayed in
@@ -311,36 +329,47 @@ thousand clock cycles.
\end{enumerate}
\begin{table}
-\rowcolors{1}{}{lightgray}
+%\rowcolors{1}{}{lightgray}
\renewcommand*{\arraystretch}{1.1}
\centering\small
\begin{tabular}
{
- l
+ ll
S[table-format = 9.1]
S[table-format = 9.1]
}
\toprule
- Operation & \multicolumn{1}{r}{Cost (ns)}
+ \multicolumn{2}{l}{Operation}
+ & \multicolumn{1}{r}{Cost (ns)}
& {\parbox[b]{.7in}{\raggedleft Ratio\\(cost/clock)}} \\
\midrule
- Clock period & 0.4 & 1.0 \\
- Same-CPU CAS & 12.2 & 33.8 \\
- Same-CPU lock & 25.6 & 71.2 \\
- Blind CAS & 12.9 & 35.8 \\
- CAS & 7.0 & 19.4 \\
+ \multicolumn{2}{l}{Clock period}
+ & 0.4 & 1.0 \\
+ \midrule
+ \multicolumn{2}{l}{Same-CPU}
+ & & \\
+ & CAS & 12.2 & 33.8 \\
+ & lock & 25.6 & 71.2 \\
+ \midrule
+ \multicolumn{2}{l}{In-Core}
+ & & \\
+ & Blind CAS & 12.9 & 35.8 \\
+ & CAS & 7.0 & 19.4 \\
\midrule
- Off-Core & & \\
- Blind CAS & 31.2 & 86.6 \\
- CAS & 31.2 & 86.5 \\
+ \multicolumn{2}{l}{Off-Core}
+ & & \\
+ & Blind CAS & 31.2 & 86.6 \\
+ & CAS & 31.2 & 86.5 \\
\midrule
- Off-Socket & & \\
- Blind CAS & 92.4 & 256.7 \\
- CAS & 95.9 & 266.4 \\
+ \multicolumn{2}{l}{Off-Socket}
+ & & \\
+ & Blind CAS & 92.4 & 256.7 \\
+ & CAS & 95.9 & 266.4 \\
\midrule
- Off-System & & \\
- Comms Fabric & 2 600 & 7 220 \\
- Global Comms & 195 000 000 & 542 000 000 \\
+ \multicolumn{2}{l}{Off-System}
+ & & \\
+ & Comms Fabric & 2 600 & 7 220 \\
+ & Global Comms & 195 000 000 & 542 000 000 \\
\bottomrule
\end{tabular}
\caption{Performance of Synchronization Mechanisms on 16-CPU 2.8\,GHz Intel X5550 (Nehalem) System}
@@ -366,38 +395,48 @@ thousand clock cycles.
\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
down to and including the two ``Off-core'' rows.
-\begin{table*}
-\rowcolors{1}{}{lightgray}
+\begin{table}
+%\rowcolors{1}{}{lightgray}
\renewcommand*{\arraystretch}{1.1}
\centering\small
+\tcresizewidth{
\begin{tabular}
{
- l
+ ll
S[table-format = 9.1]
S[table-format = 9.1]
r
}
\toprule
- Operation & \multicolumn{1}{r}{Cost (ns)}
+ \multicolumn{2}{l}{Operation}
+ & \multicolumn{1}{r}{Cost (ns)}
& {\parbox[b]{.7in}{\raggedleft Ratio\\(cost/clock)}}
& CPUs \\
\midrule
- Clock period & 0.5 & 1.0 & \\
- Same-CPU CAS & 6.2 & 13.6 & 0 \\
- Same-CPU lock & 13.5 & 29.6 & 0 \\
- In-core blind CAS & 6.5 & 14.3 & 6 \\
- In-core CAS & 16.2 & 35.6 & 6 \\
- Off-core blind CAS & 22.2 & 48.8 & 1--5,7--11 \\
- Off-core CAS & 53.6 & 117.9 & 1--5,7--11 \\
+ \multicolumn{2}{l}{Clock period}
+ & 0.5 & 1.0 & \\
+ \midrule
+ \multicolumn{2}{l}{Same-CPU} & & & 0 \\
+ & CAS & 6.2 & 13.6 & \\
+ & lock & 13.5 & 29.6 & \\
+ \midrule
+ \multicolumn{2}{l}{In-Core} & & & 6 \\
+ & Blind CAS & 6.5 & 14.3 & \\
+ & CAS & 16.2 & 35.6 & \\
+ \midrule
+ \multicolumn{2}{l}{Off-Core} & & & 1--5 \\
+ & Blind CAS & 22.2 & 48.8 & 7--11 \\
+ & CAS & 53.6 & 117.9 & \\
\midrule
- Off-System & & & \\
- Comms Fabric & 5 000 & 11 000 & \\
- Global Comms & 195 000 000 & 429 000 000 & \\
+ \multicolumn{2}{l}{Off-System}& & & \\
+ & Comms Fabric & 5 000 & 11 000 & \\
+ & Global Comms & 195 000 000 & 429 000 000 & \\
\bottomrule
\end{tabular}
+}
\caption{CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20\,GHz}
\label{tab:cpu:CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20GHz}
-\end{table*}
+\end{table}
Furthermore, newer small-scale single-socket systems such
as the laptop on which I am typing this also have more
diff --git a/perfbook-lt.tex b/perfbook-lt.tex
index 13dd88b32d94..f970212cb194 100644
--- a/perfbook-lt.tex
+++ b/perfbook-lt.tex
@@ -533,6 +533,12 @@
\newcommand\ebFloatBarrier{}
}
+\IfTwoColumn{
+\newcommand{\tcresizewidth}[1]{\resizebox{\columnwidth}{!}{#1}}
+}{
+\newcommand{\tcresizewidth}[1]{#1}
+}
+
% Glossaries dictionary and custom settings
\input{glsdict}
--
2.25.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH -perfbook 2/4] cpu: Use 'on-core' rather than 'in-core'
2023-02-14 10:03 [PATCH -perfbook 0/4] cpu/overheads: Make tables more consistent Akira Yokosawa
2023-02-14 10:05 ` [PATCH -perfbook 1/4] cpu: Improve layout and consistency of Tables 3.1, E.1, and E.2 Akira Yokosawa
@ 2023-02-14 10:07 ` Akira Yokosawa
2023-02-14 10:08 ` [PATCH -perfbook 3/4] cpu: Add page reference to Table E.1 in QQz 3.8 Akira Yokosawa
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Akira Yokosawa @ 2023-02-14 10:07 UTC (permalink / raw)
To: Paul E. McKenney, Leonardo Bras; +Cc: perfbook, Akira Yokosawa
Antonym of "off-core" should be "on-core" rather than "in-core".
Consistently use "on-core" in the overheads section.
Similarly, say "on-socket" rather than "in-socket".
Also for consistency, replace "single-CPU CAS" with "same-CPU CAS".
Also, QQz added in commit 34cc066b1d95 ("cpu: Add a QQz on table
E.1") uppercased some of related words in running text.
Lowercase them for consistency.
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
cpu/overheads.tex | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/cpu/overheads.tex b/cpu/overheads.tex
index 7ae99ed6cb7b..af17b3cfdf2f 100644
--- a/cpu/overheads.tex
+++ b/cpu/overheads.tex
@@ -159,7 +159,7 @@ optimization.
& CAS & 7.0 & 14.6 & \\
& lock & 15.4 & 32.3 & \\
\midrule
- \multicolumn{2}{l}{In-Core}
+ \multicolumn{2}{l}{On-Core}
& & & 224 \\
& Blind CAS& 7.2 & 15.2 & \\
& CAS & 18.0 & 37.7 & \\
@@ -223,7 +223,7 @@ The lock operation is more expensive than CAS because it requires two
atomic operations on the lock data structure, one for acquisition and
the other for release.
-In-core operations involving interactions between the hardware threads
+On-core operations involving interactions between the hardware threads
sharing a single core are about the same cost as same-CPU operations.
This should not be too surprising, given that these two hardware threads
also share the full cache hierarchy.
@@ -253,10 +253,10 @@ failing.
The key point is that there are now two accesses to the memory location,
the load and the CAS\@.
-Thus, it is not surprising that in-core blind CAS consumes only about
-seven nanoseconds, while in-core CAS consumes about 18 nanoseconds.
+Thus, it is not surprising that on-core blind CAS consumes only about
+seven nanoseconds, while on-core CAS consumes about 18 nanoseconds.
The non-blind case's extra load does not come for free.
-That said, the overhead of these operations are similar to single-CPU
+That said, the overhead of these operations are similar to same-CPU
CAS and lock, respectively.
\QuickQuiz{
@@ -351,7 +351,7 @@ thousand clock cycles.
& CAS & 12.2 & 33.8 \\
& lock & 25.6 & 71.2 \\
\midrule
- \multicolumn{2}{l}{In-Core}
+ \multicolumn{2}{l}{On-Core}
& & \\
& Blind CAS & 12.9 & 35.8 \\
& CAS & 7.0 & 19.4 \\
@@ -393,7 +393,7 @@ thousand clock cycles.
which represents a much smaller system with only 16~hardware threads.
A similar view is provided by the rows of
\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
- down to and including the two ``Off-core'' rows.
+ down to and including the two ``Off-Core'' rows.
\begin{table}
%\rowcolors{1}{}{lightgray}
@@ -420,7 +420,7 @@ thousand clock cycles.
& CAS & 6.2 & 13.6 & \\
& lock & 13.5 & 29.6 & \\
\midrule
- \multicolumn{2}{l}{In-Core} & & & 6 \\
+ \multicolumn{2}{l}{On-Core} & & & 6 \\
& Blind CAS & 6.5 & 14.3 & \\
& CAS & 16.2 & 35.6 & \\
\midrule
@@ -470,7 +470,7 @@ thousand clock cycles.
\QuickQuizE{
\Cref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
in the answer to \QuickQuizARef{\QspeedOfLightAtoms} says that
- In-Core CAS is faster than both of Same-CPU CAS and In-Core Blind CAS\@.
+ on-core CAS is faster than both of same-CPU CAS and on-core blind CAS\@.
What is happening there?
}\QuickQuizAnswerE{
I \emph{was} surprised by the data I obtained and did a rigorous
@@ -508,7 +508,7 @@ First, there are only two CPUs within a given core and only 56 within
a given socket, compared to 448 across the system.
Second, as shown in
\cref{tab:cpu:Cache Geometry for 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10GHz},
-the in-core caches are quite small compared to the in-socket caches, which
+the on-core caches are quite small compared to the on-socket caches, which
are in turn quite small compared to the 1.4\,TB of memory configured on
this system.
Third, again referring to the figure, the caches are organized as
--
2.25.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH -perfbook 3/4] cpu: Add page reference to Table E.1 in QQz 3.8
2023-02-14 10:03 [PATCH -perfbook 0/4] cpu/overheads: Make tables more consistent Akira Yokosawa
2023-02-14 10:05 ` [PATCH -perfbook 1/4] cpu: Improve layout and consistency of Tables 3.1, E.1, and E.2 Akira Yokosawa
2023-02-14 10:07 ` [PATCH -perfbook 2/4] cpu: Use 'on-core' rather than 'in-core' Akira Yokosawa
@ 2023-02-14 10:08 ` Akira Yokosawa
2023-02-14 10:09 ` [PATCH -perfbook 4/4] toolsoftrade: Fix QQz macro in QQz series Akira Yokosawa
2023-02-14 23:13 ` [PATCH -perfbook 0/4] cpu/overheads: Make tables more consistent Paul E. McKenney
4 siblings, 0 replies; 6+ messages in thread
From: Akira Yokosawa @ 2023-02-14 10:08 UTC (permalink / raw)
To: Paul E. McKenney, Leonardo Bras; +Cc: perfbook, Akira Yokosawa
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
cpu/overheads.tex | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/cpu/overheads.tex b/cpu/overheads.tex
index af17b3cfdf2f..ba4a33a4c45c 100644
--- a/cpu/overheads.tex
+++ b/cpu/overheads.tex
@@ -469,8 +469,10 @@ thousand clock cycles.
%
\QuickQuizE{
\Cref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
- in the answer to \QuickQuizARef{\QspeedOfLightAtoms} says that
- on-core CAS is faster than both of same-CPU CAS and on-core blind CAS\@.
+ in the answer to \QuickQuizARef{\QspeedOfLightAtoms} on
+ \cpageref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
+ says that on-core CAS is faster than both of same-CPU CAS and
+ on-core blind CAS\@.
What is happening there?
}\QuickQuizAnswerE{
I \emph{was} surprised by the data I obtained and did a rigorous
--
2.25.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH -perfbook 4/4] toolsoftrade: Fix QQz macro in QQz series
2023-02-14 10:03 [PATCH -perfbook 0/4] cpu/overheads: Make tables more consistent Akira Yokosawa
` (2 preceding siblings ...)
2023-02-14 10:08 ` [PATCH -perfbook 3/4] cpu: Add page reference to Table E.1 in QQz 3.8 Akira Yokosawa
@ 2023-02-14 10:09 ` Akira Yokosawa
2023-02-14 23:13 ` [PATCH -perfbook 0/4] cpu/overheads: Make tables more consistent Paul E. McKenney
4 siblings, 0 replies; 6+ messages in thread
From: Akira Yokosawa @ 2023-02-14 10:09 UTC (permalink / raw)
To: Paul E. McKenney, Leonardo Bras; +Cc: perfbook, Akira Yokosawa
Final QQz in a QQz series start from "\QuickQuizE{" rather than
"\QuickQuizM{".
This typo resulted in an extra "," in the anchor box:
+----------------------+
| QQ 4.20, 4.21, 4.22, |
+----------------------+
Fix it.
Fixes: a39fae0f30a9 ("treewide: Use macros for consecutive quick quizzes")
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
toolsoftrade/toolsoftrade.tex | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/toolsoftrade/toolsoftrade.tex b/toolsoftrade/toolsoftrade.tex
index fee34ae8a1df..8c9384049e75 100644
--- a/toolsoftrade/toolsoftrade.tex
+++ b/toolsoftrade/toolsoftrade.tex
@@ -947,7 +947,7 @@ in fact degraded by about 10\,\% from ideal.
sections are described in \cref{chp:Deferred Processing}.
}\QuickQuizEndM
%
-\QuickQuizM{
+\QuickQuizE{
The system used is a few years old, and new hardware should
be faster.
So why should anyone worry about reader-writer locks being slow?
--
2.25.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH -perfbook 0/4] cpu/overheads: Make tables more consistent
2023-02-14 10:03 [PATCH -perfbook 0/4] cpu/overheads: Make tables more consistent Akira Yokosawa
` (3 preceding siblings ...)
2023-02-14 10:09 ` [PATCH -perfbook 4/4] toolsoftrade: Fix QQz macro in QQz series Akira Yokosawa
@ 2023-02-14 23:13 ` Paul E. McKenney
4 siblings, 0 replies; 6+ messages in thread
From: Paul E. McKenney @ 2023-02-14 23:13 UTC (permalink / raw)
To: Akira Yokosawa; +Cc: Leonardo Bras, perfbook
On Tue, Feb 14, 2023 at 07:03:12PM +0900, Akira Yokosawa wrote:
> Hi,
>
> This patch set is a follow-up to commit 34cc066b1d95 ("cpu: Add a QQz
> on table E.1").
>
> Leo posted a patch to improve Table E.1 at [1].
>
> I have gone further to improve consistency among tables on overheads
> of atomic operations.
>
> Patch 1/4 changes the layout of those tables and adjusts upper-/
> lower-casing (with Reported-by: from Leo).
>
> I found the use of "In-Core" and "Off-Core" in the tables somewhat
> strange. Usually, I find pairs of "On/Off" and "In/Out" more natural.
> In overheads.tex, "On-Core" vs "Off-Core" looks better to me.
>
> So Patch 2/4 replaces "in-core" with "on-core", and does similar
> replacements.
>
> Patch 3/4 adds a page reference in QQz 3.8.
>
> Patch 4/4 is an independent change fixing an issue in -nq builds.
> I noticed an extra "," in the anchor box containing QQz 4.20, 4.21,
> and 4.22 while skimming through QQz's in those builds.
>
> [1]: https://www.spinics.net/lists/perfbook/msg03827.html
Queued and pushed, thank you!
Thanx, Paul
> Thanks, Akira
> --
> Akira Yokosawa (4):
> cpu: Improve layout and consistency of Tables 3.1, E.1, and E.2
> cpu: Use 'on-core' rather than 'in-core'
> cpu: Add page reference to Table E.1 in QQz 3.8
> toolsoftrade: Fix QQz macro in QQz series
>
> cpu/overheads.tex | 161 +++++++++++++++++++++-------------
> perfbook-lt.tex | 6 ++
> toolsoftrade/toolsoftrade.tex | 2 +-
> 3 files changed, 108 insertions(+), 61 deletions(-)
>
>
> base-commit: 50bb4adf51f683ba2c7d90f269632b6f4b4d4704
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 6+ messages in thread