* [PATCH v6 0/3] x86/fpu Measure the Latency of XSAVES and
@ 2023-09-01 14:34 Yi Sun
2023-09-01 14:34 ` [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS Yi Sun
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Yi Sun @ 2023-09-01 14:34 UTC (permalink / raw)
To: dave.hansen, tglx, linux-kernel, x86
Cc: sohil.mehta, ak, ilpo.jarvinen, heng.su, tony.luck, dave.hansen,
yi.sun, Yi Sun
This series implements a solution for measuring the latency of
XSAVES/XRSTORS instructions on x86 processors. The motivation for This
solution is to provide a way to identify potential bottlenecks and
anomalies in the implementation of these instructions on different X86
hardware platforms and identify when the cost of XSAVES/XRSTORS is
increasing or decreasing. And Get the RFBM and XINUSE to figure out reason
if there are any issues.
The solution consists of two parts:
1. Two new tracepoints that calculate delta TSC as the latency.
2. A user-space tool that consumes trace logs and shows results.
We conducted extensive internal testing and compared several latency
computation methodologies. We believe it could be improved. Also the script
for consuming trace logs uses sqlite3 to display statistics data such as
max and min latency, and 97% tail latency, etc.
- Change from v5 to v6:
- Rebase patch to the most recent tag v6.5-rc7, no conflict.
- Add a colomn 'count' to the output of the script.
- Change from v4 to v5:
- Rebase patch to the most recent tag v6.1-rc2, no conflict.
- Change from v3 to v4:
- Rebase patch set to the latest tag v6.0-rc3, replacing
ALTERNATIVE_2 with ALTERNATIVE_3
- Change from v2 to v3:
- Move the script xsave-latency-trace.sh to folder tools/testing/fpu/
(Ingo Molnar)
- Change from v1 to v2:
- Reword the comments. (Laight, David; Mehta, Sohil)
- Rename all variable 'dtsc' to 'latency'. (Mehta, Sohil)
Yi Sun (3):
x86/fpu: Measure the Latency of XSAVE and XRSTOR
tools/testing/fpu: Add script to consume trace log of xsave latency
tools/testing/fpu: Add a 'count' column.
arch/x86/include/asm/trace/fpu.h | 35 ++++
arch/x86/kernel/fpu/xstate.h | 49 ++++-
tools/testing/fpu/xsave-latency-trace.sh | 234 +++++++++++++++++++++++
3 files changed, 313 insertions(+), 5 deletions(-)
create mode 100755 tools/testing/fpu/xsave-latency-trace.sh
--
2.34.1
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS
2023-09-01 14:34 [PATCH v6 0/3] x86/fpu Measure the Latency of XSAVES and Yi Sun
@ 2023-09-01 14:34 ` Yi Sun
2023-09-02 10:49 ` Ingo Molnar
2023-09-01 14:34 ` [PATCH v6 2/3] tools/testing/fpu: Add script to consume trace log of xsaves latency Yi Sun
2023-09-01 14:34 ` [PATCH v6 3/3] tools/testing/fpu: Add a 'count' column Yi Sun
2 siblings, 1 reply; 12+ messages in thread
From: Yi Sun @ 2023-09-01 14:34 UTC (permalink / raw)
To: dave.hansen, tglx, linux-kernel, x86
Cc: sohil.mehta, ak, ilpo.jarvinen, heng.su, tony.luck, dave.hansen,
yi.sun, Yi Sun
Add two trace points x86_fpu_latency_xsave and x86_fpu_latency_xrstor.
The latency dumped by the new trace points can tell when XSAVES/XRSTORS
are getting more or less expensive, and get out the RFBM
(requested-feature bitmap) and XINUSE to figure out the reason.
Calculate the latency of instructions XSAVES and XRSTORS within a
single trace event respectively. Another option considered was to
have 2 separated trace events marking the start and finish of the
XSAVES/XRSTORS. The latency was calculated from the 2 trace points in
user space, but there was significant overhead added by the trace
function itself.
In internal testing, the single trace point option which is
implemented here proved to save big overhead introduced by trace
function.
Make use of trace_clock() to calculate the latency, which is based on
cpu_clock() with precision at most ~1 jiffy between CPUs.
CONFIG_X86_DEBUG_FPU and CONFIG_TRACEPOINTS are required. And the
compiler will get rid of all the extra crust when either of the two
configs is disabled.
If both of the configs are enabled, xsave/xrstor_tracing_enabled
would be reduced to a static check for tracing enabled. Thus, in the
fast path there would be only 2 additional static checks.
Since trace points can be enabled dynamically, while the code is
checking tracepoint_enabled(trace_event), the trace_event could be
concurrently enabled. Hence there is probability to get single once
noisy result 'trace_clock() - (-1)' at the moment enabling the trace
points x86_fpu_latency_*. Leave the noise here instead of additional
conditions while calling the x86_fpu_latency_* because it's not worth
for the only once noise. It's easy to filter out by the following
consuming script or other user space tool.
Trace log looks like following:
x86_fpu_latency_xsave: x86/fpu: latency:100 RFBM:0x202e7 XINUSE:0x202
x86_fpu_latency_xrstor: x86/fpu: latency:99 RFBM:0x202e7 XINUSE:0x202
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Yi Sun <yi.sun@intel.com>
diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
index 4645a6334063..5f7cb633df09 100644
--- a/arch/x86/include/asm/trace/fpu.h
+++ b/arch/x86/include/asm/trace/fpu.h
@@ -89,6 +89,41 @@ DEFINE_EVENT(x86_fpu, x86_fpu_xstate_check_failed,
TP_ARGS(fpu)
);
+DECLARE_EVENT_CLASS(x86_fpu_latency,
+ TP_PROTO(struct fpstate *fpstate, u64 latency),
+ TP_ARGS(fpstate, latency),
+
+ TP_STRUCT__entry(
+ __field(struct fpstate *, fpstate)
+ __field(u64, latency)
+ __field(u64, rfbm)
+ __field(u64, xinuse)
+ ),
+
+ TP_fast_assign(
+ __entry->fpstate = fpstate;
+ __entry->latency = latency;
+ __entry->rfbm = fpstate->xfeatures;
+ __entry->xinuse = fpstate->regs.xsave.header.xfeatures;
+ ),
+
+ TP_printk("x86/fpu: latency:%lld RFBM:0x%llx XINUSE:0x%llx",
+ __entry->latency,
+ __entry->rfbm,
+ __entry->xinuse
+ )
+);
+
+DEFINE_EVENT(x86_fpu_latency, x86_fpu_latency_xsave,
+ TP_PROTO(struct fpstate *fpstate, u64 latency),
+ TP_ARGS(fpstate, latency)
+);
+
+DEFINE_EVENT(x86_fpu_latency, x86_fpu_latency_xrstor,
+ TP_PROTO(struct fpstate *fpstate, u64 latency),
+ TP_ARGS(fpstate, latency)
+);
+
#undef TRACE_INCLUDE_PATH
#define TRACE_INCLUDE_PATH asm/trace/
#undef TRACE_INCLUDE_FILE
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index a4ecb04d8d64..5a8321d9f7db 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -5,6 +5,9 @@
#include <asm/cpufeature.h>
#include <asm/fpu/xstate.h>
#include <asm/fpu/xcr.h>
+#include <asm/trace/fpu.h>
+
+#include <linux/trace_clock.h>
#ifdef CONFIG_X86_64
DECLARE_PER_CPU(u64, xfd_state);
@@ -68,6 +71,20 @@ static inline u64 xfeatures_mask_independent(void)
return XFEATURE_MASK_INDEPENDENT;
}
+static inline bool xsave_tracing_enabled(void)
+{
+ if (!IS_ENABLED(CONFIG_X86_DEBUG_FPU))
+ return false;
+ return tracepoint_enabled(x86_fpu_latency_xsave);
+}
+
+static inline bool xrstor_tracing_enabled(void)
+{
+ if (!IS_ENABLED(CONFIG_X86_DEBUG_FPU))
+ return false;
+ return tracepoint_enabled(x86_fpu_latency_xrstor);
+}
+
/* XSAVE/XRSTOR wrapper functions */
#ifdef CONFIG_X86_64
@@ -113,7 +130,7 @@ static inline u64 xfeatures_mask_independent(void)
* original instruction which gets replaced. We need to use it here as the
* address of the instruction where we might get an exception at.
*/
-#define XSTATE_XSAVE(st, lmask, hmask, err) \
+#define __XSTATE_XSAVE(st, lmask, hmask, err) \
asm volatile(ALTERNATIVE_3(XSAVE, \
XSAVEOPT, X86_FEATURE_XSAVEOPT, \
XSAVEC, X86_FEATURE_XSAVEC, \
@@ -126,11 +143,22 @@ static inline u64 xfeatures_mask_independent(void)
: "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \
: "memory")
+#define XSTATE_XSAVE(fps, lmask, hmask, err) \
+ do { \
+ struct fpstate *f = fps; \
+ u64 tc = -1; \
+ if (xsave_tracing_enabled()) \
+ tc = trace_clock(); \
+ __XSTATE_XSAVE(&f->regs.xsave, lmask, hmask, err); \
+ if (xsave_tracing_enabled()) \
+ trace_x86_fpu_latency_xsave(f, trace_clock() - tc);\
+ } while (0)
+
/*
* Use XRSTORS to restore context if it is enabled. XRSTORS supports compact
* XSAVE area format.
*/
-#define XSTATE_XRESTORE(st, lmask, hmask) \
+#define __XSTATE_XRESTORE(st, lmask, hmask) \
asm volatile(ALTERNATIVE(XRSTOR, \
XRSTORS, X86_FEATURE_XSAVES) \
"\n" \
@@ -140,6 +168,17 @@ static inline u64 xfeatures_mask_independent(void)
: "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \
: "memory")
+#define XSTATE_XRESTORE(fps, lmask, hmask) \
+ do { \
+ struct fpstate *f = fps; \
+ u64 tc = -1; \
+ if (xrstor_tracing_enabled()) \
+ tc = trace_clock(); \
+ __XSTATE_XRESTORE(&f->regs.xsave, lmask, hmask); \
+ if (xrstor_tracing_enabled()) \
+ trace_x86_fpu_latency_xrstor(f, trace_clock() - tc);\
+ } while (0)
+
#if defined(CONFIG_X86_64) && defined(CONFIG_X86_DEBUG_FPU)
extern void xfd_validate_state(struct fpstate *fpstate, u64 mask, bool rstor);
#else
@@ -184,7 +223,7 @@ static inline void os_xsave(struct fpstate *fpstate)
WARN_ON_FPU(!alternatives_patched);
xfd_validate_state(fpstate, mask, false);
- XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
+ XSTATE_XSAVE(fpstate, lmask, hmask, err);
/* We should never fault when copying to a kernel buffer: */
WARN_ON_FPU(err);
@@ -201,7 +240,7 @@ static inline void os_xrstor(struct fpstate *fpstate, u64 mask)
u32 hmask = mask >> 32;
xfd_validate_state(fpstate, mask, true);
- XSTATE_XRESTORE(&fpstate->regs.xsave, lmask, hmask);
+ XSTATE_XRESTORE(fpstate, lmask, hmask);
}
/* Restore of supervisor state. Does not require XFD */
@@ -211,7 +250,7 @@ static inline void os_xrstor_supervisor(struct fpstate *fpstate)
u32 lmask = mask;
u32 hmask = mask >> 32;
- XSTATE_XRESTORE(&fpstate->regs.xsave, lmask, hmask);
+ XSTATE_XRESTORE(fpstate, lmask, hmask);
}
/*
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v6 2/3] tools/testing/fpu: Add script to consume trace log of xsaves latency
2023-09-01 14:34 [PATCH v6 0/3] x86/fpu Measure the Latency of XSAVES and Yi Sun
2023-09-01 14:34 ` [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS Yi Sun
@ 2023-09-01 14:34 ` Yi Sun
2023-09-01 14:34 ` [PATCH v6 3/3] tools/testing/fpu: Add a 'count' column Yi Sun
2 siblings, 0 replies; 12+ messages in thread
From: Yi Sun @ 2023-09-01 14:34 UTC (permalink / raw)
To: dave.hansen, tglx, linux-kernel, x86
Cc: sohil.mehta, ak, ilpo.jarvinen, heng.su, tony.luck, dave.hansen,
yi.sun, Yi Sun
Consume the trace log dumped by tracepoints x86_fpu_latency_xsave and
x86_fpu_latency_xrstor, calculate latency ranges for each RFBM and
XINUSE combination including min, max, average and 97% tail latency.
Add the average of 97% tail latency to remove the unreasonable
data which is introduced by interrupts or other noise. By adding the
experimental code disabling interrupts before the calculation of
latency, it's obvious to get the 3% tail latency has been filtered.
Make use of sqlite3 to make the data statistics more efficient and
concise. The output looks like follows:
EVENTs RFBM XINUSE lat_min lat_max lat_avg lat_avg(97%)
---------------------- ------- ------ ------- ------- ------- ------------
x86_fpu_latency_xrstor 0x206e7 0x0 364 364 364 364
x86_fpu_latency_xrstor 0x206e7 0x202 112 1152 300 276
x86_fpu_latency_xsave 0x206e7 0x202 80 278 141 137
x86_fpu_latency_xsave 0x206e7 0x246 108 234 180 177
The XSAVES/XRSTORS latency trace logs can be got by two ways:
1. Generated by Kernel debugfs
echo 1 > /sys/kernel/debug/tracing/events/x86_fpu/enable
cat /sys/kernel/debug/tracing/trace_pipe > trace-log
2. Generated by helper tool like 'trace-cmd'
trace-cmd record -e x86_fpu -F <command>
trace-cmd report > trace-log
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Yi Sun <yi.sun@intel.com>
diff --git a/tools/testing/fpu/xsave-latency-trace.sh b/tools/testing/fpu/xsave-latency-trace.sh
new file mode 100755
index 000000000000..d45563984fd6
--- /dev/null
+++ b/tools/testing/fpu/xsave-latency-trace.sh
@@ -0,0 +1,227 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# (c) 2022 Yi Sun <yi.sun@intel.com>
+
+trace_log=$1
+trace_lat_log=".trace_lat_log"
+db_name="db_trace"
+db_file="${db_name}.db"
+table_raw="t_trace"
+table_tail="t_trace_tail"
+table_results="t_results"
+events="x86_fpu_latency_xsave|x86_fpu_latency_xrstor"
+
+# The regex for the trace log. The rough pattern:
+# (proc) (No.cpu) (flags) (timestamp): (tracepoint): latency:(123) RFBM:0x(123) XINUSE:0x(123)$
+# Fold the regex into 3 parts making it easier to read.
+regex1="([^\ ]*)[[:space:]]*\[([0-9]+)\][[:space:]]*(.....\ )?[[:space:]]*"
+regex2="([0-9.]*):[[:space:]]*([^\ :]*):.*latency:([0-9]*)[[:space:]]*"
+regex3="RFBM:(0x[0-9a-f]*)[[:space:]]*XINUSE:(0x[0-9a-f]*)$"
+
+function usage() {
+ echo "This script consumes the tracepoint data, and dumps out the"
+ echo "latency ranges for each RFBM combination."
+ echo "Usage:"
+ echo "$0 <trace-log>"
+ echo " trace-log:"
+ echo " Either generated by Kernel sysfs:"
+ echo " echo 1 > /sys/kernel/debug/tracing/events/x86_fpu/enable"
+ echo " cat /sys/kernel/debug/tracing/trace_pipe > trace-log"
+ echo ""
+ echo " Or generate by helper tool like 'trace-cmd':"
+ echo " trace-cmd record -e x86_fpu"
+ echo " trace-cmd report > trace-log"
+}
+
+# Check the dependent tools
+# {@}: a list of third-part tools
+function check_packages() {
+ for pack in "$@"; do
+ which $pack >& /dev/null
+ if [[ $? != 0 ]]; then
+ echo "Please install $pack before running this script."
+ exit 1
+ fi
+ done
+}
+
+# Run SQL command with sqlite3
+# ${*}: SQL command fed to sqlite3
+function SQL_CMD() {
+ sqlite3 $db_file "$*"
+}
+
+# Run SQL command with sqlite3 and format the output with headers and column.
+# ${*}: SQL command fed to sqlite3
+function SQL_CMD_HEADER() {
+ sqlite3 -column -header $db_file "$*"
+}
+
+# Create a table in the DB
+# ${1}: name of the table
+function create_table() {
+ if [[ "$1" == "" ]]; then
+ echo "Empty table name!"
+ exit 1
+ fi
+ SQL_CMD "create table $1 (
+ id INTEGER PRIMARY KEY AUTOINCREMENT,
+ process TEXT,
+ cpu INT,
+ timestamp FLOAT,
+ event_name TEXT,
+ lat INT,
+ RFBM INT,
+ XINUSE INT);"
+}
+
+# Round to the nearest whole number
+# ${1}: a float number
+# Output: integer
+function round() {
+ echo "scale=0; ($1+0.5)/1" | bc
+}
+
+# Insert a record in the trace table
+#
+# process cpu timestamp event_name lat RFBM XINUSE
+# $2 $3 $4 $5 $6 $7 $8
+
+function insert_line() {
+ if [[ "$1" == "" ]]; then
+ echo "Empty table name!"
+ exit 1
+ fi
+ SQL_CMD "INSERT INTO $1 (process, cpu, timestamp, event_name, lat, RFBM, XINUSE)
+ VALUES (\"$2\", $3, $4, \"$5\", $6, $7, $8);"
+}
+
+# Show the results of the trace statistics
+function get_latency_stat() {
+ SQL_CMD "create table $table_results (
+ id INTEGER PRIMARY KEY AUTOINCREMENT,
+ event_name TEXT,
+ RFBM INT,
+ XINUSE INT,
+ lat_min INT,
+ lat_max INT,
+ lat_avg INT,
+ lat_tail_avg INT);"
+
+ for((i=0; i<$cnt; i++));do
+ event_name=`get_comb_item $table_raw $i event_name`
+ RFBM=`get_comb_item $table_raw $i RFBM`
+ XINUSE=`get_comb_item $table_raw $i XINUSE`
+ lat_min=`get_comb_item $table_raw $i min\(lat\)`
+ lat_max=`get_comb_item $table_raw $i max\(lat\)`
+ lat_avg=`get_comb_item $table_raw $i avg\(lat\)`
+ lat_tail_avg=`get_comb_item $table_tail $i avg\(lat\)`
+
+ lat_avg=`round $lat_avg`
+ lat_tail_avg=`round $lat_tail_avg`
+
+ SQL_CMD "INSERT INTO $table_results
+ (event_name, RFBM,XINUSE, lat_min, lat_max, lat_avg, lat_tail_avg)
+ VALUES (\"$event_name\", $RFBM, $XINUSE, $lat_min, $lat_max,
+ $lat_avg, $lat_tail_avg);"
+ done
+
+ SQL_CMD_HEADER "select event_name[EVENTs],printf('0x%x',RFBM)[RFBM],
+ printf('0x%x',XINUSE)[XINUSE],lat_min,lat_max,lat_avg,
+ lat_tail_avg[lat_avg(97%)]
+ from $table_results;"
+}
+
+# Get the count of the combination of event_name, RFBM, XINUSE amount all lat trace records
+function get_combs_cnt() {
+ SQL_CMD "SELECT event_name, RFBM, XINUSE from $table_raw
+ group by event_name,RFBM,XINUSE;" | wc -l
+}
+
+# Get a specified combination from a table
+# ${1}: name of table
+# ${2}: the order of the combination of event_name, RFBM, XINUSE
+# ${3}: the items which are wanted to be shown
+function get_comb_item() {
+ table=$1
+ cnt=$2
+ col=$3
+ SQL_CMD "SELECT $col from $table group by event_name,RFBM,XINUSE limit $cnt,1;"
+}
+
+# Get count of the records in a given table
+# ${1}: name of the table
+function get_rows_cnt() {
+ table=$1
+ SQL_CMD "SELECT count(*) from $table;"
+}
+
+# Generate a new table from the raw trace table removing 3% tail traces.
+function gen_tail_lat() {
+ cnt=`get_combs_cnt`
+ create_table $table_tail
+
+ for((i=0; i<$cnt; i++));do
+ create_table t$i
+ event_name=`get_comb_item $table_raw $i event_name`
+ RFBM=`get_comb_item $table_raw $i RFBM`
+ XINUSE=`get_comb_item $table_raw $i XINUSE`
+
+ SQL_CMD "insert into t$i(process,cpu,timestamp,event_name,lat,RFBM,XINUSE)
+ select process,cpu,timestamp,event_name,lat,RFBM,XINUSE
+ from $table_raw where event_name=\"$event_name\" and RFBM=$RFBM and
+ XINUSE=$XINUSE ORDER BY lat ASC;"
+
+ row=`get_rows_cnt t$i`
+ row=`echo "scale=0; ($row*0.97 + 0.5)/1" | bc`
+
+ SQL_CMD "insert into $table_tail
+ (process,cpu,timestamp,event_name,lat,RFBM,XINUSE)
+ select process,cpu,timestamp,event_name,lat,RFBM,XINUSE
+ from t$i limit 0,$row;"
+ done
+
+}
+
+if [[ ! -e "$trace_log" || $# != 1 ]];then
+ usage
+ exit 1
+fi
+
+# Check dependency
+# Make sure having following packages
+check_packages sqlite3 bc wc cut
+
+# Filter trace log keeping latency related lines only
+grep -E "$events" $trace_log > $trace_lat_log
+cnt_lines=`wc -l $trace_lat_log | cut -d' ' -f1`
+# Remove the old db file if it existed before creating
+[[ -f $db_file ]] && rm -rf $db_file
+
+create_table $table_raw
+
+# Read each line from the temp file and insert into the table
+i=0
+while IFS= read -r line;
+do
+ ((i = i + 1))
+ echo -ne "(${i}/$cnt_lines) Importing trace log into database!\r"
+ if [[ "$line" =~ ${regex1}${regex2}${regex3} ]]; then
+ pname=${BASH_REMATCH[1]}
+ cpu=${BASH_REMATCH[2]}
+ ts=${BASH_REMATCH[4]}
+ ename=${BASH_REMATCH[5]}
+ lat=${BASH_REMATCH[6]}
+ ((rfbm=${BASH_REMATCH[7]}))
+ ((xinuse=${BASH_REMATCH[8]}))
+
+ insert_line $table_raw $pname $cpu $ts $ename $lat $rfbm $xinuse
+ fi
+done < $trace_lat_log
+
+gen_tail_lat
+get_latency_stat
+
+# Cleanup
+rm -rf $trace_lat_log $db_file
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v6 3/3] tools/testing/fpu: Add a 'count' column.
2023-09-01 14:34 [PATCH v6 0/3] x86/fpu Measure the Latency of XSAVES and Yi Sun
2023-09-01 14:34 ` [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS Yi Sun
2023-09-01 14:34 ` [PATCH v6 2/3] tools/testing/fpu: Add script to consume trace log of xsaves latency Yi Sun
@ 2023-09-01 14:34 ` Yi Sun
2 siblings, 0 replies; 12+ messages in thread
From: Yi Sun @ 2023-09-01 14:34 UTC (permalink / raw)
To: dave.hansen, tglx, linux-kernel, x86
Cc: sohil.mehta, ak, ilpo.jarvinen, heng.su, tony.luck, dave.hansen,
yi.sun, Yi Sun
Show the total number of each combination of event-RFBM-XINUSE. Users can
identify which data is noise and focus more on useful data.
The sample of the output is as follows:
EVENTs RFBM XINUSE CNT min max avg avg(97%)
---...------- ------- ------- ------ ---- ---- ---- --------
x86..._xrstor 0x20ee7 0x0 1 xx xx xx xx
x86..._xrstor 0x20ee7 0x2a2 997860 xx xxx xx xx
x86..._xsave 0x20ee7 0x2a2 45 xx xx xx xx
x86..._xsave 0x20ee7 0x2e6 146613 xx xxxx xx xx
x86..._xsave 0x20ee7 0x202a2 31555 xx xxx xx xx
x86..._xsave 0x60ee7 0x602a2 818608 xx xxxx xx xx
x86..._xsave 0x60ee7 0x602a3 1 xx xx xx xx
Signed-off-by: Yi Sun <yi.sun@intel.com>
diff --git a/tools/testing/fpu/xsave-latency-trace.sh b/tools/testing/fpu/xsave-latency-trace.sh
index d45563984fd6..b2f7c3d0dd65 100755
--- a/tools/testing/fpu/xsave-latency-trace.sh
+++ b/tools/testing/fpu/xsave-latency-trace.sh
@@ -99,11 +99,14 @@ function insert_line() {
# Show the results of the trace statistics
function get_latency_stat() {
+ cnt=`get_combs_cnt`
+
SQL_CMD "create table $table_results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
event_name TEXT,
RFBM INT,
XINUSE INT,
+ CNT INT,
lat_min INT,
lat_max INT,
lat_avg INT,
@@ -121,14 +124,18 @@ function get_latency_stat() {
lat_avg=`round $lat_avg`
lat_tail_avg=`round $lat_tail_avg`
+ count=`SQL_CMD "SELECT count(*) from $table_raw
+ where event_name=\"$event_name\" and RFBM=$RFBM and
+ XINUSE=$XINUSE;"`
+
SQL_CMD "INSERT INTO $table_results
- (event_name, RFBM,XINUSE, lat_min, lat_max, lat_avg, lat_tail_avg)
- VALUES (\"$event_name\", $RFBM, $XINUSE, $lat_min, $lat_max,
+ (event_name,RFBM,XINUSE, CNT, lat_min, lat_max, lat_avg, lat_tail_avg)
+ VALUES (\"$event_name\", $RFBM, $XINUSE, $count, $lat_min, $lat_max,
$lat_avg, $lat_tail_avg);"
done
SQL_CMD_HEADER "select event_name[EVENTs],printf('0x%x',RFBM)[RFBM],
- printf('0x%x',XINUSE)[XINUSE],lat_min,lat_max,lat_avg,
+ printf('0x%x',XINUSE)[XINUSE],CNT,lat_min,lat_max,lat_avg,
lat_tail_avg[lat_avg(97%)]
from $table_results;"
}
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS
2023-09-01 14:34 ` [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS Yi Sun
@ 2023-09-02 10:49 ` Ingo Molnar
2023-09-02 19:09 ` Andi Kleen
2023-09-06 8:47 ` Yi Sun
0 siblings, 2 replies; 12+ messages in thread
From: Ingo Molnar @ 2023-09-02 10:49 UTC (permalink / raw)
To: Yi Sun
Cc: dave.hansen, tglx, linux-kernel, x86, sohil.mehta, ak,
ilpo.jarvinen, heng.su, tony.luck, dave.hansen, yi.sun
* Yi Sun <yi.sun@intel.com> wrote:
> +#define XSTATE_XSAVE(fps, lmask, hmask, err) \
> + do { \
> + struct fpstate *f = fps; \
> + u64 tc = -1; \
> + if (xsave_tracing_enabled()) \
> + tc = trace_clock(); \
> + __XSTATE_XSAVE(&f->regs.xsave, lmask, hmask, err); \
> + if (xsave_tracing_enabled()) \
> + trace_x86_fpu_latency_xsave(f, trace_clock() - tc);\
> + } while (0)
> +
> /*
> * Use XRSTORS to restore context if it is enabled. XRSTORS supports compact
> * XSAVE area format.
> */
> -#define XSTATE_XRESTORE(st, lmask, hmask) \
> +#define __XSTATE_XRESTORE(st, lmask, hmask) \
> asm volatile(ALTERNATIVE(XRSTOR, \
> XRSTORS, X86_FEATURE_XSAVES) \
> "\n" \
> @@ -140,6 +168,17 @@ static inline u64 xfeatures_mask_independent(void)
> : "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \
> : "memory")
>
> +#define XSTATE_XRESTORE(fps, lmask, hmask) \
> + do { \
> + struct fpstate *f = fps; \
> + u64 tc = -1; \
> + if (xrstor_tracing_enabled()) \
> + tc = trace_clock(); \
> + __XSTATE_XRESTORE(&f->regs.xsave, lmask, hmask); \
> + if (xrstor_tracing_enabled()) \
> + trace_x86_fpu_latency_xrstor(f, trace_clock() - tc);\
> + } while (0)
> +
> #if defined(CONFIG_X86_64) && defined(CONFIG_X86_DEBUG_FPU)
> extern void xfd_validate_state(struct fpstate *fpstate, u64 mask, bool rstor);
> #else
> @@ -184,7 +223,7 @@ static inline void os_xsave(struct fpstate *fpstate)
> WARN_ON_FPU(!alternatives_patched);
> xfd_validate_state(fpstate, mask, false);
>
> - XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
> + XSTATE_XSAVE(fpstate, lmask, hmask, err);
>
> /* We should never fault when copying to a kernel buffer: */
> WARN_ON_FPU(err);
> @@ -201,7 +240,7 @@ static inline void os_xrstor(struct fpstate *fpstate, u64 mask)
> u32 hmask = mask >> 32;
>
> xfd_validate_state(fpstate, mask, true);
> - XSTATE_XRESTORE(&fpstate->regs.xsave, lmask, hmask);
> + XSTATE_XRESTORE(fpstate, lmask, hmask);
> }
Instead of adding overhead to the regular FPU context saving/restoring code
paths, could you add a helper function that has tracing code included, but
which isn't otherwise used - and leave the regular code with no tracing
overhead?
This puts a bit of a long-term maintenance focus on making sure that the
traced functionality won't bitrot, but I'd say that's preferable to adding
tracing overhead.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS
2023-09-02 10:49 ` Ingo Molnar
@ 2023-09-02 19:09 ` Andi Kleen
2023-09-06 9:18 ` Yi Sun
2023-09-06 8:47 ` Yi Sun
1 sibling, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2023-09-02 19:09 UTC (permalink / raw)
To: Ingo Molnar
Cc: Yi Sun, dave.hansen, tglx, linux-kernel, x86, sohil.mehta,
ilpo.jarvinen, heng.su, tony.luck, dave.hansen, yi.sun
> Instead of adding overhead to the regular FPU context saving/restoring code
> paths, could you add a helper function that has tracing code included, but
> which isn't otherwise used - and leave the regular code with no tracing
> overhead?
>
> This puts a bit of a long-term maintenance focus on making sure that the
> traced functionality won't bitrot, but I'd say that's preferable to adding
> tracing overhead.
Or just use PT
% sudo perf record --kcore -e intel_pt/cyc=1,cyc_thresh=1/k --filter 'filter save_fpregs_to_fpstate' -a sleep 5
% sudo perf script --insn-trace --xed -F -comm,-tid,-dso,-sym,-symoff,+ipc
[000] 677203.751913565: ffffffffa7046230 nopw %ax, (%rax)
[000] 677203.751913565: ffffffffa7046234 nopl %eax, (%rax,%rax,1)
[000] 677203.751913565: ffffffffa7046239 mov %rdi, %rcx
[000] 677203.751913565: ffffffffa704623c nopl %eax, (%rax,%rax,1)
[000] 677203.751913565: ffffffffa7046241 movq
0x10(%rdi), %rsi
[000] 677203.751913565: ffffffffa7046245 movq 0x8(%rsi), %rax
[000] 677203.751913565: ffffffffa7046249 leaq 0x40(%rsi), %rdi
[000] 677203.751913565: ffffffffa704624d mov %rax, %rdx
[000] 677203.751913565: ffffffffa7046250 shr $0x20, %rdx
[000] 677203.751913565: ffffffffa7046254 xsaves64 (%rdi)
[000] 677203.751913565: ffffffffa7046258 xor %edi, %edi
[000] 677203.751913565: ffffffffa704625a movq 0x10(%rcx), %rax
[000] 677203.751913565: ffffffffa704625e testb $0xc0, 0x240(%rax)
[000] 677203.751913636: ffffffffa7046265 jz 0xffffffffa7046285 IPC: 0.16 (14/85)
...
So it took 85 cycles here.
(it includes a few extra instructions, but I bet they're less than what
ftrace adds. This example is for XSAVE, but can be similarly extended
for XRSTOR)
-Andi
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS
2023-09-02 10:49 ` Ingo Molnar
2023-09-02 19:09 ` Andi Kleen
@ 2023-09-06 8:47 ` Yi Sun
2023-09-15 9:54 ` Ingo Molnar
1 sibling, 1 reply; 12+ messages in thread
From: Yi Sun @ 2023-09-06 8:47 UTC (permalink / raw)
To: Ingo Molnar
Cc: dave.hansen, tglx, linux-kernel, x86, sohil.mehta, ak,
ilpo.jarvinen, heng.su, tony.luck, dave.hansen, yi.sun
On 02.09.2023 12:49, Ingo Molnar wrote:
>
>* Yi Sun <yi.sun@intel.com> wrote:
>
>> +#define XSTATE_XSAVE(fps, lmask, hmask, err) \
>> + do { \
>> + struct fpstate *f = fps; \
>> + u64 tc = -1; \
>> + if (xsave_tracing_enabled()) \
>> + tc = trace_clock(); \
>> + __XSTATE_XSAVE(&f->regs.xsave, lmask, hmask, err); \
>> + if (xsave_tracing_enabled()) \
>> + trace_x86_fpu_latency_xsave(f, trace_clock() - tc);\
>> + } while (0)
>> +
>> /*
>> * Use XRSTORS to restore context if it is enabled. XRSTORS supports compact
>> * XSAVE area format.
>> */
>> -#define XSTATE_XRESTORE(st, lmask, hmask) \
>> +#define __XSTATE_XRESTORE(st, lmask, hmask) \
>> asm volatile(ALTERNATIVE(XRSTOR, \
>> XRSTORS, X86_FEATURE_XSAVES) \
>> "\n" \
>> @@ -140,6 +168,17 @@ static inline u64 xfeatures_mask_independent(void)
>> : "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \
>> : "memory")
>>
>> +#define XSTATE_XRESTORE(fps, lmask, hmask) \
>> + do { \
>> + struct fpstate *f = fps; \
>> + u64 tc = -1; \
>> + if (xrstor_tracing_enabled()) \
>> + tc = trace_clock(); \
>> + __XSTATE_XRESTORE(&f->regs.xsave, lmask, hmask); \
>> + if (xrstor_tracing_enabled()) \
>> + trace_x86_fpu_latency_xrstor(f, trace_clock() - tc);\
>> + } while (0)
>> +
>> #if defined(CONFIG_X86_64) && defined(CONFIG_X86_DEBUG_FPU)
>> extern void xfd_validate_state(struct fpstate *fpstate, u64 mask, bool rstor);
>> #else
>> @@ -184,7 +223,7 @@ static inline void os_xsave(struct fpstate *fpstate)
>> WARN_ON_FPU(!alternatives_patched);
>> xfd_validate_state(fpstate, mask, false);
>>
>> - XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
>> + XSTATE_XSAVE(fpstate, lmask, hmask, err);
>>
>> /* We should never fault when copying to a kernel buffer: */
>> WARN_ON_FPU(err);
>> @@ -201,7 +240,7 @@ static inline void os_xrstor(struct fpstate *fpstate, u64 mask)
>> u32 hmask = mask >> 32;
>>
>> xfd_validate_state(fpstate, mask, true);
>> - XSTATE_XRESTORE(&fpstate->regs.xsave, lmask, hmask);
>> + XSTATE_XRESTORE(fpstate, lmask, hmask);
>> }
>
>Instead of adding overhead to the regular FPU context saving/restoring code
>paths, could you add a helper function that has tracing code included, but
>which isn't otherwise used - and leave the regular code with no tracing
>overhead?
>
>This puts a bit of a long-term maintenance focus on making sure that the
>traced functionality won't bitrot, but I'd say that's preferable to adding
>tracing overhead.
>
Hi Ingo,
It is actually a helper function, and I have renamed the original function
to __XSTATE_XSAVE. This function is only used in this particular context.
Furthermore, according doc static-keys.txt, the condition
xrstor_tracing_enabled() would introduce only a minimal overhead when the
trace is disabled. I believe it is a negligible impact on the performance
when the trace is disabled.
Please let me know if you have any questions or concerns about this function.
Thanks
--Sun, Yi
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS
2023-09-02 19:09 ` Andi Kleen
@ 2023-09-06 9:18 ` Yi Sun
2023-09-06 18:49 ` Dave Hansen
0 siblings, 1 reply; 12+ messages in thread
From: Yi Sun @ 2023-09-06 9:18 UTC (permalink / raw)
To: Andi Kleen
Cc: Ingo Molnar, dave.hansen, tglx, linux-kernel, x86, sohil.mehta,
ilpo.jarvinen, heng.su, tony.luck, dave.hansen, yi.sun
On 02.09.2023 12:09, Andi Kleen wrote:
>> Instead of adding overhead to the regular FPU context saving/restoring code
>> paths, could you add a helper function that has tracing code included, but
>> which isn't otherwise used - and leave the regular code with no tracing
>> overhead?
>>
>> This puts a bit of a long-term maintenance focus on making sure that the
>> traced functionality won't bitrot, but I'd say that's preferable to adding
>> tracing overhead.
>
>Or just use PT
>
>% sudo perf record --kcore -e intel_pt/cyc=1,cyc_thresh=1/k --filter 'filter save_fpregs_to_fpstate' -a sleep 5
>% sudo perf script --insn-trace --xed -F -comm,-tid,-dso,-sym,-symoff,+ipc
>[000] 677203.751913565: ffffffffa7046230 nopw %ax, (%rax)
>[000] 677203.751913565: ffffffffa7046234 nopl %eax, (%rax,%rax,1)
>[000] 677203.751913565: ffffffffa7046239 mov %rdi, %rcx
>[000] 677203.751913565: ffffffffa704623c nopl %eax, (%rax,%rax,1)
>[000] 677203.751913565: ffffffffa7046241 movq
>0x10(%rdi), %rsi
>[000] 677203.751913565: ffffffffa7046245 movq 0x8(%rsi), %rax
>[000] 677203.751913565: ffffffffa7046249 leaq 0x40(%rsi), %rdi
>[000] 677203.751913565: ffffffffa704624d mov %rax, %rdx
>[000] 677203.751913565: ffffffffa7046250 shr $0x20, %rdx
>[000] 677203.751913565: ffffffffa7046254 xsaves64 (%rdi)
>[000] 677203.751913565: ffffffffa7046258 xor %edi, %edi
>[000] 677203.751913565: ffffffffa704625a movq 0x10(%rcx), %rax
>[000] 677203.751913565: ffffffffa704625e testb $0xc0, 0x240(%rax)
>[000] 677203.751913636: ffffffffa7046265 jz 0xffffffffa7046285 IPC: 0.16 (14/85)
>...
>
>
>So it took 85 cycles here.
>
>(it includes a few extra instructions, but I bet they're less than what
>ftrace adds. This example is for XSAVE, but can be similarly extended
>for XRSTOR)
>
Hi Andi,
Thank you for your guidance on Intel PT.
I recall that we have discussed this topic via email before.
I have compared the two methods that calculate the latency:
1. Calculate using perf-intel-pt with functions filter.
2. Calculate the tsc delta explicitly in kernel, and dump the delta by a
single trace point as what this patch does.
My findings are:
1. Intel-pt is the most accurate method, but it's likely just a one-time
exercise because 'filter with function' requires rebuilding the kernel
and changing the definition of functions 'os_xsave' and 'os_xrstor' into
'noinline' instead of 'inline'.
2. I collected the latency data with the two methods, and the method in
this patch can achieve results that are close to those with intel-pt.
And it only introduces a negligible impact on the performance when the
trace is disabled, as I explained to Ingo earlier.
Hope this clarifies my approach. We're using this patch set to do tests
on Intel's new brand chipsets.
Thanks
--Sun, Yi
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS
2023-09-06 9:18 ` Yi Sun
@ 2023-09-06 18:49 ` Dave Hansen
2023-09-06 22:02 ` Ingo Molnar
0 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2023-09-06 18:49 UTC (permalink / raw)
To: Yi Sun, Andi Kleen
Cc: Ingo Molnar, tglx, linux-kernel, x86, sohil.mehta, ilpo.jarvinen,
heng.su, tony.luck, dave.hansen, yi.sun
On 9/6/23 02:18, Yi Sun wrote:
> Or just use PT
I'd really like to be able to use this mechanism across a wide range of
systems over time and vendors. For instance, if Intel's AVX512 XSAVE
implementation is much faster than AMD's, it would be nice to show some
apples-to-apples data to motivate AMD to do better. We can't do that
with PT.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS
2023-09-06 18:49 ` Dave Hansen
@ 2023-09-06 22:02 ` Ingo Molnar
2023-09-08 0:24 ` Yi Sun
0 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2023-09-06 22:02 UTC (permalink / raw)
To: Dave Hansen
Cc: Yi Sun, Andi Kleen, tglx, linux-kernel, x86, sohil.mehta,
ilpo.jarvinen, heng.su, tony.luck, dave.hansen, yi.sun
* Dave Hansen <dave.hansen@intel.com> wrote:
> On 9/6/23 02:18, Yi Sun wrote:
> > Or just use PT
>
> I'd really like to be able to use this mechanism across a wide range of
> systems over time and vendors. For instance, if Intel's AVX512 XSAVE
> implementation is much faster than AMD's, it would be nice to show some
> apples-to-apples data to motivate AMD to do better. We can't do that
> with PT.
Ack - and with the explicit tooling support, it's also very easy to provide
such numbers.
As long as the regular FPU code paths do not get new tracing overhead
added, this looks like a useful tool.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS
2023-09-06 22:02 ` Ingo Molnar
@ 2023-09-08 0:24 ` Yi Sun
0 siblings, 0 replies; 12+ messages in thread
From: Yi Sun @ 2023-09-08 0:24 UTC (permalink / raw)
To: Ingo Molnar, dave.hansen
Cc: Andi Kleen, tglx, linux-kernel, x86, sohil.mehta, ilpo.jarvinen,
heng.su, tony.luck, dave.hansen, yi.sun
On 07.09.2023 00:02, Ingo Molnar wrote:
>
>* Dave Hansen <dave.hansen@intel.com> wrote:
>
>> On 9/6/23 02:18, Yi Sun wrote:
>> > Or just use PT
>>
>> I'd really like to be able to use this mechanism across a wide range of
>> systems over time and vendors. For instance, if Intel's AVX512 XSAVE
>> implementation is much faster than AMD's, it would be nice to show some
>> apples-to-apples data to motivate AMD to do better. We can't do that
>> with PT.
>
>Ack - and with the explicit tooling support, it's also very easy to provide
>such numbers.
>
>As long as the regular FPU code paths do not get new tracing overhead
>added, this looks like a useful tool.
>
>Thanks,
>
> Ingo
Hi Ingo and Dave,
We have been running the tool since kernel v5.19, and the patch set has
been tested in our internal repository for several months.
Additionally, we have implemented a test suite of micro-benchmark in user
space that works with this kernel trace, and it will be open-sourced soon.
With all these tools, we have obtained useful data, and we haven't
encountered any issues.
It would be great for the quality of X86 SIMD instructions if the patch
could catch up kernel v6.6.
Thanks
--Sun, Yi
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS
2023-09-06 8:47 ` Yi Sun
@ 2023-09-15 9:54 ` Ingo Molnar
0 siblings, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2023-09-15 9:54 UTC (permalink / raw)
To: Yi Sun
Cc: dave.hansen, tglx, linux-kernel, x86, sohil.mehta, ak,
ilpo.jarvinen, heng.su, tony.luck, dave.hansen, yi.sun
* Yi Sun <yi.sun@intel.com> wrote:
> > Instead of adding overhead to the regular FPU context saving/restoring
> > code paths, could you add a helper function that has tracing code
> > included, but which isn't otherwise used - and leave the regular code
> > with no tracing overhead?
> Furthermore, according doc static-keys.txt, the condition
> xrstor_tracing_enabled() would introduce only a minimal overhead when the
> trace is disabled. I believe it is a negligible impact on the performance
> when the trace is disabled.
Why introduce *any* extra overhead if it's possible to test the
functionality separately? The stated goal of the series is only to measure
FPU context switch performance, which doesn't require extra added overhead
to the actual context switch path.
[ Or if you want to convince reviewers that the overhead is indeed minimal,
please provide before/after generated assembly of the affected code that
demonstrates minimal impact. ]
Thanks,
Ingo
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-09-15 9:55 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-01 14:34 [PATCH v6 0/3] x86/fpu Measure the Latency of XSAVES and Yi Sun
2023-09-01 14:34 ` [PATCH v6 1/3] x86/fpu: Measure the Latency of XSAVES and XRSTORS Yi Sun
2023-09-02 10:49 ` Ingo Molnar
2023-09-02 19:09 ` Andi Kleen
2023-09-06 9:18 ` Yi Sun
2023-09-06 18:49 ` Dave Hansen
2023-09-06 22:02 ` Ingo Molnar
2023-09-08 0:24 ` Yi Sun
2023-09-06 8:47 ` Yi Sun
2023-09-15 9:54 ` Ingo Molnar
2023-09-01 14:34 ` [PATCH v6 2/3] tools/testing/fpu: Add script to consume trace log of xsaves latency Yi Sun
2023-09-01 14:34 ` [PATCH v6 3/3] tools/testing/fpu: Add a 'count' column Yi Sun
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).