public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/3] x86/fpu Measure the Latency of XSAVES and XRSTORS
@ 2023-09-21  6:28 Yi Sun
  2023-09-21  6:28 ` [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR Yi Sun
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Yi Sun @ 2023-09-21  6:28 UTC (permalink / raw)
  To: dave.hansen, mingo, linux-kernel, x86
  Cc: sohil.mehta, ak, ilpo.jarvinen, heng.su, tony.luck, yi.sun,
	yu.c.chen, Yi Sun

This series implements a solution for measuring the latency of
XSAVES/XRSTORS instructions on x86 processors. The motivation for This
solution is to provide a way to identify potential bottlenecks and
anomalies in the implementation of these instructions on different X86
hardware platforms and identify when the cost of XSAVES/XRSTORS is
increasing or decreasing. And Get the RFBM and XINUSE to figure out reason
if there are any issues.

The solution enables the comparison of the latency of the FPU across
different systems from various vendors over an long period of time.

The solution consists of two parts:
1. Two new tracepoints that calculate delta TSC as the latency.
2. A user-space tool that consumes trace logs and shows results.

We conducted extensive internal testing and compared several latency
computation methodologies. We believe it could be improved. Also the script
for consuming trace logs uses sqlite3 to display statistics data such as
max and min latency, and 97% tail latency, etc.

- Change from v6 to v7:
  - Rebase patch to the most recent tag v6.6-rc1, no conflict.
  - Remove all the additional crusts when disable CONFIG_X86_DEBUG_FPU.
    (Ingo Molnar)

- Change from v5 to v6:
  - Rebase patch to the most recent tag v6.5-rc7, no conflict.
  - Add a colomn 'count' to the output of the script.

- Change from v4 to v5:
  - Rebase patch to the most recent tag v6.1-rc2, no conflict.

- Change from v3 to v4:
  - Rebase patch set to the latest tag v6.0-rc3, replacing
    ALTERNATIVE_2 with ALTERNATIVE_3

- Change from v2 to v3:
  - Move the script xsave-latency-trace.sh to folder tools/testing/fpu/
    (Ingo Molnar)

- Change from v1 to v2:
  - Reword the comments. (Laight, David; Mehta, Sohil)
  - Rename all variable 'dtsc' to 'latency'. (Mehta, Sohil)

*** BLURB HERE ***

Yi Sun (3):
  x86/fpu: Measure the Latency of XSAVE and XRSTOR
  tools/testing/fpu: Add script to consume trace log of xsave latency
  tools/testing/fpu: Add a 'count' column.

 arch/x86/include/asm/trace/fpu.h         |  37 ++++
 arch/x86/kernel/fpu/xstate.h             |  42 +++-
 tools/testing/fpu/xsave-latency-trace.sh | 234 +++++++++++++++++++++++
 3 files changed, 308 insertions(+), 5 deletions(-)
 create mode 100755 tools/testing/fpu/xsave-latency-trace.sh

-- 
2.34.1

From d9f24eb83ef259ccea780caee02abdd2db5911b5 Mon Sep 17 00:00:00 2001
From: Yi Sun <yi.sun@intel.com>
Date: Fri, 1 Sep 2023 20:05:01 +0800
Yi Sun (3):
  x86/fpu: Measure the Latency of XSAVE and XRSTOR
  tools/testing/fpu: Add script to consume trace log of xsave latency
  tools/testing/fpu: Add a 'count' column.

 arch/x86/include/asm/trace/fpu.h         |  35 ++++
 arch/x86/kernel/fpu/xstate.h             |  49 ++++-
 tools/testing/fpu/xsave-latency-trace.sh | 234 +++++++++++++++++++++++
 3 files changed, 313 insertions(+), 5 deletions(-)
 create mode 100755 tools/testing/fpu/xsave-latency-trace.sh

-- 
2.34.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2023-09-21  6:28 [PATCH v7 0/3] x86/fpu Measure the Latency of XSAVES and XRSTORS Yi Sun
@ 2023-09-21  6:28 ` Yi Sun
  2023-09-21  7:03   ` Ingo Molnar
  2023-09-21  6:28 ` [PATCH v7 2/3] tools/testing/fpu: Add script to consume trace log of xsave latency Yi Sun
  2023-09-21  6:29 ` [PATCH v7 3/3] tools/testing/fpu: Add a 'count' column Yi Sun
  2 siblings, 1 reply; 8+ messages in thread
From: Yi Sun @ 2023-09-21  6:28 UTC (permalink / raw)
  To: dave.hansen, mingo, linux-kernel, x86
  Cc: sohil.mehta, ak, ilpo.jarvinen, heng.su, tony.luck, yi.sun,
	yu.c.chen, Yi Sun

Add two trace points x86_fpu_latency_xsave and x86_fpu_latency_xrstor.
The latency dumped by the new trace points can tell when XSAVE/XRSTOR
are getting more or less expensive, and get out the RFBM
(requested-feature bitmap) and XINUSE to figure out the reason.

Calculate the latency of instructions XSAVE and XRSTOR within a
single trace event respectively. Another option considered was to
have 2 separated trace events marking the start and finish of the
XSAVE/XRSTOR. The latency was calculated from the 2 trace points in
user space, but there was significant overhead added by the trace
function itself.

In internal testing, the single trace point option which is
implemented here proved to save big overhead introduced by trace
function.

Make use of trace_clock() to calculate the latency, which is based on
cpu_clock() with precision at most ~1 jiffy between CPUs.

Configure CONFIG_X86_DEBUG_FPU is required. And the compiler will get
rid of all the extra crust when CONFIG_X86_DEBUG_FPU is disabled.

If both of the configs are enabled, the function tracepoint_enabled
would be reduced to a static check for tracing enabled. Thus, in the
fast path there would be only 2 additional static checks.

Since trace points can be enabled dynamically, while the code is
checking tracepoint_enabled(trace_event), the trace_event could be
concurrently enabled. Hence there is probability to get single once
noisy result 'trace_clock() - (-1)' at the moment enabling the trace
points x86_fpu_latency_*. Leave the noise here instead of additional
conditions while calling the x86_fpu_latency_* because it's not worth
for the only once noise. It's easy to filter out by the following
consuming script or other user space tool.

Trace log looks like following:
  x86_fpu_latency_xsave: x86/fpu: latency:100 RFBM:0x202e7 XINUSE:0x202
  x86_fpu_latency_xrstor: x86/fpu: latency:99 RFBM:0x202e7 XINUSE:0x202

Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Yi Sun <yi.sun@intel.com>

diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
index 4645a6334063..0640fe79edf3 100644
--- a/arch/x86/include/asm/trace/fpu.h
+++ b/arch/x86/include/asm/trace/fpu.h
@@ -89,6 +89,43 @@ DEFINE_EVENT(x86_fpu, x86_fpu_xstate_check_failed,
 	TP_ARGS(fpu)
 );
 
+#if defined(CONFIG_X86_DEBUG_FPU)
+DECLARE_EVENT_CLASS(x86_fpu_latency,
+	TP_PROTO(struct fpstate *fpstate, u64 latency),
+	TP_ARGS(fpstate, latency),
+
+	TP_STRUCT__entry(
+		__field(struct fpstate *, fpstate)
+		__field(u64, latency)
+		__field(u64, rfbm)
+		__field(u64, xinuse)
+	),
+
+	TP_fast_assign(
+		__entry->fpstate = fpstate;
+		__entry->latency = latency;
+		__entry->rfbm = fpstate->xfeatures;
+		__entry->xinuse = fpstate->regs.xsave.header.xfeatures;
+	),
+
+	TP_printk("x86/fpu: latency:%lld RFBM:0x%llx XINUSE:0x%llx",
+		__entry->latency,
+		__entry->rfbm,
+		__entry->xinuse
+	)
+);
+
+DEFINE_EVENT(x86_fpu_latency, x86_fpu_latency_xsave,
+	TP_PROTO(struct fpstate *fpstate, u64 latency),
+	TP_ARGS(fpstate, latency)
+);
+
+DEFINE_EVENT(x86_fpu_latency, x86_fpu_latency_xrstor,
+	TP_PROTO(struct fpstate *fpstate, u64 latency),
+	TP_ARGS(fpstate, latency)
+);
+#endif
+
 #undef TRACE_INCLUDE_PATH
 #define TRACE_INCLUDE_PATH asm/trace/
 #undef TRACE_INCLUDE_FILE
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index a4ecb04d8d64..aa997fb86537 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -5,6 +5,9 @@
 #include <asm/cpufeature.h>
 #include <asm/fpu/xstate.h>
 #include <asm/fpu/xcr.h>
+#include <asm/trace/fpu.h>
+
+#include <linux/trace_clock.h>
 
 #ifdef CONFIG_X86_64
 DECLARE_PER_CPU(u64, xfd_state);
@@ -113,7 +116,7 @@ static inline u64 xfeatures_mask_independent(void)
  * original instruction which gets replaced. We need to use it here as the
  * address of the instruction where we might get an exception at.
  */
-#define XSTATE_XSAVE(st, lmask, hmask, err)				\
+#define __XSTATE_XSAVE(st, lmask, hmask, err)				\
 	asm volatile(ALTERNATIVE_3(XSAVE,				\
 				   XSAVEOPT, X86_FEATURE_XSAVEOPT,	\
 				   XSAVEC,   X86_FEATURE_XSAVEC,	\
@@ -130,7 +133,7 @@ static inline u64 xfeatures_mask_independent(void)
  * Use XRSTORS to restore context if it is enabled. XRSTORS supports compact
  * XSAVE area format.
  */
-#define XSTATE_XRESTORE(st, lmask, hmask)				\
+#define __XSTATE_XRESTORE(st, lmask, hmask)				\
 	asm volatile(ALTERNATIVE(XRSTOR,				\
 				 XRSTORS, X86_FEATURE_XSAVES)		\
 		     "\n"						\
@@ -140,6 +143,35 @@ static inline u64 xfeatures_mask_independent(void)
 		     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)	\
 		     : "memory")
 
+#if defined(CONFIG_X86_DEBUG_FPU)
+#define XSTATE_XSAVE(fps, lmask, hmask, err)				\
+	do {								\
+		struct fpstate *f = fps;				\
+		u64 tc = -1;						\
+		if (tracepoint_enabled(x86_fpu_latency_xsave))		\
+			tc = trace_clock();				\
+		__XSTATE_XSAVE(&f->regs.xsave, lmask, hmask, err);	\
+		if (tracepoint_enabled(x86_fpu_latency_xsave))		\
+			trace_x86_fpu_latency_xsave(f, trace_clock() - tc);\
+	} while (0)
+
+#define XSTATE_XRESTORE(fps, lmask, hmask)				\
+	do {								\
+		struct fpstate *f = fps;				\
+		u64 tc = -1;						\
+		if (tracepoint_enabled(x86_fpu_latency_xrstor))		\
+			tc = trace_clock();				\
+		__XSTATE_XRESTORE(&f->regs.xsave, lmask, hmask);	\
+		if (tracepoint_enabled(x86_fpu_latency_xrstor))		\
+			trace_x86_fpu_latency_xrstor(f, trace_clock() - tc);\
+	} while (0)
+#else
+#define XSTATE_XSAVE(fps, lmask, hmask, err)				\
+		__XSTATE_XSAVE(&(fps)->regs.xsave, lmask, hmask, err)
+#define XSTATE_XRESTORE(fps, lmask, hmask)				\
+		__XSTATE_XRESTORE(&(fps)->regs.xsave, lmask, hmask)
+#endif
+
 #if defined(CONFIG_X86_64) && defined(CONFIG_X86_DEBUG_FPU)
 extern void xfd_validate_state(struct fpstate *fpstate, u64 mask, bool rstor);
 #else
@@ -184,7 +216,7 @@ static inline void os_xsave(struct fpstate *fpstate)
 	WARN_ON_FPU(!alternatives_patched);
 	xfd_validate_state(fpstate, mask, false);
 
-	XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
+	XSTATE_XSAVE(fpstate, lmask, hmask, err);
 
 	/* We should never fault when copying to a kernel buffer: */
 	WARN_ON_FPU(err);
@@ -201,7 +233,7 @@ static inline void os_xrstor(struct fpstate *fpstate, u64 mask)
 	u32 hmask = mask >> 32;
 
 	xfd_validate_state(fpstate, mask, true);
-	XSTATE_XRESTORE(&fpstate->regs.xsave, lmask, hmask);
+	XSTATE_XRESTORE(fpstate, lmask, hmask);
 }
 
 /* Restore of supervisor state. Does not require XFD */
@@ -211,7 +243,7 @@ static inline void os_xrstor_supervisor(struct fpstate *fpstate)
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 
-	XSTATE_XRESTORE(&fpstate->regs.xsave, lmask, hmask);
+	XSTATE_XRESTORE(fpstate, lmask, hmask);
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v7 2/3] tools/testing/fpu: Add script to consume trace log of xsave latency
  2023-09-21  6:28 [PATCH v7 0/3] x86/fpu Measure the Latency of XSAVES and XRSTORS Yi Sun
  2023-09-21  6:28 ` [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR Yi Sun
@ 2023-09-21  6:28 ` Yi Sun
  2023-09-21  6:29 ` [PATCH v7 3/3] tools/testing/fpu: Add a 'count' column Yi Sun
  2 siblings, 0 replies; 8+ messages in thread
From: Yi Sun @ 2023-09-21  6:28 UTC (permalink / raw)
  To: dave.hansen, mingo, linux-kernel, x86
  Cc: sohil.mehta, ak, ilpo.jarvinen, heng.su, tony.luck, yi.sun,
	yu.c.chen, Yi Sun

Consume the trace log dumped by trace points x86_fpu_latency_xsave and
x86_fpu_latency_xrstor, calculate latency ranges for each RFBM and
XINUSE combination including min, max, average and 97% tail latency.

Add the average of 97% tail latency to remove the unreasonable
data which is introduced by interrupts or other noise. By adding the
experimental code disabling interrupts before the calculation of
latency, it's obvious to get the 3% tail latency has been filtered.

Make use of sqlite3 to make the data statistics more efficient and
concise. The output looks like following:

EVENTs                	RFBM   	XINUSE	lat_min	lat_max	lat_avg	lat_avg(97%)
----------------------	-------	------	-------	-------	-------	------------
x86_fpu_latency_xrstor	0x206e7	0x0   	364    	364    	364    	364
x86_fpu_latency_xrstor	0x206e7	0x202 	112    	1152   	300    	276
x86_fpu_latency_xsave 	0x206e7	0x202 	80     	278    	141    	137
x86_fpu_latency_xsave 	0x206e7	0x246 	108    	234    	180    	177

The XSAVE/XRSTOR latency trace log can be got by two ways:
1. Generated by Kernel debugfs
  echo 1 > /sys/kernel/debug/tracing/events/x86_fpu/enable
  cat /sys/kernel/debug/tracing/trace_pipe > trace-log

2. Generated by helper tool like 'trace-cmd'
   trace-cmd record -e x86_fpu -F <command>
   trace-cmd report > trace-log

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Yi Sun <yi.sun@intel.com>

diff --git a/tools/testing/fpu/xsave-latency-trace.sh b/tools/testing/fpu/xsave-latency-trace.sh
new file mode 100755
index 000000000000..d45563984fd6
--- /dev/null
+++ b/tools/testing/fpu/xsave-latency-trace.sh
@@ -0,0 +1,227 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# (c) 2022 Yi Sun <yi.sun@intel.com>
+
+trace_log=$1
+trace_lat_log=".trace_lat_log"
+db_name="db_trace"
+db_file="${db_name}.db"
+table_raw="t_trace"
+table_tail="t_trace_tail"
+table_results="t_results"
+events="x86_fpu_latency_xsave|x86_fpu_latency_xrstor"
+
+# The regex for the trace log. The rough pattern:
+# (proc) (No.cpu) (flags) (timestamp): (tracepoint): latency:(123) RFBM:0x(123) XINUSE:0x(123)$
+# Fold the regex into 3 parts making it easier to read.
+regex1="([^\ ]*)[[:space:]]*\[([0-9]+)\][[:space:]]*(.....\ )?[[:space:]]*"
+regex2="([0-9.]*):[[:space:]]*([^\ :]*):.*latency:([0-9]*)[[:space:]]*"
+regex3="RFBM:(0x[0-9a-f]*)[[:space:]]*XINUSE:(0x[0-9a-f]*)$"
+
+function usage() {
+	echo "This script consumes the tracepoint data, and dumps out the"
+	echo "latency ranges for each RFBM combination."
+	echo "Usage:"
+	echo "$0 <trace-log>"
+	echo "   trace-log:"
+	echo "     Either generated by Kernel sysfs:"
+	echo "       echo 1 > /sys/kernel/debug/tracing/events/x86_fpu/enable"
+	echo "       cat /sys/kernel/debug/tracing/trace_pipe > trace-log"
+	echo ""
+	echo "     Or generate by helper tool like 'trace-cmd':"
+	echo "       trace-cmd record -e x86_fpu"
+	echo "       trace-cmd report > trace-log"
+}
+
+# Check the dependent tools
+# {@}: a list of third-part tools
+function check_packages() {
+	for pack in "$@"; do
+		which $pack >& /dev/null
+		if [[ $? != 0 ]]; then
+			echo "Please install $pack before running this script."
+			exit 1
+		fi
+	done
+}
+
+# Run SQL command with sqlite3
+# ${*}: SQL command fed to sqlite3
+function SQL_CMD() {
+	sqlite3 $db_file "$*"
+}
+
+# Run SQL command with sqlite3 and format the output with headers and column.
+# ${*}: SQL command fed to sqlite3
+function SQL_CMD_HEADER() {
+	sqlite3 -column -header $db_file "$*"
+}
+
+# Create a table in the DB
+# ${1}: name of the table
+function create_table() {
+	if [[ "$1" == "" ]]; then
+		echo "Empty table name!"
+		exit 1
+	fi
+	SQL_CMD "create table $1 (
+		id INTEGER PRIMARY KEY AUTOINCREMENT,
+		process TEXT,
+		cpu INT,
+		timestamp FLOAT,
+		event_name TEXT,
+		lat INT,
+		RFBM INT,
+		XINUSE INT);"
+}
+
+# Round to the nearest whole number
+# ${1}: a float number
+# Output: integer
+function round() {
+	echo "scale=0; ($1+0.5)/1" | bc
+}
+
+# Insert a record in the trace table
+#
+# process cpu  timestamp  event_name  lat  RFBM  XINUSE
+# $2      $3   $4         $5          $6   $7    $8
+
+function insert_line() {
+	if [[ "$1" == "" ]]; then
+		echo "Empty table name!"
+		exit 1
+	fi
+	SQL_CMD "INSERT INTO $1 (process, cpu, timestamp, event_name, lat, RFBM, XINUSE)
+		VALUES (\"$2\", $3, $4, \"$5\", $6, $7, $8);"
+}
+
+# Show the results of the trace statistics
+function get_latency_stat() {
+	SQL_CMD "create table $table_results (
+		id INTEGER PRIMARY KEY AUTOINCREMENT,
+		event_name TEXT,
+		RFBM INT,
+		XINUSE INT,
+		lat_min INT,
+		lat_max INT,
+		lat_avg INT,
+		lat_tail_avg INT);"
+
+	for((i=0; i<$cnt; i++));do
+		event_name=`get_comb_item $table_raw $i event_name`
+		RFBM=`get_comb_item $table_raw $i RFBM`
+		XINUSE=`get_comb_item $table_raw $i XINUSE`
+		lat_min=`get_comb_item $table_raw $i min\(lat\)`
+		lat_max=`get_comb_item $table_raw $i max\(lat\)`
+		lat_avg=`get_comb_item $table_raw $i avg\(lat\)`
+		lat_tail_avg=`get_comb_item $table_tail $i avg\(lat\)`
+
+		lat_avg=`round $lat_avg`
+		lat_tail_avg=`round $lat_tail_avg`
+
+		SQL_CMD "INSERT INTO $table_results
+			(event_name, RFBM,XINUSE, lat_min, lat_max, lat_avg, lat_tail_avg)
+			VALUES (\"$event_name\", $RFBM, $XINUSE, $lat_min, $lat_max,
+			$lat_avg, $lat_tail_avg);"
+	done
+
+	SQL_CMD_HEADER "select event_name[EVENTs],printf('0x%x',RFBM)[RFBM],
+			printf('0x%x',XINUSE)[XINUSE],lat_min,lat_max,lat_avg,
+			lat_tail_avg[lat_avg(97%)]
+			from $table_results;"
+}
+
+# Get the count of the combination of event_name, RFBM, XINUSE amount all lat trace records
+function get_combs_cnt() {
+	SQL_CMD "SELECT event_name, RFBM, XINUSE from $table_raw
+		group by event_name,RFBM,XINUSE;" | wc -l
+}
+
+# Get a specified combination from a table
+# ${1}: name of table
+# ${2}: the order of the combination of event_name, RFBM, XINUSE
+# ${3}: the items which are wanted to be shown
+function get_comb_item() {
+	table=$1
+	cnt=$2
+	col=$3
+	SQL_CMD "SELECT $col from $table group by event_name,RFBM,XINUSE limit $cnt,1;"
+}
+
+# Get count of the records in a given table
+# ${1}: name of the table
+function get_rows_cnt() {
+	table=$1
+	SQL_CMD "SELECT count(*) from $table;"
+}
+
+# Generate a new table from the raw trace table removing 3% tail traces.
+function gen_tail_lat() {
+	cnt=`get_combs_cnt`
+	create_table $table_tail
+
+	for((i=0; i<$cnt; i++));do
+		create_table t$i
+		event_name=`get_comb_item $table_raw $i event_name`
+		RFBM=`get_comb_item $table_raw $i RFBM`
+		XINUSE=`get_comb_item $table_raw $i XINUSE`
+
+		SQL_CMD "insert into t$i(process,cpu,timestamp,event_name,lat,RFBM,XINUSE)
+			select process,cpu,timestamp,event_name,lat,RFBM,XINUSE
+			from $table_raw where event_name=\"$event_name\" and RFBM=$RFBM and
+			XINUSE=$XINUSE ORDER BY lat ASC;"
+
+		row=`get_rows_cnt t$i`
+		row=`echo "scale=0; ($row*0.97 + 0.5)/1" | bc`
+
+		SQL_CMD "insert into $table_tail
+			(process,cpu,timestamp,event_name,lat,RFBM,XINUSE)
+			select process,cpu,timestamp,event_name,lat,RFBM,XINUSE
+			from t$i limit 0,$row;"
+	done
+
+}
+
+if [[ ! -e "$trace_log" || $# != 1 ]];then
+	usage
+	exit 1
+fi
+
+# Check dependency
+# Make sure having following packages
+check_packages sqlite3 bc wc cut
+
+# Filter trace log keeping latency related lines only
+grep -E "$events" $trace_log > $trace_lat_log
+cnt_lines=`wc -l $trace_lat_log | cut -d' ' -f1`
+# Remove the old db file if it existed before creating
+[[ -f $db_file ]] && rm -rf $db_file
+
+create_table $table_raw
+
+# Read each line from the temp file and insert into the table
+i=0
+while IFS= read -r line;
+do
+	((i = i + 1))
+	echo -ne "(${i}/$cnt_lines) Importing trace log into database!\r"
+	if [[ "$line" =~ ${regex1}${regex2}${regex3} ]]; then
+		pname=${BASH_REMATCH[1]}
+		cpu=${BASH_REMATCH[2]}
+		ts=${BASH_REMATCH[4]}
+		ename=${BASH_REMATCH[5]}
+		lat=${BASH_REMATCH[6]}
+		((rfbm=${BASH_REMATCH[7]}))
+		((xinuse=${BASH_REMATCH[8]}))
+
+		insert_line $table_raw $pname $cpu $ts $ename $lat $rfbm $xinuse
+	fi
+done < $trace_lat_log
+
+gen_tail_lat
+get_latency_stat
+
+# Cleanup
+rm -rf $trace_lat_log $db_file
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v7 3/3] tools/testing/fpu: Add a 'count' column.
  2023-09-21  6:28 [PATCH v7 0/3] x86/fpu Measure the Latency of XSAVES and XRSTORS Yi Sun
  2023-09-21  6:28 ` [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR Yi Sun
  2023-09-21  6:28 ` [PATCH v7 2/3] tools/testing/fpu: Add script to consume trace log of xsave latency Yi Sun
@ 2023-09-21  6:29 ` Yi Sun
  2 siblings, 0 replies; 8+ messages in thread
From: Yi Sun @ 2023-09-21  6:29 UTC (permalink / raw)
  To: dave.hansen, mingo, linux-kernel, x86
  Cc: sohil.mehta, ak, ilpo.jarvinen, heng.su, tony.luck, yi.sun,
	yu.c.chen, Yi Sun

Show the total number of each combination of event-RFBM-XINUSE. Users can
identify which data is noise and focus more on useful data.

Signed-off-by: Yi Sun <yi.sun@intel.com>

diff --git a/tools/testing/fpu/xsave-latency-trace.sh b/tools/testing/fpu/xsave-latency-trace.sh
index d45563984fd6..b2f7c3d0dd65 100755
--- a/tools/testing/fpu/xsave-latency-trace.sh
+++ b/tools/testing/fpu/xsave-latency-trace.sh
@@ -99,11 +99,14 @@ function insert_line() {
 
 # Show the results of the trace statistics
 function get_latency_stat() {
+	cnt=`get_combs_cnt`
+
 	SQL_CMD "create table $table_results (
 		id INTEGER PRIMARY KEY AUTOINCREMENT,
 		event_name TEXT,
 		RFBM INT,
 		XINUSE INT,
+		CNT INT,
 		lat_min INT,
 		lat_max INT,
 		lat_avg INT,
@@ -121,14 +124,18 @@ function get_latency_stat() {
 		lat_avg=`round $lat_avg`
 		lat_tail_avg=`round $lat_tail_avg`
 
+		count=`SQL_CMD "SELECT count(*) from $table_raw
+			where event_name=\"$event_name\" and RFBM=$RFBM and
+			XINUSE=$XINUSE;"`
+
 		SQL_CMD "INSERT INTO $table_results
-			(event_name, RFBM,XINUSE, lat_min, lat_max, lat_avg, lat_tail_avg)
-			VALUES (\"$event_name\", $RFBM, $XINUSE, $lat_min, $lat_max,
+			(event_name,RFBM,XINUSE, CNT, lat_min, lat_max, lat_avg, lat_tail_avg)
+			VALUES (\"$event_name\", $RFBM, $XINUSE, $count, $lat_min, $lat_max,
 			$lat_avg, $lat_tail_avg);"
 	done
 
 	SQL_CMD_HEADER "select event_name[EVENTs],printf('0x%x',RFBM)[RFBM],
-			printf('0x%x',XINUSE)[XINUSE],lat_min,lat_max,lat_avg,
+			printf('0x%x',XINUSE)[XINUSE],CNT,lat_min,lat_max,lat_avg,
 			lat_tail_avg[lat_avg(97%)]
 			from $table_results;"
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2023-09-21  6:28 ` [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR Yi Sun
@ 2023-09-21  7:03   ` Ingo Molnar
  2023-09-21  8:24     ` Yi Sun
  2023-09-21 16:52     ` Andi Kleen
  0 siblings, 2 replies; 8+ messages in thread
From: Ingo Molnar @ 2023-09-21  7:03 UTC (permalink / raw)
  To: Yi Sun
  Cc: dave.hansen, linux-kernel, x86, sohil.mehta, ak, ilpo.jarvinen,
	heng.su, tony.luck, yi.sun, yu.c.chen


* Yi Sun <yi.sun@intel.com> wrote:

> @@ -113,7 +116,7 @@ static inline u64 xfeatures_mask_independent(void)
>   * original instruction which gets replaced. We need to use it here as the
>   * address of the instruction where we might get an exception at.
>   */
> -#define XSTATE_XSAVE(st, lmask, hmask, err)				\
> +#define __XSTATE_XSAVE(st, lmask, hmask, err)				\
>  	asm volatile(ALTERNATIVE_3(XSAVE,				\
>  				   XSAVEOPT, X86_FEATURE_XSAVEOPT,	\
>  				   XSAVEC,   X86_FEATURE_XSAVEC,	\
> @@ -130,7 +133,7 @@ static inline u64 xfeatures_mask_independent(void)
>   * Use XRSTORS to restore context if it is enabled. XRSTORS supports compact
>   * XSAVE area format.
>   */
> -#define XSTATE_XRESTORE(st, lmask, hmask)				\
> +#define __XSTATE_XRESTORE(st, lmask, hmask)				\
>  	asm volatile(ALTERNATIVE(XRSTOR,				\
>  				 XRSTORS, X86_FEATURE_XSAVES)		\
>  		     "\n"						\
> @@ -140,6 +143,35 @@ static inline u64 xfeatures_mask_independent(void)
>  		     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)	\
>  		     : "memory")
>  
> +#if defined(CONFIG_X86_DEBUG_FPU)
> +#define XSTATE_XSAVE(fps, lmask, hmask, err)				\
> +	do {								\
> +		struct fpstate *f = fps;				\
> +		u64 tc = -1;						\
> +		if (tracepoint_enabled(x86_fpu_latency_xsave))		\
> +			tc = trace_clock();				\
> +		__XSTATE_XSAVE(&f->regs.xsave, lmask, hmask, err);	\
> +		if (tracepoint_enabled(x86_fpu_latency_xsave))		\
> +			trace_x86_fpu_latency_xsave(f, trace_clock() - tc);\
> +	} while (0)
> +
> +#define XSTATE_XRESTORE(fps, lmask, hmask)				\
> +	do {								\
> +		struct fpstate *f = fps;				\
> +		u64 tc = -1;						\
> +		if (tracepoint_enabled(x86_fpu_latency_xrstor))		\
> +			tc = trace_clock();				\
> +		__XSTATE_XRESTORE(&f->regs.xsave, lmask, hmask);	\
> +		if (tracepoint_enabled(x86_fpu_latency_xrstor))		\
> +			trace_x86_fpu_latency_xrstor(f, trace_clock() - tc);\

This v7 version does not adequately address the review feedback I gave for 
v6: it adds tracing overhead to potential hot paths, and putting it behind 
CONFIG_X86_DEBUG_FPU is not a solution either: it's default-y, so de-facto 
enabled on all major distros...

It seems unnecessarily complex: why does it have to measure latency 
directly? Tracepoints *by default* come with event timestamps. A latency 
measurement tool should be able to subtract two timestamps to extract the 
latency between two tracepoints...

In fact, function tracing is enabled on all major Linux distros:

  kepler:~/tip> grep FUNCTION_TRACER /boot/config-6.2.0-33-generic 
  CONFIG_HAVE_FUNCTION_TRACER=y
  CONFIG_FUNCTION_TRACER=y

Why not just enable function tracing for the affected FPU context switching 
functions?

The relevant functions are already standalone in a typical kernel:

   xsave:                # ffffffff8103cfe0 T save_fpregs_to_fpstate
   xrstor:               # ffffffff8103d160 T restore_fpregs_from_fpstate
   xrstor_supervisor:    # ffffffff8103dc50 T fpu__clear_user_states

... and FPU context switching overhead dominates the cost of these 
functions.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2023-09-21  7:03   ` Ingo Molnar
@ 2023-09-21  8:24     ` Yi Sun
  2023-09-21 16:52     ` Andi Kleen
  1 sibling, 0 replies; 8+ messages in thread
From: Yi Sun @ 2023-09-21  8:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dave.hansen, linux-kernel, x86, sohil.mehta, ak, ilpo.jarvinen,
	heng.su, tony.luck, yi.sun, yu.c.chen

On 21.09.2023 09:03, Ingo Molnar wrote:
>
>* Yi Sun <yi.sun@intel.com> wrote:
>
>> -#define XSTATE_XRESTORE(st, lmask, hmask)				\
>> +#define __XSTATE_XRESTORE(st, lmask, hmask)				\
>>  	asm volatile(ALTERNATIVE(XRSTOR,				\
>>  				 XRSTORS, X86_FEATURE_XSAVES)		\
>>  		     "\n"						\
>> @@ -140,6 +143,35 @@ static inline u64 xfeatures_mask_independent(void)
>>  		     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)	\
>>  		     : "memory")
>>
>> +#if defined(CONFIG_X86_DEBUG_FPU)
>> +#define XSTATE_XSAVE(fps, lmask, hmask, err)				\
>> +	do {								\
>> +		struct fpstate *f = fps;				\
>> +		u64 tc = -1;						\
>> +		if (tracepoint_enabled(x86_fpu_latency_xsave))		\
>> +			tc = trace_clock();				\
>> +		__XSTATE_XSAVE(&f->regs.xsave, lmask, hmask, err);	\
>> +		if (tracepoint_enabled(x86_fpu_latency_xsave))		\
>> +			trace_x86_fpu_latency_xsave(f, trace_clock() - tc);\
>> +	} while (0)
>> +
>> +#define XSTATE_XRESTORE(fps, lmask, hmask)				\
>> +	do {								\
>> +		struct fpstate *f = fps;				\
>> +		u64 tc = -1;						\
>> +		if (tracepoint_enabled(x86_fpu_latency_xrstor))		\
>> +			tc = trace_clock();				\
>> +		__XSTATE_XRESTORE(&f->regs.xsave, lmask, hmask);	\
>> +		if (tracepoint_enabled(x86_fpu_latency_xrstor))		\
>> +			trace_x86_fpu_latency_xrstor(f, trace_clock() - tc);\
>
>This v7 version does not adequately address the review feedback I gave for
>v6: it adds tracing overhead to potential hot paths, and putting it behind
>CONFIG_X86_DEBUG_FPU is not a solution either: it's default-y, so de-facto
>enabled on all major distros...
Hi Ingo,

Appreciate your comments!

What do you think if I were to add a new config here instead of
CONFIG_X86_DEBUG_FPU, and set it as the default value of n?
This way, all the overhead will be removed from the hot paths.
```
#else
#define XSTATE_XSAVE(fps, lmask, hmask, err)
                 __XSTATE_XSAVE(&(fps)->regs.xsave, lmask, hmask, err)
#define XSTATE_XRESTORE(fps, lmask, hmask)
                 __XSTATE_XRESTORE(&(fps)->regs.xsave, lmask, hmask)
#endif
```
>
>It seems unnecessarily complex: why does it have to measure latency
>directly? Tracepoints *by default* come with event timestamps. A latency
>measurement tool should be able to subtract two timestamps to extract the
>latency between two tracepoints...
>
>In fact, function tracing is enabled on all major Linux distros:
>
>  kepler:~/tip> grep FUNCTION_TRACER /boot/config-6.2.0-33-generic
>  CONFIG_HAVE_FUNCTION_TRACER=y
>  CONFIG_FUNCTION_TRACER=y
>
>Why not just enable function tracing for the affected FPU context switching
>functions?
>
>The relevant functions are already standalone in a typical kernel:
>
>   xsave:                # ffffffff8103cfe0 T save_fpregs_to_fpstate
>   xrstor:               # ffffffff8103d160 T restore_fpregs_from_fpstate
>   xrstor_supervisor:    # ffffffff8103dc50 T fpu__clear_user_states
>
>... and FPU context switching overhead dominates the cost of these
>functions.
>
I am aware of these trace points, but we add new ones for two reasons:
1. As I described in the patch comments. It would be more accurate to
calculate the latency in a single trace event, which will NOT include the
latency of the trace functions themselves. The cost of the trace
functions is much longer than the latency of xsaves, by our experiments.

Calculating the latency by subtracting two timestamps has to
contain the latency of trace points, that would make the actual latency
data less prominent.

2. We want to measure with finer granularity. The new added trace points
dump XINUSE and RFBM each time they are called. By doing so, we can
collect the latency and analyze the data grouped according to each FPU
instruction indicated in the XCR0 bits.

Thanks
    --Sun, Yi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2023-09-21  7:03   ` Ingo Molnar
  2023-09-21  8:24     ` Yi Sun
@ 2023-09-21 16:52     ` Andi Kleen
  2023-09-22  3:44       ` Yi Sun
  1 sibling, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2023-09-21 16:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Yi Sun, dave.hansen, linux-kernel, x86, sohil.mehta,
	ilpo.jarvinen, heng.su, tony.luck, yi.sun, yu.c.chen

> It seems unnecessarily complex: why does it have to measure latency 
> directly? Tracepoints *by default* come with event timestamps. A latency 
> measurement tool should be able to subtract two timestamps to extract the 
> latency between two tracepoints...
> 
> In fact, function tracing is enabled on all major Linux distros:
> 
>   kepler:~/tip> grep FUNCTION_TRACER /boot/config-6.2.0-33-generic 
>   CONFIG_HAVE_FUNCTION_TRACER=y
>   CONFIG_FUNCTION_TRACER=y
> 
> Why not just enable function tracing for the affected FPU context switching 
> functions?

Or use PT address filters to get it even accurately, as described
in [1]. In any case I agree the trace points are not needed.

-Andi

[1] https://lore.kernel.org/lkml/ZPOIVmC6aY9GBtdJ@tassilo/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2023-09-21 16:52     ` Andi Kleen
@ 2023-09-22  3:44       ` Yi Sun
  0 siblings, 0 replies; 8+ messages in thread
From: Yi Sun @ 2023-09-22  3:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, dave.hansen, linux-kernel, x86, sohil.mehta,
	ilpo.jarvinen, heng.su, tony.luck, yi.sun, yu.c.chen

On 21.09.2023 09:52, Andi Kleen wrote:
>> It seems unnecessarily complex: why does it have to measure latency
>> directly? Tracepoints *by default* come with event timestamps. A latency
>> measurement tool should be able to subtract two timestamps to extract the
>> latency between two tracepoints...
>>
>> In fact, function tracing is enabled on all major Linux distros:
>>
>>   kepler:~/tip> grep FUNCTION_TRACER /boot/config-6.2.0-33-generic
>>   CONFIG_HAVE_FUNCTION_TRACER=y
>>   CONFIG_FUNCTION_TRACER=y
>>
>> Why not just enable function tracing for the affected FPU context switching
>> functions?
>
>Or use PT address filters to get it even accurately, as described
>in [1]. In any case I agree the trace points are not needed.
>
Hi Andi,

Here, we have implemented latency measurement for all x86 platforms, not
just for Intel but also for other vendors like AMD. This allows us to 
obtain comparable data.

Thanks
    --Sun, Yi

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-09-22  3:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-21  6:28 [PATCH v7 0/3] x86/fpu Measure the Latency of XSAVES and XRSTORS Yi Sun
2023-09-21  6:28 ` [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR Yi Sun
2023-09-21  7:03   ` Ingo Molnar
2023-09-21  8:24     ` Yi Sun
2023-09-21 16:52     ` Andi Kleen
2023-09-22  3:44       ` Yi Sun
2023-09-21  6:28 ` [PATCH v7 2/3] tools/testing/fpu: Add script to consume trace log of xsave latency Yi Sun
2023-09-21  6:29 ` [PATCH v7 3/3] tools/testing/fpu: Add a 'count' column Yi Sun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox