* [PATCH v3 0/2] selftests/resctrl: SNC kernel support discovery
@ 2024-07-01 14:17 Maciej Wieczor-Retman
2024-07-01 14:18 ` [PATCH v3 1/2] selftests/resctrl: Adjust effective L3 cache size with SNC enabled Maciej Wieczor-Retman
2024-07-01 14:18 ` [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages Maciej Wieczor-Retman
0 siblings, 2 replies; 18+ messages in thread
From: Maciej Wieczor-Retman @ 2024-07-01 14:17 UTC (permalink / raw)
To: fenghua.yu, reinette.chatre, shuah
Cc: linux-kernel, linux-kselftest, ilpo.jarvinen, tony.luck
Changes v3:
- Reworked patch 2.
- Changed minor things in patch 1 like function name and made
corrections to the patch message.
Changes v2:
- Removed patches 2 and 3 since now this part will be supported by the
kernel.
Sub-Numa Clustering (SNC) allows splitting CPU cores, caches and memory
into multiple NUMA nodes. When enabled, NUMA-aware applications can
achieve better performance on bigger server platforms.
SNC support in the kernel is currently in review [1]. With SNC enabled
and kernel support in place all the tests will function normally (aside
from effective cache size). There might be a problem when SNC is enabled
but the system is still using an older kernel version without SNC
support. Currently the only message displayed in that situation is a
guess that SNC might be enabled and is causing issues. That message also
is displayed whenever the test fails on an Intel platform.
Add a mechanism to discover kernel support for SNC which will add more
meaning and certainty to the error message.
Add runtime SNC mode detection and verify how reliable that information
is.
Series was tested on Ice Lake server platforms with SNC disabled, SNC-2
and SNC-4. The tests were also ran with and without kernel support for
SNC.
Series applies cleanly on kselftest/next.
[1] https://lore.kernel.org/all/20240628215619.76401-1-tony.luck@intel.com/
Previous versions of this series:
[v1] https://lore.kernel.org/all/cover.1709721159.git.maciej.wieczor-retman@intel.com/
[v2] https://lore.kernel.org/all/cover.1715769576.git.maciej.wieczor-retman@intel.com/
Maciej Wieczor-Retman (2):
selftests/resctrl: Adjust effective L3 cache size with SNC enabled
selftests/resctrl: Adjust SNC support messages
tools/testing/selftests/resctrl/cache.c | 3 +
tools/testing/selftests/resctrl/cmt_test.c | 4 +-
tools/testing/selftests/resctrl/mba_test.c | 4 +
tools/testing/selftests/resctrl/mbm_test.c | 6 +-
tools/testing/selftests/resctrl/resctrl.h | 8 +
.../testing/selftests/resctrl/resctrl_tests.c | 7 +
tools/testing/selftests/resctrl/resctrlfs.c | 138 ++++++++++++++++++
7 files changed, 166 insertions(+), 4 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v3 1/2] selftests/resctrl: Adjust effective L3 cache size with SNC enabled
2024-07-01 14:17 [PATCH v3 0/2] selftests/resctrl: SNC kernel support discovery Maciej Wieczor-Retman
@ 2024-07-01 14:18 ` Maciej Wieczor-Retman
2024-07-02 22:20 ` Reinette Chatre
2024-07-01 14:18 ` [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages Maciej Wieczor-Retman
1 sibling, 1 reply; 18+ messages in thread
From: Maciej Wieczor-Retman @ 2024-07-01 14:18 UTC (permalink / raw)
To: shuah, reinette.chatre, fenghua.yu
Cc: linux-kernel, linux-kselftest, ilpo.jarvinen, tony.luck
Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
nodes. Systems may support splitting into either two, three or four
nodes.
When SNC mode is enabled the effective amount of L3 cache available
for allocation is divided by the number of nodes per L3.
Detect which SNC mode is active by comparing the number of CPUs
that share a cache with CPU0, with the number of CPUs on node0.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Co-developed-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
---
Changelog v3:
- Added comment to cache size modification to better explain why it is
needed there. (Reinette)
- Fix facts in patch message. (Reinette)
- Change snc_ways() to snc_nodes_per_l3_cache(). (Reinette)
tools/testing/selftests/resctrl/resctrl.h | 4 ++
tools/testing/selftests/resctrl/resctrlfs.c | 70 +++++++++++++++++++++
2 files changed, 74 insertions(+)
diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index 2dda56084588..851b37c9c38a 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -11,6 +11,7 @@
#include <signal.h>
#include <dirent.h>
#include <stdbool.h>
+#include <ctype.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mount.h>
@@ -43,6 +44,8 @@
#define DEFAULT_SPAN (250 * MB)
+#define MAX_SNC 4
+
/*
* user_params: User supplied parameters
* @cpu: CPU number to which the benchmark will be bound to
@@ -120,6 +123,7 @@ extern volatile int *value_sink;
extern char llc_occup_path[1024];
+int snc_nodes_per_l3_cache(void);
int get_vendor(void);
bool check_resctrlfs_support(void);
int filter_dmesg(void);
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index 250c320349a7..18a31a2ba7b3 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -156,6 +156,65 @@ int get_domain_id(const char *resource, int cpu_no, int *domain_id)
return 0;
}
+/*
+ * Count number of CPUs in a /sys bit map
+ */
+static unsigned int count_sys_bitmap_bits(char *name)
+{
+ FILE *fp = fopen(name, "r");
+ int count = 0, c;
+
+ if (!fp)
+ return 0;
+
+ while ((c = fgetc(fp)) != EOF) {
+ if (!isxdigit(c))
+ continue;
+ switch (c) {
+ case 'f':
+ count++;
+ case '7': case 'b': case 'd': case 'e':
+ count++;
+ case '3': case '5': case '6': case '9': case 'a': case 'c':
+ count++;
+ case '1': case '2': case '4': case '8':
+ count++;
+ }
+ }
+ fclose(fp);
+
+ return count;
+}
+
+/*
+ * Detect SNC by comparing #CPUs in node0 with #CPUs sharing LLC with CPU0.
+ * If some CPUs are offline the numbers may not be exact multiples of each
+ * other. Any offline CPUs on node0 will be also gone from shared_cpu_map of
+ * CPU0 but offline CPUs from other nodes will only make the cache_cpus value
+ * lower. Still try to get the ratio right by preventing the second possibility.
+ */
+int snc_nodes_per_l3_cache(void)
+{
+ int node_cpus, cache_cpus, i, ret = 1;
+
+ node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
+ cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
+
+ if (!node_cpus || !cache_cpus) {
+ ksft_print_msg("Could not determine Sub-NUMA Cluster mode!\n");
+ return 1;
+ }
+
+ for (i = 1; i <= MAX_SNC ; i++) {
+ if (i * node_cpus >= cache_cpus) {
+ ret = i;
+ break;
+ }
+ }
+
+ return ret;
+}
+
/*
* get_cache_size - Get cache size for a specified CPU
* @cpu_no: CPU number
@@ -211,6 +270,17 @@ int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size
break;
}
+ /*
+ * The amount of cache represented by each bit in the masks
+ * in the schemata file is reduced by a factor equal to SNC
+ * nodes per L3 cache.
+ * E.g. on a SNC-2 system with a 100MB L3 cache a test that
+ * allocates memory from its local SNC node (default behavior
+ * without using libnuma) will only see 50 MB llc_occupancy
+ * with a fully populated L3 mask in the schemata file.
+ */
+ if (cache_num == 3)
+ *cache_size /= snc_nodes_per_l3_cache();
return 0;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-01 14:17 [PATCH v3 0/2] selftests/resctrl: SNC kernel support discovery Maciej Wieczor-Retman
2024-07-01 14:18 ` [PATCH v3 1/2] selftests/resctrl: Adjust effective L3 cache size with SNC enabled Maciej Wieczor-Retman
@ 2024-07-01 14:18 ` Maciej Wieczor-Retman
2024-07-01 16:04 ` Luck, Tony
2024-07-02 22:21 ` Reinette Chatre
1 sibling, 2 replies; 18+ messages in thread
From: Maciej Wieczor-Retman @ 2024-07-01 14:18 UTC (permalink / raw)
To: shuah, reinette.chatre, fenghua.yu
Cc: linux-kernel, linux-kselftest, ilpo.jarvinen, tony.luck
Resctrl selftest prints a message on test failure that Sub-Numa
Clustering (SNC) could be enabled and points the user to check theirs BIOS
settings. No actual check is performed before printing that message so
it is not very accurate in pinpointing a problem.
Figuring out if SNC is enabled is only one part of the problem, the
others being whether the detected SNC mode is reliable and whether the
kernel supports SNC in resctrl.
When there is SNC support for kernel's resctrl subsystem and SNC is
enabled then sub node files are created for each node in the resctrlfs.
The sub node files exist in each regular node's L3 monitoring directory.
The reliable path to check for existence of sub node files is
/sys/fs/resctrl/mon_data/mon_L3_00/mon_sub_L3_00.
To check if SNC detection is reliable one can check the
/sys/devices/system/cpu/offline file. If it's empty, it means all cores
are operational and the ratio should be calculated correctly. If it has
any contents, it means the detected SNC mode can't be trusted and should
be disabled.
Add helpers for detecting SNC mode and checking its reliability.
Detect SNC mode once and let other tests inherit that information.
Add messages to alert the user when SNC detection could return incorrect
results.
Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
---
Changelog v3:
- Change snc_ways() to snc_nodes_per_l3_cache(). (Reinette)
- Add printing the discovered SNC mode. (Reinette)
- Change method of kernel support discovery from cache sizes to
existance of sub node files.
- Check if SNC detection is unreliable.
- Move SNC detection to only the first run_single_test() instead on
error at the end of test runs.
- Add global value to remind user at the end of relevant tests if SNC
detection was found to be unreliable.
- Redo the patch message after the changes.
Changelog v2:
- Move snc_ways() checks from individual tests into
snc_kernel_support().
- Write better comment for snc_kernel_support().
tools/testing/selftests/resctrl/cache.c | 3 +
tools/testing/selftests/resctrl/cmt_test.c | 4 +-
tools/testing/selftests/resctrl/mba_test.c | 4 ++
tools/testing/selftests/resctrl/mbm_test.c | 6 +-
tools/testing/selftests/resctrl/resctrl.h | 4 ++
.../testing/selftests/resctrl/resctrl_tests.c | 7 ++
tools/testing/selftests/resctrl/resctrlfs.c | 70 ++++++++++++++++++-
7 files changed, 93 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c
index 1ff1104e6575..9885d64b8a21 100644
--- a/tools/testing/selftests/resctrl/cache.c
+++ b/tools/testing/selftests/resctrl/cache.c
@@ -186,4 +186,7 @@ void show_cache_info(int no_of_bits, __u64 avg_llc_val, size_t cache_span, bool
ksft_print_msg("Average LLC val: %llu\n", avg_llc_val);
ksft_print_msg("Cache span (%s): %zu\n", lines ? "lines" : "bytes",
cache_span);
+ if (snc_unreliable)
+ ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
+
}
diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
index 0c045080d808..588543ae2654 100644
--- a/tools/testing/selftests/resctrl/cmt_test.c
+++ b/tools/testing/selftests/resctrl/cmt_test.c
@@ -175,8 +175,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
goto out;
ret = check_results(¶m, span, n);
- if (ret && (get_vendor() == ARCH_INTEL))
- ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
+ if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
+ ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
out:
free(span_str);
diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c
index ab8496a4925b..c91e85f11285 100644
--- a/tools/testing/selftests/resctrl/mba_test.c
+++ b/tools/testing/selftests/resctrl/mba_test.c
@@ -108,6 +108,8 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc)
ksft_print_msg("avg_diff_per: %d%%\n", avg_diff_per);
ksft_print_msg("avg_bw_imc: %lu\n", avg_bw_imc);
ksft_print_msg("avg_bw_resc: %lu\n", avg_bw_resc);
+ if (snc_unreliable)
+ ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
if (avg_diff_per > MAX_DIFF_PERCENT)
ret = true;
}
@@ -179,6 +181,8 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param
return ret;
ret = check_results();
+ if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
+ ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
return ret;
}
diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c
index 6b5a3b52d861..562b02118270 100644
--- a/tools/testing/selftests/resctrl/mbm_test.c
+++ b/tools/testing/selftests/resctrl/mbm_test.c
@@ -43,6 +43,8 @@ show_bw_info(unsigned long *bw_imc, unsigned long *bw_resc, size_t span)
ksft_print_msg("Span (MB): %zu\n", span / MB);
ksft_print_msg("avg_bw_imc: %lu\n", avg_bw_imc);
ksft_print_msg("avg_bw_resc: %lu\n", avg_bw_resc);
+ if (snc_unreliable)
+ ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
return ret;
}
@@ -147,8 +149,8 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param
return ret;
ret = check_results(DEFAULT_SPAN);
- if (ret && (get_vendor() == ARCH_INTEL))
- ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
+ if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
+ ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
return ret;
}
diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index 851b37c9c38a..fa44e1cde21b 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -121,10 +121,13 @@ struct perf_event_read {
*/
extern volatile int *value_sink;
+extern int snc_unreliable;
+
extern char llc_occup_path[1024];
int snc_nodes_per_l3_cache(void);
int get_vendor(void);
+int get_snc_mode(void);
bool check_resctrlfs_support(void);
int filter_dmesg(void);
int get_domain_id(const char *resource, int cpu_no, int *domain_id);
@@ -167,6 +170,7 @@ void ctrlc_handler(int signum, siginfo_t *info, void *ptr);
int signal_handler_register(const struct resctrl_test *test);
void signal_handler_unregister(void);
unsigned int count_bits(unsigned long n);
+int snc_kernel_support(void);
void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config);
void perf_event_initialize_read_format(struct perf_event_read *pe_read);
diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c b/tools/testing/selftests/resctrl/resctrl_tests.c
index ecbb7605a981..b17560bbaf5c 100644
--- a/tools/testing/selftests/resctrl/resctrl_tests.c
+++ b/tools/testing/selftests/resctrl/resctrl_tests.c
@@ -12,6 +12,7 @@
/* Volatile memory sink to prevent compiler optimizations */
static volatile int sink_target;
+static int snc_mode;
volatile int *value_sink = &sink_target;
static struct resctrl_test *resctrl_tests[] = {
@@ -123,6 +124,12 @@ static void run_single_test(const struct resctrl_test *test, const struct user_p
if (test->disabled)
return;
+ if (!snc_mode) {
+ snc_mode = get_snc_mode();
+ if (snc_mode > 1)
+ ksft_print_msg("SNC-%d mode discovered!\n", snc_mode);
+ }
+
if (!test_vendor_specific_check(test)) {
ksft_test_result_skip("Hardware does not support %s\n", test->name);
return;
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index 18a31a2ba7b3..004fb6649789 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -13,6 +13,8 @@
#include "resctrl.h"
+int snc_unreliable;
+
static int find_resctrl_mount(char *buffer)
{
FILE *mounts;
@@ -280,7 +282,7 @@ int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size
* with a fully populated L3 mask in the schemata file.
*/
if (cache_num == 3)
- *cache_size /= snc_nodes_per_l3_cache();
+ *cache_size /= get_snc_mode();
return 0;
}
@@ -939,3 +941,69 @@ unsigned int count_bits(unsigned long n)
return count;
}
+
+static bool cpus_offline_empty(void)
+{
+ char offline_cpus_str[64];
+ FILE *fp;
+
+ fp = fopen("/sys/devices/system/cpu/offline", "r");
+ if (fscanf(fp, "%s", offline_cpus_str) < 0) {
+ if (!errno) {
+ fclose(fp);
+ return 1;
+ }
+ ksft_perror("Could not read offline CPUs file!");
+ }
+
+ fclose(fp);
+
+ return 0;
+}
+
+int get_snc_mode(void)
+{
+ static int snc_mode;
+
+ if (!snc_mode) {
+ snc_mode = snc_nodes_per_l3_cache();
+ if (!cpus_offline_empty()) {
+ ksft_print_msg("Runtime SNC detection unreliable due to offline CPUs!\n");
+ ksft_print_msg("Setting SNC mode to disabled.\n");
+ snc_mode = 1;
+ snc_unreliable = 1;
+ }
+ }
+
+ return snc_mode;
+}
+
+/**
+ * snc_kernel_support - Compare system reported cache size and resctrl
+ * reported cache size to get an idea if SNC is supported on the kernel side.
+ *
+ * Return: 0 if not supported, 1 if SNC is disabled or SNC is both enabled and
+ * supported, < 0 on failure.
+ */
+int snc_kernel_support(void)
+{
+ char node_path[PATH_MAX];
+ struct stat statbuf;
+ int ret;
+
+ ret = get_snc_mode();
+ /*
+ * If SNC is disabled then its kernel support isn't important. If value
+ * is smaller than 1 an error happened.
+ */
+ if (ret <= 1)
+ return ret;
+
+ snprintf(node_path, sizeof(node_path), "%s/%s/%s", RESCTRL_PATH, "mon_data",
+ "mon_L3_00/mon_sub_L3_00");
+
+ if (!stat(node_path, &statbuf))
+ return 1;
+
+ return 0;
+}
--
2.45.2
^ permalink raw reply related [flat|nested] 18+ messages in thread
* RE: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-01 14:18 ` [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages Maciej Wieczor-Retman
@ 2024-07-01 16:04 ` Luck, Tony
2024-07-02 21:26 ` Reinette Chatre
2024-07-03 6:50 ` Maciej Wieczor-Retman
2024-07-02 22:21 ` Reinette Chatre
1 sibling, 2 replies; 18+ messages in thread
From: Luck, Tony @ 2024-07-01 16:04 UTC (permalink / raw)
To: Wieczor-Retman, Maciej, shuah@kernel.org, Chatre, Reinette,
Yu, Fenghua
Cc: linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
ilpo.jarvinen@linux.intel.com
+static bool cpus_offline_empty(void)
+{
+ char offline_cpus_str[64];
+ FILE *fp;
+
+ fp = fopen("/sys/devices/system/cpu/offline", "r");
Check for fp == NULL before using it.
+ if (fscanf(fp, "%s", offline_cpus_str) < 0) {
fscanf() seems like a heavy hammer.
if (fgets(offline_cpus_str, sizeof(offline_cpus_str), fp) == NULL) {
+ if (!errno) {
Don't need an errno check (seems dubious mixing errno with stdio).
+ fclose(fp);
+ return 1;
return true;
+ }
+ ksft_perror("Could not read offline CPUs file!");
+ }
+
+ fclose(fp);
+
+ return 0;
return false;
+}
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-01 16:04 ` Luck, Tony
@ 2024-07-02 21:26 ` Reinette Chatre
2024-07-02 21:46 ` Tony Luck
2024-07-03 6:50 ` Maciej Wieczor-Retman
1 sibling, 1 reply; 18+ messages in thread
From: Reinette Chatre @ 2024-07-02 21:26 UTC (permalink / raw)
To: Luck, Tony, Wieczor-Retman, Maciej, shuah@kernel.org, Yu, Fenghua
Cc: linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
ilpo.jarvinen@linux.intel.com
Hi Tony,
On 7/1/24 9:04 AM, Luck, Tony wrote:
> +static bool cpus_offline_empty(void)
> +{
> + char offline_cpus_str[64];
> + FILE *fp;
> +
> + fp = fopen("/sys/devices/system/cpu/offline", "r");
>
> Check for fp == NULL before using it.
>
> + if (fscanf(fp, "%s", offline_cpus_str) < 0) {
>
> fscanf() seems like a heavy hammer.
Do you perhaps have any recommendations that should be used instead of
fscanf()? I checked with stat() but could not see a difference between
file with a CPU and a file without. Other alternative is
open()/read()/close()? Looks like when there are no offline CPUs then
the file will only contain '\n' so it may be possible to read one byte
from the file and confirm it is '\n' as a check for "cpus_offline_empty()".
Reinette
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-02 21:26 ` Reinette Chatre
@ 2024-07-02 21:46 ` Tony Luck
2024-07-02 22:01 ` Reinette Chatre
0 siblings, 1 reply; 18+ messages in thread
From: Tony Luck @ 2024-07-02 21:46 UTC (permalink / raw)
To: Reinette Chatre
Cc: Wieczor-Retman, Maciej, shuah@kernel.org, Yu, Fenghua,
linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
ilpo.jarvinen@linux.intel.com
On Tue, Jul 02, 2024 at 02:26:25PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 7/1/24 9:04 AM, Luck, Tony wrote:
> > +static bool cpus_offline_empty(void)
> > +{
> > + char offline_cpus_str[64];
> > + FILE *fp;
> > +
> > + fp = fopen("/sys/devices/system/cpu/offline", "r");
> >
> > Check for fp == NULL before using it.
> >
> > + if (fscanf(fp, "%s", offline_cpus_str) < 0) {
> >
> > fscanf() seems like a heavy hammer.
>
> Do you perhaps have any recommendations that should be used instead of
> fscanf()? I checked with stat() but could not see a difference between
> file with a CPU and a file without. Other alternative is
> open()/read()/close()? Looks like when there are no offline CPUs then
> the file will only contain '\n' so it may be possible to read one byte
> from the file and confirm it is '\n' as a check for "cpus_offline_empty()".
Sorry. I replied with Outlook and didn't quote things properly
so my alternate suggestion didn't stand out. Here it is again:
if (fgets(offline_cpus_str, sizeof(offline_cpus_str), fp) == NULL) {
fclose(fp);
return true;
}
But that was with the assumption that /sys/devices/system/cpu/offline
would be empty. Not that it would conatain a single "\n" as you say
above.
So fgets( ...) followed with "if (offline_cpus_str[0] == '\n') "?
-Tony
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-02 21:46 ` Tony Luck
@ 2024-07-02 22:01 ` Reinette Chatre
2024-07-02 22:03 ` Luck, Tony
0 siblings, 1 reply; 18+ messages in thread
From: Reinette Chatre @ 2024-07-02 22:01 UTC (permalink / raw)
To: Tony Luck
Cc: Wieczor-Retman, Maciej, shuah@kernel.org, Yu, Fenghua,
linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
ilpo.jarvinen@linux.intel.com
Hi Tony,
On 7/2/24 2:46 PM, Tony Luck wrote:
> On Tue, Jul 02, 2024 at 02:26:25PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 7/1/24 9:04 AM, Luck, Tony wrote:
>>> +static bool cpus_offline_empty(void)
>>> +{
>>> + char offline_cpus_str[64];
>>> + FILE *fp;
>>> +
>>> + fp = fopen("/sys/devices/system/cpu/offline", "r");
>>>
>>> Check for fp == NULL before using it.
>>>
>>> + if (fscanf(fp, "%s", offline_cpus_str) < 0) {
>>>
>>> fscanf() seems like a heavy hammer.
>>
>> Do you perhaps have any recommendations that should be used instead of
>> fscanf()? I checked with stat() but could not see a difference between
>> file with a CPU and a file without. Other alternative is
>> open()/read()/close()? Looks like when there are no offline CPUs then
>> the file will only contain '\n' so it may be possible to read one byte
>> from the file and confirm it is '\n' as a check for "cpus_offline_empty()".
>
> Sorry. I replied with Outlook and didn't quote things properly
> so my alternate suggestion didn't stand out. Here it is again:
>
> if (fgets(offline_cpus_str, sizeof(offline_cpus_str), fp) == NULL) {
> fclose(fp);
> return true;
> }
Apologies, missed this.
>
> But that was with the assumption that /sys/devices/system/cpu/offline
> would be empty. Not that it would conatain a single "\n" as you say
> above.
>
> So fgets( ...) followed with "if (offline_cpus_str[0] == '\n') "?
How about simplifying it more to "if (fgetc(fp) == '\n')" ?
Reinette
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-02 22:01 ` Reinette Chatre
@ 2024-07-02 22:03 ` Luck, Tony
2024-07-03 6:44 ` Maciej Wieczor-Retman
0 siblings, 1 reply; 18+ messages in thread
From: Luck, Tony @ 2024-07-02 22:03 UTC (permalink / raw)
To: Chatre, Reinette
Cc: Wieczor-Retman, Maciej, shuah@kernel.org, Yu, Fenghua,
linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
ilpo.jarvinen@linux.intel.com
>> So fgets( ...) followed with "if (offline_cpus_str[0] == '\n') "?
>
> How about simplifying it more to "if (fgetc(fp) == '\n')" ?
Reinette,
Even better.
-Tony
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/2] selftests/resctrl: Adjust effective L3 cache size with SNC enabled
2024-07-01 14:18 ` [PATCH v3 1/2] selftests/resctrl: Adjust effective L3 cache size with SNC enabled Maciej Wieczor-Retman
@ 2024-07-02 22:20 ` Reinette Chatre
2024-07-03 6:43 ` Maciej Wieczor-Retman
0 siblings, 1 reply; 18+ messages in thread
From: Reinette Chatre @ 2024-07-02 22:20 UTC (permalink / raw)
To: Maciej Wieczor-Retman, shuah, fenghua.yu
Cc: linux-kernel, linux-kselftest, ilpo.jarvinen, tony.luck
Hi Maciej,
On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:
...
> +int snc_nodes_per_l3_cache(void)
> +{
> + int node_cpus, cache_cpus, i, ret = 1;
> +
> + node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
> + cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
> +
> + if (!node_cpus || !cache_cpus) {
> + ksft_print_msg("Could not determine Sub-NUMA Cluster mode!\n");
nit: could you please check this series and remove exclamation marks from the messages
printed to user space? Just stating the error should be sufficient.
Reinette
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-01 14:18 ` [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages Maciej Wieczor-Retman
2024-07-01 16:04 ` Luck, Tony
@ 2024-07-02 22:21 ` Reinette Chatre
2024-07-03 7:43 ` Maciej Wieczór-Retman
1 sibling, 1 reply; 18+ messages in thread
From: Reinette Chatre @ 2024-07-02 22:21 UTC (permalink / raw)
To: Maciej Wieczor-Retman, shuah, fenghua.yu
Cc: linux-kernel, linux-kselftest, ilpo.jarvinen, tony.luck
Hi Maciej,
On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:
> Resctrl selftest prints a message on test failure that Sub-Numa
> Clustering (SNC) could be enabled and points the user to check theirs BIOS
theirs BIOS -> their BIOS?
> settings. No actual check is performed before printing that message so
> it is not very accurate in pinpointing a problem.
>
> Figuring out if SNC is enabled is only one part of the problem, the
> others being whether the detected SNC mode is reliable and whether the
> kernel supports SNC in resctrl.
>
> When there is SNC support for kernel's resctrl subsystem and SNC is
> enabled then sub node files are created for each node in the resctrlfs.
> The sub node files exist in each regular node's L3 monitoring directory.
> The reliable path to check for existence of sub node files is
> /sys/fs/resctrl/mon_data/mon_L3_00/mon_sub_L3_00.
>
> To check if SNC detection is reliable one can check the
> /sys/devices/system/cpu/offline file. If it's empty, it means all cores
> are operational and the ratio should be calculated correctly. If it has
> any contents, it means the detected SNC mode can't be trusted and should
> be disabled.
>
> Add helpers for detecting SNC mode and checking its reliability.
>
> Detect SNC mode once and let other tests inherit that information.
>
> Add messages to alert the user when SNC detection could return incorrect
> results.
>
> Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
> ---
> Changelog v3:
> - Change snc_ways() to snc_nodes_per_l3_cache(). (Reinette)
> - Add printing the discovered SNC mode. (Reinette)
> - Change method of kernel support discovery from cache sizes to
> existance of sub node files.
> - Check if SNC detection is unreliable.
> - Move SNC detection to only the first run_single_test() instead on
> error at the end of test runs.
> - Add global value to remind user at the end of relevant tests if SNC
> detection was found to be unreliable.
> - Redo the patch message after the changes.
>
> Changelog v2:
> - Move snc_ways() checks from individual tests into
> snc_kernel_support().
> - Write better comment for snc_kernel_support().
>
> tools/testing/selftests/resctrl/cache.c | 3 +
> tools/testing/selftests/resctrl/cmt_test.c | 4 +-
> tools/testing/selftests/resctrl/mba_test.c | 4 ++
> tools/testing/selftests/resctrl/mbm_test.c | 6 +-
> tools/testing/selftests/resctrl/resctrl.h | 4 ++
> .../testing/selftests/resctrl/resctrl_tests.c | 7 ++
> tools/testing/selftests/resctrl/resctrlfs.c | 70 ++++++++++++++++++-
> 7 files changed, 93 insertions(+), 5 deletions(-)
>
> diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c
> index 1ff1104e6575..9885d64b8a21 100644
> --- a/tools/testing/selftests/resctrl/cache.c
> +++ b/tools/testing/selftests/resctrl/cache.c
> @@ -186,4 +186,7 @@ void show_cache_info(int no_of_bits, __u64 avg_llc_val, size_t cache_span, bool
> ksft_print_msg("Average LLC val: %llu\n", avg_llc_val);
> ksft_print_msg("Cache span (%s): %zu\n", lines ? "lines" : "bytes",
> cache_span);
> + if (snc_unreliable)
> + ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
The message abour SNC detection being unreliable is already printed at beginning of every
test so I do not think it is necessary to print it again at this point.
> +
> }
> diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
> index 0c045080d808..588543ae2654 100644
> --- a/tools/testing/selftests/resctrl/cmt_test.c
> +++ b/tools/testing/selftests/resctrl/cmt_test.c
> @@ -175,8 +175,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
> goto out;
>
> ret = check_results(¶m, span, n);
> - if (ret && (get_vendor() == ARCH_INTEL))
> - ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
This message does seem to still be applicable if snc_unreliable == 1.
> + if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
> + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
>
> out:
> free(span_str);
> diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c
> index ab8496a4925b..c91e85f11285 100644
> --- a/tools/testing/selftests/resctrl/mba_test.c
> +++ b/tools/testing/selftests/resctrl/mba_test.c
> @@ -108,6 +108,8 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc)
> ksft_print_msg("avg_diff_per: %d%%\n", avg_diff_per);
> ksft_print_msg("avg_bw_imc: %lu\n", avg_bw_imc);
> ksft_print_msg("avg_bw_resc: %lu\n", avg_bw_resc);
> + if (snc_unreliable)
> + ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
(here I also think this is unnecessary)
> if (avg_diff_per > MAX_DIFF_PERCENT)
> ret = true;
> }
> @@ -179,6 +181,8 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param
> return ret;
>
> ret = check_results();
> + if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
> + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
>
> return ret;
> }
> diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c
> index 6b5a3b52d861..562b02118270 100644
> --- a/tools/testing/selftests/resctrl/mbm_test.c
> +++ b/tools/testing/selftests/resctrl/mbm_test.c
> @@ -43,6 +43,8 @@ show_bw_info(unsigned long *bw_imc, unsigned long *bw_resc, size_t span)
> ksft_print_msg("Span (MB): %zu\n", span / MB);
> ksft_print_msg("avg_bw_imc: %lu\n", avg_bw_imc);
> ksft_print_msg("avg_bw_resc: %lu\n", avg_bw_resc);
> + if (snc_unreliable)
> + ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
>
> return ret;
> }
> @@ -147,8 +149,8 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param
> return ret;
>
> ret = check_results(DEFAULT_SPAN);
> - if (ret && (get_vendor() == ARCH_INTEL))
> - ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
This message does seem to still be applicable if snc_unreliable == 1.
> + if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
> + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
>
> return ret;
> }
> diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
> index 851b37c9c38a..fa44e1cde21b 100644
> --- a/tools/testing/selftests/resctrl/resctrl.h
> +++ b/tools/testing/selftests/resctrl/resctrl.h
> @@ -121,10 +121,13 @@ struct perf_event_read {
> */
> extern volatile int *value_sink;
>
> +extern int snc_unreliable;
> +
> extern char llc_occup_path[1024];
>
> int snc_nodes_per_l3_cache(void);
> int get_vendor(void);
> +int get_snc_mode(void);
> bool check_resctrlfs_support(void);
> int filter_dmesg(void);
> int get_domain_id(const char *resource, int cpu_no, int *domain_id);
> @@ -167,6 +170,7 @@ void ctrlc_handler(int signum, siginfo_t *info, void *ptr);
> int signal_handler_register(const struct resctrl_test *test);
> void signal_handler_unregister(void);
> unsigned int count_bits(unsigned long n);
> +int snc_kernel_support(void);
>
> void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config);
> void perf_event_initialize_read_format(struct perf_event_read *pe_read);
> diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c b/tools/testing/selftests/resctrl/resctrl_tests.c
> index ecbb7605a981..b17560bbaf5c 100644
> --- a/tools/testing/selftests/resctrl/resctrl_tests.c
> +++ b/tools/testing/selftests/resctrl/resctrl_tests.c
> @@ -12,6 +12,7 @@
>
> /* Volatile memory sink to prevent compiler optimizations */
> static volatile int sink_target;
> +static int snc_mode;
This global seems unnecessary (more later) and also potentially confusing since
the get_snc_mode() has a function local static variable of same name.
> volatile int *value_sink = &sink_target;
>
> static struct resctrl_test *resctrl_tests[] = {
> @@ -123,6 +124,12 @@ static void run_single_test(const struct resctrl_test *test, const struct user_p
> if (test->disabled)
> return;
>
> + if (!snc_mode) {
> + snc_mode = get_snc_mode();
> + if (snc_mode > 1)
From what I can tell this is the only place the global is used and this can just be:
if (get_snc_mode() > 1)
> + ksft_print_msg("SNC-%d mode discovered!\n", snc_mode);
> + }
> +
> if (!test_vendor_specific_check(test)) {
> ksft_test_result_skip("Hardware does not support %s\n", test->name);
> return;
> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
> index 18a31a2ba7b3..004fb6649789 100644
> --- a/tools/testing/selftests/resctrl/resctrlfs.c
> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> @@ -13,6 +13,8 @@
>
> #include "resctrl.h"
>
> +int snc_unreliable;
> +
> static int find_resctrl_mount(char *buffer)
> {
> FILE *mounts;
> @@ -280,7 +282,7 @@ int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size
> * with a fully populated L3 mask in the schemata file.
> */
> if (cache_num == 3)
> - *cache_size /= snc_nodes_per_l3_cache();
> + *cache_size /= get_snc_mode();
hmmm ... it is not ideal to use second patch to change something from first patch.
Just having a single snc_nodes_per_l3_cache() will eliminate this change (more below).
> return 0;
> }
>
> @@ -939,3 +941,69 @@ unsigned int count_bits(unsigned long n)
>
> return count;
> }
> +
> +static bool cpus_offline_empty(void)
> +{
> + char offline_cpus_str[64];
> + FILE *fp;
> +
> + fp = fopen("/sys/devices/system/cpu/offline", "r");
> + if (fscanf(fp, "%s", offline_cpus_str) < 0) {
> + if (!errno) {
> + fclose(fp);
> + return 1;
> + }
> + ksft_perror("Could not read offline CPUs file!");
> + }
> +
> + fclose(fp);
> +
> + return 0;
> +}
> +
> +int get_snc_mode(void)
> +{
> + static int snc_mode;
> +
> + if (!snc_mode) {
> + snc_mode = snc_nodes_per_l3_cache();
> + if (!cpus_offline_empty()) {
> + ksft_print_msg("Runtime SNC detection unreliable due to offline CPUs!\n");
> + ksft_print_msg("Setting SNC mode to disabled.\n");
> + snc_mode = 1;
> + snc_unreliable = 1;
> + }
> + }
> +
> + return snc_mode;
> +}
I think the SNC detection will be easier to understand if it is done in a single
place. Can the static local variable and checks using the offline file instead be included in
existing snc_nodes_per_l3_cache()?
> +
> +/**
> + * snc_kernel_support - Compare system reported cache size and resctrl
> + * reported cache size to get an idea if SNC is supported on the kernel side.
This comment does not seem to match what the function does.
> + *
> + * Return: 0 if not supported, 1 if SNC is disabled or SNC is both enabled and
> + * supported, < 0 on failure.
> + */
> +int snc_kernel_support(void)
> +{
> + char node_path[PATH_MAX];
> + struct stat statbuf;
> + int ret;
> +
> + ret = get_snc_mode();
> + /*
> + * If SNC is disabled then its kernel support isn't important. If value
> + * is smaller than 1 an error happened.
How can a value smaller than 1 be returned?
> + */
> + if (ret <= 1)
> + return ret;
> +
> + snprintf(node_path, sizeof(node_path), "%s/%s/%s", RESCTRL_PATH, "mon_data",
> + "mon_L3_00/mon_sub_L3_00");
> +
> + if (!stat(node_path, &statbuf))
> + return 1;
> +
> + return 0;
> +}
Reinette
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/2] selftests/resctrl: Adjust effective L3 cache size with SNC enabled
2024-07-02 22:20 ` Reinette Chatre
@ 2024-07-03 6:43 ` Maciej Wieczor-Retman
0 siblings, 0 replies; 18+ messages in thread
From: Maciej Wieczor-Retman @ 2024-07-03 6:43 UTC (permalink / raw)
To: Reinette Chatre
Cc: shuah, fenghua.yu, linux-kernel, linux-kselftest, ilpo.jarvinen,
tony.luck
On 2024-07-02 at 15:20:22 -0700, Reinette Chatre wrote:
>Hi Maciej,
>
>On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:
>...
>> +int snc_nodes_per_l3_cache(void)
>> +{
>> + int node_cpus, cache_cpus, i, ret = 1;
>> +
>> + node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
>> + cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
>> +
>> + if (!node_cpus || !cache_cpus) {
>> + ksft_print_msg("Could not determine Sub-NUMA Cluster mode!\n");
>
>nit: could you please check this series and remove exclamation marks from the messages
>printed to user space? Just stating the error should be sufficient.
Sure, will do.
>
>Reinette
--
Kind regards
Maciej Wieczór-Retman
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-02 22:03 ` Luck, Tony
@ 2024-07-03 6:44 ` Maciej Wieczor-Retman
0 siblings, 0 replies; 18+ messages in thread
From: Maciej Wieczor-Retman @ 2024-07-03 6:44 UTC (permalink / raw)
To: Luck, Tony
Cc: Chatre, Reinette, shuah@kernel.org, Yu, Fenghua,
linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
ilpo.jarvinen@linux.intel.com
On 2024-07-03 at 00:03:58 +0200, Luck, Tony wrote:
>>> So fgets( ...) followed with "if (offline_cpus_str[0] == '\n') "?
>>
>> How about simplifying it more to "if (fgetc(fp) == '\n')" ?
>
>Reinette,
>
>Even better.
>
>-Tony
Thanks, I'll try using this instead of fscanf.
--
Kind regards
Maciej Wieczór-Retman
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-01 16:04 ` Luck, Tony
2024-07-02 21:26 ` Reinette Chatre
@ 2024-07-03 6:50 ` Maciej Wieczor-Retman
1 sibling, 0 replies; 18+ messages in thread
From: Maciej Wieczor-Retman @ 2024-07-03 6:50 UTC (permalink / raw)
To: Luck, Tony
Cc: shuah@kernel.org, Chatre, Reinette, Yu, Fenghua,
linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
ilpo.jarvinen@linux.intel.com
Thanks for the review,
On 2024-07-01 at 18:04:00 +0200, Luck, Tony wrote:
>+static bool cpus_offline_empty(void)
>+{
>+ char offline_cpus_str[64];
>+ FILE *fp;
>+
>+ fp = fopen("/sys/devices/system/cpu/offline", "r");
>
>Check for fp == NULL before using it.
Thanks, will add it.
>
>+ if (fscanf(fp, "%s", offline_cpus_str) < 0) {
>
>fscanf() seems like a heavy hammer.
>
> if (fgets(offline_cpus_str, sizeof(offline_cpus_str), fp) == NULL) {
>+ if (!errno) {
>
>Don't need an errno check (seems dubious mixing errno with stdio).
fscanf() returns "-1" when nothing was read as well as if there was an error.
But when nothing was read the errno is "0" instead of some error code so I could
differentiate the cases that way. The fgetc() you settled on shouldn't have this
problem.
>
>+ fclose(fp);
>+ return 1;
>
> return true;
>
>+ }
>+ ksft_perror("Could not read offline CPUs file!");
>+ }
>+
>+ fclose(fp);
>+
>+ return 0;
>
> return false;
>+}
>
--
Kind regards
Maciej Wieczór-Retman
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-02 22:21 ` Reinette Chatre
@ 2024-07-03 7:43 ` Maciej Wieczór-Retman
2024-07-03 20:51 ` Reinette Chatre
0 siblings, 1 reply; 18+ messages in thread
From: Maciej Wieczór-Retman @ 2024-07-03 7:43 UTC (permalink / raw)
To: Reinette Chatre, shuah, fenghua.yu
Cc: linux-kernel, linux-kselftest, ilpo.jarvinen, tony.luck
Hi, and thanks for the review,
On 3.07.2024 00:21, Reinette Chatre wrote:
> Hi Maciej,
>
> On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:
>> Resctrl selftest prints a message on test failure that Sub-Numa
>> Clustering (SNC) could be enabled and points the user to check theirs BIOS
>
> theirs BIOS -> their BIOS?
Right, thanks.
>
>> settings. No actual check is performed before printing that message so
>> it is not very accurate in pinpointing a problem.
>>
>> Figuring out if SNC is enabled is only one part of the problem, the
>> others being whether the detected SNC mode is reliable and whether the
>> kernel supports SNC in resctrl.
>>
>> When there is SNC support for kernel's resctrl subsystem and SNC is
>> enabled then sub node files are created for each node in the resctrlfs.
>> The sub node files exist in each regular node's L3 monitoring directory.
>> The reliable path to check for existence of sub node files is
>> /sys/fs/resctrl/mon_data/mon_L3_00/mon_sub_L3_00.
>>
>> To check if SNC detection is reliable one can check the
>> /sys/devices/system/cpu/offline file. If it's empty, it means all cores
>> are operational and the ratio should be calculated correctly. If it has
>> any contents, it means the detected SNC mode can't be trusted and should
>> be disabled.
>>
>> Add helpers for detecting SNC mode and checking its reliability.
>>
>> Detect SNC mode once and let other tests inherit that information.
>>
>> Add messages to alert the user when SNC detection could return incorrect
>> results.
>>
>> Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
>> ---
>> Changelog v3:
>> - Change snc_ways() to snc_nodes_per_l3_cache(). (Reinette)
>> - Add printing the discovered SNC mode. (Reinette)
>> - Change method of kernel support discovery from cache sizes to
>> existance of sub node files.
>> - Check if SNC detection is unreliable.
>> - Move SNC detection to only the first run_single_test() instead on
>> error at the end of test runs.
>> - Add global value to remind user at the end of relevant tests if SNC
>> detection was found to be unreliable.
>> - Redo the patch message after the changes.
>>
>> Changelog v2:
>> - Move snc_ways() checks from individual tests into
>> snc_kernel_support().
>> - Write better comment for snc_kernel_support().
>>
>> tools/testing/selftests/resctrl/cache.c | 3 +
>> tools/testing/selftests/resctrl/cmt_test.c | 4 +-
>> tools/testing/selftests/resctrl/mba_test.c | 4 ++
>> tools/testing/selftests/resctrl/mbm_test.c | 6 +-
>> tools/testing/selftests/resctrl/resctrl.h | 4 ++
>> .../testing/selftests/resctrl/resctrl_tests.c | 7 ++
>> tools/testing/selftests/resctrl/resctrlfs.c | 70 ++++++++++++++++++-
>> 7 files changed, 93 insertions(+), 5 deletions(-)
>>
>> diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c
>> index 1ff1104e6575..9885d64b8a21 100644
>> --- a/tools/testing/selftests/resctrl/cache.c
>> +++ b/tools/testing/selftests/resctrl/cache.c
>> @@ -186,4 +186,7 @@ void show_cache_info(int no_of_bits, __u64 avg_llc_val, size_t cache_span, bool
>> ksft_print_msg("Average LLC val: %llu\n", avg_llc_val);
>> ksft_print_msg("Cache span (%s): %zu\n", lines ? "lines" : "bytes",
>> cache_span);
>> + if (snc_unreliable)
>> + ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
>
> The message abour SNC detection being unreliable is already printed at beginning of every
> test so I do not think it is necessary to print it again at this point.
>
The "SNC detection was unreliable" only gets printed on the first execution of run_single_test().
That's what the global snc_mode was for mostly, it starts initialized to 0, and then on the first
run of run_single_test() it is set so other tests don't call get_snc_mode(). And then the local static
variable inside get_snc_mode() prevents the detection from running more than once when other places
call get_snc_mode() (like when the cache size is adjusted).
And as we discussed last time it's beneficial to put error messages at the end of the test in case the
user misses the initial warning at the very beginning.
>> +
>> }
>> diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
>> index 0c045080d808..588543ae2654 100644
>> --- a/tools/testing/selftests/resctrl/cmt_test.c
>> +++ b/tools/testing/selftests/resctrl/cmt_test.c
>> @@ -175,8 +175,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
>> goto out;
>> ret = check_results(¶m, span, n);
>> - if (ret && (get_vendor() == ARCH_INTEL))
>> - ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
>
> This message does seem to still be applicable if snc_unreliable == 1.
I was going for this one error message to specifically catch the kernel
not having snc support for resctrl while snc is enabled. While the
above message could be true when snc_unreliable == 1, it doesn't have to.
SNC might not be enabled at all so there would be no reason to send the user
to check their BIOS - instead they can learn they have offline CPUs and they can
work on fixing that. In my opinion it could be beneficial to have more specialized
messages in the selftests to help users diagnose problems quicker.
Having only this one message wihtout the "if snc unreliable" messages would
mean nothing would get printed at the end on success with unreliable SNC detection.
>
>> + if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
>> + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
>> out:
>> free(span_str);
>> diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c
>> index ab8496a4925b..c91e85f11285 100644
>> --- a/tools/testing/selftests/resctrl/mba_test.c
>> +++ b/tools/testing/selftests/resctrl/mba_test.c
>> @@ -108,6 +108,8 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc)
>> ksft_print_msg("avg_diff_per: %d%%\n", avg_diff_per);
>> ksft_print_msg("avg_bw_imc: %lu\n", avg_bw_imc);
>> ksft_print_msg("avg_bw_resc: %lu\n", avg_bw_resc);
>> + if (snc_unreliable)
>> + ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
>
> (here I also think this is unnecessary)
Same as above.
>
>> if (avg_diff_per > MAX_DIFF_PERCENT)
>> ret = true;
>> }
>> @@ -179,6 +181,8 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param
>> return ret;
>> ret = check_results();
>> + if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
>> + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
>> return ret;
>> }
>> diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c
>> index 6b5a3b52d861..562b02118270 100644
>> --- a/tools/testing/selftests/resctrl/mbm_test.c
>> +++ b/tools/testing/selftests/resctrl/mbm_test.c
>> @@ -43,6 +43,8 @@ show_bw_info(unsigned long *bw_imc, unsigned long *bw_resc, size_t span)
>> ksft_print_msg("Span (MB): %zu\n", span / MB);
>> ksft_print_msg("avg_bw_imc: %lu\n", avg_bw_imc);
>> ksft_print_msg("avg_bw_resc: %lu\n", avg_bw_resc);
>> + if (snc_unreliable)
>> + ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
>> return ret;
>> }
>> @@ -147,8 +149,8 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param
>> return ret;
>> ret = check_results(DEFAULT_SPAN);
>> - if (ret && (get_vendor() == ARCH_INTEL))
>> - ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
>
> This message does seem to still be applicable if snc_unreliable == 1.
Same as above.
>
>> + if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
>> + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
>> return ret;
>> }
>> diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
>> index 851b37c9c38a..fa44e1cde21b 100644
>> --- a/tools/testing/selftests/resctrl/resctrl.h
>> +++ b/tools/testing/selftests/resctrl/resctrl.h
>> @@ -121,10 +121,13 @@ struct perf_event_read {
>> */
>> extern volatile int *value_sink;
>> +extern int snc_unreliable;
>> +
>> extern char llc_occup_path[1024];
>> int snc_nodes_per_l3_cache(void);
>> int get_vendor(void);
>> +int get_snc_mode(void);
>> bool check_resctrlfs_support(void);
>> int filter_dmesg(void);
>> int get_domain_id(const char *resource, int cpu_no, int *domain_id);
>> @@ -167,6 +170,7 @@ void ctrlc_handler(int signum, siginfo_t *info, void *ptr);
>> int signal_handler_register(const struct resctrl_test *test);
>> void signal_handler_unregister(void);
>> unsigned int count_bits(unsigned long n);
>> +int snc_kernel_support(void);
>> void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config);
>> void perf_event_initialize_read_format(struct perf_event_read *pe_read);
>> diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c b/tools/testing/selftests/resctrl/resctrl_tests.c
>> index ecbb7605a981..b17560bbaf5c 100644
>> --- a/tools/testing/selftests/resctrl/resctrl_tests.c
>> +++ b/tools/testing/selftests/resctrl/resctrl_tests.c
>> @@ -12,6 +12,7 @@
>> /* Volatile memory sink to prevent compiler optimizations */
>> static volatile int sink_target;
>> +static int snc_mode;
>
> This global seems unnecessary (more later) and also potentially confusing since
> the get_snc_mode() has a function local static variable of same name.
>
>> volatile int *value_sink = &sink_target;
>> static struct resctrl_test *resctrl_tests[] = {
>> @@ -123,6 +124,12 @@ static void run_single_test(const struct resctrl_test *test, const struct user_p
>> if (test->disabled)
>> return;
>> + if (!snc_mode) {
>> + snc_mode = get_snc_mode();
>> + if (snc_mode > 1)
>
>
> From what I can tell this is the only place the global is used and this can just be:
> if (get_snc_mode() > 1)
I wanted to print the message below only on the first call to run_single_test() and then
print relevant warnings at the very end of each test. I thought that was your intention
when we discussed what messages are supposed to be printed and when in v2 of this series.
Do you think it would be better to just print this message at the start of each test?
Or should I make "snc_mode" into local static inside run_single_test()? Or maybe add
a second local static variable into get_snc_mode() that would control whether or not
the message should be printed?
>
>> + ksft_print_msg("SNC-%d mode discovered!\n", snc_mode);
>> + }
>> +
>> if (!test_vendor_specific_check(test)) {
>> ksft_test_result_skip("Hardware does not support %s\n", test->name);
>> return;
>> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
>> index 18a31a2ba7b3..004fb6649789 100644
>> --- a/tools/testing/selftests/resctrl/resctrlfs.c
>> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
>> @@ -13,6 +13,8 @@
>> #include "resctrl.h"
>> +int snc_unreliable;
>> +
>> static int find_resctrl_mount(char *buffer)
>> {
>> FILE *mounts;
>> @@ -280,7 +282,7 @@ int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size
>> * with a fully populated L3 mask in the schemata file.
>> */
>> if (cache_num == 3)
>> - *cache_size /= snc_nodes_per_l3_cache();
>> + *cache_size /= get_snc_mode();
>
> hmmm ... it is not ideal to use second patch to change something from first patch.
> Just having a single snc_nodes_per_l3_cache() will eliminate this change (more below).
>
>> return 0;
>> }
>> @@ -939,3 +941,69 @@ unsigned int count_bits(unsigned long n)
>> return count;
>> }
>> +
>> +static bool cpus_offline_empty(void)
>> +{
>> + char offline_cpus_str[64];
>> + FILE *fp;
>> +
>> + fp = fopen("/sys/devices/system/cpu/offline", "r");
>> + if (fscanf(fp, "%s", offline_cpus_str) < 0) {
>> + if (!errno) {
>> + fclose(fp);
>> + return 1;
>> + }
>> + ksft_perror("Could not read offline CPUs file!");
>> + }
>> +
>> + fclose(fp);
>> +
>> + return 0;
>> +}
>> +
>> +int get_snc_mode(void)
>> +{
>> + static int snc_mode;
>> +
>> + if (!snc_mode) {
>> + snc_mode = snc_nodes_per_l3_cache();
>> + if (!cpus_offline_empty()) {
>> + ksft_print_msg("Runtime SNC detection unreliable due to offline CPUs!\n");
>> + ksft_print_msg("Setting SNC mode to disabled.\n");
>> + snc_mode = 1;
>> + snc_unreliable = 1;
>> + }
>> + }
>> +
>> + return snc_mode;
>> +}
>
> I think the SNC detection will be easier to understand if it is done in a single
> place. Can the static local variable and checks using the offline file instead be included in
> existing snc_nodes_per_l3_cache()?
Sure, that sounds good.
>
>> +
>> +/**
>> + * snc_kernel_support - Compare system reported cache size and resctrl
>> + * reported cache size to get an idea if SNC is supported on the kernel side.
>
> This comment does not seem to match what the function does.
Oops, sorry, will fix it.
>
>> + *
>> + * Return: 0 if not supported, 1 if SNC is disabled or SNC is both enabled and
>> + * supported, < 0 on failure.
>> + */
>> +int snc_kernel_support(void)
>> +{
>> + char node_path[PATH_MAX];
>> + struct stat statbuf;
>> + int ret;
>> +
>> + ret = get_snc_mode();
>> + /*
>> + * If SNC is disabled then its kernel support isn't important. If value
>> + * is smaller than 1 an error happened.
>
> How can a value smaller than 1 be returned?
I think I left it here by accident because I was experimenting with other ways
of detecting the snc mode and then it could return errors. Will remove it.
>
>> + */
>> + if (ret <= 1)
>> + return ret;
>> +
>> + snprintf(node_path, sizeof(node_path), "%s/%s/%s", RESCTRL_PATH, "mon_data",
>> + "mon_L3_00/mon_sub_L3_00");
>> +
>> + if (!stat(node_path, &statbuf))
>> + return 1;
>> +
>> + return 0;
>> +}
>
>
> Reinette
--
Kind regards
Maciej Wieczór-Retman
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-03 7:43 ` Maciej Wieczór-Retman
@ 2024-07-03 20:51 ` Reinette Chatre
2024-07-04 7:23 ` Maciej Wieczor-Retman
0 siblings, 1 reply; 18+ messages in thread
From: Reinette Chatre @ 2024-07-03 20:51 UTC (permalink / raw)
To: Maciej Wieczór-Retman, shuah, fenghua.yu
Cc: linux-kernel, linux-kselftest, ilpo.jarvinen, tony.luck
Hi Maciej,
On 7/3/24 12:43 AM, Maciej Wieczór-Retman wrote:
> On 3.07.2024 00:21, Reinette Chatre wrote:
>> On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:
>>> diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c
>>> index 1ff1104e6575..9885d64b8a21 100644
>>> --- a/tools/testing/selftests/resctrl/cache.c
>>> +++ b/tools/testing/selftests/resctrl/cache.c
>>> @@ -186,4 +186,7 @@ void show_cache_info(int no_of_bits, __u64 avg_llc_val, size_t cache_span, bool
>>> ksft_print_msg("Average LLC val: %llu\n", avg_llc_val);
>>> ksft_print_msg("Cache span (%s): %zu\n", lines ? "lines" : "bytes",
>>> cache_span);
>>> + if (snc_unreliable)
>>> + ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
>>
>> The message abour SNC detection being unreliable is already printed at beginning of every
>> test so I do not think it is necessary to print it again at this point.
>>
>
> The "SNC detection was unreliable" only gets printed on the first execution of run_single_test().
There is more about this later, but this can be printed at start of each test.
> That's what the global snc_mode was for mostly, it starts initialized to 0, and then on the first
> run of run_single_test() it is set so other tests don't call get_snc_mode(). And then the local static
> variable inside get_snc_mode() prevents the detection from running more than once when other places
> call get_snc_mode() (like when the cache size is adjusted).
The shadowing of variables can get confusing. I think the global snc_mode is not necessary, having the
local static variable within snc_nodes_per_l3_cache() should be sufficient and run_single_test()
can just do a:
int snc_mode; /* new name welcome */
snc_mode = snc_nodes_per_l3_cache();
if (snc_mode > 1)
ksft_print_msg("SNC-%d mode discovered\n", snc_mode);
else if (snc_unreliable)
ksft_print_msg("SNC detection unreliable due to offline CPUs. Test results may not be accurate if SNC enabled.\n");
>
> And as we discussed last time it's beneficial to put error messages at the end of the test in case the
> user misses the initial warning at the very beginning.
Right. What I found unexpected was that it is done "at the end" but from two places, from the show*info()
as well as from run*test(). I expect "the end" to be a single place.
>>> }
>>> diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
>>> index 0c045080d808..588543ae2654 100644
>>> --- a/tools/testing/selftests/resctrl/cmt_test.c
>>> +++ b/tools/testing/selftests/resctrl/cmt_test.c
>>> @@ -175,8 +175,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
>>> goto out;
>>> ret = check_results(¶m, span, n);
>>> - if (ret && (get_vendor() == ARCH_INTEL))
>>> - ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
>>
>> This message does seem to still be applicable if snc_unreliable == 1.
>
> I was going for this one error message to specifically catch the kernel
> not having snc support for resctrl while snc is enabled. While the
> above message could be true when snc_unreliable == 1, it doesn't have to.
If a test fails when snc_unreliable == 1 then nothing is certain and some generic message
is needed.
> SNC might not be enabled at all so there would be no reason to send the user
> to check their BIOS - instead they can learn they have offline CPUs and they can
> work on fixing that. In my opinion it could be beneficial to have more specialized
> messages in the selftests to help users diagnose problems quicker.
My goal is indeed to have specialized messages. There cannot be a specialized message
if snc_reliable == 1. In this case it needs to be generic since SNC may or may not be
enabled and it is up to the user to investigate further.
>
> Having only this one message wihtout the "if snc unreliable" messages would
> mean nothing would get printed at the end on success with unreliable SNC detection.
Having a pass/fail is what user will focus on. If the test passes then SNC detection
should not matter. The messages are just there to help user root cause where a failure
may be.
...
>>
>>> volatile int *value_sink = &sink_target;
>>> static struct resctrl_test *resctrl_tests[] = {
>>> @@ -123,6 +124,12 @@ static void run_single_test(const struct resctrl_test *test, const struct user_p
>>> if (test->disabled)
>>> return;
>>> + if (!snc_mode) {
>>> + snc_mode = get_snc_mode();
>>> + if (snc_mode > 1)
>>
>>
>> From what I can tell this is the only place the global is used and this can just be:
>> if (get_snc_mode() > 1)
>
> I wanted to print the message below only on the first call to run_single_test() and then
> print relevant warnings at the very end of each test. I thought that was your intention
> when we discussed what messages are supposed to be printed and when in v2 of this series.
>
> Do you think it would be better to just print this message at the start of each test?
Yes. If there is a problem with a test the user could be expected to start tracing back
messages printed from beginning of failing test.
> Or should I make "snc_mode" into local static inside run_single_test()? Or maybe add
> a second local static variable into get_snc_mode() that would control whether or not
> the message should be printed?
I do not see where more local static variables may be needed.
Reinette
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-03 20:51 ` Reinette Chatre
@ 2024-07-04 7:23 ` Maciej Wieczor-Retman
2024-07-08 16:39 ` Reinette Chatre
0 siblings, 1 reply; 18+ messages in thread
From: Maciej Wieczor-Retman @ 2024-07-04 7:23 UTC (permalink / raw)
To: Reinette Chatre
Cc: shuah, fenghua.yu, linux-kernel, linux-kselftest, ilpo.jarvinen,
tony.luck
Hi Reinette,
On 2024-07-03 at 13:51:03 -0700, Reinette Chatre wrote:
>Hi Maciej,
>
>On 7/3/24 12:43 AM, Maciej Wieczór-Retman wrote:
>> On 3.07.2024 00:21, Reinette Chatre wrote:
>> > On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:
>
>> > > diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c
>> > > index 1ff1104e6575..9885d64b8a21 100644
>> > > --- a/tools/testing/selftests/resctrl/cache.c
>> > > +++ b/tools/testing/selftests/resctrl/cache.c
>> > > @@ -186,4 +186,7 @@ void show_cache_info(int no_of_bits, __u64 avg_llc_val, size_t cache_span, bool
>> > > ksft_print_msg("Average LLC val: %llu\n", avg_llc_val);
>> > > ksft_print_msg("Cache span (%s): %zu\n", lines ? "lines" : "bytes",
>> > > cache_span);
>> > > + if (snc_unreliable)
>> > > + ksft_print_msg("SNC detection unreliable due to offline CPUs!\n");
>> >
>> > The message abour SNC detection being unreliable is already printed at beginning of every
>> > test so I do not think it is necessary to print it again at this point.
>> >
>>
>> The "SNC detection was unreliable" only gets printed on the first execution of run_single_test().
>
>There is more about this later, but this can be printed at start of each test.
Okay, thanks.
>
>> That's what the global snc_mode was for mostly, it starts initialized to 0, and then on the first
>> run of run_single_test() it is set so other tests don't call get_snc_mode(). And then the local static
>> variable inside get_snc_mode() prevents the detection from running more than once when other places
>> call get_snc_mode() (like when the cache size is adjusted).
>
>The shadowing of variables can get confusing. I think the global snc_mode is not necessary, having the
>local static variable within snc_nodes_per_l3_cache() should be sufficient and run_single_test()
>can just do a:
>
> int snc_mode; /* new name welcome */
>
> snc_mode = snc_nodes_per_l3_cache();
> if (snc_mode > 1)
> ksft_print_msg("SNC-%d mode discovered\n", snc_mode);
> else if (snc_unreliable)
> ksft_print_msg("SNC detection unreliable due to offline CPUs. Test results may not be accurate if SNC enabled.\n");
This sounds good, thanks.
>
>>
>> And as we discussed last time it's beneficial to put error messages at the end of the test in case the
>> user misses the initial warning at the very beginning.
>
>Right. What I found unexpected was that it is done "at the end" but from two places, from the show*info()
>as well as from run*test(). I expect "the end" to be a single place.
Okay, I'll remove messages from show*info().
>
>> > > }
>> > > diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
>> > > index 0c045080d808..588543ae2654 100644
>> > > --- a/tools/testing/selftests/resctrl/cmt_test.c
>> > > +++ b/tools/testing/selftests/resctrl/cmt_test.c
>> > > @@ -175,8 +175,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
>> > > goto out;
>> > > ret = check_results(¶m, span, n);
>> > > - if (ret && (get_vendor() == ARCH_INTEL))
>> > > - ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
>> >
>> > This message does seem to still be applicable if snc_unreliable == 1.
>>
>> I was going for this one error message to specifically catch the kernel
>> not having snc support for resctrl while snc is enabled. While the
>> above message could be true when snc_unreliable == 1, it doesn't have to.
>
>If a test fails when snc_unreliable == 1 then nothing is certain and some generic message
>is needed.
Right
>
>> SNC might not be enabled at all so there would be no reason to send the user
>> to check their BIOS - instead they can learn they have offline CPUs and they can
>> work on fixing that. In my opinion it could be beneficial to have more specialized
>> messages in the selftests to help users diagnose problems quicker.
>
>My goal is indeed to have specialized messages. There cannot be a specialized message
>if snc_reliable == 1. In this case it needs to be generic since SNC may or may not be
>enabled and it is up to the user to investigate further.
How about this in cmt_run_test() for example:
if (snc_unreliable)
ksft_print_msg("Intel CMT may be inaccurate or inefficient when Sub-NUMA Clustering is enabled and not properly detected.\n");
else if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
This way there is a generic message when snc_unreliable == 1.
And as you mentioned at the end of this email, the user can be expected to
backtrack to the beginning of the test if there are any problems so they can
discover the exact source of the issue - offline cpus.
>
>>
>> Having only this one message wihtout the "if snc unreliable" messages would
>> mean nothing would get printed at the end on success with unreliable SNC detection.
>
>Having a pass/fail is what user will focus on. If the test passes then SNC detection
>should not matter. The messages are just there to help user root cause where a failure
>may be.
My train of thought was that if test passes with broken SNC detection it means
SNC was used inefficiently right? (either the portion of L3 used was bigger or
smaller than that allocated for one cluster)
It's not exactly a failure but I thought it deserves a warning at the very end
to alert the user.
If you don't think the warning should be printed on success I guess the
condition can be:
if (ret && snc_unreliable)
and the user can look at the start of the test if they care about more
information. And the message can lose the "inefficient" word since it would only
happen on error.
>
>...
>> >
>> > > volatile int *value_sink = &sink_target;
>> > > static struct resctrl_test *resctrl_tests[] = {
>> > > @@ -123,6 +124,12 @@ static void run_single_test(const struct resctrl_test *test, const struct user_p
>> > > if (test->disabled)
>> > > return;
>> > > + if (!snc_mode) {
>> > > + snc_mode = get_snc_mode();
>> > > + if (snc_mode > 1)
>> >
>> >
>> > From what I can tell this is the only place the global is used and this can just be:
>> > if (get_snc_mode() > 1)
>>
>> I wanted to print the message below only on the first call to run_single_test() and then
>> print relevant warnings at the very end of each test. I thought that was your intention
>> when we discussed what messages are supposed to be printed and when in v2 of this series.
>>
>> Do you think it would be better to just print this message at the start of each test?
>
>Yes. If there is a problem with a test the user could be expected to start tracing back
>messages printed from beginning of failing test.
I tried to reply to this comment with the suggestion above.
>
>> Or should I make "snc_mode" into local static inside run_single_test()? Or maybe add
>> a second local static variable into get_snc_mode() that would control whether or not
>> the message should be printed?
>
>I do not see where more local static variables may be needed.
Since you agreed with the previous paragraph this one doesn't matter.
>
>Reinette
--
Kind regards
Maciej Wieczór-Retman
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-04 7:23 ` Maciej Wieczor-Retman
@ 2024-07-08 16:39 ` Reinette Chatre
2024-07-09 7:49 ` Maciej Wieczor-Retman
0 siblings, 1 reply; 18+ messages in thread
From: Reinette Chatre @ 2024-07-08 16:39 UTC (permalink / raw)
To: Maciej Wieczor-Retman
Cc: shuah, fenghua.yu, linux-kernel, linux-kselftest, ilpo.jarvinen,
tony.luck
Hi Maciej,
On 7/4/24 12:23 AM, Maciej Wieczor-Retman wrote:
> On 2024-07-03 at 13:51:03 -0700, Reinette Chatre wrote:
>> On 7/3/24 12:43 AM, Maciej Wieczór-Retman wrote:
>>> On 3.07.2024 00:21, Reinette Chatre wrote:
>>>> On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:
...
>>> SNC might not be enabled at all so there would be no reason to send the user
>>> to check their BIOS - instead they can learn they have offline CPUs and they can
>>> work on fixing that. In my opinion it could be beneficial to have more specialized
>>> messages in the selftests to help users diagnose problems quicker.
>>
>> My goal is indeed to have specialized messages. There cannot be a specialized message
>> if snc_reliable == 1. In this case it needs to be generic since SNC may or may not be
>> enabled and it is up to the user to investigate further.
>
> How about this in cmt_run_test() for example:
>
> if (snc_unreliable)
> ksft_print_msg("Intel CMT may be inaccurate or inefficient when Sub-NUMA Clustering is enabled and not properly detected.\n");
It is ok with me if you want to keep the message even if the test succeeds. Without seeing
the new implementation it is unclear to me why the SNC check below is guarded by an ARCH_INTEL
check while the one above is not. Ideally this should be consistent to not confuse how
the architectures need to be treated here.
The message does sound a bit vague to me about being able to detect SNC. How about something
like:
Sub-NUMA Clustering could not be detected properly (see earlier messages for details).
Intel CMT may be inaccurate.
> else if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
> ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
The "Check BIOS configuration" guidance is not clear to me. If the kernel does not
support SNC then the user could also be guided to upgrade their kernel? Perhaps that
snippet can just be dropped? To be more specific that SNC enabling is not a kernel
configuration but a system configuration this can read (please feel free to improve):
Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.
> This way there is a generic message when snc_unreliable == 1.
>
> And as you mentioned at the end of this email, the user can be expected to
> backtrack to the beginning of the test if there are any problems so they can
> discover the exact source of the issue - offline cpus.
>
>>
>>>
>>> Having only this one message wihtout the "if snc unreliable" messages would
>>> mean nothing would get printed at the end on success with unreliable SNC detection.
>>
>> Having a pass/fail is what user will focus on. If the test passes then SNC detection
>> should not matter. The messages are just there to help user root cause where a failure
>> may be.
>
> My train of thought was that if test passes with broken SNC detection it means
> SNC was used inefficiently right? (either the portion of L3 used was bigger or
> smaller than that allocated for one cluster)
>
> It's not exactly a failure but I thought it deserves a warning at the very end
> to alert the user.
>
> If you don't think the warning should be printed on success I guess the
> condition can be:
> if (ret && snc_unreliable)
> and the user can look at the start of the test if they care about more
> information. And the message can lose the "inefficient" word since it would only
> happen on error.
Reinette
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages
2024-07-08 16:39 ` Reinette Chatre
@ 2024-07-09 7:49 ` Maciej Wieczor-Retman
0 siblings, 0 replies; 18+ messages in thread
From: Maciej Wieczor-Retman @ 2024-07-09 7:49 UTC (permalink / raw)
To: Reinette Chatre
Cc: shuah, fenghua.yu, linux-kernel, linux-kselftest, ilpo.jarvinen,
tony.luck
Hello,
On 2024-07-08 at 09:39:02 -0700, Reinette Chatre wrote:
>Hi Maciej,
>
>On 7/4/24 12:23 AM, Maciej Wieczor-Retman wrote:
>> On 2024-07-03 at 13:51:03 -0700, Reinette Chatre wrote:
>> > On 7/3/24 12:43 AM, Maciej Wieczór-Retman wrote:
>> > > On 3.07.2024 00:21, Reinette Chatre wrote:
>> > > > On 7/1/24 7:18 AM, Maciej Wieczor-Retman wrote:
>
>...
>
>> > > SNC might not be enabled at all so there would be no reason to send the user
>> > > to check their BIOS - instead they can learn they have offline CPUs and they can
>> > > work on fixing that. In my opinion it could be beneficial to have more specialized
>> > > messages in the selftests to help users diagnose problems quicker.
>> >
>> > My goal is indeed to have specialized messages. There cannot be a specialized message
>> > if snc_reliable == 1. In this case it needs to be generic since SNC may or may not be
>> > enabled and it is up to the user to investigate further.
>>
>> How about this in cmt_run_test() for example:
>>
>> if (snc_unreliable)
>> ksft_print_msg("Intel CMT may be inaccurate or inefficient when Sub-NUMA Clustering is enabled and not properly detected.\n");
>
>It is ok with me if you want to keep the message even if the test succeeds. Without seeing
>the new implementation it is unclear to me why the SNC check below is guarded by an ARCH_INTEL
>check while the one above is not. Ideally this should be consistent to not confuse how
>the architectures need to be treated here.
Right, I'll add the get_vendor() check to this too.
>
>The message does sound a bit vague to me about being able to detect SNC. How about something
>like:
> Sub-NUMA Clustering could not be detected properly (see earlier messages for details).
> Intel CMT may be inaccurate.
It sounds good, I'll change the message to this.
>> else if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support())
>> ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
>
>The "Check BIOS configuration" guidance is not clear to me. If the kernel does not
>support SNC then the user could also be guided to upgrade their kernel? Perhaps that
>snippet can just be dropped? To be more specific that SNC enabling is not a kernel
>configuration but a system configuration this can read (please feel free to improve):
> Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.
I suppose you're right, this does look better. Thanks!
>
>
>Reinette
--
Kind regards
Maciej Wieczór-Retman
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-07-09 7:50 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-01 14:17 [PATCH v3 0/2] selftests/resctrl: SNC kernel support discovery Maciej Wieczor-Retman
2024-07-01 14:18 ` [PATCH v3 1/2] selftests/resctrl: Adjust effective L3 cache size with SNC enabled Maciej Wieczor-Retman
2024-07-02 22:20 ` Reinette Chatre
2024-07-03 6:43 ` Maciej Wieczor-Retman
2024-07-01 14:18 ` [PATCH v3 2/2] selftests/resctrl: Adjust SNC support messages Maciej Wieczor-Retman
2024-07-01 16:04 ` Luck, Tony
2024-07-02 21:26 ` Reinette Chatre
2024-07-02 21:46 ` Tony Luck
2024-07-02 22:01 ` Reinette Chatre
2024-07-02 22:03 ` Luck, Tony
2024-07-03 6:44 ` Maciej Wieczor-Retman
2024-07-03 6:50 ` Maciej Wieczor-Retman
2024-07-02 22:21 ` Reinette Chatre
2024-07-03 7:43 ` Maciej Wieczór-Retman
2024-07-03 20:51 ` Reinette Chatre
2024-07-04 7:23 ` Maciej Wieczor-Retman
2024-07-08 16:39 ` Reinette Chatre
2024-07-09 7:49 ` Maciej Wieczor-Retman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).