Igt-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH i-g-t] RFC: igt-runner fact checking
@ 2024-10-28  7:08 Peter Senna Tschudin
  2024-10-28 10:06 ` Zbigniew Kempczyński
  2024-10-28 12:53 ` Krzysztofik, Janusz
  0 siblings, 2 replies; 4+ messages in thread
From: Peter Senna Tschudin @ 2024-10-28  7:08 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org; +Cc: Lucas De Marchi

WHAT: THIS IS AN RFC

 These functions collect facts, such as "Is the i915 kmod loaded?" when
 igt-runner starts and then recheck the facts after every test is run.
 Currently the code will abort on fact changes, but this can obviously
 be altered to suit our needs.

TODO

 I have only implemented facts for checking if gpu kmods are loaded. I
 want facts for:
  - GPUs on the PCI bus: pci address and card info: pciid, vendor,
    model, codename
  - DRM Cards: association between card number and pci address

 In the future I want to divide facts into two groups: static and dynamic.
 Static facts are these I talk about here, facts that are expected to be
 relevant to all tests such as which GPUs are present at the PCI bus.

 Dynamic facts are those defined by the tests themselves. The idea is to
 create a convenient mechanism for tests to define facts, and to have
 igt-runner to "dynamically" add these before start running the tests.

 I also want to have a section on results.json to include the facts.

WHY

 Currently igt-runner expects each test to clean up after themselves to
 not change the "state" of the machine. However, tests do change the
 "state" of the machine causing negative side effects downstream.
 As an example, at the time of writing, "igt@core_hotunplug@hotrebind"
 will load gpu kmods and will not unload them after it is done. Then
 trying to run "igt@xe_module_load@reload-no-display" will fail because
 the i915 driver is loaded.

HOW

 Notice that when creating a new fact we provide a name for the fact
 and a function that can get the value. Also notice that we cannot
 change a fact after it is initially set. The rationale is that the
 facts will be set before the first test is run, then they will be
 verified after each test.

EXAMPLE

 $ cat test-list
 igt@core_hotunplug@hotrebind
 igt@xe_module_load@reload-no-display

 $ sudo IGT_TEST_ROOT='/home/gta/UPSTREAM/igt-gpu-tools/build/tests/' \
   ./build/runner/igt_runner --per-test-timeout 400 \
   --test-list /home/gta/igt/test-list /home/gta/igt/30

 Number of facts: 3
 Fact kernel.kmod_is_loaded.amdgpu: false
 Fact kernel.kmod_is_loaded.i915: false
 Fact kernel.kmod_is_loaded.xe: false
 [117.461584] [1/2] core_hotunplug (hotrebind)
 [130.057183] Facts changed by last test:
  kernel.kmod_is_loaded.amdgpu changed from false to true.
  kernel.kmod_is_loaded.i915 changed from false to true.
  kernel.kmod_is_loaded.xe changed from false to true.
 
 (igt_runner:2246) CRITICAL: Test assertion failure function execute, file ../runner/executor.c:2717:
 (igt_runner:2246) CRITICAL: Failed assertion: fact_changes == NULL
 (igt_runner:2246) CRITICAL: Last errno: 9, Bad file descriptor
 Stack trace:
   #0 ../lib/igt_core.c:2051 __igt_fail_assert()
   #1 ../runner/executor.c:1020 execute()
   #2 ../runner/runner.c:40 main()
   #3 ../sysdeps/nptl/libc_start_call_main.h:58 __libc_start_call_main()
   #4 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
   #5 [_start+0x25]
 Test (null) failed.
 **** DEBUG ****
 (igt_runner:2246) INFO: Number of facts: 3
 (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.amdgpu: false
 (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.i915: false
 (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.xe: false
 (igt_runner:2246) CRITICAL: Test assertion failure function execute, file ../runner/executor.c:2717:
 (igt_runner:2246) CRITICAL: Failed assertion: fact_changes == NULL
 (igt_runner:2246) CRITICAL: Last errno: 9, Bad file descriptor
 (igt_runner:2246) igt_core-INFO: Stack trace:
 (igt_runner:2246) igt_core-INFO:   #0 ../lib/igt_core.c:2051 __igt_fail_assert()
 (igt_runner:2246) igt_core-INFO:   #1 ../runner/executor.c:1020 execute()
 (igt_runner:2246) igt_core-INFO:   #2 ../runner/runner.c:40 main()
 (igt_runner:2246) igt_core-INFO:   #3 ../sysdeps/nptl/libc_start_call_main.h:58 __libc_start_call_main()
 (igt_runner:2246) igt_core-INFO:   #4 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
 (igt_runner:2246) igt_core-INFO:   #5 [_start+0x25]
 ****  END  ****
 FAIL (-1.000s)

Signed-off-by: Peter Senna Tschudin <peter.senna@linux.intel.com>
---
 runner/executor.c | 191 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 191 insertions(+)

diff --git a/runner/executor.c b/runner/executor.c
index ac73e1dde..f151898ef 100644
--- a/runner/executor.c
+++ b/runner/executor.c
@@ -36,10 +36,185 @@
 #include "output_strings.h"
 #include "runnercomms.h"
 
+#include "igt_kmod.h"
+
 #define KMSG_HEADER "[IGT] "
 #define KMSG_WARN 4
 #define GRACEFUL_EXITCODE -SIGHUP
 
+
+/* Fact checking functions.
+ *
+ * WHAT: THIS IS AN RFC
+ *
+ * These functions collect facts, such as "Is the i915 kmod loaded?" when
+ * igt-runner starts and then recheck the facts after every test is run.
+ * Currently the code will abort on fact changes, but this can obviously
+ * be altered to suit our needs.
+ *
+ * TODO
+ *
+ * I have only implemented facts for checking if gpu kmods are loaded. I
+ * want facts for:
+ *  - GPUs on the PCI bus: pci address and card info: pciid, vendor,
+ *    model, codename
+ *  - DRM Cards: association between card number and pci address
+ *
+ * In the future I want to divide facts into two groups: static and dynamic.
+ * Static facts are these I talk about here, facts that are expected to be
+ * relevant to all tests such as which GPUs are present at the PCI bus.
+ *
+ * Dynamic facts are those defined by the tests themselves. The idea is to
+ * create a convenient mechanism for tests to define facts, and to have
+ * igt-runner to "dynamically" add these before start running the tests.
+ *
+ * I also want to have a section on results.json to include the facts.
+ *
+ * WHY
+ *
+ * Currently igt-runner expects each test to clean up after themselves to
+ * not change the "state" of the machine. However, tests do change the
+ * "state" of the machine causing negative side effects downstream.
+ * As an example, at the time of writing, "igt@core_hotunplug@hotrebind"
+ * will load gpu kmods and will not unload them after it is done. Then
+ * trying to run "igt@xe_module_load@reload-no-display" will fail because
+ * the i915 driver is loaded.
+ *
+ * HOW
+ *
+ * Notice that when creating a new fact we provide a name for the fact
+ * and a function that can get the value. Also notice that we cannot
+ * change a fact after it is initially set. The rationale is that the
+ * facts will be set before the first test is run, then they will be
+ * verified after each test.
+ *
+ */
+
+/* Type for the fact checking function. It returns the value. Returns
+ * NULL on failure.
+ */
+typedef const char* (*fact_checker)(const char *name);
+
+/* Struct for each fact with a name, a pointer to the checker function,
+ * a bool value, and a string value.
+ */
+typedef struct {
+	const char *name;
+	const char *initial_value; /* Supposed to be set only once */
+	fact_checker checker;
+} igt_runner_fact;
+
+/* Struct for the fact list */
+typedef struct {
+	igt_runner_fact *facts;
+	int num_facts;
+} igt_runner_fact_list;
+
+/* Add a new fact if it doesn't exist. Return false on failure such as when
+ * the fact already exist.
+ */
+static bool set_fact(igt_runner_fact_list *list, const char *name, fact_checker checker)
+{
+	/* Check if the fact exist */
+	for (int i = 0; i < list->num_facts; i++) {
+		if (strcmp(list->facts[i].name, name) == 0)
+			return false;
+	}
+
+	/* Add the fact */
+	list->num_facts++;
+	list->facts = realloc(list->facts, list->num_facts * sizeof(igt_runner_fact));
+	list->facts[list->num_facts - 1].name = name;
+	list->facts[list->num_facts - 1].checker = checker;
+	list->facts[list->num_facts - 1].initial_value = checker(name);
+
+	/* Check if initial_value is NULL */
+	if (list->facts[list->num_facts - 1].initial_value == NULL)
+		return false;
+
+	return true;
+}
+
+/* Get a fact by name. Return NULL if fact is not found. */
+static igt_runner_fact *get_fact(igt_runner_fact_list *list, const char *name)
+{
+	for (int i = 0; i < list->num_facts; i++) {
+		if (strcmp(list->facts[i].name, name) == 0)
+			return &list->facts[i];
+	}
+
+	return NULL;
+}
+
+/* print_all_facts: pretty print all facts */
+static void print_all_facts(igt_runner_fact_list *list)
+{
+	igt_info("Number of facts: %d\n", list->num_facts);
+	for (int i = 0; i < list->num_facts; i++)
+		igt_info("Fact %s: %s\n", list->facts[i].name, list->facts[i].initial_value);
+}
+
+/* get_fact_changes() check if there are fact changes compared to *list.
+ * Returns:
+ *  - NULL if there are no changes
+ *  - A string with the changes if there are any
+ *
+ * Free the returned string with g_free()
+ */
+static char *get_fact_changes(igt_runner_fact_list *list)
+{
+	char *changes = NULL;
+	const char *new_value = NULL;
+
+	for (int i = 0; i < list->num_facts; i++) {
+		new_value = list->facts[i].checker(list->facts[i].name);
+		if (list->facts[i].initial_value != new_value) {
+			if (changes == NULL) {
+				changes = malloc(1);
+				changes[0] = '\0';
+			}
+
+			changes = g_strdup_printf("%s %s changed from %s to %s.\n",
+						  changes,
+						  list->facts[i].name,
+						  list->facts[i].initial_value,
+						  new_value);
+		}
+	}
+
+	return changes;
+}
+
+/* Check if the kernel module at the end of name is loaded */
+static const char *fact_check_kmod_loaded(const char *name)
+{
+	const char *module = NULL;
+
+	/* Name will be something like "kernel.kmod_is_loaded.i915".
+	 * Save the string after the last dot on module.
+	 */
+	module = strrchr(name, '.');
+	if (module == NULL)
+		return NULL;
+
+	module++; /* Skip the dot */
+
+	/* Call igt_kmod_is_loaded and convert bool to string */
+	return igt_kmod_is_loaded(module) ? "true" : "false";
+}
+
+/* Function to init all the facts */
+static void init_facts(igt_runner_fact_list *list)
+{
+	/* Init the list */
+	list->num_facts = 0;
+	list->facts = NULL;
+
+	set_fact(list, "kernel.kmod_is_loaded.amdgpu", fact_check_kmod_loaded);
+	set_fact(list, "kernel.kmod_is_loaded.i915", fact_check_kmod_loaded);
+	set_fact(list, "kernel.kmod_is_loaded.xe", fact_check_kmod_loaded);
+}
+
 static struct {
 	int *fds;
 	size_t num_dogs;
@@ -2306,6 +2481,13 @@ bool execute(struct execute_state *state,
 	sigset_t sigmask;
 	double time_spent = 0.0;
 	bool status = true;
+	igt_runner_fact_list facts;
+
+	/* Init the facts */
+	init_facts(&facts);
+
+	/* Print facts */
+	print_all_facts(&facts);
 
 	if (state->dry) {
 		outf("Dry run, not executing. Invoke igt_resume if you want to execute.\n");
@@ -2437,6 +2619,7 @@ bool execute(struct execute_state *state,
 		char *job_name;
 		int result;
 		bool already_written = false;
+		char *fact_changes = NULL;
 
 		if (should_die_because_signal(sigfd)) {
 			status = false;
@@ -2525,6 +2708,14 @@ bool execute(struct execute_state *state,
 			state->time_left = time_left;
 			return execute(state, settings, job_list);
 		}
+
+		/* Abort if facts changed */
+		fact_changes = get_fact_changes(&facts);
+		if (fact_changes) {
+			/* Print fact_changes */
+			errf("Facts changed by last test:\n%s\n", fact_changes);
+			igt_assert(fact_changes == NULL);
+		}
 	}
 
 	if ((timefd = openat(resdirfd, "endtime.txt", O_CREAT | O_WRONLY | O_EXCL, 0666)) >= 0) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH i-g-t] RFC: igt-runner fact checking
  2024-10-28  7:08 [PATCH i-g-t] RFC: igt-runner fact checking Peter Senna Tschudin
@ 2024-10-28 10:06 ` Zbigniew Kempczyński
  2024-10-28 12:53 ` Krzysztofik, Janusz
  1 sibling, 0 replies; 4+ messages in thread
From: Zbigniew Kempczyński @ 2024-10-28 10:06 UTC (permalink / raw)
  To: Peter Senna Tschudin; +Cc: igt-dev@lists.freedesktop.org, Lucas De Marchi

On Mon, Oct 28, 2024 at 08:08:52AM +0100, Peter Senna Tschudin wrote:
> WHAT: THIS IS AN RFC
> 
>  These functions collect facts, such as "Is the i915 kmod loaded?" when
>  igt-runner starts and then recheck the facts after every test is run.
>  Currently the code will abort on fact changes, but this can obviously
>  be altered to suit our needs.
> 
> TODO
> 
>  I have only implemented facts for checking if gpu kmods are loaded. I
>  want facts for:
>   - GPUs on the PCI bus: pci address and card info: pciid, vendor,
>     model, codename
>   - DRM Cards: association between card number and pci address
> 
>  In the future I want to divide facts into two groups: static and dynamic.
>  Static facts are these I talk about here, facts that are expected to be
>  relevant to all tests such as which GPUs are present at the PCI bus.
> 
>  Dynamic facts are those defined by the tests themselves. The idea is to
>  create a convenient mechanism for tests to define facts, and to have
>  igt-runner to "dynamically" add these before start running the tests.
> 
>  I also want to have a section on results.json to include the facts.

Generally I think gathering few important facts like driver modules
loaded before executing the test and drm / pci state is a good idea.
But as this will happen before each test run it must be quick and doesn't
extend much test execution time. And only static facts before execution
makes sense for me. Apart of module loading/unloading most tests requires
drm device, that's it. Test itself is using igt_require() conditions to
check if what it expects is fine. I would drop idea of dynamic facts.
And some hint - use udev to collect drm/pci state without parsing sysfs.

--
Zbigniew

> 
> WHY
> 
>  Currently igt-runner expects each test to clean up after themselves to
>  not change the "state" of the machine. However, tests do change the
>  "state" of the machine causing negative side effects downstream.
>  As an example, at the time of writing, "igt@core_hotunplug@hotrebind"
>  will load gpu kmods and will not unload them after it is done. Then
>  trying to run "igt@xe_module_load@reload-no-display" will fail because
>  the i915 driver is loaded.
> 
> HOW
> 
>  Notice that when creating a new fact we provide a name for the fact
>  and a function that can get the value. Also notice that we cannot
>  change a fact after it is initially set. The rationale is that the
>  facts will be set before the first test is run, then they will be
>  verified after each test.
> 
> EXAMPLE
> 
>  $ cat test-list
>  igt@core_hotunplug@hotrebind
>  igt@xe_module_load@reload-no-display
> 
>  $ sudo IGT_TEST_ROOT='/home/gta/UPSTREAM/igt-gpu-tools/build/tests/' \
>    ./build/runner/igt_runner --per-test-timeout 400 \
>    --test-list /home/gta/igt/test-list /home/gta/igt/30
> 
>  Number of facts: 3
>  Fact kernel.kmod_is_loaded.amdgpu: false
>  Fact kernel.kmod_is_loaded.i915: false
>  Fact kernel.kmod_is_loaded.xe: false
>  [117.461584] [1/2] core_hotunplug (hotrebind)
>  [130.057183] Facts changed by last test:
>   kernel.kmod_is_loaded.amdgpu changed from false to true.
>   kernel.kmod_is_loaded.i915 changed from false to true.
>   kernel.kmod_is_loaded.xe changed from false to true.
>  
>  (igt_runner:2246) CRITICAL: Test assertion failure function execute, file ../runner/executor.c:2717:
>  (igt_runner:2246) CRITICAL: Failed assertion: fact_changes == NULL
>  (igt_runner:2246) CRITICAL: Last errno: 9, Bad file descriptor
>  Stack trace:
>    #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>    #1 ../runner/executor.c:1020 execute()
>    #2 ../runner/runner.c:40 main()
>    #3 ../sysdeps/nptl/libc_start_call_main.h:58 __libc_start_call_main()
>    #4 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>    #5 [_start+0x25]
>  Test (null) failed.
>  **** DEBUG ****
>  (igt_runner:2246) INFO: Number of facts: 3
>  (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.amdgpu: false
>  (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.i915: false
>  (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.xe: false
>  (igt_runner:2246) CRITICAL: Test assertion failure function execute, file ../runner/executor.c:2717:
>  (igt_runner:2246) CRITICAL: Failed assertion: fact_changes == NULL
>  (igt_runner:2246) CRITICAL: Last errno: 9, Bad file descriptor
>  (igt_runner:2246) igt_core-INFO: Stack trace:
>  (igt_runner:2246) igt_core-INFO:   #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>  (igt_runner:2246) igt_core-INFO:   #1 ../runner/executor.c:1020 execute()
>  (igt_runner:2246) igt_core-INFO:   #2 ../runner/runner.c:40 main()
>  (igt_runner:2246) igt_core-INFO:   #3 ../sysdeps/nptl/libc_start_call_main.h:58 __libc_start_call_main()
>  (igt_runner:2246) igt_core-INFO:   #4 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>  (igt_runner:2246) igt_core-INFO:   #5 [_start+0x25]
>  ****  END  ****
>  FAIL (-1.000s)
> 
> Signed-off-by: Peter Senna Tschudin <peter.senna@linux.intel.com>
> ---
>  runner/executor.c | 191 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 191 insertions(+)
> 
> diff --git a/runner/executor.c b/runner/executor.c
> index ac73e1dde..f151898ef 100644
> --- a/runner/executor.c
> +++ b/runner/executor.c
> @@ -36,10 +36,185 @@
>  #include "output_strings.h"
>  #include "runnercomms.h"
>  
> +#include "igt_kmod.h"
> +
>  #define KMSG_HEADER "[IGT] "
>  #define KMSG_WARN 4
>  #define GRACEFUL_EXITCODE -SIGHUP
>  
> +
> +/* Fact checking functions.
> + *
> + * WHAT: THIS IS AN RFC
> + *
> + * These functions collect facts, such as "Is the i915 kmod loaded?" when
> + * igt-runner starts and then recheck the facts after every test is run.
> + * Currently the code will abort on fact changes, but this can obviously
> + * be altered to suit our needs.
> + *
> + * TODO
> + *
> + * I have only implemented facts for checking if gpu kmods are loaded. I
> + * want facts for:
> + *  - GPUs on the PCI bus: pci address and card info: pciid, vendor,
> + *    model, codename
> + *  - DRM Cards: association between card number and pci address
> + *
> + * In the future I want to divide facts into two groups: static and dynamic.
> + * Static facts are these I talk about here, facts that are expected to be
> + * relevant to all tests such as which GPUs are present at the PCI bus.
> + *
> + * Dynamic facts are those defined by the tests themselves. The idea is to
> + * create a convenient mechanism for tests to define facts, and to have
> + * igt-runner to "dynamically" add these before start running the tests.
> + *
> + * I also want to have a section on results.json to include the facts.
> + *
> + * WHY
> + *
> + * Currently igt-runner expects each test to clean up after themselves to
> + * not change the "state" of the machine. However, tests do change the
> + * "state" of the machine causing negative side effects downstream.
> + * As an example, at the time of writing, "igt@core_hotunplug@hotrebind"
> + * will load gpu kmods and will not unload them after it is done. Then
> + * trying to run "igt@xe_module_load@reload-no-display" will fail because
> + * the i915 driver is loaded.
> + *
> + * HOW
> + *
> + * Notice that when creating a new fact we provide a name for the fact
> + * and a function that can get the value. Also notice that we cannot
> + * change a fact after it is initially set. The rationale is that the
> + * facts will be set before the first test is run, then they will be
> + * verified after each test.
> + *
> + */
> +
> +/* Type for the fact checking function. It returns the value. Returns
> + * NULL on failure.
> + */
> +typedef const char* (*fact_checker)(const char *name);
> +
> +/* Struct for each fact with a name, a pointer to the checker function,
> + * a bool value, and a string value.
> + */
> +typedef struct {
> +	const char *name;
> +	const char *initial_value; /* Supposed to be set only once */
> +	fact_checker checker;
> +} igt_runner_fact;
> +
> +/* Struct for the fact list */
> +typedef struct {
> +	igt_runner_fact *facts;
> +	int num_facts;
> +} igt_runner_fact_list;
> +
> +/* Add a new fact if it doesn't exist. Return false on failure such as when
> + * the fact already exist.
> + */
> +static bool set_fact(igt_runner_fact_list *list, const char *name, fact_checker checker)
> +{
> +	/* Check if the fact exist */
> +	for (int i = 0; i < list->num_facts; i++) {
> +		if (strcmp(list->facts[i].name, name) == 0)
> +			return false;
> +	}
> +
> +	/* Add the fact */
> +	list->num_facts++;
> +	list->facts = realloc(list->facts, list->num_facts * sizeof(igt_runner_fact));
> +	list->facts[list->num_facts - 1].name = name;
> +	list->facts[list->num_facts - 1].checker = checker;
> +	list->facts[list->num_facts - 1].initial_value = checker(name);
> +
> +	/* Check if initial_value is NULL */
> +	if (list->facts[list->num_facts - 1].initial_value == NULL)
> +		return false;
> +
> +	return true;
> +}
> +
> +/* Get a fact by name. Return NULL if fact is not found. */
> +static igt_runner_fact *get_fact(igt_runner_fact_list *list, const char *name)
> +{
> +	for (int i = 0; i < list->num_facts; i++) {
> +		if (strcmp(list->facts[i].name, name) == 0)
> +			return &list->facts[i];
> +	}
> +
> +	return NULL;
> +}
> +
> +/* print_all_facts: pretty print all facts */
> +static void print_all_facts(igt_runner_fact_list *list)
> +{
> +	igt_info("Number of facts: %d\n", list->num_facts);
> +	for (int i = 0; i < list->num_facts; i++)
> +		igt_info("Fact %s: %s\n", list->facts[i].name, list->facts[i].initial_value);
> +}
> +
> +/* get_fact_changes() check if there are fact changes compared to *list.
> + * Returns:
> + *  - NULL if there are no changes
> + *  - A string with the changes if there are any
> + *
> + * Free the returned string with g_free()
> + */
> +static char *get_fact_changes(igt_runner_fact_list *list)
> +{
> +	char *changes = NULL;
> +	const char *new_value = NULL;
> +
> +	for (int i = 0; i < list->num_facts; i++) {
> +		new_value = list->facts[i].checker(list->facts[i].name);
> +		if (list->facts[i].initial_value != new_value) {
> +			if (changes == NULL) {
> +				changes = malloc(1);
> +				changes[0] = '\0';
> +			}
> +
> +			changes = g_strdup_printf("%s %s changed from %s to %s.\n",
> +						  changes,
> +						  list->facts[i].name,
> +						  list->facts[i].initial_value,
> +						  new_value);
> +		}
> +	}
> +
> +	return changes;
> +}
> +
> +/* Check if the kernel module at the end of name is loaded */
> +static const char *fact_check_kmod_loaded(const char *name)
> +{
> +	const char *module = NULL;
> +
> +	/* Name will be something like "kernel.kmod_is_loaded.i915".
> +	 * Save the string after the last dot on module.
> +	 */
> +	module = strrchr(name, '.');
> +	if (module == NULL)
> +		return NULL;
> +
> +	module++; /* Skip the dot */
> +
> +	/* Call igt_kmod_is_loaded and convert bool to string */
> +	return igt_kmod_is_loaded(module) ? "true" : "false";
> +}
> +
> +/* Function to init all the facts */
> +static void init_facts(igt_runner_fact_list *list)
> +{
> +	/* Init the list */
> +	list->num_facts = 0;
> +	list->facts = NULL;
> +
> +	set_fact(list, "kernel.kmod_is_loaded.amdgpu", fact_check_kmod_loaded);
> +	set_fact(list, "kernel.kmod_is_loaded.i915", fact_check_kmod_loaded);
> +	set_fact(list, "kernel.kmod_is_loaded.xe", fact_check_kmod_loaded);
> +}
> +
>  static struct {
>  	int *fds;
>  	size_t num_dogs;
> @@ -2306,6 +2481,13 @@ bool execute(struct execute_state *state,
>  	sigset_t sigmask;
>  	double time_spent = 0.0;
>  	bool status = true;
> +	igt_runner_fact_list facts;
> +
> +	/* Init the facts */
> +	init_facts(&facts);
> +
> +	/* Print facts */
> +	print_all_facts(&facts);
>  
>  	if (state->dry) {
>  		outf("Dry run, not executing. Invoke igt_resume if you want to execute.\n");
> @@ -2437,6 +2619,7 @@ bool execute(struct execute_state *state,
>  		char *job_name;
>  		int result;
>  		bool already_written = false;
> +		char *fact_changes = NULL;
>  
>  		if (should_die_because_signal(sigfd)) {
>  			status = false;
> @@ -2525,6 +2708,14 @@ bool execute(struct execute_state *state,
>  			state->time_left = time_left;
>  			return execute(state, settings, job_list);
>  		}
> +
> +		/* Abort if facts changed */
> +		fact_changes = get_fact_changes(&facts);
> +		if (fact_changes) {
> +			/* Print fact_changes */
> +			errf("Facts changed by last test:\n%s\n", fact_changes);
> +			igt_assert(fact_changes == NULL);
> +		}
>  	}
>  
>  	if ((timefd = openat(resdirfd, "endtime.txt", O_CREAT | O_WRONLY | O_EXCL, 0666)) >= 0) {
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH i-g-t] RFC: igt-runner fact checking
  2024-10-28  7:08 [PATCH i-g-t] RFC: igt-runner fact checking Peter Senna Tschudin
  2024-10-28 10:06 ` Zbigniew Kempczyński
@ 2024-10-28 12:53 ` Krzysztofik, Janusz
  2024-10-30  7:15   ` Peter Senna Tschudin
  1 sibling, 1 reply; 4+ messages in thread
From: Krzysztofik, Janusz @ 2024-10-28 12:53 UTC (permalink / raw)
  To: Peter Senna Tschudin; +Cc: igt-dev@lists.freedesktop.org, De Marchi, Lucas

On Monday, 28 October 2024 08:08:52 CET Peter Senna Tschudin wrote:
> WHAT: THIS IS AN RFC
> 
>  These functions collect facts, such as "Is the i915 kmod loaded?" when
>  igt-runner starts and then recheck the facts after every test is run.
>  Currently the code will abort on fact changes, but this can obviously
>  be altered to suit our needs.

I think it would be more helpful if we tried to restore a clean environment, 
not only check if it hadn't changed.

> 
> TODO
> 
>  I have only implemented facts for checking if gpu kmods are loaded. 

Tests can do their job regardless of any GPU modules loaded or not, I 
believe.

>  I
>  want facts for:
>   - GPUs on the PCI bus: pci address and card info: pciid, vendor,
>     model, codename

What test scenarios you think we may expect to affect hardware configuration?

>   - DRM Cards: association between card number and pci address

That may change, and tests should still work, unless a DRM card number is 
specified in device filter, but that should be avoided, at least in CI.  Tests 
can rescan for actual DRM device numbers.

> 
>  In the future I want to divide facts into two groups: static and dynamic.
>  Static facts are these I talk about here, facts that are expected to be
>  relevant to all tests such as which GPUs are present at the PCI bus.
> 
>  Dynamic facts are those defined by the tests themselves. The idea is to
>  create a convenient mechanism for tests to define facts, and to have
>  igt-runner to "dynamically" add these before start running the tests.

Why igt_runner?  Now individual tests decide if the environment, including 
available GPU devices, meets their specific requirements, and those 
requirements may be different for each test.  Each test knows best its 
requirements and can check if met.  Why are you going to check all 
conditions, collected from all tests that are going to be executed, before 
execution of each test, regardless of its actual requirements?

> 
>  I also want to have a section on results.json to include the facts.
> 
> WHY
> 
>  Currently igt-runner expects each test to clean up after themselves to
>  not change the "state" of the machine. However, tests do change the
>  "state" of the machine causing negative side effects downstream.

And that's what can be wrong, I believe, and something to be fixed.  Tests 
should restore a clean environment.  Tests know best what they change, should 
then know best what needs to be restored, and should do that.  If they don't 
then we should fix them.

>  As an example, at the time of writing, "igt@core_hotunplug@hotrebind"
>  will load gpu kmods and will not unload them after it is done. Then
>  trying to run "igt@xe_module_load@reload-no-display" will fail because
>  the i915 driver is loaded.

Ok, now I can see what issue you are trying to resolve, but you are missing 
the point, I believe.  That's an issue of core_hotunplug unexpectedly 
exercising i915 instead of xe driver in the first place, and that's what we 
should try to fix.

Since core_hotunplug test uses DRIVER_ANY, that can happen if no GPU modules 
are loaded.  If there was a driver already bound to a GPU device then the test 
would use that device, if not then it loads all available GPU drivers first, 
including i915 and xe.  As soon as i915 is bound to the GPU, xe driver is 
blocked, hardware agnostic tests start exercising i915 instead of xe, and xe 
specific tests either skip, or maybe fail on unexpected and unrecognised 
conditions, unforeseen then impossible to address correctly.  As soon as we 
have root cause of that issue resolved, other tests should work as expected 
again.

If you want i915 not interfere with xe in CI.xe* runs then you may try one of:
- xe in front of i915 in lib/drmtest.c:modules[]:973 (shouldn't hurt i915 CI 
  runs since xe is not built for them),
- # CONFIG_DRM_I915 not set in the CI.xe* kernel .config (already different 
  from i915),
- module_blacklist=i915 via the CI.xe* kernel command line, or
- IGT_FORCE_DRIVER=xe via the CI.xe* environment variables.

> 
> HOW
> 
>  Notice that when creating a new fact we provide a name for the fact
>  and a function that can get the value. Also notice that we cannot
>  change a fact after it is initially set. The rationale is that the
>  facts will be set before the first test is run, then they will be
>  verified after each test.


> 
> EXAMPLE
> 
>  $ cat test-list
>  igt@core_hotunplug@hotrebind
>  igt@xe_module_load@reload-no-display
> 
>  $ sudo IGT_TEST_ROOT='/home/gta/UPSTREAM/igt-gpu-tools/build/tests/' \
>    ./build/runner/igt_runner --per-test-timeout 400 \
>    --test-list /home/gta/igt/test-list /home/gta/igt/30
> 
>  Number of facts: 3
>  Fact kernel.kmod_is_loaded.amdgpu: false
>  Fact kernel.kmod_is_loaded.i915: false
>  Fact kernel.kmod_is_loaded.xe: false
>  [117.461584] [1/2] core_hotunplug (hotrebind)
>  [130.057183] Facts changed by last test:
>   kernel.kmod_is_loaded.amdgpu changed from false to true.
>   kernel.kmod_is_loaded.i915 changed from false to true.
>   kernel.kmod_is_loaded.xe changed from false to true.
>  
>  (igt_runner:2246) CRITICAL: Test assertion failure function execute, file 
../runner/executor.c:2717:
>  (igt_runner:2246) CRITICAL: Failed assertion: fact_changes == NULL
>  (igt_runner:2246) CRITICAL: Last errno: 9, Bad file descriptor
>  Stack trace:
>    #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>    #1 ../runner/executor.c:1020 execute()
>    #2 ../runner/runner.c:40 main()
>    #3 ../sysdeps/nptl/libc_start_call_main.h:58 __libc_start_call_main()
>    #4 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>    #5 [_start+0x25]
>  Test (null) failed.
>  **** DEBUG ****
>  (igt_runner:2246) INFO: Number of facts: 3
>  (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.amdgpu: false
>  (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.i915: false
>  (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.xe: false
>  (igt_runner:2246) CRITICAL: Test assertion failure function execute, file 
../runner/executor.c:2717:
>  (igt_runner:2246) CRITICAL: Failed assertion: fact_changes == NULL
>  (igt_runner:2246) CRITICAL: Last errno: 9, Bad file descriptor
>  (igt_runner:2246) igt_core-INFO: Stack trace:
>  (igt_runner:2246) igt_core-INFO:   #0 ../lib/igt_core.c:2051 
__igt_fail_assert()
>  (igt_runner:2246) igt_core-INFO:   #1 ../runner/executor.c:1020 execute()
>  (igt_runner:2246) igt_core-INFO:   #2 ../runner/runner.c:40 main()
>  (igt_runner:2246) igt_core-INFO:   #3 ../sysdeps/nptl/
libc_start_call_main.h:58 __libc_start_call_main()
>  (igt_runner:2246) igt_core-INFO:   #4 ../csu/libc-start.c:128 
__libc_start_main@@GLIBC_2.34()
>  (igt_runner:2246) igt_core-INFO:   #5 [_start+0x25]
>  ****  END  ****
>  FAIL (-1.000s)

From CI perspective, what failed?  A test executed just before the failure, 
potentially already reported as successful?  A test about to be run next?  
In fact, the core_hotunplug test hasn't failed, only some modules were loaded, 
and unwanted i915 among them and bound to the device, because the CI 
environment was not protected from that unwanted scenario to happen.

Besides, you would have to negotiate with CI and Bug Filing, and also with 
developers resolving issues, if and how they would handle such "failures", 
reported for (null) tests.

Thanks,
Janusz

> 
> Signed-off-by: Peter Senna Tschudin <peter.senna@linux.intel.com>
> ---
>  runner/executor.c | 191 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 191 insertions(+)
> 
> diff --git a/runner/executor.c b/runner/executor.c
> index ac73e1dde..f151898ef 100644
> --- a/runner/executor.c
> +++ b/runner/executor.c
> @@ -36,10 +36,185 @@
>  #include "output_strings.h"
>  #include "runnercomms.h"
>  
> +#include "igt_kmod.h"
> +
>  #define KMSG_HEADER "[IGT] "
>  #define KMSG_WARN 4
>  #define GRACEFUL_EXITCODE -SIGHUP
>  
> +
> +/* Fact checking functions.
> + *
> + * WHAT: THIS IS AN RFC
> + *
> + * These functions collect facts, such as "Is the i915 kmod loaded?" when
> + * igt-runner starts and then recheck the facts after every test is run.
> + * Currently the code will abort on fact changes, but this can obviously
> + * be altered to suit our needs.
> + *
> + * TODO
> + *
> + * I have only implemented facts for checking if gpu kmods are loaded. I
> + * want facts for:
> + *  - GPUs on the PCI bus: pci address and card info: pciid, vendor,
> + *    model, codename
> + *  - DRM Cards: association between card number and pci address
> + *
> + * In the future I want to divide facts into two groups: static and 
dynamic.
> + * Static facts are these I talk about here, facts that are expected to be
> + * relevant to all tests such as which GPUs are present at the PCI bus.
> + *
> + * Dynamic facts are those defined by the tests themselves. The idea is to
> + * create a convenient mechanism for tests to define facts, and to have
> + * igt-runner to "dynamically" add these before start running the tests.
> + *
> + * I also want to have a section on results.json to include the facts.
> + *
> + * WHY
> + *
> + * Currently igt-runner expects each test to clean up after themselves to
> + * not change the "state" of the machine. However, tests do change the
> + * "state" of the machine causing negative side effects downstream.
> + * As an example, at the time of writing, "igt@core_hotunplug@hotrebind"
> + * will load gpu kmods and will not unload them after it is done. Then
> + * trying to run "igt@xe_module_load@reload-no-display" will fail because
> + * the i915 driver is loaded.
> + *
> + * HOW
> + *
> + * Notice that when creating a new fact we provide a name for the fact
> + * and a function that can get the value. Also notice that we cannot
> + * change a fact after it is initially set. The rationale is that the
> + * facts will be set before the first test is run, then they will be
> + * verified after each test.
> + *
> + */
> +
> +/* Type for the fact checking function. It returns the value. Returns
> + * NULL on failure.
> + */
> +typedef const char* (*fact_checker)(const char *name);
> +
> +/* Struct for each fact with a name, a pointer to the checker function,
> + * a bool value, and a string value.
> + */
> +typedef struct {
> +	const char *name;
> +	const char *initial_value; /* Supposed to be set only once */
> +	fact_checker checker;
> +} igt_runner_fact;
> +
> +/* Struct for the fact list */
> +typedef struct {
> +	igt_runner_fact *facts;
> +	int num_facts;
> +} igt_runner_fact_list;
> +
> +/* Add a new fact if it doesn't exist. Return false on failure such as when
> + * the fact already exist.
> + */
> +static bool set_fact(igt_runner_fact_list *list, const char *name, 
fact_checker checker)
> +{
> +	/* Check if the fact exist */
> +	for (int i = 0; i < list->num_facts; i++) {
> +		if (strcmp(list->facts[i].name, name) == 0)
> +			return false;
> +	}
> +
> +	/* Add the fact */
> +	list->num_facts++;
> +	list->facts = realloc(list->facts, list->num_facts * 
sizeof(igt_runner_fact));
> +	list->facts[list->num_facts - 1].name = name;
> +	list->facts[list->num_facts - 1].checker = checker;
> +	list->facts[list->num_facts - 1].initial_value = checker(name);
> +
> +	/* Check if initial_value is NULL */
> +	if (list->facts[list->num_facts - 1].initial_value == NULL)
> +		return false;
> +
> +	return true;
> +}
> +
> +/* Get a fact by name. Return NULL if fact is not found. */
> +static igt_runner_fact *get_fact(igt_runner_fact_list *list, const char 
*name)
> +{
> +	for (int i = 0; i < list->num_facts; i++) {
> +		if (strcmp(list->facts[i].name, name) == 0)
> +			return &list->facts[i];
> +	}
> +
> +	return NULL;
> +}
> +
> +/* print_all_facts: pretty print all facts */
> +static void print_all_facts(igt_runner_fact_list *list)
> +{
> +	igt_info("Number of facts: %d\n", list->num_facts);
> +	for (int i = 0; i < list->num_facts; i++)
> +		igt_info("Fact %s: %s\n", list->facts[i].name, list-
>facts[i].initial_value);
> +}
> +
> +/* get_fact_changes() check if there are fact changes compared to *list.
> + * Returns:
> + *  - NULL if there are no changes
> + *  - A string with the changes if there are any
> + *
> + * Free the returned string with g_free()
> + */
> +static char *get_fact_changes(igt_runner_fact_list *list)
> +{
> +	char *changes = NULL;
> +	const char *new_value = NULL;
> +
> +	for (int i = 0; i < list->num_facts; i++) {
> +		new_value = list->facts[i].checker(list->facts[i].name);
> +		if (list->facts[i].initial_value != new_value) {
> +			if (changes == NULL) {
> +				changes = malloc(1);
> +				changes[0] = '\0';
> +			}
> +
> +			changes = g_strdup_printf("%s %s changed from %s 
to %s.\n",
> +						  changes,
> +						  list-
>facts[i].name,
> +						  list-
>facts[i].initial_value,
> +						  new_value);
> +		}
> +	}
> +
> +	return changes;
> +}
> +
> +/* Check if the kernel module at the end of name is loaded */
> +static const char *fact_check_kmod_loaded(const char *name)
> +{
> +	const char *module = NULL;
> +
> +	/* Name will be something like "kernel.kmod_is_loaded.i915".
> +	 * Save the string after the last dot on module.
> +	 */
> +	module = strrchr(name, '.');
> +	if (module == NULL)
> +		return NULL;
> +
> +	module++; /* Skip the dot */
> +
> +	/* Call igt_kmod_is_loaded and convert bool to string */
> +	return igt_kmod_is_loaded(module) ? "true" : "false";
> +}
> +
> +/* Function to init all the facts */
> +static void init_facts(igt_runner_fact_list *list)
> +{
> +	/* Init the list */
> +	list->num_facts = 0;
> +	list->facts = NULL;
> +
> +	set_fact(list, "kernel.kmod_is_loaded.amdgpu", 
fact_check_kmod_loaded);
> +	set_fact(list, "kernel.kmod_is_loaded.i915", 
fact_check_kmod_loaded);
> +	set_fact(list, "kernel.kmod_is_loaded.xe", fact_check_kmod_loaded);
> +}
> +
>  static struct {
>  	int *fds;
>  	size_t num_dogs;
> @@ -2306,6 +2481,13 @@ bool execute(struct execute_state *state,
>  	sigset_t sigmask;
>  	double time_spent = 0.0;
>  	bool status = true;
> +	igt_runner_fact_list facts;
> +
> +	/* Init the facts */
> +	init_facts(&facts);
> +
> +	/* Print facts */
> +	print_all_facts(&facts);
>  
>  	if (state->dry) {
>  		outf("Dry run, not executing. Invoke igt_resume if you 
want to execute.\n");
> @@ -2437,6 +2619,7 @@ bool execute(struct execute_state *state,
>  		char *job_name;
>  		int result;
>  		bool already_written = false;
> +		char *fact_changes = NULL;
>  
>  		if (should_die_because_signal(sigfd)) {
>  			status = false;
> @@ -2525,6 +2708,14 @@ bool execute(struct execute_state *state,
>  			state->time_left = time_left;
>  			return execute(state, settings, job_list);
>  		}
> +
> +		/* Abort if facts changed */
> +		fact_changes = get_fact_changes(&facts);
> +		if (fact_changes) {
> +			/* Print fact_changes */
> +			errf("Facts changed by last test:\n%s\n", 
fact_changes);
> +			igt_assert(fact_changes == NULL);
> +		}
>  	}
>  
>  	if ((timefd = openat(resdirfd, "endtime.txt", O_CREAT | O_WRONLY | 
O_EXCL, 0666)) >= 0) {
> 

---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
Spolka oswiadcza, ze posiada status duzego przedsiebiorcy w rozumieniu ustawy z dnia 8 marca 2013 r. o przeciwdzialaniu nadmiernym opoznieniom w transakcjach handlowych.

Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH i-g-t] RFC: igt-runner fact checking
  2024-10-28 12:53 ` Krzysztofik, Janusz
@ 2024-10-30  7:15   ` Peter Senna Tschudin
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Senna Tschudin @ 2024-10-30  7:15 UTC (permalink / raw)
  To: Krzysztofik, Janusz; +Cc: igt-dev@lists.freedesktop.org, De Marchi, Lucas

Dear Janusz,

Thank you for the review!

I believe that I can clarify the following points. The igt_runner_facts are:
- primarily a logging feature: to measure and report changes to the environment
- generic: I build the argument based on a real example, but I am making igt_runner_facts generic
- time based: knowing when a fact changes matter to help us debug issues
- not test requirements. The line is blurry, but we are after what can unexpectedly change on the environment. Think of the disappearing gpu problem. 


On 28.10.2024 13:53, Krzysztofik, Janusz wrote:
> On Monday, 28 October 2024 08:08:52 CET Peter Senna Tschudin wrote:
>> WHAT: THIS IS AN RFC
>>
>>  These functions collect facts, such as "Is the i915 kmod loaded?" when
>>  igt-runner starts and then recheck the facts after every test is run.
>>  Currently the code will abort on fact changes, but this can obviously
>>  be altered to suit our needs.
> 
> I think it would be more helpful if we tried to restore a clean environment, 
> not only check if it hadn't changed.

That is not the point. We currently work on the assumption that every test is a well behaved citizen. But because we do not measure the cleanness of the environment, we do not actually know how clean the system is.

> 
>>
>> TODO
>>
>>  I have only implemented facts for checking if gpu kmods are loaded. 
> 
> Tests can do their job regardless of any GPU modules loaded or not, I 
> believe.

No, and when a test causes trouble downstream it is very hard to debug because we have no idea about what changed. Ask me how I know is time consuming :-)

> 
>>  I
>>  want facts for:
>>   - GPUs on the PCI bus: pci address and card info: pciid, vendor,
>>     model, codename
> 
> What test scenarios you think we may expect to affect hardware configuration?

I know of two: pre production GPUs may vanish from the PCI bus, and require a reboot to fix. Some tests may make the GPU unresponsive in ways that are recoverable without a reboot.

But you are walking an orthogonal path here. The proposal is to have a mechanism to measure how clean the environment is, and to log changes to help developers debug issues.

> 
>>   - DRM Cards: association between card number and pci address
> 
> That may change, and tests should still work, unless a DRM card number is 
> specified in device filter, but that should be avoided, at least in CI.  Tests 
> can rescan for actual DRM device numbers.
> 
>>
>>  In the future I want to divide facts into two groups: static and dynamic.
>>  Static facts are these I talk about here, facts that are expected to be
>>  relevant to all tests such as which GPUs are present at the PCI bus.
>>
>>  Dynamic facts are those defined by the tests themselves. The idea is to
>>  create a convenient mechanism for tests to define facts, and to have
>>  igt-runner to "dynamically" add these before start running the tests.
> 
> Why igt_runner?  Now individual tests decide if the environment, including 
> available GPU devices, meets their specific requirements, and those 
> requirements may be different for each test.  Each test knows best its 
> requirements and can check if met.  Why are you going to check all 
> conditions, collected from all tests that are going to be executed, before 
> execution of each test, regardless of its actual requirements?

I only care about igt_runner because the problems I am trying to detect only manifest when at least two tests run. Test 1 causes a problem that prevents test 2 from running successfully. Sometimes there are a hundred tests between test 1 and test 2...

I do not know where you took the idea that igt_runner facts will "check all conditions, collected from all tests that are going to be executed, before execution of each test". It will not.

igt_runner facts will check facts, such as the ones I explicitly describe on the commit message.

> 
>>
>>  I also want to have a section on results.json to include the facts.
>>
>> WHY
>>
>>  Currently igt-runner expects each test to clean up after themselves to
>>  not change the "state" of the machine. However, tests do change the
>>  "state" of the machine causing negative side effects downstream.
> 
> And that's what can be wrong, I believe, and something to be fixed.  Tests 
> should restore a clean environment.  Tests know best what they change, should 
> then know best what needs to be restored, and should do that.  If they don't 
> then we should fix them.

Yes! Having the facts measured and reported will help us identify others that need fixing.

> 
>>  As an example, at the time of writing, "igt@core_hotunplug@hotrebind"
>>  will load gpu kmods and will not unload them after it is done. Then
>>  trying to run "igt@xe_module_load@reload-no-display" will fail because
>>  the i915 driver is loaded.
> 
> Ok, now I can see what issue you are trying to resolve, but you are missing 
> the point, I believe.  That's an issue of core_hotunplug unexpectedly 
> exercising i915 instead of xe driver in the first place, and that's what we 
> should try to fix.

I am well on track, thanks.

The first step to improve an issue is to measure the problem. Currently we have tests tainting the environment and it is a lot of work to detect it. My proposal is to measure changes to the environment and log them explicitly to help us debug issues.

> 
> Since core_hotunplug test uses DRIVER_ANY, that can happen if no GPU modules 
> are loaded.  If there was a driver already bound to a GPU device then the test 
> would use that device, if not then it loads all available GPU drivers first, 
> including i915 and xe.  As soon as i915 is bound to the GPU, xe driver is 
> blocked, hardware agnostic tests start exercising i915 instead of xe, and xe 
> specific tests either skip, or maybe fail on unexpected and unrecognised 
> conditions, unforeseen then impossible to address correctly.  As soon as we 
> have root cause of that issue resolved, other tests should work as expected 
> again.
> 
> If you want i915 not interfere with xe in CI.xe* runs then you may try one of:
> - xe in front of i915 in lib/drmtest.c:modules[]:973 (shouldn't hurt i915 CI 
>   runs since xe is not built for them),
> - # CONFIG_DRM_I915 not set in the CI.xe* kernel .config (already different 
>   from i915),
> - module_blacklist=i915 via the CI.xe* kernel command line, or
> - IGT_FORCE_DRIVER=xe via the CI.xe* environment variables.

From my point of view this is unimportant in the context here. I am suggesting a generic logging mechanism to make it explicit when a test misbehaves. You are suggesting how to fix one specific problem.

> 
>>
>> HOW
>>
>>  Notice that when creating a new fact we provide a name for the fact
>>  and a function that can get the value. Also notice that we cannot
>>  change a fact after it is initially set. The rationale is that the
>>  facts will be set before the first test is run, then they will be
>>  verified after each test.
> 
> 
>>
>> EXAMPLE
>>
>>  $ cat test-list
>>  igt@core_hotunplug@hotrebind
>>  igt@xe_module_load@reload-no-display
>>
>>  $ sudo IGT_TEST_ROOT='/home/gta/UPSTREAM/igt-gpu-tools/build/tests/' \
>>    ./build/runner/igt_runner --per-test-timeout 400 \
>>    --test-list /home/gta/igt/test-list /home/gta/igt/30
>>
>>  Number of facts: 3
>>  Fact kernel.kmod_is_loaded.amdgpu: false
>>  Fact kernel.kmod_is_loaded.i915: false
>>  Fact kernel.kmod_is_loaded.xe: false
>>  [117.461584] [1/2] core_hotunplug (hotrebind)
>>  [130.057183] Facts changed by last test:
>>   kernel.kmod_is_loaded.amdgpu changed from false to true.
>>   kernel.kmod_is_loaded.i915 changed from false to true.
>>   kernel.kmod_is_loaded.xe changed from false to true.
>>  
>>  (igt_runner:2246) CRITICAL: Test assertion failure function execute, file 
> ../runner/executor.c:2717:
>>  (igt_runner:2246) CRITICAL: Failed assertion: fact_changes == NULL
>>  (igt_runner:2246) CRITICAL: Last errno: 9, Bad file descriptor
>>  Stack trace:
>>    #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>>    #1 ../runner/executor.c:1020 execute()
>>    #2 ../runner/runner.c:40 main()
>>    #3 ../sysdeps/nptl/libc_start_call_main.h:58 __libc_start_call_main()
>>    #4 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>>    #5 [_start+0x25]
>>  Test (null) failed.
>>  **** DEBUG ****
>>  (igt_runner:2246) INFO: Number of facts: 3
>>  (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.amdgpu: false
>>  (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.i915: false
>>  (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.xe: false
>>  (igt_runner:2246) CRITICAL: Test assertion failure function execute, file 
> ../runner/executor.c:2717:
>>  (igt_runner:2246) CRITICAL: Failed assertion: fact_changes == NULL
>>  (igt_runner:2246) CRITICAL: Last errno: 9, Bad file descriptor
>>  (igt_runner:2246) igt_core-INFO: Stack trace:
>>  (igt_runner:2246) igt_core-INFO:   #0 ../lib/igt_core.c:2051 
> __igt_fail_assert()
>>  (igt_runner:2246) igt_core-INFO:   #1 ../runner/executor.c:1020 execute()
>>  (igt_runner:2246) igt_core-INFO:   #2 ../runner/runner.c:40 main()
>>  (igt_runner:2246) igt_core-INFO:   #3 ../sysdeps/nptl/
> libc_start_call_main.h:58 __libc_start_call_main()
>>  (igt_runner:2246) igt_core-INFO:   #4 ../csu/libc-start.c:128 
> __libc_start_main@@GLIBC_2.34()
>>  (igt_runner:2246) igt_core-INFO:   #5 [_start+0x25]
>>  ****  END  ****
>>  FAIL (-1.000s)
> 
> From CI perspective, what failed?  A test executed just before the failure, 
> potentially already reported as successful?  A test about to be run next?  
> In fact, the core_hotunplug test hasn't failed, only some modules were loaded, 
> and unwanted i915 among them and bound to the device, because the CI 
> environment was not protected from that unwanted scenario to happen.
> 
> Besides, you would have to negotiate with CI and Bug Filing, and also with 
> developers resolving issues, if and how they would handle such "failures", 
> reported for (null) tests.

I mention on the first paragraph that aborting was an arbitrary decision. I am convinced now that aborting is not a good idea, and the facts will only add logging about changes to the environment.

Thanks a million!

> 
> Thanks,
> Janusz
> 
>>
>> Signed-off-by: Peter Senna Tschudin <peter.senna@linux.intel.com>
>> ---
>>  runner/executor.c | 191 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 191 insertions(+)
>>
>> diff --git a/runner/executor.c b/runner/executor.c
>> index ac73e1dde..f151898ef 100644
>> --- a/runner/executor.c
>> +++ b/runner/executor.c
>> @@ -36,10 +36,185 @@
>>  #include "output_strings.h"
>>  #include "runnercomms.h"
>>  
>> +#include "igt_kmod.h"
>> +
>>  #define KMSG_HEADER "[IGT] "
>>  #define KMSG_WARN 4
>>  #define GRACEFUL_EXITCODE -SIGHUP
>>  
>> +
>> +/* Fact checking functions.
>> + *
>> + * WHAT: THIS IS AN RFC
>> + *
>> + * These functions collect facts, such as "Is the i915 kmod loaded?" when
>> + * igt-runner starts and then recheck the facts after every test is run.
>> + * Currently the code will abort on fact changes, but this can obviously
>> + * be altered to suit our needs.
>> + *
>> + * TODO
>> + *
>> + * I have only implemented facts for checking if gpu kmods are loaded. I
>> + * want facts for:
>> + *  - GPUs on the PCI bus: pci address and card info: pciid, vendor,
>> + *    model, codename
>> + *  - DRM Cards: association between card number and pci address
>> + *
>> + * In the future I want to divide facts into two groups: static and 
> dynamic.
>> + * Static facts are these I talk about here, facts that are expected to be
>> + * relevant to all tests such as which GPUs are present at the PCI bus.
>> + *
>> + * Dynamic facts are those defined by the tests themselves. The idea is to
>> + * create a convenient mechanism for tests to define facts, and to have
>> + * igt-runner to "dynamically" add these before start running the tests.
>> + *
>> + * I also want to have a section on results.json to include the facts.
>> + *
>> + * WHY
>> + *
>> + * Currently igt-runner expects each test to clean up after themselves to
>> + * not change the "state" of the machine. However, tests do change the
>> + * "state" of the machine causing negative side effects downstream.
>> + * As an example, at the time of writing, "igt@core_hotunplug@hotrebind"
>> + * will load gpu kmods and will not unload them after it is done. Then
>> + * trying to run "igt@xe_module_load@reload-no-display" will fail because
>> + * the i915 driver is loaded.
>> + *
>> + * HOW
>> + *
>> + * Notice that when creating a new fact we provide a name for the fact
>> + * and a function that can get the value. Also notice that we cannot
>> + * change a fact after it is initially set. The rationale is that the
>> + * facts will be set before the first test is run, then they will be
>> + * verified after each test.
>> + *
>> + */
>> +
>> +/* Type for the fact checking function. It returns the value. Returns
>> + * NULL on failure.
>> + */
>> +typedef const char* (*fact_checker)(const char *name);
>> +
>> +/* Struct for each fact with a name, a pointer to the checker function,
>> + * a bool value, and a string value.
>> + */
>> +typedef struct {
>> +	const char *name;
>> +	const char *initial_value; /* Supposed to be set only once */
>> +	fact_checker checker;
>> +} igt_runner_fact;
>> +
>> +/* Struct for the fact list */
>> +typedef struct {
>> +	igt_runner_fact *facts;
>> +	int num_facts;
>> +} igt_runner_fact_list;
>> +
>> +/* Add a new fact if it doesn't exist. Return false on failure such as when
>> + * the fact already exist.
>> + */
>> +static bool set_fact(igt_runner_fact_list *list, const char *name, 
> fact_checker checker)
>> +{
>> +	/* Check if the fact exist */
>> +	for (int i = 0; i < list->num_facts; i++) {
>> +		if (strcmp(list->facts[i].name, name) == 0)
>> +			return false;
>> +	}
>> +
>> +	/* Add the fact */
>> +	list->num_facts++;
>> +	list->facts = realloc(list->facts, list->num_facts * 
> sizeof(igt_runner_fact));
>> +	list->facts[list->num_facts - 1].name = name;
>> +	list->facts[list->num_facts - 1].checker = checker;
>> +	list->facts[list->num_facts - 1].initial_value = checker(name);
>> +
>> +	/* Check if initial_value is NULL */
>> +	if (list->facts[list->num_facts - 1].initial_value == NULL)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +/* Get a fact by name. Return NULL if fact is not found. */
>> +static igt_runner_fact *get_fact(igt_runner_fact_list *list, const char 
> *name)
>> +{
>> +	for (int i = 0; i < list->num_facts; i++) {
>> +		if (strcmp(list->facts[i].name, name) == 0)
>> +			return &list->facts[i];
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +/* print_all_facts: pretty print all facts */
>> +static void print_all_facts(igt_runner_fact_list *list)
>> +{
>> +	igt_info("Number of facts: %d\n", list->num_facts);
>> +	for (int i = 0; i < list->num_facts; i++)
>> +		igt_info("Fact %s: %s\n", list->facts[i].name, list-
>> facts[i].initial_value);
>> +}
>> +
>> +/* get_fact_changes() check if there are fact changes compared to *list.
>> + * Returns:
>> + *  - NULL if there are no changes
>> + *  - A string with the changes if there are any
>> + *
>> + * Free the returned string with g_free()
>> + */
>> +static char *get_fact_changes(igt_runner_fact_list *list)
>> +{
>> +	char *changes = NULL;
>> +	const char *new_value = NULL;
>> +
>> +	for (int i = 0; i < list->num_facts; i++) {
>> +		new_value = list->facts[i].checker(list->facts[i].name);
>> +		if (list->facts[i].initial_value != new_value) {
>> +			if (changes == NULL) {
>> +				changes = malloc(1);
>> +				changes[0] = '\0';
>> +			}
>> +
>> +			changes = g_strdup_printf("%s %s changed from %s 
> to %s.\n",
>> +						  changes,
>> +						  list-
>> facts[i].name,
>> +						  list-
>> facts[i].initial_value,
>> +						  new_value);
>> +		}
>> +	}
>> +
>> +	return changes;
>> +}
>> +
>> +/* Check if the kernel module at the end of name is loaded */
>> +static const char *fact_check_kmod_loaded(const char *name)
>> +{
>> +	const char *module = NULL;
>> +
>> +	/* Name will be something like "kernel.kmod_is_loaded.i915".
>> +	 * Save the string after the last dot on module.
>> +	 */
>> +	module = strrchr(name, '.');
>> +	if (module == NULL)
>> +		return NULL;
>> +
>> +	module++; /* Skip the dot */
>> +
>> +	/* Call igt_kmod_is_loaded and convert bool to string */
>> +	return igt_kmod_is_loaded(module) ? "true" : "false";
>> +}
>> +
>> +/* Function to init all the facts */
>> +static void init_facts(igt_runner_fact_list *list)
>> +{
>> +	/* Init the list */
>> +	list->num_facts = 0;
>> +	list->facts = NULL;
>> +
>> +	set_fact(list, "kernel.kmod_is_loaded.amdgpu", 
> fact_check_kmod_loaded);
>> +	set_fact(list, "kernel.kmod_is_loaded.i915", 
> fact_check_kmod_loaded);
>> +	set_fact(list, "kernel.kmod_is_loaded.xe", fact_check_kmod_loaded);
>> +}
>> +
>>  static struct {
>>  	int *fds;
>>  	size_t num_dogs;
>> @@ -2306,6 +2481,13 @@ bool execute(struct execute_state *state,
>>  	sigset_t sigmask;
>>  	double time_spent = 0.0;
>>  	bool status = true;
>> +	igt_runner_fact_list facts;
>> +
>> +	/* Init the facts */
>> +	init_facts(&facts);
>> +
>> +	/* Print facts */
>> +	print_all_facts(&facts);
>>  
>>  	if (state->dry) {
>>  		outf("Dry run, not executing. Invoke igt_resume if you 
> want to execute.\n");
>> @@ -2437,6 +2619,7 @@ bool execute(struct execute_state *state,
>>  		char *job_name;
>>  		int result;
>>  		bool already_written = false;
>> +		char *fact_changes = NULL;
>>  
>>  		if (should_die_because_signal(sigfd)) {
>>  			status = false;
>> @@ -2525,6 +2708,14 @@ bool execute(struct execute_state *state,
>>  			state->time_left = time_left;
>>  			return execute(state, settings, job_list);
>>  		}
>> +
>> +		/* Abort if facts changed */
>> +		fact_changes = get_fact_changes(&facts);
>> +		if (fact_changes) {
>> +			/* Print fact_changes */
>> +			errf("Facts changed by last test:\n%s\n", 
> fact_changes);
>> +			igt_assert(fact_changes == NULL);
>> +		}
>>  	}
>>  
>>  	if ((timefd = openat(resdirfd, "endtime.txt", O_CREAT | O_WRONLY | 
> O_EXCL, 0666)) >= 0) {
>>
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-10-30  7:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-28  7:08 [PATCH i-g-t] RFC: igt-runner fact checking Peter Senna Tschudin
2024-10-28 10:06 ` Zbigniew Kempczyński
2024-10-28 12:53 ` Krzysztofik, Janusz
2024-10-30  7:15   ` Peter Senna Tschudin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox