Igt-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Senna Tschudin <peter.senna@linux.intel.com>
To: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: igt-dev@lists.freedesktop.org,
	"Francois Dugast" <francois.dugast@intel.com>,
	"Zbigniew Kempczyński" <zbigniew.kempczynski@intel.com>,
	"Kamil Konieczny" <kamil.konieczny@linux.intel.com>
Subject: Re: [PATCH i-g-t 0/4] Device scan fixes
Date: Thu, 19 Dec 2024 09:44:31 +0100	[thread overview]
Message-ID: <f1c12cc0-dbd8-428a-82d9-dede1e6f4164@linux.intel.com> (raw)
In-Reply-To: <3izfilgj2eycb4fvdudz62wxq6fwarehczg5z75qpbo7l5xxvk@zdh4xgda2npo>

[-- Attachment #1: Type: text/plain, Size: 6358 bytes --]

Hi Lucas,

On 18.12.2024 23:04, Lucas De Marchi wrote:
> On Wed, Dec 18, 2024 at 09:16:48AM +0100, Peter Senna Tschudin wrote:
>>
>>
>> On 18.12.2024 06:13, Lucas De Marchi wrote:
>>> This started with the goal of fixing xe_wedged (besides the fix to the
>>> kernel that is in flight), however the issue of device "disappearing
>>> from bus" looks similar to several other issues we occasionally have -
>>> it may be the same issue. Try to fix it by forcing scans.
>>
>> Is the hypothesis that while an IGT test is running something may
>> break the association between GPU and device node? I am asking because
>> the drm device is always closed after a test ends.
> 
> it depends what the test is doing, if the fd is closed etc... note that
> multiple subtests are executed without runing a new program. Even some
> under-the-hood behavior that we have with e.g. i915 (it keeps an fd open
> a dup fd open) may change the behavior if you load/unload or
> bind/unbind. Example:
> 
> RPL (00:02.0) + BMG (03:00.0)
>     # ./build/tools/lsgpu                                       card1                    Intel Battlemage (Gen20)          drm:/dev/dri/card1           └─renderD129                                               drm:/dev/dri/renderD129      card0                    Intel Raptorlake_s (Gen12)        drm:/dev/dri/card0           └─renderD128                                               drm:/dev/dri/renderD128 
>     # ls -l /dev/dri/by-path/                               total 0                                                                             lrwxrwxrwx 1 root root  8 Dec 18 19:39 pci-0000:00:02.0-card -> ../card0
>     lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:00:02.0-render -> ../renderD128
>     lrwxrwxrwx 1 root root  8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1
>     lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129
> Rebind RPL:
>     # echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/unbind
>     # echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/bind
> 
>     # ls -l /dev/dri/by-path/
>     total 0
>     lrwxrwxrwx 1 root root  8 Dec 18 19:40 pci-0000:00:02.0-card -> ../card0
>     lrwxrwxrwx 1 root root 13 Dec 18 19:40 pci-0000:00:02.0-render -> ../renderD128
>     lrwxrwxrwx 1 root root  8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1
>     lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129
> 
> Great, nothing break, right?
> 
> Rebind both in a different order (shouldn't happen if both cards are
> backed by the same module, but can perfectly happen in a i915 + xe
> scenario as the modules may be ready in different order):
> 
>     # echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/unbind     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind     # echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/bind     # ls -l /dev/dri/by-path/
>     total 0
>     lrwxrwxrwx 1 root root  8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1
>     lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129
>     lrwxrwxrwx 1 root root  8 Dec 18 19:41 pci-0000:03:00.0-card -> ../card0
>     lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:03:00.0-render -> ../renderD128
> 
> Simulate one thing that may happen in igt: but leak an fd:
> 
>     # exec 3<>  /dev/dri/by-path/pci-0000:03:00.0-card
>     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
>     # ls -l /dev/dri/by-path/
>     total 0
>     lrwxrwxrwx 1 root root  8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1
>     lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129
>     lrwxrwxrwx 1 root root  8 Dec 18 19:42 pci-0000:03:00.0-card -> ../card2
>     lrwxrwxrwx 1 root root 13 Dec 18 19:42 pci-0000:03:00.0-render -> ../renderD130
> 
> So... because we had and fd open, now it became card2 rather than 1.
> Also note that simply calling close(fd) in igt doesn't work as the
> cached fds will throw igt through the wrong path.
> 
> So.... instead of adding bandaid everywhere in igt, I think we should
> fix the design to stop caching the wrong thing. IMO a first step would
> be "disable the cache and see how much it impacts".
> 
> Lucas De Marchi
> 
>>
>> I tested this by printing the value of _opened_fds_count from
>> lib/drmtest.c before each test, and the value is always 0.
>>
>> _opened_fds_count being zero means that the cache is empty, right? So
>> if the cache is empty before each test starts, under which conditions
>> can the potential problem manifest?


I could finally detect the issue by(The patch should be applied on top of my facts series):
 - Creating peter_drm_stats(): prints drm cache entries with fd and card number
 - Instrumenting igt@core_hotunplug@hotrebind with calls to igt_facts() and peter_drm_stats()

My goal is to create a test for us to test possible fixes. Can you help me creating a test that
is more meaningful for us to use as benchmark? See the output:

Starting subtest: hotrebind
...
[5147.892783] [FACT Before it starts] new: hardware.pci.drm_card_at_addr.0000:03:00.0: card1
[5147.897991] [DRM_CACHE] Before it starts: fd 5, card: 1
[5148.968237] [DRM_CACHE] After local_drm_open_driver(): fd 5, card: 1
Unloaded audio driver snd_hda_intel
Realoading snd_hda_intel
[5150.025062] [FACT After driver_unbind() and driver_bind()] changed: hardware.pci.drm_card_at_addr.0000:03:00.0: card1 -> card0

// Here the cache is wrong
[5150.030141] [DRM_CACHE] After driver_unbind() and driver_bind(): fd 5, card: 1

Opened device: \/dev\/dri\/card0
Opened device: \/dev\/dri\/renderD128

// healthcheck() fixes the cache by calling igt_devices_scan(true)
[5151.100086] [DRM_CACHE] After healthcheck(): fd 8, card: 0

Subtest hotrebind: SUCCESS (3.227s)

[-- Attachment #2: debug.patch --]
[-- Type: text/plain, Size: 2594 bytes --]

diff --git a/lib/drmtest.c b/lib/drmtest.c
index 2dd4540b8..6b7c35e20 100644
--- a/lib/drmtest.c
+++ b/lib/drmtest.c
@@ -35,6 +35,7 @@
 #include <sys/ioctl.h>
 #include <string.h>
 #include <sys/mman.h>
+#include <sys/sysmacros.h>
 #include <signal.h>
 #include <pciaccess.h>
 #include <stdlib.h>
@@ -418,6 +419,33 @@ static bool _is_already_opened(const char *path, int as_idx)
 	return false;
 }
 
+void peter_drm_stats(const char *msg)
+{
+	struct timespec uptime_ts;
+	char *uptime = NULL;
+
+	if (clock_gettime(CLOCK_BOOTTIME, &uptime_ts) != 0)
+		return;
+
+	asprintf(&uptime,
+		 "%ld.%06ld",
+		 uptime_ts.tv_sec,
+		 uptime_ts.tv_nsec / 1000);
+
+	for (int i = 0; i < _opened_fds_count; i++) {
+		unsigned long int st_rdev;
+		int fd, card;
+		fd = _opened_fds[i].fd;
+		st_rdev = _opened_fds[i].stat.st_rdev;
+		card =  gnu_dev_minor(st_rdev);
+		igt_info("[%s] [DRM_CACHE] %s: fd %d, card: %d\n",
+			 uptime,
+			 msg ? msg : "",
+			 fd,
+			 card);
+	}
+}
+
 static int __search_and_open(const char *base, int offset, unsigned int chipset, int as_idx)
 {
 	const char *forced;
diff --git a/lib/drmtest.h b/lib/drmtest.h
index 27e5a18e2..d3713e9b9 100644
--- a/lib/drmtest.h
+++ b/lib/drmtest.h
@@ -145,6 +145,7 @@ bool is_vc4_device(int fd);
 bool is_xe_device(int fd);
 bool is_intel_device(int fd);
 enum intel_driver get_intel_driver(int fd);
+extern void peter_drm_stats(const char *msg);
 
 /**
  * do_or_die:
diff --git a/tests/core_hotunplug.c b/tests/core_hotunplug.c
index 7f17f4423..838404d8b 100644
--- a/tests/core_hotunplug.c
+++ b/tests/core_hotunplug.c
@@ -39,6 +39,8 @@
 #include "igt_kmod.h"
 #include "igt_sysfs.h"
 #include "sw_sync.h"
+#include "igt_facts.h"
+
 /**
  * TEST: core hotunplug
  * Description: Examine behavior of a driver on device hot unplug
@@ -614,15 +616,29 @@ static void hotunplug_rescan(struct hotunplug *priv)
 
 static void hotrebind(struct hotunplug *priv)
 {
+	igt_facts_lists_init();
+
+	igt_facts("Before it starts");
+	peter_drm_stats("Before it starts");
+
 	pre_check(priv);
 
 	priv->fd.drm = local_drm_open_driver(false, "", " for hot rebind");
 
+	igt_facts("After local_drm_open_driver()");
+	peter_drm_stats("After local_drm_open_driver()");
+
 	driver_unbind(priv, "hot ", 60);
 
 	driver_bind(priv, 0);
 
+	igt_facts("After driver_unbind() and driver_bind()");
+	peter_drm_stats("After driver_unbind() and driver_bind()");
+
 	igt_assert_f(healthcheck(priv, false), "%s\n", priv->failure);
+
+	igt_facts("After healthcheck()");
+	peter_drm_stats("After healthcheck()");
 }
 
 static void hotreplug(struct hotunplug *priv)

  reply	other threads:[~2024-12-19  8:44 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-18  5:13 [PATCH i-g-t 0/4] Device scan fixes Lucas De Marchi
2024-12-18  5:13 ` [PATCH i-g-t 1/4] lib/drmtest: Fix drm_close_driver() Lucas De Marchi
2024-12-18  5:13 ` [PATCH i-g-t 2/4] lib/igt_sysfs: Fix close when rebinding Lucas De Marchi
2024-12-18  5:13 ` [PATCH i-g-t 3/4] lib/igt_sysfs: Move close to be common to all actions Lucas De Marchi
2024-12-18  5:13 ` [PATCH i-g-t 4/4] lib/igt_device_scan: Fix scan vs bind/unbind/reload Lucas De Marchi
2024-12-18  6:07   ` Peter Senna Tschudin
2024-12-18  6:17   ` Zbigniew Kempczyński
2024-12-20 19:10     ` Lucas De Marchi
2024-12-18  6:34   ` Zbigniew Kempczyński
2024-12-19 16:35     ` Lucas De Marchi
2024-12-20  6:59       ` Zbigniew Kempczyński
2024-12-20 18:52         ` Lucas De Marchi
2024-12-18  8:20   ` Kamil Konieczny
2024-12-18  8:16 ` [PATCH i-g-t 0/4] Device scan fixes Peter Senna Tschudin
2024-12-18 22:04   ` Lucas De Marchi
2024-12-19  8:44     ` Peter Senna Tschudin [this message]
2024-12-18 20:55 ` ✓ i915.CI.BAT: success for " Patchwork
2024-12-18 23:00 ` ✓ Xe.CI.BAT: " Patchwork
2024-12-19 10:18 ` ✗ i915.CI.Full: failure " Patchwork
2024-12-19 14:13 ` ✗ Xe.CI.Full: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f1c12cc0-dbd8-428a-82d9-dede1e6f4164@linux.intel.com \
    --to=peter.senna@linux.intel.com \
    --cc=francois.dugast@intel.com \
    --cc=igt-dev@lists.freedesktop.org \
    --cc=kamil.konieczny@linux.intel.com \
    --cc=lucas.demarchi@intel.com \
    --cc=zbigniew.kempczynski@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox