From: Peter Senna Tschudin <peter.senna@linux.intel.com>
To: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: igt-dev@lists.freedesktop.org,
"Francois Dugast" <francois.dugast@intel.com>,
"Zbigniew Kempczyński" <zbigniew.kempczynski@intel.com>,
"Kamil Konieczny" <kamil.konieczny@linux.intel.com>
Subject: Re: [PATCH i-g-t 0/4] Device scan fixes
Date: Thu, 19 Dec 2024 09:44:31 +0100 [thread overview]
Message-ID: <f1c12cc0-dbd8-428a-82d9-dede1e6f4164@linux.intel.com> (raw)
In-Reply-To: <3izfilgj2eycb4fvdudz62wxq6fwarehczg5z75qpbo7l5xxvk@zdh4xgda2npo>
[-- Attachment #1: Type: text/plain, Size: 6358 bytes --]
Hi Lucas,
On 18.12.2024 23:04, Lucas De Marchi wrote:
> On Wed, Dec 18, 2024 at 09:16:48AM +0100, Peter Senna Tschudin wrote:
>>
>>
>> On 18.12.2024 06:13, Lucas De Marchi wrote:
>>> This started with the goal of fixing xe_wedged (besides the fix to the
>>> kernel that is in flight), however the issue of device "disappearing
>>> from bus" looks similar to several other issues we occasionally have -
>>> it may be the same issue. Try to fix it by forcing scans.
>>
>> Is the hypothesis that while an IGT test is running something may
>> break the association between GPU and device node? I am asking because
>> the drm device is always closed after a test ends.
>
> it depends what the test is doing, if the fd is closed etc... note that
> multiple subtests are executed without runing a new program. Even some
> under-the-hood behavior that we have with e.g. i915 (it keeps an fd open
> a dup fd open) may change the behavior if you load/unload or
> bind/unbind. Example:
>
> RPL (00:02.0) + BMG (03:00.0)
> # ./build/tools/lsgpu card1 Intel Battlemage (Gen20) drm:/dev/dri/card1 └─renderD129 drm:/dev/dri/renderD129 card0 Intel Raptorlake_s (Gen12) drm:/dev/dri/card0 └─renderD128 drm:/dev/dri/renderD128
> # ls -l /dev/dri/by-path/ total 0 lrwxrwxrwx 1 root root 8 Dec 18 19:39 pci-0000:00:02.0-card -> ../card0
> lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:00:02.0-render -> ../renderD128
> lrwxrwxrwx 1 root root 8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1
> lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129
> Rebind RPL:
> # echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/unbind
> # echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/bind
>
> # ls -l /dev/dri/by-path/
> total 0
> lrwxrwxrwx 1 root root 8 Dec 18 19:40 pci-0000:00:02.0-card -> ../card0
> lrwxrwxrwx 1 root root 13 Dec 18 19:40 pci-0000:00:02.0-render -> ../renderD128
> lrwxrwxrwx 1 root root 8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1
> lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129
>
> Great, nothing break, right?
>
> Rebind both in a different order (shouldn't happen if both cards are
> backed by the same module, but can perfectly happen in a i915 + xe
> scenario as the modules may be ready in different order):
>
> # echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/unbind # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind # echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/bind # ls -l /dev/dri/by-path/
> total 0
> lrwxrwxrwx 1 root root 8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1
> lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129
> lrwxrwxrwx 1 root root 8 Dec 18 19:41 pci-0000:03:00.0-card -> ../card0
> lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:03:00.0-render -> ../renderD128
>
> Simulate one thing that may happen in igt: but leak an fd:
>
> # exec 3<> /dev/dri/by-path/pci-0000:03:00.0-card
> # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
> # ls -l /dev/dri/by-path/
> total 0
> lrwxrwxrwx 1 root root 8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1
> lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129
> lrwxrwxrwx 1 root root 8 Dec 18 19:42 pci-0000:03:00.0-card -> ../card2
> lrwxrwxrwx 1 root root 13 Dec 18 19:42 pci-0000:03:00.0-render -> ../renderD130
>
> So... because we had and fd open, now it became card2 rather than 1.
> Also note that simply calling close(fd) in igt doesn't work as the
> cached fds will throw igt through the wrong path.
>
> So.... instead of adding bandaid everywhere in igt, I think we should
> fix the design to stop caching the wrong thing. IMO a first step would
> be "disable the cache and see how much it impacts".
>
> Lucas De Marchi
>
>>
>> I tested this by printing the value of _opened_fds_count from
>> lib/drmtest.c before each test, and the value is always 0.
>>
>> _opened_fds_count being zero means that the cache is empty, right? So
>> if the cache is empty before each test starts, under which conditions
>> can the potential problem manifest?
I could finally detect the issue by(The patch should be applied on top of my facts series):
- Creating peter_drm_stats(): prints drm cache entries with fd and card number
- Instrumenting igt@core_hotunplug@hotrebind with calls to igt_facts() and peter_drm_stats()
My goal is to create a test for us to test possible fixes. Can you help me creating a test that
is more meaningful for us to use as benchmark? See the output:
Starting subtest: hotrebind
...
[5147.892783] [FACT Before it starts] new: hardware.pci.drm_card_at_addr.0000:03:00.0: card1
[5147.897991] [DRM_CACHE] Before it starts: fd 5, card: 1
[5148.968237] [DRM_CACHE] After local_drm_open_driver(): fd 5, card: 1
Unloaded audio driver snd_hda_intel
Realoading snd_hda_intel
[5150.025062] [FACT After driver_unbind() and driver_bind()] changed: hardware.pci.drm_card_at_addr.0000:03:00.0: card1 -> card0
// Here the cache is wrong
[5150.030141] [DRM_CACHE] After driver_unbind() and driver_bind(): fd 5, card: 1
Opened device: \/dev\/dri\/card0
Opened device: \/dev\/dri\/renderD128
// healthcheck() fixes the cache by calling igt_devices_scan(true)
[5151.100086] [DRM_CACHE] After healthcheck(): fd 8, card: 0
Subtest hotrebind: SUCCESS (3.227s)
[-- Attachment #2: debug.patch --]
[-- Type: text/plain, Size: 2594 bytes --]
diff --git a/lib/drmtest.c b/lib/drmtest.c
index 2dd4540b8..6b7c35e20 100644
--- a/lib/drmtest.c
+++ b/lib/drmtest.c
@@ -35,6 +35,7 @@
#include <sys/ioctl.h>
#include <string.h>
#include <sys/mman.h>
+#include <sys/sysmacros.h>
#include <signal.h>
#include <pciaccess.h>
#include <stdlib.h>
@@ -418,6 +419,33 @@ static bool _is_already_opened(const char *path, int as_idx)
return false;
}
+void peter_drm_stats(const char *msg)
+{
+ struct timespec uptime_ts;
+ char *uptime = NULL;
+
+ if (clock_gettime(CLOCK_BOOTTIME, &uptime_ts) != 0)
+ return;
+
+ asprintf(&uptime,
+ "%ld.%06ld",
+ uptime_ts.tv_sec,
+ uptime_ts.tv_nsec / 1000);
+
+ for (int i = 0; i < _opened_fds_count; i++) {
+ unsigned long int st_rdev;
+ int fd, card;
+ fd = _opened_fds[i].fd;
+ st_rdev = _opened_fds[i].stat.st_rdev;
+ card = gnu_dev_minor(st_rdev);
+ igt_info("[%s] [DRM_CACHE] %s: fd %d, card: %d\n",
+ uptime,
+ msg ? msg : "",
+ fd,
+ card);
+ }
+}
+
static int __search_and_open(const char *base, int offset, unsigned int chipset, int as_idx)
{
const char *forced;
diff --git a/lib/drmtest.h b/lib/drmtest.h
index 27e5a18e2..d3713e9b9 100644
--- a/lib/drmtest.h
+++ b/lib/drmtest.h
@@ -145,6 +145,7 @@ bool is_vc4_device(int fd);
bool is_xe_device(int fd);
bool is_intel_device(int fd);
enum intel_driver get_intel_driver(int fd);
+extern void peter_drm_stats(const char *msg);
/**
* do_or_die:
diff --git a/tests/core_hotunplug.c b/tests/core_hotunplug.c
index 7f17f4423..838404d8b 100644
--- a/tests/core_hotunplug.c
+++ b/tests/core_hotunplug.c
@@ -39,6 +39,8 @@
#include "igt_kmod.h"
#include "igt_sysfs.h"
#include "sw_sync.h"
+#include "igt_facts.h"
+
/**
* TEST: core hotunplug
* Description: Examine behavior of a driver on device hot unplug
@@ -614,15 +616,29 @@ static void hotunplug_rescan(struct hotunplug *priv)
static void hotrebind(struct hotunplug *priv)
{
+ igt_facts_lists_init();
+
+ igt_facts("Before it starts");
+ peter_drm_stats("Before it starts");
+
pre_check(priv);
priv->fd.drm = local_drm_open_driver(false, "", " for hot rebind");
+ igt_facts("After local_drm_open_driver()");
+ peter_drm_stats("After local_drm_open_driver()");
+
driver_unbind(priv, "hot ", 60);
driver_bind(priv, 0);
+ igt_facts("After driver_unbind() and driver_bind()");
+ peter_drm_stats("After driver_unbind() and driver_bind()");
+
igt_assert_f(healthcheck(priv, false), "%s\n", priv->failure);
+
+ igt_facts("After healthcheck()");
+ peter_drm_stats("After healthcheck()");
}
static void hotreplug(struct hotunplug *priv)
next prev parent reply other threads:[~2024-12-19 8:44 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-18 5:13 [PATCH i-g-t 0/4] Device scan fixes Lucas De Marchi
2024-12-18 5:13 ` [PATCH i-g-t 1/4] lib/drmtest: Fix drm_close_driver() Lucas De Marchi
2024-12-18 5:13 ` [PATCH i-g-t 2/4] lib/igt_sysfs: Fix close when rebinding Lucas De Marchi
2024-12-18 5:13 ` [PATCH i-g-t 3/4] lib/igt_sysfs: Move close to be common to all actions Lucas De Marchi
2024-12-18 5:13 ` [PATCH i-g-t 4/4] lib/igt_device_scan: Fix scan vs bind/unbind/reload Lucas De Marchi
2024-12-18 6:07 ` Peter Senna Tschudin
2024-12-18 6:17 ` Zbigniew Kempczyński
2024-12-20 19:10 ` Lucas De Marchi
2024-12-18 6:34 ` Zbigniew Kempczyński
2024-12-19 16:35 ` Lucas De Marchi
2024-12-20 6:59 ` Zbigniew Kempczyński
2024-12-20 18:52 ` Lucas De Marchi
2024-12-18 8:20 ` Kamil Konieczny
2024-12-18 8:16 ` [PATCH i-g-t 0/4] Device scan fixes Peter Senna Tschudin
2024-12-18 22:04 ` Lucas De Marchi
2024-12-19 8:44 ` Peter Senna Tschudin [this message]
2024-12-18 20:55 ` ✓ i915.CI.BAT: success for " Patchwork
2024-12-18 23:00 ` ✓ Xe.CI.BAT: " Patchwork
2024-12-19 10:18 ` ✗ i915.CI.Full: failure " Patchwork
2024-12-19 14:13 ` ✗ Xe.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f1c12cc0-dbd8-428a-82d9-dede1e6f4164@linux.intel.com \
--to=peter.senna@linux.intel.com \
--cc=francois.dugast@intel.com \
--cc=igt-dev@lists.freedesktop.org \
--cc=kamil.konieczny@linux.intel.com \
--cc=lucas.demarchi@intel.com \
--cc=zbigniew.kempczynski@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox