Hi Lucas, On 18.12.2024 23:04, Lucas De Marchi wrote: > On Wed, Dec 18, 2024 at 09:16:48AM +0100, Peter Senna Tschudin wrote: >> >> >> On 18.12.2024 06:13, Lucas De Marchi wrote: >>> This started with the goal of fixing xe_wedged (besides the fix to the >>> kernel that is in flight), however the issue of device "disappearing >>> from bus" looks similar to several other issues we occasionally have - >>> it may be the same issue. Try to fix it by forcing scans. >> >> Is the hypothesis that while an IGT test is running something may >> break the association between GPU and device node? I am asking because >> the drm device is always closed after a test ends. > > it depends what the test is doing, if the fd is closed etc... note that > multiple subtests are executed without runing a new program. Even some > under-the-hood behavior that we have with e.g. i915 (it keeps an fd open > a dup fd open) may change the behavior if you load/unload or > bind/unbind. Example: > > RPL (00:02.0) + BMG (03:00.0) >     # ./build/tools/lsgpu                                       card1                    Intel Battlemage (Gen20)          drm:/dev/dri/card1           └─renderD129                                               drm:/dev/dri/renderD129      card0                    Intel Raptorlake_s (Gen12)        drm:/dev/dri/card0           └─renderD128                                               drm:/dev/dri/renderD128  >     # ls -l /dev/dri/by-path/                               total 0                                                                             lrwxrwxrwx 1 root root  8 Dec 18 19:39 pci-0000:00:02.0-card -> ../card0 >     lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:00:02.0-render -> ../renderD128 >     lrwxrwxrwx 1 root root  8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1 >     lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129 > Rebind RPL: >     # echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/unbind >     # echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/bind > >     # ls -l /dev/dri/by-path/ >     total 0 >     lrwxrwxrwx 1 root root  8 Dec 18 19:40 pci-0000:00:02.0-card -> ../card0 >     lrwxrwxrwx 1 root root 13 Dec 18 19:40 pci-0000:00:02.0-render -> ../renderD128 >     lrwxrwxrwx 1 root root  8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1 >     lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129 > > Great, nothing break, right? > > Rebind both in a different order (shouldn't happen if both cards are > backed by the same module, but can perfectly happen in a i915 + xe > scenario as the modules may be ready in different order): > >     # echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/unbind     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind     # echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/bind     # ls -l /dev/dri/by-path/ >     total 0 >     lrwxrwxrwx 1 root root  8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1 >     lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129 >     lrwxrwxrwx 1 root root  8 Dec 18 19:41 pci-0000:03:00.0-card -> ../card0 >     lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:03:00.0-render -> ../renderD128 > > Simulate one thing that may happen in igt: but leak an fd: > >     # exec 3<>  /dev/dri/by-path/pci-0000:03:00.0-card >     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind >     # ls -l /dev/dri/by-path/ >     total 0 >     lrwxrwxrwx 1 root root  8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1 >     lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129 >     lrwxrwxrwx 1 root root  8 Dec 18 19:42 pci-0000:03:00.0-card -> ../card2 >     lrwxrwxrwx 1 root root 13 Dec 18 19:42 pci-0000:03:00.0-render -> ../renderD130 > > So... because we had and fd open, now it became card2 rather than 1. > Also note that simply calling close(fd) in igt doesn't work as the > cached fds will throw igt through the wrong path. > > So.... instead of adding bandaid everywhere in igt, I think we should > fix the design to stop caching the wrong thing. IMO a first step would > be "disable the cache and see how much it impacts". > > Lucas De Marchi > >> >> I tested this by printing the value of _opened_fds_count from >> lib/drmtest.c before each test, and the value is always 0. >> >> _opened_fds_count being zero means that the cache is empty, right? So >> if the cache is empty before each test starts, under which conditions >> can the potential problem manifest? I could finally detect the issue by(The patch should be applied on top of my facts series): - Creating peter_drm_stats(): prints drm cache entries with fd and card number - Instrumenting igt@core_hotunplug@hotrebind with calls to igt_facts() and peter_drm_stats() My goal is to create a test for us to test possible fixes. Can you help me creating a test that is more meaningful for us to use as benchmark? See the output: Starting subtest: hotrebind ... [5147.892783] [FACT Before it starts] new: hardware.pci.drm_card_at_addr.0000:03:00.0: card1 [5147.897991] [DRM_CACHE] Before it starts: fd 5, card: 1 [5148.968237] [DRM_CACHE] After local_drm_open_driver(): fd 5, card: 1 Unloaded audio driver snd_hda_intel Realoading snd_hda_intel [5150.025062] [FACT After driver_unbind() and driver_bind()] changed: hardware.pci.drm_card_at_addr.0000:03:00.0: card1 -> card0 // Here the cache is wrong [5150.030141] [DRM_CACHE] After driver_unbind() and driver_bind(): fd 5, card: 1 Opened device: \/dev\/dri\/card0 Opened device: \/dev\/dri\/renderD128 // healthcheck() fixes the cache by calling igt_devices_scan(true) [5151.100086] [DRM_CACHE] After healthcheck(): fd 8, card: 0 Subtest hotrebind: SUCCESS (3.227s)