Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "André Almeida" <andrealmeid@igalia.com>
To: Alex Deucher <alexdeucher@gmail.com>,
	Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: "Raag Jadav" <raag.jadav@intel.com>,
	intel-xe@lists.freedesktop.org, thomas.hellstrom@linux.intel.com,
	simona@ffwll.ch, intel-gfx@lists.freedesktop.org,
	joonas.lahtinen@linux.intel.com, dri-devel@lists.freedesktop.org,
	himal.prasad.ghimiray@intel.com, lucas.demarchi@intel.com,
	tursulin@ursulin.net, francois.dugast@intel.com,
	jani.nikula@linux.intel.com, airlied@gmail.com,
	aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com,
	andi.shyti@linux.intel.com, matthew.d.roper@intel.com,
	andriy.shevchenko@linux.intel.com, lina@asahilina.net,
	kernel-dev@igalia.com, "Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>
Subject: Re: [PATCH v7 1/5] drm: Introduce device wedged event
Date: Fri, 18 Oct 2024 14:56:17 -0300	[thread overview]
Message-ID: <3fac9971-8d26-4d52-badb-2b14b3f84263@igalia.com> (raw)
In-Reply-To: <CADnq5_M62YZRvBT7sQwrZTiHrUsifaqqgrWOD_z+YY=EiBtEcA@mail.gmail.com>

Em 18/10/2024 12:31, Alex Deucher escreveu:
> On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
>>
>> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
>>> Hi Raag,
>>>
>>> Em 30/09/2024 04:38, Raag Jadav escreveu:
>>>> Introduce device wedged event, which will notify userspace of wedged
>>>> (hanged/unusable) state of the DRM device through a uevent. This is
>>>> useful especially in cases where the device is no longer operating as
>>>> expected even after a hardware reset and has become unrecoverable from
>>>> driver context.
>>>>
>>>> Purpose of this implementation is to provide drivers a generic way to
>>>> recover with the help of userspace intervention. Different drivers may
>>>> have different ideas of a "wedged device" depending on their hardware
>>>> implementation, and hence the vendor agnostic nature of the event.
>>>> It is up to the drivers to decide when they see the need for recovery
>>>> and how they want to recover from the available methods.
>>>>
>>>> Current implementation defines three recovery methods, out of which,
>>>> drivers can choose to support any one or multiple of them. Preferred
>>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
>>>> Userspace consumers (sysadmin) can define udev rules to parse this event
>>>> and take respective action to recover the device.
>>>>
>>>>       =============== ==================================
>>>>       Recovery method Consumer expectations
>>>>       =============== ==================================
>>>>       rebind          unbind + rebind driver
>>>>       bus-reset       unbind + reset bus device + rebind
>>>>       reboot          reboot system
>>>>       =============== ==================================
>>>>
>>>>
>>>
>>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
>>>
>>> The motivation was that amdgpu was getting stuck after every GPU reset, and
>>> there was just a black screen. The uevent would then trigger a daemon to
>>> reset the compositor and getting things back together. As you can see in my
>>> thread, the feature was blocked in favor of getting better overall GPU reset
>>> from the kernel side.
>>>
>>> Which kind of scenarios are making i915/xe the need to have userspace
>>> involvement? I tested a bunch of resets in i915 but never managed to get the
>>> driver stuck.
>>
>> 2 scenarios:
>>
>> 1. Multiple levels of reset has failed and device was declared wedged. This is
>> rare indeed as the resets improved a lot.
>> 2. Debug case. We can boot the driver with option to declare device wedged at
>> any timeout, so the device can be debugged.
>>
>>>
>>> For the bus-reset, amdgpu does that too, but it doesn't require userspace
>>> intervention.
>>
>> How do you trigger that?
> 
> What do you mean by bus reset?  I think Chrisitian is just referring
> to a full adapter reset (as opposed to a queue reset or something more
> fine grained).  Driver can reset the device via MMIO or firmware,
> depending on the device.  I think there are also PCI helpers for
> things like PCI FLR.
> 

I was referring to AMD_RESET_PCI:

"Does a full bus reset using core Linux subsystem PCI reset and does a 
secondary bus reset or FLR, depending on what the underlying hardware 
supports."

And that can be triggered by using `amdgpu_reset_method=5` as the module 
option.


  reply	other threads:[~2024-10-18 17:56 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-30  7:38 [PATCH v7 0/5] Introduce DRM device wedged event Raag Jadav
2024-09-30  7:38 ` [PATCH v7 1/5] drm: Introduce " Raag Jadav
2024-09-30 12:59   ` Andy Shevchenko
2024-10-01  5:08     ` Raag Jadav
2024-10-01 12:07       ` Andy Shevchenko
2024-10-01 14:18         ` Raag Jadav
2024-10-01 14:54           ` Andy Shevchenko
2024-10-01 16:42             ` Raag Jadav
2024-10-01 12:20   ` Michal Wajdeczko
2024-10-03 12:23     ` Raag Jadav
2024-10-08 15:02       ` Raag Jadav
2024-10-10 13:02         ` Lucas De Marchi
2024-10-11  8:47           ` Raag Jadav
2024-10-17  2:47   ` Raag Jadav
2024-10-17  7:59     ` Christian König
2024-10-17 16:43       ` Rodrigo Vivi
2024-10-18 10:58         ` Christian König
2024-10-18 12:46           ` Raag Jadav
2024-10-18 12:54             ` Christian König
2024-10-18 14:09               ` Raag Jadav
2024-10-17 19:16   ` André Almeida
2024-10-18 14:56     ` Rodrigo Vivi
2024-10-18 15:31       ` Alex Deucher
2024-10-18 17:56         ` André Almeida [this message]
2024-10-18 21:07           ` Alex Deucher
2024-10-24 17:48             ` Rodrigo Vivi
2024-10-19 19:08     ` Raag Jadav
2024-09-30  7:38 ` [PATCH v7 2/5] drm: Expose wedge recovery methods Raag Jadav
2024-09-30 13:01   ` Andy Shevchenko
2024-10-01  5:23     ` Raag Jadav
2024-09-30  7:38 ` [PATCH v7 3/5] drm/doc: Document device wedged event Raag Jadav
2024-09-30  7:38 ` [PATCH v7 4/5] drm/xe: Use " Raag Jadav
2024-09-30  7:38 ` [PATCH v7 5/5] drm/i915: " Raag Jadav
2024-09-30 22:48 ` ✗ Fi.CI.CHECKPATCH: warning for Introduce DRM device wedged event (rev5) Patchwork
2024-09-30 22:48 ` ✗ Fi.CI.SPARSE: " Patchwork
2024-09-30 22:58 ` ✓ Fi.CI.BAT: success " Patchwork
2024-10-01  9:54 ` ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3fac9971-8d26-4d52-badb-2b14b3f84263@igalia.com \
    --to=andrealmeid@igalia.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=andi.shyti@linux.intel.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=francois.dugast@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=joonas.lahtinen@linux.intel.com \
    --cc=kernel-dev@igalia.com \
    --cc=lina@asahilina.net \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tursulin@ursulin.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox