All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Alex Deucher <alexdeucher@gmail.com>
Cc: "André Almeida" <andrealmeid@igalia.com>,
	"Raag Jadav" <raag.jadav@intel.com>,
	intel-xe@lists.freedesktop.org, thomas.hellstrom@linux.intel.com,
	simona@ffwll.ch, intel-gfx@lists.freedesktop.org,
	joonas.lahtinen@linux.intel.com, dri-devel@lists.freedesktop.org,
	himal.prasad.ghimiray@intel.com, lucas.demarchi@intel.com,
	tursulin@ursulin.net, francois.dugast@intel.com,
	jani.nikula@linux.intel.com, airlied@gmail.com,
	aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com,
	andi.shyti@linux.intel.com, matthew.d.roper@intel.com,
	andriy.shevchenko@linux.intel.com, lina@asahilina.net,
	kernel-dev@igalia.com, "Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>
Subject: Re: [PATCH v7 1/5] drm: Introduce device wedged event
Date: Thu, 24 Oct 2024 13:48:48 -0400	[thread overview]
Message-ID: <ZxqIgODaguPVxJvD@intel.com> (raw)
In-Reply-To: <CADnq5_PmHnYDvQpGNCF_3xP0a84EKsEuMqrj0MuUC=TyKTTrDg@mail.gmail.com>

On Fri, Oct 18, 2024 at 05:07:22PM -0400, Alex Deucher wrote:
> On Fri, Oct 18, 2024 at 1:56 PM André Almeida <andrealmeid@igalia.com> wrote:
> >
> > Em 18/10/2024 12:31, Alex Deucher escreveu:
> > > On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
> > >>
> > >> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> > >>> Hi Raag,
> > >>>
> > >>> Em 30/09/2024 04:38, Raag Jadav escreveu:
> > >>>> Introduce device wedged event, which will notify userspace of wedged
> > >>>> (hanged/unusable) state of the DRM device through a uevent. This is
> > >>>> useful especially in cases where the device is no longer operating as
> > >>>> expected even after a hardware reset and has become unrecoverable from
> > >>>> driver context.
> > >>>>
> > >>>> Purpose of this implementation is to provide drivers a generic way to
> > >>>> recover with the help of userspace intervention. Different drivers may
> > >>>> have different ideas of a "wedged device" depending on their hardware
> > >>>> implementation, and hence the vendor agnostic nature of the event.
> > >>>> It is up to the drivers to decide when they see the need for recovery
> > >>>> and how they want to recover from the available methods.
> > >>>>
> > >>>> Current implementation defines three recovery methods, out of which,
> > >>>> drivers can choose to support any one or multiple of them. Preferred
> > >>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
> > >>>> Userspace consumers (sysadmin) can define udev rules to parse this event
> > >>>> and take respective action to recover the device.
> > >>>>
> > >>>>       =============== ==================================
> > >>>>       Recovery method Consumer expectations
> > >>>>       =============== ==================================
> > >>>>       rebind          unbind + rebind driver
> > >>>>       bus-reset       unbind + reset bus device + rebind
> > >>>>       reboot          reboot system
> > >>>>       =============== ==================================
> > >>>>
> > >>>>
> > >>>
> > >>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
> > >>>
> > >>> The motivation was that amdgpu was getting stuck after every GPU reset, and
> > >>> there was just a black screen. The uevent would then trigger a daemon to
> > >>> reset the compositor and getting things back together. As you can see in my
> > >>> thread, the feature was blocked in favor of getting better overall GPU reset
> > >>> from the kernel side.
> > >>>
> > >>> Which kind of scenarios are making i915/xe the need to have userspace
> > >>> involvement? I tested a bunch of resets in i915 but never managed to get the
> > >>> driver stuck.
> > >>
> > >> 2 scenarios:
> > >>
> > >> 1. Multiple levels of reset has failed and device was declared wedged. This is
> > >> rare indeed as the resets improved a lot.
> > >> 2. Debug case. We can boot the driver with option to declare device wedged at
> > >> any timeout, so the device can be debugged.
> > >>
> > >>>
> > >>> For the bus-reset, amdgpu does that too, but it doesn't require userspace
> > >>> intervention.
> > >>
> > >> How do you trigger that?
> > >
> > > What do you mean by bus reset?  I think Chrisitian is just referring
> > > to a full adapter reset (as opposed to a queue reset or something more
> > > fine grained).  Driver can reset the device via MMIO or firmware,
> > > depending on the device.  I think there are also PCI helpers for
> > > things like PCI FLR.
> > >
> >
> > I was referring to AMD_RESET_PCI:
> >
> > "Does a full bus reset using core Linux subsystem PCI reset and does a
> > secondary bus reset or FLR, depending on what the underlying hardware
> > supports."
> >
> > And that can be triggered by using `amdgpu_reset_method=5` as the module
> > option.
> >
> 
> That option doesn't actually do anything useful on most AMD GPUs.  We
> don't support FLR on most boards and SBR doesn't work once the driver
> has been loaded except for really old chips.  That said, internally
> these all end up being mode1 or mode2 resets which the driver can
> trigger directly and which are the defaults.

okay, this is the same for us then.
And this is the main reason that we have this option:
- unbind + reset bus device + rebind

unbind by itself needs to be a supported and working case regardless
the reset state. Then this sequence should be fine.

Afaik there's no way that the driver itself could call for the bus
reset.

> 
> Alex

  reply	other threads:[~2024-10-24 17:49 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-30  7:38 [PATCH v7 0/5] Introduce DRM device wedged event Raag Jadav
2024-09-30  7:38 ` [PATCH v7 1/5] drm: Introduce " Raag Jadav
2024-09-30 12:59   ` Andy Shevchenko
2024-10-01  5:08     ` Raag Jadav
2024-10-01 12:07       ` Andy Shevchenko
2024-10-01 14:18         ` Raag Jadav
2024-10-01 14:54           ` Andy Shevchenko
2024-10-01 16:42             ` Raag Jadav
2024-10-01 12:20   ` Michal Wajdeczko
2024-10-03 12:23     ` Raag Jadav
2024-10-08 15:02       ` Raag Jadav
2024-10-10 13:02         ` Lucas De Marchi
2024-10-11  8:47           ` Raag Jadav
2024-10-17  2:47   ` Raag Jadav
2024-10-17  7:59     ` Christian König
2024-10-17 16:43       ` Rodrigo Vivi
2024-10-18 10:58         ` Christian König
2024-10-18 12:46           ` Raag Jadav
2024-10-18 12:54             ` Christian König
2024-10-18 14:09               ` Raag Jadav
2024-10-17 19:16   ` André Almeida
2024-10-18 14:56     ` Rodrigo Vivi
2024-10-18 15:31       ` Alex Deucher
2024-10-18 17:56         ` André Almeida
2024-10-18 21:07           ` Alex Deucher
2024-10-24 17:48             ` Rodrigo Vivi [this message]
2024-10-19 19:08     ` Raag Jadav
2024-09-30  7:38 ` [PATCH v7 2/5] drm: Expose wedge recovery methods Raag Jadav
2024-09-30 13:01   ` Andy Shevchenko
2024-10-01  5:23     ` Raag Jadav
2024-09-30  7:38 ` [PATCH v7 3/5] drm/doc: Document device wedged event Raag Jadav
2024-09-30  7:38 ` [PATCH v7 4/5] drm/xe: Use " Raag Jadav
2024-09-30  7:38 ` [PATCH v7 5/5] drm/i915: " Raag Jadav
2024-09-30  7:47 ` ✗ CI.Patch_applied: failure for Introduce DRM device wedged event (rev5) Patchwork
2024-09-30 22:48 ` ✗ Fi.CI.CHECKPATCH: warning " Patchwork
2024-09-30 22:48 ` ✗ Fi.CI.SPARSE: " Patchwork
2024-09-30 22:58 ` ✓ Fi.CI.BAT: success " Patchwork
2024-10-01  9:54 ` ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZxqIgODaguPVxJvD@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=andi.shyti@linux.intel.com \
    --cc=andrealmeid@igalia.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=francois.dugast@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=joonas.lahtinen@linux.intel.com \
    --cc=kernel-dev@igalia.com \
    --cc=lina@asahilina.net \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=simona@ffwll.ch \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tursulin@ursulin.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.