From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: "Summers, Stuart" <stuart.summers@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
"Vivekanandan,
Balasubramani" <balasubramani.vivekanandan@intel.com>
Subject: Re: [PATCH v2] drm/xe/devcoredump: Defer devcoredump initialization during probe
Date: Mon, 28 Jul 2025 15:01:22 -0400 [thread overview]
Message-ID: <aIfJAvs2OZ5oS2ce@intel.com> (raw)
In-Reply-To: <8ca6af6970ef166f66b6786f55156af876133ddd.camel@intel.com>
On Mon, Jul 28, 2025 at 01:56:07PM -0400, Summers, Stuart wrote:
> On Mon, 2025-07-28 at 14:17 +0530, Balasubramani Vivekanandan wrote:
> > Doing devcoredump initializing before GT though look harmless, it
> > leads
> > to problem during driver unbind. Because of this order, GT/Engine
> > release functions will be called before xe devcoredump release
> > function
> > (xe_driver_devcoredump_fini) leading to the following kernel crash[1]
> > because the devcoredump functions might still use GT/Engine
> > datastructures after those are freed.
> >
> > The following crash is observed while running the IGT
> > xe_wedged@wedged-at-any-timeout. The test forces a wedged state by
> > submitting a worload which hangs. Then does a unbind/rebind of the
> > driver to recover from the wedged state.
> > The hanged worload leads to a devcoredump. The following crash is
> > noticed when the devcoredump capture races with the driver unbind.
> > During driver unbind, the release function hw_engine_fini() will be
> > called which assigns NULL to hwe->gt. But the same data structure is
> > accessed during the coredump capture in the function
> > xe_engine_snapshot_print by reading snapshot->hwe->gt.
> >
> > With this patch, we make sure the devcoredump is stopped before
> > deinitializing the core driver functions.
> >
> > [1]:
> > BUG: kernel NULL pointer dereference, address: 0000000000000000
> > Workqueue: events_unbound xe_devcoredump_deferred_snap_work [xe]
> > RIP: 0010:xe_engine_snapshot_print+0x47/0x420 [xe]
> > Call Trace:
> > <TASK>
> > ? drm_printf+0x64/0x90
> > __xe_devcoredump_read+0x23f/0x2d0 [xe]
> > ? __pfx___drm_printfn_coredump+0x10/0x10
> > ? __pfx___drm_puts_coredump+0x10/0x10
> > xe_devcoredump_deferred_snap_work+0x17a/0x190 [xe]
> > process_one_work+0x22e/0x6f0
> > worker_thread+0x1e8/0x3d0
> > ? __pfx_worker_thread+0x10/0x10
> > kthread+0x11f/0x250
> > ? __pfx_kthread+0x10/0x10
> > ret_from_fork+0x47/0x70
> > ? __pfx_kthread+0x10/0x10
> > ret_from_fork_asm+0x1a/0x30
> >
> > v2: Detailed commit description (Rodrigo)
Thanks for that, now I could see the path, but now I agree with
Stuart below...
> >
> > Fixes: 4209d635a823 ("drm/xe: Remove devcoredump during driver
> > release")
> > Signed-off-by: Balasubramani Vivekanandan
> > <balasubramani.vivekanandan@intel.com>
>
> So I can see how this fixes the problem from your description and
> looking over the code. I thought generally though we were trying to
> decouple the devcoredump from the underlying structures.
> xe_engine_snapshot_print() is grabbing a lot of information from the GT
> at the time of the print rather than purely as a snapshot which doesn't
> seem right to me - we should be taking the snapshot at the time of the
> error and the print should just be relaying that info.
>
> So not that your change is bad, but I think it masks a problem we have
> in the implementation of that engine print. If we call
> xe_guc_capture_get_reg_desc_list() at the time of failure rather than
> from the print itself, do we still see the same problem?
Indeed the real fix is to entirely decouple the capture from the read.
capture should be done at the snapshot time.
Read should not depend on the gt. Although this might not be the only
case and we probably need some quick fix for now.
Perhaps we go with this patch, but mark as a FIXME comment and ensure
we have a gitlab/issue + VLK opened for this work...
>
> Thanks,
> Stuart
>
> > ---
> > drivers/gpu/drm/xe/xe_device.c | 8 ++++----
> > 1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > b/drivers/gpu/drm/xe/xe_device.c
> > index d04a0ae018e6..ae48cd3c7bf0 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -821,10 +821,6 @@ int xe_device_probe(struct xe_device *xe)
> > return err;
> > }
> >
> > - err = xe_devcoredump_init(xe);
> > - if (err)
> > - return err;
> > -
> > /*
> > * From here on, if a step fails, make sure a Driver-FLR is
> > triggereed
> > */
> > @@ -889,6 +885,10 @@ int xe_device_probe(struct xe_device *xe)
> > XE_WA(xe->tiles->media_gt, 15015404425_disable))
> > XE_DEVICE_WA_DISABLE(xe, 15015404425);
> >
> > + err = xe_devcoredump_init(xe);
> > + if (err)
> > + return err;
> > +
> > xe_nvm_init(xe);
> >
> > err = xe_heci_gsc_init(xe);
>
next prev parent reply other threads:[~2025-07-28 19:02 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-28 8:47 [PATCH v2] drm/xe/devcoredump: Defer devcoredump initialization during probe Balasubramani Vivekanandan
2025-07-28 17:07 ` ✓ CI.KUnit: success for drm/xe/devcoredump: Defer devcoredump initialization during probe (rev3) Patchwork
2025-07-28 17:56 ` [PATCH v2] drm/xe/devcoredump: Defer devcoredump initialization during probe Summers, Stuart
2025-07-28 19:01 ` Rodrigo Vivi [this message]
2025-07-29 8:09 ` Vivekanandan, Balasubramani
2025-07-29 14:37 ` Rodrigo Vivi
2025-07-28 17:57 ` ✓ Xe.CI.BAT: success for drm/xe/devcoredump: Defer devcoredump initialization during probe (rev3) Patchwork
2025-07-28 19:57 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aIfJAvs2OZ5oS2ce@intel.com \
--to=rodrigo.vivi@intel.com \
--cc=balasubramani.vivekanandan@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=stuart.summers@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.