From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: "Vivekanandan, Balasubramani" <balasubramani.vivekanandan@intel.com>
Cc: "Summers, Stuart" <stuart.summers@intel.com>,
"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v2] drm/xe/devcoredump: Defer devcoredump initialization during probe
Date: Tue, 29 Jul 2025 10:37:31 -0400 [thread overview]
Message-ID: <aIjcq41Hb1nUq3Wg@intel.com> (raw)
In-Reply-To: <aIiBrqrmcs8eXB2-@bvivekan-mob1>
On Tue, Jul 29, 2025 at 01:39:18PM +0530, Vivekanandan, Balasubramani wrote:
> On 28.07.2025 15:01, Rodrigo Vivi wrote:
> > On Mon, Jul 28, 2025 at 01:56:07PM -0400, Summers, Stuart wrote:
> > > On Mon, 2025-07-28 at 14:17 +0530, Balasubramani Vivekanandan wrote:
> > > > Doing devcoredump initializing before GT though look harmless, it
> > > > leads
> > > > to problem during driver unbind. Because of this order, GT/Engine
> > > > release functions will be called before xe devcoredump release
> > > > function
> > > > (xe_driver_devcoredump_fini) leading to the following kernel crash[1]
> > > > because the devcoredump functions might still use GT/Engine
> > > > datastructures after those are freed.
> > > >
> > > > The following crash is observed while running the IGT
> > > > xe_wedged@wedged-at-any-timeout. The test forces a wedged state by
> > > > submitting a worload which hangs. Then does a unbind/rebind of the
> > > > driver to recover from the wedged state.
> > > > The hanged worload leads to a devcoredump. The following crash is
> > > > noticed when the devcoredump capture races with the driver unbind.
> > > > During driver unbind, the release function hw_engine_fini() will be
> > > > called which assigns NULL to hwe->gt. But the same data structure is
> > > > accessed during the coredump capture in the function
> > > > xe_engine_snapshot_print by reading snapshot->hwe->gt.
> > > >
> > > > With this patch, we make sure the devcoredump is stopped before
> > > > deinitializing the core driver functions.
> > > >
> > > > [1]:
> > > > BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > > Workqueue: events_unbound xe_devcoredump_deferred_snap_work [xe]
> > > > RIP: 0010:xe_engine_snapshot_print+0x47/0x420 [xe]
> > > > Call Trace:
> > > > <TASK>
> > > > ? drm_printf+0x64/0x90
> > > > __xe_devcoredump_read+0x23f/0x2d0 [xe]
> > > > ? __pfx___drm_printfn_coredump+0x10/0x10
> > > > ? __pfx___drm_puts_coredump+0x10/0x10
> > > > xe_devcoredump_deferred_snap_work+0x17a/0x190 [xe]
> > > > process_one_work+0x22e/0x6f0
> > > > worker_thread+0x1e8/0x3d0
> > > > ? __pfx_worker_thread+0x10/0x10
> > > > kthread+0x11f/0x250
> > > > ? __pfx_kthread+0x10/0x10
> > > > ret_from_fork+0x47/0x70
> > > > ? __pfx_kthread+0x10/0x10
> > > > ret_from_fork_asm+0x1a/0x30
> > > >
> > > > v2: Detailed commit description (Rodrigo)
> >
> > Thanks for that, now I could see the path, but now I agree with
> > Stuart below...
> >
> > > >
> > > > Fixes: 4209d635a823 ("drm/xe: Remove devcoredump during driver
> > > > release")
> > > > Signed-off-by: Balasubramani Vivekanandan
> > > > <balasubramani.vivekanandan@intel.com>
> > >
> > > So I can see how this fixes the problem from your description and
> > > looking over the code. I thought generally though we were trying to
> > > decouple the devcoredump from the underlying structures.
> > > xe_engine_snapshot_print() is grabbing a lot of information from the GT
> > > at the time of the print rather than purely as a snapshot which doesn't
> > > seem right to me - we should be taking the snapshot at the time of the
> > > error and the print should just be relaying that info.
> > >
> > > So not that your change is bad, but I think it masks a problem we have
> > > in the implementation of that engine print. If we call
> > > xe_guc_capture_get_reg_desc_list() at the time of failure rather than
> > > from the print itself, do we still see the same problem?
> >
> > Indeed the real fix is to entirely decouple the capture from the read.
> > capture should be done at the snapshot time.
> > Read should not depend on the gt. Although this might not be the only
> > case and we probably need some quick fix for now.
> >
> > Perhaps we go with this patch, but mark as a FIXME comment and ensure
> > we have a gitlab/issue + VLK opened for this work...
>
> I have created a VLK to track the requested change. I didn't have
> permission to create a gitlab issue. I have applied for access.
>
> I believe we should have this patch to fix the order of
> initialization/release of the devcoredump. Looking for r-b if there are
> no other comments.
Please add a big FIXME comment near the xe_guc_capture_get_reg_desc_list
stating what needs to be done
>
> Regards,
> Bala
>
> >
> > >
> > > Thanks,
> > > Stuart
> > >
> > > > ---
> > > > drivers/gpu/drm/xe/xe_device.c | 8 ++++----
> > > > 1 file changed, 4 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > > b/drivers/gpu/drm/xe/xe_device.c
> > > > index d04a0ae018e6..ae48cd3c7bf0 100644
> > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > @@ -821,10 +821,6 @@ int xe_device_probe(struct xe_device *xe)
> > > > return err;
> > > > }
> > > >
> > > > - err = xe_devcoredump_init(xe);
> > > > - if (err)
> > > > - return err;
> > > > -
> > > > /*
> > > > * From here on, if a step fails, make sure a Driver-FLR is
> > > > triggereed
> > > > */
> > > > @@ -889,6 +885,10 @@ int xe_device_probe(struct xe_device *xe)
> > > > XE_WA(xe->tiles->media_gt, 15015404425_disable))
> > > > XE_DEVICE_WA_DISABLE(xe, 15015404425);
> > > >
> > > > + err = xe_devcoredump_init(xe);
> > > > + if (err)
> > > > + return err;
> > > > +
> > > > xe_nvm_init(xe);
> > > >
> > > > err = xe_heci_gsc_init(xe);
> > >
next prev parent reply other threads:[~2025-07-29 14:38 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-28 8:47 [PATCH v2] drm/xe/devcoredump: Defer devcoredump initialization during probe Balasubramani Vivekanandan
2025-07-28 17:07 ` ✓ CI.KUnit: success for drm/xe/devcoredump: Defer devcoredump initialization during probe (rev3) Patchwork
2025-07-28 17:56 ` [PATCH v2] drm/xe/devcoredump: Defer devcoredump initialization during probe Summers, Stuart
2025-07-28 19:01 ` Rodrigo Vivi
2025-07-29 8:09 ` Vivekanandan, Balasubramani
2025-07-29 14:37 ` Rodrigo Vivi [this message]
2025-07-28 17:57 ` ✓ Xe.CI.BAT: success for drm/xe/devcoredump: Defer devcoredump initialization during probe (rev3) Patchwork
2025-07-28 19:57 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aIjcq41Hb1nUq3Wg@intel.com \
--to=rodrigo.vivi@intel.com \
--cc=balasubramani.vivekanandan@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=stuart.summers@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox