Re: [PATCH 1/1] drm/xe/eustall: Return EBADFD from read if EU stall registers get reset

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Dixit, Ashutosh" <ashutosh.dixit@intel.com>
To: Harish Chegondi <harish.chegondi@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Subject: Re: [PATCH 1/1] drm/xe/eustall: Return EBADFD from read if EU stall registers get reset
Date: Tue, 23 Dec 2025 17:47:58 -0800	[thread overview]
Message-ID: <874ipgmz01.wl-ashutosh.dixit@intel.com> (raw)
In-Reply-To: <aUsoQy89xwWAAkmM@intel.com>

On Tue, 23 Dec 2025 15:39:47 -0800, Harish Chegondi wrote:
>
> On Mon, Dec 22, 2025 at 09:08:04PM -0800, Dixit, Ashutosh wrote:
> > On Mon, 22 Dec 2025 14:37:53 -0800, Harish Chegondi wrote:
> > >
> > > On Thu, Dec 18, 2025 at 11:53:02AM -0800, Dixit, Ashutosh wrote:
> > > > On Sun, 07 Dec 2025 22:16:11 -0800, Harish Chegondi wrote:
> > > > >
> > > >
> > > Hi Ashutosh,
> > > > Hi Harish,
> > > >
> > > > > If a reset (GT or engine) happens during EU stall data sampling, all the
> > > > > EU stall registers can get reset to 0. This will result in EU stall data
> > > > > buffers' read and write pointer register values to be out of sync with
> > > > > the cached values. This can result in read() returning invalid data. To
> > > > > prevent this, check the value of a EU stall base register. If it is zero,
> > > > > it indicates a reset may have happened that wiped the register to zero.
> > > > > If this happens, return EBADFD from read() upon which the user space
> > > > > should close the fd and open a new fd for a new EU stall data
> > > > > collection session.
> > > > >
> > > > > Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> > > > > Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> > > > > Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/xe_eu_stall.c | 15 +++++++++++++++
> > > > >  1 file changed, 15 insertions(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/xe/xe_eu_stall.c b/drivers/gpu/drm/xe/xe_eu_stall.c
> > > > > index 97dfb7945b7a..02c0beb4559f 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_eu_stall.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_eu_stall.c
> > > > > @@ -541,9 +541,24 @@ static ssize_t xe_eu_stall_stream_read_locked(struct xe_eu_stall_data_stream *st
> > > > >	size_t total_size = 0;
> > > > >	u16 group, instance;
> > > > >	unsigned int xecore;
> > > > > +	u32 base_reg_value;
> > > > >	int ret = 0;
> > > > >
> > > > >	mutex_lock(&stream->xecore_buf_lock);
> > > > > +	/* If a GT or engine reset happens during EU stall data sampling,
> > > > > +	 * all EU stall registers get reset to 0 and the cached values of
> > > > > +	 * EU stall data buffers' read and write pointers are out of sync
> > > > > +	 * with the register values. This can cause invalid data to be
> > > > > +	 * returned from read(). To prevent this, check the value of a
> > > > > +	 * EU stall base register. If it is zero, return -EBADFD. The
> > > > > +	 * user is expected to close the fd and open a new fd.
> > > > > +	 */
> > > > > +	base_reg_value = xe_gt_mcr_unicast_read_any(gt, XEHPC_EUSTALL_BASE);
> > > > > +	if (unlikely(!base_reg_value)) {
> > > > > +		xe_gt_dbg(gt, "EU stall base register has been reset to 0\n");
> > > > > +		mutex_unlock(&stream->xecore_buf_lock);
> > > > > +		return -EBADFD;
> > > > > +	}
> > > >
> > > > So I am seeing two problems here:
> > > >
> > > > 1. We are doing register read every read() call, rather than just when a
> > > >    reset happens.
> > > >
> > > > 2. The other issue is should reset itself unblock a blocked poll() or
> > > >    blocking read() call? If we don't do that, it is possible that poll()
> > > >    or blocking read() remains blocked indefinitely and so either the
> > > >    non-blocking read() doesn't get called at all, or a blocking read()
> > > >    remains indefinitely blocked. So that we never actually return -EBADFD
> > > >    even though a reset has happened.
> > > >
> > > >    (Note that, for exec(), I believe any blocked fences will unblock and
> > > >    return error etc. if a reset happens during an exec() call (see
> > > >    reset_status()), so EU stall should probably do something similar).
> > > >
> > > > So to address these two issues how about doing something like this:
> > > >
> > > > 1. Call an EU stall callback from xe_guc_exec_queue_reset_handler(). In the
> > > >    callback, if an EU stall stream is open on that gt, check if
> > > >    XEHPC_EUSTALL_BASE is 0 and set a stream variable stream->reset under a
> > > >    suitable lock (likely xecore_buf_lock).
> > >
> > > I thought about this and I think this can be racy - if read() is called
> > > after the engine and the EU stall registers got reset but before the
> > > stream->reset is set, the read() would return bad EU stall data. I think
> > > the XEHPC_EUSTALL_BASE register should be checked either in the polling
> > > thread or in read (as in this patch) to avoid any race conditions.
> >
> > In that case we need to ask GuC team about "pre context reset" g2h/h2g
> > messages/callbacks.
> >
> > Is it possible to deduce that reset has happened from the write_ptr
> > register which is read in eu_stall_data_buf_poll? Say what happens to the
> > mask bits when reset occurs (why are the mask bits there in the first place
> > in a register which cannot be written by SW)? Then we can use that to set
> > stream->reset, and not have to introduce an extra register read.
> The write_ptr gets reset to 0, but it happens when the buffer wraps
> around too.

Yeah, I know, that's why I said check the mask bits.

> I am not sure about the mask bits, but I can check what
> happens to them after a reset.
> >
> > In a sense read() cannot return bad data since it uses cached read/wrt
> > ptr's. So as long as we catch the reset while updating write_ptr, we should
> > be ok.
> >
> > Or, can we can drop the constraint that read() will never return bad
> > data. If a reset happens, read() can return bad data.
> >
> > Or do you have any other ideas to address the issues I mentioned above?
>
> You earlier suggested about checking the base register in the polling
> thread. If we set the stream->reset in the polling thread, it can be
> used in both read() and poll(). Do you think this would address the
> issues you mentioned above ?

Let's first check write_ptr. If it works out we can use that. Otherwise my
preference I think is to avoid introducing an additional register read
anywhere. What we could do is hook into the context reset callback and if
context reset occurs we just unblock any blocked threads, followed by
returning -EBADFD from read(), even if we return bad data. We revisit if
UMD's insist bad data is unacceptable (even in case of a reset).

>
> >
> > >
> > > >
> > > > 2. From eu_stall_data_buf_poll(), if stream->reset is set, return true to
> > > >    wake up any waiters. We may also need to set POLLERR or POLLHUP revents.
> > > >
> > > > 3. Now from read(), if stream->reset is set return -EBADFD.
> > > >
> > > > So I think something like this solves both problems mentioned above.
> > > >
> > > > So could you please look into this and see if this is possible? Or any
> > > > other thoughts about this?
> > > >
> > > > Thanks.
> > > > --
> > > > Ashutosh
> > > Thank You
> > > Harish.
> > > >
> > > > >	if (bitmap_weight(stream->data_drop.mask, XE_MAX_DSS_FUSE_BITS)) {
> > > > >		if (!stream->data_drop.reported_to_user) {
> > > > >			stream->data_drop.reported_to_user = true;
> > > > > --
> > > > > 2.43.0
> > > > >

next prev parent reply	other threads:[~2025-12-24  1:48 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-08  6:16 [PATCH 1/1] drm/xe/eustall: Return EBADFD from read if EU stall registers get reset Harish Chegondi
2025-12-08  6:32 ` ✓ CI.KUnit: success for series starting with [1/1] " Patchwork
2025-12-08  7:56 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-08  8:48 ` ✗ Xe.CI.Full: failure " Patchwork
2025-12-12 21:18 ` [PATCH 1/1] " Dixit, Ashutosh
2025-12-16 23:53   ` Harish Chegondi
2025-12-18 19:53 ` Dixit, Ashutosh
2025-12-22 22:37   ` Harish Chegondi
2025-12-23  5:08     ` Dixit, Ashutosh
2025-12-23 23:39       ` Harish Chegondi
2025-12-24  1:47         ` Dixit, Ashutosh [this message]
  -- strict thread matches above, loose matches on Subject: below --
2025-10-01  6:38 Harish Chegondi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874ipgmz01.wl-ashutosh.dixit@intel.com \
    --to=ashutosh.dixit@intel.com \
    --cc=harish.chegondi@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=umesh.nerlige.ramappa@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox