From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <intel-xe-bounces@lists.freedesktop.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id E8B79FCD0C3
	for <intel-xe@archiver.kernel.org>; Wed, 18 Mar 2026 06:57:35 +0000 (UTC)
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id 8113D10E069;
	Wed, 18 Mar 2026 06:57:35 +0000 (UTC)
Authentication-Results: gabe.freedesktop.org;
	dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="GHIIJwr3";
	dkim-atps=neutral
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15])
 by gabe.freedesktop.org (Postfix) with ESMTPS id 7ED4910E069
 for <intel-xe@lists.freedesktop.org>; Wed, 18 Mar 2026 06:57:34 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
 t=1773817055; x=1805353055;
 h=date:message-id:from:to:cc:subject:in-reply-to:
 references:mime-version;
 bh=1BdpBN7n3hCu27Vz91uzdKiKi7YFf9td3r4cOKOIZLg=;
 b=GHIIJwr3QTiTDSgGeEetTZ4ona6gDww9UPeqrZ+V1h/m5xYB8Bglw65W
 DRD2zMvzl6DkrcWJJKLYo6cMm+a58gYPgALqWv44kmivZXZMVi0FPVvnv
 ZbQaQUqecbA6S9G9CZ77a6Kqmrvh0Oxcoymtfrp+CcQI8ScyZZWZjnMLw
 5YleFpvrywsjiPWft68wIBEQp7YV4av0euiJHKZBG7I/dTTz8zMnp0z44
 Ury5SujDAT53gIF1FQls+/M3gfyL1+0Mw1nLC+S1czRpldFWhi0ALnCq0
 3OwmpOvo2mVR8/tzA6FXFIqka6ICkF+Uw1DrrBGDaVMvnw8th2ZSLfeRR w==;
X-CSE-ConnectionGUID: knmcT68PQaqX+/5cIKJ41Q==
X-CSE-MsgGUID: fo7hAOlaRXWdbks0/7ughw==
X-IronPort-AV: E=McAfee;i="6800,10657,11732"; a="78473864"
X-IronPort-AV: E=Sophos;i="6.23,127,1770624000"; d="scan'208";a="78473864"
Received: from orviesa004.jf.intel.com ([10.64.159.144])
 by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Mar 2026 23:57:34 -0700
X-CSE-ConnectionGUID: 7ADtsFxnRaej8bRj0qSkJg==
X-CSE-MsgGUID: C9gJNDvERt6GesO55fZwyQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,127,1770624000"; d="scan'208";a="226987417"
Received: from chungjes-mobl1.amr.corp.intel.com (HELO adixit-MOBL3.intel.com)
 ([10.125.64.16])
 by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Mar 2026 23:57:34 -0700
Date: Tue, 17 Mar 2026 23:57:32 -0700
Message-ID: <87se9xpqub.wl-ashutosh.dixit@intel.com>
From: "Dixit, Ashutosh" <ashutosh.dixit@intel.com>
To: Harish Chegondi <harish.chegondi@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <felix.j.degrood@intel.com>,
 <matias.a.cabral@intel.com>, <joshua.santosh.ranjan@intel.com>
Subject: Re: [PATCH v2 1/1] drm/xe/eustall: Return EBADFD from read if EU
 stall registers get reset
In-Reply-To: <52d991cc7e8bec514bb582717a1c42033672d4a5.1773683739.git.harish.chegondi@intel.com>
References: <52d991cc7e8bec514bb582717a1c42033672d4a5.1773683739.git.harish.chegondi@intel.com>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue)
 FLIM-LB/1.14.9 (=?ISO-8859-4?Q?Goj=F2?=) APEL-LB/10.8 EasyPG/1.0.0
 Emacs/30.2 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue")
Content-Type: text/plain; charset=US-ASCII
X-BeenThere: intel-xe@lists.freedesktop.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Intel Xe graphics driver <intel-xe.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/intel-xe>,
 <mailto:intel-xe-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/intel-xe>
List-Post: <mailto:intel-xe@lists.freedesktop.org>
List-Help: <mailto:intel-xe-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/intel-xe>,
 <mailto:intel-xe-request@lists.freedesktop.org?subject=subscribe>
Errors-To: intel-xe-bounces@lists.freedesktop.org
Sender: "Intel-xe" <intel-xe-bounces@lists.freedesktop.org>

On Mon, 16 Mar 2026 10:58:56 -0700, Harish Chegondi wrote:
>
> If a reset (GT or engine) happens during EU stall data sampling, all the
> EU stall registers can get reset to 0. This will result in EU stall data
> buffers' read and write pointer register values to be out of sync with
> the cached values. This will result in read() returning invalid data. To
> prevent this, check the value of a EU stall base register. If it is zero,
> it indicates a reset may have happened that wiped the register to zero.
> If this happens, return EBADFD from read() upon which the user space
> should close the fd and open a new fd for a new EU stall data
> collection session.
>
> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
> ---
> v2: Move base register check from read to the poll function
>
>  drivers/gpu/drm/xe/xe_eu_stall.c | 24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_eu_stall.c b/drivers/gpu/drm/xe/xe_eu_stall.c
> index c34408cfd292..7e14de73a2c9 100644
> --- a/drivers/gpu/drm/xe/xe_eu_stall.c
> +++ b/drivers/gpu/drm/xe/xe_eu_stall.c
> @@ -44,6 +44,7 @@ struct per_xecore_buf {
>  struct xe_eu_stall_data_stream {
>	bool pollin;
>	bool enabled;
> +	bool reset_detected;
>	int wait_num_reports;
>	int sampling_rate_mult;
>	wait_queue_head_t poll_wq;
> @@ -428,6 +429,17 @@ static bool eu_stall_data_buf_poll(struct xe_eu_stall_data_stream *stream)
>			set_bit(xecore, stream->data_drop.mask);
>		xecore_buf->write = write_ptr;
>	}
> +	/* If a GT or engine reset happens during EU stall sampling,
> +	 * all EU stall registers get reset to 0 and the cached values of
> +	 * the EU stall data buffers' read pointers are out of sync with
> +	 * the register values. This causes invalid data to be returned
> +	 * from read(). To prevent this, check the value of a EU stall base
> +	 * register. If it is zero, there has been a reset.
> +	 */

As previously discussed, the best way would have been to not have to do
this. We would just plug into the handler for the reset message from GuC,
rather than to implement a reset detection here (and in other places such
as OA). But looks like if we do that, because of the way EUSS registers are
reset, we can return bad EUSS data. So looks like there is no way around
doing this "reset detection" here and a solution with the GuC reset handler
would always be racy. Just for the record.

> +	if (unlikely(!xe_gt_mcr_unicast_read_any(gt, XEHPC_EUSTALL_BASE))) {
> +		stream->reset_detected = true;
> +		min_data_present = true;

I don't believe we need to set 'min_data_present = true' if we are setting
'stream->reset_detected = true', correct? See if statement at the bottom.

Also, since the write pointer itself gets reset during reset, didn't we
want to do this register read only when the write pointer is 0 (to avoid an
extra register read every 5 ms)?

> +	}
>	mutex_unlock(&stream->xecore_buf_lock);
>
>	return min_data_present;
> @@ -554,6 +566,15 @@ static ssize_t xe_eu_stall_stream_read_locked(struct xe_eu_stall_data_stream *st
>		}
>		stream->data_drop.reported_to_user = false;
>	}
> +	/* If EU stall registers got reset due to a GT/engine reset,
> +	 * continuing with the read() will return invalid data to
> +	 * the user space. Just return -EBADFD instead.
> +	 */
> +	if (unlikely(stream->reset_detected)) {
> +		xe_gt_dbg(gt, "EU stall base register has been reset\n");
> +		mutex_unlock(&stream->xecore_buf_lock);
> +		return -EBADFD;

The other option is to return -EIO here and implement
DRM_XE_OBSERVATION_IOCTL_STATUS and return status from that. Let me think
some more about this.

> +	}
>
>	for_each_dss_steering(xecore, gt, group, instance) {
>		ret = xe_eu_stall_data_buf_read(stream, buf, count, &total_size,
> @@ -692,6 +713,7 @@ static int xe_eu_stall_stream_enable(struct xe_eu_stall_data_stream *stream)
>		xecore_buf->write = write_ptr;
>		xecore_buf->read = write_ptr;
>	}
> +	stream->reset_detected = false;

So after reset, if a stream is disabled and re-enabled, we expect things to
work again and EUSS data to be correct (without re-opening a new stream)?

>	stream->data_drop.reported_to_user = false;
>	bitmap_zero(stream->data_drop.mask, XE_MAX_DSS_FUSE_BITS);
>
> @@ -717,7 +739,7 @@ static void eu_stall_data_buf_poll_work_fn(struct work_struct *work)
>		container_of(work, typeof(*stream), buf_poll_work.work);
>	struct xe_gt *gt = stream->gt;
>
> -	if (eu_stall_data_buf_poll(stream)) {
> +	if (stream->reset_detected || eu_stall_data_buf_poll(stream)) {
>		stream->pollin = true;
>		wake_up(&stream->poll_wq);
>	}
> --
> 2.43.0
>