From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <intel-xe-bounces@lists.freedesktop.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id B8213D6E2DC
	for <intel-xe@archiver.kernel.org>; Thu, 18 Dec 2025 19:53:05 +0000 (UTC)
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id 7885510E477;
	Thu, 18 Dec 2025 19:53:05 +0000 (UTC)
Authentication-Results: gabe.freedesktop.org;
	dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="CUVjPkxH";
	dkim-atps=neutral
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
 by gabe.freedesktop.org (Postfix) with ESMTPS id 055E510E477
 for <intel-xe@lists.freedesktop.org>; Thu, 18 Dec 2025 19:53:03 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
 t=1766087584; x=1797623584;
 h=date:message-id:from:to:cc:subject:in-reply-to:
 references:mime-version;
 bh=sX8e0kTO1GBtY5aQdRE4l0OA+iopSPv3rIi/MoivuBY=;
 b=CUVjPkxHhxX85IK4qrQUW5YlvxxgMPqi4tmRJW/uAA3KMqS0QA0BiW5/
 MP2TNQNmkt3yxXZPJNEnHum5aLpZ4N5PMTGbEGUO2xcCsRIWioDEHVGRI
 RWCgTi8FyIlCeM2l+9Hg6u6TJLIuI1GHomp7xk/8dyDfptImKGKvSiZLf
 pzrQx9IHPNAVvXHBQ4rYyhxwdV57V0kXRzUkCQ8fCTOSYxs0TQcoxeXX5
 6fYcZV7MnfiOyegkpeO/ZrDynTuRR7JahWk4hG3nUc6TBPg5kuiVrTnWT
 npbi3AdrqYg3/e89W66qUiLNbCnDpl+BjwgCr6ncQ1QL9kpwrUrN/Bys6 Q==;
X-CSE-ConnectionGUID: 4PUZgw0lSQmcJnpqB8BB2w==
X-CSE-MsgGUID: pxl2woPWQ0+/J8a3uSbkeQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11646"; a="68020751"
X-IronPort-AV: E=Sophos;i="6.21,159,1763452800"; d="scan'208";a="68020751"
Received: from fmviesa003.fm.intel.com ([10.60.135.143])
 by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 18 Dec 2025 11:53:04 -0800
X-CSE-ConnectionGUID: 7StFm9cUT7mmcf0IVddhKQ==
X-CSE-MsgGUID: NOqmZVD/TjaDy/jpfSkZeA==
X-ExtLoop1: 1
Received: from wrannila-mobl.amr.corp.intel.com (HELO adixit-MOBL3.intel.com)
 ([10.125.85.60])
 by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 18 Dec 2025 11:53:02 -0800
Date: Thu, 18 Dec 2025 11:53:02 -0800
Message-ID: <87fr97worl.wl-ashutosh.dixit@intel.com>
From: "Dixit, Ashutosh" <ashutosh.dixit@intel.com>
To: Harish Chegondi <harish.chegondi@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
 Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Subject: Re: [PATCH 1/1] drm/xe/eustall: Return EBADFD from read if EU stall
 registers get reset
In-Reply-To: <6d78578c015b12e7ae243727ca7ed4b93551075d.1765174462.git.harish.chegondi@intel.com>
References: <6d78578c015b12e7ae243727ca7ed4b93551075d.1765174462.git.harish.chegondi@intel.com>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue)
 FLIM-LB/1.14.9 (=?ISO-8859-4?Q?Goj=F2?=) APEL-LB/10.8 EasyPG/1.0.0
 Emacs/30.2 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue")
Content-Type: text/plain; charset=US-ASCII
X-BeenThere: intel-xe@lists.freedesktop.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Intel Xe graphics driver <intel-xe.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/intel-xe>,
 <mailto:intel-xe-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/intel-xe>
List-Post: <mailto:intel-xe@lists.freedesktop.org>
List-Help: <mailto:intel-xe-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/intel-xe>,
 <mailto:intel-xe-request@lists.freedesktop.org?subject=subscribe>
Errors-To: intel-xe-bounces@lists.freedesktop.org
Sender: "Intel-xe" <intel-xe-bounces@lists.freedesktop.org>

On Sun, 07 Dec 2025 22:16:11 -0800, Harish Chegondi wrote:
>

Hi Harish,

> If a reset (GT or engine) happens during EU stall data sampling, all the
> EU stall registers can get reset to 0. This will result in EU stall data
> buffers' read and write pointer register values to be out of sync with
> the cached values. This can result in read() returning invalid data. To
> prevent this, check the value of a EU stall base register. If it is zero,
> it indicates a reset may have happened that wiped the register to zero.
> If this happens, return EBADFD from read() upon which the user space
> should close the fd and open a new fd for a new EU stall data
> collection session.
>
> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_eu_stall.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_eu_stall.c b/drivers/gpu/drm/xe/xe_eu_stall.c
> index 97dfb7945b7a..02c0beb4559f 100644
> --- a/drivers/gpu/drm/xe/xe_eu_stall.c
> +++ b/drivers/gpu/drm/xe/xe_eu_stall.c
> @@ -541,9 +541,24 @@ static ssize_t xe_eu_stall_stream_read_locked(struct xe_eu_stall_data_stream *st
>	size_t total_size = 0;
>	u16 group, instance;
>	unsigned int xecore;
> +	u32 base_reg_value;
>	int ret = 0;
>
>	mutex_lock(&stream->xecore_buf_lock);
> +	/* If a GT or engine reset happens during EU stall data sampling,
> +	 * all EU stall registers get reset to 0 and the cached values of
> +	 * EU stall data buffers' read and write pointers are out of sync
> +	 * with the register values. This can cause invalid data to be
> +	 * returned from read(). To prevent this, check the value of a
> +	 * EU stall base register. If it is zero, return -EBADFD. The
> +	 * user is expected to close the fd and open a new fd.
> +	 */
> +	base_reg_value = xe_gt_mcr_unicast_read_any(gt, XEHPC_EUSTALL_BASE);
> +	if (unlikely(!base_reg_value)) {
> +		xe_gt_dbg(gt, "EU stall base register has been reset to 0\n");
> +		mutex_unlock(&stream->xecore_buf_lock);
> +		return -EBADFD;
> +	}

So I am seeing two problems here:

1. We are doing register read every read() call, rather than just when a
   reset happens.

2. The other issue is should reset itself unblock a blocked poll() or
   blocking read() call? If we don't do that, it is possible that poll()
   or blocking read() remains blocked indefinitely and so either the
   non-blocking read() doesn't get called at all, or a blocking read()
   remains indefinitely blocked. So that we never actually return -EBADFD
   even though a reset has happened.

   (Note that, for exec(), I believe any blocked fences will unblock and
   return error etc. if a reset happens during an exec() call (see
   reset_status()), so EU stall should probably do something similar).

So to address these two issues how about doing something like this:

1. Call an EU stall callback from xe_guc_exec_queue_reset_handler(). In the
   callback, if an EU stall stream is open on that gt, check if
   XEHPC_EUSTALL_BASE is 0 and set a stream variable stream->reset under a
   suitable lock (likely xecore_buf_lock).

2. From eu_stall_data_buf_poll(), if stream->reset is set, return true to
   wake up any waiters. We may also need to set POLLERR or POLLHUP revents.

3. Now from read(), if stream->reset is set return -EBADFD.

So I think something like this solves both problems mentioned above.

So could you please look into this and see if this is possible? Or any
other thoughts about this?

Thanks.
--
Ashutosh

>	if (bitmap_weight(stream->data_drop.mask, XE_MAX_DSS_FUSE_BITS)) {
>		if (!stream->data_drop.reported_to_user) {
>			stream->data_drop.reported_to_user = true;
> --
> 2.43.0
>