From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 43BACE6B261 for ; Tue, 23 Dec 2025 05:08:08 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C976210E047; Tue, 23 Dec 2025 05:08:07 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="VtqDkpZw"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.16]) by gabe.freedesktop.org (Postfix) with ESMTPS id D389510E047 for ; Tue, 23 Dec 2025 05:08:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1766466486; x=1798002486; h=date:message-id:from:to:cc:subject:in-reply-to: references:mime-version; bh=yOBG8HwcgjuYF0TsNrf8m6wsDKrJq6+jIlh282WLQQQ=; b=VtqDkpZwHXRmQZaytYdHGFt/JUauyNIAldk9k+BNAm3VrWwMFQODOR6r LrA+EolQGdQ6+iQgJhMixpmILgxPPpFC3RQcTzAxw08PA4qjXg0n/o+NK feseGWzgd6SknP4bH5uXY+sQy/wK5GQVk4TnkJvmn1IDTBRcVr25zajJf sVeO9CJEWNzCWKh+EOZwQFv62im/4q4pyPLMaoYm5NZXLTsiR/q30TMnA dxZS3Ze06d+vBWC4B2n3xQnpspiQj4f54MkrAZmxRbFEPWH+8rm5Yo5kI opIhoK/oe1yhIQf6tBro/mGY94UMNSEPqF7iBfRUu1HA9yJZ+vk63vBwm Q==; X-CSE-ConnectionGUID: uoToeJVLT76hFrGgVf1s2A== X-CSE-MsgGUID: IZlKJ+D0RhmJDx/DzdwevQ== X-IronPort-AV: E=McAfee;i="6800,10657,11650"; a="68473173" X-IronPort-AV: E=Sophos;i="6.21,170,1763452800"; d="scan'208";a="68473173" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by orvoesa108.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Dec 2025 21:08:05 -0800 X-CSE-ConnectionGUID: I8Y2OnGKSiiq3nDXo/AEMQ== X-CSE-MsgGUID: hztBsgi9RqSHbBLiZjXzNA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,170,1763452800"; d="scan'208";a="237102512" Received: from blsantia-mobl1.amr.corp.intel.com (HELO adixit-MOBL3.intel.com) ([10.125.115.214]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Dec 2025 21:08:05 -0800 Date: Mon, 22 Dec 2025 21:08:04 -0800 Message-ID: <87ecolwzt7.wl-ashutosh.dixit@intel.com> From: "Dixit, Ashutosh" To: Harish Chegondi Cc: , Umesh Nerlige Ramappa Subject: Re: [PATCH 1/1] drm/xe/eustall: Return EBADFD from read if EU stall registers get reset In-Reply-To: References: <6d78578c015b12e7ae243727ca7ed4b93551075d.1765174462.git.harish.chegondi@intel.com> <87fr97worl.wl-ashutosh.dixit@intel.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?ISO-8859-4?Q?Goj=F2?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/30.2 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Mon, 22 Dec 2025 14:37:53 -0800, Harish Chegondi wrote: > > On Thu, Dec 18, 2025 at 11:53:02AM -0800, Dixit, Ashutosh wrote: > > On Sun, 07 Dec 2025 22:16:11 -0800, Harish Chegondi wrote: > > > > > > Hi Ashutosh, > > Hi Harish, > > > > > If a reset (GT or engine) happens during EU stall data sampling, all the > > > EU stall registers can get reset to 0. This will result in EU stall data > > > buffers' read and write pointer register values to be out of sync with > > > the cached values. This can result in read() returning invalid data. To > > > prevent this, check the value of a EU stall base register. If it is zero, > > > it indicates a reset may have happened that wiped the register to zero. > > > If this happens, return EBADFD from read() upon which the user space > > > should close the fd and open a new fd for a new EU stall data > > > collection session. > > > > > > Cc: Ashutosh Dixit > > > Cc: Umesh Nerlige Ramappa > > > Signed-off-by: Harish Chegondi > > > --- > > > drivers/gpu/drm/xe/xe_eu_stall.c | 15 +++++++++++++++ > > > 1 file changed, 15 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/xe/xe_eu_stall.c b/drivers/gpu/drm/xe/xe_eu_stall.c > > > index 97dfb7945b7a..02c0beb4559f 100644 > > > --- a/drivers/gpu/drm/xe/xe_eu_stall.c > > > +++ b/drivers/gpu/drm/xe/xe_eu_stall.c > > > @@ -541,9 +541,24 @@ static ssize_t xe_eu_stall_stream_read_locked(struct xe_eu_stall_data_stream *st > > > size_t total_size = 0; > > > u16 group, instance; > > > unsigned int xecore; > > > + u32 base_reg_value; > > > int ret = 0; > > > > > > mutex_lock(&stream->xecore_buf_lock); > > > + /* If a GT or engine reset happens during EU stall data sampling, > > > + * all EU stall registers get reset to 0 and the cached values of > > > + * EU stall data buffers' read and write pointers are out of sync > > > + * with the register values. This can cause invalid data to be > > > + * returned from read(). To prevent this, check the value of a > > > + * EU stall base register. If it is zero, return -EBADFD. The > > > + * user is expected to close the fd and open a new fd. > > > + */ > > > + base_reg_value = xe_gt_mcr_unicast_read_any(gt, XEHPC_EUSTALL_BASE); > > > + if (unlikely(!base_reg_value)) { > > > + xe_gt_dbg(gt, "EU stall base register has been reset to 0\n"); > > > + mutex_unlock(&stream->xecore_buf_lock); > > > + return -EBADFD; > > > + } > > > > So I am seeing two problems here: > > > > 1. We are doing register read every read() call, rather than just when a > > reset happens. > > > > 2. The other issue is should reset itself unblock a blocked poll() or > > blocking read() call? If we don't do that, it is possible that poll() > > or blocking read() remains blocked indefinitely and so either the > > non-blocking read() doesn't get called at all, or a blocking read() > > remains indefinitely blocked. So that we never actually return -EBADFD > > even though a reset has happened. > > > > (Note that, for exec(), I believe any blocked fences will unblock and > > return error etc. if a reset happens during an exec() call (see > > reset_status()), so EU stall should probably do something similar). > > > > So to address these two issues how about doing something like this: > > > > 1. Call an EU stall callback from xe_guc_exec_queue_reset_handler(). In the > > callback, if an EU stall stream is open on that gt, check if > > XEHPC_EUSTALL_BASE is 0 and set a stream variable stream->reset under a > > suitable lock (likely xecore_buf_lock). > > I thought about this and I think this can be racy - if read() is called > after the engine and the EU stall registers got reset but before the > stream->reset is set, the read() would return bad EU stall data. I think > the XEHPC_EUSTALL_BASE register should be checked either in the polling > thread or in read (as in this patch) to avoid any race conditions. In that case we need to ask GuC team about "pre context reset" g2h/h2g messages/callbacks. Is it possible to deduce that reset has happened from the write_ptr register which is read in eu_stall_data_buf_poll? Say what happens to the mask bits when reset occurs (why are the mask bits there in the first place in a register which cannot be written by SW)? Then we can use that to set stream->reset, and not have to introduce an extra register read. In a sense read() cannot return bad data since it uses cached read/wrt ptr's. So as long as we catch the reset while updating write_ptr, we should be ok. Or, can we can drop the constraint that read() will never return bad data. If a reset happens, read() can return bad data. Or do you have any other ideas to address the issues I mentioned above? > > > > > 2. From eu_stall_data_buf_poll(), if stream->reset is set, return true to > > wake up any waiters. We may also need to set POLLERR or POLLHUP revents. > > > > 3. Now from read(), if stream->reset is set return -EBADFD. > > > > So I think something like this solves both problems mentioned above. > > > > So could you please look into this and see if this is possible? Or any > > other thoughts about this? > > > > Thanks. > > -- > > Ashutosh > Thank You > Harish. > > > > > if (bitmap_weight(stream->data_drop.mask, XE_MAX_DSS_FUSE_BITS)) { > > > if (!stream->data_drop.reported_to_user) { > > > stream->data_drop.reported_to_user = true; > > > -- > > > 2.43.0 > > >