From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 984CAE74900 for ; Wed, 24 Dec 2025 01:48:02 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 2AD1E10E32A; Wed, 24 Dec 2025 01:48:02 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="cyVBBFX5"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.18]) by gabe.freedesktop.org (Postfix) with ESMTPS id 476AD10E323 for ; Wed, 24 Dec 2025 01:48:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1766540880; x=1798076880; h=date:message-id:from:to:cc:subject:in-reply-to: references:mime-version; bh=39ZMGb8qC7MtJAiMD3ZRRHsKQ9POw0TEMgqTaKUvddo=; b=cyVBBFX5nIDKxreaKfyE4EuB2bLsEavpmFLLvHmLhT+L4usnDPEQzcD/ RKIno+SwASFgLXFJChXBkuMR9Qs1TPq69NTHCZQlHErs47xmgEJw+XbKS sDsnArUloVl0eMplUQWPFoYdvGfFXEFwXkm91rZnM2DxMRg+/jjv3/MLv oQfhheD7Ng65MewxPyGrKpboomQYWmN14nO1UpiQ8cQxMJBsn+AL6eYXz ck/Tc5zegZ53KnLN2doJ1l+xiXqtnNrnvyqUzpBeGzEKTZ+MY0PwI+S1v T+mw6pxARz2obi0jdqMQm1z+hnVG2bvvWsFIskVsYpvgpoR7pCbgpwj+c w==; X-CSE-ConnectionGUID: Y6FOaMjTSvqmCnV12zTtpA== X-CSE-MsgGUID: tUNa5Vi6RqaCapQt9bWVsA== X-IronPort-AV: E=McAfee;i="6800,10657,11651"; a="67580209" X-IronPort-AV: E=Sophos;i="6.21,172,1763452800"; d="scan'208";a="67580209" Received: from orviesa007.jf.intel.com ([10.64.159.147]) by fmvoesa112.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Dec 2025 17:47:59 -0800 X-CSE-ConnectionGUID: 2y28QJ4lTLaXJtkEn65riQ== X-CSE-MsgGUID: Z+nULdN8Q+GRHuw15QNAyQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,172,1763452800"; d="scan'208";a="199888252" Received: from ssapate-mobl2.gar.corp.intel.com (HELO adixit-MOBL3.intel.com) ([10.125.67.90]) by orviesa007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Dec 2025 17:47:59 -0800 Date: Tue, 23 Dec 2025 17:47:58 -0800 Message-ID: <874ipgmz01.wl-ashutosh.dixit@intel.com> From: "Dixit, Ashutosh" To: Harish Chegondi Cc: , Umesh Nerlige Ramappa Subject: Re: [PATCH 1/1] drm/xe/eustall: Return EBADFD from read if EU stall registers get reset In-Reply-To: References: <6d78578c015b12e7ae243727ca7ed4b93551075d.1765174462.git.harish.chegondi@intel.com> <87fr97worl.wl-ashutosh.dixit@intel.com> <87ecolwzt7.wl-ashutosh.dixit@intel.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?ISO-8859-4?Q?Goj=F2?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/30.2 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Tue, 23 Dec 2025 15:39:47 -0800, Harish Chegondi wrote: > > On Mon, Dec 22, 2025 at 09:08:04PM -0800, Dixit, Ashutosh wrote: > > On Mon, 22 Dec 2025 14:37:53 -0800, Harish Chegondi wrote: > > > > > > On Thu, Dec 18, 2025 at 11:53:02AM -0800, Dixit, Ashutosh wrote: > > > > On Sun, 07 Dec 2025 22:16:11 -0800, Harish Chegondi wrote: > > > > > > > > > > > > Hi Ashutosh, > > > > Hi Harish, > > > > > > > > > If a reset (GT or engine) happens during EU stall data sampling, all the > > > > > EU stall registers can get reset to 0. This will result in EU stall data > > > > > buffers' read and write pointer register values to be out of sync with > > > > > the cached values. This can result in read() returning invalid data. To > > > > > prevent this, check the value of a EU stall base register. If it is zero, > > > > > it indicates a reset may have happened that wiped the register to zero. > > > > > If this happens, return EBADFD from read() upon which the user space > > > > > should close the fd and open a new fd for a new EU stall data > > > > > collection session. > > > > > > > > > > Cc: Ashutosh Dixit > > > > > Cc: Umesh Nerlige Ramappa > > > > > Signed-off-by: Harish Chegondi > > > > > --- > > > > > drivers/gpu/drm/xe/xe_eu_stall.c | 15 +++++++++++++++ > > > > > 1 file changed, 15 insertions(+) > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_eu_stall.c b/drivers/gpu/drm/xe/xe_eu_stall.c > > > > > index 97dfb7945b7a..02c0beb4559f 100644 > > > > > --- a/drivers/gpu/drm/xe/xe_eu_stall.c > > > > > +++ b/drivers/gpu/drm/xe/xe_eu_stall.c > > > > > @@ -541,9 +541,24 @@ static ssize_t xe_eu_stall_stream_read_locked(struct xe_eu_stall_data_stream *st > > > > > size_t total_size = 0; > > > > > u16 group, instance; > > > > > unsigned int xecore; > > > > > + u32 base_reg_value; > > > > > int ret = 0; > > > > > > > > > > mutex_lock(&stream->xecore_buf_lock); > > > > > + /* If a GT or engine reset happens during EU stall data sampling, > > > > > + * all EU stall registers get reset to 0 and the cached values of > > > > > + * EU stall data buffers' read and write pointers are out of sync > > > > > + * with the register values. This can cause invalid data to be > > > > > + * returned from read(). To prevent this, check the value of a > > > > > + * EU stall base register. If it is zero, return -EBADFD. The > > > > > + * user is expected to close the fd and open a new fd. > > > > > + */ > > > > > + base_reg_value = xe_gt_mcr_unicast_read_any(gt, XEHPC_EUSTALL_BASE); > > > > > + if (unlikely(!base_reg_value)) { > > > > > + xe_gt_dbg(gt, "EU stall base register has been reset to 0\n"); > > > > > + mutex_unlock(&stream->xecore_buf_lock); > > > > > + return -EBADFD; > > > > > + } > > > > > > > > So I am seeing two problems here: > > > > > > > > 1. We are doing register read every read() call, rather than just when a > > > > reset happens. > > > > > > > > 2. The other issue is should reset itself unblock a blocked poll() or > > > > blocking read() call? If we don't do that, it is possible that poll() > > > > or blocking read() remains blocked indefinitely and so either the > > > > non-blocking read() doesn't get called at all, or a blocking read() > > > > remains indefinitely blocked. So that we never actually return -EBADFD > > > > even though a reset has happened. > > > > > > > > (Note that, for exec(), I believe any blocked fences will unblock and > > > > return error etc. if a reset happens during an exec() call (see > > > > reset_status()), so EU stall should probably do something similar). > > > > > > > > So to address these two issues how about doing something like this: > > > > > > > > 1. Call an EU stall callback from xe_guc_exec_queue_reset_handler(). In the > > > > callback, if an EU stall stream is open on that gt, check if > > > > XEHPC_EUSTALL_BASE is 0 and set a stream variable stream->reset under a > > > > suitable lock (likely xecore_buf_lock). > > > > > > I thought about this and I think this can be racy - if read() is called > > > after the engine and the EU stall registers got reset but before the > > > stream->reset is set, the read() would return bad EU stall data. I think > > > the XEHPC_EUSTALL_BASE register should be checked either in the polling > > > thread or in read (as in this patch) to avoid any race conditions. > > > > In that case we need to ask GuC team about "pre context reset" g2h/h2g > > messages/callbacks. > > > > Is it possible to deduce that reset has happened from the write_ptr > > register which is read in eu_stall_data_buf_poll? Say what happens to the > > mask bits when reset occurs (why are the mask bits there in the first place > > in a register which cannot be written by SW)? Then we can use that to set > > stream->reset, and not have to introduce an extra register read. > The write_ptr gets reset to 0, but it happens when the buffer wraps > around too. Yeah, I know, that's why I said check the mask bits. > I am not sure about the mask bits, but I can check what > happens to them after a reset. > > > > In a sense read() cannot return bad data since it uses cached read/wrt > > ptr's. So as long as we catch the reset while updating write_ptr, we should > > be ok. > > > > Or, can we can drop the constraint that read() will never return bad > > data. If a reset happens, read() can return bad data. > > > > Or do you have any other ideas to address the issues I mentioned above? > > You earlier suggested about checking the base register in the polling > thread. If we set the stream->reset in the polling thread, it can be > used in both read() and poll(). Do you think this would address the > issues you mentioned above ? Let's first check write_ptr. If it works out we can use that. Otherwise my preference I think is to avoid introducing an additional register read anywhere. What we could do is hook into the context reset callback and if context reset occurs we just unblock any blocked threads, followed by returning -EBADFD from read(), even if we return bad data. We revisit if UMD's insist bad data is unacceptable (even in case of a reset). > > > > > > > > > > > > > > 2. From eu_stall_data_buf_poll(), if stream->reset is set, return true to > > > > wake up any waiters. We may also need to set POLLERR or POLLHUP revents. > > > > > > > > 3. Now from read(), if stream->reset is set return -EBADFD. > > > > > > > > So I think something like this solves both problems mentioned above. > > > > > > > > So could you please look into this and see if this is possible? Or any > > > > other thoughts about this? > > > > > > > > Thanks. > > > > -- > > > > Ashutosh > > > Thank You > > > Harish. > > > > > > > > > if (bitmap_weight(stream->data_drop.mask, XE_MAX_DSS_FUSE_BITS)) { > > > > > if (!stream->data_drop.reported_to_user) { > > > > > stream->data_drop.reported_to_user = true; > > > > > -- > > > > > 2.43.0 > > > > >