From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id E084CC7EE29 for ; Mon, 22 May 2023 20:20:15 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 64C4310E380; Mon, 22 May 2023 20:20:15 +0000 (UTC) Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by gabe.freedesktop.org (Postfix) with ESMTPS id AFF4710E380 for ; Mon, 22 May 2023 20:20:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1684786813; x=1716322813; h=date:message-id:from:to:cc:subject:in-reply-to: references:mime-version; bh=nxffeg8aJkMiro1axohdpLdwvbTKbe9HIUam5Qp8yTU=; b=idkqgg5luiovVHw3qthzDDElQDC53NR2RD6YSrdhr/nywXBcXr7X0mI5 NIeqaP/fli32uK6Cf1NAfYNUCLm40pyUE28RMi3wC/op2bRs8NmwKCmO0 ci0a3cFL0D+pJuvb99ssbuODWPX0PZrBc+ThiLh0Oq66+Shfi6Hwvcw5v g43GY5PTnYyZgyb5Dv60ybbfs7+K/jMqxFkqcuqfkfqE/6NLfjcpo6CKX DT9I0biKgw/BI3yYkuDF1+4Hw51wyi8ikmhPV/Jcnm5KAsSHMrhqUS12s pzrldZp9HfZL8PPMIL8US99gaqXLcEnj93u4M9k9H4fAPhLGlcFt5CeoB w==; X-IronPort-AV: E=McAfee;i="6600,9927,10718"; a="439385279" X-IronPort-AV: E=Sophos;i="6.00,184,1681196400"; d="scan'208";a="439385279" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 May 2023 13:20:13 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10718"; a="793456941" X-IronPort-AV: E=Sophos;i="6.00,184,1681196400"; d="scan'208";a="793456941" Received: from adixit-mobl.amr.corp.intel.com (HELO adixit-arch.intel.com) ([10.251.0.15]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 May 2023 13:20:13 -0700 Date: Mon, 22 May 2023 13:20:12 -0700 Message-ID: <87mt1wqfsz.wl-ashutosh.dixit@intel.com> From: "Dixit, Ashutosh" To: Umesh Nerlige Ramappa In-Reply-To: <20230519225642.134778-1-umesh.nerlige.ramappa@intel.com> References: <20230519225642.134778-1-umesh.nerlige.ramappa@intel.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?ISO-8859-4?Q?Goj=F2?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/28.2 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII Subject: Re: [Intel-gfx] [PATCH] i915/perf: Avoid reading OA reports before they land X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Lionel G Landwerlin , intel-gfx@lists.freedesktop.org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" On Fri, 19 May 2023 15:56:42 -0700, Umesh Nerlige Ramappa wrote: > Hi Umesh, > On DG2, capturing OA reports while running heavy render workloads > sometimes results in invalid OA reports where 64-byte chunks inside > reports have stale values. Under memory pressure, high OA sampling rates > (13.3 us) and heavy render workload, occassionally, the OA HW TAIL > pointer does not progress as fast as the sampling rate. When these > glitches occur, the TAIL pointer takes approx. 200us to progress. While > this is expected behavior from the HW perspective, invalid reports are > not expected. > > In oa_buffer_check_unlocked(), when we execute the if condition, we are > updating the oa_buffer.tail to the aging tail and then setting pollin > based on this tail value, however, we do not have a chance to rewind and > validate the reports prior to setting pollin. The validation happens > in a subsequent call to oa_buffer_check_unlocked(). If a read occurs > before this validation, then we end up reading reports up until this > oa_buffer.tail value which includes invalid reports. Though found on > DG2, this affects all platforms. > > Set the pollin only in the else condition in oa_buffer_check_unlocked. > > Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/7484 > Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/7757 > Signed-off-by: Umesh Nerlige Ramappa > --- > drivers/gpu/drm/i915/i915_perf.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c > index 19d5652300ee..61536e3c4ac9 100644 > --- a/drivers/gpu/drm/i915/i915_perf.c > +++ b/drivers/gpu/drm/i915/i915_perf.c > @@ -545,7 +545,7 @@ static bool oa_buffer_check_unlocked(struct i915_perf_stream *stream) > u32 gtt_offset = i915_ggtt_offset(stream->oa_buffer.vma); > int report_size = stream->oa_buffer.format->size; > unsigned long flags; > - bool pollin; > + bool pollin = false; > u32 hw_tail; > u64 now; > u32 partial_report_size; > @@ -620,10 +620,10 @@ static bool oa_buffer_check_unlocked(struct i915_perf_stream *stream) > stream->oa_buffer.tail = gtt_offset + tail; > stream->oa_buffer.aging_tail = gtt_offset + hw_tail; > stream->oa_buffer.aging_timestamp = now; > - } > > - pollin = OA_TAKEN(stream->oa_buffer.tail - gtt_offset, > - stream->oa_buffer.head - gtt_offset) >= report_size; > + pollin = OA_TAKEN(stream->oa_buffer.tail - gtt_offset, > + stream->oa_buffer.head - gtt_offset) >= report_size; > + } The issue has been correctly identified above. But seems that the real cause for the issue is not that pollin statement above is misplaced but that updating the tail via aging is unreliable (at least with the present timeout as you mention above). Also, it is not clear why we have tail aging at all, since it seems we can detect when reports land (by checking report_id and timestamp). So rather than move the pollin into the else, we should just eliminate the if () part. And if we are eliminating the if () we can just eliminate the concept of tail aging from the code (and comments) and rely solely on explicit detection of reports landing. Separately, there seems to be another related bug in the code, I have sent a patch for that here: https://patchwork.freedesktop.org/series/118151/ Thanks. -- Ashutosh