From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 457A7C433EF for ; Wed, 8 Jun 2022 17:39:48 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id AA4B41128D4; Wed, 8 Jun 2022 17:39:44 +0000 (UTC) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3E42510F337; Wed, 8 Jun 2022 17:39:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654709983; x=1686245983; h=date:message-id:from:to:cc:subject:in-reply-to: references:mime-version; bh=Nn6IF+tbUDc474ZjdSe8A8uHwLOl/FApHm3m2uWzBys=; b=ixuykNttCAaYHxOYWMtrvfd/oRxy55dPrrJFXHS3VZLS4nLT+2YunM7n 8LITubQwsE3rFHo4giCD1TuH+p6kAD5ahLXBuC/Hn0DxJk4THQoW4Kr8P o7vSbGaMn6Uzwz2I2w5QaXBj2P1GDunbodka4czzX0mVX0OZCMW/kmbmS QGFVEt8ne/g8BNNY3uZPdPkrdXsVeO/4mmO+7rxasEqzCfOs/bbB7KKRo oqW1NbTk4AUlMvqZdKoQC1VUWI6+3oTzYjObDu7wodOOVd9C9eDGucgpf U+2sN3jmcDcUVisHKuDRniovA90kzxOzxNF/Mp3IPdVMpBZU5kQbL/4ZD w==; X-IronPort-AV: E=McAfee;i="6400,9594,10372"; a="302344490" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="302344490" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 10:39:42 -0700 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="759576251" Received: from adixit-mobl1.amr.corp.intel.com (HELO adixit-arch.intel.com) ([10.212.170.194]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 10:39:42 -0700 Date: Wed, 08 Jun 2022 10:39:41 -0700 Message-ID: <87tu8vjbqa.wl-ashutosh.dixit@intel.com> From: "Dixit, Ashutosh" To: John Harrison In-Reply-To: References: <20220515060506.22084-1-vinay.belgaumkar@intel.com> <87y1y8jeer.wl-ashutosh.dixit@intel.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?ISO-8859-4?Q?Goj=F2?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/28.1 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII Subject: Re: [Intel-gfx] [PATCH] drm/i915/guc/slpc: Use non-blocking H2G for waitboost X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: jeff.mcgee@intel.com, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" On Tue, 07 Jun 2022 16:15:19 -0700, John Harrison wrote: > > On 6/7/2022 15:29, Dixit, Ashutosh wrote: > > On Sat, 14 May 2022 23:05:06 -0700, Vinay Belgaumkar wrote: > >> SLPC min/max frequency updates require H2G calls. We are seeing > >> timeouts when GuC channel is backed up and it is unable to respond > >> in a timely fashion causing warnings and affecting CI. > >> > >> This is seen when waitboosting happens during a stress test. > >> this patch updates the waitboost path to use a non-blocking > >> H2G call instead, which returns as soon as the message is > >> successfully transmitted. > > Overall I think this patch is trying to paper over problems in the blocking > > H2G CT interface (specifically the 1 second timeout in > > wait_for_ct_request_update()). So I think we should address that problem in > > the interface directly rather than having each client (SLPC and any future > > client) work around the problem. Following points: > > > > 1. This patch seems to assume that it is 'ok' to ignore the return code > > from FW for a waitboost request (arguing waitboost is best effort so > > it's ok to 'fire and forget'). But the return code is still useful > > e.g. in cases where we see performance issues and want to go back and > > investigate if FW rejected any waitboost requests. > > You still get errors reported in the GuC log. Indeed, some errors (or at > least error reasons) are only visible in the log not in the return code. OK, so we at least have this method for debug available. > > 2. We are already seeing that a 1 second timeout is not sufficient. So why > > not simply increase that timeout? > > > > 3. In fact if we are saying that the CT interface is a "reliable" interface > > (implying no message loss), to ensure reliability that timeout should > > not simply be increased, it should be made "infinite" (in quotes). > > > > 4. Maybe it would have been best to not have a "blocking" H2G interface at > > all (with the wait in wait_for_ct_request_update()). Just have an > > asynchronous interface (which mirrors the actual interface between FW > > and i915) in which clients register callbacks which are invoked when FW > > responds. If this is too big a change we can probably continue with the > > current blocking interface after increasing the timeout as mentioned > > above. > > > > 5. Finally, the waitboost request is just the most likely to get stuck at > > the back of a full CT queue since it happens during normal > > operation. Actually any request, say one initiated from sysfs, can also > > get similarly stuck at the back of a full queue. So any solution should > > also address that situation (where the return code is needed and > > similarly for a future client of the "blocking" (REQUEST/RESPONSE) > > interface). > The blocking interface is only intended for init time operations, not > runtime. In that case we should probably have code to enforce this in i915. > Stuff where the operation is meant to be synchronous and the KMD > should not proceed until it has an ack back from the GuC that the update > has taken place. All runtime operations are expected to be asynchronous. If > a response is required, then it should be sent via an async > callback. E.g. context de-registration is a 'fire and forget' H2G call but > gets a 'deregistration complete' G2H notification when it is safe for the > KMD to free up the associated storage. At present all GuC interactions in intel_guc_slpc.c (in i915) do *not* follow this. They use the REQUEST/RESPONSE FW interface which is pushed through the blocking H2G CT interface in i915. If we are serious about this this needs a GuC FW change to use bi-directional EVENT's used in the asynchronous interface (with corresponding changes in intel_guc_slpc.c). > There is an 'errors only' H2G mechanism. That will not send an ack back in > the case of a successful H2G but will send back an error notification in > the case of a failure. All async H2Gs should really be using that > mechanism. I think Michal W did post a patch for it and I was meant to be > reviewing it but it dropped of my radar due to other higher priorities. These I believe are referred to as FAST_REQUEST's in GuC FW. That success is not communicated back to the KMD might be an issue in cases where KMD needs to know whether a particular operation was successful (such as for operations initiated via sysfs). Thanks. -- Ashutosh