From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 51F2BCAC5B8 for ; Mon, 6 Oct 2025 08:59:36 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 128CC10E06F; Mon, 6 Oct 2025 08:59:36 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="gCkrQc1u"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) by gabe.freedesktop.org (Postfix) with ESMTPS id 39DA410E06F for ; Mon, 6 Oct 2025 08:59:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1759741174; x=1791277174; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=6mzToAZguHBYC6Oc+vXoH8oznhx1lZB0mIfeJ1URt0Y=; b=gCkrQc1u+i/JXVzV5FdasISB1Ix3Vaa0RkrbtcKpKvSg7wQoDKcHsVME Z23rMXxE1DGiuEnLTAkIvuo0zDp4Hy9BAfxwjxPw1xotyYpGyTsFhRw+C KuJ9H0xemuDfsRsq6oRWaIZhJlxYbm+8L+y41J6IVgnxtCLiPp7NxHsYR vqSh5iaNu8chOliRkJ3H5mJDZiJ8lnrQyeSVRjZOaIY8oMJkohFYsBmvr 3Q3YHEmlx9AVyl+fw/GmNCXI+L0LaEC0aPL9aTPoQo9xiq7SHSO7pgFlZ C4bsXO+sUxkOE56AvQpXvJrK2C5Z+OF/2ZojPgeQgQf8aneij1zXZPeAq g==; X-CSE-ConnectionGUID: 2HQnRvCUQQaR6GIMMlIORQ== X-CSE-MsgGUID: DUUmywVRR9qxBDlVUHJ5WQ== X-IronPort-AV: E=McAfee;i="6800,10657,11573"; a="64530289" X-IronPort-AV: E=Sophos;i="6.18,319,1751266800"; d="scan'208";a="64530289" Received: from fmviesa010.fm.intel.com ([10.60.135.150]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Oct 2025 01:59:34 -0700 X-CSE-ConnectionGUID: 0KWUkBqSTE6vJjHvWjHSgA== X-CSE-MsgGUID: 2dL5URngR62tQYlBxqqJIA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,319,1751266800"; d="scan'208";a="180631537" Received: from smoticic-mobl1.ger.corp.intel.com (HELO [10.245.245.52]) ([10.245.245.52]) by fmviesa010-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Oct 2025 01:59:33 -0700 Message-ID: Date: Mon, 6 Oct 2025 09:59:30 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 08/34] drm/xe: Don't change LRC ring head on job resubmission To: Matthew Brost Cc: intel-xe@lists.freedesktop.org References: <20251002055402.1865880-1-matthew.brost@intel.com> <20251002055402.1865880-9-matthew.brost@intel.com> <0f7ae66b-1272-4164-9b41-faac00f42f73@intel.com> Content-Language: en-GB From: Matthew Auld In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On 05/10/2025 07:53, Matthew Brost wrote: > On Sat, Oct 04, 2025 at 10:25:34PM -0700, Matthew Brost wrote: >> On Thu, Oct 02, 2025 at 03:15:13PM +0100, Matthew Auld wrote: >>> On 02/10/2025 06:53, Matthew Brost wrote: >>>> Now that we save the job's head during submission, it's no longer >>>> necessary to adjust the LRC ring head during resubmission. Instead, a >>>> software-based adjustment of the tail will overwrite the old jobs in >>>> place. For some odd reason, adjusting the LRC ring head didn't work on >>>> parallel queues, which was causing issues in our CI. >>>> >>>> v6: >>>> - Also set LRC tail to head so queue is idle coming out of reset >>>> >>>> Signed-off-by: Matthew Brost >>>> Reviewed-by: Tomasz Lis >>>> --- >>>> drivers/gpu/drm/xe/xe_guc_submit.c | 10 ++++++++-- >>>> 1 file changed, 8 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c >>>> index 3a534d93505f..70306f902ba5 100644 >>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c >>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c >>>> @@ -2008,11 +2008,17 @@ static void guc_exec_queue_start(struct xe_exec_queue *q) >>>> struct xe_gpu_scheduler *sched = &q->guc->sched; >>>> if (!exec_queue_killed_or_banned_or_wedged(q)) { >>>> + struct xe_sched_job *job = xe_sched_first_pending_job(sched); >>>> int i; >>>> trace_xe_exec_queue_resubmit(q); >>>> - for (i = 0; i < q->width; ++i) >>>> - xe_lrc_set_ring_head(q->lrc[i], q->lrc[i]->ring.tail); >>>> + if (job) { >>>> + for (i = 0; i < q->width; ++i) { >>>> + q->lrc[i]->ring.tail = job->ptrs[i].head; >>>> + xe_lrc_set_ring_tail(q->lrc[i], >>>> + xe_lrc_ring_head(q->lrc[i])); >>> >>> IIRC the sched pending_list stuff can also give back pending jobs that have >>> completed on the hw, but are still kept pending until the final free_job()? >>> >>> Suppose we have a pending_list like: >>> >>> [pending/complete, pending/complete, actual pending kernel job that never >>> completed/ran] >>> >>> IIUC the sw ring.tail will actually go backwards to the first pending >>> free/complete job head in the pending_list, with the hw tail being reset to >>> the current hw head here. But on the next submit the sw ring.tail is where >>> the commands are emitted to, and on the next update >>> of the hw tail it will be synced to the sw ring.tail? But if that happens >>> won't we get hw tail < hw head (since we used the head of an already >>> complete job for the sw tail), which will make the hw think there is a >>> massive ring wrap, so it will execute garbage until it wraps back around to >>> tail? >>> >> >> Let me tweak this flow to use the skip_emit / last_replay flow >> introduced later in series to avoid this issue. >> > > Actually this flow works just fine. The GuC state is completely lost > during this flow - the context is not even registered. By the time > contect is registered - the LRC head will be at the original postition > of the first pending job and LRC tail will be at the end of the first > job. Can you share some more info here on the flow? I'm seeing LRC hw head being at the correct position, which must have moved past anything already completed by the hw, right? But here the sw lrc tail is being potentially moved backwards to the head of something already complete (the job we pick is pending but only because the free_job() has not run yet, so the job has already signalled/ran). Below when we do something like xe_sched_resubmit_jobs() the hw tail is then updated to the sw tail, but now we end up with hw tail < hw head? Also the flow I'm thinking about here is forced suspend/resume. > > Matt > >> Matt >> >>>> + } >>>> + } >>>> xe_sched_resubmit_jobs(sched); >>>> } >>>