From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mout-p-103.mailbox.org (mout-p-103.mailbox.org [80.241.56.161]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ACAB528C87D for ; Thu, 22 May 2025 14:59:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=80.241.56.161 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747925991; cv=none; b=ouVzdCDTGoYjpHcrCOfEg92OJ0Y2ShHfVd6beM47vwbj1HtOvX9dRWakTET7UdjfCcKD3MsWdYmBoutYSH5tZH5NGDMXELVtnLMavHBt+VhL9lEvO/vu6YAaP15yBB7lsQloiN2/DnzEf2Fikvth12tfKwGcMLgnk+6VAY2q7fU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747925991; c=relaxed/simple; bh=SZsJPIA2a2ieFci0A6gsILiqOpySQNMgu2fJyHA10HQ=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=hvyT+yIRPidjrCk7KonetzQkqOQUVFEFEk+dlEbnlsBP37AXB4e4vz3M+rdXMqUba57Bv82l0FkcbRxsrQv/xEWilZZ2GukZ5xWCix0HeVQltRciPSMAUbkRAq0Ge8f3wj9DRAXwVDZc1l2/fHjgXKMKB375ucnintiPLHmbm6g= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mailbox.org; spf=pass smtp.mailfrom=mailbox.org; dkim=pass (2048-bit key) header.d=mailbox.org header.i=@mailbox.org header.b=IWvWyL2e; arc=none smtp.client-ip=80.241.56.161 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mailbox.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mailbox.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mailbox.org header.i=@mailbox.org header.b="IWvWyL2e" Received: from smtp2.mailbox.org (smtp2.mailbox.org [10.196.197.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mout-p-103.mailbox.org (Postfix) with ESMTPS id 4b3BK81Wh9z9sm6; Thu, 22 May 2025 16:59:40 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mailbox.org; s=mail20150812; t=1747925980; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9hIgVgAiYv5vVjOkhA8xQvDe3ANpNjv4wOvq4t4WLpw=; b=IWvWyL2eylEhzKYLqRvx37xxF0K39+Y1KjagxvStpljHNSnVQw6Y2oWfQVruXmiEWzIliH JT8yK42yp2jwPCe6k0H4BlAWkwdserfNxKdrST09RjotvtB5PhZkWuO1AfqdirdVntY7C4 ze2aLUcCpyo5xJbdayKHTqq2ZGS9s1rwu9UQZ71D+X6UnlBA5AJCei1Hk5Txrd4he6QpmD DZrJu9cuNE4LUL0DG0Ijo0fZG+3iKp2rH/CEwU2M7a1jfpMZHMuK0mhyP0iuWKG3Hyv9gj LeLMjVODuVDg2w/tyQbwwbJZ/0O0ZhSn1NY2W8Ac7fy0aVyrkzgJFJrrqW/bdw== Message-ID: Subject: Re: [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method From: Philipp Stanner Reply-To: phasta@kernel.org To: Tvrtko Ursulin , Philipp Stanner , Lyude Paul , Danilo Krummrich , David Airlie , Simona Vetter , Matthew Brost , Christian =?ISO-8859-1?Q?K=F6nig?= , Maarten Lankhorst , Maxime Ripard , Thomas Zimmermann Cc: dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org, linux-kernel@vger.kernel.org Date: Thu, 22 May 2025 16:59:35 +0200 In-Reply-To: References: <20250522082742.148191-2-phasta@kernel.org> <20250522082742.148191-4-phasta@kernel.org> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MBO-RS-ID: b9071dfc324416f6776 X-MBO-RS-META: jpe4deru47jan9sex4qyichcktzu8sqo On Thu, 2025-05-22 at 15:06 +0100, Tvrtko Ursulin wrote: >=20 > On 22/05/2025 09:27, Philipp Stanner wrote: > > The drm_gpu_scheduler now supports a callback to help > > drm_sched_fini() > > avoid memory leaks. This callback instructs the driver to signal > > all > > pending hardware fences. > >=20 > > Implement the new callback > > drm_sched_backend_ops.cancel_pending_fences(). > >=20 > > Have the callback use drm_mock_sched_job_complete() with a new > > error > > field for the fence error. > >=20 > > Keep the job status as DRM_MOCK_SCHED_JOB_DONE for now, since there > > is > > no party for which checking for a CANCELED status would be useful > > currently. > >=20 > > Signed-off-by: Philipp Stanner > > --- > > =C2=A0 .../gpu/drm/scheduler/tests/mock_scheduler.c=C2=A0 | 67 +++++++-= ------ > > ----- > > =C2=A0 drivers/gpu/drm/scheduler/tests/sched_tests.h |=C2=A0 4 +- > > =C2=A0 2 files changed, 25 insertions(+), 46 deletions(-) > >=20 > > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c > > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c > > index f999c8859cf7..eca47f0395bc 100644 > > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c > > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c > > @@ -55,7 +55,7 @@ void drm_mock_sched_entity_free(struct > > drm_mock_sched_entity *entity) > > =C2=A0=C2=A0 drm_sched_entity_destroy(&entity->base); > > =C2=A0 } > > =C2=A0=20 > > -static void drm_mock_sched_job_complete(struct drm_mock_sched_job > > *job) > > +static void drm_mock_sched_job_complete(struct drm_mock_sched_job > > *job, int err) > > =C2=A0 { > > =C2=A0=C2=A0 struct drm_mock_scheduler *sched =3D > > =C2=A0=C2=A0 drm_sched_to_mock_sched(job->base.sched); > > @@ -63,7 +63,9 @@ static void drm_mock_sched_job_complete(struct > > drm_mock_sched_job *job) > > =C2=A0=C2=A0 lockdep_assert_held(&sched->lock); > > =C2=A0=20 > > =C2=A0=C2=A0 job->flags |=3D DRM_MOCK_SCHED_JOB_DONE; > > - list_move_tail(&job->link, &sched->done_list); > > + list_del(&job->link); > > + if (err) > > + dma_fence_set_error(&job->hw_fence, err); > > =C2=A0=C2=A0 dma_fence_signal(&job->hw_fence); > > =C2=A0=C2=A0 complete(&job->done); > > =C2=A0 } > > @@ -89,7 +91,7 @@ drm_mock_sched_job_signal_timer(struct hrtimer > > *hrtimer) > > =C2=A0=C2=A0 break; > > =C2=A0=20 > > =C2=A0=C2=A0 sched->hw_timeline.cur_seqno =3D job- > > >hw_fence.seqno; > > - drm_mock_sched_job_complete(job); > > + drm_mock_sched_job_complete(job, 0); > > =C2=A0=C2=A0 } > > =C2=A0=C2=A0 spin_unlock_irqrestore(&sched->lock, flags); > > =C2=A0=20 > > @@ -212,26 +214,33 @@ mock_sched_timedout_job(struct drm_sched_job > > *sched_job) > > =C2=A0=20 > > =C2=A0 static void mock_sched_free_job(struct drm_sched_job *sched_job) > > =C2=A0 { > > - struct drm_mock_scheduler *sched =3D > > - drm_sched_to_mock_sched(sched_job->sched); > > =C2=A0=C2=A0 struct drm_mock_sched_job *job =3D > > drm_sched_job_to_mock_job(sched_job); > > - unsigned long flags; > > =C2=A0=20 > > - /* Remove from the scheduler done list. */ > > - spin_lock_irqsave(&sched->lock, flags); > > - list_del(&job->link); > > - spin_unlock_irqrestore(&sched->lock, flags); > > =C2=A0=C2=A0 dma_fence_put(&job->hw_fence); > > - > > =C2=A0=C2=A0 drm_sched_job_cleanup(sched_job); > > =C2=A0=20 > > =C2=A0=C2=A0 /* Mock job itself is freed by the kunit framework. */ > > =C2=A0 } > > =C2=A0=20 > > +static void mock_sched_cancel_pending_fences(struct > > drm_gpu_scheduler *gsched) >=20 > "gsched" feels like a first time invention. Maybe drm_sched? Alright >=20 > > +{ > > + struct drm_mock_sched_job *job, *next; > > + struct drm_mock_scheduler *sched; > > + unsigned long flags; > > + > > + sched =3D container_of(gsched, struct drm_mock_scheduler, > > base); > > + > > + spin_lock_irqsave(&sched->lock, flags); > > + list_for_each_entry_safe(job, next, &sched->job_list, > > link) > > + drm_mock_sched_job_complete(job, -ECANCELED); > > + spin_unlock_irqrestore(&sched->lock, flags); >=20 > Canceling of the timers belongs in this call back I think. Otherwise=20 > jobs are not fully cancelled. I wouldn't say so =E2=80=93 the timers represent things like the hardware interrupts. And those must be deactivated by the driver itself. See, one big reason why I like my approach is that the contract between driver and scheduler is made very simple: "Driver, signal all fences that you ever returned through run_job() to this scheduler!" That always works, and the driver always has all those fences. It's based on the most fundamental agreement regarding dma_fences: they must all be signaled. >=20 > Hm, I also think, conceptually, the order of first canceling the > timer=20 > and then signaling the fence should be kept. That's the case here, no? It must indeed be kept, otherwise the timers could fire after everything is torn down -> UAF >=20 > At the moment it does not matter hugely, since the timer does not > signal=20 > the jobs directly and will not find unlinked jobs, but if that > changes=20 > in the future, the reversed order could cause double signaling. So if > you keep it in the correct logical order that potential gotcha is=20 > avoided. Basically just keep the two pass approach verbatim, as is in > the current drm_mock_sched_fini. >=20 > The rest of the conversion is I think good. :) >=20 > Only a slight uncertainty after I cross-referenced with my version=20 > (->cancel_job()) around why I needed to add signaling to=20 > mock_sched_timedout_job() and manual job cleanup to the timeout test. > It=20 > was more than a month ago that I wrote it so can't remember right > now.=20 > You checked for memory leaks and the usual stuff? Hm, it seems I indeed ran into that leak that you fixed (in addition to the other stuff) in your RFC, for the timeout tests. We should fix that in a separate patch, probably. P. >=20 > Regards, >=20 > Tvrtko >=20 > > +} > > + > > =C2=A0 static const struct drm_sched_backend_ops drm_mock_scheduler_ops > > =3D { > > =C2=A0=C2=A0 .run_job =3D mock_sched_run_job, > > =C2=A0=C2=A0 .timedout_job =3D mock_sched_timedout_job, > > - .free_job =3D mock_sched_free_job > > + .free_job =3D mock_sched_free_job, > > + .cancel_pending_fences =3D mock_sched_cancel_pending_fences, > > =C2=A0 }; > > =C2=A0=20 > > =C2=A0 /** > > @@ -265,7 +274,6 @@ struct drm_mock_scheduler > > *drm_mock_sched_new(struct kunit *test, long timeout) > > =C2=A0=C2=A0 sched->hw_timeline.context =3D dma_fence_context_alloc(1); > > =C2=A0=C2=A0 atomic_set(&sched->hw_timeline.next_seqno, 0); > > =C2=A0=C2=A0 INIT_LIST_HEAD(&sched->job_list); > > - INIT_LIST_HEAD(&sched->done_list); > > =C2=A0=C2=A0 spin_lock_init(&sched->lock); > > =C2=A0=20 > > =C2=A0=C2=A0 return sched; > > @@ -280,38 +288,11 @@ struct drm_mock_scheduler > > *drm_mock_sched_new(struct kunit *test, long timeout) > > =C2=A0=C2=A0 */ > > =C2=A0 void drm_mock_sched_fini(struct drm_mock_scheduler *sched) > > =C2=A0 { > > - struct drm_mock_sched_job *job, *next; > > - unsigned long flags; > > - LIST_HEAD(list); > > + struct drm_mock_sched_job *job; > > =C2=A0=20 > > - drm_sched_wqueue_stop(&sched->base); > > - > > - /* Force complete all unfinished jobs. */ > > - spin_lock_irqsave(&sched->lock, flags); > > - list_for_each_entry_safe(job, next, &sched->job_list, > > link) > > - list_move_tail(&job->link, &list); > > - spin_unlock_irqrestore(&sched->lock, flags); > > - > > - list_for_each_entry(job, &list, link) > > + list_for_each_entry(job, &sched->job_list, link) > > =C2=A0=C2=A0 hrtimer_cancel(&job->timer); > > =C2=A0=20 > > - spin_lock_irqsave(&sched->lock, flags); > > - list_for_each_entry_safe(job, next, &list, link) > > - drm_mock_sched_job_complete(job); > > - spin_unlock_irqrestore(&sched->lock, flags); > > - > > - /* > > - * Free completed jobs and jobs not yet processed by the > > DRM scheduler > > - * free worker. > > - */ > > - spin_lock_irqsave(&sched->lock, flags); > > - list_for_each_entry_safe(job, next, &sched->done_list, > > link) > > - list_move_tail(&job->link, &list); > > - spin_unlock_irqrestore(&sched->lock, flags); > > - > > - list_for_each_entry_safe(job, next, &list, link) > > - mock_sched_free_job(&job->base); > > - > > =C2=A0=C2=A0 drm_sched_fini(&sched->base); > > =C2=A0 } > > =C2=A0=20 > > @@ -346,7 +327,7 @@ unsigned int drm_mock_sched_advance(struct > > drm_mock_scheduler *sched, > > =C2=A0=C2=A0 if (sched->hw_timeline.cur_seqno < job- > > >hw_fence.seqno) > > =C2=A0=C2=A0 break; > > =C2=A0=20 > > - drm_mock_sched_job_complete(job); > > + drm_mock_sched_job_complete(job, 0); > > =C2=A0=C2=A0 found++; > > =C2=A0=C2=A0 } > > =C2=A0 unlock: > > diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h > > b/drivers/gpu/drm/scheduler/tests/sched_tests.h > > index 27caf8285fb7..22e530d87791 100644 > > --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h > > +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h > > @@ -32,9 +32,8 @@ > > =C2=A0=C2=A0 * > > =C2=A0=C2=A0 * @base: DRM scheduler base class > > =C2=A0=C2=A0 * @test: Backpointer to owning the kunit test case > > - * @lock: Lock to protect the simulated @hw_timeline, @job_list > > and @done_list > > + * @lock: Lock to protect the simulated @hw_timeline and @job_list > > =C2=A0=C2=A0 * @job_list: List of jobs submitted to the mock GPU > > - * @done_list: List of jobs completed by the mock GPU > > =C2=A0=C2=A0 * @hw_timeline: Simulated hardware timeline has a @context= , > > @next_seqno and > > =C2=A0=C2=A0 * @cur_seqno for implementing a struct dma_fence > > signaling the > > =C2=A0=C2=A0 * simulated job completion. > > @@ -49,7 +48,6 @@ struct drm_mock_scheduler { > > =C2=A0=20 > > =C2=A0=C2=A0 spinlock_t lock; > > =C2=A0=C2=A0 struct list_head job_list; > > - struct list_head done_list; > > =C2=A0=20 > > =C2=A0=C2=A0 struct { > > =C2=A0=C2=A0 u64 context; >=20