From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4C35EC5AE59 for ; Tue, 3 Jun 2025 09:32:03 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 51B5210E69E; Tue, 3 Jun 2025 09:32:02 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.b="AStVO9L5"; dkim-atps=neutral Received: from tor.source.kernel.org (tor.source.kernel.org [172.105.4.254]) by gabe.freedesktop.org (Postfix) with ESMTPS id 7769110E618; Tue, 3 Jun 2025 09:32:01 +0000 (UTC) Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by tor.source.kernel.org (Postfix) with ESMTP id 8EA0C61126; Tue, 3 Jun 2025 09:32:00 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5B145C4CEEF; Tue, 3 Jun 2025 09:31:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1748943120; bh=vu+k3OVXrurN0RDNX/7oaLaHr4+RCz+6uUmJJ1xRlw8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=AStVO9L5QlzmiP0MOdqk9vID9qjB2nKFZAiYkkRJNlgh2IPqsaAqOUmMkURZHmfKn W6vtH6yGFxN4IcLZNLjTfSi8/Bmg/AdRAkohcidBeOXfMtoXzo4lUoC2HEClnQU9KU VHh6wWn3JlpncWWJSWKgQuPJrPgn1+nfVG4W+qFMV/TfEptRICmBgrt3ukB1f4/hhn Qri4Yhjk0lI622PpglqmaRZDx1V7PhX9IbGKAojfPJxrj0O/7qRbUEu4jLRY1FwUSS jO1elk1dlm1pfE8XQoEUPStE6cQneFNkXq9gnin2vNwqxBsLG4nCs5Pc6B9CuPatPS Ny629OCECm08g== From: Philipp Stanner To: Lyude Paul , Danilo Krummrich , David Airlie , Simona Vetter , Matthew Brost , Philipp Stanner , =?UTF-8?q?Christian=20K=C3=B6nig?= , Maarten Lankhorst , Maxime Ripard , Thomas Zimmermann , Sumit Semwal , Tvrtko Ursulin , Pierre-Eric Pelloux-Prayer Cc: dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org Subject: [RFC PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback Date: Tue, 3 Jun 2025 11:31:26 +0200 Message-ID: <20250603093130.100159-3-phasta@kernel.org> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250603093130.100159-2-phasta@kernel.org> References: <20250603093130.100159-2-phasta@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-BeenThere: nouveau@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Nouveau development list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: nouveau-bounces@lists.freedesktop.org Sender: "Nouveau" Since its inception, the GPU scheduler can leak memory if the driver calls drm_sched_fini() while there are still jobs in flight. The simplest way to solve this in a backwards compatible manner is by adding a new callback, drm_sched_backend_ops.cancel_job(), which instructs the driver to signal the hardware fence associated with the job. Afterwards, the scheduler can savely use the established free_job() callback for freeing the job. Implement the new backend_ops callback cancel_job(). Suggested-by: Tvrtko Ursulin Signed-off-by: Philipp Stanner --- drivers/gpu/drm/scheduler/sched_main.c | 34 ++++++++++++++++---------- include/drm/gpu_scheduler.h | 9 +++++++ 2 files changed, 30 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index d20726d7adf0..3f14f1e151fa 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1352,6 +1352,18 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_ } EXPORT_SYMBOL(drm_sched_init); +static void drm_sched_kill_remaining_jobs(struct drm_gpu_scheduler *sched) +{ + struct drm_sched_job *job, *tmp; + + /* All other accessors are stopped. No locking necessary. */ + list_for_each_entry_safe_reverse(job, tmp, &sched->pending_list, list) { + sched->ops->cancel_job(job); + list_del(&job->list); + sched->ops->free_job(job); + } +} + /** * drm_sched_fini - Destroy a gpu scheduler * @@ -1359,19 +1371,11 @@ EXPORT_SYMBOL(drm_sched_init); * * Tears down and cleans up the scheduler. * - * This stops submission of new jobs to the hardware through - * drm_sched_backend_ops.run_job(). Consequently, drm_sched_backend_ops.free_job() - * will not be called for all jobs still in drm_gpu_scheduler.pending_list. - * There is no solution for this currently. Thus, it is up to the driver to make - * sure that: - * - * a) drm_sched_fini() is only called after for all submitted jobs - * drm_sched_backend_ops.free_job() has been called or that - * b) the jobs for which drm_sched_backend_ops.free_job() has not been called - * after drm_sched_fini() ran are freed manually. - * - * FIXME: Take care of the above problem and prevent this function from leaking - * the jobs in drm_gpu_scheduler.pending_list under any circumstances. + * This stops submission of new jobs to the hardware through &struct + * drm_sched_backend_ops.run_job. If &struct drm_sched_backend_ops.cancel_job + * is implemented, all jobs will be canceled through it and afterwards cleaned + * up through &struct drm_sched_backend_ops.free_job. If cancel_job is not + * implemented, memory could leak. */ void drm_sched_fini(struct drm_gpu_scheduler *sched) { @@ -1401,6 +1405,10 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) /* Confirm no work left behind accessing device structures */ cancel_delayed_work_sync(&sched->work_tdr); + /* Avoid memory leaks if supported by the driver. */ + if (sched->ops->cancel_job) + drm_sched_kill_remaining_jobs(sched); + if (sched->own_submit_wq) destroy_workqueue(sched->submit_wq); sched->ready = false; diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index e62a7214e052..81dcbfc8c223 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -512,6 +512,15 @@ struct drm_sched_backend_ops { * and it's time to clean it up. */ void (*free_job)(struct drm_sched_job *sched_job); + + /** + * @cancel_job: Used by the scheduler to guarantee remaining jobs' fences + * get signaled in drm_sched_fini(). + * + * Drivers need to signal the passed job's hardware fence with + * -ECANCELED in this callback. They must not free the job. + */ + void (*cancel_job)(struct drm_sched_job *sched_job); }; /** -- 2.49.0