All of lore.kernel.org
 help / color / mirror / Atom feed
From: Philipp Stanner <phasta@kernel.org>
To: "Lyude Paul" <lyude@redhat.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"David Airlie" <airlied@gmail.com>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Philipp Stanner" <phasta@kernel.org>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
	"Maxime Ripard" <mripard@kernel.org>,
	"Thomas Zimmermann" <tzimmermann@suse.de>,
	"Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>
Cc: dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 0/6] drm/sched: Fix memory leaks in drm_sched_fini()
Date: Thu, 24 Apr 2025 11:55:30 +0200	[thread overview]
Message-ID: <20250424095535.26119-2-phasta@kernel.org> (raw)

Changes in v2:
  - Port kunit tests to new cleanup approach (Tvrtko), however so far
    still with the fence context-kill approach (Me. to be discussed.)
  - Improve len(pending_list) > 0 warning print. (Me)
  - Remove forgotten RFC comments. (Me)
  - Remove timeout boolean, since it's surplus. (Me)
  - Fix misunderstandings in the cover letter. (Tvrtko)

Changes since the RFC:
  - (None)

Howdy,

as many of you know, we have potential memory leaks in drm_sched_fini()
which have been tried to be solved by various parties with various
methods in the past.

In our past discussions, we came to the conclusion, that the simplest
solution, blocking in drm_sched_fini(), is not possible because it could
cause processes ignoring SIGKILL and blocking for too long (which could
turn out to be an effective way to generate a funny email from Linus,
though :) )

Another idea was to have submitted jobs refcount the scheduler. I
investigated this and we found that this then *additionally* would
require us to have *the scheduler* refcount everything *in the driver*
that is accessed through the still running callbacks; since the driver
would want to unload possibly after a non-blocking drm_sched_fini()
call. So that's also no solution.

This RFC here is a new approach, somewhat based on the original
waitque-idea. It looks as follows:

1. Have drm_sched_fini() block until the pending_list becomes empty with
   a waitque, as a first step.
2. Provide the scheduler with a callback with which it can instruct the
   driver to kill the associated fence context. This will cause all
   pending hardware fences to get signalled. (Credit to Danilo, whose
   idea this was)
3. In drm_sched_fini(), first switch off submission of new jobs and
   timeouts (the latter might not be strictly necessary, but is probably
   cleaner).
4. Then, call the aformentioned callback, ensuring that free_job() will
   be called for all remaining jobs relatively quickly. This has the
   great advantage that the jobs get cleaned up through the standard
   mechanism.
5. Once all jobs are gone, also switch off the free_job() work item and
   then proceed as usual.

Furthermore, since there is now such a callback, we can provide an
if-branch checking for its existence. If the driver doesn't provide it,
drm_sched_fini() operates in "legacy mode". So none of the existing
drivers should notice a difference and we remain fully backwards
compatible.

Our glorious beta-tester is Nouveau, which so far had its own waitque
solution, which is now obsolete.

The scheduler's unit tests are also ported to the new method.

I've tested this on a desktop environment with Nouveau. Works fine and
solves the problem (though we did discover an unrelated problem inside
Nouveau in the process).

It also works with the unit tests.

I'm looking forward to your input and feedback. I really hope we can
work this RFC into something that can provide users with a more
reliable, clean scheduler API.

Philipp

Philipp Stanner (6):
  drm/sched: Fix teardown leaks with waitqueue
  drm/sched: Prevent teardown waitque from blocking too long
  drm/sched: Warn if pending list is not empty
  drm/nouveau: Add new callback for scheduler teardown
  drm/nouveau: Remove waitque for sched teardown
  drm/sched: Port unit tests to new cleanup design

 drivers/gpu/drm/nouveau/nouveau_abi16.c       |  4 +-
 drivers/gpu/drm/nouveau/nouveau_drm.c         |  2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c       | 39 +++++---
 drivers/gpu/drm/nouveau/nouveau_sched.h       | 12 +--
 drivers/gpu/drm/nouveau/nouveau_uvmm.c        |  8 +-
 drivers/gpu/drm/scheduler/sched_main.c        | 98 +++++++++++++++----
 .../gpu/drm/scheduler/tests/mock_scheduler.c  | 34 ++++---
 include/drm/gpu_scheduler.h                   | 17 ++++
 8 files changed, 154 insertions(+), 60 deletions(-)

-- 
2.48.1


             reply	other threads:[~2025-04-24  9:56 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-24  9:55 Philipp Stanner [this message]
2025-04-24  9:55 ` [PATCH v2 1/6] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
2025-05-16  9:19   ` Tvrtko Ursulin
2025-04-24  9:55 ` [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long Philipp Stanner
2025-05-16  9:33   ` Tvrtko Ursulin
2025-05-16  9:53     ` Danilo Krummrich
2025-05-16 10:19       ` Tvrtko Ursulin
2025-05-16 10:54         ` Danilo Krummrich
2025-05-16 11:35           ` Tvrtko Ursulin
2025-05-16 12:00             ` Danilo Krummrich
2025-05-16 15:48               ` Tvrtko Ursulin
2025-05-16  9:54     ` Philipp Stanner
2025-05-16 10:34       ` Tvrtko Ursulin
2025-04-24  9:55 ` [PATCH v2 3/6] drm/sched: Warn if pending list is not empty Philipp Stanner
2025-05-16  9:40   ` Tvrtko Ursulin
2025-04-24  9:55 ` [PATCH v2 4/6] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-04-24  9:55 ` [PATCH v2 5/6] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
2025-04-24  9:55 ` [PATCH v2 6/6] drm/sched: Port unit tests to new cleanup design Philipp Stanner
2025-05-08 11:03   ` Philipp Stanner
2025-05-08 12:51     ` Tvrtko Ursulin
2025-05-12  8:00       ` Philipp Stanner
2025-05-14  8:30         ` Tvrtko Ursulin
2025-05-14  9:19           ` Philipp Stanner
2025-05-16  9:00             ` Tvrtko Ursulin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250424095535.26119-2-phasta@kernel.org \
    --to=phasta@kernel.org \
    --cc=airlied@gmail.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dakr@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lyude@redhat.com \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=matthew.brost@intel.com \
    --cc=mripard@kernel.org \
    --cc=nouveau@lists.freedesktop.org \
    --cc=simona@ffwll.ch \
    --cc=tvrtko.ursulin@igalia.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.