linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] drm/sched: Fix memory leaks in drm_sched_fini()
@ 2025-04-07 15:22 Philipp Stanner
  2025-04-07 15:22 ` [PATCH 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Philipp Stanner @ 2025-04-07 15:22 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel

Changes since the RFC:
  - (None)

Howdy,

as many of you know, we have potential memory leaks in drm_sched_fini()
which have been tried to be solved by various parties with various
methods in the past.

In our past discussions, we came to the conclusion, that the simplest
solution, blocking in drm_sched_fini(), is not possible because it could
cause processes ignoring SIGKILL and blocking for too long (which could
turn out to be an effective way to generate a funny email from Linus,
though :) )

Another idea was to have submitted jobs refcount the scheduler. I
investigated this and we found that this then *additionally* would
require us to have *the scheduler* refcount everything *in the driver*
that is accessed through the still running callbacks; since the driver
would want to unload possibly after a non-blocking drm_sched_fini()
call. So that's also no solution.

This RFC here is a new approach, somewhat based on the original
waitque-idea. It looks as follows:

1. Have drm_sched_fini() block until the pending_list becomes empty with
   a waitque, as a first step.
2. Provide the scheduler with a callback with which it can instruct the
   driver to kill the associated fence context. This will cause all
   pending hardware fences to get signalled. (Credit to Danilo, whose
   idea this was)
3. In drm_sched_fini(), first switch off submission of new jobs and
   timeouts (the latter might not be strictly necessary, but is probably
   cleaner).
4. Then, call the aformentioned callback, ensuring that free_job() will
   be called for all remaining jobs relatively quickly. This has the
   great advantage that the jobs get cleaned up through the standard
   mechanism.
5. Once all jobs are gone, also switch off the free_job() work item and
   then proceed as usual.

Furthermore, since there is now such a callback, we can provide an
if-branch checking for its existence. If the driver doesn't provide it,
drm_sched_fini() operates in "legacy mode". So none of the existing
drivers should notice a difference and we remain fully backwards
compatible.

Our glorious beta-tester is Nouveau, which so far had its own waitque
solution, which is now obsolete. The last two patches port Nouveau and
remove that waitque.

I've tested this on a desktop environment with Nouveau. Works fine and
solves the problem (though we did discover an unrelated problem inside
Nouveau in the process).

Tvrtko's unit tests also run as expected (except for the new warning
print in patch 3), which is not surprising since they don't provide the
callback.

I'm looking forward to your input and feedback. I really hope we can
work this RFC into something that can provide users with a more
reliable, clean scheduler API.

Philipp

Philipp Stanner (5):
  drm/sched: Fix teardown leaks with waitqueue
  drm/sched: Prevent teardown waitque from blocking too long
  drm/sched: Warn if pending list is not empty
  drm/nouveau: Add new callback for scheduler teardown
  drm/nouveau: Remove waitque for sched teardown

 drivers/gpu/drm/nouveau/nouveau_abi16.c |   4 +-
 drivers/gpu/drm/nouveau/nouveau_drm.c   |   2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c |  39 +++++----
 drivers/gpu/drm/nouveau/nouveau_sched.h |  12 +--
 drivers/gpu/drm/nouveau/nouveau_uvmm.c  |   8 +-
 drivers/gpu/drm/scheduler/sched_main.c  | 111 +++++++++++++++++++-----
 include/drm/gpu_scheduler.h             |  19 ++++
 7 files changed, 146 insertions(+), 49 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-04-23 10:26 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-07 15:22 [PATCH 0/5] drm/sched: Fix memory leaks in drm_sched_fini() Philipp Stanner
2025-04-07 15:22 ` [PATCH 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
2025-04-17  7:49   ` Philipp Stanner
2025-04-07 15:22 ` [PATCH 2/5] drm/sched: Prevent teardown waitque from blocking too long Philipp Stanner
2025-04-07 15:22 ` [PATCH 3/5] drm/sched: Warn if pending list is not empty Philipp Stanner
2025-04-17  7:45   ` Philipp Stanner
2025-04-17 11:27     ` Tvrtko Ursulin
2025-04-17 12:11       ` Danilo Krummrich
2025-04-17 14:20         ` Tvrtko Ursulin
2025-04-17 14:48           ` Danilo Krummrich
2025-04-17 16:08             ` Tvrtko Ursulin
2025-04-17 17:07               ` Danilo Krummrich
2025-04-22  6:06               ` Philipp Stanner
2025-04-22 10:39                 ` Tvrtko Ursulin
2025-04-22 11:13                   ` Danilo Krummrich
2025-04-22 12:00                     ` Philipp Stanner
2025-04-22 13:25                       ` Tvrtko Ursulin
2025-04-22 12:07                     ` Tvrtko Ursulin
2025-04-22 12:21                       ` Philipp Stanner
2025-04-22 12:32                       ` Danilo Krummrich
2025-04-22 13:39                         ` Tvrtko Ursulin
2025-04-22 13:46                           ` Philipp Stanner
2025-04-22 14:08                           ` Danilo Krummrich
2025-04-22 14:16                             ` Philipp Stanner
2025-04-22 14:52                               ` Danilo Krummrich
2025-04-23  7:34                                 ` Tvrtko Ursulin
2025-04-23  8:48                                   ` Danilo Krummrich
2025-04-23 10:10                                     ` Tvrtko Ursulin
2025-04-23 10:26                                       ` Danilo Krummrich
2025-04-07 15:22 ` [PATCH 4/5] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-04-07 15:22 ` [PATCH 5/5] drm/nouveau: Remove waitque for sched teardown Philipp Stanner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).